Skip to content Skip to sidebar Skip to footer

Scrapy Interpreting Html Entities On Extract

During a crawling, I captured links usually that way: response.xpath('//a[contains(@class, something)/@href').extract() But for some reason in that specific page was not working.

Solution 1:

After sometime, I discovered that the same page on firefox was rendering weird... My problem has been happening because the page being crawled was with the content-type as "text/xml" and not html.

To fix my code I did other selector:

sel = scrapy.Selector(text=response.body)
sel.xpath("//a[contains(@class, something)/@href").extract()

And now I have the correct result!

['details?lm=&printerView=true&accessType=1&id=A43', (...)]

Post a Comment for "Scrapy Interpreting Html Entities On Extract"