Can ItemLoaders be used to parse HTML nodes?

Typically, an item loader fetches data automatically before passing values ​​to an input processor:

  • Data from xpath1 is retrieved and passed through the input name field processor. ( Scrapy Docs )

Is it possible to change this behavior for some elements of the element loader, so I can go into a more complex structure (a selector in my opinion)?

I have an HTML document:

<a class="foo" href="http://example.com">example 1</a>
<a class="foo" href="http://example.org">example 2</a>

      

And now I would like to get these link items in the spider

loader.add_css('links', '.foo')

      

and parse them in an item loader to get a list of values ​​(after the output processor) like this:

[("http://example.com", "example 1"), ("http://example.org", "example 2")]

      

However, since the object loaders automatically convert the input to unicode, it doesn't seem so easy.

+3


source to share


1 answer


You can use .add_value()

and "manually" build the list text

and href

s:



links = [(item.css('::text').extract()[0], 
          item.css('::attr(href)').extract()[0])
         for item in response.css('.foo')]
loader.add_value('links', links)

      

+1


source







All Articles