Can ItemLoaders be used to parse HTML nodes?

Question

Can ItemLoaders be used to parse HTML nodes?

Typically, an item loader fetches data automatically before passing values to an input processor:

Data from xpath1 is retrieved and passed through the input name field processor. ( Scrapy Docs )

Is it possible to change this behavior for some elements of the element loader, so I can go into a more complex structure (a selector in my opinion)?

I have an HTML document:

<a class="foo" href="http://example.com">example 1</a>
<a class="foo" href="http://example.org">example 2</a>

And now I would like to get these link items in the spider

loader.add_css('links', '.foo')

and parse them in an item loader to get a list of values (after the output processor) like this:

[("http://example.com", "example 1"), ("http://example.org", "example 2")]

However, since the object loaders automatically convert the input to unicode, it doesn't seem so easy.

+3

python scrapy

Aufziehvogel Dec 16 14 at 20:11

source to share

1 answer

alecxe · Accepted Answer · 2014-12-16T20:21:12+0000

You can use .add_value()

and "manually" build the list text

and href

s:

links = [(item.css('::text').extract()[0], 
          item.css('::attr(href)').extract()[0])
         for item in response.css('.foo')]
loader.add_value('links', links)

Can ItemLoaders be used to parse HTML nodes?

More articles: