Can ItemLoaders be used to parse HTML nodes?
Typically, an item loader fetches data automatically before passing values ββto an input processor:
- Data from xpath1 is retrieved and passed through the input name field processor. ( Scrapy Docs )
Is it possible to change this behavior for some elements of the element loader, so I can go into a more complex structure (a selector in my opinion)?
I have an HTML document:
<a class="foo" href="http://example.com">example 1</a>
<a class="foo" href="http://example.org">example 2</a>
And now I would like to get these link items in the spider
loader.add_css('links', '.foo')
and parse them in an item loader to get a list of values ββ(after the output processor) like this:
[("http://example.com", "example 1"), ("http://example.org", "example 2")]
However, since the object loaders automatically convert the input to unicode, it doesn't seem so easy.
+3
source to share
1 answer
You can use .add_value()
and "manually" build the list text
and href
s:
links = [(item.css('::text').extract()[0],
item.css('::attr(href)').extract()[0])
for item in response.css('.foo')]
loader.add_value('links', links)
+1
source to share