Checking text for the presence of a large set of keywords

Question

Checking text for the presence of a large set of keywords

Suppose I want to check a web page for as many keywords as I want. How should I do it?

I have tested the xpath selector if response.xpath('//*[text()[contains(.,"red") or contains(.,"blue") or contains(.,"green")]]'):

and it works as expected. The actual set of keywords I am interested in is too large to be conveniently typed in manually as above. I am interested in a way to automate this process by creating my selector based on the content of a keyword-filled file.

Starting with a text file with each keyword on its own line, how can I open that file and use it to check for the presence of keywords in the text elements of a given xpath?

I used streams Xpath contains value A or value B and XPATH Multiple Element Filters to go up with my manual input solution, but didn't find anything to address automation.

Lightening

I'm not interested in just checking if a given xpath contains any of the keywords in my list. I also want to use their presence as a prerequisite for scraping the content off the page. A manually checked system works as follows:

item_info = ItemLoader(item=info_categories(), response=response)
if response.xpath('//*[text()[contains(.,"red") or contains(.,"blue") or contains(.,"green")]]'):
    item_info.add_xpath('title', './/some/x/path/text()')
    item_info.add_xpath('description', './/some/other/x/path/text()')
return item_info.load_item()

Whereas @ alecxe's solution allows me to validate the text of a page using a set of keywords, switching from 'print' to 'if' and trying to manipulate the information I am retrieving returns SyntaxError: invalid syntax

. Can I combine the readability of keywords from a list with a manual input feature?

Update-investigation of Frederic Bazin's regex solution

Over the past few days I have been working with regex to limit my parsing. My code, which accepts Frederick's suggestion with a few changes to account for errors, looks like this:

item_info = ItemLoader(item=info_categories(), response=response)
keywords = '|'.join(re.escape(word.strip()) for word in open('keys.txt'))
r = re.compile('.*(%s).*' % keywords, re.MULTILINE|re.UNICODE)
if r.match(response.body_as_unicode()):
    item_info.add_xpath('title', './/some/x/path/text()')
    item_info.add_xpath('description', './/some/other/x/path/text()')
return item_info.load_item()

This code works without error, but Scrapy reports traversing 0 elements and 0 elements scraping, so something is clearly going wrong.

I tried to debug by running this from a Scrapy shell. My results show that the steps are keywords

and r

are being followed. If I define and call keywords

using the method above for a .txt file containing the words red, blue and green, I get a response 'red|blue|green'

. Defining and calling r

as above gives me <_sre.SRE_Pattern object at 0x17bc980>

which I believe is the expected answer. However, when I run r.match(response.body_as_unicode())

I get no response, even on pages that I know contain one or more keywords.

Does anyone have any thoughts on what I'm missing here? As I understand it, whenever one of my keywords appears in response.body, a match should be called and Scrapy should continue fetching information from that response using the xpaths that I have defined. Obviously I'm wrong, but I'm not sure how or why.

Decision

I think that maybe this problem came at last. My current conclusion is that the difficulty was caused by running r.match

on response.body_as_unicode

. The documentation provided here says the match:

If zero or more characters at the beginning of the line match the regular expression pattern, return a matching MatchObject instance. Returns None if the string does not match the pattern; note that this is different from a zero-length match.

Note that even in MULTILINE mode, re.match () will only match at the beginning of a line, not at the beginning of each line.

This behavior was not appropriate for my situation. I am interested in identifying and clearing information from pages containing my keywords anywhere , rather than those containing one of my keywords as the first element on the page. To accomplish this task, I needed re.search

one that scans the string until it finds a match for the regex pattern generated with compile

, and will return MatchObject

or return None

if no pattern match is found.

Below is my current (working!) Code. Note that in addition to going from match

to, search

I've added a bit to the keyword definition to limit whole word matches.

item_info = ItemLoader(item=info_categories(), response=response)
keywords = '|'.join(r"\b" + re.escape(word.strip()) + r"\b" for word in open('keys.txt'))
r = re.compile('.*(%s).*' % keywords, re.MULTILINE|re.UNICODE)
if r.search(response.body_as_unicode()):
    item_info.add_xpath('title', './/some/x/path/text()')
    item_info.add_xpath('description', './/some/other/x/path/text()')
return item_info.load_item()

+3

python html xpath web-scraping scrapy

Tric 09 Aug At 12:27 am

source to share

2 answers

alecxe · Answer 1 · 2015-08-09T00:29:38+0000

You can also check if the keyword is inside response.body

:

source = response.body
with open('input.txt') as f:
    for word in f:
        print word, word.strip() in source

Or, using any()

:

with open('input.txt') as f:
    print any(word.strip() in source for word in f)

Frederic bazin · Answer 2 · 2015-08-09T14:34:55+0000

regex is probably the fastest way to run tests on a large number of pages

import re
keywords = '|'.join(re.escape(word.strip()) for word in open('keywords.txt'))
r = re.compile('.*(%s).*' % keywords, re.MULTILINE|re.UNICODE)
if r.match(response.body_as_unicode()):

generating xpath expression for multiple keywords might work, but you add extra CPU overhead (typically ~ 100ms) to parse the page as XML before running XPATH.

Checking text for the presence of a large set of keywords

More articles: