Xpath to get information in next sibling tag using Scrapy

I am trying to get Scrap and am currently trying to extract information from the etymology website: http://www.etymonline.com Right now, I just want to get the words and their raw description. This is how a normal HTML snippet is represented in etymonline:

<dt><a href="/index.php?term=address&allowed_in_frame=0">address (n.)</a> <a href="http://dictionary.reference.com/search?q=address" class="dictionary" title="Look up address at Dictionary.com"><img src="graphics/dictionary.gif" width="16" height="16" alt="Look up address at Dictionary.com" title="Look up address at Dictionary.com" /></a></dt> <dd>1530s, "dutiful or courteous approach," from <a href="/index.php?term=address&allowed_in_frame=0" class="crossreference">address</a> (v.) and from French <span class="foreign">adresse</span>. Sense of "formal speech" is from 1751. Sense of "superscription of a letter" is from 1712 and led to the meaning "place of residence" (1888).</dd>

Word contained in the tag <dt>

and description tag in the next a sibling, <dd>

. To get a list of words on a page like http://www.etymonline.com/index.php?l=a&p=9&allowed_in_frame=0 one can write word = sel.xpath('//dl/dt/a/text()').extract()

.

Then I tried to iterate over this word list and extract the relevant information using this line of code info = selInfo.xpath("//dl/dt[a='"+word[i]+"']/following-sibling::dd")

. But it doesn't seem to work. Any ideas?

+3


source to share


3 answers


To go to <dd>

after <dt>

, you can use the axis following-sibling

, you are right.

following-sibling::dd

with selection of all dd

elements after the context node. Therefore, you need to restrict the XPath to only the first one using the position predicate [1]

.

For each item dt

you choose //dl/dt

, you choose following-sibling::dd[1]

.



Here's an example of a session using scrapy shell

"address" for the term:

$ scrapy shell "http://www.etymonline.com/index.php?allowed_in_frame=0&search=address&searchmode=none"
...
2014-11-26 10:34:53+0100 [default] DEBUG: Crawled (200) <GET http://www.etymonline.com/index.php?allowed_in_frame=0&search=address&searchmode=none> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f1396cc6950>
[s]   item       {}
[s]   request    <GET http://www.etymonline.com/index.php?allowed_in_frame=0&search=address&searchmode=none>
[s]   response   <200 http://www.etymonline.com/index.php?allowed_in_frame=0&search=address&searchmode=none>
[s]   settings   <scrapy.settings.Settings object at 0x7f1397399bd0>
[s]   spider     <Spider 'default' at 0x7f13966c05d0>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser

In [1]: for dt in response.xpath('//dl/dt'):
    print "Word:", dt.xpath('string(a)').extract()
    print "Definition:", dt.xpath('string(following-sibling::dd[1])').extract()
    print
   ...:     
Word: [u'address (n.)']
Definition: [u'1530s, "dutiful or courteous approach," from address (v.) and from French adresse. Sense of "formal speech" is from 1751. Sense of "superscription of a letter" is from 1712 and led to the meaning "place of residence" (1888).']

Word: [u'addressee (n.)']
Definition: [u'1810; see address (v.) + -ee.']

Word: [u'address (v.)']
Definition: [u'early 14c., "to guide or direct," from Old French adrecier "go straight toward; straighten, set right; point, direct" (13c.), from Vulgar Latin *addirectiare "make straight," from Latin ad "to" (see ad-) + *directiare, from Latin directus "straight, direct" (see direct (v.)). Late 14c. as "to set in order, repair, correct." Meaning "to write as a destination on a written message" is from mid-15c. Meaning "to direct spoken words (to someone)" is from late 15c. Related: Addressed; addressing.']

Word: [u'salutatorian (n.)']
Definition: [u'1841, American English, from salutatory "of the nature of a salutation," here in the specific sense "designating the welcoming address given at a college commencement" (1702) + -ian. The address was originally usually in Latin and given by the second-ranking graduating student.']

...

Word: [u'reverend (adj.)']
Definition: [u'early 15c., "worthy of respect," from Middle French reverend, from Latin reverendus "(he who is) to be respected," gerundive of revereri (see reverence). As a form of address for clergymen, it is attested from late 15c.; earlier reverent (late 14c. in this sense). Abbreviation Rev. is attested from 1721, earlier Revd. (1690s). Very Reverend is used of deans, Right Reverend of bishops, Most Reverend of archbishops.']

Word: [u'nun (n.)']
Definition: [u'Old English nunne "nun, vestal, pagan priestess, woman devoted to religious life under vows," from Late Latin nonna "nun, tutor," originally (along with masc. nonnus) a term of address to elderly persons, perhaps from children\ speech, reminiscent of nana (compare Sanskrit nona, Persian nana "mother," Greek nanna "aunt," Serbo-Croatian nena "mother," Italian nonna, Welsh nain "grandmother;" see nanny).']


In [2]: 

      

+3


source


The idea behind your xpath is not the loop

extracted list, but the parent node in the xpath .

I currently don't have scrapy on my mac, but the technique should be applied here in a similar way:



# I use lxml for loose html string parsing
from lxml import html

s = '''<dt><a href="/index.php?term=address&allowed_in_frame=0">address (n.)</a> <a href="http://dictionary.reference.com/search?q=address" class="dictionary" title="Look up address at Dictionary.com"><img src="graphics/dictionary.gif" width="16" height="16" alt="Look up address at Dictionary.com" title="Look up address at Dictionary.com" /></a></dt>
<dd>1530s, "dutiful or courteous approach," from <a href="/index.php?term=address&allowed_in_frame=0" class="crossreference">address</a> (v.) and from French <span class="foreign">adresse</span>. Sense of "formal speech" is from 1751. Sense of "superscription of a letter" is from 1712 and led to the meaning "place of residence" (1888).</dd>'''

sel = html.fromstring(s)

# rather than extracting the words straight away, you loop from the parent xpath
for nodes in sel.xpath('//dt'):
    # then access a node to get the text
    print nodes.xpath('a/text()')
    # and go back to parent and search the dd node
    print nodes.xpath('../dd/text()')

# sample results
['address (n.)']
['1530s, "dutiful or courteous approach," from ', ' (v.) and from French ', '. Sense of "formal speech" is from 1751. Sense of "superscription of a letter" is from 1712 and led to the meaning "place of residence" (1888).']

      

Hope it helps.

+1


source


Solution using the next brother.

class SingleSpider(scrapy.Spider):
    name = "etym"
    allowed_domains = ["etymonline.com"]
    start_urls = [
        "http://www.etymonline.com/index.php?l=d&allowed_in_frame=0"]

    def parse(self, response):


        for nodes in response.xpath('//dl'):
            for i in nodes.xpath('dt'):
                print i.xpath('a/text()').extract()   
                print i.xpath('following-sibling::dd[1]/text()').extract()    

      

Basically:

  • you get the Dt element one by one
  • print the text contained in the link
  • go to next sibling and print the text containing
  • List item

Here's an excerpt from the output:

[u'daiquiri (n.) '] [1920 type of alcoholic beverage (first recorded in F. Scott Fitzgerald), from', u ', the name of a district or village in eastern Cuba. ']

[u'dairy (n.) '] [u'late 13c., "building for cooking butter and cheese; dairy farm" formed with English-French ", u' attached to Middle English ', u' (in ', u' 'dairymaid "), from Old English", u' "mush bread, housekeeper, female servant" (see ', u' (n.1)). the native word was', u '.']

[u'dais (n.) '] [u'mid-13c., from English-French, u', Old French ', u' "table, platform", "from Latin", "disc-shaped object", and "medieval times", "table", "Greek", "quoit", "disc", "dish" (see ", u '(n.)). Disappeared in English c. 1600, preserved in Scotland, revived 19 with antiques. ']

+1


source







All Articles