NameError: name "hxs" is not defined when using Scrapy

I launched the Scrapy shell and pinged Wikipedia successfully.

scrapy shell http://en.wikipedia.org/wiki/Main_Page

I am sure this step is correct judging by the verbal nature of Scrapy's answer.

Next, I would like to see what happens when I write

hxs.select('/html').extract()

At this point, I am getting the error:

NameError: name 'hxs' is not defined

What is the problem? I know Scrapy is installed fine, it accepted URL for destination, but why is there a problem with the command hxs

?

+3


source to share


3 answers


I suspect you are using a version of Scrapy that no longer has hxs

a wrapper.

Use sel

instead (deprecated after 0.24, see below):

$ scrapy shell http://en.wikipedia.org/wiki/Main_Page
>>> sel.xpath('//title/text()').extract()[0]
u'Wikipedia, the free encyclopedia'

      

Or, as of Scrapy 1.0, you have to use the Selector object response

, with its convenience methods, .xpath

and .css

:



$ scrapy shell http://en.wikipedia.org/wiki/Main_Page
>>> response.xpath('//title/text()').extract()[0]
u'Wikipedia, the free encyclopedia'

      

FYI, quote from Using Selectors in Scrapy Documentation:

... after loading the shell, you will receive the response as a shell variable response

and its attached selector in the attribute response.selector

.
...
Queries for answers using XPath and CSS are so common that answers include two convenient combinations: response.xpath()

and response.css()

:

>>> response.xpath('//title/text()')


[<Selector (text) xpath=//title/text()>]


>>> response.css('title::text')


[<Selector (text) xpath=//title/text()>]

+6


source


You must use verbose nature of Scrapy response.

$ scrapy shell http://en.wikipedia.org/wiki/Main_Page

      

if your verbose looks like this:



2014-09-20 23:02:14-0400 [scrapy] INFO: Scrapy 0.14.4 started (bot: scrapybot)
2014-09-20 23:02:14-0400 [scrapy] DEBUG: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Enabled item pipelines: 
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-09-20 23:02:15-0400 [default] INFO: Spider opened
2014-09-20 23:02:15-0400 [default] DEBUG: Crawled (200) <GET http://en.wikipedia.org/wiki/Main_Page> (referer: None)
[s] Available Scrapy objects:
[s]   hxs        <HtmlXPathSelector xpath=None data=u'<html lang="en" dir="ltr" class="client-'>
[s]   item       {}
[s]   request    <GET http://en.wikipedia.org/wiki/Main_Page>
[s]   response   <200 http://en.wikipedia.org/wiki/Main_Page>
[s]   settings   <CrawlerSettings module=None>
[s]   spider     <BaseSpider 'default' at 0xb5d95d8c>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
Python 2.7.6 (default, Mar 22 2014, 22:59:38) 
Type "copyright", "credits" or "license" for more information.

      

your detail will display Available Scrapy objects

therefore hxs

or sel

depends on what you show in your details. hxs

Not available for your case , so you will need to use 'sel' (newer version with scrappy). So it is hxs

ok for some and others sel

is what they will need to use

0


source


The "sel" shortcut is deprecated, you must use response.xpath ('/ html'). extract ()

0


source







All Articles