NameError: name "hxs" is not defined when using Scrapy
I launched the Scrapy shell and pinged Wikipedia successfully.
scrapy shell http://en.wikipedia.org/wiki/Main_Page
I am sure this step is correct judging by the verbal nature of Scrapy's answer.
Next, I would like to see what happens when I write
hxs.select('/html').extract()
At this point, I am getting the error:
NameError: name 'hxs' is not defined
What is the problem? I know Scrapy is installed fine, it accepted URL for destination, but why is there a problem with the command hxs
?
source to share
I suspect you are using a version of Scrapy that no longer has hxs
a wrapper.
Use sel
instead (deprecated after 0.24, see below):
$ scrapy shell http://en.wikipedia.org/wiki/Main_Page
>>> sel.xpath('//title/text()').extract()[0]
u'Wikipedia, the free encyclopedia'
Or, as of Scrapy 1.0, you have to use the Selector object response
, with its convenience methods, .xpath
and .css
:
$ scrapy shell http://en.wikipedia.org/wiki/Main_Page
>>> response.xpath('//title/text()').extract()[0]
u'Wikipedia, the free encyclopedia'
FYI, quote from Using Selectors in Scrapy Documentation:
... after loading the shell, you will receive the response as a shell variable
response
and its attached selector in the attributeresponse.selector
.
...
Queries for answers using XPath and CSS are so common that answers include two convenient combinations:response.xpath()
andresponse.css()
:
>>> response.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]
>>> response.css('title::text')
[<Selector (text) xpath=//title/text()>]
source to share
You must use verbose nature of Scrapy response.
$ scrapy shell http://en.wikipedia.org/wiki/Main_Page
if your verbose looks like this:
2014-09-20 23:02:14-0400 [scrapy] INFO: Scrapy 0.14.4 started (bot: scrapybot)
2014-09-20 23:02:14-0400 [scrapy] DEBUG: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Enabled item pipelines:
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-09-20 23:02:15-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-09-20 23:02:15-0400 [default] INFO: Spider opened
2014-09-20 23:02:15-0400 [default] DEBUG: Crawled (200) <GET http://en.wikipedia.org/wiki/Main_Page> (referer: None)
[s] Available Scrapy objects:
[s] hxs <HtmlXPathSelector xpath=None data=u'<html lang="en" dir="ltr" class="client-'>
[s] item {}
[s] request <GET http://en.wikipedia.org/wiki/Main_Page>
[s] response <200 http://en.wikipedia.org/wiki/Main_Page>
[s] settings <CrawlerSettings module=None>
[s] spider <BaseSpider 'default' at 0xb5d95d8c>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
Python 2.7.6 (default, Mar 22 2014, 22:59:38)
Type "copyright", "credits" or "license" for more information.
your detail will display Available Scrapy objects
therefore hxs
or sel
depends on what you show in your details. hxs
Not available for your case , so you will need to use 'sel' (newer version with scrappy). So it is hxs
ok for some and others sel
is what they will need to use
source to share