RCURL doesn't work in loading url content
Page loading is not working. Here is the error I am getting:
Error in which(value == defs) :
argument "code" is missing, with no default
Here is my code:
require(RCurl)
require(XML)
ok <- "http://www.okcupid.com/match?filter1=0,34&filter2=2,22,40&filter3=3,5&filter4=5,3600&filter5=9,486&filter6=1,1&locid=4265540&lquery=San%20Francisco,%20California&timekey=1&matchOrderBy=MATCH&custom_search=0&fromWhoOnline=0&mygender=m&update_prefs=1&sort_type=0&sa=1&using_saved_search=&count=50"
okc <- getURL(ok, encoding="UTF-8") #Download the page
okcHTML <- htmlParse(okc, asText = TRUE, encoding = "utf-8")
source to share
If you want to live on the cutting edge of the Hadleyverse, this rvest
does it well enough:
library(rvest)
ok_search <- "https://www.okcupid.com/match?filter1=0,34&filter2=2,22,40&filter3=3,5&filter4=5,3600&filter5=9,486&filter6=1,1&locid=4265540&lquery=San%20Francisco,%20California&timekey=1&matchOrderBy=MATCH&custom_search=0&fromWhoOnline=0&mygender=m&update_prefs=1&sort_type=0&sa=1&using_saved_search=&count=50"
pg <- html_session(ok_search)
pg %>% html_nodes("div.profile_info") %>% html_text()
## [1] " phenombom 32·San Francisco, CA " " sylvea 24·San Francisco, CA "
## [3] " haafu 40·San Francisco, CA " " Rebamania 31·San Francisco, CA "
## [5] " Brilikedacheese 26·San Francisco, CA " " cloudhunteress 23·San Francisco, CA "
## [7] " Lizzieisdizzy 28·San Francisco, CA " " liddybird80 34·San Francisco, CA "
## [9] " wander_found 32·San Francisco, CA " " Crunchyisinabox 31·San Francisco, CA "
...
I'll write why the direct RCurl
( rvest
wraps RCurl
) doesn't work.
UPDATE
Gone one level deeper and used httr
(another abstraction RCurl
):
library(httr)
library(XML)
res <- GET(ok_search)
ok_html <- content(res, as="parsed")
xpathSApply(ok_html, "//div[@class='profile_info']", xmlValue)
It returns the same as above, so work fine too.
UPDATE / RESOLVED
library(RCurl)
library(XML)
okc <- getURL(ok, followlocation=TRUE)
ok_html <- htmlParse(okc)
xpathSApply(ok_html , "//div[@class='profile_info']", xmlValue)
You need to add followlocation=TRUE
. The original url triggers the response 302
(the server sends a redirect) and RCurl
will not follow these by default, but it seems httr
and rvest
make sure the option is set to default.
You can use the verbose=TRUE
on parameter getURL
to see responses as console messages:
## * Adding handle: conn: 0x114ade000
## * Adding handle: send: 0
## * Adding handle: recv: 0
## * Curl_addHandleToPipeline: length: 1
## * - Conn 12 (0x114ade000) send_pipe: 1, recv_pipe: 0
## * About to connect() to www.okcupid.com port 80 (#12)
## * Trying 198.41.209.131...
## * Connected to www.okcupid.com (198.41.209.131) port 80 (#12)
## > GET /match?filter1=0,34&filter2=2,22,40&filter3=3,5&filter4=5,3600&filter5=9,486&filter6=1,1&locid=4265540&lquery=San%20Francisco,%20California&timekey=1&matchOrderBy=MATCH&custom_search=0&fromWhoOnline=0&mygender=m&update_prefs=1&sort_type=0&sa=1&using_saved_search=&count=50 HTTP/1.1
## User-Agent: curl/7.30.0 Rcurl/1.95.4.3
## Host: www.okcupid.com
## Accept: */*
##
## < HTTP/1.1 302
## < Date: Mon, 20 Oct 2014 20:07:12 GMT
## < Content-Type: application/octet-stream
## < Transfer-Encoding: chunked
## < Connection: keep-alive
## < Set-Cookie: __cfduid=d0d55f2c9c990d97b0d02dba7148881741413835631999; expires=Mon, 23-Dec-2019 23:50:00 GMT; path=/; domain=.okcupid.com; HttpOnly
## < X-OKWS-Version: OKWS/3.1.30.2
## < Location: https://www.okcupid.com/match?filter1=0,34&filter2=2,22,40&filter3=3,5&filter4=5,3600&filter5=9,486&filter6=1,1&locid=4265540&lquery=San%20Francisco,%20California&timekey=1&matchOrderBy=MATCH&custom_search=0&fromWhoOnline=0&mygender=m&update_prefs=1&sort_type=0&sa=1&using_saved_search=&count=50
## < P3P: CP="NOI CURa ADMa DEVa TAIa OUR BUS IND UNI COM NAV INT", policyref="http://www.okcupid.com/w3c/p3p.xml"
## < X-XSS-Protection: 1; mode=block
## < Set-Cookie: guest=10834912674894888479; Expires=Tue, 20 Oct 2015 20:07:12 GMT; Path=/; Domain=okcupid.com; HttpOnly
## * Server cloudflare-nginx is not blacklisted
## < Server: cloudflare-nginx
## < CF-RAY: 17c7d71bf1880412-EWR
## <
## * Connection #12 to host www.okcupid.com left intact
This is very useful for debugging problems like this. you can use parameter verbose()
for httr
or rvest
URL search functions.
source to share