Google Trends Scraping Off With R by Metro Area

I am using the following code in R to download data from Google Trends which I took mostly from here http://christophriedl.net/2013/08/22/google-trends-with-r/

    ############################################
##    Query GoogleTrends from R
##
## by Christoph Riedl, Northeastern University
## Additional help and bug-fixing re cookies by
## Philippe Massicotte Université du Québec à Trois-Rivières (UQTR)
############################################

# Load required libraries
library(RCurl)      # For getURL() and curl handler / cookie / google login
library(stringr)    # For str_trim() to trip whitespace from strings
# Google account settings
username <- "USERNAME"
password <- "PASSWORD"

# URLs
loginURL        <- "https://accounts.google.com/accounts/ServiceLogin"
authenticateURL <- "https://accounts.google.com/accounts/ServiceLoginAuth"
trendsURL       <- "http://www.google.com/trends/TrendsRepport?"

############################################
## This gets the GALX cookie which we need to pass back with the login form
############################################
getGALX <- function(curl) {
  txt = basicTextGatherer()
  curlPerform( url=loginURL, curl=curl, writefunction=txt$update, header=TRUE, ssl.verifypeer=FALSE )

  tmp <- txt$value()

  val <- grep("Cookie: GALX", strsplit(tmp, "\n")[[1]], val = TRUE)
  strsplit(val, "[:=;]")[[1]][3]

  return( strsplit( val, "[:=;]")[[1]][3]) 
}


############################################
## Function to perform Google login and get cookies ready
############################################
gLogin <- function(username, password) {
  ch <- getCurlHandle()

  ans <- (curlSetOpt(curl = ch,
                     ssl.verifypeer = FALSE,
                     useragent = getOption('HTTPUserAgent', "R"),
                     timeout = 60,         
                     followlocation = TRUE,
                     cookiejar = "./cookies",
                     cookiefile = ""))

  galx <- getGALX(ch)
  authenticatePage <- postForm(authenticateURL, .params=list(Email=username, Passwd=password, GALX=galx, PersistentCookie="yes", continue="http://www.google.com/trends"), curl=ch)

  authenticatePage2 <- getURL("http://www.google.com", curl=ch)

  if(getCurlInfo(ch)$response.code == 200) {
    print("Google login successful!")
  } else {
    print("Google login failed!")
  }
  return(ch)
}

##

# returns string w/o leading or trailing whitespace
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

get_interest_over_time <- function(res, clean.col.names = TRUE) {
  # remove all text before "Interest over time" data block begins
  data <- gsub(".*Interest over time", "", res)

  # remove all text after "Interest over time" data block ends
  data <- gsub("\n\n.*", "", data)

  # convert "interest over time" data block into data.frame
  data.df <- read.table(text = data, sep =",", header=TRUE)

  # Split data range into to only end of week date 
  data.df$Week <- gsub(".*\\s-\\s", "", data.df$Week)
  data.df$Week <- as.Date(data.df$Week)

  # clean column names
  if(clean.col.names == TRUE) colnames(data.df) <- gsub("\\.\\..*", "", colnames(data.df))


  # return "interest over time" data.frame
  return(data.df)
}

############################################
## Read data for a query
############################################
ch <- gLogin( username, password )
authenticatePage2 <- getURL("http://www.google.com", curl=ch)

res <- getForm(trendsURL, q="sugar", geo="US", content=1, export=1, graph="all_csv", curl=ch)
# Check if quota limit reached
if( grepl( "You have reached your quota limit", res ) ) {
  stop( "Quota limit reached; You should wait a while and try again lateer" )
}
df <- get_interest_over_time(res)
head(df)

write.csv(df,"sugar.csv")

      

When I search for USA only or any one country everything works fine, but I need more split data in Metropolitan Area. However, I cannot seem to get these requests to work with this script. Whenever I do this by typing, for example, "US-IL" in a geofield, I get an error:

Error in read.table(text = data, sep = ",", header = TRUE) : 
more columns than column names 

      

The same thing happens if I try to trend the metropolitan area (for example using, for example, "US-IL-602" for Chicago). Does anyone know how I can modify this script to make it work?

Many thanks,

Brian.

+3


source to share





All Articles