R reads a comma txt file with comma inside one column

I have some user behavior logs. This is from the data collector and apparently it used commas to separate the variables. However, some URLs have commas inside. I cannot read the txt file in R.

20091,2009-06-02 22:06:14,84,taobao.com,search1.taobao.com,http://search1.taobao.com/browse/0/n-g,grdsa2kqn5scattbnzxq-------2-------b--40--commend-0-all-0.htm?at_topsearch=1&ssid=e-s1,www.taobao.com,shopping,e-commerce,C2C
20092,2009-06-16 12:25:35,8,sohu.com,www.wap.sohu.com,http://www.wap.sohu.com/info/index.html?url=http://wap.sohu.com/sports/pic/?lpn=1&resIdx=0&nid=336&rid=KL39,PD21746&v=2&ref=901981387,www.sohu.com,portal,entertainment,mobile
20092,2009-06-07 16:02:03,14,eetchina.com,www.powersystems.eetchina.com,http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT,,others,marketing,enterprise
20096,2009-06-30 07:51:38,7,taobao.com,search1.taobao.com,http://search1.taobao.com/browse/0/n-1----------------------0----------------------g,zhh3viy-g,ywtmf7glxeqnhjgt263ps-------2-------b--40--commend-0-all-0.htm?ssid=p1-s1,search1.taobao.com,shopping,e-commerce,C2C
2009184,2009-06-25 14:40:39,6,mktginc.com,surv.mktginc.com,,,unknown,unknown,unknown
20092,2009-06-07 15:13:06,32,ccb.com.cn,ibsbjstar.ccb.com.cn,https://ibsbjstar.ccb.com.cn/app/V5/CN/STY1/login.jsp,,e-bank,finance,e-bank

      

The above URLs should be:

http://search1.taobao.com/browse/0/n-g,grdsa2kqn5scattbnzxq-------2-------b--40--commend-0-all-0.htm?at_topsearch=1&ssid=e-s1
http://www.wap.sohu.com/info/index.html?url=http://wap.sohu.com/sports/pic/?lpn=1&resIdx=0&nid=336&rid=KL39,PD21746&v=2&ref=901981387
http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT
http://search1.taobao.com/browse/0/n-1----------------------0----------------------g,zhh3viy-g,ywtmf7glxeqnhjgt263ps-------2-------b--40--commend-0-all-0.htm?ssid=p1-s1

https://ibsbjstar.ccb.com.cn/app/V5/CN/STY1/login.jsp

      

How can I tell there are exactly 10 variables on each line and put commas in the url? Thank!

df <- read.table('2009.txt', sep= ',', quote= '', comment.char= '', stringsAsFactors= F)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : line 130 did not have 10 elements

      

+3


source to share


3 answers


If you are reading data so that each row is one row (for example sep="\n"

), you can process each row just before inserting the corresponding dataframe.

If only the 6th entry can have commas (it looks like the other URLs are only primary domains), then something like the following might work:



d <- strsplit(d, ",")

for (i in 1:length(d)) {
  x <- d[[i]]
  n <- length(x)
  if (n > 10) {
    d[[i]] <- c(x[1:5], paste(x[6:(n-4)], collapse=","), x[(n-3):n])
  }
}

d <- do.call(rbind,lapply(d, matrix, ncol=10, byrow=TRUE))

      

If other URLs might have a problem, this approach might still work, but it can be very complicated.

+3


source


You may try:

  dat <- read.table(text=gsub("http:.*(?=(,www)|,,)(*SKIP)(*F)|,", "*",
           Lines, perl=TRUE), sep="*", header=FALSE, stringsAsFactors=FALSE)


  dat
  #    V1                  V2 V3           V4                            V5
  #1 20091 2009-06-02 22:06:14 84   taobao.com            search1.taobao.com
  #2 20092 2009-06-16 12:25:35  8     sohu.com              www.wap.sohu.com
  #3 20092 2009-06-07 16:02:03 14 eetchina.com www.powersystems.eetchina.com
   #                     V6
  #1               http://search1.taobao.com/browse/0/n-g,grdsa2kqn5scattbnzxq------- 2-------b--40--commend-0-all-0.htm?at_topsearch=1&ssid=e-s1
  #2       http://www.wap.sohu.com/info/index.html?url=http://wap.sohu.com/sports/pic/?lpn=1&resIdx=0&nid=336&rid=KL39,PD21746&v=2&ref=901981387
  #3 http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT
  #            V7       V8            V9        V10
  #1 www.taobao.com shopping    e-commerce        C2C
  #2   www.sohu.com   portal entertainment     mobile
  #3                  others     marketing enterprise

      

data

 Lines <-  readLines(textConnection(txt)) #(`txt` from @Richard Scriven)

      

Update



Using a new dataset

 indx <- grep("http", Lines)
 Lines1 <- Lines[indx]
 pat1 <- paste(unique(gsub(".*http[s]?.{3}(\\w+)\\..*", "\\1", Lines1)), collapse="|")
 pat1N <-  paste0("http:.*(?=,(", pat1, "|,))(*SKIP)(*F)|,") 

 dat1 <-  read.table(text=gsub(pat1N, "*", Lines, perl=TRUE),
                   sep="*", header=FALSE, stringsAsFactors=FALSE)

 dat1
 #           V1                  V2 V3           V4                            V5
 #1   20091 2009-06-02 22:06:14 84   taobao.com            search1.taobao.com
 #2   20092 2009-06-16 12:25:35  8     sohu.com              www.wap.sohu.com
 #3   20092 2009-06-07 16:02:03 14 eetchina.com www.powersystems.eetchina.com
 #4   20096 2009-06-30 07:51:38  7   taobao.com            search1.taobao.com
 #5 2009184 2009-06-25 14:40:39  6  mktginc.com              surv.mktginc.com
 #6   20092 2009-06-07 15:13:06 32   ccb.com.cn          ibsbjstar.ccb.com.cn
 #                                     V6
 # 1                                            http://search1.taobao.com/browse/0/n-g,grdsa2kqn5scattbnzxq-------2-------b--40--commend-0-all-0.htm?at_topsearch=1&ssid=e-s1
 # 2                                    http://www.wap.sohu.com/info/index.html?url=http://wap.sohu.com/sports/pic/?lpn=1&resIdx=0&nid=336&rid=KL39,PD21746&v=2&ref=901981387
 # 3                              http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT
 # 4 http://search1.taobao.com/browse/0/n-1----------------------0----------------------g,zhh3viy-g,ywtmf7glxeqnhjgt263ps-------2-------b--40--commend-0-all-0.htm?ssid=p1-s1
 #5                                                                                                                                                                         
 #6                                                                                                                       https://ibsbjstar.ccb.com.cn/app/V5/CN/STY1/login.jsp
#                 V7       V8            V9        V10
#1     www.taobao.com shopping    e-commerce        C2C
#2       www.sohu.com   portal entertainment     mobile
#3                      others     marketing enterprise
#4 search1.taobao.com shopping    e-commerce        C2C
#5                     unknown       unknown    unknown
#6                      e-bank       finance     e-bank

      

data

 txt <- '20091,2009-06-02 22:06:14,84,taobao.com,search1.taobao.com,http://search1.taobao.com/browse/0/n-g,grdsa2kqn5scattbnzxq-------2-------b--40--commend-0-all-0.htm?at_topsearch=1&ssid=e-s1,www.taobao.com,shopping,e-commerce,C2C
20092,2009-06-16 12:25:35,8,sohu.com,www.wap.sohu.com,http://www.wap.sohu.com/info/index.html?url=http://wap.sohu.com/sports/pic/?lpn=1&resIdx=0&nid=336&rid=KL39,PD21746&v=2&ref=901981387,www.sohu.com,portal,entertainment,mobile
20092,2009-06-07 16:02:03,14,eetchina.com,www.powersystems.eetchina.com,http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT,,others,marketing,enterprise
20096,2009-06-30 07:51:38,7,taobao.com,search1.taobao.com,http://search1.taobao.com/browse/0/n-1----------------------0----------------------g,zhh3viy-g,ywtmf7glxeqnhjgt263ps-------2-------b--40--commend-0-all-0.htm?ssid=p1-s1,search1.taobao.com,shopping,e-commerce,C2C
2009184,2009-06-25 14:40:39,6,mktginc.com,surv.mktginc.com,,,unknown,unknown,unknown
20092,2009-06-07 15:13:06,32,ccb.com.cn,ibsbjstar.ccb.com.cn,https://ibsbjstar.ccb.com.cn/app/V5/CN/STY1/login.jsp,,e-bank,finance,e-bank'

  Lines <- readLines(textConnection(txt))

      

+3


source


It looks like it might work for you.

txt <- '20091,2009-06-02 22:06:14,84,taobao.com,search1.taobao.com,http://search1.taobao.com/browse/0/n-g,grdsa2kqn5scattbnzxq-------2-------b--40--commend-0-all-0.htm?at_topsearch=1&ssid=e-s1,www.taobao.com,shopping,e-commerce,C2C
  20092,2009-06-16 12:25:35,8,sohu.com,www.wap.sohu.com,http://www.wap.sohu.com/info/index.html?url=http://wap.sohu.com/sports/pic/?lpn=1&resIdx=0&nid=336&rid=KL39,PD21746&v=2&ref=901981387,www.sohu.com,portal,entertainment,mobile
  20092,2009-06-07 16:02:03,14,eetchina.com,www.powersystems.eetchina.com,http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT,,others,marketing,enterprise'

readLog <-  function(file, stringsAsFactors = TRUE)
{
    s <- strsplit(readLines(file), ",")
    loop <- t(sapply(s, function(x) {
            g <- grep("http", x)
            x[g] <- paste(x[g:(length(x)-4)], collapse = ",")
            x[-c((g+1):(length(x)-4))]
        }))
    data.frame(loop, stringsAsFactors = stringsAsFactors)
}
## readLog(textConnection(txt))
readLog(yourFile)

      

This gives the following in the 6th column and each row has 10 columns

                                                                        V6
1               http://search1.taobao.com/browse/0/n-g,grdsa2kqn5scattbnzxq-------2-------b--40--commend-0-all-0.htm?at_topsearch=1&ssid=e-s1
2       http://www.wap.sohu.com/info/index.html?url=http://wap.sohu.com/sports/pic/?lpn=1&resIdx=0&nid=336&rid=KL39,PD21746&v=2&ref=901981387
3 http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT                                                                                    

      

And from 7 to 10

              V7       V8            V9        V10
1 www.taobao.com shopping    e-commerce        C2C
2   www.sohu.com   portal entertainment     mobile
3                  others     marketing enterprise

      

+1


source







All Articles