R reads a comma txt file with comma inside one column
I have some user behavior logs. This is from the data collector and apparently it used commas to separate the variables. However, some URLs have commas inside. I cannot read the txt file in R.
20091,2009-06-02 22:06:14,84,taobao.com,search1.taobao.com,http://search1.taobao.com/browse/0/n-g,grdsa2kqn5scattbnzxq-------2-------b--40--commend-0-all-0.htm?at_topsearch=1&ssid=e-s1,www.taobao.com,shopping,e-commerce,C2C
20092,2009-06-16 12:25:35,8,sohu.com,www.wap.sohu.com,http://www.wap.sohu.com/info/index.html?url=http://wap.sohu.com/sports/pic/?lpn=1&resIdx=0&nid=336&rid=KL39,PD21746&v=2&ref=901981387,www.sohu.com,portal,entertainment,mobile
20092,2009-06-07 16:02:03,14,eetchina.com,www.powersystems.eetchina.com,http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT,,others,marketing,enterprise
20096,2009-06-30 07:51:38,7,taobao.com,search1.taobao.com,http://search1.taobao.com/browse/0/n-1----------------------0----------------------g,zhh3viy-g,ywtmf7glxeqnhjgt263ps-------2-------b--40--commend-0-all-0.htm?ssid=p1-s1,search1.taobao.com,shopping,e-commerce,C2C
2009184,2009-06-25 14:40:39,6,mktginc.com,surv.mktginc.com,,,unknown,unknown,unknown
20092,2009-06-07 15:13:06,32,ccb.com.cn,ibsbjstar.ccb.com.cn,https://ibsbjstar.ccb.com.cn/app/V5/CN/STY1/login.jsp,,e-bank,finance,e-bank
The above URLs should be:
http://search1.taobao.com/browse/0/n-g,grdsa2kqn5scattbnzxq-------2-------b--40--commend-0-all-0.htm?at_topsearch=1&ssid=e-s1
http://www.wap.sohu.com/info/index.html?url=http://wap.sohu.com/sports/pic/?lpn=1&resIdx=0&nid=336&rid=KL39,PD21746&v=2&ref=901981387
http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT
http://search1.taobao.com/browse/0/n-1----------------------0----------------------g,zhh3viy-g,ywtmf7glxeqnhjgt263ps-------2-------b--40--commend-0-all-0.htm?ssid=p1-s1
https://ibsbjstar.ccb.com.cn/app/V5/CN/STY1/login.jsp
How can I tell there are exactly 10 variables on each line and put commas in the url? Thank!
df <- read.table('2009.txt', sep= ',', quote= '', comment.char= '', stringsAsFactors= F)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 130 did not have 10 elements
source to share
If you are reading data so that each row is one row (for example sep="\n"
), you can process each row just before inserting the corresponding dataframe.
If only the 6th entry can have commas (it looks like the other URLs are only primary domains), then something like the following might work:
d <- strsplit(d, ",")
for (i in 1:length(d)) {
x <- d[[i]]
n <- length(x)
if (n > 10) {
d[[i]] <- c(x[1:5], paste(x[6:(n-4)], collapse=","), x[(n-3):n])
}
}
d <- do.call(rbind,lapply(d, matrix, ncol=10, byrow=TRUE))
If other URLs might have a problem, this approach might still work, but it can be very complicated.
source to share
You may try:
dat <- read.table(text=gsub("http:.*(?=(,www)|,,)(*SKIP)(*F)|,", "*",
Lines, perl=TRUE), sep="*", header=FALSE, stringsAsFactors=FALSE)
dat
# V1 V2 V3 V4 V5
#1 20091 2009-06-02 22:06:14 84 taobao.com search1.taobao.com
#2 20092 2009-06-16 12:25:35 8 sohu.com www.wap.sohu.com
#3 20092 2009-06-07 16:02:03 14 eetchina.com www.powersystems.eetchina.com
# V6
#1 http://search1.taobao.com/browse/0/n-g,grdsa2kqn5scattbnzxq------- 2-------b--40--commend-0-all-0.htm?at_topsearch=1&ssid=e-s1
#2 http://www.wap.sohu.com/info/index.html?url=http://wap.sohu.com/sports/pic/?lpn=1&resIdx=0&nid=336&rid=KL39,PD21746&v=2&ref=901981387
#3 http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT
# V7 V8 V9 V10
#1 www.taobao.com shopping e-commerce C2C
#2 www.sohu.com portal entertainment mobile
#3 others marketing enterprise
data
Lines <- readLines(textConnection(txt)) #(`txt` from @Richard Scriven)
Update
Using a new dataset
indx <- grep("http", Lines)
Lines1 <- Lines[indx]
pat1 <- paste(unique(gsub(".*http[s]?.{3}(\\w+)\\..*", "\\1", Lines1)), collapse="|")
pat1N <- paste0("http:.*(?=,(", pat1, "|,))(*SKIP)(*F)|,")
dat1 <- read.table(text=gsub(pat1N, "*", Lines, perl=TRUE),
sep="*", header=FALSE, stringsAsFactors=FALSE)
dat1
# V1 V2 V3 V4 V5
#1 20091 2009-06-02 22:06:14 84 taobao.com search1.taobao.com
#2 20092 2009-06-16 12:25:35 8 sohu.com www.wap.sohu.com
#3 20092 2009-06-07 16:02:03 14 eetchina.com www.powersystems.eetchina.com
#4 20096 2009-06-30 07:51:38 7 taobao.com search1.taobao.com
#5 2009184 2009-06-25 14:40:39 6 mktginc.com surv.mktginc.com
#6 20092 2009-06-07 15:13:06 32 ccb.com.cn ibsbjstar.ccb.com.cn
# V6
# 1 http://search1.taobao.com/browse/0/n-g,grdsa2kqn5scattbnzxq-------2-------b--40--commend-0-all-0.htm?at_topsearch=1&ssid=e-s1
# 2 http://www.wap.sohu.com/info/index.html?url=http://wap.sohu.com/sports/pic/?lpn=1&resIdx=0&nid=336&rid=KL39,PD21746&v=2&ref=901981387
# 3 http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT
# 4 http://search1.taobao.com/browse/0/n-1----------------------0----------------------g,zhh3viy-g,ywtmf7glxeqnhjgt263ps-------2-------b--40--commend-0-all-0.htm?ssid=p1-s1
#5
#6 https://ibsbjstar.ccb.com.cn/app/V5/CN/STY1/login.jsp
# V7 V8 V9 V10
#1 www.taobao.com shopping e-commerce C2C
#2 www.sohu.com portal entertainment mobile
#3 others marketing enterprise
#4 search1.taobao.com shopping e-commerce C2C
#5 unknown unknown unknown
#6 e-bank finance e-bank
data
txt <- '20091,2009-06-02 22:06:14,84,taobao.com,search1.taobao.com,http://search1.taobao.com/browse/0/n-g,grdsa2kqn5scattbnzxq-------2-------b--40--commend-0-all-0.htm?at_topsearch=1&ssid=e-s1,www.taobao.com,shopping,e-commerce,C2C
20092,2009-06-16 12:25:35,8,sohu.com,www.wap.sohu.com,http://www.wap.sohu.com/info/index.html?url=http://wap.sohu.com/sports/pic/?lpn=1&resIdx=0&nid=336&rid=KL39,PD21746&v=2&ref=901981387,www.sohu.com,portal,entertainment,mobile
20092,2009-06-07 16:02:03,14,eetchina.com,www.powersystems.eetchina.com,http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT,,others,marketing,enterprise
20096,2009-06-30 07:51:38,7,taobao.com,search1.taobao.com,http://search1.taobao.com/browse/0/n-1----------------------0----------------------g,zhh3viy-g,ywtmf7glxeqnhjgt263ps-------2-------b--40--commend-0-all-0.htm?ssid=p1-s1,search1.taobao.com,shopping,e-commerce,C2C
2009184,2009-06-25 14:40:39,6,mktginc.com,surv.mktginc.com,,,unknown,unknown,unknown
20092,2009-06-07 15:13:06,32,ccb.com.cn,ibsbjstar.ccb.com.cn,https://ibsbjstar.ccb.com.cn/app/V5/CN/STY1/login.jsp,,e-bank,finance,e-bank'
Lines <- readLines(textConnection(txt))
source to share
It looks like it might work for you.
txt <- '20091,2009-06-02 22:06:14,84,taobao.com,search1.taobao.com,http://search1.taobao.com/browse/0/n-g,grdsa2kqn5scattbnzxq-------2-------b--40--commend-0-all-0.htm?at_topsearch=1&ssid=e-s1,www.taobao.com,shopping,e-commerce,C2C
20092,2009-06-16 12:25:35,8,sohu.com,www.wap.sohu.com,http://www.wap.sohu.com/info/index.html?url=http://wap.sohu.com/sports/pic/?lpn=1&resIdx=0&nid=336&rid=KL39,PD21746&v=2&ref=901981387,www.sohu.com,portal,entertainment,mobile
20092,2009-06-07 16:02:03,14,eetchina.com,www.powersystems.eetchina.com,http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT,,others,marketing,enterprise'
readLog <- function(file, stringsAsFactors = TRUE)
{
s <- strsplit(readLines(file), ",")
loop <- t(sapply(s, function(x) {
g <- grep("http", x)
x[g] <- paste(x[g:(length(x)-4)], collapse = ",")
x[-c((g+1):(length(x)-4))]
}))
data.frame(loop, stringsAsFactors = stringsAsFactors)
}
## readLog(textConnection(txt))
readLog(yourFile)
This gives the following in the 6th column and each row has 10 columns
V6
1 http://search1.taobao.com/browse/0/n-g,grdsa2kqn5scattbnzxq-------2-------b--40--commend-0-all-0.htm?at_topsearch=1&ssid=e-s1
2 http://www.wap.sohu.com/info/index.html?url=http://wap.sohu.com/sports/pic/?lpn=1&resIdx=0&nid=336&rid=KL39,PD21746&v=2&ref=901981387
3 http://www.powersystems.eetchina.com/ART_8800533274_2600005_TA_346f6b13.HTM?click_from=8800024853,8875136323,2009-05-26,PSCOL,ARTICLE_ALERT
And from 7 to 10
V7 V8 V9 V10
1 www.taobao.com shopping e-commerce C2C
2 www.sohu.com portal entertainment mobile
3 others marketing enterprise
source to share