R: Read in the form of tab-delimited data with duplicated tabs

I need to read a fairly large delimited text file in R (about two gigabytes). The problem is that the file contains a lot of duplicate tabs (two subsequent tabs without anything in between). They seem to cause problems in that (some?) Of them are interpreted as end of line.

Since the data is huge, I have uploaded a small part to illustrate the problem, see the code below.

count.fields(file = "http://m.uploadedit.com/ba3c/1429271380882.txt", sep = "\t")
read.table(file = "http://m.uploadedit.com/ba3c/1429271380882.txt", 
       header = TRUE, sep = "\t")

      

Thanks for your help.

Edit

Edit: The example doesn't illustrate the original problem. For all data, I must have a total of 6312 fields per line, but when I do count.fields()

on it, the lines break into a 4571 - 1741 - 4571 - 1741 -... pattern, so with an extra line end after the field number 4571.

+3


source to share


3 answers


Rows seem to be \n

randomly scattered across the column names. If we search for the first 5 or so occurrences \n

in the file with substr()

and gregexpr()

, the results seem strange:

library(readr) # useful pkg to read files
df <- read_file("http://m.uploadedit.com/ba3c/1429271380882.txt")

> substr(df, gregexpr("\n", df)[[1]][1]-10, gregexpr("\n", df)[[1]][1]+10)
[1] "1-024.Top \nAlleles\tCF"

> substr(df, gregexpr("\n", df)[[1]][2]-10, gregexpr("\n", df)[[1]][2]+10)
[1] "053.Theta\t\nCFF01-053."

> substr(df, gregexpr("\n", df)[[1]][3]-10, gregexpr("\n", df)[[1]][3]+10)
[1] "CFF01-072.\nTop Allele"

> substr(df, gregexpr("\n", df)[[1]][4]-10, gregexpr("\n", df)[[1]][4]+10)
[1] "CFF01-086.\nTheta\tCFF0"

> substr(df, gregexpr("\n", df)[[1]][5]-10, gregexpr("\n", df)[[1]][5]+10)
[1] "ype\tCFF01-\n303.Top Al"

      

So the problem is apparently not in the next two \t

, but the random scattered line breaks. This obviously breaks the analyzer read.table

.



But: if randomly scattered line breaks are a problem, remove all of them and insert them in the correct position. The following code will correctly read the published example data. You will probably need to find the best regular expression for the variable ID_REF

to automatically replace it with \n

before the ID string in case the ID string is larger than the example data:

library(readr)

df <- read_file("http://m.uploadedit.com/ba3c/1429271380882.txt")

df <- gsub("\n", "", df)
df <- gsub("abph1", "\nabph1", df)
df <- read_delim(df, delim = "\t")

      

+2


source


Check the file for quotes and comments. The default behavior is to not count tabs or other separators that are inside quotes (or after comments). So the fact that the number of fields in a line is interleaved and the 2 values ​​are added to the correct number indicates that you have a quote character after field 4570 on each line. So the first line reads the first 4570 records, sees the quote and reads the rest of that line and the first 4570 fields of the next line as one field, then reads the remaining 1741 lines on the second line as separate fields, repeats from line 3 and 4, etc. ...



count.fields

and read.table

and their related functions have arguments for setting quote and comment characters. Changing them to blank lines will tell R to ignore quotes and comments, which is a quick way to test my theory.

+1


source


Well, I didn't get to the root of the problem, but I figured out that you were duplicating the rose in the table. I have loaded your data into the R workspace like this.

to.load = readLines("http://m.uploadedit.com/ba3c/1429271380882.txt")
data = read.csv(text = to.load, sep = "\t",  nrows=length(to.load) - 1, row.names=NULL)

      

0


source







All Articles