Most efficient way to read key value pairs where values ​​span multiple rows?

What is the fastest way to parse a text file like the one below into two columns data.frame

, which is then converted to wide format?

FN Thomson Reuters Web of Science™
VR 1.0
PT J
AU Panseri, Sara
   Chiesa, Luca Maria
   Brizzolari, Andrea
   Santaniello, Enzo
   Passero, Elena
   Biondi, Pier Antonio
TI Improved determination of malonaldehyde by high-performance liquid
   chromatography with UV detection as 2,3-diaminonaphthalene derivative
SO JOURNAL OF CHROMATOGRAPHY B-ANALYTICAL TECHNOLOGIES IN THE BIOMEDICAL
   AND LIFE SCIENCES
VL 976
BP 91
EP 95
DI 10.1016/j.jchromb.2014.11.017
PD JAN 22 2015
PY 2015

      

Usage is readLines

problematic because multi-line fields have no keys. Reading as a fixed-width table also doesn't work. Suggestions? If it weren't for the multi-line problem, this would be easy to accomplish with a function that works on each line / record like this:

x <- "FN Thomson Reuters Web of Science"
re <- "^([^\\s]+)\\s*(.*)$"
key <- sub(re, "\\1", x, perl=TRUE)
value <- sub(re, "\\2", x, perl=TRUE)
data.frame(key, value)
key                          value
1  FN Thomson Reuters Web of Science

      

Notes. The fields will always be in uppercase and two characters. The entire title and list of authors can be combined into one cell.

+3


source to share


3 answers


Here's another idea that might be helpful if you want to stay in the R base:



parseEntry <- function(entry) {
    ## Split at beginning of each line that starts with a non-space character    
    ll <- strsplit(entry, "\\n(?=\\S)", perl=TRUE)[[1]]
    ## Clean up empty characters at beginning of continuation lines
    ll <- gsub("\\n(\\s){3}", "", ll)
    ## Split each field into its two components
    read.fwf(textConnection(ll), c(2, max(nchar(ll))))
}

## Read in and collapse entry into one long character string.
## (If file contained more than one entry, you could preprocess it accordingly.)
ee <- paste(readLines("egFile.txt"), collapse="\n")
## Parse the entry
parseEntry(ee)

      

+3


source


This should work:



library(zoo)

x <- read.fwf(file="tempSO.txt",widths=c(2,500),as.is=TRUE)

x$V1[x$V1=="  "] <- NA
x$V1 <- na.locf(x$V1)

res <- aggregate(V2 ~ V1, data = x, FUN = paste, collapse = "")

      

+5


source


Read the lines of the file into a character vector with readLines

and add a colon to each key. The result is in DCF format, so we can read it with read.dcf

- this is the function used to read DES RCR DESCRIPTION files. The result read.dcf

is wide

, a matrix with one column per key. Finally, we create a long

long data.frame with one line per key:

L <- readLines("myfile.dat")
L <- sub("^(\\S\\S)", "\\1:", L)
wide <- read.dcf(textConnection(L))
long <- data.frame(key = colnames(wide), value = wide[1,], stringsAsFactors = FALSE)

      

+3


source







All Articles