Most efficient way to read key value pairs where values span multiple rows?

Question

Most efficient way to read key value pairs where values span multiple rows?

What is the fastest way to parse a text file like the one below into two columns data.frame

, which is then converted to wide format?

FN Thomson Reuters Web of Science™
VR 1.0
PT J
AU Panseri, Sara
   Chiesa, Luca Maria
   Brizzolari, Andrea
   Santaniello, Enzo
   Passero, Elena
   Biondi, Pier Antonio
TI Improved determination of malonaldehyde by high-performance liquid
   chromatography with UV detection as 2,3-diaminonaphthalene derivative
SO JOURNAL OF CHROMATOGRAPHY B-ANALYTICAL TECHNOLOGIES IN THE BIOMEDICAL
   AND LIFE SCIENCES
VL 976
BP 91
EP 95
DI 10.1016/j.jchromb.2014.11.017
PD JAN 22 2015
PY 2015

Usage is readLines

problematic because multi-line fields have no keys. Reading as a fixed-width table also doesn't work. Suggestions? If it weren't for the multi-line problem, this would be easy to accomplish with a function that works on each line / record like this:

x <- "FN Thomson Reuters Web of Science"
re <- "^([^\\s]+)\\s*(.*)$"
key <- sub(re, "\\1", x, perl=TRUE)
value <- sub(re, "\\2", x, perl=TRUE)
data.frame(key, value)
key                          value
1  FN Thomson Reuters Web of Science

Notes. The fields will always be in uppercase and two characters. The entire title and list of authors can be combined into one cell.

+3

r dataframe

Maiasaura Apr 29. 15 at 19:25

source to share

3 answers

This should work:

library(zoo)

x <- read.fwf(file="tempSO.txt",widths=c(2,500),as.is=TRUE)

x$V1[x$V1=="  "] <- NA
x$V1 <- na.locf(x$V1)

res <- aggregate(V2 ~ V1, data = x, FUN = paste, collapse = "")

+5

zx8754 Apr 29. 15 at 19:58

source to share

Read the lines of the file into a character vector with readLines

and add a colon to each key. The result is in DCF format, so we can read it with read.dcf

- this is the function used to read DES RCR DESCRIPTION files. The result read.dcf

is wide

, a matrix with one column per key. Finally, we create a long

long data.frame with one line per key:

L <- readLines("myfile.dat")
L <- sub("^(\\S\\S)", "\\1:", L)
wide <- read.dcf(textConnection(L))
long <- data.frame(key = colnames(wide), value = wide[1,], stringsAsFactors = FALSE)

+3

G. Grothendieck Apr 30 At 1:42

source to share

Josh o'brien · Accepted Answer · 2015-04-29T20:20:52+0000

Here's another idea that might be helpful if you want to stay in the R base:

parseEntry <- function(entry) {
    ## Split at beginning of each line that starts with a non-space character    
    ll <- strsplit(entry, "\\n(?=\\S)", perl=TRUE)[[1]]
    ## Clean up empty characters at beginning of continuation lines
    ll <- gsub("\\n(\\s){3}", "", ll)
    ## Split each field into its two components
    read.fwf(textConnection(ll), c(2, max(nchar(ll))))
}

## Read in and collapse entry into one long character string.
## (If file contained more than one entry, you could preprocess it accordingly.)
ee <- paste(readLines("egFile.txt"), collapse="\n")
## Parse the entry
parseEntry(ee)

Most efficient way to read key value pairs where values ​​span multiple rows?

More articles:

Most efficient way to read key value pairs where values span multiple rows?