Most efficient way to read key value pairs where values span multiple rows?
What is the fastest way to parse a text file like the one below into two columns data.frame
, which is then converted to wide format?
FN Thomson Reuters Web of Science™
VR 1.0
PT J
AU Panseri, Sara
Chiesa, Luca Maria
Brizzolari, Andrea
Santaniello, Enzo
Passero, Elena
Biondi, Pier Antonio
TI Improved determination of malonaldehyde by high-performance liquid
chromatography with UV detection as 2,3-diaminonaphthalene derivative
SO JOURNAL OF CHROMATOGRAPHY B-ANALYTICAL TECHNOLOGIES IN THE BIOMEDICAL
AND LIFE SCIENCES
VL 976
BP 91
EP 95
DI 10.1016/j.jchromb.2014.11.017
PD JAN 22 2015
PY 2015
Usage is readLines
problematic because multi-line fields have no keys. Reading as a fixed-width table also doesn't work. Suggestions? If it weren't for the multi-line problem, this would be easy to accomplish with a function that works on each line / record like this:
x <- "FN Thomson Reuters Web of Science"
re <- "^([^\\s]+)\\s*(.*)$"
key <- sub(re, "\\1", x, perl=TRUE)
value <- sub(re, "\\2", x, perl=TRUE)
data.frame(key, value)
key value
1 FN Thomson Reuters Web of Science
Notes. The fields will always be in uppercase and two characters. The entire title and list of authors can be combined into one cell.
source to share
Here's another idea that might be helpful if you want to stay in the R base:
parseEntry <- function(entry) {
## Split at beginning of each line that starts with a non-space character
ll <- strsplit(entry, "\\n(?=\\S)", perl=TRUE)[[1]]
## Clean up empty characters at beginning of continuation lines
ll <- gsub("\\n(\\s){3}", "", ll)
## Split each field into its two components
read.fwf(textConnection(ll), c(2, max(nchar(ll))))
}
## Read in and collapse entry into one long character string.
## (If file contained more than one entry, you could preprocess it accordingly.)
ee <- paste(readLines("egFile.txt"), collapse="\n")
## Parse the entry
parseEntry(ee)
source to share
Read the lines of the file into a character vector with readLines
and add a colon to each key. The result is in DCF format, so we can read it with read.dcf
- this is the function used to read DES RCR DESCRIPTION files. The result read.dcf
is wide
, a matrix with one column per key. Finally, we create a long
long data.frame with one line per key:
L <- readLines("myfile.dat")
L <- sub("^(\\S\\S)", "\\1:", L)
wide <- read.dcf(textConnection(L))
long <- data.frame(key = colnames(wide), value = wide[1,], stringsAsFactors = FALSE)
source to share