Convert single column and multiple row data to multiple columns multiple columns
I have a data output with a web list in R that looks like this
Name1
Email: email1@xyz.com
City/Town: Location1
Name2
Email: email2@abc.com
City/Town: Location2
Name3
Email: email3@pqr.com
City/Town: Location3
Some names may not include an email address or location. I want to convert the data above to tabular format. The result should look like
Name Email City/Town
Name1 email1@xyz.com Location1
Name2 email2@abc.com Location2
Name3 email3@pqr.com Location3
Name4 Location4
Name5 email5@abc.com
source to share
Using:
txt <- readLines(txt)
library(data.table)
library(zoo)
dt <- data.table(txt = txt)
dt[!grepl(':', txt), name := txt
][, name := na.locf(name)
][grepl('^Email:', txt), email := sub('Email: ','',txt)
][grepl('^City/Town:', txt), city_town := sub('City/Town: ','',txt)
][txt != name, lapply(.SD, function(x) toString(na.omit(x))), by = name, .SDcols = c('email','city_town')]
gives:
name email city_town
1: Name1 email1@xyz.com Location1
2: Name2 email2@abc.com Location2
3: Name3 email3@pqr.com Location3
4: Name4 Location4
5: Name5 email5@abc.com
This also works with real names. With @uweBlock data, you get:
name email city_town
1: John Doe email1@xyz.com Location1
2: Save the World Fund email2@abc.com Location2
3: Best Shoes Ltd. email3@pqr.com Location3
4: Mother Location4
5: Jane email5@abc.com
And with multiple keys in the section (again with @UweBlock data):
name email city_town
1: John Doe email1@xyz.com, email1@abc.com Location1
2: Save the World Fund email2@abc.com Location2
3: Best Shoes Ltd. email3@pqr.com Location3
4: Mother Location4, everywhere
5: Jane email5@abc.com
Data used:
txt <- textConnection("Name1
Email: email1@xyz.com
City/Town: Location1
Name2
Email: email2@abc.com
City/Town: Location2
Name3
Email: email3@pqr.com
City/Town: Location3
Name4
City/Town: Location4
Name5
Email: email5@abc.com")
source to share
Insert \nName:
before each name and then read it in use read.dcf
(If the data comes from a file, replace textConnection(Lines)
with the file name, for example "myfile.dat"
, in the first line of code.) No packages are used.
L <- trimws(readLines(textConnection(Lines)))
ix <- !grepl(":", L)
L[ix] <- paste("\nName:", L[ix])
read.dcf(textConnection(L))
using the following values in the note at the end:
Name Email City/Town
[1,] "Name1" "email1@xyz.com" "Location1"
[2,] "Name2" NA "Location2"
[3,] "Name3" "email3@pqr.com" NA
Note: The input used. This is slightly modified from the question to show that it works when email or city / city is missing:
Lines <- "Name1
Email: email1@xyz.com
City/Town: Location1
Name2
City/Town: Location2
Name3
Email: email3@pqr.com"
source to share
The inputs offer several problems:
- The data is specified as a vector of direct characters, not as a data.frame with predefined columns.
- Strings are partially composed of key / value pairs separated by a character
": "
- Other lines act as section headers. All key / value pairs in the lines below refer to the same section until the next header is reached.
The code below relies on only two assumptions:
- key / value pairs contain one and only one
": "
- there are no section headers at all.
Multiple keys in a section, for example, multiple lines with email addresses, are processed by specifying toString()
as an aggregation function to dcast()
.
library(data.table)
# coerce to data.table
data.table(text = txt)[
# split key/value pairs in columns
, tstrsplit(text, ": ")][
# pick section headers and create new column
is.na(V2), Name := V1][
# fill in Name into the rows below
, Name := zoo::na.locf(Name)][
# reshape key/value pairs from long to wide format using Name as row id
!is.na(V2), dcast(.SD, Name ~ V1, fun = toString, value.var = "V2")]
Name City/Town Email
1: Name1 Location1 email1@xyz.com
2: Name2 Location2 email2@abc.com
3: Name3 Location3 email3@pqr.com
4: Name4 Location4 NA
5: Name5 NA email5@abc.com
Data
txt <- c("Name1", "Email: email1@xyz.com", "City/Town: Location1", "Name2",
"Email: email2@abc.com", "City/Town: Location2", "Name3", "Email: email3@pqr.com",
"City/Town: Location3", "Name4", "City/Town: Location4", "Name5",
"Email: email5@abc.com")
Or try some more "realistic" names
txt1 <- c("John Doe", "Email: email1@xyz.com", "City/Town: Location1", "Save the World Fund",
"Email: email2@abc.com", "City/Town: Location2", "Best Shoes Ltd.", "Email: email3@pqr.com",
"City/Town: Location3", "Mother", "City/Town: Location4", "Jane",
"Email: email5@abc.com")
which will result in:
Name City/Town Email
1: Best Shoes Ltd. Location3 email3@pqr.com
2: Jane NA email5@abc.com
3: John Doe Location1 email1@xyz.com
4: Mother Location4 NA
5: Save the World Fund Location2 email2@abc.com
Or with multiple keys in a section
txt2 <- c("John Doe", "Email: email1@xyz.com", "Email: email1@abc.com", "City/Town: Location1", "Save the World Fund",
"Email: email2@abc.com", "City/Town: Location2", "Best Shoes Ltd.", "Email: email3@pqr.com",
"City/Town: Location3", "Mother", "City/Town: Location4", "City/Town: everywhere", "Jane",
"Email: email5@abc.com")
Name City/Town Email
1: Best Shoes Ltd. Location3 email3@pqr.com
2: Jane email5@abc.com
3: John Doe Location1 email1@xyz.com, email1@abc.com
4: Mother Location4, everywhere
5: Save the World Fund Location2 email2@abc.com
source to share
Using dplyr and tidyr checked against both the data provided by @Jaap txt
and by @UweBlock txt1
:
library(dplyr)
library(tidyr)
# data_frame(txt = txt1) %>%
data_frame(txt = txt) %>%
mutate(txt = if_else(grepl(":", txt), txt, paste("Name:", txt)),
rn = row_number()) %>%
separate(txt, into = c("mytype", "mytext"), sep = ":") %>%
spread(key = mytype, value = mytext) %>%
select(-rn) %>%
fill(Name) %>%
group_by(Name) %>%
fill(1:2, .direction = "down") %>%
fill(1:2, .direction = "up") %>%
unique() %>%
ungroup() %>%
select(3:1)
# # A tibble: 5 x 3
# Name Email `City/Town`
# <chr> <chr> <chr>
# 1 Name1 email1@xyz.com Location1
# 2 Name2 email2@abc.com Location2
# 3 Name3 email3@pqr.com Location3
# 4 Name4 <NA> Location4
# 5 Name5 email5@abc.com <NA>
Notes:
- See here why we need it
rn
. - Hoping someone would suggest a better / simpler code using only tidyverse.
source to share
Landmarks:
Code:
txt2 <- c("John Doe", "Email: email1@xyz.com", "Email: email1@abc.com", "City/Town: Location1", "Save the World Fund",
"Email: email2@abc.com", "City/Town: Location2", "Best Shoes Ltd.", "Email: email3@pqr.com",
"City/Town: Location3", "Mother", "City/Town: Location4", "City/Town: everywhere", "Jane",
"Email: email5@abc.com")
library(microbenchmark)
library(data.table)
library(dplyr)
library(tidyr)
microbenchmark(ans.uwe = data.table(text = txt2)[, tstrsplit(text, ": ")
][is.na(V2), Name := V1
][, Name := zoo::na.locf(Name)
][!is.na(V2), dcast(.SD, Name ~ V1, fun = toString, value.var = "V2")],
ans.zx8754 = data_frame(txt = txt2) %>%
mutate(txt = ifelse(grepl(":", txt), txt, paste("Name:", txt)),
rn = row_number()) %>%
separate(txt, into = c("mytype", "mytext"), sep = ":") %>%
spread(key = mytype, value = mytext) %>%
select(-rn) %>%
fill(Name) %>%
group_by(Name) %>%
fill(1:2, .direction = "down") %>%
fill(1:2, .direction = "up") %>%
unique() %>%
ungroup() %>%
select(3:1),
ans.jaap = data.table(txt = txt2)[!grepl(':', txt), name := txt
][, name := zoo::na.locf(name)
][grepl('^Email:', txt), email := sub('Email: ','',txt)
][grepl('^City/Town:', txt), city_town := sub('City/Town: ','',txt)
][txt != name, lapply(.SD, function(x) toString(na.omit(x))), by = name, .SDcols = c('email','city_town')],
ans.G.Grothendieck = {
L <- trimws(readLines(textConnection(txt2)))
ix <- !grepl(":", L)
L[ix] <- paste("\nName:", L[ix])
read.dcf(textConnection(L))},
times = 1000)
Result:
Unit: microseconds
expr min lq mean median uq max neval cld
ans.uwe 4243.754 4885.4765 5305.8688 5139.0580 5390.360 92604.820 1000 c
ans.zx8754 39683.911 41771.2925 43940.7646 43168.4870 45291.504 130965.088 1000 d
ans.jaap 2153.521 2488.0665 2788.8250 2640.1580 2773.150 91862.177 1000 b
ans.G.Grothendieck 266.268 304.0415 332.6255 331.8375 349.797 721.261 1000 a
source to share