Change data format in long format with multiple sets of dimension columns
I have an R dataframe that I popped off the internet using readHTMLTable()
in a package XML
. The table looks like the following excerpt with several variables / columns for population and year. (Note that years are not duplicated across columns and represent a unique identifier for the population.)
year1 pop1 year2 pop2 year3 pop3
1
2 16XX 4675,0 1900 6453,0 1930 9981,2
3 17XX 4739,3 1901 6553,5 1931 ...
4 17XX 4834,0 1902 6684,0 1932
5 180X 4930,0 1903 6818,0 1933
6 180X 5029,0 1904 6955,0 1934
7 181X 5129,0 1905 7094,0 1935
8 181X 5231,9 1906 7234,7 1936
9 182X 5297,0 1907 7329,0 1937
10 182X 5362,0 1908 7422,0 1938
I would like to reorganize the data into only two columns, one for the year and one for the aggregate, which looks like this:
year pop
1
2 16XX 4675,0
3 17XX 4739,3
4 17XX 4834,0
5 180X 4930,0
6 180X 5029,0
7 181X 5129,0
8 181X 5231,9
9 182X 5297,0
10 182X 5362,0
11 1900 6453,0
12 1901 6553,5
13 1902 6684,0
... ... ...
21 1930 9981,2
22 ...
The values ββfrom the variables / columns year2
and year3
are added below year1
, as well as the corresponding population values.
I have considered the following:
(1) Looping over the population and year columns ( n>2
) and adding those values ββas new observations to year1, and population1 will work, but it seems unnecessarily cumbersome.
(2) I tried to melt like below, but either it can't handle the id variable shared by multiple columns, or I'm not implementing it correctly.
df.melt <- melt(df, id=c("year1", "year2",...)
(3) Finally, I considered the possibility to output the column each year as my own vector, and add each of those vectors together here:
year.all <- c(df$year1, df$year2,...)
However, the above returns the next year. all
[1] 1 2 3 3 4 4 5 5 6 6 7 8 8 9 9 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 1 1 2 ...
but not
[1] 16XX 17XX 17XX 180X 180X 181X 181X 182X 182X 1900 1901 1902...
If there is an easy way to accomplish this refactoring, I would like to look into it. Many thanks for the help.
source to share
If the columns "year", "pop" alternate, we can multiply by c(TRUE, FALSE)
to get columns 1, 3, 5, etc. and c(FALSE, TRUE)
to get 2, 4, 6, .. due to recycling. Then we unlist
columns and create a new 'data.frame.
df2 <- data.frame(year=unlist(df1[c(TRUE, FALSE)]),
pop=unlist(df1[c(FALSE, TRUE)]))
row.names(df2) <- NULL
head(df2)
# year pop
#1
#2 16XX 4675,0
#3 17XX 4739,3
#4 17XX 4834,0
#5 180X 4930,0
#6 180X 5029,0
Or another option
library(splitstackshape)
merged.stack(transform(df1, id=1:nrow(df1)), var.stubs=c('year', 'pop'),
sep='var.stubs')[order(.time_1), 3:4, with=FALSE]
data
df1 <- structure(list(year1 = c("", "16XX", "17XX", "17XX", "180X",
"180X", "181X", "181X", "182X", "182X"), pop1 = c("", "4675,0",
"4739,3", "4834,0", "4930,0", "5029,0", "5129,0", "5231,9", "5297,0",
"5362,0"), year2 = c(NA, 1900L, 1901L, 1902L, 1903L, 1904L, 1905L,
1906L, 1907L, 1908L), pop2 = c("", "6453,0", "6553,5", "6684,0",
"6818,0", "6955,0", "7094,0", "7234,7", "7329,0", "7422,0"),
year3 = c(NA, 1930L, 1931L, 1932L, 1933L, 1934L, 1935L, 1936L,
1937L, 1938L), pop3 = c("", "9981,2", "", "", "", "", "",
"", "", "")), .Names = c("year1", "pop1", "year2", "pop2",
"year3", "pop3"), class = "data.frame", row.names = c(NA, -10L))
source to share
Using the new function in melt
from data.table v1.9.5+
:
require(data.table) # v1.9.5+
melt(setDT(df), measure = patterns("^year", "^pop"), value.name = c("year", "pop"))
You can find the rest of the vignettes here .
source to share
Another option is to use split.default
to split a dataframe in a data list and then concatenate them:
lst <- lapply(split.default(df1, sub('.*(\\d)', '\\1', names(df1))),
setNames, c('year','pop'))
do.call(rbind, lst)
which gives the desired output:
year pop
1.1 16XX 4675,0
1.2 17XX 4739,3
1.3 17XX 4834,0
1.4 180X 4930,0
1.5 180X 5029,0
1.6 181X 5129,0
1.7 181X 5231,9
1.8 182X 5297,0
1.9 182X 5362,0
2.1 1900 6453,0
2.2 1901 6553,5
2.3 1902 6684,0
2.4 1903 6818,0
2.5 1904 6955,0
2.6 1905 7094,0
2.7 1906 7234,7
2.8 1907 7329,0
2.9 1908 7422,0
3.1 1930 9981,2
3.2 1931 10583,5
3.3 1932 8671,0
3.4 1933 9118,0
3.5 1934 9625,0
3.6 1935 8097,0
3.7 1936 7984,7
3.8 1937 8729,0
3.9 1938 10462,0
You can also use rbindlist
from package data.table
for the last step:
library(data.table)
rbindlist(lst)
Data used:
df1 <- structure(list(year1 = c("16XX", "17XX", "17XX", "180X", "180X", "181X", "181X", "182X", "182X"),
pop1 = c("4675,0", "4739,3", "4834,0", "4930,0", "5029,0", "5129,0", "5231,9", "5297,0", "5362,0"),
year2 = c(1900L, 1901L, 1902L, 1903L, 1904L, 1905L, 1906L, 1907L, 1908L),
pop2 = c("6453,0", "6553,5", "6684,0", "6818,0", "6955,0", "7094,0", "7234,7", "7329,0", "7422,0"),
year3 = c(1930L, 1931L, 1932L, 1933L, 1934L, 1935L, 1936L, 1937L, 1938L),
pop3 = c("9981,2", "10583,5", "8671,0", "9118,0", "9625,0", "8097,0", "7984,7", "8729,0", "10462,0")),
.Names = c("year1", "pop1", "year2", "pop2", "year3", "pop3"), class = "data.frame", row.names = c(NA, -9L))
source to share