Change data format in long format with multiple sets of dimension columns

Question

Change data format in long format with multiple sets of dimension columns

I have an R dataframe that I popped off the internet using readHTMLTable()

in a package XML

. The table looks like the following excerpt with several variables / columns for population and year. (Note that years are not duplicated across columns and represent a unique identifier for the population.)

        year1   pop1      year2   pop2     year3   pop3     
1                                                        
2       16XX    4675,0    1900    6453,0    1930   9981,2       
3       17XX    4739,3    1901    6553,5    1931   ...      
4       17XX    4834,0    1902    6684,0    1932   
5       180X    4930,0    1903    6818,0    1933        
6       180X    5029,0    1904    6955,0    1934        
7       181X    5129,0    1905    7094,0    1935
8       181X    5231,9    1906    7234,7    1936
9       182X    5297,0    1907    7329,0    1937
10      182X    5362,0    1908    7422,0    1938

I would like to reorganize the data into only two columns, one for the year and one for the aggregate, which looks like this:

        year    pop     
1                                                        
2       16XX    4675,0
3       17XX    4739,3  
4       17XX    4834,0  
5       180X    4930,0
6       180X    5029,0  
7       181X    5129,0
8       181X    5231,9  
9       182X    5297,0
10      182X    5362,0  
11      1900    6453,0
12      1901    6553,5
13      1902    6684,0
...     ...     ...
21      1930    9981,2
22      ...

The values from the variables / columns year2

and year3

are added below year1

, as well as the corresponding population values.

I have considered the following:

(1) Looping over the population and year columns ( n>2

) and adding those values as new observations to year1, and population1 will work, but it seems unnecessarily cumbersome.

(2) I tried to melt like below, but either it can't handle the id variable shared by multiple columns, or I'm not implementing it correctly.

df.melt <- melt(df, id=c("year1", "year2",...)

(3) Finally, I considered the possibility to output the column each year as my own vector, and add each of those vectors together here:

year.all <- c(df$year1, df$year2,...)

However, the above returns the next year. all

[1]  1  2  3  3  4  4  5  5  6  6  7  8  8  9  9  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24  1  1  2 ...

but not

[1] 16XX 17XX 17XX 180X 180X 181X 181X 182X 182X 1900 1901 1902...

If there is an easy way to accomplish this refactoring, I would like to look into it. Many thanks for the help.

+3

r dataframe reshape2 reshape

user2489854 Jul 19 15 at 18:51

source to share

3 answers

Using the new function in melt

from data.table v1.9.5+

:

require(data.table) # v1.9.5+
melt(setDT(df), measure = patterns("^year", "^pop"), value.name = c("year", "pop"))

You can find the rest of the vignettes here .

+7

Arun Jul 19 15 at 19:02

source to share

Another option is to use split.default

to split a dataframe in a data list and then concatenate them:

lst <- lapply(split.default(df1, sub('.*(\\d)', '\\1', names(df1))),
              setNames, c('year','pop'))

do.call(rbind, lst)

which gives the desired output:

    year     pop
1.1 16XX  4675,0
1.2 17XX  4739,3
1.3 17XX  4834,0
1.4 180X  4930,0
1.5 180X  5029,0
1.6 181X  5129,0
1.7 181X  5231,9
1.8 182X  5297,0
1.9 182X  5362,0
2.1 1900  6453,0
2.2 1901  6553,5
2.3 1902  6684,0
2.4 1903  6818,0
2.5 1904  6955,0
2.6 1905  7094,0
2.7 1906  7234,7
2.8 1907  7329,0
2.9 1908  7422,0
3.1 1930  9981,2
3.2 1931 10583,5
3.3 1932  8671,0
3.4 1933  9118,0
3.5 1934  9625,0
3.6 1935  8097,0
3.7 1936  7984,7
3.8 1937  8729,0
3.9 1938 10462,0

You can also use rbindlist

from package data.table

for the last step:

library(data.table)
rbindlist(lst)

Data used:

df1 <- structure(list(year1 = c("16XX", "17XX", "17XX", "180X", "180X", "181X", "181X", "182X", "182X"),
                      pop1 = c("4675,0", "4739,3", "4834,0", "4930,0", "5029,0", "5129,0", "5231,9", "5297,0", "5362,0"),
                      year2 = c(1900L, 1901L, 1902L, 1903L, 1904L, 1905L, 1906L, 1907L, 1908L),
                      pop2 = c("6453,0", "6553,5", "6684,0", "6818,0", "6955,0", "7094,0", "7234,7", "7329,0", "7422,0"), 
                      year3 = c(1930L, 1931L, 1932L, 1933L, 1934L, 1935L, 1936L, 1937L, 1938L),
                      pop3 = c("9981,2", "10583,5", "8671,0", "9118,0", "9625,0", "8097,0", "7984,7", "8729,0", "10462,0")),
                 .Names = c("year1", "pop1", "year2", "pop2", "year3", "pop3"), class = "data.frame", row.names = c(NA, -9L))

+2

Jaap Jul 19 15 at 19:50

source to share

akrun · Accepted Answer · 2015-07-19T18:56:13+0000

If the columns "year", "pop" alternate, we can multiply by c(TRUE, FALSE)

to get columns 1, 3, 5, etc. and c(FALSE, TRUE)

to get 2, 4, 6, .. due to recycling. Then we unlist

columns and create a new 'data.frame.

 df2 <- data.frame(year=unlist(df1[c(TRUE, FALSE)]), 
                  pop=unlist(df1[c(FALSE, TRUE)]))
 row.names(df2) <- NULL
 head(df2)
 #   year    pop
 #1            
 #2 16XX 4675,0
 #3 17XX 4739,3
 #4 17XX 4834,0
 #5 180X 4930,0
 #6 180X 5029,0

Or another option

library(splitstackshape)
merged.stack(transform(df1, id=1:nrow(df1)), var.stubs=c('year', 'pop'), 
        sep='var.stubs')[order(.time_1), 3:4, with=FALSE]

data

df1 <- structure(list(year1 = c("", "16XX", "17XX", "17XX", "180X", 
"180X", "181X", "181X", "182X", "182X"), pop1 = c("", "4675,0", 
"4739,3", "4834,0", "4930,0", "5029,0", "5129,0", "5231,9", "5297,0", 
"5362,0"), year2 = c(NA, 1900L, 1901L, 1902L, 1903L, 1904L, 1905L, 
1906L, 1907L, 1908L), pop2 = c("", "6453,0", "6553,5", "6684,0", 
"6818,0", "6955,0", "7094,0", "7234,7", "7329,0", "7422,0"), 
year3 = c(NA, 1930L, 1931L, 1932L, 1933L, 1934L, 1935L, 1936L, 
1937L, 1938L), pop3 = c("", "9981,2", "", "", "", "", "", 
"", "", "")), .Names = c("year1", "pop1", "year2", "pop2", 
"year3", "pop3"), class = "data.frame", row.names = c(NA, -10L))

Change data format in long format with multiple sets of dimension columns

data

More articles: