Splitting a data frame to create new columns

I have a data frame with columns for "Count", "Transect Number", "Data" and "Year". My goal is to split the dataframe by Transect, then again by Year and create a new dataframe with a column for "Transect" and then the corresponding data for the year in the following columns.

To create a dummy data frame:

Count1<-1:27
Count2<-1:30
Count3<-1:25
T1<-c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3)
T2<-c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3,1,1,1,1,2,2,2,2,3,3,3,3)
T3<-c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3,1,1,1,2,2,2,3)
Data1<-c(1,2,3,2,1,2,3,4,3,2,1,2,3,4,3,2,1,2,3,4,5,4,3,2,3,3,2)
Data2<-c(1,2,3,2,1,4,3,2,1,2,4,3,2,3,4,3,2,3,4,5,6,4,3,2,1,4,5,4,3,2)
Data3<-c(1,2,3,4,5,4,3,3,3,4,5,4,3,3,2,3,4,5,4,3,4,3,2,3,4)
Year1<-c(2014,2014,2014,2014,2014,2014,2014,2014,2014,2015,2015,2015,2015,2015,2015,2015,2015,2015,2016,2016,2016,2016,2016,2016,2016,2016,2016)
Year2<-c(2014,2014,2014,2014,2014,2014,2014,2014,2014,2015,2015,2015,2015,2015,2015,2015,2015,2015,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016)
Year3<-c(2014,2014,2014,2014,2014,2014,2014,2014,2014,2015,2015,2015,2015,2015,2015,2015,2015,2015,2016,2016,2016,2016,2016,2016,2016)




DF1<-data.frame(Count1,T1,Data1,Year1)
colnames(DF1)<-c("Count","Transect","Data","Year")
DF2<-data.frame(Count2,T2,Data2,Year2)
colnames(DF2)<-c("Count","Transect","Data","Year")
DF3<-data.frame(Count3,T3,Data3,Year3)
colnames(DF3)<-c("Count","Transect","Data","Year")


All<-rbind(DF1,DF2,DF3)

      

Once I have a dataframe I thought about splitting the data into a transect as this would be a permanent aspect of my current dataset.

#Step 1-Break down by T
Trans1<-All[All$Transect==1,]
Trans2<-All[All$Transect==2,]
Trans3<-All[All$Transect==3,]
Trans4<-All[All$Transect==4,]
Trans5<-All[All$Transect==5,]

      

But in the next step, I'm a little less clear. I need to pull data from a Data column, organized by year. Something similar to further data destruction:

Trans1_Year1<-Trans1[Trans1$Year==2014,]
Trans2_Year1<-Trans2[Trans2$Year==2014,]
Trans3_Year1<-Trans3[Trans3$Year==2014,]
Trans4_Year1<-Trans4[Trans4$Year==2014,]
Trans5_Year1<-Trans5[Trans5$Year==2014,]

      

or even using split

ByYear1<-split(Trans1,Trans1$Year)

      

But I would rather not write the code like above as I hope to add new data every year as this dataset moves forward. And I would like the code to be able to accommodate new Year data as it is added, rather than writing new lines of code every year.

Once I have my data set like this, I would like to create a second data frame with columns for each year. One problem is that each year contains a different number of lines, which was a problem for me. But my end result would have columns:

 "Transect", "Data 2014", "Data 2015", "Data 2016"

      

Since each year can have a different number of lines inside the transom, I would like to leave NA at the end of each Transect section when the number of lines in each cut differs between years.

+3


source to share


2 answers


It looks like you have been basically trying to convert your data to semi-wide format with columns for years, not "long" format.

If so, you'd be better off adding an extra index column that shows the repeated combination of "Transect" and "Year".

This can be easily done with the help getanID

from my splitstackshape package. "splitstackshape" also loads "data.table" from which you can then use dcast.data.table

to get a wide format.



library(splitstackshape)
dcast.data.table(getanID(All, c("Transect", "Year")), 
                 Transect + .id ~ Year, value.var = "Data")
#     Transect .id 2014 2015 2016
#  1:        1   1    1    2    3
#  2:        1   2    2    1    4
#  3:        1   3    3    2    5
#  4:        1   4    1    2    4
#  5:        1   5    2    4    5
#  6:        1   6    3    3    6
#  7:        1   7    1    4    4
#  8:        1   8    2    5    4
#  9:        1   9    3    4    3
# 10:        1  10   NA   NA    4
# 11:        2   1    2    3    4
# 12:        2   2    1    4    3
# 13:        2   3    2    3    2
# 14:        2   4    2    2    3
# 15:        2   5    1    3    2
# 16:        2   6    4    4    1
# 17:        2   7    4    3    4
# 18:        2   8    5    3    3
# 19:        2   9    4    2    2
# 20:        2  10   NA   NA    3
# 21:        3   1    3    2    3
# 22:        3   2    4    1    3
# 23:        3   3    3    2    2
# 24:        3   4    3    3    5
# 25:        3   5    2    2    4
# 26:        3   6    1    3    3
# 27:        3   7    3    3    2
# 28:        3   8    3    4    4
# 29:        3   9    3    5   NA
#     Transect .id 2014 2015 2016

      

Then, if you really want to split the "Transect" column, you can go ahead and use split

, but since you now have "data.table" it would be better to stick with that and take advantage of its many convenient features, including subset-related and aggregation.

+3


source


I think you are forcing your data in a format that it doesn’t have, naturally. There are many advantages of processing to leave it in the "long" format. Take a look at this article if you haven't seen it yet, it's a classic.



http://www.jstatsoft.org/v21/i12

0


source







All Articles