Importing and analyzing non-rectangular CSV files in R
I am switching to R from Mathematica where I do not need to anticipate data structures during import, in particular I do not need to anticipate the rectangularity of my data before importing.
I have a lot of files .csv
formatted like this:
tasty,chicken,cinnamon not_tasty,butter,pepper,onion,cardamom,cayenne tasty,olive_oil,pepper okay,olive_oil,onion,potato,black_pepper not_tasty,tomato,fenugreek,pepper,onion,potato tasty,butter,cheese,wheat,ham
The strings are of varying length and will only contain strings.
In R, how should I approach this problem?
What have you tried?
I've tried with read.table
:
dataImport <- read.table("data.csv", header = FALSE)
class(dataImport)
##[1] "data.frame"
dim(dataImport)
##[1] 6 1
dataImport[1]
##[1] tasty,chicken,cinnamon
##6 Levels: ...
I am interpreting this from the documentation as a special column with each ingredient list as a separate row. I can extract the first three lines like this, each line has class
factor
, but appears to contain more data than expected:
dataImport[c(1,2,3),1]
## my rows
rowOne <- dataImport[c(1),1];
class(rowOne)
## "factor"
rowOne
## [1] tasty,chicken,cinnamon
## 6 Levels: not_tasty,butter,cheese [...]
So far, as far as I have pursued this issue, I would appreciate some advice on the suitability of read.table
this data structure.
My goal is to group data by the first element of each row and analyze the difference between each type of recipe. If it helps influence the data structure advice, in Mathematica I would do the following:
dataImport=Import["data.csv"];
tasty = Cases[dataImport, {"tasty", ingr__} :> {ingr}]
Discussion of answers
@ G.Grothendieck provided a solution in use read.table
and post-processing using the package reshape2
- it seems extremely useful and I will investigate it later. The general advice here solved my problem, hence agree.
@MrFlick package usage tm
was helpful for later analysis usingDataframeSource
source to share
read.table Try read.table
with fill=TRUE
:
d1 <- read.table("data.csv", sep = ",", as.is = TRUE, fill = TRUE)
giving:
> d1
V1 V2 V3 V4 V5 V6
1 tasty chicken cinnamon
2 not_tasty butter pepper onion cardamom cayenne
3 tasty olive_oil pepper
4 okay olive_oil onion potato black_pepper
5 not_tasty tomato fenugreek pepper onion potato
6 tasty butter cheese wheat ham
read.table with NAs
or fill empty cells with NA values, add na.strings = ""
:
d2 <- read.table("data.csv", sep = ",", as.is = TRUE, fill = TRUE, na.strings = "")
giving:
> d2
V1 V2 V3 V4 V5 V6
1 tasty chicken cinnamon <NA> <NA> <NA>
2 not_tasty butter pepper onion cardamom cayenne
3 tasty olive_oil pepper <NA> <NA> <NA>
4 okay olive_oil onion potato black_pepper <NA>
5 not_tasty tomato fenugreek pepper onion potato
6 tasty butter cheese wheat ham <NA>
long form
If you want it in long form:
library(reshape2)
long <- na.omit(melt(d2, id.var = c("id", "V1"))[-3])
long <- long[order(long$id), ]
giving:
> long
id V1 value
1 1 tasty chicken
7 1 tasty cinnamon
2 2 not_tasty butter
8 2 not_tasty pepper
14 2 not_tasty onion
20 2 not_tasty cardamom
26 2 not_tasty cayenne
3 3 tasty olive_oil
9 3 tasty pepper
4 4 okay olive_oil
10 4 okay onion
16 4 okay potato
22 4 okay black_pepper
5 5 not_tasty tomato
11 5 not_tasty fenugreek
17 5 not_tasty pepper
23 5 not_tasty onion
29 5 not_tasty potato
6 6 tasty butter
12 6 tasty cheese
18 6 tasty wheat
24 6 tasty ham
wide form 0/1 binary variables
To represent the variable part as 0/1 binary variables, try this:
wide <- cast(id + V1 ~ value, data = long)
wide[-(1:2)] <- 0 + !is.na(wide[-(1:2)])
:
in data frame
Another representation would be the following list in a data frame, so that ag$value
is a list of character vectors:
ag <- aggregate(value ~., transform(long, value = as.character(value)), c)
ag <- ag[order(ag$id), ]
giving:
> ag
id V1 value
4 1 tasty chicken, cinnamon
1 2 not_tasty butter, pepper, onion, cardamom, cayenne
5 3 tasty olive_oil, pepper
3 4 okay olive_oil, onion, potato, black_pepper
2 5 not_tasty tomato, fenugreek, pepper, onion, potato
6 6 tasty butter, cheese, wheat, ham
> str(ag)
'data.frame': 6 obs. of 3 variables:
$ id : int 1 2 3 4 5 6
$ V1 : chr "tasty" "not_tasty" "tasty" "okay" ...
$ value:List of 6
..$ 15: chr "chicken" "cinnamon"
..$ 1 : chr "butter" "pepper" "onion" "cardamom" ...
..$ 17: chr "olive_oil" "pepper"
..$ 11: chr "olive_oil" "onion" "potato" "black_pepper"
..$ 6 : chr "tomato" "fenugreek" "pepper" "onion" ...
..$ 19: chr "butter" "cheese" "wheat" "ham"
source to share
I don't think dragging and dropping your data into a data.frame or data.table will help you, as both of these shapes usually use rectangular data. If you just want a list of character vectors, you can read them with.
strsplit(readLines("data.csv"), ",")
It all depends on what you are going to do with the data after reading it. If you plan on using an existing feature, what kind of input do they expect?
It looks like you can keep track of the terms in each of these recipes. Perhaps the corresponding data structure would be a "corpus" from a tm
text mining package .
source to share