Importing and analyzing non-rectangular CSV files in R

I am switching to R from Mathematica where I do not need to anticipate data structures during import, in particular I do not need to anticipate the rectangularity of my data before importing.

I have a lot of files .csv

formatted like this:

tasty,chicken,cinnamon
not_tasty,butter,pepper,onion,cardamom,cayenne
tasty,olive_oil,pepper
okay,olive_oil,onion,potato,black_pepper
not_tasty,tomato,fenugreek,pepper,onion,potato
tasty,butter,cheese,wheat,ham

      

The strings are of varying length and will only contain strings.

In R, how should I approach this problem?

What have you tried?

I've tried with read.table

:

dataImport <- read.table("data.csv", header = FALSE)
class(dataImport)
##[1] "data.frame"
dim(dataImport)
##[1] 6   1
dataImport[1]
##[1] tasty,chicken,cinnamon
##6 Levels: ...

      

I am interpreting this from the documentation as a special column with each ingredient list as a separate row. I can extract the first three lines like this, each line has class

factor

, but appears to contain more data than expected:

dataImport[c(1,2,3),1]
## my rows
rowOne <- dataImport[c(1),1];
class(rowOne)
## "factor"
rowOne
## [1] tasty,chicken,cinnamon
## 6 Levels: not_tasty,butter,cheese [...]

      

So far, as far as I have pursued this issue, I would appreciate some advice on the suitability of read.table

this data structure.

My goal is to group data by the first element of each row and analyze the difference between each type of recipe. If it helps influence the data structure advice, in Mathematica I would do the following:

dataImport=Import["data.csv"];
tasty = Cases[dataImport, {"tasty", ingr__} :> {ingr}]

      

Discussion of answers

@ G.Grothendieck provided a solution in use read.table

and post-processing using the package reshape2

- it seems extremely useful and I will investigate it later. The general advice here solved my problem, hence agree.

Sentence

@MrFlick package usage tm

was helpful for later analysis usingDataframeSource

+3


source to share


2 answers


read.table Try read.table

with fill=TRUE

:

d1 <- read.table("data.csv", sep = ",", as.is = TRUE, fill = TRUE)

      

giving:

> d1
         V1        V2        V3     V4           V5      V6
1     tasty   chicken  cinnamon                            
2 not_tasty    butter    pepper  onion     cardamom cayenne
3     tasty olive_oil    pepper                            
4      okay olive_oil     onion potato black_pepper        
5 not_tasty    tomato fenugreek pepper        onion  potato
6     tasty    butter    cheese  wheat          ham   

      

read.table with NAs

or fill empty cells with NA values, add na.strings = ""

:

d2 <- read.table("data.csv", sep = ",", as.is = TRUE, fill = TRUE, na.strings = "")

      

giving:

> d2
         V1        V2        V3     V4           V5      V6
1     tasty   chicken  cinnamon   <NA>         <NA>    <NA>
2 not_tasty    butter    pepper  onion     cardamom cayenne
3     tasty olive_oil    pepper   <NA>         <NA>    <NA>
4      okay olive_oil     onion potato black_pepper    <NA>
5 not_tasty    tomato fenugreek pepper        onion  potato
6     tasty    butter    cheese  wheat          ham    <NA>

      

long form

If you want it in long form:



library(reshape2)
long <- na.omit(melt(d2, id.var = c("id", "V1"))[-3])
long <- long[order(long$id), ]

      

giving:

> long
   id        V1        value
1   1     tasty      chicken
7   1     tasty     cinnamon
2   2 not_tasty       butter
8   2 not_tasty       pepper
14  2 not_tasty        onion
20  2 not_tasty     cardamom
26  2 not_tasty      cayenne
3   3     tasty    olive_oil
9   3     tasty       pepper
4   4      okay    olive_oil
10  4      okay        onion
16  4      okay       potato
22  4      okay black_pepper
5   5 not_tasty       tomato
11  5 not_tasty    fenugreek
17  5 not_tasty       pepper
23  5 not_tasty        onion
29  5 not_tasty       potato
6   6     tasty       butter
12  6     tasty       cheese
18  6     tasty        wheat
24  6     tasty          ham

      

wide form 0/1 binary variables

To represent the variable part as 0/1 binary variables, try this:

wide <- cast(id + V1 ~ value, data = long)
wide[-(1:2)] <- 0 + !is.na(wide[-(1:2)])

      

:

screenshot

in data frame

Another representation would be the following list in a data frame, so that ag$value

is a list of character vectors:

ag <- aggregate(value ~., transform(long, value = as.character(value)), c)
ag <- ag[order(ag$id), ]

giving:

> ag
  id        V1                                    value
4  1     tasty                        chicken, cinnamon
1  2 not_tasty butter, pepper, onion, cardamom, cayenne
5  3     tasty                        olive_oil, pepper
3  4      okay   olive_oil, onion, potato, black_pepper
2  5 not_tasty tomato, fenugreek, pepper, onion, potato
6  6     tasty               butter, cheese, wheat, ham

> str(ag)
'data.frame':   6 obs. of  3 variables:
 $ id   : int  1 2 3 4 5 6
 $ V1   : chr  "tasty" "not_tasty" "tasty" "okay" ...
 $ value:List of 6
  ..$ 15: chr  "chicken" "cinnamon"
  ..$ 1 : chr  "butter" "pepper" "onion" "cardamom" ...
  ..$ 17: chr  "olive_oil" "pepper"
  ..$ 11: chr  "olive_oil" "onion" "potato" "black_pepper"
  ..$ 6 : chr  "tomato" "fenugreek" "pepper" "onion" ...
  ..$ 19: chr  "butter" "cheese" "wheat" "ham"

      

+5


source


I don't think dragging and dropping your data into a data.frame or data.table will help you, as both of these shapes usually use rectangular data. If you just want a list of character vectors, you can read them with.

strsplit(readLines("data.csv"), ",")

      



It all depends on what you are going to do with the data after reading it. If you plan on using an existing feature, what kind of input do they expect?

It looks like you can keep track of the terms in each of these recipes. Perhaps the corresponding data structure would be a "corpus" from a tm

text mining package .

+4


source







All Articles