String substitution into jagged list and boolean propagation of data.frame from list
I spent a lot of time trying to fix this issue and was unsuccessful.
I have a data.frame with a column containing variable length strings. The data.frame format looks like this:
Taxa <- as.character(c("cellularorganisms(norank)_Bacteria(superkingdom)_Actinobacteria(phylum)","cellularorganisms(norank)_Bacteria(superkingdom)_Proteobacteria(phylum)_Gammaproteobacteria(class)_Enterobacteriales(order)_Enterobacteriaceae(family)","cellularorganisms(norank)_Bacteria(superkingdom)_Proteobacteria(phylum)_Gammaproteobacteria(class)_Enterobacteriales(order)","cellularorganisms(norank)_Bacteria(superkingdom)_Proteobacteria(phylum)_Gammaproteobacteria(class)_Enterobacteriales(order)_Enterobacteriaceae(family)_Klebsiella(genus)","cellularorganisms(norank)_Bacteria(superkingdom)_Proteobacteria(phylum)_Gammaproteobacteria(class)_Enterobacteriales(order)_Enterobacteriaceae(family)_Klebsiella(genus)_Klebsiellapneumoniae(species)","cellularorganisms(norank)_Bacteria(superkingdom)_Firmicutes(phylum)_Clostridia(class)_Clostridiales(order)","cellularorganisms(norank)_Bacteria(superkingdom)_Firmicutes(phylum)_Clostridia(class)","cellularorganisms(norank)_Bacteria(superkingdom)_Firmicutes(phylum)_Clostridia(class)_Clostridiales(order)_Clostridiaceae(family)","cellularorganisms(norank)_Bacteria(superkingdom)_Firmicutes(phylum)_Clostridia(class)_Clostridiales(order)_Clostridiaceae(family)_Clostridium(genus)","cellularorganisms(norank)_Bacteria(superkingdom)_Firmicutes(phylum)_Clostridia(class)_Clostridiales(order)_Clostridiaceae(family)_Clostridium(genus)_Clostridiumbotulinum(species)","cellularorganisms(norank)_Bacteria(superkingdom)_Firmicutes(phylum)_Clostridia(class)_Clostridiales(order)_Clostridiaceae(family)_Clostridium(genus)_Clostridiumbotulinum(species)_ClostridiumbotulinumCDC66177(strain)","cellularorganisms(norank)_Bacteria(superkingdom)_Actinobacteria(phylum)_Actinobacteria(class)_Actinobacteridae(subclass)_Actinomycetales(order)_Micrococcineae(suborder)","cellularorganisms(norank)_Bacteria(superkingdom)_Actinobacteria(phylum)_Actinobacteria(class)_Actinobacteridae(subclass)_Actinomycetales(order)_Micrococcineae(suborder)_Microbacteriaceae(family)","cellularorganisms(norank)_Bacteria(superkingdom)_Actinobacteria(phylum)_Actinobacteria(class)_Actinobacteridae(subclass)_Actinomycetales(order)_Micrococcineae(suborder)_Microbacteriaceae(family)_Microbacterium(genus)","cellularorganisms(norank)_Bacteria(superkingdom)_Actinobacteria(phylum)_Actinobacteria(class)_Actinobacteridae(subclass)_Actinomycetales(order)_Micrococcineae(suborder)_Microbacteriaceae(family)_Microbacterium(genus)_Microbacteriumlaevaniformans(species)_MicrobacteriumlaevaniformansOR221(strain)"))
Percent <- c("0.000400","0.006800","0.005034","0.001760","0.000000","0.000000","0.344400","0.000000","0.000000","0.000000","0.006500","0.002819","0.000487","0.000000","0.001090")
Test <- data.frame(Percent, Taxa)
Test$Taxa <- as.character(Test$Taxa)
I can multiply these underscore lines into a list of unequal lengths:
NewDF <- strsplit(Test$Taxa, "_", fixed=TRUE)
But I can't figure out how to take this parsed output and format it into a useful structure.
Each analyzed section has two components: a descriptor and a taxonomic level (i.e. bacteria (super-eating) are descriptor bacteria and a taxonomic level of superking.
What I want to do is take this parsed output and populate a data.frame that has the following column headers (norank, superkingdom, phylum, class, order, family, gender, species, deformation). The output should skip taxonomic levels that are not included in the above list (for example, there are lines with a taxonomic subclass level between class and order, I need to drop the subclass).
Also, if the line stops at a certain taxonomic level and there are still some blank columns left, they should be set to NA (i.e. the first row ends in phylum, so class, order, family, etc. should be NA) ...
The end result should look like this:
norank superkingdom phylum class order family genus species strain
1 cellularorganisms(norank) Bacteria(superkingdom) Actinobacteria(phylum) <NA> <NA> <NA> <NA> <NA> <NA>
2 cellularorganisms(norank) Bacteria(superkingdom) Proteobacteria(phylum) Gammaproteobacteria(class) Enterobacteriales(order) Enterobacteriaceae(family) <NA> <NA> <NA>
3 cellularorganisms(norank) Bacteria(superkingdom) Proteobacteria(phylum) Gammaproteobacteria(class) Enterobacteriales(order) <NA> <NA> <NA> <NA>
4 cellularorganisms(norank) Bacteria(superkingdom) Proteobacteria(phylum) Gammaproteobacteria(class) Enterobacteriales(order) Enterobacteriaceae(family) Klebsiella(genus) <NA>
5 cellularorganisms(norank) Bacteria(superkingdom) Proteobacteria(phylum) Gammaproteobacteria(class) Enterobacteriales(order) Enterobacteriaceae(family) Klebsiella(genus) Klebsiellapneumoniae(species) <NA>
6 cellularorganisms(norank) Bacteria(superkingdom) Firmicutes(phylum) Clostridia(class) Clostridiales(order) <NA> <NA> <NA> <NA>
7 cellularorganisms(norank) Bacteria(superkingdom) Firmicutes(phylum) Clostridia(class) <NA> <NA> <NA> <NA> <NA>
8 cellularorganisms(norank) Bacteria(superkingdom) Firmicutes(phylum) Clostridia(class) Clostridiales(order) Clostridiaceae(family) <NA> <NA> <NA>
9 cellularorganisms(norank) Bacteria(superkingdom) Firmicutes(phylum) Clostridia(class) Clostridiales(order) Clostridiaceae(family) Clostridium(genus) <NA> <NA>
10 cellularorganisms(norank) Bacteria(superkingdom) Firmicutes(phylum) Clostridia(class) Clostridiales(order) Clostridiaceae(family) Clostridium(genus) Clostridiumbotulinum(species) <NA>
11 cellularorganisms(norank) Bacteria(superkingdom) Firmicutes(phylum) Clostridia(class) Clostridiales(order) Clostridiaceae(family) Clostridium(genus) Clostridiumbotulinum(species) ClostridiumbotulinumCDC66177(strain)
12 cellularorganisms(norank) Bacteria(superkingdom) Actinobacteria(phylum) Actinobacteria(class) Actinomycetales(order) <NA> <NA> <NA> <NA>
13 cellularorganisms(norank) Bacteria(superkingdom) Actinobacteria(phylum) Actinobacteria(class) Actinomycetales(order) Microbacteriaceae(family) <NA> <NA> <NA>
14 cellularorganisms(norank) Bacteria(superkingdom) Actinobacteria(phylum) Actinobacteria(class) Actinomycetales(order) Microbacteriaceae(family) Microbacterium(genus) <NA> <NA>
15 cellularorganisms(norank) Bacteria(superkingdom) Actinobacteria(phylum) Actinobacteria(class) Actinomycetales(order) Microbacteriaceae(family) Microbacterium(genus) Microbacteriumlaevaniformans(species) MicrobacteriumlaevaniformansOR221(strain)
source to share
You can try to do this by compiling a list of small data.frames into one df
library(dplyr)
NewDF <-
lapply(strsplit(Test$Taxa, "_", fixed=TRUE),
function(x)
{
vars <- lapply(x, function(y)
{
m <- regexec("\\((.+?)\\)",y)
regmatches(y,m)[[1]][2]
})
vals <- as.list( x )
names(vals) <- unlist(vars)
data.frame( vals,
stringsAsFactors = FALSE )
}) %>% rbind_all
which gives me the desired output (with nice variable names too)
source to share