How to count lines separated by semicolons
My data looks like this:
df <- structure(list(V1 = structure(c(7L, 4L, 8L, 8L, 5L, 3L, 1L, 1L,
2L, 1L, 6L), .Label = c("", "cell and biogenesis;transport",
"differentiation;metabolic process;regulation;stimulus", "MAPK cascade;cell and biogenesis",
"MAPK cascade;cell and biogenesis;transport", "metabolic process;regulation;stimulus;transport",
"mRNA;stimulus;transport", "targeting"), class = "factor")), .Names = "V1", class = "data.frame", row.names = c(NA,
-11L))
I want to count how many similar lines there are, but there is also a track they come from. Each line is separated by a character ;
, but they belong to the line they are on.
I want to get output like this:
String Count position
mRNA 1 1
stimulus 3 1,6,11
transport 4 1,5,9,11
MAPK cascade 2 2,5
cell and biogenesis 3 2,5,9
targeting 2 3,4
regulation of mRNA stability 1 1
regulation 2 6,11
differentiation 1 6,11
metabolic process 2 6,11
The count shows how many times each line (the line is separated by a semicolon) is repeated in all data. The second column shows where they were, for example, mRNA was only in the first row, so this is 1.the stimulus was in three rows 1 and 6 and 11
Some lines are empty and they are also considered to be lines.
source to share
In the code below, we do the following:
- Add line numbers as a column.
- Use
strplit
to split each row into its components and store the result in a named columnstring
. -
strsplit
returns a list. We useunnest
list compositing to create a "long" dataframe, giving us a neat dataframe that can be generalized. - Group by
string
and return a new data frame that counts the frequency of each row and gives the original row number where each row instance originally appeared.
library(tidyverse)
df$V1 = as.character(df$V1)
df %>%
rownames_to_column() %>%
mutate(string = strsplit(V1, ";")) %>%
unnest %>%
group_by(string) %>%
summarise(count = n(),
rows = paste(rowname, collapse=","))
string count rows
1 cell and biogenesis 3 2,5,9
2 differentiation 1 6
3 MAPK cascade 2 2,5
4 metabolic process 2 6,11
5 mRNA 1 1
6 regulation 2 6,11
7 stimulus 3 1,6,11
8 targeting 2 3,4
9 transport 4 1,5,9,11
If you plan to do further processing on the line numbers, you can store them as numeric values ββrather than as a string of inserted values. In this case, you can do this:
df.new = df %>%
rownames_to_column("rows") %>%
mutate(string = strsplit(V1, ";")) %>%
select(-V1) %>%
unnest
This will give you a long dataframe with one line for each combination of string
and rows
.
source to share
Basic R approach:
# convert 'V1' to a character vector (only necessary of it isn't already)
df$V1 <- as.character(df$V1)
# get the unique strings
strng <- unique(unlist(strsplit(df$V1,';')))
# create a list with the rows for each unique string
lst <- lapply(strng, function(x) grep(x, df$V1, fixed = TRUE))
# get the counts for each string
count <- lengths(lst)
# collpase the list string positions into a string with the rownumbers for each string
pos <- sapply(lst, toString)
# put everything together in one dataframe
d <- data.frame(strng, count, pos)
You can shorten this approach to:
d <- data.frame(strng = unique(unlist(strsplit(df$V1,';'))))
lst <- lapply(d$strng, function(x) grep(x, df$V1, fixed = TRUE))
transform(d, count = lengths(lst), pos = sapply(lst, toString))
Result:
> d
strng count pos
1 mRNA 1 1
2 stimulus 3 1, 6, 11
3 transport 4 1, 5, 9, 11
4 MAPK cascade 2 2, 5
5 cell and biogenesis 3 2, 5, 9
6 targeting 2 3, 4
7 differentiation 1 6
8 metabolic process 2 6, 11
9 regulation 2 6, 11
source to share
Possible solution data.table
for completeness
library(data.table)
setDT(df)[, .(.I, unlist(tstrsplit(V1, ";", fixed = TRUE)))
][!is.na(V2), .(count = .N, pos = toString(sort(I))),
by = .(String = V2)]
# String count pos
# 1: mRNA 1 1
# 2: MAPK cascade 2 2, 5
# 3: targeting 2 3, 4
# 4: differentiation 1 6
# 5: cell and biogenesis 3 2, 5, 9
# 6: metabolic process 2 6, 11
# 7: stimulus 3 1, 6, 11
# 8: transport 4 1, 5, 9, 11
# 9: regulation 2 6, 11
This basically splits the column V1
into ;
when converting to long format, while simultaneously binding it to the row index ( .I
). Subsequently, it's just a simple aggregation by row count ( .N
) and anchor position in one row at a time String
.
source to share