How to count lines separated by semicolons

My data looks like this:

df <- structure(list(V1 = structure(c(7L, 4L, 8L, 8L, 5L, 3L, 1L, 1L, 
2L, 1L, 6L), .Label = c("", "cell and biogenesis;transport", 
"differentiation;metabolic process;regulation;stimulus", "MAPK cascade;cell and biogenesis", 
"MAPK cascade;cell and biogenesis;transport", "metabolic process;regulation;stimulus;transport", 
"mRNA;stimulus;transport", "targeting"), class = "factor")), .Names = "V1", class = "data.frame", row.names = c(NA, 
-11L))

      

I want to count how many similar lines there are, but there is also a track they come from. Each line is separated by a character ;

, but they belong to the line they are on.

I want to get output like this:

String                           Count        position 
mRNA                                 1        1
stimulus                             3        1,6,11
transport                            4        1,5,9,11
MAPK cascade                         2        2,5
cell and biogenesis                  3        2,5,9
targeting                            2        3,4
regulation of mRNA stability         1        1
regulation                           2        6,11
differentiation                      1        6,11
metabolic process                    2        6,11

      

The count shows how many times each line (the line is separated by a semicolon) is repeated in all data. The second column shows where they were, for example, mRNA was only in the first row, so this is 1.the stimulus was in three rows 1 and 6 and 11

Some lines are empty and they are also considered to be lines.

+3


source to share


3 answers


In the code below, we do the following:

  • Add line numbers as a column.
  • Use strplit

    to split each row into its components and store the result in a named column string

    .
  • strsplit

    returns a list. We use unnest

    list compositing to create a "long" dataframe, giving us a neat dataframe that can be generalized.
  • Group by string

    and return a new data frame that counts the frequency of each row and gives the original row number where each row instance originally appeared.

library(tidyverse)

df$V1 = as.character(df$V1)

df %>% 
  rownames_to_column() %>% 
  mutate(string = strsplit(V1, ";")) %>% 
  unnest %>%
  group_by(string) %>%
  summarise(count = n(),
            rows = paste(rowname, collapse=","))

      



               string count     rows
1 cell and biogenesis     3    2,5,9
2     differentiation     1        6
3        MAPK cascade     2      2,5
4   metabolic process     2     6,11
5                mRNA     1        1
6          regulation     2     6,11
7            stimulus     3   1,6,11
8           targeting     2      3,4
9           transport     4 1,5,9,11

      

If you plan to do further processing on the line numbers, you can store them as numeric values ​​rather than as a string of inserted values. In this case, you can do this:

df.new = df %>% 
  rownames_to_column("rows") %>% 
  mutate(string = strsplit(V1, ";")) %>% 
  select(-V1) %>%
  unnest

      

This will give you a long dataframe with one line for each combination of string

and rows

.

+4


source


Basic R approach:

# convert 'V1' to a character vector (only necessary of it isn't already)
df$V1 <- as.character(df$V1)

# get the unique strings
strng <- unique(unlist(strsplit(df$V1,';')))

# create a list with the rows for each unique string
lst <- lapply(strng, function(x) grep(x, df$V1, fixed = TRUE))

# get the counts for each string
count <- lengths(lst)

# collpase the list string positions into a string with the rownumbers for each string
pos <- sapply(lst, toString)

# put everything together in one dataframe
d <- data.frame(strng, count, pos)

      

You can shorten this approach to:



d <- data.frame(strng = unique(unlist(strsplit(df$V1,';'))))
lst <- lapply(d$strng, function(x) grep(x, df$V1, fixed = TRUE))
transform(d, count = lengths(lst), pos = sapply(lst, toString))

      

Result:

> d
                strng count         pos
1                mRNA     1           1
2            stimulus     3    1, 6, 11
3           transport     4 1, 5, 9, 11
4        MAPK cascade     2        2, 5
5 cell and biogenesis     3     2, 5, 9
6           targeting     2        3, 4
7     differentiation     1           6
8   metabolic process     2       6, 11
9          regulation     2       6, 11

      

+3


source


Possible solution data.table

for completeness

library(data.table)
setDT(df)[, .(.I, unlist(tstrsplit(V1, ";", fixed = TRUE)))
          ][!is.na(V2), .(count = .N, pos = toString(sort(I))), 
            by = .(String = V2)]
#                 String count         pos
# 1:                mRNA     1           1
# 2:        MAPK cascade     2        2, 5
# 3:           targeting     2        3, 4
# 4:     differentiation     1           6
# 5: cell and biogenesis     3     2, 5, 9
# 6:   metabolic process     2       6, 11
# 7:            stimulus     3    1, 6, 11
# 8:           transport     4 1, 5, 9, 11
# 9:          regulation     2       6, 11

      

This basically splits the column V1

into ;

when converting to long format, while simultaneously binding it to the row index ( .I

). Subsequently, it's just a simple aggregation by row count ( .N

) and anchor position in one row at a time String

.

+1


source







All Articles