Combining irrelevant / similar observations into one (s)

Question

Combining irrelevant / similar observations into one (s)

After doing a survey on perceived issues in each district, I get this framework . Since the survey had different choices from + open, the results for open-ended questions are often irrelevant (see below):

library(dplyr)
library(splitstackshape)
df = read.csv("http://pastebin.com/raw.php?i=tQKHWMvL")

# Splitting multiple answers into different rows.
df = cSplit(df, "Problems", ",", direction = "long")

df = df %>%
  group_by(Problems) %>%
  summarise(Total = n()) %>%
  mutate(freq = Total/sum(Total)*100) %>%
  arrange(rank = desc(rank(freq)))

Result in this dataframe:

> df
Source: local data table [34 x 3]

                       Problems Total       freq
1  Hurtos o robos sin violencia   245 25.6008359
2                        Drogas   232 24.2424242
3             Peleas callejeras   162 16.9278997
4               Ningún problema   149 15.5694880
5                    Agresiones    66  6.8965517
6           Robos con violencia    62  6.4785789
7            Quema contenedores     6  0.6269592
8                        Ruidos     5  0.5224660
9                         NS/NC     4  0.4179728
10                    Desempleo     2  0.2089864
..                          ...   ...        ...
>

As you can see the results after line 9 is mostly irrelevant (only one or two respondents per parameter), so I would like them to be grouped into one parameter (like "others") without losing the relationship to the neighborhood (so I can't rename the values now). Any suggestions?

+3

r dataframe dplyr splitstackshape

ccamara Jul 24 15 at 11:47

source to share

1 answer

David Arenburg · Answer 1 · 2015-07-24T12:11:57+0000

splitstackshape

imports the package data.table

(so you don't even need it library

) and assigns a class to data.table

your dataset, so I'll just continue the syntax data.table

there, especially since nothing beats data.table

when it comes to subset assignments.

In other words, instead of this long pipeline, dplyr

you can simply do

df[, freq := .N / nrow(df) * 100 , by = Problems]
df[freq < 6, Problems := "OTHER"]

And you're good to go.

You can check the new pivot table using

df[, .(freq = .N/nrow(df) * 100), by = Problems][order(-freq)]
# 1: Hurtos o robos sin violencia 25.600836
# 2:                       Drogas 24.242424
# 3:            Peleas callejeras 16.927900
# 4:              Ningֳ÷n problema 15.569488
# 5:                   Agresiones  6.896552
# 6:          Robos con violencia  6.478579
# 7:                        OTHER  4.284222

Combining irrelevant / similar observations into one (s)

More articles: