Combining irrelevant / similar observations into one (s)

After doing a survey on perceived issues in each district, I get this framework . Since the survey had different choices from + open, the results for open-ended questions are often irrelevant (see below):

library(dplyr)
library(splitstackshape)
df = read.csv("http://pastebin.com/raw.php?i=tQKHWMvL")

# Splitting multiple answers into different rows.
df = cSplit(df, "Problems", ",", direction = "long")

df = df %>%
  group_by(Problems) %>%
  summarise(Total = n()) %>%
  mutate(freq = Total/sum(Total)*100) %>%
  arrange(rank = desc(rank(freq)))

      

Result in this dataframe:

> df
Source: local data table [34 x 3]

                       Problems Total       freq
1  Hurtos o robos sin violencia   245 25.6008359
2                        Drogas   232 24.2424242
3             Peleas callejeras   162 16.9278997
4               Ningún problema   149 15.5694880
5                    Agresiones    66  6.8965517
6           Robos con violencia    62  6.4785789
7            Quema contenedores     6  0.6269592
8                        Ruidos     5  0.5224660
9                         NS/NC     4  0.4179728
10                    Desempleo     2  0.2089864
..                          ...   ...        ...
>

      

As you can see the results after line 9 is mostly irrelevant (only one or two respondents per parameter), so I would like them to be grouped into one parameter (like "others") without losing the relationship to the neighborhood (so I can't rename the values ​​now). Any suggestions?

+3


source to share


1 answer


splitstackshape

imports the package data.table

(so you don't even need it library

) and assigns a class to data.table

your dataset, so I'll just continue the syntax data.table

there, especially since nothing beats data.table

when it comes to subset assignments.

In other words, instead of this long pipeline, dplyr

you can simply do

df[, freq := .N / nrow(df) * 100 , by = Problems]
df[freq < 6, Problems := "OTHER"]

      



And you're good to go.

You can check the new pivot table using

df[, .(freq = .N/nrow(df) * 100), by = Problems][order(-freq)]
# 1: Hurtos o robos sin violencia 25.600836
# 2:                       Drogas 24.242424
# 3:            Peleas callejeras 16.927900
# 4:              Ningֳ÷n problema 15.569488
# 5:                   Agresiones  6.896552
# 6:          Robos con violencia  6.478579
# 7:                        OTHER  4.284222

      

+6


source







All Articles