Aggregation and mapping of observations from an open questionnaire

Summary

I want to create a box like this that displays the most common problems in each area of ​​the city. Example of a window with real data

Unfortunately boxplot is useless as it is being used as the data I am using comes from an open questionnaire and it has two main problems:

  • There are many irrelevant answers (irrelevant, I mean those used by only one or a few people).
  • There are problems that refer to the same concept, but have been rephrased in different ways and therefore are considered to be something else.

To make it more useful, I would like to summarize the irrelevant answers in one group "eg: other problems

and rename the problems that mean the same thing so that they are spelled out accurately and therefore can be displayed correctly in the panel. Unfortunately this is not for me managed.

Detailed explanation

Let's take a look at some example code (The names in the dataframe are just examples: I've changed them for clarity to make it easier to understand that two or more problems are related, but real conditions can 'always be inferred from the regex):

library(plyr)
library(dplyr)
library(tidyr)

df= read.csv("http://pastebin.com/raw/bUxANQw6")

problems = df %>%
  select(Problems) %>%
  gather(variable, value) %>%
  group_by(value) %>%
  summarise(Total = n()) %>%
  arrange(desc(Total))

      

This results in the following frame:

> problems
Source: local data frame [27 x 2]

          value Total
1     Problem 1   282
2     Problem 3   268
3     Problem 2   186
4   No problems   160
5     Problem 4    76
6     Problem 5    68
7     Problem 6     6
8     Problem 7     5
9  Doesn't know     4
10    Problem 8     2
..          ...   ...
> 

      

As you can see, we have 27 problems, and looking at them more carefully, we could create several groups:

  • Relevant Data: Issues 1 through 7 + No Problems

    andDoesn't know

  • Sean: We have Problem 9

    , Problem 9'

    , Problem 9''

    or Problem 9'''

    (among others)
  • Inappropriate data that should be grouped under a single label, such as "Other Issues": Issues 12 to 18

My suggested approach

What I thought I could do to overcome these two problems:

To deal with synonyms , I was thinking about renaming the values ​​of the synonyms to one, perhaps using a command revalue

something like this:

df$Problems = revalue(df$Problems, c('Problem 9’' = 'Problem 9',
                                     'Problem 9’’' = 'Problem 9',
                                     'Problem 9’’’' = 'Problem 9'))

      

However, as an R newbie (and new to programming languages, too), I think there must be a faster way to achieve this, as the task of keeping a dictionary of synonyms is going to be very tedious and will grow with more answers.

To deal with irrelevant answers , I could use a similar approach and re-evaluate them as other problems

, but I would like to do it in an automatic way, since the list of irrelevant terms will grow as the questionnaire is not yet complete and I cannot match all of them manually (ex: match all values ​​that were voted by less than 5 people Total < 5

). I think I should create a function and use control structure ( for ... in

), but I haven't succeeded yet.

Since I need to display a square box of responses grouped by neighborhood, I am afraid that I cannot use the problems

dataframe as it is. So while it is helpful to calculate the total number of votes for an issue, I have no idea what to do with it other than using it as information data. On the other hand, I cannot determine if the answer is inconsequential based only on the responses received in each district, as it will bias the results as different districts are expected to have different problems.

Any help with these two issues would be really appreciated. Thanks to

+3


source to share


1 answer


I looked at your data and code. Your data frame, problems

received Problem 9’

, Problem 7'

etc. So, you want to remove

and '

. This is your only task. You can accomplish this task with the following line.

problems$value <- gsub(pattern = "’+|'+", replacement = "", x = problems$value)

      

You can accomplish another task using which()

. You want to find the lines that Total < 5

. Using which()

, you can find indices. Then you replace anything in value

the strings with Other problems

. Hope this is what you need.

problems$value[which(problems$Total < 5)] <- "Other problems"

      

DATA



problems <- structure(list(value = c("Problem 1", "Problem 3", "Problem 2", 
"No problems", "Problem 4", "Problem 5", "Problem 6", "Problem 7", 
"Doesn't know", "Problem 8", "Problem 9", "Problem 9’", "Other problems", 
"Problem 10", "Problem 10’", "Problem 11", "Problem 11'", "Problem 12", 
"Problem 13", "Problem 14", "Problem 15", "Problem 16", "Problem 17", 
"Problem 18", "Problem 7'", "Problem 9’’", "Problem 9’’’"
), Total = c(282L, 268L, 186L, 160L, 76L, 68L, 6L, 5L, 4L, 2L, 
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-27L), .Names = c("value", "Total"))

      

EDIT

Seeing the OP's first comment, the following lines will make a data frame for drawing graphics.

count(df, Neighborhoods, Problems) -> temp

temp$Problems <- gsub(pattern = "’+|'+", replacement = "", x = temp$Problems)

temp$Problems[which(temp$n < 5)] <- "Other problems"

group_by(temp, Neighborhoods, Problems) %>%
summarize(Total = sum(n)) -> temp2

      

+2


source







All Articles