Convert irregular data to usable format in R

Suppose I work for a company that provides a range of different services to its clients. I have been given a report of service data that I need to analyze. The report is formatted to be easy to read and print, but not suitable for data analysis.

The report format is as follows:

Input:

customer <- c(1,2,2,3,3,3)
service1 <- c(1,3,5,1,3,5)
fee1 <- c(100,290,500,100,300,500)
service2 <- c("",4,"",2,4,8)
fee2 <- c("",400,"",200,390,800)

require(data.table)
DT <- data.table(customer, service1, fee1, service2, fee2)

      

Imprint:

> DT
   customer service1 fee1 service2 fee2
1:        1        1  100              
2:        2        3  290        4  400
3:        2        5  500              
4:        3        1  100        2  200
5:        3        3  300        4  390
6:        3        5  500        8  800

      

There are several customers, and for each of them there is a range of services they consume and the associated fees. Services and charges are printed horizontally in four columns and then overflow to a new line. There can be any number of services for each customer, but each service can only be performed once per customer, and the service fee may be different for each customer. They probably always print in the same order, although the solution shouldn't rely on this.

The challenge is to transform the data into a usable format. I see two different ways to do this.

First option (long format): cut off the last two columns, create a new row for each customer, and fill in the content.

Option one will look like this:

    customer service fee
 1:        1       1 100
 2:        2       3 290
 3:        2       4 400
 4:        2       5 500
 5:        3       1 100
 6:        3       2 200
 7:        3       3 300
 8:        3       4 390
 9:        3       5 500
10:        3       8 800

      

Option two (wide format): Cut all rows, but for each client first, create new columns for the sliced ​​services, and then convert the service labels to column headers (and make sure everything is in the right place).

Option two will look like this:

   customer service.1 service.2 service.3 service.4 service.5 service.6 service.7 service.8
1:        1       100                                                                      
2:        2                           290       400       500                              
3:        3                 200       300       390       500                           800

      

I can work with any format (and converting between long and wide is pretty easy).

As a starting point, I decided that I would need to find either the number of services for each client (option 1) or the number of unique services (option 2), expand the data table to the desired size and move the data.

It seems to me that I data.table

should be able to handle this and would prefer a solution using this package because of its efficiency.

+3


source to share


1 answer


I don't see how this is solvable with melt

, but here you can use a simple rbind

one like

res <- rbind(DT[, c(1,2:3), with = FALSE], 
             DT[, c(1,4:5), with = FALSE], 
                 use.names = FALSE)[service1 != ""]
res
#     customer service1 fee1
#  1:        1        1  100
#  2:        2        3  290
#  3:        2        5  500
#  4:        3        1  100
#  5:        3        3  300
#  6:        3        5  500
#  7:        2        4  400
#  8:        3        2  200
#  9:        3        4  390
# 10:        3        8  800

      



According to your second output, you can try something like

Range <- range(as.numeric(unlist(DT[, c(1, 4), with = FALSE])), na.rm = TRUE)
res[, service1 := factor(service1, levels = Range[1L]:Range[2L])]
dcast(res, customer ~ service1, drop = FALSE, fill = "", value.var = "fee1")
#    customer   1   2   3   4   5 6 7   8
# 1:        1 100                        
# 2:        2         290 400 500        
# 3:        3 100 200 300 390 500     800

      

+4


source







All Articles