Creating a Sankey diagram using the NetworkD3 package in R

Question

Creating a Sankey diagram using the NetworkD3 package in R

I am currently trying to create an interactive Sankey with a package networkD3

as instructed by Chris Grandad ( https://christophergandrud.github.io/networkD3/ ).
What I don't understand is the tabular format as it just uses two columns to render more transitions. To be more specific, I have a dataset containing four columns that are 4 years old. Within these columns are different hotel names, while each row represents one customer who has been "tracked" over the four years.

    URL <- paste0(
        "https://cdn.rawgit.com/christophergandrud/networkD3/",
        "master/JSONdata/energy.json")
    Energy <- jsonlite::fromJSON(URL)

    sankeyNetwork(Links = Energy$links, Nodes = Energy$nodes, Source = "source",
         Target = "target", Value = "value", NodeID = "name",
         units = "TWh", fontSize = 12, nodeWidth = 30)

To give you an overview of my data, here's a screenshot:

SampleDataScreenshot

I would give you more "coded" information, but since I'm very new to the R topic, I hope you can follow my thoughts on this issue. If not, please do not hesitate to ask a question.

Thank:)

+3

r plot sankey-diagram networkd3 htmlwidgets

Phipsy May 23 '17 at 10:36

source to share

2 answers

scheddy · Answer 1 · 2017-05-26T20:19:57+0000

you need two data frames: one list of all nodes (containing names) and one list of links. The latter contains three columns: source node, target node and some value indicating the strength or width of the link. In the link data frame, you reference nodes by position (starting at zero) in the node data frame.

Assuming your data looks like this:

df <- data.frame(Year1=sample(paste0("Hotel", 1:4), 1000, replace = TRUE),
                 Year2=sample(paste0("Hotel", 1:4), 1000, replace = TRUE),
                 Year3=sample(paste0("Hotel", 1:4), 1000, replace = TRUE),
                 Year4=sample(paste0("Hotel", 1:4), 1000, replace = TRUE),
                 stringsAsFactors = FALSE)

For the diagram, it is necessary to distinguish not only hotels, but also the hotel / year combination, since each of them must be one node:

df$Year1 <- paste0("Year1_", df$Year1)
df$Year2 <- paste0("Year2_", df$Year2)
df$Year3 <- paste0("Year3_", df$Year3)
df$Year4 <- paste0("Year4_", df$Year4)

links are "transitions" between hotels from year to year:

library(dplyr)
trans1_2 <- df %>% group_by(Year1, Year2) %>% summarise(sum=n())
trans2_3 <- df %>% group_by(Year2, Year3) %>% summarise(sum=n())
trans3_4 <- df %>% group_by(Year3, Year4) %>% summarise(sum=n())

colnames(trans1_2)[1:2] <- colnames(trans2_3)[1:2] <- colnames(trans3_4)[1:2] <- c("source","target")

links <- rbind(as.data.frame(trans1_2), 
               as.data.frame(trans2_3), 
               as.data.frame(trans3_4))

finally, the data frames must be linked to each other:

nodes <- data.frame(name=unique(c(links$source, links$target)))
links$source <- match(links$source, nodes$name) - 1
links$target <- match(links$target, nodes$name) - 1

Then the diagram can be drawn:

library(networkD3)
sankeyNetwork(Links = links, Nodes = nodes, Source = "source",
              Target = "target", Value = "sum", NodeID = "name",
              fontSize = 12, nodeWidth = 30)

There may be more elegant solutions, but this can be the starting point for your problem. If you don't like the "Year ..." in the hostnames, you can remove them after setting up the data frames.

Cj yetman · Answer 2 · 2018-09-08T16:21:48+0000

This question comes up a lot ... how to transform a dataset that has multiple links / edges defined on each row across multiple columns. Here's how I convert this to a dataset type that sankeyNetwork

(and many other packages that deal with edges / links / network data) uses ... a single end / line reference dataset.

starting with an example dataset ...

df <- read.csv(header = TRUE, as.is = TRUE, text = '
name,year1,year2,year3,year4
Bob,Hilton,Sheraton,Westin,Hyatt
John,Four Seasons,Ritz-Carlton,Westin,Sheraton
Tom,Ritz-Carlton,Westin,Sheraton,Hyatt
Mary,Westin,Sheraton,Four Seasons,Ritz-Carlton
Sue,Hyatt,Ritz-Carlton,Hilton,Sheraton
Barb,Hilton,Sheraton,Ritz-Carlton,Four Seasons
')

#   name        year1        year2        year3        year4
# 1  Bob       Hilton     Sheraton       Westin        Hyatt
# 2 John Four Seasons Ritz-Carlton       Westin     Sheraton
# 3  Tom Ritz-Carlton       Westin     Sheraton        Hyatt
# 4 Mary       Westin     Sheraton Four Seasons Ritz-Carlton
# 5  Sue        Hyatt Ritz-Carlton       Hilton     Sheraton
# 6 Barb       Hilton     Sheraton Ritz-Carlton Four Seasons

create a line number so you can still figure out which line / observation happens with each individual link when you convert the data to long format
use tidyr

gather()

to convert dataset to long format
convert column name variable to index / column number in original dataset
grouped by row (each observation in the original dataset), order each node by the column it was in and create a variable for your "target" by setting it to a node from the column after it
filter out any lines with NA

for "target" (the nodes in the last column of the source dataset will not have a "target" and therefore these lines do not indicate a link)

library(dplyr)
library(tidyr)

links <-
  df %>%
  mutate(row = row_number()) %>%
  gather('column', 'source', -row) %>%
  mutate(column = match(column, names(df))) %>%
  group_by(row) %>%
  arrange(column) %>%
  mutate(target = lead(source)) %>%
  ungroup() %>%
  filter(!is.na(target))

# # A tibble: 24 x 4
#      row column source       target
#    <int>  <int> <chr>        <chr>
#  1     1      1 Bob          Hilton
#  2     2      1 John         Four Seasons
#  3     3      1 Tom          Ritz-Carlton
#  4     4      1 Mary         Westin
#  5     5      1 Sue          Hyatt
#  6     6      1 Barb         Hilton
#  7     1      2 Hilton       Sheraton
#  8     2      2 Four Seasons Ritz-Carlton
#  9     3      2 Ritz-Carlton Westin
# 10     4      2 Westin       Sheraton
# # ... with 14 more rows

The data is now already in a typical network data format with a single row reference defined by the "source" and "target" columns and can be used with sankeyNetwork()

. However, you will most likely want the nodes to refer to the same thing that has appeared multiple times in your story ... if someone visited the Hilton in Year 1 and then visited the Hilton again in Year 3 year, you will probably need two separate nodes, both named by the Hilton, but appear in different parts of the plot. To do this, you will need to identify each node in the "source" and "target" columns with the year they were visited. Something that will support the "row" and "column" variables will be helpful.

Add a column index to the "original" name and add a column index + 1 to the "target" name, and now you can distinguish between, for example, a site for Hilton that was visited in year 1 and a site for Hilton that was visited in 3 year

links <-
  links %>%
  mutate(source = paste0(source, '_', column)) %>%
  mutate(target = paste0(target, '_', column + 1)) %>%
  select(source, target)

# # A tibble: 24 x 2
#    source         target
#    <chr>          <chr>
#  1 Bob_1          Hilton_2
#  2 John_1         Four Seasons_2
#  3 Tom_1          Ritz-Carlton_2
#  4 Mary_1         Westin_2
#  5 Sue_1          Hyatt_2
#  6 Barb_1         Hilton_2
#  7 Hilton_2       Sheraton_3
#  8 Four Seasons_2 Ritz-Carlton_3
#  9 Ritz-Carlton_2 Westin_3
# 10 Westin_2       Sheraton_3
# # ... with 14 more rows

You can now follow the fairly standard procedure of using a target link list to create the required data frames for sankeyNetwork()

. Create a data frame nodes

with all the unique nodes found in the "source" and "target" vectors. Convert the "source" and "target" vectors in the data frame links

to be a 0-based index of the node in the data frame nodes

. Add an arbitrary value for each link in the dataframe links

as required sankeyNetwork()

. Now you can remove the index of the added column from the nodes

dataframe names nodes

as they will only be used to denote the nodes in the resulting graph (so now it doesn't matter if they are unique). Then sankeyNetwork()

it s sankeyNetwork()

!

nodes <- data.frame(name = unique(c(links$source, links$target)))

links$source <- match(links$source, nodes$name) - 1
links$target <- match(links$target, nodes$name) - 1
links$value <- 1

nodes$name <- sub('_[0-9]+$', '', nodes$name)

library(networkD3)
library(htmlwidgets)

sankeyNetwork(Links = links, Nodes = nodes, Source = 'source',
              Target = 'target', Value = 'value', NodeID = 'name')

Creating a Sankey diagram using the NetworkD3 package in R

More articles: