Creating a Sankey diagram using the NetworkD3 package in R
I am currently trying to create an interactive Sankey with a package networkD3
as instructed by Chris Grandad ( https://christophergandrud.github.io/networkD3/ ).
What I don't understand is the tabular format as it just uses two columns to render more transitions. To be more specific, I have a dataset containing four columns that are 4 years old. Within these columns are different hotel names, while each row represents one customer who has been "tracked" over the four years.
URL <- paste0(
"https://cdn.rawgit.com/christophergandrud/networkD3/",
"master/JSONdata/energy.json")
Energy <- jsonlite::fromJSON(URL)
sankeyNetwork(Links = Energy$links, Nodes = Energy$nodes, Source = "source",
Target = "target", Value = "value", NodeID = "name",
units = "TWh", fontSize = 12, nodeWidth = 30)
To give you an overview of my data, here's a screenshot:
I would give you more "coded" information, but since I'm very new to the R topic, I hope you can follow my thoughts on this issue. If not, please do not hesitate to ask a question.
Thank:)
source to share
you need two data frames: one list of all nodes (containing names) and one list of links. The latter contains three columns: source node, target node and some value indicating the strength or width of the link. In the link data frame, you reference nodes by position (starting at zero) in the node data frame.
Assuming your data looks like this:
df <- data.frame(Year1=sample(paste0("Hotel", 1:4), 1000, replace = TRUE),
Year2=sample(paste0("Hotel", 1:4), 1000, replace = TRUE),
Year3=sample(paste0("Hotel", 1:4), 1000, replace = TRUE),
Year4=sample(paste0("Hotel", 1:4), 1000, replace = TRUE),
stringsAsFactors = FALSE)
For the diagram, it is necessary to distinguish not only hotels, but also the hotel / year combination, since each of them must be one node:
df$Year1 <- paste0("Year1_", df$Year1)
df$Year2 <- paste0("Year2_", df$Year2)
df$Year3 <- paste0("Year3_", df$Year3)
df$Year4 <- paste0("Year4_", df$Year4)
links are "transitions" between hotels from year to year:
library(dplyr)
trans1_2 <- df %>% group_by(Year1, Year2) %>% summarise(sum=n())
trans2_3 <- df %>% group_by(Year2, Year3) %>% summarise(sum=n())
trans3_4 <- df %>% group_by(Year3, Year4) %>% summarise(sum=n())
colnames(trans1_2)[1:2] <- colnames(trans2_3)[1:2] <- colnames(trans3_4)[1:2] <- c("source","target")
links <- rbind(as.data.frame(trans1_2),
as.data.frame(trans2_3),
as.data.frame(trans3_4))
finally, the data frames must be linked to each other:
nodes <- data.frame(name=unique(c(links$source, links$target)))
links$source <- match(links$source, nodes$name) - 1
links$target <- match(links$target, nodes$name) - 1
Then the diagram can be drawn:
library(networkD3)
sankeyNetwork(Links = links, Nodes = nodes, Source = "source",
Target = "target", Value = "sum", NodeID = "name",
fontSize = 12, nodeWidth = 30)
There may be more elegant solutions, but this can be the starting point for your problem. If you don't like the "Year ..." in the hostnames, you can remove them after setting up the data frames.
source to share
This question comes up a lot ... how to transform a dataset that has multiple links / edges defined on each row across multiple columns. Here's how I convert this to a dataset type that sankeyNetwork
(and many other packages that deal with edges / links / network data) uses ... a single end / line reference dataset.
starting with an example dataset ...
df <- read.csv(header = TRUE, as.is = TRUE, text = '
name,year1,year2,year3,year4
Bob,Hilton,Sheraton,Westin,Hyatt
John,Four Seasons,Ritz-Carlton,Westin,Sheraton
Tom,Ritz-Carlton,Westin,Sheraton,Hyatt
Mary,Westin,Sheraton,Four Seasons,Ritz-Carlton
Sue,Hyatt,Ritz-Carlton,Hilton,Sheraton
Barb,Hilton,Sheraton,Ritz-Carlton,Four Seasons
')
# name year1 year2 year3 year4
# 1 Bob Hilton Sheraton Westin Hyatt
# 2 John Four Seasons Ritz-Carlton Westin Sheraton
# 3 Tom Ritz-Carlton Westin Sheraton Hyatt
# 4 Mary Westin Sheraton Four Seasons Ritz-Carlton
# 5 Sue Hyatt Ritz-Carlton Hilton Sheraton
# 6 Barb Hilton Sheraton Ritz-Carlton Four Seasons
- create a line number so you can still figure out which line / observation happens with each individual link when you convert the data to long format
- use
tidyr
gather()
to convert dataset to long format - convert column name variable to index / column number in original dataset
- grouped by row (each observation in the original dataset), order each node by the column it was in and create a variable for your "target" by setting it to a node from the column after it
- filter out any lines with
NA
for "target" (the nodes in the last column of the source dataset will not have a "target" and therefore these lines do not indicate a link)
library(dplyr)
library(tidyr)
links <-
df %>%
mutate(row = row_number()) %>%
gather('column', 'source', -row) %>%
mutate(column = match(column, names(df))) %>%
group_by(row) %>%
arrange(column) %>%
mutate(target = lead(source)) %>%
ungroup() %>%
filter(!is.na(target))
# # A tibble: 24 x 4
# row column source target
# <int> <int> <chr> <chr>
# 1 1 1 Bob Hilton
# 2 2 1 John Four Seasons
# 3 3 1 Tom Ritz-Carlton
# 4 4 1 Mary Westin
# 5 5 1 Sue Hyatt
# 6 6 1 Barb Hilton
# 7 1 2 Hilton Sheraton
# 8 2 2 Four Seasons Ritz-Carlton
# 9 3 2 Ritz-Carlton Westin
# 10 4 2 Westin Sheraton
# # ... with 14 more rows
The data is now already in a typical network data format with a single row reference defined by the "source" and "target" columns and can be used with sankeyNetwork()
. However, you will most likely want the nodes to refer to the same thing that has appeared multiple times in your story ... if someone visited the Hilton in Year 1 and then visited the Hilton again in Year 3 year, you will probably need two separate nodes, both named by the Hilton, but appear in different parts of the plot. To do this, you will need to identify each node in the "source" and "target" columns with the year they were visited. Something that will support the "row" and "column" variables will be helpful.
Add a column index to the "original" name and add a column index + 1 to the "target" name, and now you can distinguish between, for example, a site for Hilton that was visited in year 1 and a site for Hilton that was visited in 3 year
links <-
links %>%
mutate(source = paste0(source, '_', column)) %>%
mutate(target = paste0(target, '_', column + 1)) %>%
select(source, target)
# # A tibble: 24 x 2
# source target
# <chr> <chr>
# 1 Bob_1 Hilton_2
# 2 John_1 Four Seasons_2
# 3 Tom_1 Ritz-Carlton_2
# 4 Mary_1 Westin_2
# 5 Sue_1 Hyatt_2
# 6 Barb_1 Hilton_2
# 7 Hilton_2 Sheraton_3
# 8 Four Seasons_2 Ritz-Carlton_3
# 9 Ritz-Carlton_2 Westin_3
# 10 Westin_2 Sheraton_3
# # ... with 14 more rows
You can now follow the fairly standard procedure of using a target link list to create the required data frames for sankeyNetwork()
. Create a data frame nodes
with all the unique nodes found in the "source" and "target" vectors. Convert the "source" and "target" vectors in the data frame links
to be a 0-based index of the node in the data frame nodes
. Add an arbitrary value for each link in the dataframe links
as required sankeyNetwork()
. Now you can remove the index of the added column from the nodes
dataframe names nodes
as they will only be used to denote the nodes in the resulting graph (so now it doesn't matter if they are unique). Then sankeyNetwork()
it s sankeyNetwork()
!
nodes <- data.frame(name = unique(c(links$source, links$target)))
links$source <- match(links$source, nodes$name) - 1
links$target <- match(links$target, nodes$name) - 1
links$value <- 1
nodes$name <- sub('_[0-9]+$', '', nodes$name)
library(networkD3)
library(htmlwidgets)
sankeyNetwork(Links = links, Nodes = nodes, Source = 'source',
Target = 'target', Value = 'value', NodeID = 'name')
source to share