Combine data files

I have the following data frames in R:

Id   Class
@a    64
@b    7
@c    98 

      

And the second data frame:

SOURCE    TARGET 
@d        @b
@c        @a 

      

It describes the nodes and edges in a social network. Users (all with @ in front) belong to a specific community and the number is specified in the column class. To analyze the connections between columns, I want to concatenate these dataframes and create a new dataframe that looks like this:

SOURCE    TARGET    SOURCE.Class    TARGET.Class 
@a        @i        56               2
@f        @k        90               49 

      

When I try to answer merge()

R and I need to end up with R. The data frames are 20,000 (node ​​file) and 30,000 (edge ​​file) lines.

Then I want to know how many records in a given source class have the same target class and the percentage of connections between classes.

I will be so happy if someone can help me as I am very new to R.

EDIT: I think I was able to create columns with this code using match()

instead merge()

(rt_node contains columns "id", "class" and rt_node containing columns "source", "target"):

#match source in rt_edges with id in rt_node
match(rt_edges$Source,rt_nodes$id)

#match target in rt_edges with id in rt_node
match(rt_edges$Target,rt_nodes$id)

#create source_class 
rt_nodes$modularity_class[match(rt_edges$Source,rt_nodes$id)]
rt_edges$Source_Class=rt_nodes$modularity_class[match(rt_edges$Source,rt_nodes$id)]

#create target_class
rt_nodes$modularity_class[match(rt_edges$Target,rt_nodes$id)]
rt_edges$Target_Class=rt_nodes$modularity_class[match(rt_edges$Target,rt_nodes$id)]

      

Now I just need to figure out how I can find the percentage of connections in each class and the percentage of connections with other classes. Any advice on how to do this?

0


source to share


1 answer


Question 1: Merging

This requires two separate concatenation operations: the initial concatenation rt_edges

from rt_nodes

to Target

and the subsequent concatenation of the intermediate result from rt_nodes

to Source

. In addition, all rows should appear in the result rt_edges

.

The approach below uses data.table

. (I accepted the variable and column names the OP was using in the edited code of his Q, but note that this does not match the data sample provided by the OP.)

Reading data

library(data.table)
rt_nodes <- fread(
  "id   Class
  @a    64
  @b    7
  @c    98
  @d    23
  @f    59")
rt_edges <-fread(
  "Source    Target 
  @d        @b
  @c        @a
  @a        @e")

      

Note that additional extra lines have been added to the sample data provided by the OP to demonstrate the effect

  • a node ( @f

    ) not participating in an edge, and
  • edge ( @a -> @e

    ) where one id is missing in rt_nodes

    .

Double connection



By default, joins in data.table

are valid joins. Therefore, it rt_edges

appears on the right side.

result <- rt_nodes[rt_nodes[rt_edges, on = c(id = "Target")], on = c(id = "Source")]

# rename columns
setnames(result, c("Source", "Source.Class", "Target", "Target.Class"))

result
#   Source Source.Class Target Target.Class
#1:     @d           23     @b            7
#2:     @c           98     @a           64
#3:     @a           64     @e           NA

      

As a result, all three edges appear. NA

indicates what is @e

missing in rt_nodes

.

Question 2

OP included second question (and also created a new post )

Then I want to know how many records in a given source class have the same target class and the percentage of connections between classes.

result[, .(.N, share_of_occurrence_in_Target.Class = sum(Source.Class == Target.Class)/.N), 
       by = Source.Class]
#   Source.Class N share_of_occurrence_in_Target.Classs
#1:           23 1                                    0
#2:           98 1                                    0
#3:           64 1                                   NA

      

The counters are 1 and the stock is 0 because the sample data does not contain enough class matches. However, the code has been tested to work with the data presented in another post by the OP .

+1


source







All Articles