R: convert email addresses to unique integers

R newbie with what seems like a pretty simple problem: I have several email logs that I read in R in the format:

    Date        Time            From                 To
1   2000-01-01  00:00:00    bob@mail.com            test1@mail.com
2   2000-01-02  01:00:00    carolyn @mail.com       test2@mail.com
3   2000-01-03  02:00:00    chris@mail.com          test3@mail.com
4   2000-01-04  03:00:00    chris @mail.com         test4@mail.com
5   2000-01-05  04:00:00    alan@mail.com           test5@mail.com
6   2000-01-06  05:00:00    alan.@mail.com          test6@mail.com


I need to change log1 $ From and log1 $ To to a globally unique numeric ID so that when reading other logs later, any email address will get the same ID as the previous logs.

I tried:

id <- as.numeric(as.character(log1[,3])))
id <- charToRaw(log1[,4]), base=16)


Will there be some kind of soul please help me - Thank you!

The apology probably should have included the following:

 Date=c( "01/01/2000" ,"02/01/2000" ,"03/01/2000", "04/01/2000" ,"05/01/2000" ,"06/01/2000","07/01/2000","08/01/2000",
    "09/01/2000","10/01/2000","11/01/2000", "12/01/2000" ,"13/01/2000", "14/01/2000", "15/01/2000","16/01/2000"
    Time=c("00:00:00","01:00:00","02:00:00", "03:00:00" ,"04:00:00" ,"05:00:00", "06:00:00" ,"07:00:00", "08:00:00", "09:00:00" ,"10:00:00",
    "11:00:00", "12:00:00","13:00:00", "14:00:00","15:00:00","16:00:00","17:00:00","18:00:00","19:00:00","00:00:00" ,"00:00:00")


Try using MD5 to generate unique unique IDs: note that the ID for ana.correa@mail.com is a valid match in ID_to, but not in ID_from


    for (i in 1:nrow(log)){
    to<-as.numeric(paste('0x', substr(rep(hmac('secret',log[i,4], algo='md5'), 2), c(1, 9, 17, 25), c(8, 16, 24, 32)),sep=""))

    from<-as.numeric(paste('0x', substr(rep(hmac('secret',log[i,3], algo='md5'), 2), c(1, 9, 17, 25),c(8, 16, 24, 32)),sep=""))


    ID_to[,3]<-paste(ID_to[,1],ID_to[,2], sep="")
    ID_from[,3]<-paste(ID_from[,1],ID_from[,2], sep="")

    ID_from...3.                 log...3.           ID_to...3.            log...4.   log...1. log...2.
    27488842661591306920      bob.shults@mail.com 18727221862165338513 ana.correa@mail.com 01/01/2000 00:00:00
    38124472891255273775   carolyn.green@mail.com  1251903296725454474      test2@mail.com 02/01/2000 01:00:00
    29070047663451376630      chris.long@mail.com 17074276751156451031      test3@mail.com 03/01/2000 02:00:00
    8261398433828474582 christi.nicolay@mail.com  1563683670909194033      test4@mail.com 04/01/2000 03:00:00
    18727221862165338513  alan.aronowitz@mail.com 26735368323826533112      test5@mail.com 05/01/2000 04:00:00
    5680838251168988404     alan.comnes@mail.com  2923605896229594830      test6@mail.com 06/01/2000 05:00:00
    2351312285811012730       dab@sprintmail.com 17171333544033270402      test7@mail.com 07/01/2000 06:00:00
    328278708432069254      ana.correa@mail.com 33840664403556851587      test8@mail.com 08/01/2000 07:00:00
    1127901879852039037   andrew.fastow@mail.com  1973548136161209824      test9@mail.com 09/01/2000 08:00:00
    7349515121496417787 elena.kapralova@mail.com  5680838251168988404     test10@mail.com 10/01/2000 09:00:00
    27488842661591306920      bob.shults@mail.com   328278708432069254     test11@mail.com 11/01/2000 10:00:00
    38124472891255273775   carolyn.green@mail.com  1127901879852039037     test12@mail.com 12/01/2000 11:00:00
    29070047663451376630      chris.long@mail.com 27488842661591306920     test13@mail.com 13/01/2000 12:00:00
    8261398433828474582 christi.nicolay@mail.com 38124472891255273775     test14@mail.com 14/01/2000 13:00:00
    18727221862165338513  alan.aronowitz@mail.com 29070047663451376630     test15@mail.com 15/01/2000 14:00:00
    5680838251168988404     alan.comnes@mail.com  8261398433828474582     test16@mail.com 16/01/2000 15:00:00
    2351312285811012730       dab@sprintmail.com  2351312285811012730     test17@mail.com 17/01/2000 16:00:00
    328278708432069254      ana.correa@mail.com  7349515121496417787     test18@mail.com 18/01/2000 17:00:00
    1127901879852039037   andrew.fastow@mail.com 41762759923562968495     test19@mail.com 19/01/2000 18:00:00
    7349515121496417787 elena.kapralova@mail.com 24894056753582090007     test20@mail.com 20/01/2000 19:00:00
    27488842661591306920      bob.shults@mail.com 18727221862165338513 ana.correa@mail.com 01/01/2000 00:00:00
    27488842661591306920      bob.shults@mail.com 18727221862165338513 ana.correa@mail.com 02/01/2000 00:00:00


Trying the levels / coefficients method:

Getting error:

log <- union(levels(log[,3]), levels(log[,4]))
>Error in emails[, 3] : incorrect number of dimensions



source to share

3 answers

You can use MD5 to generate globally unique identifiers as it has a very low chance of collisions, but since its output is 128 bits, you need multiple numbers to represent it (four integers in 32-bit R, two integers in 64-bit R). This should be easy to handle using short numeric vectors.

Here's how you can create a vector like this of four integers for an email address (or any other string, for that matter):

email <- 'test1@gmail'
as.numeric(paste('0x', substr(rep(hmac('secret56f8a7', email, algo='md5'), 4), c(1, 9, 17, 25), c(8, 16, 24, 32)), sep=''))


You can only use algo='crc32'

and get one integer, but this is not recommended as CRC collisions are much more likely.



you need to create a unique ID for each letter in your journals. One way is to calculate the crc checksum of each letter and use that as an identifier, but that would be a very long number. Or you can implement a hashmap in R and email the hashmap key.



I think this will do what you want and it is efficient and you can do it using only basic packages ...


1.Convert both columns to factors

2. Set the factor levels in the same way so that each email has a unique identifier in the factor levels.

3. Change the entries in each column to the number corresponding to their ratio. As a result, we can determine the time when " test1@gmail.com " sent and received emails by simply looking at the "1" in both columns.

log1$From <- as.factor(log1$From) 
log1$To <- as.factor(log1$To) 
emails <- union(levels(log1$From), levels(log1$To))
levels(log1$From) <- emails
levels(log1$To) <- emails
log1$From <- as.numeric(log1$From) 
log1$To <- as.numeric(log1$To)


It would probably be a good idea to keep a record of the original email addresses as I did. Then, if you're wondering, let's say which emails are sent to test1@gmail.com :

log1[log1$From == which(emails == "test1@gmail.com"), ]


must do the trick! You can write a procedure to make this view much cleaner ...



All Articles