R - Is the Amazon EC2 high performance slower than the i7?

I am working with a large dataset and am trying to offload it to Amazon EC2 for faster processing.

The data starts out as two tables - 6.5M x 6 and 11K x 15. Then I combine them into one 6.5M x 20 table.

Here's my R code:

library(data.table)
library(dplyr)

download.file("http://d396qusza40orc.cloudfront.net/exdata%2Fdata%2FNEI_data.zip", "data.zip")

unzip("data.zip")

data <- readRDS("summarySCC_PM25.rds")
scckey <- readRDS("Source_Classification_Code.rds")

system.time(data <- data %>% inner_join(scckey))

      

On my home laptop (i7 1.9GHZ, 8GB RAM, SSD) here is my conclusion

   user  system elapsed 
 226.91    0.36  228.39 

      

On Amazon EC2 c4.8xlarge (36 vCPU, 132 EPU, 60 GB RAM, EBS storage)

   user  system elapsed 
302.016   0.396 302.422 

      

On Amazon EC2 c3.8large (32 vCPU, 108 EPU, 60GB RAM, SSD storage)

   user  system elapsed 
374.839   0.367 375.178

      

How can it be that both EC2 systems are slower than my own laptop? c4.8 larger, in particular, seems to be the most powerful MOST computing solution offered by Amazon.

Am I doing something wrong?

EDIT:

I checked the monitoring system - it looks like the connection works at 3-5% CPU usage. It seems to be very low - on my laptop it runs at 30-40%.

EDIT:

Assuming I have tried data.table

merge()

3.8xlarge @ ~ 1% cpu load:

system.time(datamerge <- merge(data, scckey, by = "SCC"))
   user  system elapsed 
193.012   0.658 193.654

      

4.8xlarge @ ~ 2% cpu load:

system.time(datamerge <- merge(data, scckey, by = "SCC"))
   user  system elapsed 
162.829   0.822 163.638 

      

Laptop:

Initially it took 5 minutes, so I restarted R.

system.time(datamerge <- merge(data, scckey, by = "SCC"))
   user  system elapsed 
133.45    1.34  135.81 

      

This is obviously a more efficient feature, but I still beat the best Amazon EC2 machines at a decent level.

EDIT:

scckey[data]

reduces the time for this operation to less than 1 second on my laptop . I'm still wondering how I could use EC2 better.

+3


source to share


1 answer


Not that I am an expert on Amazon EC2, but I probably used commodity servers as my underlying hardware platform. "Commercial" in this context means x86 processors that have the same underlying architecture as your laptop. Depending on how powerful your laptop is, it may even have a higher clock speed than the cores in your EC2 instance.



What EC2 gets is scalability, which means more cores and more memory than you have locally. But your code must be written to use these kernels; which means it must be parallel in execution. I'm pretty sure it data.table

's single-threaded like almost all R packages, so getting more cores won't speed things up on your own. Also, if your data already fits into your memory, then getting more will not result in significant gain.

+3


source







All Articles