Counting a huge dataset
I have a suitable machine learning classifier on a sample of 1-2% data using R / Python and I am quite happy with the precision measures (precision, recall and F_score).
Now I would like to get a huge 70 million row / instance database that is in a Hadoop / Hive environment with this classifier, which is coded in R.
Data set information:
70 million X 40 variables (columns): about 18 variables are categorical and the remaining 22 are numeric (including integers)
How should I do it? Any suggestions?
What I was thinking about is:
a) Partitioning data with a step of 1 M from the hadoop system in csv files and feeding it to R
b) Batch processing.
This is not a real time system, so it is not needed every day, but I would still like to score it in about 2-3 hours.
source to share
I suppose you want to run your R-code (your classifier) on the full dataset instead of the sample datasets
So, we are looking for the execution of R code in a widely distributed system
Also, it should have tight integration with hadoop components.
So RHadoop is fine for your problem statement.
source to share
The scoring of 80 million to 8.5 seconds The code below was run on an off lease Dell T7400 workstation with 64gb ram, dual quad 3ghz XEONS and two raid 0 SSD arrays on separate channels which I purchased for $600. I also use the free SPDE to partition the dataset. For small datasets like your 80 million you might want to consider SAS or WPS. The code below scores 80 million 40 char records in 9 seconds The combination of in memory R and SAS/WPS makes a great combinations. Many SAS users consider datasets less than 1TB to be small. I ran 8 parallel processes, SAS 9.4 64bit Win Pro 64bit 8.5 %let pgm=utl_score_spde; proc datasets library=spde; delete gig23ful_spde; run;quit; libname spde spde 'd:/tmp' datapath=("f:/wrk/spde_f" "e:/wrk/spde_e" "g:/wrk/spde_g") partsize=4g; ; data spde.littledata_spde (compress=char drop=idx); retain primary_key; array num n1-n20; array chr $4 c1-c20; do primary_key=1 to 80000000; do idx=31 to 50; num[idx-30]=uniform(-1); chr[idx-30]=repeat(byte(idx),40); end; output; end; run;quit; %let _s=%sysfunc(compbl(C:\Progra~1\SASHome\SASFoundation\9.4\sas.exe -sysin c:\nul -nosplash -sasautos c:\oto -autoexec c:\oto\Tut_Oto.sas)); * score it; data _null_;file "c:\oto\utl_scoreit.sas" lrecl=512;input;put _infile_;putlog _infile_; cards4; %macro utl_scoreit(beg=1,end=10000000); libname spde spde 'd:/tmp' datapath=("f:/wrk/spde_f" "e:/wrk/spde_e" "g:/wrk/spde_g") partsize=4g; libname out "G:/wrk"; data keyscore; set spde.littledata_spde(firstobs=&beg obs=&end keep= primary_key n1 n12 n3 n14 n5 n16 n7 n18 n9 n10 c18 c19 c12); score= (.1*n1 + .1*n12 + .1*n3 + .1*n14 + .1*n5 + .1*n16 + .1*n7 + .1*n18 + .1*n9 + .1*n10 + (c18='0000') + (c19='0000') + (c12='0000'))/3 ; keep primary_key score; run; %mend utl_scoreit; ;;;; run;quit; %utl_scoreit; %let tym=%sysfunc(time()); systask kill sys101 sys102 sys103 sys104 sys105 sys106 sys107 sys108; systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=1,end=10000000);) -log G:\wrk\sys101.log" taskname=sys101; systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=10000001,end=20000000);) -log G:\wrk\sys102.log" taskname=sys102 ; systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=20000001,end=30000000);) -log G:\wrk\sys103.log" taskname=sys103 ; systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=30000001,end=40000000);) -log G:\wrk\sys104.log" taskname=sys104 ; systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=40000001,end=50000000);) -log G:\wrk\sys105.log" taskname=sys105 ; systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=50000001,end=60000000);) -log G:\wrk\sys106.log" taskname=sys106 ; systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=60000001,end=70000000);) -log G:\wrk\sys107.log" taskname=sys107 ; systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=70000001,end=80000000);) -log G:\wrk\sys108.log" taskname=sys108 ; waitfor _all_ sys101 sys102 sys103 sys104 sys105 sys106 sys107 sys108; systask list; %put %sysevalf( %sysfunc(time()) - &tym); 8.56500005719863 NOTE: AUTOEXEC processing completed. NOTE: Libref SPDE was successfully assigned as follows: Engine: SPDE Physical Name: d:\tmp\ NOTE: Libref OUT was successfully assigned as follows: Engine: V9 Physical Name: G:\wrk NOTE: There were 10000000 observations read from the data set SPDE.LITTLEDATA_SPDE. NOTE: The data set WORK.KEYSCORE has 10000000 observations and 2 variables. NOTE: DATA statement used (Total process time): real time 7.05 seconds cpu time 6.98 seconds NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA 27513-2414 NOTE: The SAS System used: real time 8.34 seconds cpu time 7.36 seconds
source to share