Counting a huge dataset
I have a suitable machine learning classifier on a sample of 1-2% data using R / Python and I am quite happy with the precision measures (precision, recall and F_score).
Now I would like to get a huge 70 million row / instance database that is in a Hadoop / Hive environment with this classifier, which is coded in R.
Data set information:
70 million X 40 variables (columns): about 18 variables are categorical and the remaining 22 are numeric (including integers)
How should I do it? Any suggestions?
What I was thinking about is:
a) Partitioning data with a step of 1 M from the hadoop system in csv files and feeding it to R
b) Batch processing.
This is not a real time system, so it is not needed every day, but I would still like to score it in about 2-3 hours.
source to share
If you can set the R runtime on all your datanodes, you can just hasoop streaming a map-only job that will call the R code
Also you can take a look at SparkR
source to share
I suppose you want to run your R-code (your classifier) on the full dataset instead of the sample datasets
So, we are looking for the execution of R code in a widely distributed system
Also, it should have tight integration with hadoop components.
So RHadoop is fine for your problem statement.
source to share
The scoring of 80 million to 8.5 seconds
The code below was run on an off lease Dell T7400 workstation with 64gb ram, dual quad 3ghz XEONS and two raid 0 SSD arrays on separate channels which I purchased for $600. I also use the free SPDE to partition the dataset.
For small datasets like your 80 million you might want to consider SAS or WPS.
The code below scores 80 million 40 char records in 9 seconds
The combination of in memory R and SAS/WPS makes a great combinations. Many SAS users consider datasets less than 1TB to be small.
I ran 8 parallel processes, SAS 9.4 64bit Win Pro 64bit
8.5
%let pgm=utl_score_spde;
proc datasets library=spde;
delete gig23ful_spde;
run;quit;
libname spde spde 'd:/tmp'
datapath=("f:/wrk/spde_f" "e:/wrk/spde_e" "g:/wrk/spde_g")
partsize=4g;
;
data spde.littledata_spde (compress=char drop=idx);
retain primary_key;
array num[20] n1-n20;
array chr[20] $4 c1-c20;
do primary_key=1 to 80000000;
do idx=31 to 50;
num[idx-30]=uniform(-1);
chr[idx-30]=repeat(byte(idx),40);
end;
output;
end;
run;quit;
%let _s=%sysfunc(compbl(C:\Progra~1\SASHome\SASFoundation\9.4\sas.exe -sysin c:\nul -nosplash -sasautos c:\oto -autoexec c:\oto\Tut_Oto.sas));
* score it;
data _null_;file "c:\oto\utl_scoreit.sas" lrecl=512;input;put _infile_;putlog _infile_;
cards4;
%macro utl_scoreit(beg=1,end=10000000);
libname spde spde 'd:/tmp'
datapath=("f:/wrk/spde_f" "e:/wrk/spde_e" "g:/wrk/spde_g")
partsize=4g;
libname out "G:/wrk";
data keyscore;
set spde.littledata_spde(firstobs=&beg obs=&end
keep=
primary_key
n1
n12
n3
n14
n5
n16
n7
n18
n9
n10
c18
c19
c12);
score= (.1*n1 +
.1*n12 +
.1*n3 +
.1*n14 +
.1*n5 +
.1*n16 +
.1*n7 +
.1*n18 +
.1*n9 +
.1*n10 +
(c18='0000') +
(c19='0000') +
(c12='0000'))/3 ;
keep primary_key score;
run;
%mend utl_scoreit;
;;;;
run;quit;
%utl_scoreit;
%let tym=%sysfunc(time());
systask kill sys101 sys102 sys103 sys104 sys105 sys106 sys107 sys108;
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=1,end=10000000);) -log G:\wrk\sys101.log" taskname=sys101;
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=10000001,end=20000000);) -log G:\wrk\sys102.log" taskname=sys102 ;
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=20000001,end=30000000);) -log G:\wrk\sys103.log" taskname=sys103 ;
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=30000001,end=40000000);) -log G:\wrk\sys104.log" taskname=sys104 ;
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=40000001,end=50000000);) -log G:\wrk\sys105.log" taskname=sys105 ;
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=50000001,end=60000000);) -log G:\wrk\sys106.log" taskname=sys106 ;
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=60000001,end=70000000);) -log G:\wrk\sys107.log" taskname=sys107 ;
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=70000001,end=80000000);) -log G:\wrk\sys108.log" taskname=sys108 ;
waitfor _all_ sys101 sys102 sys103 sys104 sys105 sys106 sys107 sys108;
systask list;
%put %sysevalf( %sysfunc(time()) - &tym);
8.56500005719863
NOTE: AUTOEXEC processing completed.
NOTE: Libref SPDE was successfully assigned as follows:
Engine: SPDE
Physical Name: d:\tmp\
NOTE: Libref OUT was successfully assigned as follows:
Engine: V9
Physical Name: G:\wrk
NOTE: There were 10000000 observations read from the data set SPDE.LITTLEDATA_SPDE.
NOTE: The data set WORK.KEYSCORE has 10000000 observations and 2 variables.
NOTE: DATA statement used (Total process time):
real time 7.05 seconds
cpu time 6.98 seconds
NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA 27513-2414
NOTE: The SAS System used:
real time 8.34 seconds
cpu time 7.36 seconds
source to share