Counting a huge dataset

Question

Counting a huge dataset

I have a suitable machine learning classifier on a sample of 1-2% data using R / Python and I am quite happy with the precision measures (precision, recall and F_score).

Now I would like to get a huge 70 million row / instance database that is in a Hadoop / Hive environment with this classifier, which is coded in R.

Data set information:

70 million X 40 variables (columns): about 18 variables are categorical and the remaining 22 are numeric (including integers)

How should I do it? Any suggestions?

What I was thinking about is:

a) Partitioning data with a step of 1 M from the hadoop system in csv files and feeding it to R

b) Batch processing.

This is not a real time system, so it is not needed every day, but I would still like to score it in about 2-3 hours.

+3

database r hadoop bigdata scoring

ML_Passion Apr 25. 15 at 5:39 am

source to share

3 answers

Arnon Rotem-Gal-Oz · Answer 1 · 2015-04-25T06:14:38+0000

If you can set the R runtime on all your datanodes, you can just hasoop streaming a map-only job that will call the R code

Also you can take a look at SparkR

Partha kaushik · Answer 2 · 2015-04-25T06:25:57+0000

I suppose you want to run your R-code (your classifier) on the full dataset instead of the sample datasets

So, we are looking for the execution of R code in a widely distributed system

Also, it should have tight integration with hadoop components.

So RHadoop is fine for your problem statement.

http://www.rdatamining.com/big-data/r-hadoop-setup-guide

roger deangelis · Answer 3 · 2015-04-25T12:17:14+0000

The scoring of 80 million to 8.5 seconds

The code below was run on an off lease Dell T7400 workstation with 64gb ram, dual quad 3ghz XEONS and two raid 0 SSD arrays on separate channels which I purchased for $600. I also use the free SPDE to partition the dataset.

For small datasets like your 80 million you might want to consider SAS or WPS.
The code below scores  80 million 40 char records in 9 seconds

The combination of in memory R and SAS/WPS makes a great combinations. Many SAS users consider datasets less than 1TB to be small.

I ran 8 parallel processes, SAS 9.4 64bit Win Pro 64bit

8.5

%let pgm=utl_score_spde;

proc datasets library=spde;
delete gig23ful_spde;
run;quit;

libname spde spde 'd:/tmp'
  datapath=("f:/wrk/spde_f" "e:/wrk/spde_e" "g:/wrk/spde_g")
  partsize=4g;
;

data spde.littledata_spde (compress=char drop=idx);
  retain primary_key;
  array num[20] n1-n20;
  array chr[20] $4 c1-c20;
  do primary_key=1 to 80000000;
    do idx=31 to 50;
      num[idx-30]=uniform(-1);
      chr[idx-30]=repeat(byte(idx),40);
    end;
    output;
  end;
run;quit;



%let _s=%sysfunc(compbl(C:\Progra~1\SASHome\SASFoundation\9.4\sas.exe -sysin c:\nul -nosplash -sasautos c:\oto -autoexec c:\oto\Tut_Oto.sas));

* score it;


data _null_;file "c:\oto\utl_scoreit.sas" lrecl=512;input;put _infile_;putlog _infile_;
cards4;
%macro utl_scoreit(beg=1,end=10000000);

  libname spde spde 'd:/tmp'
  datapath=("f:/wrk/spde_f" "e:/wrk/spde_e" "g:/wrk/spde_g")
  partsize=4g;

  libname out "G:/wrk";

  data keyscore;

     set spde.littledata_spde(firstobs=&beg obs=&end
        keep=
           primary_key
           n1
           n12
           n3
           n14
           n5
           n16
           n7
           n18
           n9
           n10
           c18
           c19
           c12);
    score= (.1*n1   +
            .1*n12  +
            .1*n3   +
            .1*n14  +
            .1*n5   +
            .1*n16  +
            .1*n7   +
            .1*n18  +
            .1*n9   +
            .1*n10  +
             (c18='0000')  +
             (c19='0000')  +
             (c12='0000'))/3  ;
    keep primary_key score;
  run;

%mend utl_scoreit;
;;;;
run;quit;

%utl_scoreit;


%let tym=%sysfunc(time());
systask kill sys101 sys102 sys103 sys104  sys105 sys106 sys107 sys108;
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=1,end=10000000);) -log G:\wrk\sys101.log" taskname=sys101;
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=10000001,end=20000000);) -log G:\wrk\sys102.log" taskname=sys102 ;
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=20000001,end=30000000);) -log G:\wrk\sys103.log" taskname=sys103 ;
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=30000001,end=40000000);) -log G:\wrk\sys104.log" taskname=sys104 ;
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=40000001,end=50000000);) -log G:\wrk\sys105.log" taskname=sys105 ;
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=50000001,end=60000000);) -log G:\wrk\sys106.log" taskname=sys106 ;
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=60000001,end=70000000);) -log G:\wrk\sys107.log" taskname=sys107 ;
systask command "&_s -termstmt %nrstr(%utl_scoreit(beg=70000001,end=80000000);) -log G:\wrk\sys108.log" taskname=sys108 ;
waitfor _all_ sys101 sys102 sys103 sys104  sys105 sys106 sys107 sys108;
systask list;
%put %sysevalf( %sysfunc(time()) - &tym);

8.56500005719863

NOTE: AUTOEXEC processing completed.

NOTE: Libref SPDE was successfully assigned as follows: 
      Engine:        SPDE 
      Physical Name: d:\tmp\
NOTE: Libref OUT was successfully assigned as follows: 
      Engine:        V9 
      Physical Name: G:\wrk

NOTE: There were 10000000 observations read from the data set SPDE.LITTLEDATA_SPDE.
NOTE: The data set WORK.KEYSCORE has 10000000 observations and 2 variables.
NOTE: DATA statement used (Total process time):
      real time           7.05 seconds
      cpu time            6.98 seconds



NOTE: SAS Institute Inc., SAS Campus Drive, Cary, NC USA 27513-2414
NOTE: The SAS System used:
      real time           8.34 seconds
      cpu time            7.36 seconds

Counting a huge dataset

More articles: