Create a table of duplicates from a SAS dataset with more than 50 variables

I have a large SAS dataset (54 variables and over 10 million observations). I need to upload to Teradata. There are duplicates that should also appear and my machine is not configured for MultiLoad. I just want to create a table of 300,000 duplicates that I can add to the original load that didn't accept them. The logic I've read in other posts seems to be good for tables with multiple variables. Is there another way to create a new table that lists all cases that have the same combination of all 54 variables. I am trying to avoid sorting proc ... logically using 54 variables. The query building method also proved to be ineffective. Thank.

+3


source to share


4 answers


Using proc sort

is a good way to do it, you just need to create a nicer way to disable it.

Create some test data.

data have;
  x = 1;
  y = 'a';
  output; 
  output;
  x = 2;
  output;
run;

      

Create a new field, which is basically equivalent to adding all the fields in the string together and then running them albeit with an algorithm md5()

(hashing). This will give you a nice short field that uniquely identifies the combination of all the values ​​in that string.



data temp;
  length hash $16;
  set have;
  hash = md5(cats(of _all_));
run;

      

Now use the proc key and our new hash field. Output duplicate records to a table named "want":

proc sort data=temp nodupkey dupout=want;
  by hash;
run;

      

+1


source


You can do something like this:

proc sql;
create table rem_dups as 
select <key_fields>, count(*) from duplicates
group by <key_fields>
having count(*) > 1;
quit; 

proc sql; 
create table target as 
select dp.* from duplicates dp 
left join rem_dups rd 
on <Key_fields>
where <key_fields> is null;
quit; 

      



If there are more than 300K duplicates, this option has no effect. Also, I'm afraid to say that I don't know about Teradata and how you load tables.

0


source


First, a few sentences sort

, and then the core of the "quick" sentence after the break.


If the table is completely unsorted (i.e. duplicates can appear anywhere in the dataset) then this proc sort

is probably your simplest option. If you have a key that will guarantee the offset of duplicate records, you can do the following:

proc sort data=have out=uniques noduprec dupout=dups;
  by <key>;
run;

      

This will result in duplicate records (note noduprec

not nodupkey

), which requires all 54 variables to be identical) in the secondary dataset ( dups

in the above). However, if they are not physically adjacent (i.e. you have 4 or 5 duplicates with a help key

, but only two are completely duplicated), it may not catch unless they are physically adjacent; you will need a second sort, or you will need to list all the variables in your statement by

(which can be confusing). You can also use Rob's method md5

to simplify this.

If the table is not sorted but the duplicate records are contiguous, you can use by

with the option notsorted

.

data uniques dups;
  set have;
  by <all 54 variables> notsorted;
  if not (first.<last variable in the list>) then output dups;
  else output uniques;
run;

      

This tells SAS not to complain if something is wrong, but allows the first / last to be used. Not a good option, especially if you need to specify everything.


The fastest way to do this is probably to use a hash table to do this if you have enough RAM to handle it, or you can split your table somehow (without losing duplicates). 10m rows of 54 times (say 10 bytes) variables means 5.4GB of data, so this only works if you have 5.4GB of RAM available for SAS to create a hash table with.

If you know that a subset of your 54 variables is sufficient to check for uniqueness, then the hash hash unique

should only contain that subset of variables (i.e. it can only be four or five index variables). The hash table dups

must contain all the variables (as it will be used to display duplicates).

This works with help modify

to quickly process the dataset rather than overwrite most observations; using remove

to remove them and the hash table method output

to output duplicates to a new dataset. The hash table unq

is only used for lookups - so, again, it can contain a subset of variables.

I am also using a technique here to get a complete list of variables in a macro variable, so you don't need to enter 54 variables.

data class;   *make some dummy data with a few true duplicates;
  set sashelp.class;
  if age=15 then output;
  output;
run;

proc sql;
  select quote(name) 
    into :namelist separated by ','
    from dictionary.columns
    where libname='WORK' and memname='CLASS'
  ;  *note UPCASE names almost always here;
quit;

data class;
  if 0 then set class;
  if _n_=1 then do;               *make a pair of hash tables;
     declare hash unq();
     unq.defineKey(&namelist.);
     unq.defineData(&namelist.);
     unq.defineDone();
     declare hash dup(multidata:'y'); *the latter allows this to have dups in it (if your dups can have dups);
     dup.defineKey(&namelist.);
     dup.defineData(&namelist.);
     dup.defineDone();
  end;
  modify class end=eof;
  rc_c = unq.check();           *check to see if it is in the unique hash;
  if rc_c ne 0 then unq.add();  *if it is not, add it;
  else do;                      *otherwise add it to the duplicate hash and mark to remove it;
    dup.add();
    delete_check=1;
  end;

  if eof then do;                      *if you are at the end, output the dups;
    rc_d = dup.output(dataset:'work.dups');
  end;

  if delete_check eq 1 then remove;        *actually remove it from unique dataset;
run;

      

0


source


Instead of trying to avoid proc sorting, I would recommend that you use Proc sort

with index

.

Read the index document

I'm pretty sure there _n_

must be an identifier (s) to distinguish an observation other than , and using an index to sort by noduprecs

or nodupkey

dupout = dataset

would be an efficient choice. In addition, indexing can also facilitate other operations such as merge / reporting.

Anyway, I don't think a dataset with 10 million cases (each?) Is a good dataset, let alone 54 variables.

-2


source







All Articles