Create a table of duplicates from a SAS dataset with more than 50 variables
I have a large SAS dataset (54 variables and over 10 million observations). I need to upload to Teradata. There are duplicates that should also appear and my machine is not configured for MultiLoad. I just want to create a table of 300,000 duplicates that I can add to the original load that didn't accept them. The logic I've read in other posts seems to be good for tables with multiple variables. Is there another way to create a new table that lists all cases that have the same combination of all 54 variables. I am trying to avoid sorting proc ... logically using 54 variables. The query building method also proved to be ineffective. Thank.
source to share
Using proc sort
is a good way to do it, you just need to create a nicer way to disable it.
Create some test data.
data have;
x = 1;
y = 'a';
output;
output;
x = 2;
output;
run;
Create a new field, which is basically equivalent to adding all the fields in the string together and then running them albeit with an algorithm md5()
(hashing). This will give you a nice short field that uniquely identifies the combination of all the values ββin that string.
data temp;
length hash $16;
set have;
hash = md5(cats(of _all_));
run;
Now use the proc key and our new hash field. Output duplicate records to a table named "want":
proc sort data=temp nodupkey dupout=want;
by hash;
run;
source to share
You can do something like this:
proc sql;
create table rem_dups as
select <key_fields>, count(*) from duplicates
group by <key_fields>
having count(*) > 1;
quit;
proc sql;
create table target as
select dp.* from duplicates dp
left join rem_dups rd
on <Key_fields>
where <key_fields> is null;
quit;
If there are more than 300K duplicates, this option has no effect. Also, I'm afraid to say that I don't know about Teradata and how you load tables.
source to share
First, a few sentences sort
, and then the core of the "quick" sentence after the break.
If the table is completely unsorted (i.e. duplicates can appear anywhere in the dataset) then this proc sort
is probably your simplest option. If you have a key that will guarantee the offset of duplicate records, you can do the following:
proc sort data=have out=uniques noduprec dupout=dups;
by <key>;
run;
This will result in duplicate records (note noduprec
not nodupkey
), which requires all 54 variables to be identical) in the secondary dataset ( dups
in the above). However, if they are not physically adjacent (i.e. you have 4 or 5 duplicates with a help key
, but only two are completely duplicated), it may not catch unless they are physically adjacent; you will need a second sort, or you will need to list all the variables in your statement by
(which can be confusing). You can also use Rob's method md5
to simplify this.
If the table is not sorted but the duplicate records are contiguous, you can use by
with the option notsorted
.
data uniques dups;
set have;
by <all 54 variables> notsorted;
if not (first.<last variable in the list>) then output dups;
else output uniques;
run;
This tells SAS not to complain if something is wrong, but allows the first / last to be used. Not a good option, especially if you need to specify everything.
The fastest way to do this is probably to use a hash table to do this if you have enough RAM to handle it, or you can split your table somehow (without losing duplicates). 10m rows of 54 times (say 10 bytes) variables means 5.4GB of data, so this only works if you have 5.4GB of RAM available for SAS to create a hash table with.
If you know that a subset of your 54 variables is sufficient to check for uniqueness, then the hash hash unique
should only contain that subset of variables (i.e. it can only be four or five index variables). The hash table dups
must contain all the variables (as it will be used to display duplicates).
This works with help modify
to quickly process the dataset rather than overwrite most observations; using remove
to remove them and the hash table method output
to output duplicates to a new dataset. The hash table unq
is only used for lookups - so, again, it can contain a subset of variables.
I am also using a technique here to get a complete list of variables in a macro variable, so you don't need to enter 54 variables.
data class; *make some dummy data with a few true duplicates;
set sashelp.class;
if age=15 then output;
output;
run;
proc sql;
select quote(name)
into :namelist separated by ','
from dictionary.columns
where libname='WORK' and memname='CLASS'
; *note UPCASE names almost always here;
quit;
data class;
if 0 then set class;
if _n_=1 then do; *make a pair of hash tables;
declare hash unq();
unq.defineKey(&namelist.);
unq.defineData(&namelist.);
unq.defineDone();
declare hash dup(multidata:'y'); *the latter allows this to have dups in it (if your dups can have dups);
dup.defineKey(&namelist.);
dup.defineData(&namelist.);
dup.defineDone();
end;
modify class end=eof;
rc_c = unq.check(); *check to see if it is in the unique hash;
if rc_c ne 0 then unq.add(); *if it is not, add it;
else do; *otherwise add it to the duplicate hash and mark to remove it;
dup.add();
delete_check=1;
end;
if eof then do; *if you are at the end, output the dups;
rc_d = dup.output(dataset:'work.dups');
end;
if delete_check eq 1 then remove; *actually remove it from unique dataset;
run;
source to share
Instead of trying to avoid proc sorting, I would recommend that you use Proc sort
with index
.
I'm pretty sure there _n_
must be an identifier (s) to distinguish an observation other than , and using an index to sort by noduprecs
or nodupkey
dupout = dataset
would be an efficient choice. In addition, indexing can also facilitate other operations such as merge / reporting.
Anyway, I don't think a dataset with 10 million cases (each?) Is a good dataset, let alone 54 variables.
source to share