SAS Hash Object Sum

Question

SAS Hash Object Sum

I am trying to understand the sum () function of a SAS Hash object . As I understand it, suminc: defines a variable , the SAS hash object keeps track of, and sum () will sum the values of that variable.

Suppose I have a dataset

data sample;
    input id x;
    datalines;
1 350
1 220
1 300
2 300
2 500
;
run;

I want the aggregation to be

id x_sum
2  800
1  870

However, my hashcode is:

data _null_;
    set sample end= done;

    length x_sum 8;

    if _N_ = 1 then 
    do;
        declare hash T(suminc:"x");
        T.definekey("id");
        T.definedata("id");
        T.definedata("x_sum");
        T.definedone();
    end;


    T.ref();

    T.sum(sum:x_sum);

    put _all_;


    T.replace();

    if done then T.output(dataset: "my_set");

run;

outputs:

id x_sum
2  800
1  520

as dataset and log:

done=0 id=1 x=350 x_sum=350 _ERROR_=0 _N_=1
done=0 id=1 x=220 x_sum=570 _ERROR_=0 _N_=2
**done=0 id=1 x=300 x_sum=520 _ERROR_=0 _N_=3**
done=0 id=2 x=300 x_sum=300 _ERROR_=0 _N_=4
done=1 id=2 x=500 x_sum=800 _ERROR_=0 _N_=5

Can someone explain to me what is going on?

UPDATE AFTER ALL COMMENTS :

Hey everyone, I'm completely new to Stack Overflow, so I'm still getting my head around this Disable Answer system ... I felt like everyone contributed something.

Anyway, after a lot of experimentation, I figured out what was going on -

Basically, when ~~.sum ()~~ .replace () is called, the reset sum counter is zero. This, ~~and does not actually replace (), etc.~~ , is the reason the results were like this: sum () reset my count and therefore ref () only summed the previous 2 observations.

Hope this is useful information for everyone. If others have an understanding, please share.

+3

sas

Matt May 17 '15 at 11:35 PM

source to share

4 answers

I think T.REPLACE () is part of your problem. I don't know what is good to do, only bad. If you comment this, PUT _ ALL _ shows what you want. A surprise to me (still new to SAS hashing) is that I couldn't get T.Output () to write the x_sum variable to the output dataset. Hope someone else rings. Perhaps these total accumulators are treated differently? Below, since the PDV had the correct data, I switched to writing the output set in the normal DATA step instead of using output ()

data my_set(keep=id x_sum);
  set sample;
  by id;

  if _N_ = 1 then 
  do;
    declare hash T(suminc:"x");
    T.definekey("id");
    T.definedata("x");
    T.definedone();
  end;

  T.ref();

  T.sum(sum:x_sum);

  put _all_;

  if last.id;
run;

0

Quentin May 18 '15 at 1:30

source to share

Maybe you can try these codes:
data _null_;

length x_sum 8;

if _N_ = 1 then 
do;
    declare hash T(suminc:"x");
    T.definekey("id");
    T.definedata("id");
    T.definedata("x_ct","x_sum");
    T.definedone();
end;

do until(done);
  set sample end = done;
  if t.find() ^= 0 then do;
     x_ct = 0;
     x_sum = 0;
  end;
  x_ct ++ 1;
  x_sum ++ x;
  t.replace();
 end;
 t.output(dataset:'want');
 stop;
run;

0

smilepj May 18 '15 at 12:32

source to share

Your problem is that you are declaring your hash table, your key is ID

not unique for every record. You can fix this by enabling method multidannyh: multidata:"yes"

:

if _N_ = 1 then 
do;
    declare hash T(suminc:"x", multidata:"yes");
    T.definekey("id");
    T.definedata("id");
    T.definedata("x_sum");
    T.definedone();
end;

This gives:

done=0 id=1 x=350 x_sum=350 _ERROR_=0 _N_=1
done=0 id=1 x=220 x_sum=570 _ERROR_=0 _N_=2
**done=0 id=1 x=300 x_sum=870 _ERROR_=0 _N_=3**
done=0 id=2 x=300 x_sum=300 _ERROR_=0 _N_=4
done=1 id=2 x=500 x_sum=800 _ERROR_=0 _N_=5

0

Bendy May 18 '15 at 13:18

source to share

Quentin · Accepted Answer · 2015-05-19T13:22:03+0000

The problem is that you are using Replace (). From the docs (9.3 language references using hash object):

This SUMINC tag instructs the hash object to allocate internal storage to store a summary value for each key. The hash key summary value is initialized to SUMINC using the ADD or REPLACE method. the sum of the hash key is incremented by the SUMINC variable whenever the FIND, CHECK, or REF method is used.

I think the important point is that the "total" is not a DATA step variable x_sum, or x_sum, stored as a data variable in a hash table. It is stored outside of the hash table data. This is ancillary information that is actually an attribute of the key. (in my mind...)

If you comment out replace () your code works (you get the correct value for x_sum in PDV), but the problem is that x_sum is never written to the hash table. So you call replace () to write x_sum to the hash table, which has the unfortunate side effect of the total being initialized to x. I think the workaround is to assign x = x_sum before calling replace (). Thus, when replace () reinitializes the total to x, x contains the current total. I find it hard to insert words, but only one statement is shown below.

data _null_;
  set sample end= done;

  length x_sum 8;

  if _N_ = 1 then 
  do;
    declare hash T(suminc:"x");
    T.definekey("id");
    T.definedata("id");
    T.definedata("x_sum");
    T.definedone();
  end;
  T.ref();
  T.sum(sum:x_sum);
  put _all_;

  x=x_sum;  *Replace method will initialize the summary value to x! ;

  T.replace();
  if done then T.output(dataset: "my_set");
run;

SAS Hash Object Sum

More articles: