When using multiple datasets in SET or MERGE with double BY values, why aren't my variables reset between data steps?

Question

When using multiple datasets in SET or MERGE with double BY values, why aren't my variables reset between data steps?

When performing a data step with two datasets in a statement, set

sometimes the variables are not reset, missing between iterations. This also applies to merge

when you are duplicating values (i.e. when variables are by

not guaranteed to be unique).

For example:

data have1;
  do x=1 to 5;
    y=1;
    output;
  end;
run;

data have2;
  do x = 6 to 10;
     z=x+1;
     output;
  end;
run;

data want;
  set have1 have2;
  if missing(y) and mod(z,2)=0 then y=2;
run;

This y

is set to 2 for every record coming from have2

, not just even values z

.

Similarly,

data have1;
  do x = 1 to 5;
    y=1;
    output;    
  end;
run;

data have2;
  do x = 1 to 5;
    do z = 1 to 4;
       output;
    end;
  end;
run;

data want;
  merge have1 have2;
  by x;
  if mod(z,4)=3 then y=3;
run;

Why is this happening, and how can I prevent unintended consequences?

+3

sas

Joe 11 Aug 14 at 19:47

source to share

1 answer

Joe · Accepted Answer · 2014-08-11T19:47:28+0000

Why is this happening?

As discussed in detail in the SAS documentation in combination Datsets SAS: Methods , this is due to the fact that the variables that are defined on set

, merge

or update

is not installed at each iteration step data (this is equivalent to using retain

all the variables included in the data set).

In the first example, this naturally follows from the concept retain

: it is y

preserved, therefore, when it is not replaced by a new record from set

that has a value on it y

, it retains its last value. (As we will see later, it will be deleted once: when the dataset set

changes, hence why doesn't it have an earlier value from the previous dataset yet).

However, that doesn't quite explain the functionality of the merge (how it goes back and forth). This is caused by different behavior when the group is involved by

.

In particular, no variables are set in between each iteration of the data step; however, for each new group, they are missing from the group or dataset. From the documentation:

The values of the variables in the program data vector are set for each time SAS starts reading a new dataset and when a BY group changes.

The consequence of this is that the second example y

reverts to 1 for the first two iterations z

, but stores at 3 for the iteration z=4

.

To label each iteration with its value z

:

Z = 1: first entry by group, so everything is set to none. HAVE1

read, HAVE2

read. X=1

,, Y=1

are Z=1

installed.
Z = 2: the second record is read HAVE2

. y

keeps the value 1 from the previous iteration.
Z = 3: the third record is read HAVE2

. y

set to 3.
Z = 4: the fourth record is read HAVE2

. y

keeps the value 3 from the previous iteration.

Note that it HAVE1

is read only once, per iteration Z=1

. If it was a many-to-many merge, it HAVE1

will be read once for every other line with the same value x

on it.

How can we prevent this?

You have several options for solving this problem if you want it to act as if it was not automatically saved.

Add by operator

As noted earlier, new values by

will automatically reset everything to be missing. Therefore, if you run

data want;
  set have1 have2;
  by x;
  if missing(y) and mod(z,2)=0 then y=2;
run;

This will work as expected (although it gives a slightly different result).

Set some or all of the variables missing on your own

You can do this in two places:

data want;
  set have1 have2;
  if missing(y) and mod(z,2)=0 then y=2;
  output;
  call missing(of _all_);
run;

or

data want;
  y=.;
  set have1 have2;
  if missing(y) and mod(z,2)=0 then y=2;
run;

One or the other might be more appropriate for your program depending on your needs (the first sets everything to absence, but requires an additional operator ( output;

), and the second only sets y

to absence (which is all it needs), but changes the variable order by first putting y

).

For merge

with duplicate values by

, if you want to keep the value y

, you may need to do something like:

data want;
  merge have1 have2;
  by x;
  y_new=y;
  if mod(z,4)=3 then y_new=3;
  rename y_new=y;
  drop y;
run;

which gets around things by using a separate variable to store the new value. You can also install it in the same way as above, if that's what you need.

When using multiple datasets in SET or MERGE with double BY values, why aren't my variables reset between data steps?

More articles: