Sas: data steps versus proc sql

So, I'm starting to get a little bit of sas knowledge, and I realize that many of the operations I've learned on datasets can also be done with proc sql statements , including merge, create variables, subset and many more.

So my question is, "When is the best time to do something?" Is it better? is always? Which one is the fastest, which consumes less memory?

Note that I probably expect the answer to be "it depends", in which case I would like to know what.

+3


source to share


3 answers


If you follow the steps:

data temp_new;
set temp;
run;

      

and

proc sql;
create table temp_new as
select *
from temp;
quit;

      

You won't see any difference. But there are many of them. I will only cover the functions of functions, what you can do with the data step and what you can with proc sql.

A data step can:



  • use loops;
  • infile from file.
  • iterate over data using open, fetch, fetchobs operators.
  • use putLog / put to output to log / file;
  • control the data flow with the first, last, saving statements. _n_

    , _error_

    and other variables.
  • output to different tables in one data file. Inference statement
  • determines the number of records added to the dataset.
  • use hash in datastep.
  • use arrays
  • forced stop reading
  • When merging / tuning, there is no limit on the number of input datasets other than memory. (in SAS 9.1 in SQL: the maximum number of tables that can be joined at one time is 32. I don't know if it has changed in later versions of SAS)
  • In general, complex business logic is easier to integrate with data step processing.

Proc sql can:

  • use grouping and ordering;
  • use internal sql.
  • set operations (union / outer union / intersection / exclusion).
  • do inner / outer joins without sorting data.
  • use integrity constraints on insert / delete / update. (I am not looking at the data step with update / upgrade instructions).
  • Direct access to the DBMS.

Another big difference is how data and proc sql work with datasets.
The data step reads the record sequentially to program the data vector, then does some processing on it and outputs it to the dataset. http://support.sas.com/documentation/cdl/en/basess/58133/HTML/default/viewer.htm#a001290590.htm

Whereas proc sql puts everything in memory or a service file (if there is not enough memory) and does all the calculations and merges in memory. After that, it writes all data to the dataset.

I mostly use both of them. Proc sql is efficient for some operations that require inserting, updating or deleting small pieces of data. For example, you want to add one record to a dataset that has 1KK records. In this situation, you won't be using a data step (you can use the proc append alternative)
If I need a lot of joins with large tables, I prefer to do it using data packet sort / join combinations or other methods (for example, put one data array into an array using a hash or using formats) because it's not that painful in time.

+2


source


it depends!

about what you are trying to achieve. Without knowing what you are trying to achieve, this question cannot be answered. Remember, the SAS language has been growing for decades .. SAS works to provide "backward compatibility" and so there are many things out there for old purposes.



For example, when something new (like SQL) arrives, SAS does not stop support for clients who run programs based on the data step.

The SAS language is separate from SQL syntax and again separate from other languages โ€‹โ€‹(such as DS2, C ++, or JAVA), all of which can be embedded in SAS and can perform many of the same operations.

0


source


Is proc sql better?

No - just different.

What's the fastest?

None of them - doing the same steps through SQL usually takes about the same amount of time as doing with datastep. It is highly unlikely that you will notice a noticeable difference in speed by changing a typical data step to a typical SQL query.

which consumes less memory?

They are probably about the same, to find out for sure, use option fullstimer;

which will give you notes in your journal window similar to the following:

NOTE: PROCEDURE SQL used (Total process time):
      real time           10.69 seconds
      user cpu time       1.62 seconds
      system cpu time     0.06 seconds
      memory              958.25k
      OS Memory           16328.00k
      Timestamp           10/21/2014 08:35:26 AM

      

When is the best time to do something?

Use an approach that makes your code most readable by others and supports.

The only thing I can think of is that I almost always use proc sql

for is when I need to combine multiple datasets using different join conditions for each dataset. The data step does not provide an easy way to do this in one step, while it is quite simple in proc sql

.

0


source







All Articles