Reading from XML into multiple SAS tables

I have email data in XML format and I am trying to load this into multiple SAS tables. The XML structure is not flat and has several levels in the hierarchy. From this XML file I want to create several SAS tables (e.g. Sender, Recipients, Attachments, Email Body and Metadata ...). Obviously, there will be one sender, one email message for one email message, but any number of recipients and attachments. For this, I am currently using an XML Map file to translate the data into the tables I need.

The problem is with the xmlv2 engine with the MAP file, it seems like SAS is reading the XML file once for each table I want to create. This poses a problem as it doesn't scale well! For example, if I have 200GB XML files and want to create 10 tables, I will read 2TB of data for that. Is there a better way to handle XML files so that I only need to do one pass of the file to read all the data in the SAS datasets?

Thanks in advance.

+3


source to share


1 answer


Allocate the directory as the location of the aggregate file and use the data step to access the files in turn in the single data step. Search the var file in sas help for examples on how to do this. The xml map file will give you the detailed xmlpath information you need to find content in the XML file, which you can read directly from the data step code.



The alternative suggested in an earlier post will work as well. Prepare n * xml files in datastep format as above, but write the selected header content (opening tags) once, then the data content from each xml file (discarding the header content for each subsequent file), then write the closing tags once. This process is very fast. Then your original xml map file will process the big xml file once. You can test this approach pretty quickly by manually editing two xml files to collapse them into one. This will quickly tell you what content is common, what you need once.

0


source







All Articles