How to lazy process XML document using hexpat?

In my search for a haskell library that can handle large (300-1000mb) xml files, I came across hexpat . an example in the Haskell Wiki that states that

-- Process document before handling error, so we get lazy processing.

      

For testing purposes, I redirected the output to /dev/null

and dumped a 300MB file into it. Memory consumption continued to rise until I had to kill the process.

Now I removed the error handling from the function process

:

process :: String -> IO ()
process filename = do
  inputText <- L.readFile filename
  let (xml, mErr) = parse defaultParseOptions inputText :: (UNode String,     Maybe XMLParseError)

  hFile <- openFile "/dev/null" WriteMode
  L.hPutStr hFile $ format xml
  hClose hFile

  return ()

      

As a result, the function now uses persistent memory. Why does error handling lead to massive memory consumption?

As I understand it, xml

and mErr

are two separate unreasonable thunks after the call parse

. Evaluates format xml

xml

and builds the 'mErr' score tree? If so, is there a way to deal with the error when using persistent memory?

http://www.haskell.org/haskellwiki/Hexpat/

+3


source to share


1 answer


I can't speak with hexpat permissions, but in general, error handling will force you to read the entire file into memory. If you only want to print the result if there are no errors in the input, you need to read the entire input before issuing.

As I said, I don't really know hexpat, but with an xml feed, you can do something like:



try $ runResourceT $ parseFile def inputFile $$ renderBytes def =$ sinkFile outputFile

      

It will use persistent memory, and if there are any processing errors, it will throw an exception (which it will try

). The downside is that the output file can get corrupted. I guess it is best to output to a temporary file, and after the whole process completes, move the temporary file to the output file. Either way, just delete the temporary file.

+1


source







All Articles