F # XML Parsing Using XMLReader
I am trying to process a particularly large XML document using F #. Since loading the entire document is out of the question, I am trying to use the XmlReader to accomplish my goals. My first step is to define the XML document as a sequence of nodes.
// Read XML as a lazy sequence let Read (s:string) = let r = XmlReader.Create s let src = seq { while r.Read() do if XmlNodeType.Element = r.NodeType then yield CreateNodeData r while r.MoveToNextAttribute() do yield CreateNodeData r done else yield CreateNodeData r done } LazyList.ofSeq src
This creates an XML document as a NodeData sequence (which is created by the CreateNodeData function and is not included here for simplicity). The tape list is used to use active pattern matching.
The schema parser is now built by defining a grammar such as FParsec. for example
type NodeSeq = NS of LazyList<NodeData> (* Define a generic parser that takes an XML Reader and returns a singleton list containing parsed element and unparsed parser. Failure is denoted by an empty list *) type 'a Parser = P of ( NodeSeq -> list<'a * NodeSeq > )
And adding monadic constructs to create a monadic parser, so that the following code parses a NodeData that matches the given criteria.
let item = P ( fun inp -> match inp with | NS(LazyList.Nil) -> [] | NS(LazyList.Cons(a,b)) -> [(a,NS(b))] ) let nodeFilter (f: NodeData -> bool) = parser { let! c = item if (f c) then return c }
In addition, a select statement is added (+++)
so that it p +++ q
presents alternative parsers.
The problem I am facing is parsing XML with an element such as
<Node Color="Red" Transparency="90%" Material="Wood"/>
Here, the Color, Transparency and Material attributes are required attributes, but their sequence is irrelevant. In addition, there may be other optional attributes. How to create a combinatorial parser for a view
- managing independent sequence attributes
- optional attributes
This is equivalent to matching any of the following lines
xabc,xacb,xbac,xbca,xcab,xcba
How can I simplify it?
source to share
If you like the XElement from LINQ to XML but don't want to load the entire document into memory, you can pass individual XElement instances from the XmlReader:
type XmlReader with
/// Returns a lazy sequence of XElements matching a given name.
member reader.StreamElements(name, ?namespaceURI) =
let readOp =
match namespaceURI with
| None -> fun () -> reader.ReadToFollowing(name)
| Some ns -> fun () -> reader.ReadToFollowing(name, ns)
seq {
while readOp() do
match XElement.ReadFrom reader with
| :? XElement as el -> yield el
| _ -> ()
}
You can then query the attributes of each element, and the original order of the attributes doesn't matter, but you're still streaming the document rather than loading the whole thing into memory.
source to share
My impression is that you are reinventing the wheel.
XmlReader
is a complete and efficient XML parser. The parsing c attributes XmlReader
are simple and order-independent. You can use XmlReader
to get required and optional attributes when building a sequence. Check r.HasAttribute
and r. MoveToNextAttribute()
read the attributes here on MSDN .
However, writing a parser combinator for a task is overkill. And I doubt using LazyList
will give you any advantage. You will most likely use higher order functions to process the sequence; starting with seq
is a good choice.
source to share