F # XML Parsing Using XMLReader

I am trying to process a particularly large XML document using F #. Since loading the entire document is out of the question, I am trying to use the XmlReader to accomplish my goals. My first step is to define the XML document as a sequence of nodes.

// Read XML as a lazy sequence
let Read (s:string) = 
    let r = XmlReader.Create s
    let src = seq {
                while r.Read()
                    do
                        if XmlNodeType.Element = r.NodeType then
                            yield CreateNodeData r
                            while r.MoveToNextAttribute() 
                                do
                                    yield CreateNodeData r
                                done
                        else
                            yield CreateNodeData r
                    done
                }
    LazyList.ofSeq src

      

This creates an XML document as a NodeData sequence (which is created by the CreateNodeData function and is not included here for simplicity). The tape list is used to use active pattern matching.

The schema parser is now built by defining a grammar such as FParsec. for example

type NodeSeq = NS of LazyList<NodeData>

(* 
Define a generic parser that takes an XML Reader and returns a singleton
list containing parsed element and unparsed parser. Failure is denoted by 
an empty list 
*)

type 'a Parser = P of ( NodeSeq -> list<'a * NodeSeq > )

      

And adding monadic constructs to create a monadic parser, so that the following code parses a NodeData that matches the given criteria.

let item = P ( fun inp ->
    match inp with
    | NS(LazyList.Nil)          -> [] 
    | NS(LazyList.Cons(a,b))    -> [(a,NS(b))]
    )

let nodeFilter (f: NodeData -> bool) = 
    parser {
        let! c = item
        if (f c) then
            return c
        }

      

In addition, a select statement is added (+++)

so that it p +++ q

presents alternative parsers.

The problem I am facing is parsing XML with an element such as

<Node Color="Red" Transparency="90%" Material="Wood"/>

      

Here, the Color, Transparency and Material attributes are required attributes, but their sequence is irrelevant. In addition, there may be other optional attributes. How to create a combinatorial parser for a view

  • managing independent sequence attributes
  • optional attributes

This is equivalent to matching any of the following lines

xabc,xacb,xbac,xbca,xcab,xcba

How can I simplify it?

+3


source to share


3 answers


If you like the XElement from LINQ to XML but don't want to load the entire document into memory, you can pass individual XElement instances from the XmlReader:

type XmlReader with
    /// Returns a lazy sequence of XElements matching a given name.
    member reader.StreamElements(name, ?namespaceURI) =
        let readOp =
            match namespaceURI with
            | None    -> fun () -> reader.ReadToFollowing(name)
            | Some ns -> fun () -> reader.ReadToFollowing(name, ns)
        seq {
            while readOp() do
                match XElement.ReadFrom reader with
                | :? XElement as el -> yield el
                | _ -> ()
        }

      



You can then query the attributes of each element, and the original order of the attributes doesn't matter, but you're still streaming the document rather than loading the whole thing into memory.

+4


source


Check the following ... maybe you find this useful http://fssnip.net/bd



+3


source


My impression is that you are reinventing the wheel.

XmlReader

is a complete and efficient XML parser. The parsing c attributes XmlReader

are simple and order-independent. You can use XmlReader

to get required and optional attributes when building a sequence. Check r.HasAttribute

and r. MoveToNextAttribute()

read the attributes here on MSDN .

However, writing a parser combinator for a task is overkill. And I doubt using LazyList

will give you any advantage. You will most likely use higher order functions to process the sequence; starting with seq

is a good choice.

+2


source







All Articles