LXML

20 August 2019 Link



Description

This module implements a non-validating XML stream parser with a handler based event api (conceptually similar to SAX) which can be used to post-process the event data as required (eg into a tree).

The current functionality is:
  • Tokenises well-formed XML (relatively robustly)
  • Flexible handler based event api (
  • Parses/generates events all XML elements - ie.
    • Tags
    • Text
    • Comments
    • CDATA
    • XML Decl
    • Processing Instructions
    • DOCTYPE declarations
  • Provides limited well-formedness checking (checks for basic syntax & balanced tags only)
  • Flexible whitespace handling (selectable)
  • Entity Handling (selectable)
  • UTF-8 streams handled (beginning with UTF-8 BOM)

The limitations are:
  • Non-validating
  • No charset handling
  • No namespace support
  • Shallow well-formedness checking only (fails to detect most semantic errors)

The distribution also includes sample event handlers to convert the SAX event stream into a Lua table -
  • domHandler generates a DOM-like node tree structure and is capable of representing any valid XML document
  • simpleTreeHandler attempts to generate a more 'natural' table based structure which supports many common XML formats and is generally more useful (there are some restrictions dealing with mixed content however)
  • printHandler prints the XML content

Parser

Overview

This function provides a non-validating XML stream parser in Lua.

Features

  • Tokenises well-formed XML (relatively robustly)
  • Flexible handler based event API (see below)
  • Parses all XML Infoset elements - ie.
    • Tags
    • Text
    • Comments
    • CDATA
    • XML Decl
    • Processing Instructions
    • DOCTYPE declarations
  • Provides limited well-formedness checking (checks for basic syntax & balanced tags only)
  • Flexible whitespace handling (selectable)
  • Entity Handling (selectable)

Limitations

  • Non-validating
  • No charset handling
  • No namespace support
  • Shallow well-formedness checking only (fails to detect most semantic errors)

API

The parser provides a partially object-oriented API with functionality split into tokeniser and handler components. The handler instance is passed to the tokeniser and receives callbacks for each XML element processed (if a suitable handler function is defined). The API is conceptually similar to the SAX API but implemented differently.
The following events are generated by the tokeniser:
  1. handler:start - Start Tag
  2. handler:end - End Tag
  3. handler:text - Text
  4. handler:decl - XML Declaration
  5. handler:pi - Processing Instruction
  6. handler:comment - Comment
  7. handler:dtd - DOCTYPE definition
  8. handler:cdata - CDATA

The function prototype for all the callback functions is:
callback(val,attrs,start,end)
where attrs is a table and val/attrs are overloaded for specific callbacks - ie.
Callbackvalattrs (table)
startname{ attributes (name=val).. }
endnamenil
text<text>nil
cdata<text>nil
decl"xml"{ attributes (name=val).. }
pipi name{ attributes (if present).. ,
_text = <PI Text>
}
comment<text>nil
dtdroot element{ _root = <Root Element>,
_type = SYSTEM|PUBLIC,
_name = <name>,
_uri = <uri>,
_internal = <internal dtd>
}

(start & end provide the character positions of the start/end of the element)
XML data is passed to the parser instance through the 'parse' method (Note: must be passed a single string currently)

Options

Parser options are controlled through the 'self.options' table. Available options are:
  • stripWS - Strip non-significant whitespace (leading/trailing) and do not generate events for empty text elements
  • expandEntities - Expand entities (standard entities + single char numeric entities only currently - could be extended at runtime if suitable DTD parser added elements to table (see obj._ENTITIES). May also be possible to expand multibyre entities for UTF-8 only
  • errorHandler - Custom error handler function
NOTE: Boolean options must be set to 'nil' not '0'

Usage

  • Create a handler instance -
h = { start = function(t,a,s,e) .... end,
        end = function(t,a,s,e) .... end,
        text = function(t,a,s,e) .... end,
        cdata = text }

NOTE: Predefined handlers available in the module can also be used.
  • Create parser instance
p = Parser(h)

  • Set options
p.options.xxxx = nil

  • Parse XML data
p:parse("<?xml... ")
  • Now use the handler object to use the xml data

Handlers

Overview

Standard XML event handler(s) for XML Parser function

Types of handlers

  1. printHandler - Generate XML event trace
  2. domHandler - Generate DOM-like node tree
  3. simpleTreeHandler - Generate 'simple' node tree

API

Must be called as handler function from Parser function and implement XML event callbacks (see the Parser documentation above for callback API definition)

printHandler

printHandler prints event trace for debugging.

domHandler

domHandler generates a DOM-like node tree structure with a single ROOT node parent - each node is a table comprising fields below.
 
      node = { _name = ,
              _type = ROOT|ELEMENT|TEXT|COMMENT|PI|DECL|DTD,
              _attr = { Node attributes - see callback API },
              _parent = 
              _children = { List of child nodes - ROOT/NODE only }
            }

The dom structure is capable of representing any valid XML document

simpleTreeHandler

simpleTreeHandler is a simplified handler which attempts to generate a more 'natural' table based structure which supports many common XML formats. The XML tree structure is mapped directly into a recursive table structure with node names as keys and child elements as either a table of values or directly as a string value for text. Where there is only a single child element this is inserted as a named key - if there are multiple elements these are inserted as a vector (in some cases it may be preferable to always insert elements as a vector which can be specified on a per element basis in the options). Attributes are inserted as a child element with a key of '_attr'. Only Tag/Text & CDATA elements are processed - all others are ignored. This format has some limitations - primarily
  • Mixed-Content behaves unpredictably - the relationship between text elements and embedded tags is lost and multiple levels of mixed content does not work
  • If a leaf element has both a text element and attributes then the text must be accessed through a vector (to provide a container for the attribute)
In general however this format is relatively useful. It is much easier to understand by running some test data through 'textxml.lua -simpletree' than to read this)

Options

  • simpleTreeHandler.options.noReduce = { <tag> = bool,.. }
    • Nodes not to reduce children vector even if only one child
  • domHandler.options.(comment|pi|dtd|decl)Node = bool
    • Include/exclude given node types

Usage

Passed as delegate in parser constructor and called as callback by Parser:parse(xml) method.

Authors

Paul Chakravarti (paulc@passtheaardvark.com)<p/>
Milind Gupta