SimpleXMLParse: XML to Python Objects and Back Again (evanjones.ca)

[ 2006-April-16 13:39 ]

SimpleXMLParse is a library for parsing XML into Python data structures instead of XML document object models (DOM) or parsing events (SAX). The idea is to make it easier to build applications that use simple XML formats. To build a parser, you supply an XML template that looks very similar to an example document. The parser will then be capable of parsing documents that match the template into Python objects. It can handle nearly any XML vocabulary. Soon it will also be able to generate XML documents from Python objects.

Download SimpleXMLParse: simplexmlparse.tar.bz2

Atom Example

Atom is an XML format for publishing news and updates for feed readers. SimpleXMLParse handle most of the Atom format, but this simplified example will look at only the entry element, which contains information about a chunk of content. Here is a simple valid Atom entry from the developer documentation.

<entry xmlns="http://www.w3.org/2005/Atom">
    <title>Atom-Powered Robots Run Amok</title>
    <link href="http://example.org/2003/12/13/atom03"/>
    <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
    <updated>2003-12-13T18:30:02Z</updated>
    <summary>Some text.</summary>
</entry>

SimpleXMLParse can parse the above example document like this:

parser = simplexmlparse.SimpleXMLParser( TEMPLATE )
entry = parser.parse( DOCUMENT )

print "title =", entry.title._text
print "id =", entry.id._text
print "updated =", entry.updated._text
if entry.summary: print "summary =", entry.summary._text
if entry.link: print "link href =", entry.link.href

Notice that we can access all the values and attributes via normal Python attributes. This will output the following:

title = Atom-Powered Robots Run Amok
id = urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a
updated = 2003-12-13T18:30:02Z
summary = Some text.
link href = http://example.org/2003/12/13/atom03

For comparison, here is the equivalent code using the XML DOM:

dom = minidom.parseString( GOOD_DOCUMENT )
print "title =", dom.firstChild.getElementsByTagName( 'title' )[0].firstChild.data
print "id =", dom.firstChild.getElementsByTagName( 'id' )[0].firstChild.data
print "updated =", dom.firstChild.getElementsByTagName( 'updated' )[0].firstChild.data

summaries = dom.firstChild.getElementsByTagName( 'summary' )
if len( summaries ) > 0: print "summary =", summaries[0].firstChild.data

links = dom.firstChild.getElementsByTagName( 'link' )
if len( links ) > 0: print "link href =", links[0].getAttribute( 'href' )

SimpleXMLParse creates strict parsers. It will not accept documents that do not match the template, unlike the DOM example. If we attempt to pass in a document that is missing a required element, we get the following exception:

ParseError: Element 'http://www.w3.org/2005/Atom:entry' is missing a required element: 'http://www.w3.org/2005/Atom:id' at line 6 column 6

This message tells us exactly what is wrong, and gives us a hint as to where the problem is in the document. Compare this to the following output from the DOM version:

IndexError: list index out of range

Quick Start

Write a template from your example document. To begin, you can just use the example document. With simple documents, that will be sufficient. For more details, see the template format section.
Construct a SimpleXMLParser instance using the template string: parser = simplexmlparse.SimpleXMLParser(template)
Use the parse method to create a Python object from the document string: docObj = parser.parse(document)
Access the attributes on the object to extract the data. If you want to look at the entire data structure, the simplexmlparse.printObjectTree function will display the object, along with types and values.

You can also test your templates at the command line. The module provides a main routine that will parse the template and document, then print the object using printObjectTree: simplexmlparse.py [template file] [document file]

Template Format

SimpleXMLParse uses a template to build Python objects for XML elements. To create a template, you take an example document and annote it to describe which elements and attributes are required and which are optional. The template is very similar to the example document:

<entry xmlns="http://www.w3.org/2005/Atom"
        xmlns:simplexmlparse="http://evanjones.ca/simplexmlparse">
    <title simplexmlparse:count="1">Atom-Powered Robots Run Amok</title>
    <link href="required URI"/>
    <id simplexmlparse:count="1">urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
    <updated simplexmlparse:count="1">2003-12-13T18:30:02Z</updated>
    <summary>Some text.</summary>
</entry>

Elements

Each element in the template defines a Python type. Each child element or attribute is turned into an attribute on the Python object. If the element contains any non-whitespace text, it will be permitted to have text in the document. All the text in the element will stored in the _text attribute on the Python object. By default, attributes and elements are optional. To make an attribute required, its value must begin with the string "required". Since attributes and elements both become attributes on the Python object, their names must be unique. Additionally, their names must be ASCII text and cannot begin with underscores (_) or contain colons (:). Because an element defines a type, SimpleXMLParse does not permit element names to be reused, unless they are defined to be the same type using references.

Number of Elements (count)

By default, elements are optional. To change the number of permitted elements, specify the simplexmlparse:count attribute. It takes the following values:

Value	Meaning
?	(default) Zero or one elements.
1	Exactly one element.
*	Zero or more elements (any number).
+	One or more elements.

If the number of elements in the document does not match the number specified by the count attribute, a ParseError exception will be raised.

Referring to Other Types (ref)

An element can be the same type as a previously defined element in the document. This allows a definition to be reused. In this case, both the elements will have the same required attributes and child elements. To specify that an element uses a previously defined type, specify the simplexmlparse:ref attribute on the element, with a value equal to the previous element's name.

Limitations

Order is not preserved. Since the XML document is converted into a Python object, the order of elements and text is not preserved. This is a fundamental limitation of this type of conversion. If you need the order, you need to use DOM or SAX parsers.
No distinction between elements and attributes. This is due to the model of parsing XML documents into Python objects.
No XML output. It should be possible to create a Python object and have it converted into a valid XML document matching the template.
Strict parsing only. Unknown attributes and elements cause parse errors. It should be possible to ignore any XML data that is not understood.
Arbitrary XML content. Some XML formats allow some elements to contain arbitrary XML. For example, the Atom format permits XHTML to be contained in the content element. It should be possible to collect this XML into an attribute on the parent object.