Python etree

From wikinotes

python's builtin xml parser.

Documentation

XML https://www.w3schools.com/XML/default.asp
XSLT (convert XML to other formats) https://www.w3schools.com/xml/xsl_intro.asp
etree/ElementTree Documentation https://docs.python.org/3/library/xml.etree.elementtree.html#element-objects

Usage

Sample XML object (pom.xml)

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.willpittman.maven_intro</groupId>
  <artifactId>maven_intro</artifactId>
</project>


from xml.etree import ElementTree  # is module
tree = ElementTree.parse('pom.xml')
root_element = tree.getroot()

element.tag


# name of tag (with namespace here)
>>> root.tag
'{http://maven.apache.org/POM/4.0.0}project'

element.attrib


Lists a dictionary of key/vals within tag brackets

>>> root.attrib
{
    "{http://www.w3.org/2001/XMLSchema-instance}schemaLocation": 
        "http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd',
}

There is a lot going on here, so let's break it down

<project 
  xmlns="http://maven.apache.org/POM/4.0.0" 
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"
>
  • xmlns="http://..." all tags will be assigned this namespace by default, unless they are explicitly assigned another.
  • xmlns:xsi="http://..." assigns a value to the 'xsi' namespace. Everywhere a key is prefixed with xsi:, the assigned value will be dumped.
  • xsi:schemaLocation="http://... http://file.xsd assigns the namespace to use (for the schema?) and the location of the file the schema is contained in.


element.text


Returns the text between the open/closed tags

# above <project> contains no text
>>> root.text    
'\n  '

# it's first child (modelVersion) does contain text
>>> list(root)[0].text
'4.0.0'

list(element)

# get children
>>> list(root)
[
   <Element '{http://maven.apache.org/POM/4.0.0}modelVersion' at 0x7f6f9a0233b8>,
   <Element '{http://maven.apache.org/POM/4.0.0}groupId' at 0x7f6f91601868>,
   <Element '{http://maven.apache.org/POM/4.0.0}artifactId' at 0x7f6f926744a8>,
]

element.get() (attributes)


Normal

# <span style="border:1">...</span>
element.get('style')
>>> 'border:1'


Namespaces

You can query namespaces either using their expanded value, or by using the prefix and providing a dictionary of namespaces.

>>> namespaces = {"xsi": 'http://www.w3.org/2001/XMLSchema-instance'}
>>> root_element.get('xsi:schemaLocation', namespaces)

'http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd'
>>> root_element.get('{http://www.w3.org/2001/XMLSchema-instance}schemaLocation')

'http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd'


element.find/findall() (children)


Normal Usage

root_element.find('modelVersion')     # first direct-child element of type 'modelVersion'
root_element.findall('modelVersion')  # all direct-child elements of type 'modelVersion'

Namespaces

# create dictionary of your namespaces
ns = {
    "x": "http://blah/blah",
    "y": "http:/foo/bar",
}

# find items with namespace
root_element.findall('x:person', ns)              # namespaces will be applied from dict
root_element.findall('{http://blah/blah}person')  # you can also use final result

XPath Support XPath allows you to treat XML keys similar to how you would work with a filesystem.

root.findall('./country/neighbor')  # all 'neighbor' children under 'root/country'
root.findall('.//year')             # select all 'year' children at any nested-depth
root.findall('../year')             # children of parent of type 'year'
root.findall('./[@attrib]')         # direct children with attribute 'attrib'
root.findall('./[@attrib="val"]')   # direct children with attribute 'attrib' whose value is "val"
root.findall('./[tag]')             # direct children, with immediate child of type 'tag'
root.findall('./[tag="val"]')       # direct children, with immediate child of type 'tag' assigned value "val"

See https://docs.python.org/3/library/xml.etree.elementtree.html?highlight=etree#xpath-support

for more details.


extract namespaces from xml file


>>> from xml.etree import ElementTree
>>> namespaces_list = [ns_tuple for (event, ns_tuple) in ElementTree.iterparse('pom.xml', events=['start-ns'])]
>>> namespaces = dict(namespaces_list)
{
    '': 'http://maven.apache.org/POM/4.0.0',
    'xsi': 'http://www.w3.org/2001/XMLSchema-instance',
}


Compose XML objects


# create XML object
a  = ElementTree.Element('a')
b1 = ElementTree.SubElement(a, 'b1')
b2 = ElementTree.SubElement(a, 'b2')

b1.text = 'val'
b2.text = 'val'

a.tail = '\n'
b1.tail = '\n'
b2.tail = '\n'

xmlstr = ElementTree.tostring(a)

Write XML objects


Text between tags, and following tags in XML is potentially very important (think of HTML). ElementTree does not have a builtin way of handling indent, but python's minidom does.

from xml.dom import minidom
from xml.etree import ElementTree

etree_element = ElementTree.Element('project')
xml = ElementTree.tostring(element)

minidom_element = minidom.parseString(xml)
minidom_element.toprettyxml(indent='    ')


Tips

working with namespaces

Working in namespaces in xml can be unintuitive. Here's a quickstart for parsing a document.

<!-- ./doc.xml -->
<section xmlns="http://docbook.org/ns/docbook"
      xmlns:xlink="http://www.w3.org/1999/xlink"
      xmlns:xi="http://www.w3.org/2001/XInclude"
      version="5.0"
      xml:id='ssec-builtins'>

<title>Built-in Functions</title>

<para>This section lists the functions and constants built into the ...</para>
from xml import etree


def get_namespaces(filepath):
    """ returns dict of namespaces

    Returns:

        .. code-block:: python

            { '':      'http://docbook.org/ns/docbook',
              'xlink': 'http://www.w3.org/1999/xlink',
              'xi':    'http://www.w3.org/2001/XInclude'  }

    """
    ns_list = [ns_tuple for (event, ns_tuple) in etree.ElementTree.iterparse(filepath, events=['start-ns'])
    namespaces = dict(ns_list)
    return namespaces


def find_title_element(filepath):
    tree = etree.ElementTree.parse(filepath)
    root = tree.getroot()
    ns = get_namespaces(filepath)

    # <title> is not assigned a namespace, so it belongs to the '' namespace.
    # We must add it's url within '{}'s in our query
    root.findall('.//{%s}title' % ns[''])

    # or
    root.findall('.//{http://docbook.org/ns/docbook}title')