Python etree
python's builtin xml parser.
Documentation
XML https://www.w3schools.com/XML/default.asp XSLT (convert XML to other formats) https://www.w3schools.com/xml/xsl_intro.asp etree/ElementTree Documentation https://docs.python.org/3/library/xml.etree.elementtree.html#element-objects
Usage
Sample XML object (pom.xml)
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>com.willpittman.maven_intro</groupId> <artifactId>maven_intro</artifactId> </project>
from xml.etree import ElementTree # is module tree = ElementTree.parse('pom.xml') root_element = tree.getroot()element.tag
# name of tag (with namespace here) >>> root.tag '{http://maven.apache.org/POM/4.0.0}project'element.attrib
Lists a dictionary of key/vals within tag brackets>>> root.attrib { "{http://www.w3.org/2001/XMLSchema-instance}schemaLocation": "http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd', }There is a lot going on here, so let's break it down
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd" >
xmlns="http://..."
all tags will be assigned this namespace by default, unless they are explicitly assigned another.xmlns:xsi="http://..."
assigns a value to the 'xsi' namespace. Everywhere a key is prefixed with xsi:, the assigned value will be dumped.xsi:schemaLocation="http://... http://file.xsd
assigns the namespace to use (for the schema?) and the location of the file the schema is contained in.
element.text
Returns the text between the open/closed tags# above <project> contains no text >>> root.text '\n ' # it's first child (modelVersion) does contain text >>> list(root)[0].text '4.0.0'list(element)
# get children >>> list(root) [ <Element '{http://maven.apache.org/POM/4.0.0}modelVersion' at 0x7f6f9a0233b8>, <Element '{http://maven.apache.org/POM/4.0.0}groupId' at 0x7f6f91601868>, <Element '{http://maven.apache.org/POM/4.0.0}artifactId' at 0x7f6f926744a8>, ]element.get() (attributes)
Normal# <span style="border:1">...</span> element.get('style') >>> 'border:1'
NamespacesYou can query namespaces either using their expanded value, or by using the prefix and providing a dictionary of namespaces.
>>> namespaces = {"xsi": 'http://www.w3.org/2001/XMLSchema-instance'} >>> root_element.get('xsi:schemaLocation', namespaces) 'http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd'>>> root_element.get('{http://www.w3.org/2001/XMLSchema-instance}schemaLocation') 'http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd'
element.find/findall() (children)
Normal Usageroot_element.find('modelVersion') # first direct-child element of type 'modelVersion' root_element.findall('modelVersion') # all direct-child elements of type 'modelVersion'Namespaces
# create dictionary of your namespaces ns = { "x": "http://blah/blah", "y": "http:/foo/bar", } # find items with namespace root_element.findall('x:person', ns) # namespaces will be applied from dict root_element.findall('{http://blah/blah}person') # you can also use final resultXPath Support XPath allows you to treat XML keys similar to how you would work with a filesystem.
root.findall('./country/neighbor') # all 'neighbor' children under 'root/country' root.findall('.//year') # select all 'year' children at any nested-depth root.findall('../year') # children of parent of type 'year' root.findall('./[@attrib]') # direct children with attribute 'attrib' root.findall('./[@attrib="val"]') # direct children with attribute 'attrib' whose value is "val" root.findall('./[tag]') # direct children, with immediate child of type 'tag' root.findall('./[tag="val"]') # direct children, with immediate child of type 'tag' assigned value "val"See https://docs.python.org/3/library/xml.etree.elementtree.html?highlight=etree#xpath-support
for more details.
extract namespaces from xml file
>>> from xml.etree import ElementTree >>> namespaces_list = [ns_tuple for (event, ns_tuple) in ElementTree.iterparse('pom.xml', events=['start-ns'])] >>> namespaces = dict(namespaces_list) { '': 'http://maven.apache.org/POM/4.0.0', 'xsi': 'http://www.w3.org/2001/XMLSchema-instance', }
Compose XML objects
# create XML object a = ElementTree.Element('a') b1 = ElementTree.SubElement(a, 'b1') b2 = ElementTree.SubElement(a, 'b2') b1.text = 'val' b2.text = 'val' a.tail = '\n' b1.tail = '\n' b2.tail = '\n' xmlstr = ElementTree.tostring(a)Write XML objects
Text between tags, and following tags in XML is potentially very important (think of HTML). ElementTree does not have a builtin way of handling indent, but python's minidom does.from xml.dom import minidom from xml.etree import ElementTree etree_element = ElementTree.Element('project') xml = ElementTree.tostring(element) minidom_element = minidom.parseString(xml) minidom_element.toprettyxml(indent=' ')
Tips
working with namespaces
Working in namespaces in xml can be unintuitive. Here's a quickstart for parsing a document.
<!-- ./doc.xml --> <section xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xi="http://www.w3.org/2001/XInclude" version="5.0" xml:id='ssec-builtins'> <title>Built-in Functions</title> <para>This section lists the functions and constants built into the ...</para>from xml import etree def get_namespaces(filepath): """ returns dict of namespaces Returns: .. code-block:: python { '': 'http://docbook.org/ns/docbook', 'xlink': 'http://www.w3.org/1999/xlink', 'xi': 'http://www.w3.org/2001/XInclude' } """ ns_list = [ns_tuple for (event, ns_tuple) in etree.ElementTree.iterparse(filepath, events=['start-ns']) namespaces = dict(ns_list) return namespaces def find_title_element(filepath): tree = etree.ElementTree.parse(filepath) root = tree.getroot() ns = get_namespaces(filepath) # <title> is not assigned a namespace, so it belongs to the '' namespace. # We must add it's url within '{}'s in our query root.findall('.//{%s}title' % ns['']) # or root.findall('.//{http://docbook.org/ns/docbook}title')