XML Basics for Python Using ElementTree
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    XML Basics for Python Using ElementTree

    Learn how you can parse, explore, modify and populate XML files with the Python ElementTree package, for loops and XPath expressions.

    As a data scientist, you'll find that understanding XML is powerful for both web-scraping and general practice in parsing a structured document.

    In this tutorial, you'll cover the following topics:

    • You'll learn more about XML and you'll get introduced to the ElementTree package.
    • Then, you'll discover how you can explore XML trees to understand the data that you're working with better with the help of ElementTree functions, for loops and XPath expressions.
    • Next, you'll learn how you can modify an XML file; And
    • You'll utilize xpath expresssions to populate XML files

    What is XML?

    XML stands for "Extensible Markup Language". It is mainly used in webpages, where the data has a specific structure and is understood dynamically by the XML framework.

    XML creates a tree-like structure that is easy to interpret and supports a hierarchy. Whenever a page follows XML, it can be called an XML document.

    • XML documents have sections, called elements, defined by a beginning and an ending tag. A tag is a markup construct that begins with < and ends with >. The characters between the start-tag and end-tag, if there are any, are the element's content. Elements can contain markup, including other elements, which are called "child elements".
    • The largest, top-level element is called the root, which contains all other elements.
    • Attributes are name–value pair that exist within a start-tag or empty-element tag. An XML attribute can only have a single value and each attribute can appear at most once on each element.

    To understand this a little bit better, take a look at the following (shortened) XML file:

    <?xml version="1.0"?> <collection> <genre category="Action"> <decade years="1980s"> <movie favorite="True" title="Indiana Jones: The raiders of the lost Ark"> <format multiple="No">DVD</format> <year>1981</year> <rating>PG</rating> <description> 'Archaeologist and adventurer Indiana Jones is hired by the U.S. government to find the Ark of the Covenant before the Nazis.' </description> </movie> <movie favorite="True" title="THE KARATE KID"> <format multiple="Yes">DVD,Online</format> <year>1984</year> <rating>PG</rating> <description>None provided.</description> </movie> <movie favorite="False" title="Back 2 the Future"> <format multiple="False">Blu-ray</format> <year>1985</year> <rating>PG</rating> <description>Marty McFly</description> </movie> </decade> <decade years="1990s"> <movie favorite="False" title="X-Men"> <format multiple="Yes">dvd, digital</format> <year>2000</year> <rating>PG-13</rating> <description>Two mutants come to a private academy for their kind whose resident superhero team must oppose a terrorist organization with similar powers.</description> </movie> <movie favorite="True" title="Batman Returns"> <format multiple="No">VHS</format> <year>1992</year> <rating>PG13</rating> <description>NA.</description> </movie> <movie favorite="False" title="Reservoir Dogs"> <format multiple="No">Online</format> <year>1992</year> <rating>R</rating> <description>WhAtEvER I Want!!!?!</description> </movie> </decade> </genre> <genre category="Thriller"> <decade years="1970s"> <movie favorite="False" title="ALIEN"> <format multiple="Yes">DVD</format> <year>1979</year> <rating>R</rating> <description>"""""""""</description> </movie> </decade> <decade years="1980s"> <movie favorite="True" title="Ferris Bueller's Day Off"> <format multiple="No">DVD</format> <year>1986</year> <rating>PG13</rating> <description>Funny movie about a funny guy</description> </movie> <movie favorite="FALSE" title="American Psycho"> <format multiple="No">blue-ray</format> <year>2000</year> <rating>Unrated</rating> <description>psychopathic Bateman</description> </movie> </decade> </genre>

    From what you have read above, you see that

    • <collection> is the single root element: it contains all the other elements, such as <genre>, or <movie>, which are the child elements or subelements. As you can see, these elements are nested.

    Note that these child elements can also act as parents and contain their own child elements, which are then called "sub-child elements".

    • You'll see that, for example, the <movie> element contains a couple of "attributes", such as favorite title that give even more information!

    With this short intro to XML files in mind, you're ready to learn more about ElementTree!

    Introduction to ElementTree

    The XML tree structure makes navigation, modification, and removal relatively simple programmatically. Python has a built in library, ElementTree, that has functions to read and manipulate XMLs (and other similarly structured files).

    First, import ElementTree. It's a common practice to use the alias of ET:

    import xml.etree.ElementTree as ET

    Parsing XML Data

    In the XML file provided, there is a basic collection of movies described. The only problem is the data is a mess! There have been a lot of different curators of this collection and everyone has their own way of entering data into the file. The main goal in this tutorial will be to read and understand the file with Python - then fix the problems.

    First you need to read in the file with ElementTree.

    tree = ET.parse('data/movies.xml')
    root = tree.getroot()

    Now that you have initialized the tree, you should look at the XML and print out values in order to understand how the tree is structured.

    Every part of a tree (root included) has a tag that describes the element. In addition, as you have seen in the introduction, elements might have attributes, which are additional descriptors, used especially for repeated tag usage. Attributes also help to validate values entered for that tag, once again contributing to the structured format of an XML.

    You'll see later on in this tutorial that attributes can be pretty powerful when included in an XML!

    root.tag

    At the top level, you see that this XML is rooted in the collection tag.

    root.attrib

    So the root has no attributes.

    For Loops

    You can easily iterate over subelements (commonly called "children") in the root by using a simple "for" loop.