Ability to capture, store, analyse data and extract useful insights from large volumes of XML documents securely and efficiently is a valuable business capability. XML is a widely adopted tool, data format and model (hierarchy model) with its own query capabilities to navigate paths and nodes within XML documents (using Xpath, XQuery). XML specifications allows developers to define literally any type of data efficiently (e.g. structured, unstructured, semistructured, etc) which makes it an excellent intermediate format for data dissemination, exchange and integration between heterogeneous source systems with varying data models, query capabilities and data formats.
Over 2 decades into adoption since its release by the WC3 community, XML is ubiquitous in every technology footprint due to its versatility and autonomy from softwares and machines. As use cases increase, businesses are accumulating XML documents from internal and external sources, comprising crucial information to business which if mined and extracted is valuable for day to day operations.
Most businesses are aware of the potential presented by data in XML documents to i) enhance analytic data; ii) promote further automation of functions; iIi) improve performance and reduce carbon footprint of various technologies; iv) uncover new patterns in customer/user behaviour; v) leverage open data initiatives; and vi) improve cyber security measures. Though managing and processing XML documents as a viable data asset for business data requires a slightly different approach compared to other types of data due to XMLs highly flexible data format and structure. In this blog, we discuss strategies and some key areas of consideration to extend existing end to end data architectures to consume quality data from XML sources.
Alternatively, if you are looking for a more hands-on material with codes, kindly refer to this link on how to evaluate and extract data from XMLs in 6 simple steps using python.
2.0. The Strategy Matters
When it comes to XML processing and management, four (4) common methods are observed and often co-exist when there are many use cases in place.
XML to Relational DBMS. Normalise XML content to a relational table or sets of relational tables with parent-child, foreign key relationships. Relational systems are not suitable for XML types natively, but is an acceptable solution for many dealing with slow to change, small to medium sized XMLs. RDBMs has mature features for scalability, availability and indexing but is expensive for document loading and reconstruction of XML results during queries. SQL engines are relatively fast when processing XML queries that map to simple SQL queries but are sluggish for complex queries as it does not optimise with the knowledge base of XML semantics and data model.
XML to NoSQL DBMS ( Native XML DBMS). Another method is to store XMLs in a XML native DBMS which is designed to index and manage hierarchical data. While the ability to store XMLs in its native form saves computation to alter their forms for processing, the query optimizer of this system is relatively less mature than its relational counterparts. They are also slower in performance as opposed to most RDBMS.
Web based DOM processors. While the first two approaches mentioned above take a ‘store-query-retrieve’ approach, this method is designed to format, query, combine and present XML data without storing XMLs in disk. In this method a XML query engine parses XML documents into a in-memory DOM tree and traverses the tree using XPATH expressions, extract values and content into a new document or other data models. This method is suitable for small to medium size XMLs but workloads can turn memory intensive and query results are only available after documents are fully parsed and processed.
Other Web oriented processors. Others build custom web processors similar to DOM using python or C++ and XML libraries such as ElementTree or Rapid XML to process specific regions of the XML tree (via pull message) which scales well for even very large files but incurs higher network cost as this method requires considerable communication between data source and the consumer.
The first two approaches are best if the need is to directly query the XMLs for business answers. On the contrary, the last two methods emphasise on the need to extract XML data and combine it with other data sources for answering business queries. Questions to resolve (predefined queries and workload), time to resolve, consumers of service, nature of data sources, type of service (realtime, historical perspective), service delivery mechanism, cost and carbon footprint are all important factors for choosing the right combination of techniques for working with XMLs.
3.0. Get a Execution Framework in place
Including a new data source, especially one that is highly mutable with loose schemas requires careful considerations and guardrails to preserve quality and safety. For instance, additional steps are included in the XML data processing workflow to verify, validate and evaluate contents before data is extracted and consumed to ensure compliance to security policies. For best value from XML processing investments, management of XML documents or datasets must be treated as a critical business function just like database management and any other semi or unstructured data sources.
It’s best to start from high value experimentation (groups of XMLs based on application or functional unit) and work backwards to outline key processes, tools, technologies, skills requirements, best practices, internal policies and conventions. Python and various XML parsing libraries (e.g DOM, Element Tree) is extremely useful during this stage to quickly spin up modules to facilitate and automate various tasks in the processes - from capturing /requesting XMLs to productising the data in the XMLs for consumers as illustrated in diagram 1 above.
4.0. Prioritise the right Use Cases
Each business may have its own needs and priorities. Which means use cases can vary from XML conversion to i) other formats (e.g. JSON, CSV, parquet or tables) for big data processing; ii) updates to DBMS for powering digital applications; iii) updates to DW systems for analytic/OLAP consumption; iv) triggering a ML model or another function for automation. Each may result in a slightly different architecture to deliver ready data to consumers and affect choice of tools and technologies used significantly (e.g. scripting language, XML libraries, XML storage, etc). Some high level samples of potential / common use cases are shown in Diagram 2 below.
The higher the value of data in XML use case (improves ROI of existing datasets, speeds GTM, reduces risks) the easier it becomes to make justifications and present the business case for XML data management architecture with c-suits.
5.0. Some traits of XMLs to keep in mind throughout deployment
The XML specification is easy to follow and use by developers and integrators. XMLs are constructed using simple predefined rules and schema (e.g. opening tag, closing tag, elements, text node, attribute, namespaces inclusions, etc). XMLs can contain data of various types including scalar (e.g. numeric, strings) and compound (e.g. array, structures) in a hierarchy model with unprecedented levels of depth. It contains a single root element and branches out to children elements which can contain numerous more of its own children elements as depicted in a sample in diagram 3 below.
For decades, this super strength of XML has helped us avoid expensive overhead especially when exchanging messages/data through multiple transport protocol layers and its proprietary APIs. However, the fluidity of XML can open up opportunities for mistakes (e.g. duplicated data, missing data, typos, loose or changing schema) and misuse by bad actors in various ways. Crystallising a set of internal rules for processing XMLs which among other specifies accepted sources, xml schemas, data boundaries, data patterns, size of document, estimated processing time and error handling methods, is a great start to tighten the loose XML schema and optimise the formats according to the business need at hand.
6.0. Tools and Technologies
6.1. XML Parsers - DOM, SAX or ElementTree for Python?
When it comes to XML parsing in Python there are several useful libraries to explore depending on capabilities required for the use case.
DOM. DOM is probably the easiest parser to use regardless of language. DOM parse XMLs as a whole in one go and generates a tree-like structure (hierarchy of objects) in memory. This approach is slow and consumes a significant amount of memory footprint, especially when files are large. Python supports 3 types of DOM APIs: i) :xml.dom, ii) xml.dom.minidom, a minimal DOM implementation, and iii) xml.dom.pulldom which supports partial DOM trees.
SAX. SAX was developed by the Java dev community to overcome limitations of DOM. SAX is an event based streaming API that parses individual elements rather than the whole tree. Elements are processed top down to extract data and require multiple parsing for handling deeply nested structures. Python also supports an API for SAX classes and functions.
ElementTree. Python ElementTree is another option for parsing and processing XMLs. It’s a lightweight processor, capable of returning a tree for the whole XML or just a substructure of the tree (single node from tree). It’s suitable for traversing XMLs of all sizes for validation, extracting data and making modifications.
Python and ElementTree are a great match when building complex parsing and data extraction routines involving XMLs with deep structures and nested data as shown in sample of Diagram 4 above. ElementTree comes with a rich set of methods, functions and objects to facilitate traversing the whole (Element Tree) or part (Element) of a XML document, extract text and attribute values or modify the tree. Data can be captured into Python’s dictionary type structures and converted into a table, csv or parquet using Pandas DataFrame or directly updated into a database table.
Similarly you can build customised XML validators and implement internal schemes for filtering threats and poorly formed XMLs if a ready tool is not available for such a task. Alternatively C++ and rapidXML is another powerful combo to explore for super strength XML parsing and data processing capabilities.
6.2. XML Storage and Query Engine
There are three (3) options to store and manage XMLs for strategies that employ XML storage.
File systems. XMLs can be stored in a file system, but this method does not scale well, suffers from availability and manageability issues. Metadata management, query engine and other administrative functions must be addressed separately in the XML data processing architecture.
RDBMS with XML support. A second option is to store XMLs in relational databases such as SQL server, MySQL, PostgreSQL or Oracle with XML support. In a relational database, XML data is stored in a column of a table or as a separate table. RDBMs comes with data management, query engine, indexing and is highly scalable and available. This simplifies architecture but relational systems have limitations when it comes to handling complex XML queries and larger document sizes.
Native XML databases. The third option is to use a native XML database such as eXistdb, Sedna, or BaseX which comes with support for a rich set of Xquery and Xpath expressions. Which makes XML databases more flexible than relational databases for XML querying, but require additional effort to set up and are less mature in terms of query processing architecture (E.g. pipelining, adaptive query optimisation).
7.0. Build and deploy XML Processing Functions
Regardless of the use cases, common functions to validate and verify XML sources effectively; capture and store XMLs and query XMLs in native form is the minimal competence for a business. Many of these features are bundled in RDBMS with XML support or native XML DBMS.
Once the basics are covered sufficiently, address other utility functions to extract XML data, perform transformations, combine with other sources, conduct various types of data analysis and enhance business insights through various outlets (e.g. Analytics, ML, data products, etc.). These higher level capabilities will connect XML sources securely to existing data processing and management architectures while increasing overall ROI for data investments.
Conclusion - Enhance existing data management and processing capabilities to support XML Sources
Surging need for extracting data from XMLs suggests increased momentum in efforts to optimise value and ROI of data investments. Businesses are shifting their focus from structured sources to push the boundaries of their end to end data processing and management architectures to extract value from semistructured and unstructured sources using a variety of methods including analytics and Machine Learning. Businesses are rarely just retrieving data from a source, but combining it with other heterogeneous sources and dynamic queries to bind correlations and enhance business understandings to uncover competitive edge in performance, innovation and industry leadership.
Managing XMLs is a valuable business function which continuously evaluates current demands and adds necessary capabilities to i) extract, combine and restructure data from sets of XMLs of arbitrary size, safely and economically; ii) delivering ready XML data services in timely manner to consumers (digital services, OLAP services, etc); iii) combine XML data with multiple other sources to form new functional and organisational insights; iv) employ an effective query processing model to extract and combine live data from heterogeneous sources and data models for performing dynamic queries. Build parts needing heavy customisation from scratch and use ready building blocks where solutions are available.
Some References:
XML Specifications; Cyber attacks through XML Vulnerabilities ; Python XML Libraries