In XML: Using SQL to Link Below the Root - Part 1
Published: April 1, 2004
Published in TDAN.com April 2004
SQL/XML integration is in a mess today. Every SQL vendor’s XML support is proprietary and still does not integrate seamlessly, fully, or satisfactorily. The current working standards solutions, XQuery (when used in SQL) and the SQL/XML Standard are very XML centric and do not integrate fully with SQL either. The reason most given for this SQL/XML integration problem is that relational data is flat and XML is hierarchical and never the twain shall meet. All of the current solutions have one thing in common, they flatten the XML data so that it can be integrated with the relational data. This is the lowest common denominator approach. It does have its problems such as introducing XML centric syntax and being very inefficient in memory and processing. Often overlooked, because hierarchical processing has been in the closet for many years, is the powerful hierarchical semantics in the XML that are being thrown away or at the very least ignored. This is meta information that can be inherently utilized by a nonprocedural processor to automatically process intuitive queries that internally involve complex hierarchical logic to process.
Now for the good news. ANSI standard SQL can perform hierarchically out of the box. All that is required is to model the join relationships hierarchically using the standard Left Outer Join operation, and the relational engine will process the modeled data hierarchically. In retrospect this should be obvious, because this is how SQL naturally operates. Just specify the data relationships and the data to be returned, and a general purpose SQL processor will produce the logically correct answer automatically based on the data relationships. This is why hierarchical data relationships produce hierarchical results with SQL.
You probably have your doubts and a number of questions about SQL performing hierarchically. Such as how can a flat row-set hold a hierarchical structure where multiple leg occurrences can vary in length from occurrence to occurrence, or how can a query hierarchically select data from one leg based on data values from a different leg of the structure, especially when rows are qualified a single row at a time by the relational engine. I will cover these basic hierarchical operations, and then the newer hierarchical operations made popular by XML. These new operations include node promotion and fragment processing. All of these capabilities including native XML integration can also be handled naturally by SQL’s inherent hierarchical processing.
These hierarchical operations are discussed only briefly because there is a lot of information to introduce. More detailed documentation about these advanced SQL hierarchical capabilities can be found at www.adatinc.com. Also included in this article is an example labeled Figure 1 that will visually help clarify and confirm what is being described. I will also cover a hierarchical data modeling problematic area where a hierarchical structure is referenced below its root instead of its root and discuss what the derived structure semantics of this operation are. This is an interesting problem which was solved to my satisfaction by analyzing how SQL’s inherent hierarchical data modeling and structure processing naturally handled it.
Basic Hierarchical Processing
ANSI SQL-92 introduced outer joins and a join syntax where tables are joined two at a time controlled by an ON clause which specifies the join criteria at each join point. Left Outer Joins preserve the left table data with null values when there is no matching right table data. This enables left table rows to exist without matching right table rows, but right table rows can not exist without matching left table rows. This is the precise nature of hierarchical data structures and is the building block principle for hierarchical structures.
The newer SQL-92 join operation can chain any series of Left Outer Joins together so that they can build a complete hierarchical structure from the left to right where hierarchically lower level tables are introduced from the right table argument. This is demonstrated in Figure 1. Each ON clause controls where its associated lower level table is joined to the upper structure being built. In this hierarchical model, each table represents a node. When a table is joined to an upper structure node that already has a leg, another leg is created so that hierarchical structures can contain multiple legs. SQL views can be used to store these SQL data modeling operations to define an entire structure for flexible use and reused as shown in Figure1. These data modeling Outer Join views can be constructed automatically from XML DTDs, Schemas or even COBOL File Definitions by utility operations.
The semantics of the Left Outer Join controls how the hierarchical structure materializes and operates. The hierarchical structure materializes so that it represents the correct hierarchically preserved data. It operates hierarchically since all the relations are hierarchical and each node is related hierarchically to every other node. This naturally enables the relational engine to perform hierarchically on the entire hierarchical structure. This makes hierarchical processing a valid subset of relational processing.
Because of the hierarchical data preservation, legs of hierarchical structures vary in length dynamically from data occurrence to occurrence. Variable length hierarchical legs are stored and operated on in the relational rowset. This is performed naturally because null values are inserted automatically to represent the missing data which keeps all of the different legs aligned consistently, even if they are intermixed in the row-set.
Hierarchical query filtering specified by the WHERE clause automatically performs global data filtering across entire multi-leg hierarchical data structures. This requires hierarchical logic to determine common ancestor nodes in order to control hierarchical ranges of sibling data comparisons. It may seem like the row-at-a-time qualification logic of the relational engine could not perform this complex qualification logic. But the restricted Cartesian product where all hierarchical combinations are generated enables this advanced hierarchical selection to be carried out automatically a single row at a time. The affect of this WHERE clause filtering is precisely how hierarchical query processing should operate. It operates across the entire structure filtering complex relationships not only within a leg, but also across sibling legs. This complex operation is performed non-procedurally making SQL a very powerful and flexible hierarchical processor.
Let me point out that the ON clause join criteria described earlier can also contain data filtering criteria. This filtering criteria is handled differently than on the WHERE clause. While the WHERE clause performs global filtering across the entire structure as described above, ON clause filtering is isolated to the node point in the structure where it was used. This causes filtering to be limited to a specific area of the structure and downward. This ON clause filtering operation very closely resembles XPath filtering.
As specified previously, the modeling of hierarchical structures using Left Outer Joins can be defined in ANSI SQL views and used as any table would. This also supports the ANSI SQL hierarchical controlled joining of hierarchical structures using their views. The ON clause is used to specify exactly where the lower structure is linked to the upper structure being built (shown in Figure 1). Hierarchical views, because of their hierarchical data preservation, can be hierarchically optimized to avoid accessing legs or portions of legs that are not needed for the current query, eliminating their columns from internal working row-sets. This is why nodes M and V are not access and are missing from the working set in Figure 1.This can greatly reduce unnecessary data explosions. Only data nodes selected or on a path to selected data nodes need to be accessed. This optimization can be determined at view invocation to also support ad hoc queries. This means that global views can be specified without incurring any overhead reducing the number of views necessary and increasing data view abstraction. This hierarchical optimization is not currently used in relational engines, the hierarchical data structures need to be recognized by the relational engine so that this optimization can be applied. This dynamic data structure extraction capability is described further later.
So ANSI SQL out of the box can define logical relational hierarchical structures, adding semantic value to the data. The relational engine can hierarchically process this hierarchical modeled data utilizing its hierarchical semantics. This inherent hierarchical processing allows SQL processing to operate as a tree structure model. This concept can be applied to the SQL modeling of physical hierarchical structures like XML so that the seamless integration of native XML can be performed at a hierarchical level where the hierarchical semantics are naturally utilized. This level of automatic hierarchical processing has so far been overlooked in XML processing because of the required processing complexity needed with some of XML’s advanced processing requirements. But when easy to specify nonprocedural languages like SQL inherently use the hierarchical semantics in the data, this complex multi-leg hierarchical processing can be easily specified and automatically performed.
Seamless XML Integration
With ANSI SQL able to perform hierarchically, how can XML document structures be integrated so that their access becomes seamless and transparent? This can be accomplished by physically modeling the XML structure as a hierarchical structure in an SQL view. This gives standard SQL processing direct and seamless access to every data item in the view, making XML support not only seamless but transparent. Whenever there is an internal request for the next root node occurrence of the XML mapped SQL view, the next XML document occurrence is retrieved and returned as a row-set that maps exactly to the hierarchical Left Outer Join view that models the XML structure. Because of the hierarchical optimization described earlier, only the XML data necessary is accessed, unnecessary paths are bypassed. The SAX XML access interface can be instructed to perform this type of optimized access.
To be totally XML enabled, SQL needs to be able to also output or publish native XML from the hierarchically processed relational result. The mapping of the relational result into element or attribute types can be controlled by specifying a choice of common predefined mappings such as Elements VS Attributes. The default hierarchical output structure can automatically be determined from the final runtime global structure created from the expansion of all the logical and physical hierarchical views into the single unified view that was used for processing. This unified view makes SQL’s processing of multiple hierarchical structures totally seamless and consistent for heterogeneous and disparate data sources. To accomplish this, the fully expanded Outer Join specification must be analyzed to determine the unified hierarchical structure formed at run time.
XML Introduced Capabilities
XML has introduced a number of advanced hierarchical capabilities such as node promotion and fragment processing. These are actually basic hierarchical operations, so SQL’s inherent hierarchical processing can perform them. Node promotion is where unselected nodes (nodes that have no data selected for output) are removed from the structure which causes children nodes to be moved up and around unselected ancestor nodes. This is a natural operation that occurs automatically in the relational result row-set as it is constructed from the working row-set by slicing out values that are not to be output or returned. The remaining structure still reflects the original structure. This is because the existing ancestor nodes will still exhibit the same control over existing descendent nodes after intermediate descendant nodes have been sliced out by data selection. The result set reflects this semantics.
For our use here, fragments are basically portions of a structure from anywhere in the structure that retain their basic shape and semantics. Fragments can be easily identified and controlled simply by which data items in the accessed structure are selected for output. This utilizes the action of node promotion described above applied to the structure as a whole. In Figure 1, the SELECT list forms fragments D/T and B in the lower relational structure and the join operation manipulates them by joining them to the L node in the upper structure specified by the ON clause. In this case, the L node is not selected, so node promotion pushes the fragments up under node X. This is very intuitive, the SQL programmer does not even need to be aware of all these different hierarchical processes occurring.
Linking Below the Root
In Figure 1, there is an interesting and advanced data modeling operation taking place probably unnoticed. This is the linking of the lower level structure based on a value below its root node instead of its root. This introduces interesting problems such as: What are the semantics of this operation and what are its limitations? Does it only bring in the referenced node and its descendents? Does it wrap the higher level structure portion under the node referenced making it the new root of the lower level structure? Maybe linking below the root of the lower structure should not be allowed at all. This was my thinking originally. But I found it very restrictive to only be able to reference the lower structure limited to its root.
Figure 1: SQL transparent integration of native XML utilizing inherent ANSI SQL hierarchical processing
So far in this article, SQL has performed hierarchical processing inherently, and in a perfectly valid and logical way that mirrored the way hierarchical nonprocedural languages operated. But these previous hierarchical languages had different solutions with restrictions for handling this lower level linking operation which were usually influenced to accommodate their particular internal operation.
A possible definitive solution to linking below the root node is to see how SQL would naturally handle this non root hierarchical joining of lower level structures. What I found after analyzing the results made perfect sense, was easy to understand, and was consistent in operation for all queries. There were no problem areas which needed to be mitigated by restrictions. Most important of all, it worked seamlessly and consistently across logical and physical structures.
The semantics for linking below the root is that the originally physical root remains the root for data modeling purposes, this makes sense because the original root still controls its descendents and the node referenced. This allows everything else to fall into place naturally. The data modeling semantics are that the lower level structure is linked from its original root to the upper structure being built to its link point specified in the ON clause. The lower level structure is filtered based on the ON clause join criteria applied to the actual lower link point(s). Node promotion automatically removes unselected nodes, which could include removing the original link-from node (L), lower structure root node (R) and/or link-to node (C) which is the case in the example in Figure 1. Other selected nodes that were cousins to the link point such as node B, meaning they were not directly related under the lower level link point but are still related because the original ancestor root (R), remains the root and will automatically be linked in correctly as shown in Figure 1.
You may wonder how logical structures such as relational view structures can be referenced under their root node before the root node is accessed. In other words, how is the lower level link point available before its root is accessed (the cart before the horse). It is not generally known, but the Outer Join syntax allows view execution nesting to occur. This is controlled by the placement of the ON clause. If the occurrence of an ON clause is delayed, its matching join operation execution is stacked until its matching ON clause is encountered. For example, as in A Left Join (B Left Join C ON B=C) ON A=C where the expanded lower level view is represented in the parentheses which are optional. This insertion of another ON clause allows the A=C ON clause condition to be deferred until table C had been accessed by the nested outer join. This happens automatically when the lower level structure view is expanded, causing the lower level structure to be materialized before being joined to the proceeding structure or table (node table A in this SQL example). This operation was introduced as an automatic way to protect against side effects of views.
While this ability to link to the lower level structure based on a non root naturally exists in SQL, this capability is a very useful operation. Its semantics are generically applicable to hierarchical data modeling in general, even outside of SQL. Another point to note is that hierarchical structure semantics are independent of their structure implementation.
From what has been shown, you can see that ANSI SQL data modeling and structure processing is complete, flexible, powerful, and capable of processing XML seamlessly and transparently while greatly limiting the need for XML centric syntax. This is because XML and SQL are integrated at a full hierarchical level utilizing SQL’s inherent and powerful hierarchical processing capability.
There are also no unnecessary limitations to the hierarchical processing, it is very orthogonal allowing hierarchical processing to be applied consistently and seamlessly across all of SQL processing. The hierarchical processing and therefore XML processing can be fully performed non-procedurally and dynamically. Other SQL/XML standards approaches are XML centric and procedural making programming required and ad hoc use not possible.
With hierarchical processing, the hierarchical results are also valid ANSI SQL. The results were created naturally within ANSI SQL and its inherent hierarchical processing. This is not another proprietary or SQL/XML standard put forth. This is a way XML can be integrated naturally today in ANSI SQL because hierarchical processing is an inherent subset of relational processing. This also means that relational and legacy data sources can also be integrated in the same seamless and unified way. Relational and legacy data can also perform the advanced hierarchical capabilities, node promotion and fragment processing, introduced by XML because these capabilities are standard hierarchical capabilities. More of these naturally supported advanced XML hierarchical capabilities, such as structure transformation, variable structures, and logical network structures will be described in Part 2 of this article.
Recent articles by Michael M. David
Michael M. David - Michael is is the founder of Advanced Data Access Technologies, Inc. Previously, he was the lead XML architect for NCR/Teradata, and served as their representative to the ANSI SQLX Group. Before that he was a staff scientist for Teradata and designed high level multi-featured SQL utilities. From his earlier career, he has more than 25 years of experience researching and designing commercial nonprocedural heterogeneous database hierarchical query processing products using flat, relational and hierarchical data. From this experience, he authored the book Advanced ANSI SQL Data Modeling and Structure Processing, as well as numerous papers and articles on this subject. His research on hierarchical and relational systems and data integration has resulted in discoveries that led to the development of an ANSI SQL transparent XML hierarchical processor prototype that integrates and processes relational and XML data at a full multipath hierarchical level. This also proves that inherent hierarchical processing capability is possible in ANSI SQL.
Contact Mike at firstname.lastname@example.org, and read his blog at www.adatinc.com/blog1.