|
In SQL: Processing XML's Complex Hierarchical Structures
Published: July 1, 2004
Published in TDAN.com July 2004
ReviewIn Part 1 of last issue’s article entitled In SQL: Using XML to Link Below the Root I described how ANSI SQL can transparently integrate native XML, and do it at a full hierarchical level that utilizes the hierarchical semantics in the data. This inherent hierarchical SQL processing became possible in the SQL-92 standard with the introduction of the Left Outer Join operation which can model multi-leg hierarchical structures. The Left Outer Join, also known simply as the Left Join, enables hierarchical data structures to be defined in SQL by placing the left join data argument over the right data argument and linking the two using the ON clause specification. This hierarchical semantics instructs the relational engine how to process the data structure hierarchically. In effect, hierarchical processing became a subset of relational processing with the Left Join operating on tree structures where the nodes are tables or elements. This produces a valid hierarchical result that is also ANSI SQL accurate and mathematically sound. Complex XML CapabilitiesThere are many complex and unconventional capabilities being introduced by XML that are not being currently addressed in commercial SQL-based products. In this Part 2, we will examine how ANSI SQL inherently goes beyond the standard hierarchical processing described in Part 1 to naturally process XML’s complex hierarchical capabilities. These include network structures, variable structures and structural transformations. These capabilities are not being fully addressed because they impose significant problems for SQL and its relational processing. These SQL capabilities are described in this article. It is felt that these XML capabilities are useful and it is important that SQL-based products naturally handle these new unconventional XML features and capabilities. You will notice how well these complex capabilities naturally integrate with and utilize SQL’s inherent hierarchical processing capabilities. Because of this, the examples used in this article can be processing relational, XML, legacy data or a combination. Network StructuresNetwork structures can be created by XML’s duplicate and shared elements when they occur in an XML data structure. Duplicate elements occur where the same (named) element type occurs in multiple locations in the XML data structure. Shared elements are created by the XML IDREF and ID attribute constructs which creates a logical alternative pathway in the physical XML data structure forming a network structure. Both of these unconventional structures are demonstrated in Figure 1 as the top two structure diagrams. The Addr node in these diagrams represents the duplicate and shared element. The dotted arrow in the XML Shared Element diagram in Figure 1 represents the logical IDREF pathway. The problem with these two structures is that they are both ambiguous for a nonprocedural hierarchical query language such as SQL because there is no single unambiguous access path to a specific Addr node location. What makes a nonprocedural hierarchical query language so powerful is that the hierarchical data structures they operate on are naturally unambiguous because they only have a single path to each node. In this way, the query can be specified unambiguously because each node in the structure has its own unique access path and specific semantics that can be utilized automatically to process the query.
The Alias feature of SQL allows the duplicate and shared element structures shown in Figure 1 to be data modeled as unambiguous hierarchical structures by using the optional AS keyword to rename the nodes (tables or elements). In this example, both of the ambiguous structures can use the same data modeling SQL to produce an unambiguous structure which maintains the original semantics of both input structures. This is possible because the semantics of both input structures with differing storage distribution of the Addr node are the same and produce the same result. With the unambiguously modeled structure shown in Figure 1, each specific Addr node can be unambiguously referenced by using its unique node name as a prefix to the field name. In this way each Addr node reference has its own logical path with its own hierarchical semantics. This allows the full nonprocedural hierarchical power of SQL to be easily controlled with simple intuitive queries. Avoiding the use of node name prefixes can be accomplished by using the SQL Alias feature on the SELECT list to rename the duplicate field names to unique names. The underlying XML access module’s logic can adapt transparently to the physical storage representation of the Addr element whether it is shared or a duplicated. Variable StructuresXML can define variable structures which allow for considerable variability of structure formats for a single definition of the structure. This means that from structure occurrence to occurrence or even within a single structure occurrence of a record or document, the structure format can vary. Usually with a varying structure, some piece of information from a higher level in the structure indicates how a variable substructure is to or has formed. This operation is very similar to COBOL’s variable structure capability which is controlled by its DEPENDING ON clause. With this indicator information, SQL data modeling which controls the building of each structure occurrence can define the appropriate substructures dynamically. An example of this is shown in Figure 2 below, where the ON clause is used to control the generation of the variable portion of the structure.
Figure 2 is a simple example where the generation of the data structure in memory can vary depending on the value of the field StoreType in the Store node. In this case, only one substructure was affected, but there is no limit to the number of variations which can be controlled separately by the ON clauses testing for any number of structure indicator values to control whether they are generated. These structure decision tests can be coded to duplicate the rules specified in XML DTD and Schemas for variable element generation which can become quite complex. These indicator values can be embedded in variable portions of the structure offering increased capabilities. This variable structure modeling in SQL also controls how variable physical structures are nonprocedurally navigated. The example in Figure 2 also demonstrates that ON clauses and WHERE clauses are very different in their operation. WHERE clauses remove entire rows, while ON clauses can remove isolated portions of rows which are replaced with null values. Looking at this operation from a hierarchical processing perspective, WHERE clauses can remove nodes above and below the node used as the qualifying filter, while ON clauses can only remove data from their qualification node downward. This means that the WHERE clause is logically applied after each record or row occurrence is complete, while the ON clause is applied as the record is being built. This is why the ON clause can dynamically control how the structured record occurrence is built. Structure TransformationIn Part 1, it was shown how the natural operation of node promotion controlled by which nodes were selected for output formed fragments that maintained their general structure and could be manipulated by joining as a contiguous structure. Figure 3 below demonstrated two fragments indicated by the dashed circles that encircle them. Unselected nodes are represented by a dashed box. These unselected nodes are not transferred when output causing lower level nodes to be promoted to the next ancestor on its path, retaining its basic structure and semantics For example, the Cust node which is an ancestor of the Item node controlled its existence before and after its parent node was removed.
By combining fragment processing and the Alias feature used in processing network structures above, it is possible to flexibly perform powerful structure transformations. This operates by specifying different fragments from the same structure by using the Alias capability to logically create multiple copies of the same structure. This enables different fragments in the same structure can be isolated and then independently manipulated. The Alias feature renames views allowing duplicate input structures to be logically defined so that references to them can be made unambiguously. The example in Figure 3 above demonstrates this by creating two separate and independent fragments from the StoreView view structure which are encircled. These now independent fragments are then recombined into a different structure by re-joining them. Notice that this is a hierarchical controlled join (described in Part 1) that precisely controls the hierarchical placement of the two fragments made possible by the Outer Joins’s ON clause. This is a simple example of structure transformation, multiple structures can each have multiple fragments extracted which can all be combined in any order. The fragments can be joined as they are needed. Structural transformations can also be defined in an SQL view for abstraction and reuse in larger structures. Heterogeneous Hierarchical StructuresIt is important and interesting to note that the XML advanced hierarchical capabilities described in this article, while brought into prominence by XML are not unique to XML. These are inherent hierarchical operations and are applicable to any logical or physical hierarchical structure. This means that they will operate seamlessly and consistently across logical (relational) and physical (XML and legacy data) hierarchically modeled structures under the control of SQL hierarchical processing. ConclusionThis article has demonstrated how ANSI SQL’s standard well known capabilities coupled with its inherent hierarchical processing naturally perform XML’s advance hierarchical operations. Part 1 in the last issue demonstrated how hierarchical processing is a valid subset of relational processing. This enables all of SQL’s basic hierarchical processing capability to integrate transparently with native XML at a full hierarchical processing level. This includes dynamic hierarchical controlled joining of structures and powerful hierarchical optimization seamlessly across relational and XML native data. Part 3 will examine hierarchical query semantics to see how and why standard SQL exactly follows it. I will also compare the natural ANSI hierarchical solution described here to proprietary solutions, XQuery used in SQL, and the SQL/XML Standard. I will also examine the synergy between hierarchical processing optimization and hierarchical SQL view support that produces a flexible and powerful, and at the same time, user friendly and intuitive nonprocedural hierarchical processor. For additional information on the capabilities described in this article, see www.adatinc.com. Any potential customers or SQL/XML developer/researchers interested in becoming involved with this transparent ANSI SQL native XML technology, please get in touch with mike@adatinc.com. Go to Current Issue | Go to Issue Archive Recent articles by Michael M. David
Michael M. David - Michael is is the founder of Advanced Data Access Technologies, Inc. Previously, he was the lead XML architect for NCR/Teradata, and served as their representative to the ANSI SQLX Group. Before that
he was a staff scientist for Teradata and designed high level multi-featured SQL utilities. From his earlier career, he has more than 25 years of experience researching and designing commercial
nonprocedural heterogeneous database hierarchical query processing products using flat, relational and hierarchical data. From this experience, he authored the book Advanced ANSI SQL Data Modeling
and Structure Processing, as well as numerous papers and articles on this subject. His research on hierarchical and relational systems and data integration has resulted in discoveries that led to the
development of an ANSI SQL transparent XML hierarchical processor prototype that integrates and processes relational and XML data at a full multipath hierarchical level. This also proves that
inherent hierarchical processing capability is possible in ANSI SQL.
Contact Mike at mike@adatinc.com, and read his blog at www.adatinc.com/blog1. |