|
The Ghost in the Machine
ANSI SQL Inherent Hierarchical Data Processing
Published: August 1, 2009 Now that hierarchical data structures are popular again because of XML, their full hierarchical processing is still being limited to flat two dimensional linear path processing by relational
processing. This will change when database professionals realize that ANSI SQL relational processing can now support full multipath nonlinear hierarchical processing.
Three decades ago, during the reign of hierarchical databases, full hierarchical data processing was commonplace. Hierarchical query languages were supporting the most complex and powerful multipath
queries that could be thrown at hierarchical databases. This ended with the advent of relational databases. Now that hierarchical data structures are popular again because of XML, their full
hierarchical processing is still being limited to flat two dimensional linear path processing by relational processing. This will change when database professionals realize that ANSI SQL relational
processing can now support full multipath nonlinear hierarchical processing. The physical or scientific proof that this technology exists and works can be difficult to find since relational
processing was not designed to support full hierarchical processing and could be hiding in plain site or buried deep in the relational engine. This article will find it and expose it.
Relational Hierarchical Processing is PossibleMost database professionals still do not know or believe that ANSI SQL can now inherently perform hierarchical data processing. I still see this belief and supposed fact being mentioned often as the reason for the impedance mismatch problem preventing full relational/XML seamless integration. The fact is ANSI SQL can perform full hierarchical processing directly out of the box today. This makes relational data and XML data directly compatible at a full hierarchical level. More amazingly, this is a full multipath navigationless hierarchical processing (schema-free query) that far exceeds today’s commonly limited single path linear processing. This processing can bring back powerful full multipath nonprocedural query processing again, this time naturally in ANSI SQL. This advanced ANSI SQL hierarchical processing also allows nontechnical users to once again specify any multipath query without having to know or specify the data structure. This increases the number of queries possible many times over, and the additional semantics natural existing between the hierarchical pathways dynamically increases the value of the database data as needed. There is good reason for skepticism about ANSI SQL’s inherent full multipath hierarchical processing. After all, how is it possible that this powerful capability, very advanced beyond today’s standards, suddenly appeared in SQL with no design or additional code added for its support? The required hierarchical processing logic required to support this capability certainly seems too powerful, seamless, and well thought out for it to appear spontaneously. Hierarchical Data Modeling and Data PreservationFull hierarchical data modeling and hierarchical data preservation was introduced unknowingly in the SQL-92 standard with the introduction of the Outer Join operation, specifically the Left Outer Join (or Left Join), which preserves unmatched rows on the left side of the operation. This preserves the data rows on the left side of the operation if there is no matching row on the right side. This results in a powerful basic hierarchical operation of modeling the left side data over the right side data. Additionally with the Left Join operation, another very important addition was made that completed the hierarchical modeling processing capability. This is the ON clause which replaces the WHERE clause for joining tables based on matching relationships and is specified locally at each specific join point affecting only the join point where it is used. The WHERE clause is still necessary to specify global data filtering that affects the entire structure. In hierarchical terminology, the ON clause specifies the hierarchical node linkage. This gives the Left Join and the ON clause combination extremely precise and flexible control to model and define full multipath hierarchical structures and process them hierarchically. The Left Join/ON clause combination supplies the hierarchical data modeling syntax and its associated hierarchical data processing semantics for controlling the hierarchical data processing and preservation principles. Figure 1 is a hierarchical view that uses a Left Join sequence to define the hierarchical structure shown in Figure 2. The entire hierarchical structure defined by the ANSI SQL standard hierarchical view will operate fully hierarchically automatically when processed by SQL. A close examination of the ON clause use in Figure1 will show how the pathways are extended and how new pathways are created to model the hierarchical structure shown in Figure 2. ![]() ![]() Lowest Common Ancestor (LCA) LogicThe hierarchical data processing described so far is quite surprising to have occurred with no hierarchical processing design or forethought behind it. But this is only the beginning. Full hierarchical processing also contains and requires advanced capabilities that occur across multiple pathways supported automatically by semantics that naturally occur across the pathways. These different capabilities are used in two ways. One way is for selecting data from one pathway based on data in another pathway. The other way is needed to process compound WHERE data filtering logic that references different pathways. Each of these reasons uses the Lowest Common Ancestor (LCA) node between its pathways to control the range of operation under it to assure the result is meaningful for the hierarchical query.The locating of the LCA node and the controlled range of the processing of the pathways under the LCA node was originally performed in physical hierarchical structures by hierarchical tree walking requiring many navigational steps. The finding of the LCA node required an algorithm and much research went into locating it efficiently. Compounding this processing is that LCA nodes can be nested depending on the number of different pathways requiring accessing in a single query. The selection of qualified data under an LCA from another pathway is limited to only the data occurrences under the LCA node data occurrence. When the WHERE clause is performing compound decision logic, all of the combinations of the data in multiple paths being tested for a single match is limited to the data under the LCA node occurrence. In both of these LCA uses, the LCA node data occurrence limits the processing to keep the data meaningfully related. This LCA technology has not made it into commercial database software today except for its natural occurrence in ANSI SQL. When relating two pieces of data in a hierarchical structure, the lower in the hierarchy they are related, the stronger the relationships are. For this reason, by default unless overridden, the convention is to always use the Lowest Common Ancestor node as the highest level to restrict the testing range to. This type of hierarchical association measurement is also known as hierarchical proximity. Examples of Determining LCAThis section has additional examples of the two types of LCA usage using Figure 2 above. The examples use the ViewX view defined in Figure 1 and shown in Figure 2. Starting with a data selection example: SELECT E FROM ViewX WHERE C=2. The LCA in this case is node A found by following nodes C and E up to their Lowest Common Ancestor. Let’s look at a WHERE qualification logic LCA as in: SELECT A FROM ViewX WHERE F=2 AND G=4. In this case the LCA is node E.What happens if we combine the last two types of LCA use as in: SELECT C FROM ViewX WHERE F=2 AND G=4. In this example, the LCA’s are nested. Starting inside out, the WHERE qualification logic has determined node E is the LCA. Continuing outward, the selecting data logic has chosen node A as the LCA using node C and the nested LCA node E in its LCA determination. Both of these LCA uses can generate nested or multiple LCA by themselves. For example, in the case of data selection: SELECT D, G FROM ViewX WHERE C=2. Node combinations D and C produce the LCA node B, and node combinations C and G produce the LCA node A. Similarly, WHERE C=2 AND D=4 OR G=8, would also produce nested LCA nodes of B and A. Nodes C and D produce LCA node B. Then LCA node B and node G produce LCA node A. It is worth mentioning that the OR conditional processing in the WHERE operation directly above can dynamically affect which LCAs are used during processing depending on which side of the OR is true. This also means that both sides of the OR operation always needs testing and operating on depending on the result even if the first OR condition is true. This is necessary in hierarchical processing logic and ANSI SQL performs this operation correctly in its Cartesian processing. For a more detailed look at this OR conditional hierarchical processing in SQL, see my blog entry. www.adatinc.com/blog1/?p=19 The above examples demonstrate the logic required to process multipath hierarchical processing. You can also see why procedural navigation would be impractical and error prone performing these complex operations. Academic projects have been attempting to add LCA functionality to XQuery using externally supplied functions. This will have significant complexity limitations and require knowledge of how to use with each different query requiring a different LCA programming. How LCA Logic Works in Relational ProcessingThe above complex LCA logic being performed in relational processing is hard to explain when it was not designed into SQL and because the LCA concept is totally foreign to relational processing. Another problem is that while the LCA logic was originally processed by tree walking on physical hierarchical structures, relational processing is basically processed a row at a time. The solution to the relational processing of the LCA logic can be found in the relational Cartesian product internal operation which is the brain of relational processing. This has also made the current problem of locating LCAs efficiently during processing a non issue.The relational Cartesian product operates by generating all combinations of data around the join points. With hierarchical relationship modeling using the Left join, these Cartesian product join points represent the LCA nodes. The combination of the data in Cartesian products causes replicated data to be generated. This properly generated data with its replicated data allows the two previously discussed types of LCA logic processing to be naturally simulated a row at a time in standard ANSI SQL processing when it is processing hierarchical structures modeled by the Left Join operation. Complex LCA nesting is automatically handled correctly by the algorithm used by the Cartesian product. The Ghost in the MachineMultipath hierarchical query processing (known academically as LCA Query or Schema-free Query) which is inherent in ANSI SQL can be isolated and explained logically. The fact that this internally complex and powerful hierarchical operation does work automatically and fully in ANSI SQL without any design or forethought being put into its operation is still quite amazing.Multidimensional hierarchical structure processing is more complex than two-dimensional flat relational data processing. But what accounts for the additional ability to perform this more complex hierarchical logic in standard SQL? It is inherent in the Cartesian product operation. This totally unexpected hierarchical LCA processing is “the ghost in the machine.” But the LCA operation has now been located and its processing explained in this article and can now be relied upon to be hierarchically accurate. This also proves that hierarchical processing is a valid subset of relational processing opening the door to the next generation of advanced hierarchical XML processing. ConclusionIn the same way it is hard to visualize a multidimensional hierarchical processing occurring in two dimensional relational SQL, it is also difficult to see the hierarchical result in SQL’s two-dimensional result set. But it is reassuring to know that the result is not just relationally accurate, it is also hierarchically correct. This makes sense because hierarchical processing is a subset of relational processing that goes through a more restricted specialized hierarchical processing by limiting the join operations to hierarchical Left Outer Joins.The hierarchical structure has not been lost in the two-dimensional relational rowset. It can be hierarchically mapped using the extracted structure metadata in the hierarchical Left Outer Join data modeled SQL view to transform the data result into a physical hierarchical structure such as XML. This can be utilized in ANSI SQL to automatically support transparent XML hierarchical processing, input and output. If you want to understand hierarchical processing principles and LCA processing more fully, access the article and interactive ANSI SQL Transparent XML Hierarchical Processor demo below. Navigationless Database XML: Hierarchical Data Processing article ANSI SQL Transparent XML Hierarchical Processor demo Go to Current Issue | Go to Issue Archive Recent articles by Michael M. David
Michael M. David - Michael is is the founder of Advanced Data Access Technologies, Inc. Previously, he was the lead XML architect for NCR/Teradata, and served as their representative to the ANSI SQLX Group. Before that
he was a staff scientist for Teradata and designed high level multi-featured SQL utilities. From his earlier career, he has more than 25 years of experience researching and designing commercial
nonprocedural heterogeneous database hierarchical query processing products using flat, relational and hierarchical data. From this experience, he authored the book Advanced ANSI SQL Data Modeling
and Structure Processing, as well as numerous papers and articles on this subject. His research on hierarchical and relational systems and data integration has resulted in discoveries that led to the
development of an ANSI SQL transparent XML hierarchical processor prototype that integrates and processes relational and XML data at a full multipath hierarchical level. This also proves that
inherent hierarchical processing capability is possible in ANSI SQL.
Contact Mike at mike@adatinc.com, and read his blog at www.adatinc.com/blog1. |