Introduction to VTD XML
May 6, 2005 -- VTD-XML is a new, open-source, non-validating, non-extractive eXtensible Markup Lanugauge (XML) processing Application Programming Interface (API) written in Java. The VTD-XML is the best alternative to Simple API for XML (SAX) and Document Object Model (DOM), as it does not force you to trade processing performance for usability. The Java-based, non-validating VTD - XML parser is faster than DOM and better than SAX. Unlike the other XML processing technologies, VTD-XML is designed to be random-access capable without incurring excessive resource overhead.
An important optimization feature of VTD-XML is non-extractive tokenization. Internally, VTD-XML retains the XML message in memory intact and un-decoded, and tokens represent tokens using starting offset and length exclusively. Tokenization of VTD-XML is based on the Virtual Token Descriptor (VTD) core binary encoding specification. A VTD record is a 64-bit integer that encodes the token length, starting offset, type and nesting depth of a token in XML.
Memory buffers can be allocated in bulk to store the VTD records, as the records are constant in length. This avoids the creation of a multitude of string/node objects usually associated with other XML processing technologies. As a result, both memory usage and object creation cost are greatly reduced by using VTD-XML, which leads to significantly higher processing performance. For example, on a 1.5 Ghz Athlon machine, VTD-XML delivers random access at a performance level of 25 to 35 MB/sec, outperforming most SAX parsers with null content handlers. An in-memory VTD-XML document typically consumes only 1.3 to 1.5 times the size of the XML document.
VTD-XML provides several benefits for software developers. For example, you require a processing model to start work on a project involving XML. The DOM is slow and consumes too much memory, particularly for large documents. The SAX difficult to use especially for XML documents with complex structures. As a result, the best option is to choose the VTD-XML, as the features of VTD-XML does not force you to trade processing performance for usability. The random-access capability of VTD-XML provides the best possible performance. Even though SAX is fast due to ifs forward only nature, it does not suit for all the conditions.
In some situations, you perform lots of buffering to extract the data needed, while in others, you may have to repeat SAX parsing on the same document multiple times. Irrespective of what you do, SAX programming usually results in ugly and unmaintainable code, while the performance benefit over DOM is not always significant. The VTD-XML enables you to simultaneously achieve ease-of-use and high-performance. Also the performance benefit of the VTD-XML over DOM is substantial.
The following steps need to be performed to use VTD-XML for processing an XML document either from disk or via HTTP. The first step is to find out the length of the XML document, allocate adequate memory big enough to hold the document, and then read the entire document into memory. The next step is to create an instance of VTDGen and assign the byte array to it using setDoc(). The final step is to call parse(boolean ns), to generate the parsed XML representation. When ns is set to true, subsequent document navigation is namespace aware. If parsing succeeds, you can retrieve an instance of VTDNav by calling getNav().
At the onset of navigation, the cursor of the VTDNav instance points at the root element of the XML document. You can use one of the overloaded versions of toElement() function, to move the cursor manually to different positions in the hierarchy. The toElement() function when declared as toElement(int direction) takes an integer as the input, to indicate the direction in which the cursor moves. Defined as class variables of VTDNav, the six possible values of this integer are ROOT, PARENT, FIRST_CHILD, LAST_CHILD, NEXT_SIBLING, and PREV_SIBLING. Each has its respective acronym such as R, P, FC, LC, NS, and PS. The method toElement() returns a Boolean value indicating the status of the operation. The toElement() returns true when the cursor is moved successfully. When the cursor is moved to a non-existent location, for example, the first child of a childless element, then the cursor does not move and the toElement() returns false.
The method getAttrVal(String attrName) retrieves the attribute value of the element at the cursor position. Similarly, the getText() method retrieves the text content of the cursor element. In addition, you can use the toElementNS() and getAttrValNS() methods to navigate the document hierarchy in a namespace-aware fashion, if the namespace is turned on during parsing. Autopilot is the other mode of navigation. An instance of Autopilot can automatically move the cursor through the node hierarchy in document order. To use Autopilot, first you need to call the constructor, which accepts an instance variable of VTDNav as the input. Then, you need to call the selectElement() or selectElementNS() method, to specify the descendent elements to be filtered out. Whenever this is done, each call to the iterate() method moves the cursor to the next matching element.
Now let us see some of the unique properties of VTD-XML compared to other similar XML APIs, such as DOM and XMLCursor. The hierarchy of VTD-XML consists exclusively of element nodes. This is very different from DOM, which treats every node, whether it is an attribute node or a text node, as a part of the hierarchy. In VTD-XML, every instance of VTDNav has only one cursor. The cursor can be moved back and forth in the hierarchy, but you cannot duplicate it. However, you can temporarily save the location of the cursor on a global stack. VTDNav has two stack access methods which include Calling push() which saves the cursor state and Calling pop() which restores the cursor state. For example, consider that you are somewhere in the element hierarchy and you want to move to a different area of the document after saving the current location and then continue at the saved point. To accomplish this task, you need to first push() the location onto the stack. After moving the cursor to a different part of the document, you can very quickly jump back to the saved location by popping it off the stack.
One of the most unique aspect of VTD-XML that distinguishes it from any other XML processing API, is its non-extractive tokenization based on Virtual Token Descriptor. Non-extractive parsing enables you to achieve optimal processing and memory efficiency in VTD-XML. VTD-XML manifests this non-extractiveness in the following ways. Most of the member methods of VTDNav, such as getAttrVal(), getCurrentIndex(), and getText() return an integer. This integer is a VTD record index that describes the token as requested by the calling functions. VTD-XML produces a linear buffer filled with VTD records, after parsing. You can access any VTD record in the buffer if you know its index value, as all the VTD records are have the same length. In addition, the VTD records cannot be addressed using pointers, as the records are not objects. When a VTDNav function does not evaluate to any meaningful value, it returns -1 which is more or less equivalent to a NULL pointer in DOM.
Visit http://www.xml-training-guide.com for a complete introduction to XML programming. Learn XML, DTD, Schema, XSLT, Soap and other related technologies. Also to access the above information online, please visit http://www.xml-training-guide.com/vtd-xml.html
Introduction to VTD XML