This package holds two fast "XML Processors" as defined in the XML 1.0 specification; these are parsers (validating and non-validating) with some supporting classes and interfaces. The parsers support the SAX 1.0 API, with extensions that expose DTD and lexical information useful for advanced applications (such as XML editors) which sometimes need such data. They are highly conformant to the XML 1.0 specification, and support a large number of character encodings beyond the UTF-8 and UTF-16 support that is required of all XML processors.

Compatible Extensions to SAX

For most purposes, the parsers in this package will be viewed as generic SAX parsers. However, there are a few features which may be of note, and are not supported by all SAX parsers.

Callbaks for Parser and ValidatingParser

All SAX parsers provide methods to set customization options, such as entity resolvers and error handlers. These parsers also provide methods to examine those option settings. This facilitates examining the configuration of such a parser, as well as intercepting and chaining handlers for parsing events.

For better support of DOM, and ID-based element retrieval (such as that used in XSL and in XML Linking), the attribute lists passed to the DocumentHandler.startElement method implement the AttributeListEx interface. Similarly, additional DTD information (exposed by DOM for editor support) is accessible through an extended DtdEventListener interface.

An optimization that may be of interest is usable only with the nonvalidating parser and applies only to standalone documents. When the input document is a valid standalone document, it can be processed more quickly by not reading external parameter entities and by not normalizing or defaulting attributes. The optimization is not enabled by default, because some validity errors will then be misreported (as fatal "well formedness" errors). Also, applications are written to expect normalized attribute values; they may not be correctly normalized when the document is not in fact truly valid.

If the document handler is a LexicalEventListener, information such as comments and CDTA section delimiters is provided.

If the DTD handler is a DtdEventListener, information such as general entity declarations and validity rules for elements and attributes are reported.

Resolver Class

The Resolver class has basic support to register catalog entries, used to map the public identifiers for external entities to storage locations other than those provided by XML documents. This mechanism is used to reduce network traffic by providing local caches of entities (such as DTD components, style sheets, schemas, and more) in local files or Java resources. The mechanism also supports local administration of such reusable components.

It also includes static factory methods to return input sources given java.io.File and java.net.URL objects.

Another feature of the Resolver class is a factory method that transforms MIME typed byte streams into SAX InputSource objects. This simplifies building systems that use MIME with XML and can't pass URIs to parsers, such as applications that send and receive XML documents as messages. Examples of such applications include servlets accepting input data as XML from HTTP methods such as POST or PUT. JavaMail based applications, and clients using HTTP POST/PUT to send data to a servlet or another web application server component.

Supported Character Encodings

This parser supports all of the character encodings supported by the Java platform with which it is used. (See these links for lists of those encodings: for JDK 1.1 and for JDK 1.2. For JDK 1.2 this is a total of approximately 140 encodings, with some encodings having multiple names.) In some cases, the preferred Internet Standard (IANA) encoding names are not supported, but an alternative name may be used. The following IANA encoding names should all work:

UTF-8, UTF-16 (mandatory for all XML parsers)
US-ASCII, ISO-8859-1 (common English and European Encodings)
ISO-2022-JP, EUC-JP, Shift_JIS (common Japanese encodings)
Big5, GB-2312 (common Chinese encodings)
EBCDIC-CP-US, and other EBCDIC-CP-* encodings: AR1, AR2, BE, CA, CH, DK, ES, FI, FR, GB, HE, IS, IT, NL, NO, ROECE, SE, WT, and YU.
ISO-10646-UCS-2 (the Unicode subset of UTF-16)
Many other encodings.

In all cases that an encoding other than UTF-8 or UTF-16 is in use, an encoding declaration should be used.