This package holds two fast "XML Processors" as defined in the XML 1.0
specification; these are parsers (validating and non-validating)
with some supporting classes and interfaces. The parsers
support the SAX 1.0 API, with extensions that expose DTD and lexical
information useful for advanced applications (such as XML editors)
which sometimes need such data. They are highly conformant to the
XML 1.0 specification, and support a large number of character
encodings beyond the UTF-8 and UTF-16 support that is required
of all XML processors.
Compatible Extensions to SAX
For most purposes, the parsers in this package will be viewed as
generic SAX parsers. However, there are a few features which
may be of note, and are not supported by all SAX parsers.
Callbaks for Parser and ValidatingParser
All SAX parsers provide methods to set customization options,
such as entity resolvers and error handlers. These parsers also
provide methods to examine those option settings. This facilitates
examining the configuration of such a parser, as well as intercepting
and chaining handlers for parsing events.
For better support of DOM, and ID-based element retrieval (such
as that used in XSL and in XML Linking), the attribute lists passed
to the DocumentHandler.startElement method implement the
AttributeListEx interface. Similarly, additional DTD information
(exposed by DOM for editor support) is accessible through an extended
DtdEventListener interface.
An optimization that may be of interest is usable only with
the nonvalidating parser and applies only to standalone documents.
When the input document is a valid standalone document, it can be
processed more quickly by not reading external parameter entities and by
not normalizing or defaulting attributes. The optimization is not enabled
by default, because some validity errors will then be misreported (as
fatal "well formedness" errors). Also, applications are written to
expect normalized attribute values; they may not be correctly normalized
when the document is not in fact truly valid.
If the document handler is a LexicalEventListener,
information such as comments and CDTA section delimiters is provided.
If the DTD handler is a DtdEventListener,
information such as general entity declarations and
validity rules for elements and attributes are reported.
Resolver Class
The Resolver class has basic support to register catalog
entries, used to map the public identifiers for external entities to
storage locations other than those provided by XML documents. This
mechanism is used to reduce network traffic by providing local caches
of entities (such as DTD components, style sheets, schemas, and more)
in local files or Java resources. The mechanism also supports local
administration of such reusable components.
It also includes static factory methods to return input sources given
java.io.File and java.net.URL objects.
Another feature of the Resolver class is a factory method
that transforms MIME typed byte streams into SAX InputSource
objects. This simplifies building systems that use MIME with XML and
can't pass URIs to parsers, such as applications that send and receive
XML documents as messages. Examples of such applications include
servlets accepting input data as XML from HTTP methods such as POST or PUT.
JavaMail based applications, and clients using HTTP POST/PUT
to send data to a servlet or another web application server component.
Supported Character Encodings
This parser supports all of the character encodings supported
by the Java platform with which it is used. (See these links for
lists of those encodings: for
JDK 1.1 and for JDK 1.2. For JDK 1.2 this is
a total of approximately 140 encodings, with some encodings
having multiple names.)
In some cases, the preferred Internet Standard (IANA)
encoding names are not supported, but an alternative name may
be used. The following IANA encoding names should all work:
- UTF-8, UTF-16 (mandatory for all XML parsers)
- US-ASCII, ISO-8859-1 (common English and European Encodings)
- ISO-2022-JP, EUC-JP, Shift_JIS (common Japanese encodings)
- Big5, GB-2312 (common Chinese encodings)
- EBCDIC-CP-US, and other EBCDIC-CP-* encodings:
AR1, AR2, BE, CA, CH, DK, ES, FI, FR, GB,
HE, IS, IT, NL, NO, ROECE, SE, WT, and YU.
- ISO-10646-UCS-2 (the Unicode subset of UTF-16)
- Many other encodings.
In all cases that an encoding other than UTF-8 or UTF-16 is in use,
an encoding declaration should be used.