Performs character set transcoding on all XML documents in a fileset, roundtripping through a Unicode representation.
Can optionally replace characters in the XML file with substitution strings. This latter feature is intended primarily for use when preparing an XML file for a specific output medium: one example is speech synthesizers (who typically doesnt recognize and pronounce all characters in the Unicode repertoire). Another example is when an XML document is being prepared for Braille.
The transformer is written to work on any file/fileset that can
be represented by the org.daisy.util.fileset package.
Character set transcoding will only be done on XML members of the input fileset; all other types of members pass through untouched.
If no file in the fileset is of type XML, then the whole fileset will pass through untouched. It is therefore safe to place this transformer in contexts whose dataflow varies considerably.
A file/fileset whose XML members has been transcoded, and optionally has had certain characters substituted by replacement strings. See parameters
No specific recovery scheme. On error, this transformer will send a fatal message, then throw an exception and abort.
The substitution is made using different attempts in a series of preference; each successor is considered a fallback to its predecessor.
All fallbacks are disabled by default.
By setting an "exclusion reportoire" a set of characters are defined which are considered "allowed": replacement will not be attempted on a character that is a member of an excluded repertoire.
The use of this class may result in a change in unicode character composition between input and output. If you need a certain normalization form, normalize after the use of this class.
The character translation table with a mapping between characters and their replacement strings must comply to the xml format used in java.util.Properties. See http://java.sun.com/dtd/properties.dtd and java.util.Properties for details.
The key attribute of the entry element must be a hex value representing a unicode codepoint, and the entry element value an arbitrary length string of characters.
Example of replacement text table (this also exists as a real file (example-table.xml) in the transformer directory):
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>
This is an example of an input translation table for int_daisy_unicodeTranscoder.
The key attribute contains the hex codepoint to be translated,
and the entry text node the replacement string.
The entries match two hebrew characters and some other stuff.
The table can be built using: www.unicode.org/Public/UNIDATA/UnicodeData.txt
</comment>
<entry key="05E2">hebrew ayin</entry>
<entry key="05DD">hebrew final mem</entry>
<entry key="00A5">currency yen</entry>
<entry key="00AE">registered sign</entry>
</properties>
Note: after a priori code review, the sjsxp StAX implementation seems safer to use than Woodstox when it comes to transcoding. This should be tested.
Markus Gylling, Daisy Consortium
LGPL