This transformer can do abbreviation, initialism and acronym detection, sentence detection and word detection in XML documents.
Multiple XML grammars are supported; only a configuration file is needed to support a new grammar but so far only support for DTBook documents has been added.
The internal Java BreakIterator is used to perform the sentence and word
detection, so any language supported by Java should work with this transformer.
xml:lang markup is used to switch the language.
The abbreviation, initialism and acronym detection is based on word lists in configuration files. So far, there are configuration files for english, swedish and french. The transformer will not fail catastrophically if it finds a language it has no configuration file for, it will simply mean that no abbreviations or acronyms will be found for that particular language.
This transformer differentiates between three types of abbreviations. In initialisms, each letter is pronounced (e.g. HTML). An acronyms is pronounced as a word (e.g. DAISY), where an abbreviation is pronounced by spelling out the abbreviation (e.g. is pronounced as "for example").
A document having a doctype declaration or root element XML namespace supported by the configuration files.
An XML document having abbreviation and acronym markup, sentence markup or word markup.
On error, this transformer will throw an exception and abort execution.
true (the default), abbreviation, initialism
and acronym detection will be performed.true (the default), sentence detection will
be performed.true (the default), word detection will be performed.true (the default), referred files, such as images
referenced from a DTBook document, will be copied to the output.true (defalt is false), the abbreviations,
initialisms and acronyms in the custom language file will override the language specific ones
defined in the different language dependant configuration files.false (defalt is true which is also the original
behavior of this transformer), sent elements will not be added in the case where they would become the only
descendant of the parent element.
The language root element basically contains three sub elements
(initialism, acronym and abbreviation). Each of these
elements can have three attributes:
Each abbreviation, initialism or acronym consists of a key element. Each
key has one or more name elements describing the string(s) to
be matched. The expansion element contains the expanded version of the
abbreviation, initialism or acronym.
<key> <name>o.s.v.</name> <name>o.s.v</name> <name>osv.</name> <name>o s v</name> <expansion>och så vidare</expansion> </key>
This section can be expanded.
This section remains to be written
StAX is used for XML processing.
Linus Ericson, TPB
LGPL