Localizing and customizing Pipeline Narrator

Martin Blomberg

Latest update: 2006-08-24

Introduction

The purpose of this document is to give anyone trying to localize Pipeline Narrator clues of where to find localizable features, and of which files to edit or create new versions of. Localizing Pipeline Narrator means adjusting Pipeline Narrator to produce digital talking books in languages not yet covered by Narrator. It could as well mean localizing the user interface. The sections Narrator Transformer Localization and User Interface Localization describe each of those tasks. Note: user interface localization is not necessary in order to localize the production of books.

You'll have several chances to fill in language codes when localizing Narrator. These language codes are the lower-case, two-letter codes as defined by ISO-639. You can find a full list of these codes at a number of sites, such as: http://www.loc.gov/standards/iso639-2/englangn.html.

Note: This document is not transformer documentation - to learn more about each one of the transformers, please read the respective transformer documentation which should be found in in the doc/transformers/ directory.

Available Localizations

There is no neat way of finding out what localizations are available in your Pipeline Narrator installation. The easiest way is to examine the files in each transformer directory and see what they contain, or run a book with xml:lang="xx", where xx is your language code, and see what comes out.

Default Configuration

Pipeline Narrator is supposed to work for English texts out of the box. The default configuration is what's used at TPB when producing university level course literature in English. There are more settings than the localizable to tweak, and they're described elsewhere. Please read each transformer documentation to learn more.

Narrator Transformer Localization

Abbreviation and Acronym Detection (se_tpb_xmldetection)

Transformer documentation.

se_tpb_xmldetection is a highly language dependant transformer when used for abbreviation and acronym detection (see Sentence Detection for other usage). Despite the transformer name, it isn't really xml that is detected, but rather patterns and strings in the text. Such patterns and strings are defined in certain language files that reside in ../../transformers/se_tpb_xmldetection/lang/. The language files contain abbreviations, acronyms and initialisms together with their corresponding expansions, for the TTS to read. That way, the TTS may be able to say "that is" instead of just "i e", and so on.

If you are using Narrator to produce digital talking books in a language not yet covered by Narrator, you probably want to write your own language file. A short example follows, but you may want to consult the transformer documentation for a more thorough description on how to write such files.

<language xml:lang="en">
    <initialism before=".*[\s(]|^" after="([\-,\.\s:;?!)].*)|$" suffix="s|:s">
        <key>
            <name>ACP</name>
            <expansion>African, Caribbean and Pacific Countries</expansion>
        </key>
    </initialism>

    <acronym before=".*[\s(]|^" after="([\-,\.\s:;?!)].*)|$" suffix="s|:s">
        <key>
            <name>DAISY</name>
            <expansion id="daisyBook">Digital Accessible Information System</expansion>
        </key>
    </acronym>

    <abbreviation before=".*[\s(]|^" after="([,\.\s:;?!)].*)|$">
        <key>
            <name>e.g.</name>
            <name>eg.</name>
            <expansion>for example</expansion>
        </key>
    </abbreviation>
</language>	

In the above example, there are three main elements: initialism, acronym and abbreviation. All three can have multiple key children.

Once you have produced a file for your language, you have to tell Narrator the file exists. You do so by editing the file ../../transformers/se_tpb_xmldetection/lang.xml, adding the mapping between a language code and your new file.

Structure Announcer (se_tpb_annonsator)

Transformer documentation.

Structure announcer adds spoken introductions and/or terminations of structures, such as tables, sidebars and notes. The announcements are read by the TTS and needs a rewrite if a language not yet covered by Narrator is being used. The announcements are found in ../../transformers/se_tpb_annonsator/type directory. The file dtbook-2005.xml contains the announcements made in a book that complies to the DTBook 2005 standard.

The file contains rule elements, each one with the attribute match which contains an xpath defining which elements the rule should be to applied to. Typically, localizing Narrator, no new rules have to be added. What you need to add is instead the lang child of the rule element, with the lang attribute matching your language. The lang element has two optional children: before and after that contain the text to be read before and after any matching structure from the book.

The file also contains an element called copy. That element contains xslt code dealing with getting spoken announcements of list items in numbered list (<list type="ol"...). If you want the spoken announcements to appear in lists with roman numerals, you have to edit the file adding a <xsl:when test="lang('xx')">... where xx is your language code. You'll see tests for lang('yy') and the easiest way is just to copy one of them, and change the language code and the announcement text. If you don't have numbered lists using roman numerals, you can skip this and your lists will be fine anyway.

Sentence Detection (se_tpb_xmldetection)

Transformer documentation.

The sentence detection uses Java's java.text.BreakIterator to find sentence boundaries. All localization is done automagically by Java using the document's current locale.

Synchronization Point Normalization (se_tpb_syncPointNormalizer)

Transformer documentation.

Language agnostic.

Speech Generator (se_tpb_speechgenerator)

Transformer documentation.

se_tpb_speechgenerator takes care of the audio file/speech generation. It has several language specific features that need to be adjusted to get the most out of the system.

File Set Creator (se_tpb_filesetcreator)

Transformer documentation.

A Z39.86 fileset contains a resource file. To add more languages, just extend the existing file by adding more resources with another xml:lang. Note that audio must be supplied.

Audio Encoder (se_tpb_dtbAudioEncoder)

Transformer documentation.

Language agnostic.

Z3986-2005 to Daisy 2.02 Converter (se_tpb_zed2daisy202)

Transformer documentation.

Language agnostic.

User Interface Localization

The Pipeline transformers make use of the internationalization features in the DMFC package. That way the messages displayed via the standard EventSender during transformer execution are localizable. There is no need to do interface localization in order to produce books in different languages.

Default messages.properties

In every transformer directory, there is a file called messages.properties. The file has a simple syntax and describes a key-value mapping. The key is typically a fairly understandable and descriptive name of a message, and the value is the message itself, i.e. what's supposed to be printed on screen. messages.properties contains the default messages as defined by the transformer developer. The language used should be English. The file should not be removed or edited.

The following example of a messages.properties file comes from the se_tpb_filesetcreator-transformer. Lines starting with # are considered comments. The left hand side of the equals sign is the key (the message name) and the right hand side the value (message text). The curly braces in the message text denote parameters sent by the transformer:

########## Message properties for FileSetCreator ##########
# {0} is the current input filename
USING_INPUT_FILE = Using input file {0}
# {0} is the current output directory name
USING_OUTPUT_DIR = Using output directory {0}
SEARCHING_FOR_REFERRED_FILES = Searching for referred files...
GENERATING_SMIL = Generating SMIL files...
GENERATING_NCX = Generating NCX...
GENERATING_OPF = Generating OPF...
AUDIO_FILE_COPY = Copying audio files...
DONE = Done!
	

Localized Messages

If you'd like to rewrite some of the messages, or have messages displayed in a language other than the default English, there is the possibility to do so by adding a localized message properties file. The file must of course follow the same simple syntax as messages.properties and also have the same keys. You only need to change the values, and save the file with a name like messages_xx.properties, where xx is your language code. The localized file is to be placed in the transformer directory.

The name of a Swedish message properties file is messages_sv.properties and to write a Swedish localization of the file shown above, one could produce the following:

########## Message properties for FileSetCreator ##########
USING_INPUT_FILE = Använder som indata {0}
USING_OUTPUT_DIR = Använder som utkatalog {0}
SEARCHING_FOR_REFERRED_FILES = Söker efter refererade filer...
GENERATING_SMIL = Genererar SMIL-filer...
GENERATING_NCX = Genererar NCX...
GENERATING_OPF = Genererar OPF...
AUDIO_FILE_COPY = Kopierar filer...
DONE = Klart!		
	

Committing Localizations

If you have produced a localization of Narrator and want to share it with others, please contact one of the developers listed as administrator on the sourceforge daisymfc project members list. That way your localization may be committed to the project CVS, free for anyone to download and possibly included in future releases of Pipeline Narrator.

Author

Martin Blomberg, TPB

Licensing

LGPL