This script converts Word documents saved as XML from within Word 2003 into DTBook. The purpose is to provide an automatic conversion process from structured Word files into DTBook. The output can be used for further processing by other scripts in the Daisy Pipeline, e.g. to produce a Daisy book.
This documentation covers both the simle script "Word 2003 XML to DTBook" and the advanced script "Word 2003 XML to DTBook (production)". What applies to the simple script also apply to the "Word 2003 XML to XHTML" script.
This script accepts Word documents saved as XML from within Word 2003 as input. To ensure that the output is error free, the following restrictions apply.
Only use a single flow of text. Most people only use one text flow, you would have to put some effort into your layout before breaking this rule by accident.
Never use floating objects. This applies to images as well as to text and other objects. A floating object is an object that is positioned on a page without reference to surrounding text. To test if an object is floating, insert about a page of text on any page preceding the object. If the object remains on the same page and position as before but the text is different, then it is a floating object.
To create high quality output containing footnotes, use the footnotes feature in Word.
Note: A production facility with knowledge in DTBook markup might benefit more from semi automatic footnote creation, especially when working with OCR material. Refer to the transformer documentation for further details.
The following built-in paragraph styles can be used to structure the document: heading 1, heading 2, heading 3, heading 4, heading 5, heading 6, block text.
The style names given here are in English, the actual names as they appear in Word may be different depending on which version of Word you have purchased. The localized style names will work as described.
Using styles not defined in this list will not cause an error, but will not enhance the result in any way.
Note: The script can be customized to accept other styles. Refer to the transformer documentation for further details.
It is recommended, although not an absolute requirement, that the first heading in a document is a heading 1 and that following headings never have a greater number than the preceding heading plus one. Not following this recommendation will still create an error free output, but might cause subsequent scrips that use it to fail.
Note! Never use a paragraph style on a section of a paragraph. This is a very common mistake and can be very hard to spot. The most common mistake is to select the entire paragraph except the paragraph marker, thus appearing perfectly fine upon visual inspection. The output will be error free, but it will not reflect the authors intention. This is not a malfunction of the script, but a design flaw/feature in Word.
The following built-in character styles can be used to structure the document: strong, emphasis, page number.
The style names given here are in English, the actual names as they appear in Word may be different depending on which version of Word you have purchased. The localized style names will work as described.
Using styles not defined in this list will not cause an error, but will not enhance the result in any way.
Note: The script can be customized to accept other styles. Refer to the transformer documentation for further details.
The following manual formating is preserved: italic, bold, superscript and subscript. Any other formatting done directly on, or close to, a group of characters will not enhance the result and should only be used for layout that does not communicate anything important to the reader. If the layout is important to the reader (as it should be), use styles to express it.
Use list nesting on list styles only (identified by a list icon next to the name in the Styles and Formatting Pane).
Keep list nesting neat by using the same principle that applies to headings: the first list item in a list must not be indented and following list items must never have a greater indentation than the preceding list item plus one(use tab to indent).
Note! Never use list nesting on paragraph styles with list formatting (identified by a paragraph icon next to the name in the Styles and Formatting Pane). Using tab to indent a paragraph style list will appear correct, but the result will be wrong.
All images that are to be part of the result must be embedded in the original document. To ensure that images are embedded, do the following:
Images can be converted to JPEG by checking the "Convert images to JPEG" checkbox.
Two document templates are available in the transformer directory. Both include macros to prepare a document for input into pipeline and should be run when the document is finished. To run the pipeline preparation macro:
In order to make use of this feature the macro security must be set to "medium" or lower in Word (click Macros/Security... in the Tools menu).
The behaviour is similar regardless of which template you are using:
Note! This proceedure can contain one or two save as dialogs in sequence, pay attention to which dialog you are currently in.
This template is designed to be used with the simple script and contains a few basic styles. Focus is on documents that were created in Word.
Documents that are created in Word have a page numbering that matches the layout on the screen. Therefore, the macro contained in this template will insert the current page number automatically at the top of each page.
This template is designed to be used with the advanced script and contains a wider set of styles. Focus is on documents that were imported into Word from another source, e.g. OCR-software or print publishing software. A basic understanding of the DTBook format is highly recommended as manual corrections usually are needed.
Documents that have another source than Word never have a page numbering that matches the layout on the screen. Therefore, the page breaks in the source format have to be inserted manually using the page number style.
The output of the script is a DTBook document including images.
The documents linked below are parts of the Transformer technical documentation. These are developer and systems-administrator centric documents.