Manual

User Manual:

Open the PDF directly: View PDF .
Page Count: 263 [warning: Documents this large are best viewed by clicking the View PDF Link!]

Contents
List of Figures
Introduction
Using LaTeXML
Architecture
- latexml architecture
- latexmlpost architecture
Customization
- LaTeXML Customization
- latexmlpost Customization
  - XSLT
  - CSS
Mathematics
- Math Details
  - Internal Math Representation
  - Grammatical Roles
Localization
Alignments
Metadata
- RDFa
ToDo
Commands
- latexml
  - latexmlpost
  - latexmlmath
Bindings
Modules
Schema
Error Codes
CSS Classes
Index

EXML The Manual

A L

X to XML/HTML/MATHML Converter;

Version 0.8.3

Bruce R. Miller

May 3, 2019

Contents

Contents iii

List of Figures vii

1 Introduction 1

2 Using L

EXML 5

2.1 Conversion ............................... 6

2.2 Postprocessing ............................. 7

2.3 Splitting ................................. 11

2.4 Sites ................................... 11

2.5 Individual Formula ........................... 12

3 Architecture 13

3.1 latexml architecture ........................... 13

3.2 latexmlpost architecture ......................... 16

4 Customization 17

4.1 LaTeXML Customization ........................ 18

4.1.1 Expansion ............................ 18

4.1.2 Digestion ............................ 20

4.1.3 Construction .......................... 22

4.1.4 Document Model ........................ 25

4.1.5 Rewriting ............................ 26

4.1.6 Packages and Options ..................... 26

4.1.7 Miscellaneous ......................... 27

4.2 latexmlpost Customization ....................... 27

4.2.1 XSLT .............................. 28

4.2.2 CSS ............................... 28

5 Mathematics 31

5.1 Math Details ............................... 32

5.1.1 Internal Math Representation .................. 32

5.1.2 Grammatical Roles ....................... 34

iii

iv CONTENTS

6 Localization 37

6.1 Numbering ............................... 37

6.2 Input Encodings ............................. 38

6.3 Output Encodings ............................ 38

6.4 Babel .................................. 38

7 Alignments 39

7.1 T

X Alignments ............................. 39

7.2 Tabular Header Heuristics ....................... 39

7.3 Math Forks ............................... 40

7.4 eqnarray ................................. 41

7.5 AMS Alignments ............................ 41

8 Metadata 43

8.1 RDFa .................................. 43

9 ToDo 45

A Commands 49

latexml ................................ 49

latexmlpost ............................. 52

latexmlmath ............................. 60

B Bindings 65

C Modules 67

LaTeXML ................................ 67

LaTeXML::Global ......................... 68

LaTeXML::Package ......................... 69

LaTeXML::MathParser ...................... 95

C.1 Common Modules ........................... 97

LaTeXML::Common::Config ................... 97

LaTeXML::Common::Object ................... 111

LaTeXML::Common::Color .................... 113

LaTeXML::Common::Color::rgb ................ 114

LaTeXML::Common::Color::hsb ................ 114

LaTeXML::Common::Color::cmy ................ 114

LaTeXML::Common::Color::cmyk ............... 114

LaTeXML::Common::Color::gray ............... 114

LaTeXML::Common::Color::Derived ............. 115

LaTeXML::Common::Number ................... 115

LaTeXML::Common::Float .................... 116

LaTeXML::Common::Dimension ................. 116

LaTeXML::Common::Glue ..................... 117

LaTeXML::Common::Font ..................... 117

LaTeXML::Common::Model .................... 118

LaTeXML::Common::Model::DTD ................ 119

CONTENTS v

LaTeXML::Common::Model::RelaxNG ............. 119

LaTeXML::Common::Error .................... 119

C.2 Core Modules .............................. 121

LaTeXML::Core::State ..................... 121

LaTeXML::Core::Mouth ..................... 124

LaTeXML::Core::Gullet ..................... 124

LaTeXML::Core::Stomach .................... 127

LaTeXML::Core::Document ................... 129

LaTeXML::Core::Rewrite .................... 136

LaTeXML::Core::Token ..................... 137

LaTeXML::Core::Tokens ..................... 138

LaTeXML::Core::Box ....................... 138

LaTeXML::Core::List ...................... 139

LaTeXML::Core::Comment .................... 139

LaTeXML::Core::Whatsit .................... 140

LaTeXML::Core::Alignment .................. 141

LaTeXML::Core::KeyVals .................... 141

LaTeXML::Core::MuDimension ................. 145

LaTeXML::Core::MuGlue ..................... 145

LaTeXML::Core::Pair ...................... 145

LaTeXML::Core::PairList ................... 146

LaTeXML::Core::Definition ................. 146

LaTeXML::Core::Definition::CharDef .......... 147

LaTeXML::Core::Definition::Conditional ....... 147

LaTeXML::Core::Definition::Constructor ....... 147

LaTeXML::Core::Definition::Expandable ........ 148

LaTeXML::Core::Definition::Primitive ......... 148

LaTeXML::Core::Definition::Register ......... 149

LaTeXML::Core::Parameter .................. 149

LaTeXML::Core::Parameters ................. 149

C.3 Utility Modules ............................. 150

LaTeXML::Util::Pathname ................... 150

LaTeXML::Util::WWW ....................... 152

LaTeXML::Util::Pack ...................... 153

C.4 Preprocessing Modules ......................... 153

LaTeXML::Pre::BibTeX ..................... 153

C.5 Postprocessing Modules ........................ 154

LaTeXML::Post ........................... 154

LaTeXML::Post::MathML ..................... 155

D Schema 159

D.1 Module LaTeXML ........................... 159

D.2 Module LaTeXML-common ...................... 161

D.3 Module LaTeXML-inline ...................... 172

D.4 Module LaTeXML-block ....................... 177

D.5 Module LaTeXML-misc ....................... 184

vi CONTENTS

D.6 Module LaTeXML-meta ....................... 186

D.7 Module LaTeXML-para ....................... 190

D.8 Module LaTeXML-math ....................... 193

D.9 Module LaTeXML-tabular ..................... 201

D.10 Module LaTeXML-picture ..................... 203

D.11 Module LaTeXML-structure ................... 209

D.12 Module LaTeXML-bib ........................ 224

E Error Codes 235

F CSS Classes 239

Index 243

List of Figures

3.1 Flow of data through L

EXML’s digestive tract. ............ 14

vii

viii LIST OF FIGURES

Chapter 1

Introduction

Note: Some of the more detailed portions of this manual have not kept uptodate with

the evolution of the code and style of L

EXML, but rather than delay release, we’ll

improve the documentation in a later update.

For many, L

X is the preferred format for document authoring, particularly those

involving signiﬁcant mathematical content and where quality typesetting is desired.

On the other hand, content-oriented XML is an extremely useful representation for doc-

uments, allowing them to be used, and reused, for a variety of purposes, not least,

presentation on the Web. Yet, the style and intent of L

X markup, as compared to

XML markup, not to mention its programmability, presents difﬁculties in converting

documents from the former format to the latter. Perhaps ironically, these difﬁculties

can be particularly large for mathematical material, where there is a tendency for the

markup to focus on appearance rather than meaning.

The choice of L

X for authoring, and XML for delivery were natural and uncon-

troversial choices for the Digital Library of Mathematical Functions1. Faced with the

need to perform this conversion and the lack of suitable tools to perform it, the DLMF

project proceeded to develop thier own tool, L

EXML, for this purpose.

Design Goals The idealistic goals of L

EXML are:

•Faithful emulation of T

X’s behaviour;

•Easily extensible;

•Lossless, preserving both semantic and presentation cues;

•Use an abstract L

X-like, extensible, document type;

1http://dlmf.nist.gov

2CHAPTER 1. INTRODUCTION

•Infer the semantics of mathematical content

(Good Presentation MATHML, eventually Content MATHML and OpenMath).

As these goals are not entirely practical, even somewhat contradictory, they are im-

plicitly modiﬁed by as much as possible. Completely mimicing T

X’s, and L

X’s,

behaviour would seem to require the sneakiest modiﬁcations to T

X, itself; redeﬁning

X’s internals does not really guarantee compatibility. “Ease of use” is, of course, in

the eye of the beholder; this manual is an attempt to make it easier! More signiﬁcantly,

few documents are likely to have completely unambiguous mathematics markup; hu-

man understanding of both the topic and the surrounding text is needed to properly

interpret any particular fragment. Thus, while we’ll try to provide a “turn-key” so-

lution that does the ‘Right Thing’ automatically, we expect that applications requir-

ing high semantic content will require document-speciﬁc declarations and tuning to

achieve the desired result. Towards this end, we provide a variety of means to cus-

tomize the processing and declare the author’s intent. At the same time, especially for

new documents, we encourage a more logical, content-oriented markup style, over a

purely presentation-oriented style.

Overview of this Manual Chapter 2describes the usage of L

EXML, along with

common use cases and techniques. Chapter 3describes the system architecture in

some detail. Strategies for customization and implementation of new packages is de-

scribed in Chapter 4. The special considerations for mathematics, including details of

representation and how to improve the conversion, are covered in Chapter 5. Several

specialized topics are covered in the remaining chapters. An overview of outstanding

issues and planned future improvements are given in Chapter 9.

Finally, the Appendices give detailed documentation the system components: Ap-

pendix Adescribes the command-line programs provided by the system; Appendix B

lists the L

X style packages for which we’ve provided L

EXML-speciﬁc bindings.

Appendix Cdescribes the various Perl modules, in groups, that comprise the sys-

tem. Appendix Ddescribes the XML schema used by L

EXML. Appendix Egives

an overview of the warning and error messages that L

EXML may generate. Appendix

Fdescribes the strategy and naming conventions used for CSS styling of the resulting

HTML.

Using L

EXML, and programming for it, can be somewhat confusing as one is deal-

ing with several languages not normally combined, often within the same ﬁle, — Perl,

X and XML (along with XSLT,HTML,CSS), plus the occasional shell programmming.

To help visually distinguish different contexts in this manual we will put ‘program-

ming’ oriented material (Perl, T

X) in a typewriter font, like this;XML material

will be put in a sans-serif face like this.

If you encounter difﬁculties, there is a support mailing list at latexml-project2.

Bugs and enhancement requests can be reported at Github3. If all else fails, please

2http://lists.informatik.uni-erlangen.de/mailman/listinfo/latexml

3https://github.com/brucemiller/LaTeXML

consult the source code, or the author.

Danger! When you see this sign, be warned that the material presented is

somewhat advanced and may not make much sense until you have dabbled quite

a bit in L

EXML’s internals. Such advanced or ‘dangerous’ material will be

presented like this paragraph to make it easier to skip over.

4CHAPTER 1. INTRODUCTION

Chapter 2

Using L

EXML

The main commands provided by the L

EXML system are

latexml for converting T

X and BIBT

X sources to XML.

latexmlpost for various postprocessing tasks including conversion to HTML, pro-

cessing images, conversion to MATHML and so on.

The usage of these commands can be as simple as

latexml doc.tex | latexmlpost --dest=doc.html -

to convert a single document into HTML5 document, or as complicated as

latexml --dest=1.xml ch1

latexml --dest=2.xml ch2

latexml --dest=b.xml b

latexml --dest=B.bib.xml B.bib

latexmlpost --prescan --db=my.db --dest=1.html 1

latexmlpost --prescan --db=my.db --dest=2.html 2

latexmlpost --prescan --db=my.db --dest=b.html b

latexmlpost --noscan --db=my.db --dest=1.html 1

latexmlpost --noscan --db=my.db --dest=2.html 2

latexmlpost --noscan --db=my.db --dest=b.html b

to convert a whole set of documents, including a bibliography, into a complete inter-

connected site.

How best to use the commands depends, of course, on what you are trying to

achieve. In the next section, we’ll describe the use of latexml, which performs the

conversion to XML. The following sections consider a sequence of successively more

complicated postprocessing situations, using latexmlpost, by which one or more

X sources can be converted into one or more web documents or a complete site.

6CHAPTER 2. USING L

EXML

Additionally, there is a convenience command latexmlmath for converting in-

dividual formula into various formats.

2.1 Basic XML Conversion

The command

latexml {options} --destination=doc.xml doc

converts the T

X document doc.tex, or standard input if -is used in place of the ﬁle-

name, to XML. It loads any required deﬁnition bindings (see below), reads, tokenizes,

expands and digests the document creating an XML structure. It then performs some

document rewriting, parses the mathematical content and writes the result, in this case,

to doc.xml; if no --destination is suppplied, it writes the result to standard out-

put. For details on the processing, see Chapter 3, and Chapter 5for more information

about math parsing.

BIBT

X processing If the source ﬁle has an explicit extension of .bib, or if the

--bibtex option is used, the source will be treated as a BIBT

X database. See 2.2

for how BIBT

X ﬁles are included in the ﬁnal output.

Note that the timing is different than with BIBT

X and L

X. Normally,

BIBT

X simply selects and formats a subset of the bibliographic entries accord-

ing to the .aux ﬁle; all T

X expansion and processing is carried out only when

the result is included in the main L

X document. In contrast, latexml processes

and expands the entire bibliography, including any T

X markup within it, when it is

converted to XML; the selection of entries is done during postprocessing. One impli-

cation is that latexml does not know about packages included in the main document; if

the bibliography uses macros deﬁned in such packages, the packages must be explicitly

speciﬁed using the --preload option.

Useful Options The number and detail of progress and debugging messages printed

during processing can be controlled using

--verbose or --quiet

They can be repeated to get even more or fewer details.

Directories to search (in addition to the working directory) for various ﬁles can be

speciﬁed using

--path={directory}

This option can be repeated.

Whenever multiple sources are being used (including multiple bibliographies), the

option

--documentid=id

2.2. POSTPROCESSING 7

should be used to provide a unique ID for the document root element. This ID is used

as the base for id’s of the child-elements within the document, so that they are unique,

as well.

See the documentation for the command latexml for less common options.

Loading Bindings Although L

EXML is reasonably adept at processing T

X macros,

it generally beneﬁts from having its own implementation of the macros, primitives,

environments and other control sequences appearing in a document because these are

what deﬁne the mapping into XML. The L

EXML-analogue of a style or class ﬁle

we call a L

EXML-binding ﬁle, or binding for short; these ﬁles have an additional

extension .ltxml.

In fact, since style ﬁles often bypass structurally or semantically meaningful macros

by directly invoking macros internal to L

X, L

EXML actually avoids processing style

ﬁles when a binding is unavailable. The option

--includestyles

can be used to override this behaviour and allow L

EXML to (attempt to) process raw

style ﬁles. [A more selective, per-ﬁle, option may be developed in the future, if there

is sufﬁcient demand — please provide use cases.]

EXML always starts with the TeX.pool binding loaded, and if L

X-speciﬁc

commands are recognized, LaTeX.pool as well. Any input directives within the

source loads the appropriate binding. For example, \documentclass{article}

or \usepackage{graphicx} will load the bindings article.cls.ltxml or

graphicx.sty.ltxml, respectively; the obsolete directive \documentstyle is

also recognized. An \input directive will search for ﬁles with both .tex and .sty

extensions; it will prefer a binding ﬁle if one is found, but will load and digest a .tex

if no binding is found. An \include directive (and related ones) search only for a

.tex ﬁle, which is processed and digested as usual.

There are two mechanisms for customization: a document-speciﬁc binding ﬁle

doc.latexml will be loaded, if present; the option

--preload=binding

will load the binding ﬁle binding.ltxml. The --preload option can be repeated;

both kinds of preload are loaded before document processing, and are processed in

order.

See Chapter 4for details about what can go in these bindings; and Appendix Bfor

a list of bindings currently included in the distribution.

2.2 Basic Postprocessing

In the simplest situation, you have a single T

X source document from which you want

to generate a single output document. The command

latexmlpost options --destination=doc.html doc

or similarly with --destination=doc.html4,--destination=doc.xhtml, will

carry out a set of appropriate transformations in sequence:

8CHAPTER 2. USING L

EXML

•scanning of labels and ids;

•ﬁlling in the index and bibliography (if needed);

•cross-referencing;

•conversion of math;

•conversion of graphics and picture environments to web format (png);

•applying an XSLT stylesheet.

The output format affects the defaults for each step, and particularly, the XSLT

stylesheet that is used, and is determined by the ﬁle extension of --destination, or

by the option

--format=(html|html5|html4|xhtml|xml)

which overrides the extension used in the destination. The recognized formats are:

html or html5 math is converted to Presentation MATHML, some ‘vector’ style

graphics are converted to SVG, other graphics are converted to images;

LaTeXML-html5.xslt is used. The ﬁle extension html is generates html5

html4 both math and graphics are converted to png images; LaTeXML-html4.xslt

is used.

xhtml math is converted to Presentation MATHML, other graphics are converted to

images; LaTeXML-xhtml.xslt is used.

xml no math, graphics or XSLT conversion is carried out.

Of course, all of these conversions can be controlled or overridden by explicit options

described below. For more details about less common options, see the command doc-

umentation latexmlpost, as well as Appendix C.5.

Scanning The scanning step collects information about all labels, ids, indexing com-

mands, cross-references and so on, to be used in the following postprocessing stages.

Indexing An index is built from \index markup, if makeidx’s \printindex

command has been used, but this can be disabled by

--noindex

The index entries can be permuted with the option

--permutedindex

Thus \index{term a!term b} also shows up as \index{term b!term a}.

This leads to a more complete, but possibly rather silly, index, depending on how the

terms have been written.

2.2. POSTPROCESSING 9

Bibliography When a document contains a request for bibliographies, typically

due to the \bibliography{..} command, the postprocessor will look for the

named bibliographies. It ﬁrst looks for preconverted bibliographies with the exten-

tion .bib.xml, otherwise it will look for .bib and convert it internally (the latter is

a somewhat experimental feature).

If you want to override that search, for example using a bibliography with a differ-

ent name, you can supply that ﬁlename using the option

--bibliography=bibfile.bib.xml

Note that the internal bibliography list will then be ignored. The bibliography would

have typically been produced by running

latexml --dest=bibfile.bib.xml bibfile.bib

Note that the XML ﬁle, bibﬁle, is not used to directly produce an HTML-formatted bibli-

ography, rather it is used to ﬁll in the \bibliography{..} within a T

X document.

Cross-Referencing In this stage, the scanned information is used to ﬁll in the text

and links of cross-references within the document. The option

--urlstyle=(server|negotiated|file)

can control the format of urls with the document.

server formats urls appropriate for use from a web server. In particular, trailing

index.html are omitted. (default)

negotiated formats urls appropriate for use by a server that implements content nego-

tiation. File extensions for html and xhtml are omitted. This enables you to

set up a server that serves the appropriate format depending on the browser being

used.

ﬁle formats urls explicitly, with full ﬁlename and extension. This allows the ﬁles to be

browsed from the local ﬁlesystem.

Math Conversion Speciﬁc conversions of the mathematics can be requested using

the options

--mathimages # converts math to png images,

--presentationmathml or --pmml # creates Presentation MATHML

--contentmathml or --cmml # creates Content MATHML

--openmath or --om # creates OpenMath

--keepXMath # preserves L

EXML’s XMath

(Each of these options can also be negated if needed, eg. --nomathimages) It must be

pointed out that the Content MATHML and OpenMath conversions are currently rather

experimental.

If more than one of these conversions are requested, parallel math markup will be

generated with the ﬁrst format being the primary one, and the additional ones added

as secondary formats. The secondary format is incorporated using whatever means

10 CHAPTER 2. USING L

EXML

the primary format uses; eg. MATHML combines formats using m:semantics and

m:annotation-xml.

Given the state of current browsers, you may wish to use a polyﬁll such as MathJax1

to support MathML on more platforms. See the example in 2.2 for one way to do it.

Graphics processing Conversion of graphics (eg. from the graphic(s|x) pack-

ages’ \includegraphics) can be enabled or disabled using

--graphicsimages or --nographicsimages

Similarly, the conversion of picture environments can be controlled with

--pictureimages or --nopictureimages

An experimental capability for converting the latter to SVG can be controlled by

--svg or --nosvg

Stylesheets and Javascript If you wish to restyle the generated HTML either by

adding CSS or by customizing the XSLT, change its functionality by adding javascript,

or even generate an alternative output format with XSLT, some combination of the fol-

lowing options will be useful.

--nodefaultresources # Omits the default resources (css..)

--css=stylesheet.css # Adds a new CSS stylesheet

--javascript=program.js # Adds a Javascript

--stylesheet=stylesheet.xsl # Uses an alternative XSLT stylesheet

--xsltparameter=name:value # Sets an XSLT parameter

All but --stylesheet can be repeated to include multiple ﬁles or set multiple param-

eters. When a local CSS or javascript ﬁle is included, it will be copied to the destination

directory, but otherwise urls are accepted.

The core CSS stylesheet, LaTeXML.css, along with certain styles or classes

(article,report,book,amsart) which add stylesheets automatically, helps

match the styling of L

X to HTML. You can also request the inclusion of your own

stylesheets from the commandline using --css option. Some sample CSS enhance-

ments are included with the distribution:

LaTeXML-navbar-left.css Places a navigation bar on the left.

LaTeXML-navbar-right.css Places a navigation bar on the left.

LaTeXML-blue.css Colors various features in a soft blue.

In cases where you wish to completely manage the CSS the option --nodefaultcss

causes only explicitly requested (command-line) css ﬁles to be included.

Javascript ﬁles are included in the generated HTML by using the --javascript

option. The distribution includes a sample LaTeXML-maybeMathjax.js which

is useful for supporting MathML: it invokes MathJax2to render the mathematics in

browsers without native support for MathML.

1http://mathjax.org/

2http://mathjax.org

2.3. SPLITTING 11

--javascript=LaTeXML-maybeMathJax.js

The option can also reference a remote script; for example to invoke MathJax uncon-

ditionally from the ‘cloud’:

latexmlpost --format=html5 \

--javascript=’https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/MathJax.js?config=MML_CHTML’ \

--destination=somewhere/doc.html doc

See 4.2.2 for more information on developing your own stylesheets. To develop

CSS and XSLT stylesheets, a knowledge of the L

EXML document type is also neces-

sary; see Appendix D.

Individual XSLT stylesheets may have parameters that can customize the conversion

from L

EXML’s XML to the target format. An obscure example is

--xsltparameter=SIMPLIFY_HTML:true

which causes a ‘simpler’ HTML to be generated. Generally, L

EXML’s HTML relies on

CSS to recreate the appearance of many features of L

X, but this sometimes results

in somewhat convoluted HTML that may not be ideal in situations where CSS is not

available. This parameter ‘dumbs down’ itemizations and enumerations by ignoring

any custom item labels or numbers.

2.3 Splitting the Output

For larger documents, it is often desirable to break the result into several interlinked

pages. This split, carried out before scanning, is requested by

--splitat=level

where level is one of chapter,section,subsection, or subsubsection.

For example, section would split the document into chapters (if any) and sections,

along with separate bibliography, index and any appendices. (See also --splitxpath

in latexml.) The removed document nodes are replaced by a Table of Contents.

The extra ﬁles are named using either the id or label of the root node of each new

page document according to

--splitnaming=(id|idrelative|label|labelrelative)

The relative foms create shorter names in subdirectories for each level of splitting. (See

also --urlstyle and --documentid in latexml.)

Additionally, the index and bibliography can be split into separate pages according

to the initial letter of entries by using the options

--splitindex and --splitbibliography

2.4 Site processing

A more complicated situation combines several T

X sources into a single interlinked

site consisting of multiple pages and a composite index and bibliography.

12 CHAPTER 2. USING L

EXML

Conversion First, all T

X sources must be converted to XML, using latexml. Since

every target-able element in all ﬁles to be combined must have a unique identi-

ﬁer, it is useful to preﬁx each identiﬁer with a unique value for each ﬁle. The

latexml option --documentid=id provides this.

Scanning Secondly, all XML ﬁles must be split and scanned using the command

latexmlpost --prescan --dbfile=DB --dest=i.xhtml i

where DB names a ﬁle in which to store the scanned data. Other conversions,

including writing the output ﬁle, are skipped in this prescanning step.

Pagination Finally, all XML ﬁles are cross-referenced and converted into the ﬁnal for-

mat using the command

latexmlpost --noscan --dbfile=DB --dest=i.xhtml i

which skips the unnecessary scanning step.

2.5 Individual Formula

For cases where you’d just like to convert a single formula to, say, MATHML, and

don’t mind the overhead, we’ve combined the pre- and post-processing into a single,

handy, command latexmlmath. For example,

latexmlmath --pmml=- \\frac{b\\pm\\sqrt{bˆ2-4ac}}{2a}

will print the MATHML to standard output. To convert the formula to a png image,

say quad.png, use the option --mathimage=quad.png.

Note that this involves putting T

X code on the command line. You’ve got to

‘slashify’ your code in whatever way is necessary so that after your shell is ﬁnished

with it, the string that is passed to latexmlmath sees is normal T

X. In the example

above, in most unix-like shells, we only needed to double-up the backslashes.

Chapter 3

Architecture

As has been said, L

EXML consists of two main programs: latexml responsible for

converting the T

X source into XML; and latexmlpost responsible for converting

to target formats. See Figure 3.1 for illustration.

The casual user needs only a superﬁcial understanding of the architecture. The

programmer who wants to extend or customize L

EXML will, however, need a fairly

good understanding of the process and the distinctions between text, Tokens, Boxes,

Whatsits and XML, on the one hand, and Macros, Primitives and Constructors, on the

other. In a way, the implementer of a L

EXML binding for a L

X package may need a

better understanding than when implementing for L

X since they have to understand

not only the T

X-view, primarily just the macros and the intended appearance, but also

the L

EXML-view, with XML and representation questions, aw well.

The intention is that all semantics of the original document is preserved by

latexml, or even inferred by parsing; latexmlpost is for formatting and conver-

sion. Depending on your needs, the L

EXML document resulting from latexml may

be sufﬁcient. Alternatively, you may want to enhance the document by applying third

party programs before postprocessing.

3.1 latexml architecture

Like T

X, latexml is data-driven: the text and executable control sequences

(ie. macros and primitives) in the source ﬁle (and any packages loaded) direct the

processing. For L

EXML, the user exerts control over the conversion, and customizes

it, by providing alternative bindings of the control sequences and packages, by declar-

ing properties of the desired document structure, and by deﬁning rewrite rules to be

applied to the constructed document tree.

The top-level class, LaTeXML, manages the processing, providing several meth-

ods for converting a T

X document or string into an XML document, with varying

degrees of postprocessing and writing the document to ﬁle. It binds a (LaTeXML::

Core::)State object (to $STATE)to maintain the current state of bindings for con-

trol sequence deﬁnitions and emulates T

X’s scoping rules. The processing is broken

14 CHAPTER 3. ARCHITECTURE

Figure 3.1: Flow of data through L

EXML’s digestive tract.

into the following stages

Digestion the T

X-like digestion phase which converts the input into boxes.

Construction converts the resulting boxes into an XML DOM.

Rewriting applies rewrite rules to modify the DOM.

Math Parsing parses the tokenized mathematics.

Serialization converts the XML DOM to a string, or writes to ﬁle.

Digestion Digestion is carried out primarily in a pull mode: The (LaTeXML::

Core::)Stomach pulls expanded (LaTeXML::Core::)Tokens from the

(LaTeXML::Core::)Gullet, which itself pulls Tokens from the (LaTeXML::

Core::)Mouth. The Mouth converts characters from the plain text input into

Tokens according to the current catcodes (category codes) assigned to them (as

3.1. LATEXML ARCHITECTURE 15

bound in the State). The Gullet is responsible for expanding Macros, that

is, control sequences currently bound to (LaTeXML::Core::Definition::

)Expandables and for parsing sequences of tokens into common core datatypes

((LaTeXML::Common::)Number,(LaTeXML::Common::)Dimension,

etc.). See 4.1.1 for how to deﬁne macros and affect expansion.

The Stomach then digests these tokens by executing (LaTeXML::Core::

Definition::)Primitive control sequences, usually for side effect, but of-

ten for converting material into (LaTeXML::Core::)Lists of (LaTeXML::

Core::)Boxes and (LaTeXML::Core::)Whatsits (A Macro should never di-

gest). Normally, textual tokens are converted to Boxes in the current font. The

main (intentional) deviation of L

EXML’s digestion from that of T

X is the intro-

duction of a new type of deﬁnition, a (LaTeXML::Core::Definition::)

Constructor, responsible for constructing XML fragments. A control sequence

bound to Constructor is digested by reading and processing its arguments and

wrapping these up in a Whatsit. Before- and after-daemons, essentially anonymous

primitives, associated with the Constructor are executed before and after digesting

the Constructor arguments’ markup, which can affect the context of that digestion,

as well as augmenting the Whatsit with additional properties. See 4.1.2 for how to

deﬁne primitives and affect digestion.

Construction Given the List of Boxes and Whatsits, we proceed to constructing

an XML document. This consists of creating an (LaTeXML::Core::)Document

object, containing a libxml2 document, XML::LibXML::Document, and having

it absorb the digested material. Absorbing a Box converts it to text content, with pro-

vision made to track and set the current font. A Whatsit is absorbed by invoking the

associated Constructor to insert an appropriate XML fragment, including elements

and attributes, and recursively processing their arguments as necessary See 4.1.3 for

how to deﬁne constructors.

A(LaTeXML::Common::)Model is maintained througout the digestion phase

which accumulates any document model declarations, in particular the document type

(RelaxNG is preferred, but DTD is also supported). As L

X markup is more like

SGML than XML, additional declarations may be used (see Tag in (LaTeXML::)

Package) to indicate which elements may be automatically opened or closed when

needed to build a document tree that matches the document type. As an example, a

<subsection>will automaticall be closed when a <section>is begun. Additionally,

extra bits of code can be executed whenever particularly elements are openned or closed

(also speciﬁed by Tag). See 4.1.4 for how to affect the schema.

Rewriting Once the basic document is constructed, (LaTeXML::Core::)

Rewrite rules are applied which can perform various functions. Ligatures and

combining mathematics digits and letters (in certain fonts) into composite math tokens

are handled this way. Additionally, declarations of the type or grammatical role of

math tokens can be applied here See 4.1.5 for how to deﬁne rewrite rules.

16 CHAPTER 3. ARCHITECTURE

MathParsing After rewriting, a grammar based parser is applied to the mathematical

nodes in order to infer, at least, the structure of the expressions, if not the meaning.

Mathematics parsing, and how to control it, is covered in detail in Chapter 5.

Serialization Here, we simple convert the DOM into string form, and output it.

3.2 latexmlpost architecture

EXML’s postprocessor is primarily for format conversion. It operates by applying a

sequence of ﬁlters responsible for transforming or splitting documents, or their parts,

from one format to another.

Exactly which postprocessing ﬁlter modules are applied depends on the command-

line options to latexmlpost. Postprocessing ﬁlter modules are generally applied in

the following order:

Split splits the document into several ‘page’ documents, according to --split or

--splitxpath options.

Scan scans the document for all ID’s, labels and cross-references. This data may be

stored in an external database, depending on the --db option.

MakeIndex ﬁlls in the index element (due to a \printindex) with material gener-

ated by index.

MakeBibliography ﬁlls in the bibliography element (from \bibliography) with

material extracted from the ﬁle speciﬁed by the --bibilography option, for

all \cite’d items.

CrossRef establishes all cross-references between documents and parts thereof, ﬁlling

in the references with appropriate text for the hyperlink.

MathImages, MathML, OpenMath performs various conversions of the internal

Math representation.

PictureImages, Graphics, SVG performs various graphics conversions.

XSLT applies an XSLT transformation to each document.

Writer writes the document to a ﬁle in the appropriate location.

See 4.2 for how to customize the postprocessing.

Chapter 4

Customization

The processsing of the L

X document, its conversion into XML and ultimately to

XHTML or other formats can be customized in various ways, at different stages of

processing and in different levels of complexity. Depending on what you are trying

to achieve, some approaches may be easier than others: Recall Larry Wall’s adage

“There’s more than one way to do it.”

By far, the easiest way to customize the style of the output is by modifying the CSS,

see 4.2.2, so that is the recommended way when it applies.

The basic conversion from T

X markup to XML is done by latexml, and is ob-

viously affected by the mapping between the T

X markup and the XML markup. This

mapping is deﬁned by macros, primitives and, of course, constructors; The mapping

that is in force at any time is determined by the L

EXML-speciﬁc implementations of

the T

X packages involved, what we call ‘bindings’. Consequently, you can customize

the conversion by modifying the bindings used by latexml.

Likewise, you extend latexml by creating bindings for T

X styles that hadn’t

been covered.

Or by deﬁning your own T

X style ﬁle along with it’s L

EXML binding.

In all these cases, you’ll need the same skills: understanding and using text, tokens,

boxes and whatsits, as well as macros and macro expansion, primitives and digestion,

and ﬁnally whatsits and constructors. Understanding T

X helps; reading the L

EXML

bindings in the distribution will give an idea of how we use it. To teach L

EXML about

new macros, to implement bindings for a package not yet covered, or to modify the

way T

X control sequences are converted to XML, you will want to look at 4.1. To

modify the way that XML is converted to other formats such as HTML, see 4.2.

A particularly powerful strategy when you have control over the source documents

is to develop a semantically oriented L

X style ﬁle, say smacros.sty, and then

provide a L

EXML binding as smacros.sty.ltxml. In the L

X version, you may

style the terms as you like; in the L

EXML version, you could control the conversion

so as to preserve the semantics in the XML. If L

EXML’s schema is insufﬁcient, then

you would need to extend it with your own representation; although that is beyond the

scope of the current manual, see the discussion below in 4.1.4. In such a case, you

would also need to extend the XSLT stylesheets, as discussed in 4.2.1.

18 CHAPTER 4. CUSTOMIZATION

4.1 LaTeXML Customization

This layer of customization deals with modifying the way a L

X document is trans-

formed into L

EXML’s XML, primarily through deﬁning the way that control sequences

are handled. In 2.1 the loading of various bindings was described. The facilities

described in the following subsections apply in all such cases, whether used to cus-

tomize the processing of a particular document or to implement a new L

X package.

We make no attempt to be comprehensive here; please consult the documentation for

(LaTeXML::)Global and Package, as well as the binding ﬁles included with the

system for more guidance.

A L

EXML binding is actually a Perl module, and as such, a familiarity with Perl is

helpful. A binding ﬁle will look something like:

use LaTeXML::Package;

use strict;

use warnings;

# Your code here!

The ﬁnal ‘1’ is required; it tells Perl that the module has loaded successfully. In be-

tween, comes any Perl code you wish, along with the deﬁnitions and declarations as

described here.

Actually, familiarity with Perl is more than merely helpful, as is familiarity with

X and XML! When writing a binding, you will be programming with all three lan-

guages. Of course, you need to know the T

X corresponding to the macros that you

intend to implement, but sometimes it is most convenient to implement them com-

pletely, or in part, in T

X, itself (eg. using DefMacro), rather then in Perl. At the

other end, constructors (eg. using DefConstructor) are usually deﬁned by patterns

of XML.

4.1.1 Expansion & Macros

DefMacro($prototype,$replacement,%options)Macros are deﬁned

using DefMacro, such as the pointless:

DefMacro(’\mybold{}’,’\textbf{#1}’);

The two arguments to DefMacro we call the prototype and the replacement. In the

prototype, the {} speciﬁes a single normal T

X parameter. The replacement is here

a string which will be tokenized and the #1 will be replaced by the tokens of the

argument. Presumably the entire result will eventually be further expanded and or

processed.

Whereas, T

X normally uses #1, and L

X has developed a complex scheme where

it is often necessary to peek ahead token by token to recognize optional arguments, we

have attempted to develop a suggestive, and easier to use, notation for parameters.

Thus a prototype \foo{} speciﬁes a single normal argument, wheere \foo[]{}

would take an optional argument followed by a required one. More complex argument

4.1. LATEXML CUSTOMIZATION 19

prototypes can be found in Package. As in T

X, the macro’s arguments are neither

expanded nor digested until the expansion itself is further expanded or digested.

The macro’s replacement can also be Perl code, typically an anonymous sub,

which gets the current Gullet followed by the macro’s arguments as its arguments.

It must return a list of Token’s which will be used as the expansion of the macro. The

following two examples show alternative ways of writing the above macro:

DefMacro(’\mybold{}’, sub {

my($gullet,$arg)=@_;

(T_CS(’\textbf’),T_BEGIN,$arg,T_END); });

or alternatively

DefMacro(’\mybold{}’, sub {

Invocation(T_CS(’\textbf’),$_[1]); });

Generally, the body of the macro should not involve side-effects, assignments or other

changes to state other than reading Token’s from the Gullet; of course, the macro

may expand into control sequences which do have side-effects.

Tokens, Catcodes and friends Functions that are useful for dealing with Tokens

and writing macros include the following:

•Constants for the corresponding T

X catcodes:

CC_ESCAPE,CC_BEGIN,CC_END,CC_MATH,

CC_ALIGN,CC_EOL,CC_PARAM,CC_SUPER,

CC_SUB,CC_IGNORE,CC_SPACE,CC_LETTER,

CC_OTHER,CC_ACTIVE,CC_COMMENT,CC_INVALID

•Constants for tokens with the appropriate content and catcode:

T_BEGIN,T_END,T_MATH,T_ALIGN,T_PARAM,

T_SUB,T_SUPER,T_SPACE,T_CR

•T_LETTER($char),T_OTHER($char),T_ACTIVE($char), create tokens of

the appropriate catcode with the given text content.

•T_CS($cs) creates a control sequence token; the string $cs should typically

begin with the slash.

•Token($string,$catcode) creates a token with the given content and cat-

code.

•Tokens($token,...) creates a (LaTeXML::Core::)Tokens object con-

taining the list of Tokens.

•Tokenize($string) converts the string to a Tokens, using T

X’s standard

catcode assignments.

•TokenizeInternal($string) like Tokenize, but treating @as a letter.

20 CHAPTER 4. CUSTOMIZATION

•Explode($string) converts the string to a Tokens where letter character are

given catcode CC_OTHER.

•Expand($tokens expands $tokens (a Tokens), returning a Tokens; there

should be no expandable tokens in the result.

•Invocation($cstoken,$arg,...) Returns a Tokens representing the se-

quence needed to invoke $cstoken on the given arguments (each are Tokens,

or undef for an unsupplied optional argument).

4.1.2 Digestion & Primitives

Primitives are processed during the digestion phase in the Stomach, after macro ex-

pansion (in the Gullet), and before document construction (in the Document). Our

primitives generalize T

X’s notion of primitive; they are used to implement T

X’s prim-

itives, invoke other side effects and to convert Tokens into Boxes, in particular, Unicode

strings in a particular font.

Here are a few primitives from TeX.pool:

DefPrimitive(’\begingroup’,sub {

$_[0]->begingroup; });

DefPrimitive(’\endgroup’, sub {

$_[0]->endgroup; });

DefPrimitiveI(’\batchmode’, undef,undef);

DefPrimitiveI(’\OE’, undef, "\x{0152}");

DefPrimitiveI(’\tiny’, undef,undef,

font=>{size=>5});

Other than for implementing T

X’s own primitives, DefPrimitive is needed

less often than DefMacro or DefConstructor. The main thing to keep in mind is

that primitives are processed after macro expansion, by the Stomach. They are most

useful for side-effects, changing the State.

DefPrimitive($prototype,$replacement,%options) The replace-

ment is either a string which will be used to create a Box in the current font, or can

be code taking the Stomach and the control sequence arguments as argument; like

macros, these arguments are not expanded or digested by default, they must be ex-

plicitly digested if necessary. The replacement code must either return nothing (eg.

ending with return;) or should return a list (ie. a Perl list (...)) of digested Boxes

or Whatsits.

Options to DefPrimitive are:

•mode=>(’math’|’text’) switches to math or text mode, if needed;

•requireMath=>1,forbidMath=>1 requires, or forbids, this primitive to ap-

pear in math mode;

•bounded=>1 speciﬁes that all digestion (of arguments and daemons) will take

place within an implicit T

X group, so that any side-effects are localized, rather

than affecting the global state;

4.1. LATEXML CUSTOMIZATION 21

•font=>{hash} switches the font used for any created text; recognized font keys

are family,series,shape,size,color;

Note that if the font change should only affect the material digested within this

command itself, then bounded=>1 should be used; otherwise, the font change

will remain in effect after the command is processed.

•beforeDigest=>CODE($stomach),

afterDigest=>CODE($stomach) provides code to be digested before and af-

ter processing the main part of the primitive.

DefRegister(. . . ) Needs descrition!

Other Utilities for Digestion Other functions useful for dealing with digestion and

state are important for writing before & after daemons in constructors, as well as in

Primitives; we give an overview here:

•Digest($tokens) digests $tokens (a (LaTeXML::Core::)Tokens), re-

turning a list of Boxes and Whatsits.

•Let($token1,$token2) gives $token1 the same meaning as $token2, like

\let.

Bindings The following functions are useful for accessing and storing information

in the current State. It maintains a stack-like structure that mimics T

X’s approach

to binding; braces {and }open and close stack frames. (The Stomach methods

bgroup and egroup can be used when explicitly needed.)

•LookupValue($symbol),AssignValue($string,$value,$scope) main-

tain arbitrary values in the current State, looking up or assigning the current

value bound to $symbol (a string). For assignments, the $scope can be

’local’ (the default, if $scope is omitted), which changes the binding in

the current stack frame. If $scope is ’global’, it assigns the value globally

by undoing all bindings. The $scope can also be another string, which indicates

a named scope — but that is a more advanced topic.

•PushValue($symbol,$value,...),PopValue($symbol),

UnshiftValue($symbol,$value,...),ShiftValue($symbol) These

maintain the value of $symbol as a list, with the operatations having the same

sense as in Perl; modiﬁcations are always global.

•LookupCatcode($char),AssignCatcode($char,$catcode,$scope)

maintain the catcodes associated with characters.

•LookupMeaning($token),LookupDefinition($token) looks up the

current meaning of the token, being any executable deﬁnition bound for

it. If there is no such defniition LookupMeaning returns the token itself,

LookupDefinition returns undef.

22 CHAPTER 4. CUSTOMIZATION

Counters The following functions maintain L

X-like counters, and generally also

associate an ID with them. A counter’s print form (ie. \theequation for equations)

often ends up on the refnum attribute of elements; the associated ID is used for the

xml:id attribute.

•NewCounter($name,$within,options), creates a L

X-style counters.

When $within is used, the given counter will be reset whenever the counter

$within is incremented. This also causes the associated ID to be preﬁxed with

$within’s ID. The option idprefix=>$string causes the ID to be preﬁxed

with that string. For example,

NewCounter(’section’, ’document’, idprefix=>’S’);

NewCounter(’equation’,’document’, idprefix=>’E’,

idwithin=>’section’);

would cause the third equation in the second section to have ID=’S2.E3’.

•CounterValue($name) returns the Number representing the current value.

•ResetCounter($name) resets the counter to 0.

•StepCounter($name) steps the counter (and resets any others ‘within’ it), and

returns the expansion of \the$name.

•RefStepCounter($name) steps the counter and any ID’s associated with it. It

returns a hash containing refnum (expansion of \the$name) and id (expan-

sion of \the$name@ID)

•RefStepID($name) steps the ID associated with the counter, without actually

stepping the counter; this is useful for unnumbered units that normally would

have both a refnum and ID.

4.1.3 Construction & Constructors

Constructors are where things get interesting, but also complex; they are responsible for

deﬁning how the XML is built. There are basic constructors corresponding to normal

control sequences, as well as environments. Mathematics generally comes down to

constructors, as well, but is covered in Chapter 5.

Here are a couple of trivial examples of constructors:

DefConstructor(’\emph{}’,

"<ltx:emph>#1</ltx:emph>", mode=>’text’);

DefConstructor(’\item[]’,

"<ltx:item>?#1(<ltx:tag>#1</ltx:tag>)");

DefEnvironment(’{quote}’,

’<ltx:quote>#body</ltx:quote>’,

beforeDigest=>sub{Let(’\\\\’,’\@block@cr’);});

DefConstructor(’\footnote[]{}’,

"<ltx:note class=’footnote’ mark=’#refnum’>#2</ltx:note>",

mode=>’text’,

4.1. LATEXML CUSTOMIZATION 23

properties=> sub {

($_[1] ? (refnum=>$_[1]) : RefStepCounter(’footnote’)) });

DefConstructor($prototype,$replacement,%options)The $replacement

for a constructor describes the XML to be generated during the construction phase. It

can either be a string representing the XML pattern (described below), or a subroutine

CODE($document,$arg1,...props) receiving the arguments and properties from

the Whatsit; it would invoke the methods of Document to construct the desired

XML.

At its simplest, the XML pattern is a just serialization of the desired XML. For more

expressivity, XML trees, text content, attributes and attribute values can be effectively

‘interpolated’ into the XML being constructed by use of the following expressions:

•#1,#2,. . . #%name% returns the construction of the numbered argument or named

property of the Whatsit;

•&function(arg1,arg2,...) invokes the Perl function on the given argu-

ments, arg1,. . . , returning the result. The arguments should be expressions for

values, rather than XML subtrees.

•?test(if pattern)or ?test(if pattern)(else pattern)returns the

result of either the if or else pattern depending on whether the result of test

(typically also an expression) is non-empty;

•%expression returns a hash (or rather assumes the result is a hash or KeyVals

object); this is only allowed within an opening XML tag, where all the key-value

pairs are inserted as attributes;

•ˆif this appears at the beginning of the pattern, the replacement is allowed to

ﬂoat up the current tree to whereever it might be allowed;

In each case, the result of an expression is expected to be either an XML tree, a string

or a hash, depending on the context it was used in. In particular, values of attributes are

typically given by quoted strings, but expressions within those strings are interpolated

into the computed attribute value. The special characters @#?%which introduce

these expressions can be escaped by preceding with a backslash, when the literal char-

acter is desired.

A subroutine used as the $replacement, allows programmatic insertion of XML

into, or modiﬁcation of, the document being constructed. Although one could use

LibXML’s DOM API to manipulate the document tree, it is strongly recommended

to use Document’s API whereever possible as it maintains consistency and manages

namespace preﬁxes. This is particularly true for insertion of new content, setting at-

tributes and ﬁnding existing nodes in the tree using XPath.

Options:

•mode=>(’math’|’text’) switches to math or text mode, if needed;

•requireMath=>1,forbidMath=>1 requires, or forbids, this constructor to ap-

pear in math mode;

24 CHAPTER 4. CUSTOMIZATION

•bounded=>1 speciﬁes that all digestion (of arguments and daemons) will take

place within an implicit T

X group, so that any side-effects are localized, rather

than affecting the global state;

•font=>{hash} switches the font used for any created text; recognized font keys

are family,series,shape,size,color;

•properties=> {hash} | CODE($stomach,$arg1,..). provides a set

of properties to store in the Whatsit for eventual use in the constructor

$replacement. If a subroutine is used, it also should return a hash of proper-

ties;

•beforeDigest=>CODE($stomach),

afterDigest=>CODE($stomach,$whatsit) provides code to be digested

before and after digesting the arguments of the constructor, typically to alter the

context of the digestion (before), or to augment the properties of the Whatsit

(after);

•beforeConstruct=>CODE($document,$whatsit),

afterConstruct=>CODE($document,$whatit) provides code to be run be-

fore and after the main $replacement is effected; occassionaly it is convenient

to use the pattern form for the main $replacement, but one still wants to exe-

cute a bit of Perl code, as well;

•captureBody=>(1 | $token) speciﬁes that an additional argument (like an

environment body) wiil be read until the current T

X grouping ends, or until the

speciﬁed $token is encountered. This argument is available to $replacement

as $body;

•scope=>(’global’|’local’|$name) speciﬁes whether this deﬁnition is

made globally, or in the current stack frame (default), (or in a named scope);

•reversion=>$string|CODE(...),alias=>$cs can be used when the

Whatsit needs to be reverted into T

X code, and the default of simply re-

assembling based on the prototype is not desired. See the code for examples.

Some additional functions useful when writing constructors:

•ToString($stuff) converts $stuff to a string, hopefully without T

markup, suitable for use as document content and attribute values. Note that

if $stuff contains Whatsits generated by Constructors, it may not be possible

to avoid T

X code. Constrast ToString to the following two functions.

•UnTeX($stuff) returns a string containing the T

X code that would generate

$stuff (this might not be the original T

X). The function Revert($stuff)

returns the same information as a Tokens list.

•Stringify($stuff) returns a string more intended for debugging purposes; it

reveals more of the structure and type information of the object and its parts.

4.1. LATEXML CUSTOMIZATION 25

•CleanLabel($arg),CleanIndexKey($arg),CleanBibKey($arg),

CleanURL($arg) cleans up arguments (converting to string, handling invalid

characters, etc) to make the argument appropriate for use as an attribute repre-

senting a label, index ID, etc.

•UTF($hex) returns the Unicode character for the given codepoint; this is useful

for characters below 0x100 where Perl becomes confused about the encoding.

DefEnvironment($prototype,$replacement,%options) Environments are largely a

special case of constructors, but the prototype starts with {envname}, rather than

\cmd, the replacement will also typically involve #body representing the contents of

the environment.

DefEnvironment takes the same options as DefConstructor, with the ad-

dition of

•afterDigestBegin=>CODE($stomach,$whatsit) provides code to digest

after the \begin{env} is digested;

•beforeDigestEnd=>CODE($stomach) provides code to digest before the

\end{env} is digested.

For those cases where you do not want an environment to correspond to a con-

structor, you may still (as in L

X), deﬁne the two control sequences \envname and

\endenvname as you like.

4.1.4 Document Model

The following declarations are typically only needed when customizing the schema

used by L

EXML.

•RelaxNGSchema($schema,namespaces) declares the created XML docu-

ment should be ﬁt to the RelaxNG schema in $schema; A ﬁle $schema.rng

should be ﬁndable in the current search paths. (Note that currently, L

EXML is

unable to directly parse compact notation).

•RegisterNamespace($prefix,$url) associates the preﬁx with the given

namespace url. This allows you to use $prefix as a namespace preﬁx when

writing Constructor patterns or XPath expressions.

•Tag($tag,properties) speciﬁes properties for the given XML $tag. Rec-

ognized properties include: autoOpen=>1 indicates that the tag can automat-

ically be opened if needed to create a valid document; autoClose=>1 in-

dicates that the tag can automatically be closed if needed to create a valid

document; afterOpen=>$code speciﬁes code to be executed before opening

the tag; the code is passed the Document being constructed as well as the

Box (or Whatsit) responsible for its creation; afterClose=>code similar

to afterOpen, but executed after closing the element.

26 CHAPTER 4. CUSTOMIZATION

4.1.5 Rewriting

The following functions are a bit tricky to use (and describe), but can be quite useful in

some circumstances.

DefLigature($regexp,%options)applies a regular expression to substitute

textnodes after they are closed; the only option is fontTest=>$code which restricts

the ligature to text nodes where the current font passes &$code($font).

DefMathLigature($code,%options)allows replacement of sequences of

math nodes. It applies $code to the current Document and each sequence of math

nodes encountered in the document; if a replacement should occur, $code should re-

turn a list of the form ($n,$string,attributes) in which case, the text content of

the ﬁrst node is replaced by $string, the given attributes are added, and the following

$n-1 nodes are removed.

DefRewrite(%spec) deﬁnes document rewrite rules. These speciﬁcations describe

what document nodes match:

•label=>$label restricts to nodes contained within an element whose labels

includes $label;

•scope=>$scope generalizes label; the most useful form a string like

’section:1.3.2’ where it matches the section element whose refnum

is 1.3.2;

•xpath=>$xpath selects nodes matching the given XPath;

•match=>$tex selects nodes that look like what processing the T

X string $tex

would produce;

•regexp=>$regexp selects text nodes that match the given regular expression.

The following speciﬁcations describe what to do with the matched nodes:

•attributes=>{attr} adds the given attributes to the matching nodes;

•replace=>$tex replaces the matching nodes with the result of processing the

X string $tex.

4.1.6 Packages and Options

The following declarations are useful for deﬁning L

EXML bindings, including option

handling. As when deﬁning L

X packages, the following, if needed at all, need to

appear in the order shown.

•DeclareOption($option,$handler) speciﬁes the handler for $option

when it is passed to the current package or class. If $option is undef, it de-

ﬁnes the default handler, for options that are otherwise unrecognized. $handler

can be either a string to be expanded, or a sub which is executed like a primitive.

4.2. LATEXMLPOST CUSTOMIZATION 27

•PassOptions($name,$type,@options) speciﬁes that the given options

should be passed to the package (if $type is sty) or class (if $type is cls)

$name, if it is ever loaded.

•ProcessOptions(keys) processes any options that have been passed to the

current package or class. If inorder=>1 is speciﬁed, the options will be pro-

cessed in the order passed to the package (\ProcessOptions*); otherwise

they will be processed in the declared order (\ProcessOptions).

•ExecuteOptions(@options) executes the handlers for the speciﬁc set of op-

tions @options.

•RequirePackage($pkgname,keys) loads the speciﬁed package. The key-

word options have the following effect: options=>$options can provide

an explicit array of string specifying the options to pass to the package;

withoptions=>1 means that the options passed to the currently loading class or

package should be passed to the requested package; type=>$ext speciﬁes the

type of the package ﬁle (default is sty); raw=>1 speciﬁes that reading the raw

style ﬁle (eg. pkg.sty) is permissible if there is no speciﬁc L

EXML binding

(eg. pkg.sty.ltxml)after=>$after speciﬁes a string or (LaTeXML::

Core::)Tokens to be expanded after the package has ﬁnished loading.

•LoadClass($classname,keys) Similar to RequirePackage, but loads a

class ﬁle (type=>’cls’).

•AddToMacro($cstoken,$tokens) a little used utilty to add material to the

expansion of $cstoken, like an \edef; typically used to add code to a class or

package hook.

4.1.7 Miscellaneous

Other useful stuff:

•RawTeX($texstring) expands and processes the $texstring; This is typ-

ically useful to include deﬁnitions copied from a T

X styleﬁle, when they are

approriate for L

EXML, as is. Single-quoting the $texstring is useful, since it

isn’t interpolated by Perl, and avoids having to double all the slashes!

4.2 latexmlpost Customization

The current postprocessing framework works by passing the document through a se-

quence of postprocessing ﬁlter modules. Each module is responsible for carrying out

a speciﬁc transformation, augmentation or conversion on the document. In principle,

this architecture has the ﬂexibility to employ new ﬁlters to perform new or customized

conversions. However, the driver, latexmlpost, currently provides no convenient

means to instanciate and incorporate outside ﬁlters, short of developing your own spe-

cialized version.

28 CHAPTER 4. CUSTOMIZATION

Consequently, we will consider custom postprocessing ﬁlters outside the scope of

this manual (but of course, you are welcome to explore the code, or contact us with

suggestions).

The two areas where customization is most practical is in altering the XSLT trans-

forms used and extending the CSS stylesheets.

4.2.1 XSLT

EXML provides stylesheets for transforming its XML format to XHTML and HTML.

These stylesheets are modular with components corresponding to the schema modules.

Probably the best strategy for customizing the transform involves making a copy of

the standard base stylesheets, LaTeXML-xhtml.xsl,LaTeXML-html.xsl and

LaTeXML-html5.xsl, found at installationdir/LaTeXML/style/ — they’re

short, consisting mainly of an xsl:include and setting appropriate parameters and

output method; thus modifying the parameters and and adding your own rules, or in-

cluding your own modules should be relatively easy.

Naturally, this requires a familiarity with L

EXML’s schema (see D), as well as

XSLT and XHTML. See the other stylesheet modules in the same directory as the base

stylesheet for guidance. Generally the strategy is to use various parameters to switch

between common behaviors and to use templates with modes that can be overridden

in the less common cases.

Conversion to formats other than XHTML are, of course, possible, as well, but are

neither supplied nor covered here. How complex the transformation will be depends

on the extent that the L

EXML schema can be mapped to the desired one, and to what

extent L

EXML has lost or hidden information represented in the original document.

Again, familiarity with the schema is needed, and the provided XHTML stylesheets may

suggest an approach.

NOTE: I’m trying to make stylesheets easily customizable. However, this is getting

tricky.

•You can import stylesheets which allows the templates to be overridden.

•You can call the overridden stylesheet using apply-imports

•You can not call apply-imports to call an overridden named template! (al-

though you seemingly can override them?)

•You can refer to xslt modules using URN’s, provided you have loaded the

LaTeXML.catalog:

4.2.2 CSS

CSS stylesheets can be supplied to latexmlpost to be included in the generated doc-

uments in addition to, or as a replacement for, the standard stylesheet LaTeXML.css.

See the directory installationdir/LaTeXML/style/ for samples.

4.2. LATEXMLPOST CUSTOMIZATION 29

To best take advantage of this capability so as to design CSS rules with the correct

speciﬁcity, the following points are helpful:

•L

EXML converts the T

X to its own schema, with structural elements (like

equation) getting their own tag; others are transformed to something more

generic, such as note. In the latter case, a class attribute is often used to dis-

tinguish. For example, a \footnote generates

<not e c l a ss = ’ f o o t n o t e ’>. . .

whereas an \endnote generates

<no te c l a s s = ’ e ndn ote ’>. . .

•The provided XSLT stylesheets transform L

EXML’s schema to XHTML, generat-

ing a combined class attribute consisting of any class attributes already present as

well as the L

EXML tag name. However, there are some variations on the theme.

For example, L

X’s \section yeilds a L

EXML element section, with a ti-

tle element underneath. When transformed to XHTML, the former becomes a

<div class=’section’>, while the latter becomes <h2 class=’section−title ’ >(for

example, the h-level may vary with the document structure),

Mode begin and end For most elements, once the main html element has been

opened and the primary attributes have been added but before any content has been

added, a template with mode begin is called; thus it can add either attributes or con-

tent. Just before closing the main html element, a template with mode end is called.

Computing class and style Templates with mode classes and styling.

30 CHAPTER 4. CUSTOMIZATION

Chapter 5

Mathematics

There are several issues that have to be dealt with in treating the mathematics. On the

one hand, the T

X markup gives a pretty good indication of what the author wants the

math to look like, and so we would seem to have a good handle on the conversion to

presentation forms. On the other hand, content formats are desirable as well; there

are a few, but too few, clues about what the intent of the mathematics is. And in

fact, the generation of even Presentation MathML of high quality requires recognizing

the mathematical structure, if not the actual semantics. The mathematics processing

must therefore preserve the presentational information provided by the author, while

inferring, likely with some help, the mathematical content.

From a parsing point of view, the T

X-like processing serves as the lexer, tok-

enizing the input which L

EXML will then parse [perhaps eventually a type-analysis

phase will be added]. Of course, there are a few twists. For one, the tokens, repre-

sented by XMTok, can carry extra attributes such as font and style, but also the name,

meaning and grammatical role, with defaults that can be overridden by the author —

more on those, in a moment. Another twist is that, although L

X’s math markup

is not nearly as semantic as we might like, there is considerable semantics and struc-

ture in the markup that we can exploit. For example, given a \frac, we’ve already

established the numerator and denominator which can be parsed individually, but the

fraction as a whole can be directly represented as an application, using XMApp, of a

fraction operator; the resulting structure can be treated as atomic within its containing

expression.This structure preserving character greatly simpliﬁes the parsing task and

helps reduce misinterpretation.

The parser, invoked by the postprocessor, works only with the top-level lists of

lexical tokens, or with those sublists contained in an XMArg. The grammar works

primarily through the name and grammatical role. The name is given by an attribute,

or the content if it is the same. The role (things like ID, FUNCTION, OPERATOR,

OPEN, . . . ) is also given by an attribute, or, if not present, the name is looked up in a

document-speciﬁc dictionary (jobname.dict), or in a default dictionary.

Additional exceptions that need fuller explanation are:

•Constructors may wish to create a dual object (XMDual) whose children are

32 CHAPTER 5. MATHEMATICS

the semantic and presentational forms.

•Spacing and similar markup generates XMHint elements, which are currently

ignored during parsing, but probably shouldn’t.

5.1 Math Details

EXML processes mathematical material by proceeding through several stages:

•Basic processing of macros, primitives and constructors resulting in an XML

document; the math is primarily represented by a sequence of tokens (XMTok)

or structured items (XMApp,XMDual) and hints (XMHint, which are ignored).

•Document tree rewriting, where rules are applied to modify the document tree.

User supplied rules can be used here to clarify the intent of markup used in the

document.

•Math Parsing; a grammar based parser is applied, depth ﬁrst, to each level of

the math. In particular, at the top level of each math expression, as well as

each subexpression within structured items (these will have been contained in

an XMArg or XMWrap element). This results in an expression tree that will

hopefully be an accurate representation of the expression’s structure, but may be

ambigous in speciﬁcs (eg. what the meaning of a superscript is). The parsing is

driven almost entirely by the grammatical role assigned to each item.

•Not yet implemented a following stage must be developed to resolve the semantic

ambiguities by analyzing and augmenting the expression tree.

•Target conversion: from the internal XM*representation to MATHML or Open-

Math.

The Math element is a top-level container for any math mode material, serving

as the container for various representations of the math including images (through at-

tributes mathimage,width and height), textual (through attributes tex,content-tex

and text), MATHML and the internal representation itself. The mode attribute speci-

ﬁes whether the math should be in display or inline mode.

5.1.1 Internal Math Representation

The XMath element is the container for the internal representation

The following attributes can appear on all XM*elements:

role the grammatical role that this element plays

open,close parenthese or delimiters that were used to wrap the expression repre-

sented by this element.

argopen,argclose,separators delimiters on an function or operator (the ﬁrst ele-

ment of an XMApp) that were used to delimit the arguments of the function. The

separators is a string of the punctuation characters used to separate arguments.

5.1. MATH DETAILS 33

xml:id a unique identiﬁer to allow reference (XMRef) to this element.

Math Tags The following tags are used for the intermediate math representation:

XMTok represents a math token. It may contain text for presentation. Additional

attributes are:

name the name that represents the meaning of the token; this overrides the

content for identifying the token.

omcd the OpenMath content dictionary that the name belongs to.

font the font to be used for presenting the content.

style ?

size ?

stackscripts whether scripts should be stacked above/below the item, instead

of the usual script position.

XMApp represents the generalized application of some function or operator to argu-

ments. The ﬁrst child element is the operator, the remainig elements are the

arguments. Additional attributes:

name the name that represents the meaning of the construct as a whole.

stackscripts ?

XMDual combines representations of the content (the ﬁrst child) and presentation (the

second child), useful when the two structures are not easily related.

XMHint represents spacing or other apparent purely presentation material.

name names the effect that the hint was intended to achieve.

style ?

XMWrap serves to assert the expected type or role of a subexpression that may other-

wise be difﬁcult to interpret — the parser is more forgiving about these.

name ?

style ?

XMArg serves to wrap individual arguments or subexpressions, created by structured

markup, such as \frac. These subexpressions can be parsed individually.

rule the grammar rule that this subexpression should match.

XMRef refers to another subexpression,. This is used to avoid duplicating arguments

when constructing an XMDual to represent a function application, for example.

The arguments will be placed in the content branch (wrapped in an XMArg)

while XMRef’s will be placed in the presentation branch.

idref the identiﬁer of the referenced math subexpression.

34 CHAPTER 5. MATHEMATICS

5.1.2 Grammatical Roles

As mentioned above, the grammar take advantage of the structure (however minimal)

of the markup. Thus, the grammer is applied in layers, to sequences of tokens or

atomic subexpressions (like a fractions or arrays). It is the role attribute that indicates

the syntactic and/or presentational nature of each item. On the one hand, this drives

the parsing: the grammar rules are keyed on the role (say, ADDOP), rather than content

(say + or -), of the nodes [In some cases, the content is used to distinguish special

synthesized roles]. The role is also used to drive the conversion to presentation markup,

(say, as an inﬁx operator), especially Presentation MATHML. Some values of role are

used only in the grammar, some are only used in presentation; most are used both ways.

The following grammatical roles are recognized by the math parser. These values

can be speciﬁed in the role attribute during the initial document construction or by

rewrite rules. Although the precedence of operators is loosely described in the follow-

ing, since the grammar contains various special case productions, no rigidly ordered

precedence is given. Also note that in the current design, an expresssion has only a sin-

gle role, although that role may be involved in grammatical rules with distinct syntax

and semantics (some roles directly reﬂect this ambiguity).

ATOM a general atomic subexpression (atomic at the level of the expression; it may

have internal structure);

ID a variable-like token, whether scalar or otherwise, but not a function;

NUMBER a number;

ARRAY a structure with internal components and alignments; typically has a particular

syntactic relationship to OPEN and CLOSE tokens.

UNKNOWN an unknown expression. This is the default for token elements. Such tokens

are treated essential as ID, but generate a warning if it seems to be used as a

function.

OPEN,CLOSE opening and closing delimiters, group expressions or enclose arguments

among other structures;

MIDDLE a middle operator used to group items between an OPEN,CLOSE pair;

PUNCT,PERIOD punctuation; a period ‘ends’ formula (note that numbers, including

ﬂoating point, are recognized earlier in processing);

VERTBAR a vertical bar (single or doubled) which serves a confusing variety of nota-

tions: absolute values, “at”, divides;

RELOP a relational operator, loosely binding;

ARROW an arrow operator (with little semantic signiﬁcance), but generally treated

equivalently to RELOP;

METARELOP an operator used for relations between relations, with lower precedence;

5.1. MATH DETAILS 35

MODIFIER an atomic expression following an object that ‘modiﬁes’ it in some way,

such as a restriction (<0) or modulus expression;

MODIFIEROP an operator (such as mod) between two expressions such that the latter

modiﬁes the former;

ADDOP an addition operator, between RELOP and MULOP operators in precedence;

MULOP a multiplicative operator, high precedence than ADDOOP;

BINOP a generic inﬁx operator, can act as either an ADDOP or MULOP, typically used

for cases wrapped in \mathbin;

SUPOP An operator appearing in a superscript, such as a collection of primes, or per-

haps a T for transpose. This is distinct from an expression in a superscript with

an implied power or index operator;

PREFIX for a preﬁx operator;

POSTFIX for a postﬁx operator;

FUNCTION a function which (may) apply to following arguments with higher prece-

dence than addition and multiplication, or to parenthesized arguments (enclosed

between OPEN,CLOSE);

OPFUNCTION a variant of FUNCTION which doesn’t require fenced arguments;

TRIGFUNCTION a variant of OPFUNCTION with special rules for recognizing which

following tokens are arguments and which are not;

APPLYOP an explicit inﬁx application operator (high precedence);

COMPOSEOP an inﬁx operator that composes two FUNCTION’s (resulting in another

FUNCTION);

OPERATOR a general operator; higher precedence than function application. For

example, for an operator A, and function F,AF x would be interpretted as

(A(F))(x);

SUMOP,INTOP,LIMITOP,DIFFOP,BIGOP a summation/union, integral, limiting,

differential or general purpose operator. These are treated equivalently by the

grammar, but are distinguished to facilitate (eventually) analyzing the argument

structure (eg bound variables and differentials within an integral). Note are

SUMOP and LIMITOP signiﬁcantly different in this sense?

POSTSUBSCRIPT,POSTSUPERSCRIPT intermediate form of sub- and superscript,

roughly as T

X processes them. The script is (essentially) treated as an argument

but the base will be determined by parsing.

FLOATINGSUBSCRIPT,FLOATINGSUPERSCRIPT A special case for a sub- and

superscript on an empty base, ie. {}ˆ{x}. It is often used to place a pre-

superscript or for non-math uses (eg. 10${}ˆ{th});

36 CHAPTER 5. MATHEMATICS

The following roles are not used in the grammar, but are used to capture the presen-

tation style; they are typically used directly in macros that construct structured objects,

or used in representing the results of parsing an expression.

STACKED corresponds to stacked structures, such as \atop, and the presentation of

binomial coefﬁcients.

SUPERSCRIPTOP,SUBSCRIPTOP after parsing, the operator involved in various

sub/superscript constructs above will be comverted to these;

OVERACCENT,UNDERACCENT these are special cases of the above that indicate the

2nd operand acts as an accent (typically smaller), expressions using these roles

are usually directly constructed for accenting macros;

FENCED this operator is used to represent containers enclosed by OPEN and CLOSE,

possibly with punctuation, particularly when no semantic is known for the con-

struct, such as an arbitrary list.

The content of a token is actually used in a few special cases to distinguish distinct

syntactic constructs, but these roles are not assigned to the role attribute of expressions:

LANGLE,RANGLE recognizes use of <and >in the bra-ket notation used in quantum

mechanics;

LBRACE,RBRACE recognizes use of {and }on either side of stacked or array con-

structions representing various kinds of cases or choices;

SCRIPTOPEN recognizes the use of {in opening specialized set notations.

Chapter 6

Localization

In this chapter, a few issues relating to various national or cultural styles, languages or

text encodings, which we’ll refer to collectively as ‘localization’, are breiﬂy discussed.

6.1 Numbering

Generally when titles and captions are formatted or when equations are numbered and

when they are referred to in a cross reference or table of contents, text consisting of

some combination of the raw title or caption text, a reference number and a type name

(eg. ‘Chapter’) or symbol (eg. §) is composed and used. The exact compositions that is

used at each level can depend on language, culture, the subject matter as well as both

journal and individual style preferences. L

X has evolved to accommodate many of

these styles and L

EXML attempts to follow that lead, while preserve its options (the

demands of extensively hyper-linked online material sometimes seems to demand more

options and ﬂexibility than traditional print formatting).

For example, the various macros \chaptername,\partname,\refname,

etc. are respected and used. Likewise, the various counters and formatters such as

\theequation are supported.

X’s mechanism for formatting caption tags (\fnum@figure and \fnum@table)

is extended to cover more cases. If you deﬁne \fnum@type, (where type is

chapter,section,subsection, etc.) it will be used to format the reference

number and/or type name for instances of that type. The macro \fnum@toc@type is

used when formatting numbers for tables of contents.

Alternatively, you can deﬁne a macro \format@title@type that will be used

format the whole title including reference number and type as desired; it takes a sin-

gle argument, the title text. The macro \format@toctitle@type is used for the

formatting a (typically) short form use in tables of contents.

38 CHAPTER 6. LOCALIZATION

6.2 Input Encodings

EXML supports the standard L

X mechanism for handling non-ASCII encodings

of the input T

X sources: using the inputenc package. The L

EXML binding

of inputenc loads the encoding deﬁnition (generally with extension def) directly

from the L

X distribution (which are generally well-enough behaved to be easily pro-

cessed). These encoding deﬁnitions make the upper 128 code points (of 8 bit) active

and deﬁne T

X macros to handle them.

Using the commandline option --inputencoding=utf8 to latexml allows

processing of sources encoded as utf8, without any special packages loaded. [future

work will make L

EXML compatible with xetex]

6.3 Output Encodings

At some level, as far as T

X is concerned, what you type ends up pointing into a font

that causes some blob of ink to be printed. This mechanism is used to print a unique

mathematical operator, say ‘subset of and not equals’. It is also used to print greek

when you seemed to have been typing ASCII!

So, we must accomodate that mechanism, as well. At the stage when character to-

kens are digested to create boxes in the current font, a font encoding table (a FontMap)

is consulted to map the token’s text (viewed as an index into the table) to Unicode.

The declaration DeclareFontMap is used to associate a FontMap with an encoding

name, or font.

Note that this mapping is only used for text originating from the source document;

The text within Constructor’s XML pattern is used without any such font conversion.

6.4 Babel

The babel package for supporting multiple languages by redeﬁning various internal

bits of text to replace, eg. “Chapter” by “Kapital” and by deﬁning various shorthand

mechanisms to make it easy to type the extra non-latin characters and glyphs used by

those languages. Each supported language or dialect has a module which is loaded to

provide the needed deﬁnitions.

To the extent: that L

EXML’s input and output encoding handling is sufﬁcient; that

its processing of raw T

X is good enough; and that it proceeds through the appropriate

X internals, L

EXML should be able to support babel and arbitrary languages by

reading in the raw T

X implementation of the language module from the T

X distribu-

tion itself.

At least, that is the strategy that we use.

Chapter 7

Alignments

There are several situations where T

X stacks or aligns a number of objects into a one

or two dimensional grids. In most cases, these are built upon low-level primitives,

like \halign, and so share characteristics: using & to separate alignment columns;

either \\ or \cr to separate rows. Yet, there are many different markup patterns

and environments used for quite different purposes from tabular text to math arrays to

composing symbols and so it is worth recognizing the intended semantics in each case,

while still processing them as T

X would.

In this chapter, we will describe some of the special complications presented by

alignments and the strategies used to infer and represent the appropriate semantic struc-

tures, particularly for math.

7.1 T

X Alignments

NOTE This section needs to be written.

Many utilities for setting up and processing alignments are deﬁned in TeX.pool

with support from the module (LaTeXML::Core::)Alignment. Typically, one

binds a set of control sequences specially for the alignment environment or structure

encountered, particularly for & and \\. An alignment object is created which records

information about each row and cell that was processed, such as width, alignment,

span, etc. Then the alignment is converted to XML by specifying what tag wraps the

entire alignment, each row and each cell.

The content of aligments is being expanded before the column and row markers

are recognized; this allows more ﬂexibility in deﬁning markup since row and column

markers can be hidden in macros, but it also means that simple means, such as delimited

parameter lists, to parse the structure won’t work.

7.2 Tabular Header Heuristics

To be written

40 CHAPTER 7. ALIGNMENTS

7.3 Math Forks

There are several constructs for aligning mathematics in L

X, and common packages.

Here we are concerned with the large scale alignments where one or more equations

are displayed in a grid, such as eqnarray, in standard L

X, and a suite of constructs

of the amsmath packages. The arrangements are worth preserving as they often con-

vey important information to the reader by the grouping, or by drawing attention to

similarities or differences in the formula. At the same time, the individual fragments

within the grid cells often have little ‘meaning’ on their own: it is subsequences of

these fragments that represent the logical mathematical objects or formula. Thus, we

would also like to recognize those sequences and synthesize complete formula for use

in content-oriented services. We therefore have to devise an XML structure to represent

this duality, as well as developing strategies for inferring and rearranging the mathe-

matics as it was authored into the desired form.

The needed structure shares some characteristics with XMDual,which needs to

be described, but needs to resided at the document level, containing several, possibly

numbered, equations each of which provide two views. Additional objects, such as

textual insertions (such as amsmath’s \intertext), must also be accomodated.

The following XML is used to represent these structures:

<ltx:MathFork>

<ltx:Math>logical math here</ l t x : M a t h>

<ltx:MathBranch>

<l t x : t d>< ltx:Math>cell math</ l t x : M a t h>< / l t x : t d>...

<l t x : t r>< l t x : t d>< ltx:Math>...

</ ltx:Ma t h B r a n c h>

</ ltx:MathFork>

</ l t x : e q u a t i o n>

<ltx:text>inter-text</ ltx:text>

...more text or equations

</ l t x : e q u a t i o n g r o u p>

Typically, the contents of the MathBranch will be a sequence of td, each containing

an Math, or of tr, each containing sequence of such td. This structure can thus rep-

resent both eqnarray where a logical equation consists of one or more complete

rows, as well as AMS’ aligned where equations consist of pairs of columns. The

XSLT transformation that converts to end formats recognizes which case and lays out

appropriately.

In most cases, the material that will yield a MathFork is given as a set of partial

math expressions representing rows and/or columnns; these must be concatenated (and

parsed) to form the composite logical expression.

Any ID’s within the expressions (and references to them) must be modiﬁed to avoid

duplicate ids. Moreover, a useful application associates the displayed tokens from the

aligned presentation of the MathBranch with the presumably semantic tokens in the

logcal content of the main branch of the MathFork. Thus, we desire that the IDs in the

7.4. EQNARRAY 41

two branches to have a known relationship; in particular, those in the branch should

have .fork1 appended.

7.4 eqnarray

The eqnarray environment seems intended to represent one or more equations, but

each equation can be continued with additional right-hand-sides (by omitting the 1st

column), or the RHS itself can be continued on multiple lines by omitting the 1st two

columns on a row. With our goal of constructing well-structured mathematics, this

gives us a fun little puzzle to sort out. However, being essentially the only structure for

aligning mathematical stuff in standard L

X, eqnarray tended to be stretched into

various other use cases; aligning numbered equations with bits of text on the side, for

example. We therefore have some work to do to guess what the intent is.

The strategy used for eqnarray is process the material as an alignment in math

mode and convert initially to the following XML structure:

<ltx: Capture >

<ltx:Math><ltx:XMath>column math here</ ltx:XMath>< / l t x : M a t h>

</ l t x : C a p t u r e >

...

</ l t x : e q u a t i o n>

...

</ l t x : e q u a t i o n g r o u p>

The results are then studied to recognize the patterns of empty columns so that the rows

can be regrouped into logical equations. MathFork structures are used to contain those

logical equations while preserving the layout in the MathBranch.

NOTE We need to deal better with the cases that have more rows numbered that

we would like.

7.5 AMS Alignments

The AMS math packages deﬁne a number of useful math alignment structures. These

have been well thought out and designed with particular logical structures in mind, as

well as the layout. Thus these environments are less often abused than is eqnarray.

In this section, we list the environments, their expected use case and describe the strat-

egy used for converting them.

To be done Describe alternates for equation and things inside equations; De-

scribe single vs multiple logical equations. (and started variants)

This list outlines the intended use of the AMS alignment environments The follow-

ing constructs are intended as top-level environments, used like equation.

Several of the constructs are used in place of a top-level equation and represent

one or more logical equations. The following describes the intended usage, as a guide

to understanding the implementation code (or its limitations!)

42 CHAPTER 7. ALIGNMENTS

•align,flalign,alignat,xalignat: Each row may be numbered; has

even number of columns; Each pair of columns, aligned right then left, repre-

sents a logical equation; Note that the documentation suggests that annotative

text can be added by putting \text{} in a column followed by an empty col-

umn.

•gather: Each row is a single centered column representing an equation.

•multline: This environment represents a single equation broken to multiple

lines; the lines are aligned left, center (repeated) and ﬁnally, right. alignment not

yet implemented

The following environments are used within an equation (or similar) environment and

thus do not generate MathFork structures. Moreover, except for aligned, their se-

mantic intent is less clear. The preservation of the alignment have not yet been imple-

mented; they; presumably would yeiled an XMDual.

•split

•gathered

•aligned,alignedat

Note that the case of a single equation containing a single aligned is transformed

into and treated equivalently to a top-level align.

Chapter 8

Metadata

8.1 RDFa

EXML has support for representing and generating RDFa metadata in L

EXML doc-

uments. The core attributes property,rel,rev,about resource,typeof and content

are included. Provision is also made for about and resource to be speciﬁed using

X-style labels, or plain XML id’s.

The default set of vocabularies is speciﬁed in HTML Role Vocabulary1, and the

associated set of preﬁxes are predeﬁned.

It is intended that the support will be extended to automatically generate RDFa data

from the implied semantics of L

X markup; the idea would be not to inadvertently

override any explicitly provided metadata supplied by one of the following packages.

The hyperref package The hyperref and hyperxmp packages provide a means to

specify metadata which will be embedded in the generated pdf ﬁle; L

EXML converts

that data to RDFa in its generated XML.

The lxRDFa package There is also a L

EXML-speciﬁc package, lxRDFa, which

provides several commands for annotating the generated XML. The most powerful of

which is \lxRDFa which allows you to specify any set or subset of RDFa attributes on

the current XML element and thus take advantage of the arbitrary shorthands, chaining

and partial triples that RDFa allows. Correspondingly, you are must beware of clashes

or unintended changes to the set of triples generated by explicit and hidden RDFa data.

1http://www.w3.org/1999/xhtml/vocab/#XHTMLRoleVocabulary

44 CHAPTER 8. METADATA

Chapter 9

ToDo

Lots. . . !

•Many useful L

X packages have not been implemented, and those that are

aren’t necessarily complete.

Contributed bindings are, of course, welcome!

•Low-level T

X capabilities, such as text modes (eg. vertical, horizonatal), box

details like width and depth, as well as fonts, aren’t mimicked faithfully, although

it isn’t clear how much can be done at the ‘semantic’ level.

•a richer math grammar, or more ﬂexible parsing engine, better inferencing of

math structure, better inferencing of math meaning. . . and thus better Content

MathML and OpenMath support!

•Could be faster.

•Easier customization of the document schema, XSLT stylesheets.

•...um,...documentation!

46 CHAPTER 9. TODO

Acknowledgements

Thanks to the DLMF project and it’s Editors — Frank Olver, Dan Lozier, Ron Boisvert,

and Charles Clark — for providing the motivation and opportunity to pursue this.

Thanks to the arXMLiv project, in particular Michael Kohlhase and Heinrich

Stamerjohanns, for providing a rich testbed and testing framework to exercise the sys-

tem. Additionally, thanks to Ioan Sucan, Catalin David and Silviu Oprea for testing

help and for implementing additional packages.

Particular thanks go to Deyan Ginev as an enthusiastic supporter and developer.

48 CHAPTER 9. TODO

Appendix A

Command Documentation

latexml

Transforms a TeX/LaTeX ﬁle into XML.

Synopsis

latexml [options] texﬁle

Options:

--destination=file sets destination file (default stdout).

--output=file [obsolete synonym for --destination]

--preload=module requests loading of an optional module;

can be repeated

--preamble=file sets a preamble file which will

effectively be prepended to the main file.

--postamble=file sets a postamble file which will

effectively be appended to the main file.

--includestyles allows latexml to load raw *.sty file;

by default it avoids this.

--path=dir adds to the paths searched for files,

modules, etc;

--documentid=id assign an id to the document root.

--quiet suppress messages (can repeat)

--verbose more informative output (can repeat)

--strict makes latexml less forgiving of errors

--bibtex processes as a BibTeX bibliography.

--xml requests xml output (default).

--tex requests TeX output after expansion.

--box requests box output after expansion

and digestion.

--noparse suppresses parsing math

--nocomments omit comments from the output

--inputencoding=enc specify the input encoding.

--VERSION show version number.

50 APPENDIX A. COMMANDS

--debug=package enables debugging output for the named

package

--help shows this help message.

If texﬁle is ’-’, latexml reads the TeX source from standard input. If texﬁle has an

explicit extention of .bib, it is processed as a BibTeX bibliography.

Options & Arguments

--destination=ﬁle

Speciﬁes the destination ﬁle; by default the XML is written to stdout.

--preload=module

Requests the loading of an optional module or package. This may be useful

if the TeX code does not speciﬁcly require the module (eg. through input or

usepackage). For example, use --preload=LaTeX.pool to force LaTeX

mode.

--preamble=ﬁle,--postamble=ﬁle

Speciﬁes a ﬁle whose contents will effectively be prepended or appended to the

main document ﬁle’s content. This can be useful when processing TeX frag-

ments, in which case the preamble would contain documentclass and begindoc-

ument control sequences. This option is not used when processing BibTeX ﬁles.

--includestyles

This optional allows processing of style ﬁles (ﬁles with extensions sty,cls,

clo,cnf). By default, these ﬁles are ignored unless a latexml implementation

of them is found (with an extension of ltxml).

These style ﬁles generally fall into two classes: Those that merely affect docu-

ment style are ignorable in the XML. Others deﬁne new markup and document

structure, often using deeper LaTeX macros to achieve their ends. Although the

omission will lead to other errors (missing macro deﬁnitions), it is unlikely that

processing the TeX code in the style ﬁle will lead to a correct document.

--path=dir

Add dir to the search paths used when searching for ﬁles, modules, style ﬁles,

etc; somewhat like TEXINPUTS. This option can be repeated.

--documentid=id

Assigns an ID to the root element of the XML document. This ID is generally

inherited as the preﬁx of ID’s on all other elements within the document. This

is useful when constructing a site of multiple documents so that all nodes have

unique IDs.

--quiet

Reduces the verbosity of output during processing, used twice is pretty silent.

--verbose

Increases the verbosity of output during processing, used twice is pretty chatty.

Can be useful for getting more details when errors occur.

--strict

Speciﬁes a strict processing mode. By default, undeﬁned control sequences and

invalid document constructs (that violate the DTD) give warning messages, but

attempt to continue processing. Using --strict makes them generate fatal errors.

--bibtex

Forces latexml to treat the ﬁle as a BibTeX bibliography. Note that the timing

is slightly different than the usual case with BibTeX and LaTeX. In the latter

case, BibTeX simply selects and formats a subset of the bibliographic entries;

the actual TeX expansion is carried out when the result is included in a LaTeX

document. In contrast, latexml processes and expands the entire bibliography;

the selection of entries is done during postprocessing. This also means that any

packages that deﬁne macros used in the bibliography must be speciﬁed using the

--preload option.

--xml

Requests XML output; this is the default.

--tex

Requests TeX output for debugging purposes; processing is only carried out

through expansion and digestion. This may not be quite valid TeX, since Uni-

code may be introduced.

--box

Requests Box output for debugging purposes; processing is carried out through

expansion and digestions, and the result is printed.

--nocomments

Normally latexml preserves comments from the source ﬁle, and adds a comment

every 25 lines as an aid in tracking the source. The option --nocomments discards

such comments.

--inputencoding=encoding

Specify the input encoding, eg. --inputencoding=iso-8859-1. The en-

coding must be one known to Perl’s Encode package. Note that this only enables

the translation of the input bytes to UTF-8 used internally by LaTeXML, but

does not affect catcodes. It is usually better to use LaTeX’s inputenc package.

Note that this does not affect the output encoding, which is always UTF-8.

--VERSION

Shows the version number of the LaTeXML package..

52 APPENDIX A. COMMANDS

--debug=package

Enables debugging output for the named package. The package is given without

the leading LaTeXML::.

--help

Shows this help message.

See also

latexmlpost,latexmlmath,LaTeXML

latexmlpost

Postprocesses an xml ﬁle generated by latexml to perform common tasks, such as

convert math to images and processing graphics inclusions for the web.

Synopsis

latexmlpost [options] xmlﬁle

Options:

--verbose shows progress during processing.

--VERSION show version number.

--help shows help message.

--sourcedirectory=sourcedir sets directory of the original

source TeX file.

--validate, --novalidate Enables (the default) or disables

validation of the source xml.

--format=html|html5|html4|xhtml|xml requests the output format.

(html defaults to html5)

--destination=file sets output file (and directory).

--omitdoctype omits the Doctype declaration,

--noomitdoctype disables the omission (the default)

--numbersections enables (the default) the inclusion of

section numbers in titles, crossrefs.

--nonumbersections disables the above

--stylesheet=xslfile requests the XSL transform using the

given xslfile as stylesheet.

--css=cssfile adds css stylesheet to (x)html(5)

(can be repeated)

--nodefaultresources disables processing built-in resources

--javscript=jsfile adds a link to a javascript file into

html4/html5/xhtml (can be repeated)

--xsltparameter=name:value passes parameters to the XSLT.

--split requests splitting each document

--nosplit disables the above (default)

--splitat sets level to split the document

--splitpath=xpath sets xpath expression to use for

splitting (default splits at

sections, if splitting is enabled)

--splitnaming=(id|idrelative|label|labelrelative) specifies

how to name split files (idrelative).

--scan scans documents to extract ids,

labels, etc.

section titles, etc. (default)

--noscan disables the above

--crossref fills in crossreferences (default)

--nocrossref disables the above

--urlstyle=(server|negotiated|file) format to use for urls

(default server).

--navigationtoc=(context|none) generates a table of contents

in navigation bar

--index requests creating an index (default)

--noindex disables the above

--splitindex Splits index into pages per initial.

--nosplitindex disables the above (default)

--permutedindex permutes index phrases in the index

--nopermutedindex disables the above (default)

--bibliography=file sets a bibliography file

--splitbibliography splits the bibliography into pages per

initial.

--nosplitbibliography disables the above (default)

--prescan carries out only the split (if

enabled) and scan, storing

cross-referencing data in dbfile

(default is complete processing)

--dbfile=dbfile sets file to store crossreferences

--sitedirectory=dir sets the base directory of the site

--mathimages converts math to images

(default for html4 format)

--nomathimages disables the above

--mathsvg converts math to svg images

--nomathsvg disables the above

--mathimagemagnification=mag sets magnification factor

--presentationmathml converts math to Presentation MathML

(default for xhtml & html5 formats)

--pmml alias for --presentationmathml

--nopresentationmathml disables the above

--linelength=n formats presentation mathml to a

linelength max of n characters

--contentmathml converts math to Content MathML

--nocontentmathml disables the above (default)

--cmml alias for --contentmathml

--openmath converts math to OpenMath

--noopenmath disables the above (default)

--om alias for --openmath

--keepXMath preserves the intermediate XMath

representation (default is to remove)

54 APPENDIX A. COMMANDS

--mathtex adds TeX annotation to parallel markup

--nomathtex disables the above (default)

--mathlex adds linguistic lexeme annotation to parallel markup

--nomathlex disables the above (default)

--plane1 use plane-1 unicode for symbols

(default, if needed)

--noplane1 do not use plane-1 unicode

--graphicimages converts graphics to images (default)

--nographicimages disables the above

--graphicsmap=type.type specifies a graphics file mapping

--pictureimages converts picture environments to

images (default)

--nopictureimages disables the above

--svg converts picture environments to SVG

--nosvg disables the above (default)

If xmlﬁle is ’-’, latexmlpost reads the XML from standard input.

Options & Arguments

General Options

--verbose

Requests informative output as processing proceeds. Can be repeated to increase

the amount of information.

--VERSION

Shows the version number of the LaTeXML package..

--help

Shows this help message.

Source Options

--sourcedirectory=source

Speciﬁes the directory where the original latex source is located. Unless latexml-

post is run from that directory, or it can be determined from the xml ﬁlename, it

may be necessary to specify this option in order to ﬁnd graphics and style ﬁles.

--validate,--novalidate

Enables (or disables) the validation of the source XML document (the default).

Format Options

--format=(html|html5|html4|xhtml|xml)

Speciﬁes the output format for post processing. By default, it will be guessed

from the ﬁle extension of the destination (if given), with html implying html5,

xhtml implying xhtml and the default being xml, which you probably don’t

want.

The html5 format converts the material to html5 form with mathematics as

MathML; html5 supports SVG. html4 format converts the material to the ear-

lier html form, version 4, and the mathematics to png images. xhtml format

converts to xhtml and uses presentation MathML (after attempting to parse the

mathematics) for representing the math. html5 similarly converts math to pre-

sentation MathML. In these cases, any graphics will be converted to web-friendly

formats and/or copied to the destination directory. If you simply specify html,

it will treat that as html5.

For the default, xml, the output is left in LaTeXML’s internal xml, but the math

is parsed and converted to presentation MathML. For html, html5 and xhtml, a

default stylesheet is provided, but see the --stylesheet option.

--destination=destination

Speciﬁes the destination ﬁle and directory. The directory is needed for mathim-

ages, mathsvg and graphics processing.

--omitdoctype,--noomitdoctype

Omits (or includes) the document type declaration. The default is to include it if

the document model was based on a DTD.

--numbersections,--nonumbersections

Includes (default), or disables the inclusion of section, equation, etc, numbers in

the formatted document and crossreference links.

--stylesheet=xslﬁle

Requests the XSL transformation of the document using the given xslﬁle as

stylesheet. If the stylesheet is omitted, a ‘standard’ one appropriate for the format

(html4, html5 or xhtml) will be used.

--css=cssﬁle

Adds cssﬁle as a css stylesheet to be used in the transformed html/html5/xhtml.

Multiple stylesheets can be used; they are included in the html in the order given,

following the default ltx-LaTeXML.css (unless --nodefaultcss). The

stylesheet is copied to the destination directory, unless it is an absolute url.

Some stylesheets included in the distribution are --css=navbar-left Puts a nav-

igation bar on the left. (default omits navbar) --css=navbar-right Puts a navi-

gation bar on the left. --css=theme-blue A blue coloring theme for headings.

--css=amsart A style suitable for journal articles.

--javascript=jsﬁle

Includes a link to the javascript ﬁle jsﬁle, to be used in the transformed htm-

l/html5/xhtml. Multiple javascript ﬁles can be included; they are linked in the

html in the order given. The javascript ﬁle is copied to the destination directory,

unless it is an absolute url.

56 APPENDIX A. COMMANDS

--icon=iconﬁle

Copies iconﬁle to the destination directory and sets up the linkage in the trans-

formed html/html5/xhtml to use that as the ”favicon”.

--nodefaultresources

Disables the copying and inclusion of resources added by the binding ﬁles; This

includes CSS, javascript or other ﬁles. This does not affect resources explicitly

requested by the --css or --javascript options.

--timestamp=timestamp

Provides a timestamp (typically a time and date) to be embedded in the com-

ments by the stock XSLT stylesheets. If you don’t supply a timestamp, the cur-

rent time and date will be used. (You can use --timestamp=0 to omit the

timestamp).

--xsltparameter=name:value

Passes parameters to the XSLT stylesheet. See the manual or the stylesheet itself

for available parameters.

Site & Crossreferencing Options

--split,--nosplit

Enables or disables (default) the splitting of documents into multiple ‘pages’.

If enabled, the the document will be split into sections, bibliography, index and

appendices (if any) by default, unless --splitpath is speciﬁed.

--splitat=unit

Speciﬁes what level of the document to split at. Should be one of chapter,

section (the default), subsection or subsubsection. For more con-

trol, see --splitpath.

--splitpath=xpath

Speciﬁes an XPath expression to select nodes that will generate separate

pages. The default splitpath is //ltx:section |//ltx:bibliography |//ltx:appendix |

//ltx:index

Specifying

--splitpath="//ltx:section | //ltx:subsection

| //ltx:bibliography | //ltx:appendix | //ltx:index"

would split the document at subsections as well as sections.

--splitnaming=(id|idrelative|label|labelrelative)

Speciﬁes how to name the ﬁles for subdocuments created by splitting. The values

id and label simply use the id or label of the subdocument’s root node for it’s

ﬁlename. idrelative and labelrelative use the portion of the id or

label that follows the parent document’s id or label. Furthermore, to impose

structure and uniqueness, if a split document has children that are also split, that

document (and it’s children) will be in a separate subdirectory with the name

index.

--scan,--noscan

Enables (default) or disables the scanning of documents for ids, labels, refer-

ences, indexmarks, etc, for use in ﬁlling in refs, cites, index and so on. It may

be useful to disable when generating documents not based on the LaTeXML

doctype.

--crossref,--nocrossref

Enables (default) or disables the ﬁlling in of references, hrefs, etc based on a

previous scan (either from --scan, or --dbfile) It may be useful to disable

when generating documents not based on the LaTeXML doctype.

--urlstyle=(server|negotiated|file)

This option determines the way that URLs within the documents are formatted,

depending on the way they are intended to be served. The default, server,

eliminates unneccessary trailing index.html. With negotiated, the trail-

ing ﬁle extension (typically html or xhtml) are eliminated. The scheme file

preserves complete (but relative) urls so that the site can be browsed as ﬁles

without any server.

--navigationtoc=(context|none)

Generates a table of contents in the navigation bar; default is none. The ‘con-

text’ style of TOC, is somewhat verbose and reveals more detail near the current

page; it is most suitable for navigation bars placed on the left or right. Other

styles of TOC should be developed and added here, such as a short form.

--index,--noindex

Enables (default) or disables the generation of an index from indexmarks em-

bedded within the document. Enabling this has no effect unless there is an index

element in the document (generated by \printindex).

--splitindex,--nosplitindex

Enables or disables (default) the splitting of generated indexes into separate

pages per initial letter.

--bibliography=pathname

Speciﬁes a bibliography generated from a BibTeX ﬁle to be used to ﬁll in a bibli-

ography element. Hand-written bibliographies placed in a thebibliography

environment do not need this. The option has no effect unless there is an bibli-

ography element in the document (generated by \bibliography).

58 APPENDIX A. COMMANDS

Note that this option provides the bibliography to be used to ﬁll in the bibliogra-

phy element (generated by \bibliography); latexmlpost does not (currently)

directly process and format such a bibliography.

--splitbibliography,--nosplitbibliography

Enables or disables (default) the splitting of generated bibliographies into sepa-

rate pages per initial letter.

--prescan

By default latexmlpost processes a single document into one (or more; see

--split) destination ﬁles in a single pass. When generating a complicated site

consisting of several documents it may be advantageous to ﬁrst scan through the

documents to extract and store (in dbfile) cross-referencing data (such as ids,

titles, urls, and so on). A later pass then has complete information allowing all

documents to reference each other, and also constructs an index and bibliography

that reﬂects the entire document set. The same effect (though less efﬁcient) can

be achieved by running latexmlpost twice, provided a dbfile is speciﬁed.

--dbfile=ﬁle

Speciﬁes a ﬁlename to use for the crossreferencing data when using two-pass

processing. This ﬁle may reside in the intermediate destination directory.

--sitedirectory=dir

Speciﬁes the base directory of the overall web site. Pathnames in the database

are stored in a form relative to this directory to make it more portable.

Math Options These options specify how math should be converted into other for-

mats. Multiple formats can be requested; how they will be combined depends on the

format and other options.

--mathimages,--nomathimages

Requests or disables the conversion of math to images (png by default). Conver-

sion is the default for html4 format.

--mathsvg,--nomathsvg

Requests or disables the conversion of math to svg images.

--mathimagemagnification=factor

Speciﬁes the magniﬁcation used for math images (both png and svg), if they are

made. Default is 1.75.

--presentationmathml,--nopresentationmathml

Requests or disables conversion of math to Presentation MathML. Conversion is

the default for xhtml and html5 formats.

--linelength=number

(Experimental) Line-breaks the generated Presentation MathML so that it is no

longer than number ‘characters’.

--plane1

Converts the content of Presentation MathML token elements to the appropriate

Unicode Plane-1 codepoints according to the selected font, when applicable (the

default).

--hackplane1

Converts the content of Presentation MathML token elements to the appropri-

ate Unicode Plane-1 codepoints according to the selected font, but only for the

mathvariants double-struck, fraktur and script. This gives support for current (as

of August 2009) versions of Firefox and MathPlayer, provided a sufﬁcient set of

fonts is available (eg. STIX).

--contentmathml,--nocontentmathml

Requests or disables conversion of math to Content MathML. Conversion is dis-

abled by default. Note that this conversion is only partially implemented.

--openmath

Requests or disables conversion of math to OpenMath. Conversion is disabled

by default. Note that this conversion is only partially implemented.

--keepXMath

By default, when any of the MathML or OpenMath conversions are used, the

intermediate math representation will be removed; this option preserves it; it

will be used as secondary parallel markup, when it follows the options for other

math representations.

Graphics Options

--graphicimages,--nographicimages

Enables (default) or disables the conversion of graphics to web-appropriate for-

mat (png).

--graphicsmap=sourcetype.desttype

Speciﬁes a mapping of graphics ﬁle types. Typically, graphics elements specify

a graphics ﬁle that will be converted to a more appropriate ﬁle target format;

for example, postscript ﬁles used for graphics with LaTeX will be converted to

png format for use on the web. As with LaTeX, when a graphics ﬁle is speciﬁed

without a ﬁle type, the system will search for the most appropriate target type

ﬁle.

When this option is used, it overrides and replaces the defaults and provides

a mapping of sourcetype to desttype. The option can be repeated to provide

60 APPENDIX A. COMMANDS

several mappings, with the earlier formats preferred. If the desttype is omitted, it

speciﬁes copying ﬁles of type sourcetype, unchanged.

The default settings is equivalent to having supplied the options:

--graphicsmap=svg

--graphicsmap=png

--graphicsmap=gif

--graphicsmap=jpg

--graphicsmap=jpeg

--graphicsmap=eps.png

--graphicsmap=ps.png

--graphicsmap=ai.png

--graphicsmap=pdf.png

The ﬁrst formats are preferred and used unchanged, while the latter ones are

converted to png.

--pictureimages,--nopictureimages

Enables (default) or disables the conversion of picture environments and pstricks

material into images.

--svg,--nosvg

Enables or disables (default) the conversion of picture environments and pstricks

material to SVG.

See also

latexml,latexmlmath,LaTeXML

latexmlmath

Transforms a TeX/LaTeX math expression into various formats.

Synopsis

latexmlmath [options] texmath

Options:

--mathimage=file converts to image in file

--mathsvg=file converts to svg image in file

--magnification=mag specifies magnification factor

--presentationmathml=file converts to Presentation MathML

--pmml=file alias for --presentationmathml

--linelength=n do linewrapping of pMML

--contentmathml=file convert to Content MathML

--cmml=file alias for --contentmathml

--openmath=file convert to OpenMath

--om=file alias for --openmath

--XMath=file output LaTeXML’s internal format

--noparse disables parsing of math

(not useful for cMML or openmath)

--preload=file loads a style file.

--includestyles allows processing raw *.sty files

(normally it avoids this)

--path=dir adds a search path for style files.

--quiet reduces verbosity (can repeat)

--verbose increases verbosity (can repeat)

--strict be more strict about errors.

--documentid=id assign an id to the document root.

--debug=package enables debugging output for the

named package

--inputencoding=enc specify the input encoding.

--VERSION show version number and exit.

--help shows this help message.

-- ends options

If texmath is ’-’, latexmlmath reads the TeX from standard input. If any of the

output ﬁles are ’-’, the result is printed on standard output.

Input notes Note that, unless you are reading texmath from standard input, the tex-

math string will be processed by whatever shell you are using before latexmlmath

even sees it. This means that many so-called meta characters, such as backslash and

star, may confuse the shell or be changed. Consequently, you will need to quote and/or

slashify the input appropriately. Most particularly, \will need to be doubled to \\ for

latexmlmath to see it as a control sequence.

Using -- to explicitly end the option list is useful for cases when the math starts

with a minus (and would otherwise be interpreted as an option, probably an unrecog-

nized one). Alternatively, wrapping the texmath with {} will hide the minus.

Simple examples:

latexmlmath \\frac{-b\\pm\\sqrt{bˆ2-4ac}}{2a}

echo "\\sqrt{bˆ2-4ac}" | latexmlmath --pmml=quad.mml -

Options & Arguments

Conversion Options These options specify what formats the math should be con-

verted to. In each case, the destination ﬁle is given. Except for mathimage, the ﬁle can

be given as ’-’, in which case the result is printed to standard output.

If no conversion option is speciﬁed, the default is to output presentation MathML

to standard output.

--mathimage=ﬁle

Requests conversion to png images.

--mathsvg=ﬁle

Requests conversion to svg images.

62 APPENDIX A. COMMANDS

--magnification=factor

Speciﬁes the magniﬁcation used for math image. Default is 1.75.

--presentationmathml=ﬁle

Requests conversion to Presentation MathML.

--linelength=number

(Experimental) Line-breaks the generated Presentation MathML so that it is no

longer than number ‘characters’.

--plane1

Converts the content of Presentation MathML token elements to the appropriate

Unicode Plane-1 codepoints according to the selected font, when applicable.

--hackplane1

Converts the content of Presentation MathML token elements to the appropri-

ate Unicode Plane-1 codepoints according to the selected font, but only for the

mathvariants double-struck, fraktur and script. This gives support for current (as

of August 2009) versions of Firefox and MathPlayer, provided a sufﬁcient set of

fonts is available (eg. STIX).

--contentmathml=ﬁle

Requests conversion to Content MathML. Note that this conversion is only par-

tially implemented.

--openmath=ﬁle

Requests conversion to OpenMath. Note that this conversion is only partially

implemented.

--XMath=ﬁle

Requests convertion to LaTeXML’s internal format.

Other Options

--preload=module

Requests the loading of an optional module or package. This may be useful

if the TeX code does not speciﬁcly require the module (eg. through input or

usepackage). For example, use --preload=LaTeX.pool to force LaTeX

mode.

--includestyles

This optional allows processing of style ﬁles (ﬁles with extensions sty,cls,

clo,cnf). By default, these ﬁles are ignored unless a latexml implementation

of them is found (with an extension of ltxml).

These style ﬁles generally fall into two classes: Those that merely affect docu-

ment style are ignorable in the XML. Others deﬁne new markup and document

structure, often using deeper LaTeX macros to achieve their ends. Although the

omission will lead to other errors (missing macro deﬁnitions), it is unlikely that

processing the TeX code in the style ﬁle will lead to a correct document.

--path=dir

Add dir to the search paths used when searching for ﬁles, modules, style ﬁles,

etc; somewhat like TEXINPUTS. This option can be repeated.

--documentid=id

Assigns an ID to the root element of the XML document. This ID is generally

inherited as the preﬁx of ID’s on all other elements within the document. This

is useful when constructing a site of multiple documents so that all nodes have

unique IDs.

--quiet

Reduces the verbosity of output during processing, used twice is pretty silent.

--verbose

Increases the verbosity of output during processing, used twice is pretty chatty.

Can be useful for getting more details when errors occur.

--strict

Speciﬁes a strict processing mode. By default, undeﬁned control sequences and

invalid document constructs (that violate the DTD) give warning messages, but

attempt to continue processing. Using --strict makes them generate fatal errors.

--inputencoding=encoding

Specify the input encoding, eg. --inputencoding=iso-8859-1. The en-

coding must be one known to Perl’s Encode package. Note that this only enables

the translation of the input bytes to UTF-8 used internally by LaTeXML, but

does not affect catcodes. It is usually better to use LaTeX’s inputenc package.

Note that this does not affect the output encoding, which is always UTF-8.

--VERSION

Shows the version number of the LaTeXML package..

--debug=package

Enables debugging output for the named package. The package is given without

the leading LaTeXML::.

--help

Shows this help message.

64 APPENDIX A. COMMANDS

BUGS

This program runs much slower than would seem justiﬁed. This is a result of the

relatively slow initialization including loading TeX and LaTeX macros and the schema.

Normally, this cost would be ammortized over large documents, whereas, in this case,

we’re processing a single math expression.

See also

latexml,latexmlpost,LaTeXML

Appendix B

Implemented Bindings

Bindings for the following classes and packages are supplied with the distribution:

classes: IEEEtran, JHEP, JHEP2, JHEP3, OmniBus, a0poster, aa, aastex, aastex6, aas-

tex61, acmart, amsart, amsbook, amsproc, article, book, elsart, elsarticle, emu-

lateapj, gen-j-l, gen-m-l, gen-p-l, ieeeconf, iopart, llncs, mn, mn2e, mnras, mod-

erncv, quantumarticle, report, revtex, revtex4-1, revtex4, slides, subﬁles, svjour,

svjour3, svmult

packages: a0size, a4, a4wide, aas macros, aasms, aaspp, aastex, accents, acronym,

ae, aecompl, afterpage, algc, algcompatible, algmatlab, algorithm, algorithm2e,

algorithmic, algorithmicx, algpascal, algpseudocode, alltt, amsbsy, amscd, ams-

fonts, amsgen, amsmath, amsopn, amsppt, amsrefs, amssymb, amstex, amstext,

amsthm, amsxtra, apjfonts, appendix, array, attachﬁle, authblk, avant, babel, bal-

ance, bbm, bbold, beton, bm, bookman, booktabs, braket, breakurl, calc, can-

cel, caption, cases, ccfonts, chancery, charter, chngcntr, circuitikz, cite, cite-

sort, cleveref, cmbright, color, colordvi, colortbl, comment, concmath, courier,

crop, cropmark, csquotes, dcolumn, deluxetable, doublespace, dsfont, ellipsis,

elsart, empheq, emulateapj, emulateapj5, endnotes, enumerate, enumitem, epi-

graph, epsf, epsﬁg, epstopdf, esint, etex, etoolbox, eucal, eufrak, euler, eulervm,

eurosym, euscript, exscale, fancyhdr, ﬁx-cm, ﬁxltx2e, ﬂafter, ﬂeqn, ﬂoat, ﬂoat-

ﬁg, ﬂoatﬂt, ﬂoatpag, ﬂowchart, ﬂushend, fontenc, fontspec, footmisc, fourier,

framed, fullpage, gensymb, geometry, german, graphics, graphicx, grfﬁle, hel-

vet, here, hhline, html, hyperref, hyperxmp, icml2016, icml2017, icml2018, iﬂu-

atex, ifpdf, ifthen, ifvtex, ifxetex, import, indentﬁrst, inputenc, iopams, jheppub,

keyval, lastpage, latexml, latexsym, lineno, lipsum, listings, listingsutf8, llama-

pun, lmodern, longtable, lscape, luximono, lxRDFa, makecell, makeidx, mar-

vosym, mathbbol, mathpazo, mathpple, mathptm, mathptmx, mathrsfs, math-

tools, microtype, mleftright, multicol, multido, multirow, nameref, natbib, new-

cent, newﬂoat, newlfont, newtxmath, newtxtext, ngerman, nicefrac, ntheorem,

numprint, palatino, paralist, parskip, pdﬂscape, pdfpages, pdfsync, pgf, pgf-

plots, pifont, placeins, preview, psﬁg, pslatex, pspicture, pst-grad, pst-node,

pstricks, pxfonts, ragged2e, relsize, remreset, revsymb, revtex, revtex4, rotate,

66 APPENDIX B. BINDINGS

rotating, rsfs, scalefnt, sectsty, setspace, showkeys, siunitx, slashed, soul, srcltx,

stﬂoats, stmaryrd, subcaption, subﬁg, subﬁgure, subﬁles, subﬂoat, supertabu-

lar, svg, t1enc, tablefootnote, tabularx, tabulary, textcase, textcomp, texvc, theo-

rem, thmtools, threeparttable, tikz-3dplot, tikz, times, titlesec, titling, tocbibind,

todonotes, tracefnt, transparent, turing, txfonts, type1cm, ulem, units, upgreek,

upref, url, utopia, verbatim, wasysym, wiki, wrapﬁg, xargs, xcolor, xkeyval,

xkvview, xspace, xunicode, yfonts

Appendix C

Perl Modules Documentation

LaTeXML

A converter that transforms TeX and LaTeX into XML/HTML/MathML

Synopsis

use LaTeXML;

my $converter = LaTeXML->get_converter($config);

my $converter = LaTeXML->new($config);

$converter->prepare_session($opts);

$converter->initialize_session; # SHOULD BE INTERNAL

$hashref = $converter->convert($tex);

my ($result,$log,$status)

= map {$hashref->{$_}} qw(result log status);

Description

LaTeXML is a converter that transforms TeX and LaTeX into XML/HTML/MathML

and other formats.

A LaTeXML object represents a converter instance and can convert ﬁles on de-

mand, until dismissed.

Methods

my $converter = LaTeXML->new($config);

Creates a new converter object for a given LaTeXML::Common::Conﬁg object,

$conﬁg.

my $converter = LaTeXML->get converter($config);

Either creates, or looks up a cached converter for the $conﬁg conﬁguration ob-

ject.

68 APPENDIX C. MODULES

$converter->prepare session($opts);

Top-level preparation routine that prepares both a correct options object and

an initialized LaTeXML object, using the ”initialize options” and ”initial-

ize session” routines, when needed.

Contains optimization checks that skip initializations unless necessary.

Also adds support for partial option speciﬁcations during daemon runtime,

falling back on the option defaults given when converter object was created.

my ($result,$status,$log) = $converter->convert($tex);

Converts a TeX input string $tex into the LaTeXML::Core::Document object

$result.

Supplies detailed information of the conversion log ($log), as well as a brief

conversion status summary ($status).

INTERNAL ROUTINES

$converter->initialize session($opts);

Given an options hash reference $opts, initializes a session by creating a new

LaTeXML object with initialized state and loading a daemonized preamble (if

any).

Sets the ”ready” ﬂag to true, making a subsequent ”convert” call immediately

possible.

my $latexml = new latexml($opts);

Creates a new LaTeXML object and initializes its state.

my $postdoc = $converter->convert post($dom);

Post-processes a LaTeXML::Core::Document object $dom into a ﬁnal format,

based on the preferences speciﬁed in $$self{opts}.

Typically used only internally by convert.

$converter->bind log;

Binds STDERR to a ”log” ﬁeld in the $converter object

my $log = $converter->flush log;

Flushes out the accumulated conversion log into $log, reseting STDERR to its

usual stream.

LaTeXML::Global

Global exports used within LaTeXML, and in Packages.

Synopsis

use LaTeXML::Global;

Description

This module exports the various constants and constructors that are useful throughout

LaTeXML, and in Package implementations.

Global state

$STATE;

This is bound to the currently active LaTeXML::Core::State by an in-

stance of LaTeXML during processing.

LaTeXML::Package

Support for package implementations and document customization.

Synopsis

This package deﬁnes and exports most of the procedures users will need to customize or

extend LaTeXML. The LaTeXML implementation of some package might look some-

thing like the following, but see the installed LaTeXML/Package directory for real-

istic examples.

package LaTeXML::Package::pool; # to put new subs & variables in common pool

use LaTeXML::Package; # to load these definitions

use strict; # good style

use warnings;

# Load "anotherpackage"

RequirePackage(’anotherpackage’);

# A simple macro, just like in TeX

DefMacro(’\thesection’, ’\thechapter.\roman{section}’);

# A constructor defines how a control sequence generates XML:

DefConstructor(’\thanks{}’, "<ltx:thanks>#1</ltx:thanks>");

# And a simple environment ...

DefEnvironment(’{abstract}’,’<abstract>#body</abstract>’);

# A math symbol \Real to stand for the Reals:

DefMath(’\Real’, "\x{211D}", role=>’ID’);

# Or a semantic floor:

DefMath(’\floor{}’,’\left\lfloor#1\right\rfloor’);

# More esoteric ...

# Use a RelaxNG schema

RelaxNGSchema("MySchema");

# Or use a special DocType if you have to:

70 APPENDIX C. MODULES

# DocType("rootelement",

# "-//Your Site//Your DocType",’your.dtd’,

# prefix=>"http://whatever/");

# Allow sometag elements to be automatically closed if needed

Tag(’prefix:sometag’, autoClose=>1);

# Don’t forget this, so perl knows the package loaded.

Description

This module provides a large set of utilities and declarations that are useful for writing

‘bindings’: LaTeXML-speciﬁc implementations of a set of control sequences such as

would be deﬁned in a LaTeX style or class ﬁle. They are also useful for controlling

and customization of LaTeXML’s processing. See the LaTeXML::Package/"See

also" section, below, for additional lower-level modules imported & re-exported.

To a limited extent (and currently only when explicitly enabled), LaTeXML can

process the raw TeX code found in style ﬁles. However, to preserve document

structure and semantics, as well as for efﬁciency, it is usually necessary to sup-

ply a LaTeXML-speciﬁc ‘binding’ for style and class ﬁles. For example, a binding

mypackage.sty.ltxml would encode LaTeXML-speciﬁc implementations of all

the control sequences in mypackage.sty so that \usepackage{mypackage}

would work. Similarly for myclass.cls.ltxml. Additionally, document-speciﬁc

bindings can be supplied: before processing a TeX source ﬁle, eg mydoc.tex, La-

TeXML will automatically include the deﬁnitions and settings in mydoc.latexml.

These .ltxml and .latexml ﬁles should be placed LaTeXML’s searchpaths, where

will ﬁnd them: either in the current directory or in a directory given to the --path option,

or possibly added to the variable SEARCHPATHS).

Since LaTeXML mimics TeX, a familiarity with TeX’s processing model is crit-

ical. LaTeXML models: catcodes and tokens (See LaTeXML::Core::Token,

LaTeXML::Core::Tokens) which are extracted from the plain source text charac-

ters by the LaTeXML::Core::Mouth;LaTeXML::Package/Macros, which

are expanded within the LaTeXML::Core::Gullet; and LaTeXML::Package/Primitives,

which are digested within the LaTeXML::Core::Stomach to produce LaTeXML::Core::Box,

LaTeXML::Core::List. A key additional feature is the LaTeXML::Package/Constructors:

when digested they generate a LaTeXML::Core::Whatsit which, upon absorb-

tion by LaTeXML::Core::Document, inserts text or XML fragments in the ﬁnal

document tree.

Notation: Many of the following forms take code references as arguments or op-

tions. That is, either a reference to a deﬁned sub, eg. \&somesub, or an anonymous

function sub {... }. To document these cases, and the arguments that are passed

in each case, we’ll use a notation like code($stomach,...).

Control Sequences Many of the following forms deﬁne the behaviour of control

sequences. While in TeX you’ll typically only deﬁne macros, LaTeXML is effec-

tively redeﬁning TeX itself, so we deﬁne LaTeXML::Package/Macros as well

as LaTeXML::Package/Primitives,LaTeXML::Package/Registers,

LaTeXML::Package/Constructors and LaTeXML::Package/Environments.

These deﬁne the behaviour of these control sequences when processed during the vari-

ous phases of LaTeX’s imitation of TeX’s digestive tract.

Prototypes LaTeXML uses a more convienient method of specifying parame-

ter patterns for control sequences. The ﬁrst argument to each of these deﬁning forms

(DefMacro,DefPrimive, etc) is a prototype consisting of the control sequence be-

ing deﬁned along with the speciﬁcation of parameters required by the control sequence.

Each parameter describes how to parse tokens following the control sequence into ar-

guments or how to delimit them. To simplify coding and capture common idioms in

TeX/LaTeX programming, latexml’s parameter speciﬁcations are more expressive than

TeX’s \def or LaTeX’s \newcommand. Examples of the prototypes for familiar TeX

or LaTeX control sequences are:

DefConstructor(’\usepackage[]{}’,...

DefPrimitive(’\multiply Variable SkipKeyword:by Number’,..

DefPrimitive(’\newcommand OptionalMatch:*DefToken[]{}’, ...

The general syntax for parameter speciﬁcation is

{spec}

reads a regular TeX argument. spec can be omitted (ie. {}). Otherwise spec is

itself a parameter speciﬁcation and the argument is reparsed to accordingly. ({}

is a shorthand for Plain.)

[spec]

reads an LaTeX-style optional argument. spec can be omitted (ie. {}). Other-

wise, if spec is of the form Default:stuff, then stuff would be the default value.

Otherwise spec is itself a parameter speciﬁcation and the argument, if supplied,

is reparsed according to that speciﬁcation. ([] is a shorthand for Optional.)

Type

Reads an argument of the given type, where either Type has been declared, or

there exists a ReadType function accessible from LaTeXML::Package::Pool. See

the available types, below.

Type:value |Type:value1:value2...

These forms invoke the parser for Type but pass additional Tokens to the reader

function. Typically this would supply defaults or parameters to a match.

OptionalType

Similar to Type, but it is not considered an error if the reader returns undef.

72 APPENDIX C. MODULES

SkipType

Similar to OptionalType, but the value returned from the reader is ignored,

and does not occupy a position in the arguments list.

The predeﬁned argument Types are as follows.

Plain, Semiverbatim

Reads a standard TeX argument being either the next token, or if the next token

is an {, the balanced token list. In the case of Semiverbatim, many catcodes

are disabled, which is handy for URL’s, labels and similar.

Token, XToken

Read a single TeX Token. For XToken, if the next token is expandable, it is

repeatedly expanded until an unexpandable token remains, which is returned.

Number, Dimension, Glue |MuGlue

Read an Object corresponding to Number, Dimension, Glue or MuGlue, using

TeX’s rules for parsing these objects.

Until:match |XUntil:match>

Reads tokens until a match to the tokens match is found, returning the tokens pre-

ceding the match. This corresponds to TeX delimited arguments. For XUntil,

tokens are expanded as they are matched and accumulated.

UntilBrace

Reads tokens until the next open brace {. This corresponds to the peculiar TeX

construct \def\foo#{....

Match:match(|match)*|Keyword:match(|match)*>

Reads tokens expecting a match to one of the token lists match, returning the

one that matches, or undef. For Keyword, case and catcode of the matches are

ignored. Additionally, any leading spaces are skipped.

Balanced

Read tokens until a closing }, but respecting nested {} pairs.

BalancedParen

Read a parenthesis delimited tokens, but does not balance any nested parenthe-

ses.

Undigested, Digested, DigestUntil:match

These types alter the usual sequence of tokenization and digestion in separate

stages (like TeX). A Undigested parameter inhibits digestion completely and

remains in token form. A Digested parameter gets digested until the (re-

quired) opening {is balanced; this is useful when the content would usually need

to have been protected in order to correctly deal with catcodes. DigestUntil

digests tokens until a token matching match is found.

Variable

Reads a token, expanding if necessary, and expects a control sequence naming

a writable register. If such is found, it returns an array of the corresponding

deﬁnition object, and any arguments required by that deﬁnition.

SkipSpaces, Skip1Space

Skips one, or any number of, space tokens, if present, but contributes nothing to

the argument list.

Common Options

scope=>’local’ |’global’ |scope

Most deﬁning commands accept an option to control how the deﬁnition is stored,

for global or local deﬁnitions, or using a named scope A named scope saves a set

of deﬁnitions and values that can be activated at a later time.

Particularly interesting forms of scope are those that get automatically acti-

vated upon changes of counter and label. For example, deﬁnitions that have

scope=>’section:1.1’ will be activated when the section number is

”1.1”, and will be deactivated when that section ends.

locked=>boolean

This option controls whether this deﬁnition is locked from further changes in

the TeX sources; this keeps local ’customizations’ by an author from overriding

important LaTeXML deﬁnitions and breaking the conversion.

Macros

DefMacro(prototype,expansion,%options);

Deﬁnes the macro expansion for prototype; a macro control sequence that is

expanded during macro expansion time in the LaTeXML::Core::Gullet.

The expansion should be one of tokens |string |code($gullet,@args)>: a string

will be tokenized upon ﬁrst usage. Any macro arguments will be substituted for

parameter indicators (eg #1) in the tokens or tokenized string and the result is

used as the expansion of the control sequence. If code is used, it is called at

expansion time and should return a list of tokens as its result.

DefMacro options are

scope=>scope,

locked=>boolean

See LaTeXML::Package/"Common Options".

mathactive=>boolean

speciﬁes a deﬁnition that will only be expanded in math mode; the control

sequence must be a single character.

74 APPENDIX C. MODULES

Examples:

DefMacro(’\thefootnote’,’\arabic{footnote}’);

DefMacro(’\today’,sub { ExplodeText(today()); });

DefMacroI(cs,paramlist,expansion,%options);

Internal form of DefMacro where the control sequence and parameter list have

already been separated; useful for deﬁnitions from within code. Also, slightly

more efﬁcient for macros with no arguments (use undef for paramlist), and

useful for obscure cases like deﬁning \begin{something*}as a Macro.

Conditionals

DefConditional(prototype,test,%options);

Deﬁnes a conditional for prototype; a control sequence that is processed dur-

ing macro expansion time (in the LaTeXML::Core::Gullet). A condi-

tional corresponds to a TeX \if. If the test is undef, a \newif type of condi-

tional is deﬁned, which is controlled with control sequences like \footrue and

\foofalse. Otherwise the test should be code($gullet,@args) (with

the control sequence’s arguments) that is called at expand time to determine the

condition. Depending on whether the result of that evaluation returns a true or

false value (in the usual Perl sense), the result of the expansion is either the ﬁrst

or else code following, in the usual TeX sense.

DefConditional options are

scope=>scope,

locked=>boolean

See LaTeXML::Package/"Common Options".

skipper=>code($gullet)

This option is only used to deﬁne \ifcase.

Example:

DefConditional(’\ifmmode’,sub {

LookupValue(’IN_MATH’); });

DefConditionalI(cs,paramlist,test,%options);

Internal form of DefConditional where the control sequence and param-

eter list have already been parsed; useful for deﬁnitions from within code.

Also, slightly more efﬁcient for conditinal with no arguments (use undef for

paramlist).

IfCondition($ifcs,@args)

IfCondition allows you to test a conditional from within perl. Thus

something like if(IfCondition(’\ifmmode’)){domath }else

{dotext }might be equivalent to TeX’s \ifmmode domath \else

dotext \fi.

Primitives

DefPrimitive(prototype,replacement,%options);

Deﬁnes a primitive control sequence; a primitive is processed during digestion

(in the LaTeXML::Core::Stomach), after macro expansion but before

Construction time. Primitive control sequences generate Boxes or Lists, gen-

erally containing basic Unicode content, rather than structured XML. Primitive

control sequences are also executed for side effect during digestion, effecting

changes to the LaTeXML::Core::State.

The replacement can be a string used as the text content of a Box to be created

(using the current font). Alternatively replacement can be code($stomach,@args)

(with the control sequence’s arguments) which is invoked at digestion time, prob-

ably for side-effect, but returning Boxes or Lists or nothing. replacement may

also be undef, which contributes nothing to the document, but does record the

TeX code that created it.

DefPrimitive options are

scope=>scope,

locked=>boolean

See LaTeXML::Package/"Common Options".

mode=>(’text’ |’display math’ |’inline math’)

Changes to this mode during digestion.

font=>{%fontspec}

Speciﬁes the font to use (see LaTeXML::Package/"Fonts"). If the

font change is to only apply to material generated within this command,

you would also use <bounded=1>>; otherwise, the font will remain in

effect afterwards as for a font switching command.

bounded=>boolean

If true, TeX grouping (ie. {}) is enforced around this invocation.

requireMath=>boolean,

forbidMath=>boolean

speciﬁes whether the given constructor can only appear, or cannot appear,

in math mode.

beforeDigest=>code($stomach)

supplies a hook to execute during digestion just before the main part

of the primitive is executed (and before any arguments have been read).

The code should either return nothing (return;) or a list of digested items

(Box’s,List,Whatsit). It can thus change the State and/or add to the digested

output.

afterDigest=>code($stomach)

supplies a hook to execute during digestion just after the main part of the

primitive ie executed. it should either return nothing (return;) or digested

items. It can thus change the State and/or add to the digested output.

76 APPENDIX C. MODULES

isPrefix=>boolean

indicates whether this is a preﬁx type of command; This is only used for

the special TeX assignment preﬁxes, like \global.

Example:

DefPrimitive(’\begingroup’,sub { $_[0]->begingroup; });

DefPrimitiveI(cs,paramlist,code($stomach,@args), %options);

Internal form of DefPrimitive where the control sequence and parameter list

have already been separated; useful for deﬁnitions from within code.

Registers

DefRegister(prototype,value,%options);

Deﬁnes a register with value as the initial value (a Number, Dimension, Glue,

MuGlue or Tokens --- I haven’t handled Box’s yet). Usually, the prototype

is just the control sequence, but registers are also handled by prototypes like

\count{Number}.DefRegister arranges that the register value can be ac-

cessed when a numeric, dimension, ... value is being read, and also deﬁnes the

control sequence for assignment.

Options are

readonly=>boolean

speciﬁes if it is not allowed to change this value.

getter=>code(@args),

setter=>code($value,@args)

By default value is stored in the State’s Value table under a name con-

catenating the control sequence and argument values. These options allow

other means of fetching and storing the value.

Example:

DefRegister(’\pretolerance’,Number(100));

DefRegisterI(cs,paramlist,value,%options);

Internal form of DefRegister where the control sequence and parameter list

have already been parsed; useful for deﬁnitions from within code.

Constructors

DefConstructor(prototype,$replacement,%options);

The Constructor is where LaTeXML really starts getting interesting; invoking

the control sequence will generate an arbitrary XML fragment in the document

tree. More speciﬁcally: during digestion, the arguments will be read and di-

gested, creating a LaTeXML::Core::Whatsit to represent the object.

During absorbtion by the LaTeXML::Core::Document, the Whatsit

will generate the XML fragment according to replacement. The replacement

can be code($document,@args,%properties) which is called dur-

ing document absorbtion to create the appropriate XML (See the methods of

LaTeXML::Core::Document).

More conveniently, replacement can be an pattern: simply a bit of XML as a

string with certain substitutions to be made. The substitutions are of the follow-

ing forms:

#1, #2 ... #name

These are replaced by the corresponding argument (for #1) or property (for

#name) stored with the Whatsit. Each are turned into a string when it ap-

pears as in an attribute position, or recursively processed when it appears

as content.

&function(@args)

Another form of substituted value is preﬁxed with &which invokes a func-

tion. For example, &func(#1) would invoke the function func on

the ﬁrst argument to the control sequence; what it returns will be inserted

into the document.

?test(pattern)or ?test(ifpattern)(elsepattern)

Patterns can be conditionallized using this form. The test is any of the

above expressions (eg. #1), considered true if the result is non-empty. Thus

?#1(<foo/>)would add the empty element foo if the ﬁrst argument

were given.

If the constuctor begins with ˆ, the XML fragment is allowed to ﬂoat up

to a parent node that is allowed to contain it, according to the Document

Type.

The Whatsit property font is deﬁned by default. Additional properties body

and trailer are deﬁned when captureBody is true, or for environ-

ments. By using $whatsit->setProperty(key=>$value); within

afterDigest, or by using the properties option, other properties can be

added.

DefConstructor options are

scope=>scope,

78 APPENDIX C. MODULES

locked=>boolean

See LaTeXML::Package/"Common Options".

mode=>mode,

font=>{%fontspec},

bounded=>boolean,

requireMath=>boolean,

forbidMath=>boolean

These options are the same as for LaTeXML::Package/Primitives

reversion=>texstring |code($whatsit,#1,#2,...)

speciﬁes the reversion of the invocation back into TeX tokens (if the default

reversion is not appropriate). The textstring string can include #1,#2...

The code is called with the $whatsit and digested arguments and must

return a list of Token’s.

alias=>control sequence

provides a control sequence to be used in the reversion instead of the

one deﬁned in the prototype. This is a convenient alternative for rever-

sion when a ’public’ command conditionally expands into an internal one,

but the reversion should be for the public command.

sizer=>string |code($whatsit)

speciﬁes how to compute (approximate) the displayed size of the object,

if that size is ever needed (typically needed for graphics generation). If

a string is given, it should contain only a sequence of #1 or #name to

access arguments and properties of the Whatsit: the size is computed from

these items layed out side-by-side. If code is given, it should return the

three Dimensions (width, height and depth). If neither is given, and the

reversion speciﬁcation is of suitible format, it will be used for the sizer.

properties=>{%properties} | code($stomach,#1,#2...)

supplies additional properties to be set on the generated Whatsit. In the ﬁrst

form, the values can be of any type, but if a value is a code references, it

takes the same args ($stomach,#1,#2,...) and should return the value; it is

executed before creating the Whatsit. In the second form, the code should

return a hash of properties.

beforeDigest=>code($stomach)

supplies a hook to execute during digestion just before the Whatsit is cre-

ated. The code should either return nothing (return;) or a list of digested

items (Box’s,List,Whatsit). It can thus change the State and/or add to the

digested output.

afterDigest=>code($stomach,$whatsit)

supplies a hook to execute during digestion just after the Whatsit is created

(and so the Whatsit already has its arguments and properties). It should

either return nothing (return;) or digested items. It can thus change the

State, modify the Whatsit, and/or add to the digested output.

beforeConstruct=>code($document,$whatsit)

supplies a hook to execute before constructing the XML (generated by re-

placement).

afterConstruct=>code($document,$whatsit)

Supplies code to execute after constructing the XML.

captureBody=>boolean |Token

if true, arbitrary following material will be accumulated into a ‘body’ until

the current grouping level is reverted, or till the Token is encountered if

the option is a Token. This body is available as the body property of the

Whatsit. This is used by environments and math.

nargs=>nargs

This gives a number of args for cases where it can’t be infered directly from

the prototype (eg. when more args are explicitly read by hooks).

DefConstructorI(cs,paramlist,replacement,%options);

Internal form of DefConstructor where the control sequence and parameter

list have already been separated; useful for deﬁnitions from within code.

DefMath(prototype,tex,%options);

A common shorthand constructor; it deﬁnes a control sequence that creates a

mathematical object, such as a symbol, function or operator application. The op-

tions given can effectively create semantic macros that contribute to the eventual

parsing of mathematical content. In particular, it generates an XMDual using the

replacement tex for the presentation. The content information is drawn from the

name and options

DefMath accepts the options:

scope=>scope,

locked=>boolean

See LaTeXML::Package/"Common Options".

font=>{%fontspec},

reversion=>reversion,

alias=>cs,

sizer=>sizer,

properties=>properties,

beforeDigest=>code($stomach),

afterDigest=>code($stomach,$whatsit),

These options are the same as for LaTeXML::Package/Constructors

name=>name

gives a name attribute for the object

80 APPENDIX C. MODULES

omcd=>cdname

gives the OpenMath content dictionary that name is from.

role=>grammatical role

adds a grammatical role attribute to the object; this speciﬁes the grammati-

cal role that the object plays in surrounding expressions. This direly needs

documentation!

mathstyle=>(’display’ |’text’ |’script’ |’scriptscript’)

Controls whether the this object will be presented in a speciﬁc mathstyle,

or according to the current setting of mathstyle.

scriptpos=>(’mid’ |’post’)

Controls the positioning of any sub and super-scripts relative to this object;

whether they be stacked over or under it, or whether they will appear in

the usual position. TeX.pool deﬁnes a function doScriptpos() which

is useful for operators like \sum in that it sets to mid position when in

displaystyle, otherwise post.

stretchy=>boolean

Whether or not the object is stretchy when displayed.

operator role=>grammatical role,

operator scriptpos=>boolean,

operator stretchy=>boolean

These three are similar to role,scriptpos and stretchy, but are

used in unusual cases. These apply to the given attributes to the operator

token in the content branch.

nogroup=>boolean

Normally, these commands are digested with an implicit grouping around

them, localizing changes to fonts, etc; noggroup=>1inhibits this.

Example:

DefMath(’\infty’,"\x{221E}",

role=>’ID’, meaning=>’infinity’);

DefMathI(cs,paramlist,tex,%options);

Internal form of DefMath where the control sequence and parameter list have

already been separated; useful for deﬁnitions from within code.

Environments

DefEnvironment(prototype,replacement,%options);

Deﬁnes an Environment that generates a speciﬁc XML fragment. replacement

is of the same form as for DefConstructor, but will generally include reference

to the #body property. Upon encountering a \begin{env}: the mode is

switched, if needed, else a new group is opened; then the environment name is

noted; the beforeDigest hook is run. Then the Whatsit representing the begin

command (but ultimately the whole environment) is created and the afterDi-

gestBegin hook is run. Next, the body will be digested and collected until

the balancing \end{env}. Then, any afterDigest hook is run, the environ-

ment is ended, ﬁnally the mode is ended or the group is closed. The body and

\end{env}whatsit are added to the \begin{env}’s whatsit as body and

trailer, respectively.

DefEnvironment takes the following options:

scope=>scope,

locked=>boolean

See LaTeXML::Package/"Common Options".

mode=>mode,

font=>{%fontspec}

requireMath=>boolean,

forbidMath=>boolean,

These options are the same as for LaTeXML::Package/Primitives

reversion=>reversion,

alias=>cs,

sizer=>sizer,

properties=>properties,

nargs=>nargs

These options are the same as for LaTeXML::Package/DefConstructor

beforeDigest=>code($stomach)

This hook is similar to that for DefConstructor, but it applies to the

\begin{environment}control sequence.

afterDigestBegin=>code($stomach,$whatsit)

This hook is similar to DefConstructor’s afterDigest but it ap-

plies to the \begin{environment}control sequence. The Whatsit is

the one for the beginning control sequence, but represents the environment

as a whole. Note that although the arguments and properties are present in

the Whatsit, the body of the environment is not yet available!

beforeDigestEnd=>code($stomach)

This hook is similar to DefConstructor’s beforeDigest but it ap-

plies to the \end{environment}control sequence.

afterDigest=>code($stomach,$whatsit)

This hook is simlar to DefConstructor’s afterDigest but it applies

to the \end{environment}control sequence. Note, however that the

Whatsit is only for the ending control sequence, not the Whatsit for the

environment as a whole.

82 APPENDIX C. MODULES

afterDigestBody=>code($stomach,$whatsit)

This option supplies a hook to be executed during digestion after the ending

control sequence has been digested (and all the 4 other digestion hook have

executed) and after the body of the environment has been obtained. The

Whatsit is the (useful) one representing the whole environment, and it now

does have the body and trailer available, stored as a properties.

Example:

DefConstructor(’\emph{}’,

"<ltx:emph>#1</ltx:emph", mode=>’text’);

DefEnvironmentI(name,paramlist,replacement,%options);

Internal form of DefEnvironment where the control sequence and parameter

list have already been separated; useful for deﬁnitions from within code.

Inputing Content and Deﬁnitions

FindFile(name,%options);

Find an appropriate ﬁle with the given name in the current directories in

SEARCHPATHS. If a ﬁle ending with .ltxml is found, it will be preferred.

Note that if the name starts with a recognized protocol (currently one of

(literal|http|https|ftp)) followed by a colon, the name is returned,

as is, and no search for ﬁles is carried out.

The options are:

type=>type

speciﬁes the ﬁle type. If not set, it will search for both name.tex and

name.

noltxml=>1

inhibits searching for a LaTeXML binding (name.type.ltxml) to use

instead of the ﬁle itself.

notex=>1

inhibits searching for raw tex version of the ﬁle. That is, it will only search

for the LaTeXML binding.

InputContent(request,%options);

InputContent is used for cases when the ﬁle (or data) is plain TeX material

that is expected to contribute content to the document (as opposed to pure deﬁni-

tions). A Mouth is opened onto the ﬁle, and subsequent reading and/or digestion

will pull Tokens from that Mouth until it is exhausted, or closed.

In some circumstances it may be useful to provide a string containing the TeX

material explicitly, rather than referencing a ﬁle. In this case, the literal

pseudo-protocal may be used:

InputContent(’literal:\textit{Hey}’);

If a ﬁle named $request.latexml exists, it will be read in as if it were a

latexml binding ﬁle, before processing. This can be used for adhoc customization

of the conversion of speciﬁc ﬁles, without modifying the source, or creating more

elaborate bindings.

The only option to InputContent is:

noerror=>boolean

Inhibits signalling an error if no appropriate ﬁle is found.

Input(request);

Input is analogous to LaTeX’s \input, and is used in cases where it isn’t

completely clear whether content or deﬁnitions is expected. Once a ﬁle is found,

the approach speciﬁed by InputContent or InputDefinitions is used,

depending on which type of ﬁle is found.

InputDefinitions(request,%options);

InputDefinitions is used for loading deﬁnitions, ie. various macros, set-

tings, etc, rather than document content; it can be used to load LaTeXML’s

binding ﬁles, or for reading in raw TeX deﬁnitions or style ﬁles. It reads and

processes the material completely before returning, even in the case of TeX def-

initions. This procedure optionally supports the conventions used for standard

LaTeX packages and classes (see RequirePackage and LoadClass).

Options for InputDefinitions are:

type=>type

the ﬁle type to search for.

noltxml=>boolean

inhibits searching for a LaTeXML binding; only raw TeX ﬁles will be

sought and loaded.

notex=>boolean

inhibits searching for raw TeX ﬁles, only a LaTeXML binding will be

sought and loaded.

noerror=>boolean

inhibits reporting an error if no appropriate ﬁle is found.

The following options are primarily useful when InputDefinitions is sup-

porting standard LaTeX package and class loading.

withoptions=>boolean

indicates whether to pass in any options from the calling class or package.

handleoptions=>boolean

indicates whether options processing should be handled.

84 APPENDIX C. MODULES

options=>[...]

speciﬁes a list of options (in the ’package options’ sense) to be passed

(possibly in addition to any provided by the calling class or package).

after=>tokens |code($gullet)

provides tokens or code to be processed by a name.type-h@@k macro.

as class=>boolean

ﬁshy option that indicates that this deﬁnitions ﬁle should be treated as if it

were deﬁning a class; typically shows up in latex compatibility mode, or

AMSTeX.

A handy method to use most of the TeX distribution’s raw TeX deﬁnitions for

a package, but override only a few with LaTeXML bindings is by deﬁning a

binding ﬁle, say tikz.sty.ltxml, to contain

InputDefinitions(’tikz’, type => ’sty’, noltxml => 1);

which would ﬁnd and read in tizk.sty, and then follow it by a couple of

strategic LaTeXML deﬁnitions, DefMacro, etc.

Class and Packages

RequirePackage(package,%options);

Finds and loads a package implementation (usually package.sty.ltxml,

unless noltxml is speciﬁed)for the requested package. It returns the pathname

of the loaded package. The options are:

type=>type

speciﬁes the ﬁle type (default sty.

options=>[...]

speciﬁes a list of package options.

noltxml=>boolean

inhibits searching for the LaTeXML binding for the ﬁle (ie. name.type.ltxml

notex=>1

inhibits searching for raw tex version of the ﬁle. That is, it will only search

for the LaTeXML binding.

LoadClass(class,%options);

Finds and loads a class deﬁnition (usually class.cls.ltxml). It returns the

pathname of the loaded class. The only option is

options=>[...]

speciﬁes a list of class options.

LoadPool(pool,%options);

Loads a pool ﬁle (usually pool.pool.ltxml), one of the top-level deﬁnition

ﬁles, such as TeX, LaTeX or AMSTeX. It returns the pathname of the loaded ﬁle.

DeclareOption(option,tokens |string |code($stomach));

Declares an option for the current package or class. The 2nd argument can be a

string (which will be tokenized and expanded) or tokens (which will be macro

expanded), to provide the value for the option, or it can be a code reference which

is treated as a primitive for side-effect.

If a package or class wants to accomodate options, it should start with one or

more DeclareOptions, followed by ProcessOptions().

PassOptions(name,ext,@options);

Causes the given @options (strings) to be passed to the package (if ext is sty)

or class (if ext is cls) named by name.

ProcessOptions(%options);

Processes the options that have been passed to the current package or class

in a fashion similar to LaTeX. The only option (to ProcessOptions is

inorder=>boolean indicating whehter the (package) options are processed

in the order they were used, like ProcessOptions*.

ExecuteOptions(@options);

Process the options given explicitly in @options.

AtBeginDocument(@stuff);

Arranges for @stuff to be carried out after the preamble, at the beginning of

the document. @stuff should typically be macro-level stuff, but carried out for

side effect; it should be tokens, tokens lists, strings (which will be tokenized), or

code($gullet) which would yeild tokens to be expanded.

This operation is useful for style ﬁles loaded with --preload or document

speciﬁc customization ﬁles (ie. ending with .latexml); normally the contents

would be executed before LaTeX and other style ﬁles are loaded and thus can be

overridden by them. By deferring the evaluation to begin-document time, these

contents can override those style ﬁles. This is likely to only be meaningful for

LaTeX documents.

AtEndDocument(@stuff)

Arranges for @stuff to be carried out just before \\end{document}. These

tokens can be used for side effect, or any content they generate will appear as the

last children of the document.

86 APPENDIX C. MODULES

Counters and IDs

NewCounter(ctr,within,%options);

Deﬁnes a new counter, like LaTeX’s \newcounter, but extended. It deﬁnes a

counter that can be used to generate reference numbers, and deﬁnes \thectr,

etc. It also deﬁnes an ”uncounter” which can be used to generate ID’s (xml:id)

for unnumbered objects. ctr is the name of the counter. If deﬁned, within is the

name of another counter which, when incremented, will cause this counter to be

reset. The options are

idprefix=>string

Speciﬁes a preﬁx to be used to generate ID’s when using this counter

nested

Not sure that this is even sane.

$num = CounterValue($ctr);

Fetches the value associated with the counter $ctr.

$tokens = StepCounter($ctr);

Analog of \stepcounter, steps the counter and returns the expansion of

\the$ctr. Usually you should use RefStepCounter($ctr) instead.

$keys = RefStepCounter($ctr);

Analog of \refstepcounter, steps the counter and returns a hash contain-

ing the keys refnum=$refnum, id=>$id>. This makes it suitable for use in a

properties option to constructors. The id is generated in parallel with the

reference number to assist debugging.

$keys = RefStepID($ctr);

Like to RefStepCounter, but only steps the ”uncounter”, and returns only

the id; This is useful for unnumbered cases of objects that normally get both a

refnum and id.

ResetCounter($ctr);

Resets the counter $ctr to zero.

GenerateID($document,$node,$whatsit,$prefix);

Generates an ID for nodes during the construction phase, useful for cases where

the counter based scheme is inappropriate. The calling pattern makes it appro-

priate for use in Tag, as in

Tag(’ltx:para’,afterClose=>sub { GenerateID(@_,’p’); })

If $node doesn’t already have an xml:id set, it computes an appropriate id by

concatenating the xml:id of the closest ancestor with an id (if any), the preﬁx (if

any) and a unique counter.

Document Model Constructors deﬁne how TeX markup will generate XML frag-

ments, but the Document Model is used to control exactly how those fragments are

assembled.

Tag(tag,%properties);

Declares properties of elements with the name tag. Note that Tag can set or

add properties to any element from any binding ﬁle, unlike the properties set on

control by DefPrimtive,DefConstructor, etc.. And, since the properties

are recorded in the current Model, they are not subject to TeX grouping; once set,

they remain in effect until changed or the end of the document.

The tag can be speciﬁed in one of three forms:

prefix:name matches specific name in specific namespace

prefix:*matches any tag in the specific namespace;

*matches any tag in any namespace.

There are two kinds of properties:

Scalar properties

For scalar properties, only a single value is returned for a given element.

When the property is looked up, each of the above forms is considered (the

speciﬁc element name, the namespace, and all elements); the ﬁrst deﬁned

value is returned.

The recognized scalar properties are:

autoOpen=>boolean

Speciﬁes whether tag can be automatically opened if needed to insert

an element that can only be contained by tag. This property can help

match the more SGML-like LaTeX to XML.

autoClose=>boolean

Speciﬁes whether this tag can be automatically closed if needed to

close an ancestor node, or insert an element into an ancestor. This

property can help match the more SGML-like LaTeX to XML.

Code properties

These properties provide a bit of code to be run at the times of certain

events associated with an element. All the code bits that match a given

element will be run, and since they can be added by any binding ﬁle, and

be speciﬁed in a random orders, a little bit of extra control is desirable.

Firstly, any early codes are run (eg afterOpen:early), then any nor-

mal codes (without modiﬁer) are run, and ﬁnally any late codes are run (eg.

afterOpen:late).

Within each of those groups, the codes assigned for an element’s spe-

ciﬁc name are run ﬁrst, then those assigned for its package and ﬁnally the

generic one (*); that is, the most speciﬁc codes are run ﬁrst.

88 APPENDIX C. MODULES

When code properties are accumulated by Tag for normal or late events,

the code is appended to the end of the current list (if there were any previous

codes added); for early event, the code is prepended.

The recognized code properties are:

afterOpen=>code($document,$box)

Provides code to be run whenever a node with this tag is opened. It is

called with the document being constructed, and the initiating digested

object as arguments. It is called after the node has been created, and

after any initial attributes due to the constructor (passed to openEle-

ment) are added.

afterOpen:early or afterOpen:late can be used in place of

afterOpen; these will be run as a group bfore, or after (respectively)

the unmodiﬁed blocks.

afterClose=>code($document,$box)

Provides code to be run whenever a node with this tag is closed. It is

called with the document being constructed, and the initiating digested

object as arguments.

afterClose:early or afterClose:late can be used in place

of afterClose; these will be run as a group bfore, or after (respec-

tively) the unmodiﬁed blocks.

RelaxNGSchema(schemaname);

Speciﬁes the schema to use for determining document model. You can leave off

the extension; it will look for schemaname.rng (and maybe eventually, .rnc

if that is ever implemented).

RegisterNamespace(prefix,URL);

Declares the preﬁx to be associated with the given URL. These preﬁxes may be

used in ltxml ﬁles, particularly for constructors, xpath expressions, etc. They

are not necessarily the same as the preﬁxes that will be used in the generated

document Use the preﬁx #default for the default, non-preﬁxed, namespace.

(See RegisterDocumentNamespace, as well as DocType or RelaxNGSchema).

RegisterDocumentNamespace(prefix,URL);

Declares the preﬁx to be associated with the given URL used within the gen-

erated XML. They are not necessarily the same as the preﬁxes used in code

(RegisterNamespace). This function is less rarely needed, as the namespace dec-

larations are generally obtained from the DTD or Schema themselves Use the

preﬁx #default for the default, non-preﬁxed, namespace. (See DocType or

RelaxNGSchema).

DocType(rootelement,publicid,systemid,%namespaces);

Declares the expected rootelement, the public and system ID’s of the document

type to be used in the ﬁnal document. The hash %namespaces speciﬁes the

namespaces preﬁxes that are expected to be found in the DTD, along with each

associated namespace URI. Use the preﬁx #default for the default namespace

(ie. the namespace of non-preﬁxed elements in the DTD).

The preﬁxes deﬁned for the DTD may be different from the preﬁxes used in im-

plementation CODE (eg. in ltxml ﬁles; see RegisterNamespace). The generated

document will use the namespaces and preﬁxes deﬁned for the DTD.

Document Rewriting During document construction, as each node gets closed, the

text content gets simplﬁed. We’ll call it applying ligatures, for lack of a better name.

DefLigature(regexp,%options);

Apply the regular expression (given as a string: ”/fa/fa/” since it will be con-

verted internally to a true regexp), to the text content. The only option is

fontTest=>code($font); if given, then the substitution is applied only

when fontTest returns true.

Predeﬁned Ligatures combine sequences of ”.” or single-quotes into appropriate

Unicode characters.

DefMathLigature($string=$replacment,%options);>

A Math Ligature typically combines a sequence of math tokens (XMTok) into a

single one. A simple example is

DefMathLigature(":=" => ":=", role => ’RELOP’, meaning => ’assign’);

replaces the two tokens for colon and equals by a token representing assignment.

The options are those characterising an XMTok, namely: role,meaning and

name.

For more complex cases (recognizing numbers, for example), you may supply

a function matcher=CODE($document,$node)>, which is passed the current

document and the last math node in the sequence. It should examine $node

and any preceding nodes (using previousSibling) and return a list of

($n,$string,%attributes) to replace the $n nodes by a new one with

text content being $string content and the given attributes. If no replacement

is called for, CODE should return undef.

After document construction, various rewriting and augmenting of the document

can take place.

DefRewrite(%specification);

DefMathRewrite(%specification);

These two declarations deﬁne document rewrite rules that are applied to the doc-

ument tree after it has been constructed, but before math parsing, or any other

postprocessing, is done. The %speciﬁcation consists of a sequence of key/value

pairs with the initial specs successively narrowing the selection of document

90 APPENDIX C. MODULES

nodes, and the remaining specs indicating how to modify or replace the selected

nodes.

The following select portions of the document:

label=>label

Selects the part of the document with label=$label

scope=>scope

The scope could be ”label:foo” or ”section:1.2.3” or something similar.

These select a subtree labelled ’foo’, or a section with reference number

”1.2.3”

xpath=>xpath

Select those nodes matching an explicit xpath expression.

match=>tex

Selects nodes that look like what the processing of tex would produce.

regexp=>regexp

Selects text nodes that match the regular expression.

The following act upon the selected node:

attributes=>hashref

Adds the attributes given in the hash reference to the node.

replace=>replacement

Interprets replacement as TeX code to generate nodes that will replace the

selected nodes.

Mid-Level support

$tokens = Expand($tokens);

Expands the given $tokens according to current deﬁnitions.

$boxes = Digest($tokens);

Processes and digestes the $tokens. Any arguments needed by control se-

quences in $tokens must be contained within the $tokens itself.

@tokens = Invocation($cs,@args);

Constructs a sequence of tokens that would invoke the token $cs on the argu-

ments.

RawTeX(’... tex code ...’);

RawTeX is a convenience function for including chunks of raw TeX (or LaTeX)

code in a Package implementation. It is useful for copying portions of the normal

implementation that can be handled simply using macros and primitives.

Let($token1,$token2);

Gives $token1 the same ‘meaning’ (deﬁnition) as $token2; like TeX’s \let.

StartSemiVerbatim(); ... ; EndSemiVerbatim();

Disable disable most TeX catcodes.

$tokens = Tokenize($string);

Tokenizes the $string using the standard catcodes, returning a LaTeXML::Core::Tokens.

$tokens = TokenizeInternal($string);

Tokenizes the $string according to the internal cattable (where @ is a letter),

returning a LaTeXML::Core::Tokens.

Argument Readers

ReadParameters($gullet,$spec);

Reads from $gullet the tokens corresponding to $spec (a Parameters ob-

ject).

DefParameterType(type,code($gullet,@values), %options);

Deﬁnes a new Parameter type, type, with code for its reader.

Options are:

reversion=>code($arg,@values);

This code is responsible for converting a previously parsed argument back

into a sequence of Token’s.

optional=>boolean

whether it is an error if no matching input is found.

novalue=>boolean

whether the value returned should contribute to argument lists, or simply

be passed over.

semiverbatim=>boolean

whether the catcode table should be modiﬁed before reading tokens.

<DefColumnType(proto,expansion);

Deﬁnes a new column type for tabular and arrays. proto is the prototype for the

pattern, analogous to the pattern used for other deﬁnitions, except that macro

being deﬁned is a single character. The expansion is a string specifying what it

should expand into, typically more verbose column speciﬁcation.

92 APPENDIX C. MODULES

Access to State

$value = LookupValue($name);

Lookup the current value associated with the the string $name.

AssignValue($name,$value,$scope);

Assign $value to be associated with the the string $name, according to the given

scoping rule.

Values are also used to specify most conﬁguration parameters (which can there-

for also be scoped). The recognized conﬁguration parameters are:

VERBOSITY : the level of verbosity for debugging

output, with 0 being default.

STRICT : whether errors (eg. undefined macros)

are fatal.

INCLUDE_COMMENTS : whether to preserve comments in the

source, and to add occasional line

number comments. (Default true).

PRESERVE_NEWLINES : whether newlines in the source should

be preserved (not 100% TeX-like).

By default this is true.

SEARCHPATHS : a list of directories to search for

sources, implementations, etc.

PushValue($name,@values);

This function, along with the next three are like AssignValue, but maintain

a global list of values. PushValue pushes the provided values onto the end of

a list. The data stored for $name is global and must be a LIST reference; it is

created if needed.

UnshiftValue($name,@values);

Similar to PushValue, but pushes a value onto the front of the list. The data

stored for $name is global and must be a LIST reference; it is created if needed.

PopValue($name);

Removes and returns the value on the end of the list named by $name. The data

stored for $name is global and must be a LIST reference. Returns undef if

there is no data in the list.

ShiftValue($name);

Removes and returns the ﬁrst value in the list named by $name. The data stored

for $name is global and must be a LIST reference. Returns undef if there is

no data in the list.

LookupMapping($name,$key);

This function maintains a hash association named by $name. It returns the value

associated with $key within that mapping. The data stored for $name is global

and must be a HASH reference. Returns undef if there is no data associated

with $key in the mapping, or the mapping is not (yet) deﬁned.

AssignMapping($name,$key,$value);

This function associates $value with $key within the mapping named by

$name. The data stored for $name is global and must be a HASH reference; it

is created if needed.

$value = LookupCatcode($char);

Lookup the current catcode associated with the the character $char.

AssignCatcode($char,$catcode,$scope);

Set $char to have the given $catcode, with the assignment made according

to the given scoping rule.

This method is also used to specify whether a given character is active in math

mode, by using math:$char for the character, and using a value of 1 to specify

that it is active.

$meaning = LookupMeaning($token);

Looks up the current meaning of the given $token which may be a Deﬁnition,

another token, or the token itself if it has not otherwise been deﬁned.

$defn = LookupDefinition($token);

Looks up the current deﬁnition, if any, of the $token.

InstallDefinition($defn);

Install the Deﬁnition $defn into $STATE under its control sequence.

XEquals($token1,$token2)

Tests whether the two tokens are equal in the sense that they are either equal

tokens, or if deﬁned, have the same deﬁnition.

Fonts

MergeFont(%fontspec);

Set the current font by merging the font style attributes with the current font. The

%fontspec speciﬁes the properties of the desired font. Likely values include (the

values aren’t required to be in this set):

family : serif, sansserif, typewriter, caligraphic,

fraktur, script

series : medium, bold

shape : upright, italic, slanted, smallcaps

size : tiny, footnote, small, normal, large,

Large, LARGE, huge, Huge

color : any named color, default is black

94 APPENDIX C. MODULES

Some families will only be used in math. This function returns nothing so it can

be easily used in beforeDigest, afterDigest.

DeclareFontMap($name,$map,%options);

Declares a font map for the encoding $name. The map $map is an array of 128

or 256 entries, each element is either a unicode string for the representation of

that codepoint, or undef if that codepoint is not supported by this encoding. The

only option currently is family used because some fonts (notably cmr!) have

different glyphs in some font families, such as family=’typewriter’>.

FontDecode($code,$encoding,$implicit);

Returns the unicode string representing the given codepoint $code (an integer)

in the given font encoding $encoding. If $encoding is undeﬁned, the usual

case, the current font encoding and font family is used for the lookup. Explicit

decoding is used when \\char or similar are invoked ($implicit is false),

and the codepoint must be represented in the fontmap, otherwise undef is re-

turned. Implicit decoding (ie. $implicit is true) occurs within the Stomach

when a Token’s content is being digested and converted to a Box; in that case

only the lower 128 codepoints are converted; all codepoints above 128 are as-

sumed to already be Unicode.

The font map for $encoding is automatically loaded if it has not already been

loaded.

FontDecodeString($string,$encoding,$implicit);

Returns the unicode string resulting from decoding the individual characters in

$string according to FontDecode, above.

LoadFontMap($encoding);

Finds and loads the font map for the encoding named $encoding, if it hasn’t

been loaded before. It looks for encoding.fontmap.ltxml, which would

typically deﬁne the font map using DeclareFontMap, possibly including ex-

tra maps for families like typewriter.

Color

$color=LookupColor($name);

Lookup the color object associated with $name.

DefColor($name,$color,$scope);

Associates the $name with the given $color (a color object), with the given

scoping.

DefColorModel($model,$coremodel,$tocore,$fromcore);

Deﬁnes a color model $model that is derived from the core color model

$coremodel. The two functions $tocore and $fromcore convert a color

object in that model to the core model, or from the core model to the derived

model. Core models are rgb, cmy, cmyk, hsb and gray.

Low-level Functions

CleanID($id);

Cleans an $id of disallowed characters, trimming space.

CleanLabel($label,$prefix);

Cleans a $label of disallowed characters, trimming space. The preﬁx

$prefix is prepended (or LABEL, if none given).

CleanIndexKey($key);

Cleans an index key, so it can be used as an ID.

CleanBibKey($key);

Cleans a bibliographic citation key, so it can be used as an ID.

CleanURL($url);

Cleans a url.

UTF($code);

Generates a UTF character, handy for the the 8 bit characters. For example,

UTF(0xA0) generates the non-breaking space.

@tokens = roman($number);

Formats the $number in (lowercase) roman numerals, returning a list of the

tokens.

@tokens = Roman($number);

Formats the $number in (uppercase) roman numerals, returning a list of the

tokens.

See also

See also LaTeXML::Global,LaTeXML::Common::Object,LaTeXML::Common::Error,

LaTeXML::Core::Token,LaTeXML::Core::Tokens,LaTeXML::Core::Box,

LaTeXML::Core::List,LaTeXML::Common::Number,LaTeXML::Common::Float,

LaTeXML::Common::Dimension,LaTeXML::Common::Glue,LaTeXML::Core::MuDimension,

LaTeXML::Core::MuGlue,LaTeXML::Core::Pair,LaTeXML::Core::PairList,

LaTeXML::Common::Color,LaTeXML::Core::Alignment,LaTeXML::Common::XML,

LaTeXML::Util::Radix.

LaTeXML::MathParser

Parses mathematics content

Description

LaTeXML::MathParser parses the mathematical content of a document. It uses

Parse::RecDescent and a grammar MathGrammar.

96 APPENDIX C. MODULES

Math Representation Needs description.

Possibile Customizations Needs description.

Convenience functions The following functions are exported for convenience in

writing the grammar productions.

$node = New($name,$content,%attributes);

Creates a new XMTok node with given $name (a string or undef), and

$content (a string or undef) (but at least one of name or content should

be provided), and attributes.

$node = Arg($node,$n);

Returns the $n-th argument of an XMApp node; 0 is the operator node.

Annotate($node,%attributes);

Add attributes to $node.

$node = Apply($op,@args);

Create a new XMApp node representing the application of the node $op to the

nodes @args.

$node = ApplyDelimited($op,@stuff);

Create a new XMApp node representing the application of the node $op to the

arguments found in @stuff.@stuff are delimited arguments in the sense that

the leading and trailing nodes should represent open and close delimiters and the

arguments are separated by punctuation nodes.

$node = InterpretDelimited($op,@stuff);

Similar to ApplyDelimited, this interprets sequence of delimited, punctu-

ated items as being the application of $op to those items.

$node = recApply(@ops,$arg);

Given a sequence of operators and an argument, forms the nested application

op(op(...(arg)))>.

$node = InvisibleTimes;

Creates an invisible times operator.

$boole = isMatchingClose($open,$close);

Checks whether $open and $close form a ‘normal’ pair of delimiters, or if

either is ”.”.

C.1. COMMON MODULES 97

$node = Fence(@stuff);

Given a delimited sequence of nodes, starting and ending with open/close de-

limiters, and with intermediate nodes separated by punctuation or such, attempt

to guess what type of thing is represented such as a set, absolute value, interval,

and so on.

This would be a good candidate for customization!

$node = NewFormulae(@stuff);

Given a set of formulas, construct a Formulae application, if there are more

than one, else just return the ﬁrst.

$node = NewList(@stuff);

Given a set of expressions, construct a list application, if there are more than

one, else just return the ﬁrst.

$node = LeftRec($arg1,@more);

Given an expr followed by repeated (op expr), compose the left recursive tree.

For example a+b+c-dwould give (- (+ a b c) d)>

MaybeFunction($token);

Note the possible use of $token as a function, which may cause incorrect pars-

ing. This is used to generate warning messages.

C.1 Common Modules Documentation

LaTeXML::Common::Config

Conﬁguration logic for LaTeXML

SYNPOSIS

use LaTeXML::Common::Config;

my $config = LaTeXML::Common::Config->new(

profile=>’name’,

timeout=>60,

... );

$config->read(\@ARGV);

$config->check;

my $value = $config->get($name);

$config->set($name,$value);

$config->delete($name);

my $bool = $config->exists($name);

my @keys = $config->keys;

my $options_hashref = $config->options;

my $config_clone = $config->clone;

98 APPENDIX C. MODULES

Description

Conﬁguration management class for LaTeXML options. * Responsible for deﬁning the

options interface and parsing the usual Perl command-line options syntax * Provides

the intuitive getters, setters, as well as hash methods for manipulating the option values.

* Also supports cloning into new conﬁguration objects.

Methods

my $config = LaTeXML::Common::Config->new(%options);

Creates a new conﬁguration object. Note that you should try not to provide your

own %options hash but rather create an empty conﬁguration and use $conﬁg-

>read to read in the options.

$config->read(\@ARGV);

This is the main method for parsing in LaTeXML options. The input array should

either be @ARGV, e.g. when the options were provided from the command line

using the classic Getopt::Long syntax, or any other array reference that conforms

to that setup.

$config->check;

Ensures that the conﬁguration obeys the given proﬁle and performs a set of as-

signments of meaningful defaults (when needed) and normalizations (for relative

paths, etc).

my $value = $config->get($name);

Classic getter for the $value of an option $name.

$config->set($name,$value);

Classic setter for the $value of an option $name.

$config->delete($name);

Deletes option $name from the conﬁguration.

my $bool = $config->exists($name);

Checks whether the key $name exists in the options hash of the conﬁguration.

Similarly to Perl’s ”exist” for hashes, it returns true even when the option’s value

is undeﬁned.

my @keys = $config->keys;

Similar to ”keys %hash” in Perl. Returns an array of all option names.

my $options hashref = $config->options;

Returns the actual hash reference that holds all options within the conﬁguration

object.

my $config clone = $config->clone;

Clones $conﬁg into a new LaTeXML::Common::Conﬁg object, $conﬁg clone.

C.1. COMMON MODULES 99

OPTION SYNOPSIS

latexmlc [options]

Options:

--VERSION show version number.

--help shows this help message.

--destination=file specifies destination file.

--output=file [obsolete synonym for --destination]

--preload=module requests loading of an optional module;

can be repeated

--preamble=file loads a tex file containing document

frontmatter. MUST include \begin{document}

or equivalent

--postamble=file loads a tex file containing document

backmatter. MUST include \end{document}

or equivalent

--includestyles allows latexml to load raw *.sty file;

by default it avoids this.

--base=dir sets the current working directory

--path=dir adds dir to the paths searched for files,

modules, etc;

--log=file specifies log file (default: STDERR)

--autoflush=count Automatically restart the daemon after

"count" inputs. Good practice for vast

batch jobs. (default: 100)

--timeout=secs Timecap for conversions (default 600)

--expire=secs Timecap for server inactivity (default 600)

--address=URL Specify server address (default: localhost)

--port=number Specify server port (default: 3354)

--documentid=id assign an id to the document root.

--quiet suppress messages (can repeat)

--verbose more informative output (can repeat)

--strict makes latexml less forgiving of errors

--bibtex processes a BibTeX bibliography.

--xml requests xml output (default).

--tex requests TeX output after expansion.

--box requests box output after expansion

and digestion.

--format=name requests "name" as the output format.

Supported: tex,box,xml,html4,html5,xhtml

html implies html5

--noparse suppresses parsing math (default: off)

--parse=name enables parsing math (default: on)

and selects parser framework "name".

Supported: RecDescent, no

--profile=name specify profile as defined in

LaTeXML::Common::Config

Supported: standard|math|fragment|...

(default: standard)

100 APPENDIX C. MODULES

--mode=name Alias for profile

--cache_key=name Provides a name for the current option set,

to enable daemonized conversions without

needing re-initializing

--whatsin=chunk Defines the provided input chunk,

choose from document (default), fragment

and formula

--whatsout=chunk Defines the expected output chunk,

choose from document (default), fragment

and formula

--post requests a followup post-processing

--nopost forbids followup post-processing

--validate, --novalidate Enables (the default) or disables

validation of the source xml.

--omitdoctype omits the Doctype declaration,

--noomitdoctype disables the omission (the default)

--numbersections enables (the default) the inclusion of

section numbers in titles, crossrefs.

--nonumbersections disables the above

--timestamp provides a timestamp (typically a time and date)

to be embedded in the comments

--embed requests an embeddable XHTML snippet

(requires: --post,--profile=fragment)

DEPRECATED: Use --whatsout=fragment

TODO: Remove completely

--stylesheet specifies a stylesheet,

to be used by the post-processor.

--css=cssfile adds a css stylesheet to html/xhtml

(can be repeated)

--nodefaultresources disables processing built-in resources

--javscript=jsfile adds a link to a javascript file into

html/html5/xhtml (can be repeated)

--icon=iconfile specify a file to use as a "favicon"

--xsltparameter=name:value passes parameters to the XSLT.

--split requests splitting each document

--nosplit disables the above (default)

--splitat sets level to split the document

--splitpath=xpath sets xpath expression to use for

splitting (default splits at

sections, if splitting is enabled)

--splitnaming=(id|idrelative|label|labelrelative) specifies

how to name split files (idrelative).

--scan scans documents to extract ids,

labels, etc.

section titles, etc. (default)

--noscan disables the above

--crossref fills in crossreferences (default)

--nocrossref disables the above

--urlstyle=(server|negotiated|file) format to use for urls

(default server).

C.1. COMMON MODULES 101

--navigationtoc=(context|none) generates a table of contents

in navigation bar

--index requests creating an index (default)

--noindex disables the above

--splitindex Splits index into pages per initial.

--nosplitindex disables the above (default)

--permutedindex permutes index phrases in the index

--nopermutedindex disables the above (default)

--bibliography=file sets a bibliography file

--splitbibliography splits the bibliography into pages per

initial.

--nosplitbibliography disables the above (default)

--prescan carries out only the split (if

enabled) and scan, storing

cross-referencing data in dbfile

(default is complete processing)

--dbfile=dbfile sets file to store crossreferences

--sitedirectory=dir sets the base directory of the site

--sourcedirectory=dir sets the base directory of the

original TeX source

--source=input as an alternative to passing the input as

the last argument, after the option set

you can also specify it as the value here.

useful for predictable API calls

--mathimages converts math to images

(default for html4 format)

--nomathimages disables the above

--mathimagemagnification=mag specifies magnification factor

--presentationmathml converts math to Presentation MathML

(default for xhtml & html5 formats)

--pmml alias for --presentationmathml

--nopresentationmathml disables the above

--linelength=n formats presentation mathml to a

linelength max of n characters

--contentmathml converts math to Content MathML

--nocontentmathml disables the above (default)

--cmml alias for --contentmathml

--openmath converts math to OpenMath

--noopenmath disables the above (default)

--om alias for --openmath

--keepXMath preserves the intermediate XMath

representation (default is to remove)

--mathtex adds TeX annotation to parallel markup

--nomathtex disables the above (default)

--mathlex (EXPERIMENTAL) adds linguistic lexeme

annotation to parallel markup

--nomathlex (EXPERIMENTAL) disables the above (default)

--parallelmath use parallel math annotations (default)

--noparallelmath disable parallel math annotations

--plane1 use plane-1 unicode for symbols

102 APPENDIX C. MODULES

(default, if needed)

--noplane1 do not use plane-1 unicode

--graphicimages converts graphics to images (default)

--nographicimages disables the above

--graphicsmap=type.type specifies a graphics file mapping

--pictureimages converts picture environments to

images (default)

--nopictureimages disables the above

--svg converts picture environments to SVG

--nosvg disables the above (default)

--nocomments omit comments from the output

--inputencoding=enc specify the input encoding.

--debug=package enables debugging output for the named

package

If you want to provide a TeX snippet directly on input, rather than supply a ﬁle-

name, use the literal: protocol to preﬁx your snippet.

Options & Arguments

General Options

--verbose

Increases the verbosity of output during processing, used twice is pretty chatty.

Can be useful for getting more details when errors occur.

--quiet

Reduces the verbosity of output during processing, used twice is pretty silent.

--VERSION

Shows the version number of the LaTeXML package..

--debug=package

Enables debugging output for the named package. The package is given without

the leading LaTeXML::.

--base=dir

Sepciﬁes the base working directory for the conversion server. Useful when

converting sets of documents that use relative paths.

--log=ﬁle

Speciﬁes the log ﬁle; be default any conversion messages are printed to

STDERR.

--help

Shows this help message.

C.1. COMMON MODULES 103

Source Options

--destination=ﬁle

Speciﬁes the destination ﬁle; by default the XML is written to STDOUT.

--preload=module

Requests the loading of an optional module or package. This may be useful

if the TeX code does not speciﬁcly require the module (eg. through input or

usepackage). For example, use --preload=LaTeX.pool to force LaTeX

mode.

--preamble=ﬁle

Requests the loading of a tex ﬁle with document frontmatter, to be read in before

the converted document, but after all --preload entries.

Note that the given ﬁle MUST contain \begin{document}or an equivalent envi-

ronment start, when processing LaTeX documents.

If the ﬁle does not contain content to appear in the ﬁnal document, but only

macro deﬁnitions and setting of internal counters, it is more appropriate to use

--preload instead.

--postamble=ﬁle

Requests the loading of a tex ﬁle with document backmatter, to be read in after

the converted document.

Note that the given ﬁle MUST contain \end{document}or an equivalent envi-

ronment end, when processing LaTeX documents.

--sourcedirectory=source

Speciﬁes the directory where the original latex source is located. Unless La-

TeXML is run from that directory, or it can be determined from the xml ﬁlename,

it may be necessary to specify this option in order to ﬁnd graphics and style ﬁles.

--path=dir

Add dir to the search paths used when searching for ﬁles, modules, style ﬁles,

etc; somewhat like TEXINPUTS. This option can be repeated.

--validate,--novalidate

Enables (or disables) the validation of the source XML document (the default).

--bibtex

Forces latexml to treat the ﬁle as a BibTeX bibliography. Note that the timing

is slightly different than the usual case with BibTeX and LaTeX. In the latter

case, BibTeX simply selects and formats a subset of the bibliographic entries;

the actual TeX expansion is carried out when the result is included in a LaTeX

document. In contrast, latexml processes and expands the entire bibliography;

the selection of entries is done during post-processing. This also means that any

104 APPENDIX C. MODULES

packages that deﬁne macros used in the bibliography must be speciﬁed using the

--preload option.

--inputencoding=encoding

Specify the input encoding, eg. --inputencoding=iso-8859-1. The en-

coding must be one known to Perl’s Encode package. Note that this only enables

the translation of the input bytes to UTF-8 used internally by LaTeXML, but

does not affect catcodes. In such cases, you should be using the inputenc pack-

age. Note also that this does not affect the output encoding, which is always

UTF-8.

TeX Conversion Options

--includestyles

This optional allows processing of style ﬁles (ﬁles with extensions sty,cls,

clo,cnf). By default, these ﬁles are ignored unless a latexml implementation

of them is found (with an extension of ltxml).

These style ﬁles generally fall into two classes: Those that merely affect docu-

ment style are ignorable in the XML. Others deﬁne new markup and document

structure, often using deeper LaTeX macros to achieve their ends. Although the

omission will lead to other errors (missing macro deﬁnitions), it is unlikely that

processing the TeX code in the style ﬁle will lead to a correct document.

--timeout=secs

Set time cap for conversion jobs, in seconds. Any job failing to convert in the

time range would return with a Fatal error of timing out. Default value is 600,

set to 0 to disable.

--nocomments

Normally latexml preserves comments from the source ﬁle, and adds a comment

every 25 lines as an aid in tracking the source. The option --nocomments discards

such comments.

--documentid=id

Assigns an ID to the root element of the XML document. This ID is generally

inherited as the preﬁx of ID’s on all other elements within the document. This

is useful when constructing a site of multiple documents so that all nodes have

unique IDs.

--strict

Speciﬁes a strict processing mode. By default, undeﬁned control sequences and

invalid document constructs (that violate the DTD) give warning messages, but

attempt to continue processing. Using --strict makes them generate fatal

errors.

C.1. COMMON MODULES 105

--post

Request post-processing, auto-enabled by any requested post-processor. Dis-

abled by default. If post-processing is enabled, the graphics and cross-

referencing processors are on by default.

Format Options

Speciﬁes the output format for post processing. By default, it will be guessed

from the ﬁle extension of the destination (if given), with html implying html5,

xhtml implying xhtml and the default being xml, which you probably don’t

want.

The html5 format converts the material to html5 form with mathematics as

MathML; html5 supports SVG. html4 format converts the material to the ear-

lier html form, version 4, and the mathematics to png images. xhtml format

converts to xhtml and uses presentation MathML (after attempting to parse the

mathematics) for representing the math. html5 similarly converts math to pre-

sentation MathML. In these cases, any graphics will be converted to web-friendly

formats and/or copied to the destination directory. If you simply specify html,

it will treat that as html5.

For the default, xml, the output is left in LaTeXML’s internal xml, but the math

is parsed and converted to presentation MathML. For html, html5 and xhtml, a

default stylesheet is provided, but see the --stylesheet option.

--xml

Requests XML output; this is the default. DEPRECATED: use --format=xml

instead

--tex

Requests TeX output for debugging purposes; processing is only carried out

through expansion and digestion. This may not be quite valid TeX, since Uni-

code may be introduced.

--box

Requests Box output for debugging purposes; processing is carried out through

expansion and digestions, and the result is printed.

--profile

Variety of shorthand proﬁles. Note that the proﬁles come with a variety of preset

options. You can examine any of them in their resources/Profiles/name.opt

ﬁle.

Example: latexmlc --profile=math ’1+2=3’

106 APPENDIX C. MODULES

--omitdoctype,--noomitdoctype

Omits (or includes) the document type declaration. The default is to include it if

the document model was based on a DTD.

--numbersections,--nonumbersections

Includes (default), or disables the inclusion of section, equation, etc, numbers in

the formatted document and crossreference links.

--stylesheet=xslﬁle

Requests the XSL transformation of the document using the given xslﬁle as

stylesheet. If the stylesheet is omitted, a ‘standard’ one appropriate for the format

(html4, html5 or xhtml) will be used.

--css=cssﬁle

Adds cssﬁle as a css stylesheet to be used in the transformed html/html5/xhtml.

Multiple stylesheets can be used; they are included in the html in the order given,

following the default ltx-LaTeXML.css (unless --nodefaultcss). The

stylesheet is copied to the destination directory, unless it is an absolute url.

Some stylesheets included in the distribution are --css=navbar-left Puts a nav-

igation bar on the left. (default omits navbar) --css=navbar-right Puts a navi-

gation bar on the left. --css=theme-blue A blue coloring theme for headings.

--css=amsart A style suitable for journal articles.

--javascript=jsﬁle

Includes a link to the javascript ﬁle jsﬁle, to be used in the transformed htm-

l/html5/xhtml. Multiple javascript ﬁles can be included; they are linked in the

html in the order given. The javascript ﬁle is copied to the destination directory,

unless it is an absolute url.

--icon=iconﬁle

Copies iconﬁle to the destination directory and sets up the linkage in the trans-

formed html/html5/xhtml to use that as the ”favicon”.

--nodefaultresources

Disables the copying and inclusion of resources added by the binding ﬁles; This

includes CSS, javascript or other ﬁles. This does not affect resources explicitly

requested by the --css or --javascript options.

--timestamp=timestamp

Provides a timestamp (typically a time and date) to be embedded in the com-

ments by the stock XSLT stylesheets. If you don’t supply a timestamp, the cur-

rent time and date will be used. (You can use --timestamp=0 to omit the

timestamp).

--xsltparameter=name:value

Passes parameters to the XSLT stylesheet. See the manual or the stylesheet itself

for available parameters.

C.1. COMMON MODULES 107

Site & Crossreferencing Options

--split,--nosplit

Enables or disables (default) the splitting of documents into multiple ‘pages’.

If enabled, the the document will be split into sections, bibliography, index and

appendices (if any) by default, unless --splitpath is speciﬁed.

--splitat=unit

Speciﬁes what level of the document to split at. Should be one of chapter,

section (the default), subsection or subsubsection. For more con-

trol, see --splitpath.

--splitpath=xpath

Speciﬁes an XPath expression to select nodes that will generate separate

pages. The default splitpath is //ltx:section |//ltx:bibliography |//ltx:appendix |

//ltx:index

Specifying

--splitpath="//ltx:section | //ltx:subsection

| //ltx:bibliography | //ltx:appendix | //ltx:index"

would split the document at subsections as well as sections.

--splitnaming=(id|idrelative|label|labelrelative)

Speciﬁes how to name the ﬁles for subdocuments created by splitting. The values

id and label simply use the id or label of the subdocument’s root node for it’s

ﬁlename. idrelative and labelrelative use the portion of the id or

label that follows the parent document’s id or label. Furthermore, to impose

structure and uniqueness, if a split document has children that are also split, that

document (and it’s children) will be in a separate subdirectory with the name

index.

--scan,--noscan

Enables (default) or disables the scanning of documents for ids, labels, refer-

ences, indexmarks, etc, for use in ﬁlling in refs, cites, index and so on. It may

be useful to disable when generating documents not based on the LaTeXML

doctype.

--crossref,--nocrossref

Enables (default) or disables the ﬁlling in of references, hrefs, etc based on a

previous scan (either from --scan, or --dbfile) It may be useful to disable

when generating documents not based on the LaTeXML doctype.

--urlstyle=(server|negotiated|file)

This option determines the way that URLs within the documents are formatted,

depending on the way they are intended to be served. The default, server,

108 APPENDIX C. MODULES