PDFlib Text And Image Extraction Toolkit (TET) Manual TET
TET-manual
User Manual:
Open the PDF directly: View PDF
.
Page Count: 220
ABC
Text and Image
Extraction Toolkit (TET)
Version 5.1
Toolkit for extracting Text, Images,
and other items from PDF
Copyright © 2002–2017 PDFlib GmbH. All rights reserved.
Protected by European and U.S. patents.
PDFlib GmbH
Franziska-Bilek-Weg 9, 80339 München, Germany
www.pdflib.com
phone +49 • 89 • 452 33 84-0
fax +49 • 89 • 452 33 84-99
If you have questions check the PDFlib mailing list and archive at
groups.yahoo.com/neo/groups/pdflib/info
Licensing contact: sales@pdflib.com
Support for commercial PDFlib licensees: support@pdflib.com (please include your license number)
This publication and the information herein is furnished as is, is subject to change without notice, and
should not be construed as a commitment by PDFlib GmbH. PDFlib GmbH assumes no responsibility or liability for any errors or inaccuracies, makes no warranty of any kind (express, implied or statutory) with respect to this publication, and expressly disclaims any and all warranties of merchantability, fitness for particular purposes and noninfringement of third party rights.
Adobe, Acrobat, PostScript, and XMP are trademarks of Adobe Systems Inc. AIX, IBM, OS/390, WebSphere,
iSeries, and zSeries are trademarks of International Business Machines Corporation. ActiveX, Microsoft,
OpenType, and Windows are trademarks of Microsoft Corporation. Apple, Macintosh and TrueType are
trademarks of Apple Computer, Inc. Unicode and the Unicode logo are trademarks of Unicode, Inc. Unix is a
trademark of The Open Group. Java and Solaris are trademarks of Sun Microsystems, Inc. HKS is a registered trademark of the HKS brand association: Hostmann-Steinberg, K+E Printing Inks, Schmincke. Other
company product and service names may be trademarks or service marks of others.
TET contains modified parts of the following third-party software:
Zlib compression library, Copyright © 1995-2012 Jean-loup Gailly and Mark Adler
TIFFlib image library, Copyright © 1988-1997 Sam Leffler, Copyright © 1991-1997 Silicon Graphics, Inc.
Cryptographic software written by Eric Young, Copyright © 1995-1998 Eric Young (eay@cryptsoft.com)
Independent JPEG Group’s JPEG software, Copyright © 1991-1998, Thomas G. Lane
Cryptographic software, Copyright © 1998-2002 The OpenSSL Project (www.openssl.org)
Expat XML parser, Copyright © 1998, 1999, 2000 Thai Open Source Software Center Ltd
ICU International Components for Unicode, Copyright © 1995-2012 International Business Machines Corporation and others
OpenJPEG library, Copyright © 2002-2014, Université catholique de Louvain (UCL), Belgium
TET contains the RSA Security, Inc. MD5 message digest algorithm.
Contents
0 First Steps with TET
0.1
Installing the Software 7
0.2
Applying the TET License Key 8
1 Introduction
7
11
1.1
Overview of TET Features 11
1.2
Many ways to use TET 13
1.3
Roadmap to Documentation and Samples 14
1.4
What’s new in TET 5.0? 15
1.5
What’s new in TET 5.1? 16
2 TET Command-Line Tool
2.1
Command-Line Options 17
2.2
Constructing TET Command Lines 20
2.3
Command-Line Examples 21
2.3.1 Extracting Text 21
2.3.2 Extracting Images 21
2.3.3 Generating TETML 22
2.3.4 Advanced Options 22
17
3 TET Library Language Bindings
3.1
23
Exception Handling 23
3.2
C Binding 24
3.3
C++ Binding 26
3.4
COM Binding 28
3.5
Java Binding 29
3.6 .NET Binding 31
3.7
Objective-C Binding 32
3.8
Perl Binding 34
3.9 PHP Binding 35
3.10 Python Binding 37
3.11 REALbasic/Xojo Binding 38
3.12 Ruby Binding 39
3.13 RPG Binding 41
4 TET Connectors
43
4.1
Free TET Plugin for Adobe Acrobat 43
4.2
TET Connector for the Lucene Search Engine 44
Contents
3
4.3
TET Connector for the Solr Search Server 47
4.4
TET Connector for Oracle 48
4.5
TET PDF IFilter for Microsoft Products 51
4.6 TET Connector for the Apache TIKA Toolkit 53
4.7
TET Connector for MediaWiki 55
5 Configuration
57
5.1
Extracting Content from protected PDF 57
5.2
Resource Configuration and File Searching 59
5.3
Recommendations for common Scenarios 63
6 Text Extraction
6.1
67
PDF Document Domains 67
6.2
Page and Text Geometry 72
6.3
Text Color 78
6.4
Chinese, Japanese, and Korean Text 80
6.4.1 CJK Encodings and CMaps 80
6.4.2 Word Boundaries for CJK Text 80
6.4.3 Vertical Writing Mode 80
6.4.4 CJK Decompositions: Narrow, wide, vertical, etc. 81
6.5
Bidirectional Arabic and Hebrew Text 83
6.5.1 General Bidi Topics 83
6.5.2 Postprocessing Arabic Text 83
6.6 Content Analysis 85
6.7
Layout Analysis 89
6.8
Check whether an Area is empty 93
7 Advanced Unicode Handling
95
7.1
Important Unicode Concepts 95
7.2
Unicode Preprocessing (Filtering) 98
7.2.1 Filters for all Granularities 98
7.2.2 Filters for Granularity Word and above 99
7.3
Unicode Postprocessing 100
7.3.1 Unicode Folding 100
7.3.2 Unicode Decomposition 103
7.3.3 Unicode Normalization 107
7.4
Supplementary Characters and Surrogates 109
7.5
Unicode Mapping for Glyphs 110
8 Image Extraction
8.1
Image Extraction Basics 117
8.2
Extracting Images 120
4
Contents
117
8.2.1 Placed Images and Image Resources 120
8.2.2 Page-based and Resource-based Image Retrieval 121
8.2.3 Geometry of Placed Images 122
8.3
Merging Fragmented Images 125
8.4
Small Image Filtering 127
8.5
Image Colors and Masking 128
8.5.1 Color Spaces 128
8.5.2 Image Masks and Soft Masks 129
9 TET Markup Language (TETML)
9.1
Creating TETML 131
9.2
TETML Examples 133
9.3
Controlling TETML Details 137
9.4
TETML Elements and the TETML Schema 141
9.5
Transforming TETML with XSLT 149
131
9.6 XSLT Samples 153
10 TET Library API Reference
157
10.1
Option Lists 157
10.1.1 Option List Syntax 157
10.1.2 Basic Types 159
10.1.3 Geometric Types 162
10.1.4 Unicode Support in Language Bindings 163
10.1.5 Encoding Names 163
10.2
General Functions 165
10.2.1 Option Handling 165
10.2.2 Setup 167
10.2.3 PDFlib Virtual Filesystem (PVF) 168
10.2.4 Unicode Conversion Function 170
10.2.5 Exception Handling 172
10.2.6 Logging 173
10.3
Document Functions 175
10.4
Page Functions 184
10.5
Text and Glyph Details Retrieval Functions 194
10.6 Image Retrieval Functions 201
10.7
TET Markup Language (TETML) Functions 205
10.8
pCOS Functions 208
A TET Library Quick Reference
B Revision History
Index
213
215
217
Contents
5
0 First Steps with TET
0.1 Installing the Software
TET is delivered as an MSI or compressed package for Windows systems, and as a compressed archive for all other supported operating systems. All TET packages contain the
TET command-line tool and the TET library/component, plus support files, documentation, and examples. After installing or unpacking TET the following steps are recommended:
> Users of the TET command-line tool can use the executable right away. The available
options are discussed in Section 2.1, »Command-Line Options«, page 17, and are also
displayed when you execute the TET command-line tool without any options.
> Users of the TET library/component should read one of the sections in Chapter 3,
»TET Library Language Bindings«, page 23, corresponding to their preferred development environment, and review the installed examples.
If you obtained a commercial TET license you must enter your TET license key according
to Section 0.2, »Applying the TET License Key«, page 8.
CJK configuration. In order to extract Chinese, Japanese, or Korean (CJK) text which is
encoded with legacy encodings TET requires the corresponding CMap files for mapping
CJK encodings to Unicode. The CMap files are contained in all TET packages, and are installed in the resource/cmap directory within the TET installation directory.
On non-Windows systems you must manually configure the CMap files:
> For the TET command-line tool this can be achieved by supplying the name of the directory holding the CMap files with the --searchpath option.
> For the TET library/component you can set the searchpath at runtime:
tet.set_option("searchpath={/path/to/resource/cmap}");
As an alternative method for configuring access to the CJK CMap files you can set the
TETRESOURCEFILE environment variable to point to a UPR configuration file which contains a suitable searchpath definition.
Restrictions of the evaluation version. The TET command-line tool and library can be
used as fully functional evaluation versions even without a commercial license. Unlicensed versions support all features, but will only process PDF documents with up to 10
pages and 1 MB size. Evaluation versions of TET must not be used for production purposes, but only for evaluating the product. Using TET for production purposes requires
a valid TET license.
0.1 Installing the Software
7
0.2 Applying the TET License Key
Using TET for production purposes requires a valid TET license key. Once you purchased
a TET license you must apply your license key in order to allow processing of arbitrarily
large documents. There are several methods for applying the license key; choose one of
the methods detailed below.
Note TET license keys are platform-dependent, and can only be used on the platform for which they
have been purchased.
Windows installer. If you are working with the Windows installer you can enter the license key when you install the product. The installer will add the license key to the registry (see below).
Working with a license file. PDFlib products read license keys from a license file,
which is a text file according to the format shown below. You can use the template
licensekeys.txt which is contained in all TET distributions. Lines beginning with a ’#’
character contain comments and will be ignored; the second line contains version information for the license file itself:
# Licensing information for PDFlib GmbH products
PDFlib license file 1.0
TET 5.1 ...your license key...
The license file may contain license keys for multiple PDFlib GmbH products on separate lines. It may also contain license keys for multiple platforms so that the same license file can be shared among platforms. License files can be configured in the following ways:
> A file called licensekeys.txt will be searched in all default locations (see »Default file
search paths«, page 9).
> You can specify the licensefile option with the set_option( ) API function:
tet.set_option("licensefile={/path/to/licensekeys.txt}");
The licensefile option must be set immediately after instantiating the TET object, i.e.,
after calling TET_new( ) (in C) or creating a TET object.
> Supply the --tetopt option of the TET command-line tool and supply the licensefile
option with the name of a license file:
tet --tetopt "licensefile=/path/to/your/licensekeys.txt" ...
If the path name contains space characters you must enclose the path with braces:
tet --tetopt "licensefile={/path/to/your license file.txt}" ...
> You can set an environment (shell) variable which points to a license file. On Windows use the system control panel and choose System, Advanced, Environment
Variables; on Unix apply a command similar to the following:
export PDFLIBLICENSEFILE="/path/to/licensekeys.txt"
On i5/iSeries the license file can be specified as follows (this command can be specified in the startup program QSTRUP and will work for all PDFlib GmbH products):
ADDENVVAR ENVVAR(PDFLIBLICENSEFILE) VALUE(<... path ...>) LEVEL(*SYS)
8
Chapter 0: First Steps with TET
License keys in the registry. On Windows you can also enter the name of the license
file in the following registry key:
HKLM\SOFTWARE\PDFlib\PDFLIBLICENSEFILE
As another alternative you can enter the license key directly in one of the following registry keys:
HKLM\SOFTWARE\PDFlib\TET5\license
HKLM\SOFTWARE\PDFlib\TET5\5.1\license
The MSI installer will write the license key provided at install time in the last of these
entries.
Note Be careful when manually accessing the registry on 64-bit Windows systems: as usual, 64-bit
binaries work with the 64-bit view of the Windows registry, while 32-bit binaries running on a
64-bit system work with the 32-bit view of the registry. If you must add registry keys for a 32-bit
product manually, make sure to use the 32-bit version of the regedit tool. It can be invoked as
follows from the Start, Run... dialog:
%systemroot%\syswow64\regedit
Default file search paths. On Unix, Linux, OS X/macOS and i5/iSeries systems some directories will be searched for files by default even without specifying any path and directory names. Before searching and reading the UPR file (which may contain additional
search paths), the following directories will be searched:
/PDFlib/TET/5.1/resource/cmap
/PDFlib/TET/5.1/resource/codelist
/PDFlib/TET/5.1/resource/glyphlst
/PDFlib/TET/5.1
/PDFlib/TET
/PDFlib
On Unix, Linux, and OS X/macOS will first be replaced with /usr/local and
then with the HOME directory. On i5/iSeries is empty.
Default file names for license and resource files. By default, the following file names
will be searched for in the default search path directories:
licensekeys.txt
tet.upr
(license file)
(resource file)
This feature can be used to work with a license file without setting any environment
variable or runtime option.
Setting the license key in an option for the TET command-line tool. If you use the TET
command-line tool you can supply an option which contains the name of a license file
or the license key itself:
tet --tetopt "license ...your license key..." ...more options...
0.2 Applying the TET License Key
9
Setting the license key with a TET API call. If you use the TET API you can add an API
call to your script or program which sets the license key at runtime:
> In COM/VBScript:
oTET.set_option "license=...your license key..."
> In C:
TET_set_option(tet, "license=...your license key...");
> In C++, .NET/C#, Java, and Ruby:
tet.set_option("license=...your license key...");
> In Perl, Python and PHP:
tet->set_option("license=...your license key...");
> In RPG:
d licensekey
d licenseval
c
c
s
s
eval
callp
20
50
licenseopt='license=... your license key ...'+x'00'
TET_set_option(TET:licenseopt:0)
The license option must be set immediately after instantiating the TET object, i.e., after
calling TET_new( ) (in C) or creating a TET object.
Licensing options. Different licensing options are available for TET use on one or more
computers, and for redistributing TET with your own products. We also offer support
and source code contracts. Licensing details and the purchase order form can be found
in the TET distribution. Please contact us if you are interested in obtaining a commercial license, or have any questions:
PDFlib GmbH, Licensing Department
Franziska-Bilek-Weg 9, 80339 München, Germany
www.pdflib.com
phone +49 • 89 • 452 33 84-0
fax +49 • 89 • 452 33 84-99
Licensing contact: sales@pdflib.com
Support for PDFlib licensees: support@pdflib.com
10
Chapter 0: First Steps with TET
1 Introduction
The PDFlib Text and Image Extraction Toolkit (TET) is targeted at extracting text and images from PDF documents, but can also be used to retrieve other information from PDF.
TET can be used as a base component for realizing the following tasks:
> search the text contents of PDF
> create a list of all words contained in a PDF (concordance)
> implement a search engine for processing large numbers of PDF files
> extract text from PDF to store, translate, or otherwise repurpose it
> convert the text contents of PDF to other formats
> process or enhance PDFs based on their contents
> compare the text contents of multiple PDF documents
> extract the raster images from PDF
> extract metadata and other information from PDF
TET has been designed for stand-alone use, and does not require any third-party software. It is robust and suitable for multi-threaded server use.
1.1 Overview of TET Features
Supported PDF input. TET has been tested against millions of PDF test files from various sources. It accepts PDF 1.0 up to PDF 1.7 extension level 8 and PDF 2.0, corresponding
to Acrobat 1-DC including encrypted documents. TET attempts to repair various kinds of
malformed and damaged PDF documents.
Note TET does not support XFA forms. XFA is a separate format which is not part of the PDF standard
ISO 32000. Since XFA is packaged inside a small PDF wrapper XFA forms are often confused
with PDF documents although XFA is actually a completely different file format which requires
dedicated software.
Unicode support. TET includes a considerable number of algorithms and data to
achieve reliable Unicode mappings for all text. Since text in PDF documents is not usually encoded in Unicode, TET normalizes the text from a PDF document to Unicode:
> TET converts all text contents to Unicode. In C the text is returned in UTF-8 or UTF-16
format; in other language bindings as native Unicode strings.
> Ligatures and other multi-character glyphs are decomposed into a sequence of their
constituent Unicode characters.
> Vendor-specific Unicode values (Corporate Use Subarea, CUS) are identified and
mapped to characters with precisely defined meanings if possible.
> Glyphs which are lacking Unicode mapping information are identified and mapped
to a configurable replacement character.
> UTF-16 surrogate pairs for characters outside the Basic Multilingual Plane (BMP) are
interpreted and maintained. Surrogate pairs and UTF-32 values can be retrieved in all
language bindings.
Some PDF documents do not contain enough information for reliable Unicode mapping. In order to successfully extract the text nevertheless TET offers various configuration options which can be used to supply auxiliary information for proper Unicode
mappings. In order to facilitate writing the required mapping tables we make available
1.1 Overview of TET Features
11
PDFlib FontReporter, a free plugin for Adobe Acrobat. This plugin can be used for analyzing fonts, encodings, and glyphs in PDF.
CJK support. TET includes full support for extracting Chinese, Japanese, and Korean
text:
> All predefined CJK CMaps (encodings) are recognized; CJK text is converted to Unicode. The CMap files for CJK encoding conversion are included in the TET distribution.
> Special character forms (e.g. wide, narrow, prerotated glyphs for vertical text) can optionally be converted (folded) to the corresponding regular forms
> Horizontal and vertical writing modes are supported.
> CJK font names are normalized to Unicode.
Support for Bidirectional Hebrew and Arabic Text. TET includes the following features
for dealing with Bidi text:
> Re-order right-to-left and Bidi text to logical ordering
> Determine dominant text direction of the page
> Normalize Arabic presentation forms and decompose ligatures
> Remove Arabic Tatweel character used for stretching words
Unicode postprocessing. TET’s Unicode postprocessing features include the following:
> Folding: preserve, replace, or remove one or more characters; affected characters can
conveniently be specified as Unicode sets;
> Decomposition: optionally apply canonical or compatibility decompositions as defined in the Unicode standard. This may make the text better usable in some environments. For example, you can keep or split accented characters, fractions, or symbols like the trademark symbol.
> Normalization: convert the output to Unicode normalization formats NFC, NFD,
NFKC, or NFKD as defined in the Unicode standard. This way TET can produce the exact format required as input in some environments, e.g. databases or search engines.
Image extraction. TET extracts raster images from PDF. Adjacent parts of a segmented
image are combined to facilitate postprocessing and re-use (e.g. multi-strip images created by some applications). Small images can be filtered in order to exclude tiny image
fragments from cluttering the output. If a mask is attached to an image, the mask can
be extracted as well.
Images are extracted in TIFF, JPEG, JPEG 2000, or JBIG2 format.
Geometry. TET provides precise metrics for the text, such as the position on the page,
glyph widths, and text direction. Specific areas on the page can be excluded or included
in the text extraction process, e.g. to ignore headers and footers or margins.
For images the pixel size, physical size, and color space are available as well as position and angle.
Text color. TET provides information about the color of glyphs. The color spaces for
filling and stroking and the corresponding color values can be retrieved. A convenient
shortcut is available for easily comparing the colors of multiple glyphs without having
to deal with the complexities of PDF color spaces.
12
Chapter 1: Introduction
Word detection and content analysis. TET can be used to retrieve low-level glyph information, but also includes advanced algorithms for high-level content and layout
analysis:
> Detect word boundaries to retrieve words instead of characters.
> Recombine the parts of hyphenated words (dehyphenation).
> Remove duplicate instances of text, e.g. shadow and fake bold text.
> Recombine paragraphs into reading order.
> Reorder text which is scattered over the page.
> Reconstruct lines of text.
> Recognize tabular structures on the page.
> Recognize superscript, subscript and drop caps (large initial characters at the start of
a paragraph).
TET Markup Language (TETML). The information retrieved from a PDF document can
be presented in an XML format called TET Markup Language (TETML) for processing
with standard XML tools. TETML contains text, image, and metadata information and
can optionally also contain font- and geometry-related details. TETML also contains color and colorspace information as well as interactive elements such as form fields, annotations, bookmarks, etc.
pCOS interface for simple access to PDF objects. TET includes pCOS (PDFlib Comprehensive Object System) for retrieving arbitrary PDF objects. With pCOS you can retrieve
PDF metadata, interactive elements (e.g. bookmark text, contents of form fields), or any
other information from a PDF document with a simple query interface. The syntax of
pCOS query path is described separately in the pCOS Path Reference.
What is text? While TET deals with a large class of PDF documents, in some cases visible text cannot be extracted. The text must be encoded using PDF’s text and encoding
facilities (i.e., it must be based on a font). Although the following flavors of text may be
visible on the page they cannot be extracted with TET:
> Rasterized (pixel image) text, e.g. scanned pages;
> Text which is represented by vector elements without any font.
Note that metadata and text in hypertext elements (such as bookmarks, form fields,
notes, or annotations) can be retrieved with TETML or the pCOS interface; see Section
6.1, »PDF Document Domains«, page 67, for details. On the other hand, TET may extract
some text which is not visible on the page. This may happen in the following situations:
> Text using PDF’s invisible attribute (however, there is an option to exclude this kind
of text from the text retrieval process)
> Text which is obscured by some other element on the page, e.g. an image.
1.2 Many ways to use TET
TET is available as a programming library (component) for various development environments, and as a command-line tool for batch operations. Both offer similar features,
but are suitable for different deployment tasks. Both the TET library and command-line
tool can create TETML, TET’s XML-based output format.
> The TET programming library can be used for integration into your desktop or server
application. Many different programming languages are supported. Examples for
1.2 Many ways to use TET
13
>
>
>
>
using the TET library with all supported language bindings are included in the TET
package.
The TET command-line tool is suited for batch processing PDF documents. It doesn’t
require any programming, but offers command-line options which can be used to
integrate it into complex workflows.
TETML output is suited for XML-based workflows and developers who are familiar
with the wide range of XML processing tools and languages, e.g. XSLT.
TET connectors are suited for integrating TET in various common software packages,
e.g. databases and search engines.
The TET Plugin is a free extension for Adobe Acrobat which makes TET available for
interactive use (see Section 4.1, »Free TET Plugin for Adobe Acrobat«, page 43, for
more information).
1.3 Roadmap to Documentation and Samples
Mini samples for the TET library. The TET distribution contains programming examples for all supported language bindings. These mini samples can serve as a starting
point for your own applications, or to test your TET installation. They comprise source
code for the following applications:
> The extractor sample demonstrates the basic loop for extracting text from a PDF document.
> The images_per_page sample extracts the images on each page and reports about
their geometry and other properties.
> The image_resources sample demonstrates the basic loop for extracting images from
a PDF document in a resource-oriented way (no geometric information available).
> The dumper sample shows the use of the integrated pCOS interface for querying general information about a PDF document.
> The fontfilter sample shows how to process font-related information, such as font
name and font size.
> The glyphinfo sample demonstrates how to retrieve detailed information about
glyphs (font, size, position, etc.) as well as text attributes such as dropcap, shadow,
hyphenation, etc. It also shows how to access text color information.
> The tetml sample contains code for generating TETML (TET’s XML language for expressing PDF contents) from a PDF document.
> The get_attachments sample demonstrates how to process PDF file attachments, i.e.
PDF documents which are embedded in another PDF document.
XSLT samples. The TET distribution contains several XSLT stylesheets. They demonstrate how to process TETML to achieve various goals:
> concordance.xsl: create list of unique words in a document sorted by descending frequency.
> fontfilter.xsl: List all words in a document which use a particular font in a size larger
than a specified value.
> fontfinder.xsl: For all fonts in a document, list all occurrences along with page number
and position information.
> fontstat.xsl: generate font and glyph statistics.
> index.xsl: create an alphabetically sorted »back-of-the-book« index.
14
Chapter 1: Introduction
> metadata.xsl: extract selected properties from document-level XMP metadata included in TETML.
> solr.xsl: generate input for the Solr enterprise search server.
> table.xsl: Extract a table to a CSV file (comma-separated values).
> tetml2html.xsl: convert TETML to HTML.
> textonly.xsl: extract the raw text from TETML input.
TET Cookbook. The TET Cookbook is a collection of source code examples for solving
specific application problems with the TET library. The Cookbook examples are written
in the Java language, but can easily be adjusted to other programming languages since
the TET API is almost identical for all supported language bindings. Some Cookbook
samples are written in the XSLT language.The TET Cookbook is organized in the following groups:
> Text: samples related to text extraction
> Font: samples related to text with a focus on font properties
> Image: samples related to image extraction
> TET & PDFlib+PDI: samples which extract information from a PDF with TET and construct a new PDF based on the original PDF and the extracted information. These
samples require the PDFlib+PDI product in addition to TET.
> TETML: XSLT samples for processing TETML
> Special: other samples
The TET Cookbook is available at the following URL:
www.pdflib.com/tet-cookbook.
pCOS Cookbook. The pCOS Cookbook is a collection of code fragments for the pCOS interface which is integrated in TET. It is available at the following URL:
www.pdflib.com/pcos-cookbook.
Details of the pCOS interface are documented in the pCOS Path Reference which is
included in the TET package.
1.4 What’s new in TET 5.0?
The features below are new or considerably improved in TET 5.0.
Text retrieval:
> retrieve fill and stroke color of text
> honor vector graphics to improve page and table layout recognition
> support vertical font metrics for CJK text
Image retrieval:
> significantly enhanced merging of fragmented images, e.g. for rotated images
> improved image handling for many special cases and rare PDF image flavors
> extract image masks and soft masks
> merge and convert JPEG 2000-compressed images
> preserve spot color in extracted TIFF images
> restrict image extraction to user-selected area
> collect XMP image metadata stored in non-standard locations by Adobe InDesign
Page processing:
> honor clipping paths to avoid extraction of invisible content
1.4 What’s new in TET 5.0?
15
> honor layers (optional content) to avoid extraction of invisible content
> optionally ignore artifacts (irrelevant content) in Tagged PDF
> check whether an area on the page is empty or contains any text, image, or vector
graphics
TETML:
> TETML includes fill and stroke color of glyphs
> TETML includes information about interactive elements including annotations,
form fields, bookmarks, actions, JavaScript, signatures, etc.
> TETML includes color space and ICC profile details
> TETML includes information about layers and page labels
pCOS PDF information retrieval:
> pCOS pseudo objects for ICC profile details and image masking properties
> pCOS pseudo objects for form fields
Other areas:
> additional checks and heuristics for damaged and non-conforming PDF input
> updated TET language bindings, programming samples and TET connectors
> new options for improved PDF processing control
> many improvements in existing functionality
1.5 What’s new in TET 5.1?
The features below are new or considerably improved in TET 5.1:
> numbered and unnumbered lists are identified and expressed in TETML
(with page option structureanalysis={list=true})
> repair mode for damaged input documents with cross-reference streams
> improved workarounds for non-conforming input documents
> improved performance for disabled image, color, and vector engines as well as for
documents without layers
> reduced memory requirements
> pCOS interface updated to version 11 with support for certificate security
> other bug fixes
> updated language bindings
> pCOS interface updated to version 11
16
Chapter 1: Introduction
2 TET Command-Line Tool
2.1 Command-Line Options
The TET command-line tool allows you to extract text and images from one or more PDF
documents without the need for any programming. Output can be generated in plain
text (Unicode) format or in TETML, TET’s XML-based output format. The TET program
can be controlled via a number of command-line options. The program will insert space
characters (U+0020) after each word, U+000A after each line, and U+000C after each
page. It is called as follows for one or more input PDF files:
tet [] ...
The TET command-line tool is built on top of the TET library. You can supply library options using the --docopt, --tetopt, --imageopt, and --pageopt options according to the option list tables in Chapter 10, »TET Library API Reference«, page 157. Table 2.1 lists all TET
command-line options (this list will also be displayed if you run the TET program without any options).
Note In order to extract CJK text you must configure access to the CMap files which are shipped with
TET according to Section 0.1, »Installing the Software«, page 7.
Table 2.1 TET command-line options
option
parameters
--
function
End the list of options; this is useful if file names start with a - character.
1
@filename
Specify a response file with options; for a syntax description see »Response files«,
page 20. Response files are only recognized before the -- option and before the
first filename. Response files can not be used to replace the parameter for another
option, but must contain complete option/parameter combinations.
--docopt
Source Exif Data:
File Type : PDF
File Type Extension : pdf
MIME Type : application/pdf
PDF Version : 1.7
Linearized : Yes
Author : PDFlib GmbH
Create Date : 2017:05:24 14:39:33Z
Modify Date : 2017:05:24 15:17:44+02:00
Language : en
XMP Toolkit : Adobe XMP Core 5.4-c005 78.150055, 2012/11/19-18:45:32
Marked : Other
Format : application/pdf
Creator : PDFlib GmbH
Title : PDFlib Text and Image Extraction Toolkit (TET) Manual
Creator Tool : FrameMaker 2015.0.5
Metadata Date : 2017:05:24 15:17:44+02:00
Producer : Acrobat Distiller 15.0 (Windows)
Document ID : uuid:6b382b70-43a1-4727-bb75-eb4808bd0ad6
Instance ID : uuid:9fe536dd-4bb1-447f-aa15-e54767e0e649
Page Layout : SinglePage
Page Mode : UseOutlines
Page Count : 220
EXIF Metadata provided by EXIF.tools