Gcp Guide
User Manual: Pdf
Open the PDF directly: View PDF .
Page Count: 32
Download | |
Open PDF In Browser | View PDF |
The GATECloud Paralleliser (GCP) Large-scale multi-threaded processing with GATE Embedded version 3.0-SNAPSHOT Ian Roberts, Valentin Tablan GATE Team February 4, 2018 Contents 1 Introduction 5 1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Processing Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Changelog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.1 3.0 (February 2018) . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.2 2.8.1 (June 2017) . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.3 2.8 (February 2017) . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.4 2.7 (January 2017) . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.5 2.6 (June 2016) . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.6 2.5 (June 2015) . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.3.7 2.4 (May 2014) . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.3.8 2.3 (November 2012) . . . . . . . . . . . . . . . . . . . . . . . 9 1.3.9 2.2 (February 2012) . . . . . . . . . . . . . . . . . . . . . . . 9 2 Installing and Running GCP 10 2.1 Installing GCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2 Running GCP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.1 Using gcp-cli.jar . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.2 Using gcp-direct.sh . . . . . . . . . . . . . . . . . . . . . . 12 3 The Batch Definition File 14 3.1 The Structure of a Batch Descriptor . . . . . . . . . . . . . . . . . . 14 3.2 Specifying the Input Handler . . . . . . . . . . . . . . . . . . . . . . 15 2 3.3 3.4 3.2.1 The FileInputHandler . . . . . . . . . . . . . . . . . . . . . 16 3.2.2 The ZipInputHandler . . . . . . . . . . . . . . . . . . . . . . 16 3.2.3 The ARCInputHandler and WARCInputHandler . . . . . . . . 17 3.2.4 The streaming JSON input handler . . . . . . . . . . . . . . . 18 Specifying the Output Handlers . . . . . . . . . . . . . . . . . . . . . 19 3.3.1 File-based Output Handlers . . . . . . . . . . . . . . . . . . . 20 3.3.2 The Mı́mir Output Handler . . . . . . . . . . . . . . . . . . . 23 3.3.3 Conditional Output . . . . . . . . . . . . . . . . . . . . . . . 24 Specifying the Documents to Process . . . . . . . . . . . . . . . . . . 24 3.4.1 The File and ZIP enumerators . . . . . . . . . . . . . . . . . 25 3.4.2 The ARC and WARC enumerators . . . . . . . . . . . . . . . 25 3.4.3 The ListDocumentEnumerator . . . . . . . . . . . . . . . . . 26 4 Extending GCP 27 4.1 Custom Input Handlers . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2 Custom Output Handlers . . . . . . . . . . . . . . . . . . . . . . . . 28 4.3 Custom Naming Strategies . . . . . . . . . . . . . . . . . . . . . . . 29 4.4 Custom Document Enumerators . . . . . . . . . . . . . . . . . . . . 29 5 Advanced Topics 31 5.1 GATE Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.2 JMX Monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4 Chapter 1 Introduction GCP is a tool designed to support the execution of pipelines built using GATE Developer over large collections of thousands or millions of documents, using a multi-threaded architecture to make the best use of today’s multi-core processors. GCP tasks or batches are defined using an extensible XML syntax, describing the location and format of the input files, the GATE application to be run, and the kinds of outputs required. A number of standard input and output handlers are provided, but all the various components are pluggable so custom implementations can be used if the task requires it. GCP keeps track of the progress of each batch in a human- and machine-readable XML format, and is designed so that if a running batch is interrupted for any reason it can be re-run with the same settings and GCP will automatically continue from where it left off. 1.1 Definitions This section defines a number of terms that have specific meanings in GCP. Batch A batch is the unit of work for a GCP process. It is described by an XML file, and includes the location of a saved GATE application state (a “gapp” file), the location of the report file, one input handler definition, zero or more output handler definitions and a specification of which documents from the input handler should be processed (either as an explicit list of document IDs or as a document enumerator which calculates the IDs in an appropriate manner). Chapter 3 describes the format of batch definition files in detail. 5 GCP A User’s Guide Report The progress of a running batch is recorded in an XML report file as the documents are processed. For each document ID, the report records whether the document was processed successfully or whether processing failed with an error. For successful documents the report includes statistics on how many annotations were found in the document, and for a completed batch it also records overall statistics on the number of documents and total amount of data processed, the total number of successful and failed documents and the total processing time. The report file for a batch is also the mechanism which allows GCP to recover if processing is unexpectedly interrupted. If GCP is asked to process a batch where the report file already exists it will parse the existing report and ignore documents that are marked as having already been successfully processed. Thus you can simply restart a crashed GCP batch with the same command-line settings and it will continue processing from where it left off on the previous run. GATE application A GCP batch specifies the GATE application that is to be run over the documents as a standard “GAPP file” saved application state, which would typically be created using GATE Developer. Input handler The input handler for a GCP batch specifies the source of documents to be processed. The job of an input handler is to take a document ID and load the corresponding GATE Document object ready to be processed. There are a number of standard input handlers provided with GCP to take input documents from individual files on disk, directly from a ZIP archive file or from an ARC file as produced by the Heritrix web crawler (http://crawler.archive.org). If the standard handlers do not suit your needs then you can provide a custom implementation by including your handler class in a GATE CREOLE plugin referenced by your saved application. Document enumerator While the input handler specifies how to go from document IDs to gate.Document objects, it does not specify which document IDs are to be processed. The IDs can be specified explicitly in the batch XML file but more commonly an enumerator would be used to build a list of IDs by scanning the input directory or archive file. Standard enumerator implementations are provided, corresponding to the standard input handler types, to select a subset of documents from the input directory or archive according to various criteria. As with input handlers, custom enumerator implementations can be provided through the standard CREOLE plugin mechanism. 6 A User’s Guide GCP Output handler Most batch definitions will include one or more output handler definitions, which describe what to do with the document once it has been processed by the GATE application. Standard output handler implementations are provided to save the documents as GATE XML files, plain text and XCES standoff annotations, inline XML (“save preserving format” in GATE Developer terms), and to send the annotated documents to a Mı́mir server for indexing. Custom implementations can be added using the CREOLE plugin mechanism. Note that output handler definitions are optional – if you do not specify any output handlers then GCP will not save the results anywhere, but this may be appropriate if, for example, your pipeline contains a custom PR that saves your results to a relational database or similar. 1.2 Processing Model GCP processes a batch as follows. 1. Parse the batch definition file, and run the document enumerators (if any) to build the complete list of document IDs to be processed. 2. Parse the existing (possibly partial) report file, if one exists, and remove from this list any documents that are already marked as having been successfully processed. 3. Create a thread pool of a size specified on the command line (the default is 6 threads). 4. Load the saved application state, and use Factory.duplicate to make additional copies of the application such that there is one independent copy of the application per thread in the thread pool. 5. Run • • • the processing threads. Each thread will repeatedly: take the next available unprocessed document ID from the list. ask the input handler for the corresponding gate.Document. put that document into a singleton Corpus and run this thread’s copy of the GATE application over that corpus. • pass the annotated document to each of the output handlers. • write an entry to the report file indicating whether the document was processed successfully or whether an exception occurred during processing. • release the document using Factory.deleteResource. 6. Once all the documents have been processed, shut down the thread pool and call Factory.deleteResource to cleanly shut down the GATE applications. Due to the asynchronous nature of the processing threads, if one document takes a particularly long time to process the other threads can proceed with many other documents in parallel, they are not forced to wait for the slowest thread. 7 GCP 1.3 A User’s Guide Changelog This section summarises the main changes between releases of GCP 1.3.1 3.0 (February 2018) Updated to work with GATE Embedded version 8.5: • Minimum required Java version is now Java 8. • GCP now builds using Maven rather than Ant, and has been split into “api” and “impl” modules. GATE plugins that want to provide custom input or output handlers should declare a “provided” dependency on the appropriate version of uk.ac.gate:gcp-api. • New command line parameters -C and -p to allow pre-loading of specific GATE plugins in addition to those declared by the saved application. This is useful, for example, to load plugins that provide document format parsers. 1.3.2 2.8.1 (June 2017) GCP now depends on GATE Embedded 8.4.1. Also the -i option to gcp-direct.sh can now be a file which lists the documents to process, instead of just a directory of documents. 1.3.3 2.8 (February 2017) GCP now depends on GATE Embedded 8.4. 1.3.4 2.7 (January 2017) This is a minor bugfix release, the main change is that GCP now depends on GATE Embedded 8.3. There have been minor changes to the gcp-direct.sh script, in particular it is now possible to run a pipeline with no output handler at all, useful in cases where there is a PR within the pipeline that is responsible for handling the output, or if you want to run a pipeline purely for its side effects (e.g. building a frequency table or training some sort of machine learning model). 1.3.5 2.6 (June 2016) This is a minor bugfix release, the main change is that GCP now depends on GATE Embedded 8.2. 1.3.6 2.5 (June 2015) • Now depends on GATE Embedded 8.1 8 A User’s Guide GCP • Introduced “streaming” style input and output handlers for JSON data (e.g. from Twitter), which can read a series of documents from a single JSON input file, and write JSON results to a single concatenated output file (sections 3.2.4 and 3.3.1). • Introduced the gcp-direct.sh script to cover simple invocations of GCP without the need to write a batch definition XML file (section 2.2.2). • For “controller-aware” PRs1 , the various callbacks are now invoked just once per batch rather than before and after every single document. 1.3.7 2.4 (May 2014) • Now depends on GATE Embedded 8.0 (and thus requires Java 7 to run) • Added input handler for WARC format archives, to complement the existing ARC handler (section 3.2.3). • ARC and WARC handlers can optionally load individual records from remotely hosted archives using HTTP requests with a “Range” header. This facility can be used with publicly-hosted data sets such as Common Crawl2 . To support this functionality, document identifiers in a batch definition can now take XML attributes as well as the actual string identifier (exactly how such attributes are used is up to the handler implementations). • Added output handler to save documents in a JSON format modelled on that used by Twitter to represent “entities” (e.g. username mentions) in Tweets. • Efficiency improvements in the Mı́mir output handler, to send documents to the server in batches rather than opening a new HTTP connection for every document. 1.3.8 • • • • 2.3 (November 2012) Now depends on GATE Embedded 7.1 Introduced support for conditional saving of documents (section 3.3.3) Added the serialized object output handler (section 3.3.1) More robust and reliable counting of the size of each input document. 1.3.9 2.2 (February 2012) • Now depends on GATE Embedded 7.0 • Introduced Java-based command line interface to replace the gcp.sh shell script, which behaves more consistently across platforms. 1 http://gate.ac.uk/gate/doc/javadoc/gate/creole/ControllerAwarePR.html 2 http://www.commoncrawl.org 9 Chapter 2 Installing and Running GCP 2.1 Installing GCP Binary releases are available for release versions of GCP starting with version 2.5, as a ZIP file which can be downloaded from GitHub at https://github.com/ GateNLP/gcp/releases1 . For development versions the software must be built from source. The source code is available on GitHub at https://github.com/ GateNLP/gcp. To build GCP you will need a Java 8 JDK. Sun/Oracle and OpenJDK have been tested and are known to work. GCJ is known not to work. You will also need Apache Maven 3.3 or later. Running “mvn install” will build the various components of GCP and create a ZIP file containing the binary distribution under distribution/target. Unpack that file somewhere to create your GCP installation. 2.2 Running GCP Once GCP is installed you can run it in one of two ways: • using the gcp-cli.jar executable JAR file in the installation directory. • using the gcp-direct.sh bash script. 2.2.1 Using gcp-cli.jar The usual way to run GCP is to write one or more batch definition XML files (see chapter 3 for details) defining the application you want to run, the documents to process, and the output formats to produce. You then pass these batch definitions to gcp-cli.jar for processing. The CLI tool takes a number of optional arguments: 1 Versions prior to 3.0 are available from SourceForge at http://sf.net/projects/gate/files/ gcp/. 10 A User’s Guide GCP -m Specifies the maximum Java heap size, in the format expected by the usual -Xmx Java option, e.g. -m 10G for a 10GB heap limit. The default setting is 12G. The gcp-cli will spawn a separate java process to run each batch, passing this memory limit to that process. This is different from specifying a -Xmx option to gcp-cli, which would define the heap size limit for the CLI process itself, not the batch runner processes it spawns. -t Specifies the number of threads that GCP should use to execute the GATE application. Typically this should be set to between 1 and 1.5 times the number of processing cores available on the machine. The default value is 6, which is generally suitable for a 4-core machine. -D Java system property settings, for example -Djava.io.tmpdir=/home/bigtmp. -D options specified before the -jar apply to the virtual machine running the CLI, those specified after -jar gcp-cli.jar will be passed to the batch runner processes. If you have an installed copy of GATE Developer you may wish to set -Dgate.home=... to point to your installation (after the -jar, as this is a setting that needs to apply to the batch runner VM). This is required if your saved GATE application refers to standard GATE plugins (using $gatehome$ paths in the xgapp), but is optional if the application is self-contained – GCP includes its own copy of GATE Embedded and does not require a separate installed copy of the core libraries. GATE plugins can also be pre-loaded using the -C and -p options, see the “gcpdirect” section below for details on these. The tool will determine the location of where GCP is installed in the following order, taking the first it can find: 1. The value of the environment variable GCP_HOME if it is set. 2. The value of the propery gcp.home if it is set. 3. The location of the JAR file used for running the program. Note that if the environment variable GCP_HOME is set to a different directory than the one used to run gcp-cli, the version of the batch runner in the directory pointed to by GCP_HOME will get invoked which is probably not what is intended. The settings for the -m and -t options are typically a trade-off – if your application is particularly memory-hungry or you are processing particularly large or complex documents you may need to lower the number of processing threads in order to give more memory (on average) to each one. GCP can run in two modes. In the basic “single-batch” mode the final commandline argument is simply the path to a single batch definition XML file (see chapter 3 for details), and GCP will process that batch and then exit. The other (and more commonly used) mode is “multi-batch” mode, signified by the -d command line option. In this mode the final command-line argument is the path to a directory referred to as the working directory. java -jar gcp-cli.jar -t 4 -m 8G -d /data/gcp The working directory is expected to contain a subdirectory named “in”, and any file in this directory with the extension .xml (in lower case) is assumed to be a batch definition file. For each batch batch.xml in the “in” directory, the script will: 11 GCP A User’s Guide • run the batch, redirecting the standard output and error streams to a file working-dir/logs/batch.xml.log • if the batch completes successfully, move the definition file to workingdir/out/batch.xml • or, if the batch fails (i.e. the Java process exits with a non-zero exit code, which occurs if, for example, one of the processing threads encounters an OutOfMemoryError), move the definition file to working-dir/err/batch.xml Additional batches can be added to the “in” directory at any time – whenever a batch completes the script will re-scan the “in” directory to locate the next available batch. In particular, failed batches can be moved back from “err” to “in” and they will be re-processed, and if the report file for the failed batch is intact GCP will continue on from where it left off on the previous run. Creating a file named shutdown.gcp in the “in” directory will cause the script to exit at the end of the batch it is currently processing (or immediately if it is currently idle). 2.2.2 Using gcp-direct.sh The gcp-direct.sh script can be used for simple cases where you want to process all the files under one particular directory and output the resulting annotations in GATE XML or FastInfoset format. For this specific case it is not necessary to write an XML batch descriptor, you can specify the required parameters using command line options to gcp-direct.sh: -t the number of parallel threads to use. -x the path to the saved GATE application that you want to run. -f the output format to use for saving results, must be either “xml” (GATE XML format) or “finf” (FastInfoset format). To use FastInfoset the GATE Format_FastInfoset plugin must be loaded by the saved application. -i the directory in which to look for the input files or a file that contains relative path names to the input files. If this points to a directory, all files in this directory and any subdirectories will be processed (except for standard backup and temporary file name patterns and source control metadata – see http://ant. apache.org/manual/dirtasks.html#defaultexcludes for details). If this points to a file, the content of the file is expected to be one relative file path per line, using UTF-8 encoding. The file paths are interpreted to be relative to the directory that contains the list file. If processed documents are written, then this will also be their relative path to the output directory. -o (optional) the directory in which to place the output files. Each input file will generate an output file with the same name in the output directory. If this option is missing, and the option -b is missing as well, the documents are not saved! -b (optional) if this option is specified it can be used to specify a batch file. In that case, the options -x, -i, -o, -r, -I are not required and ignored if specified and the corresponding information is taken from the batch configuration file instead. -r (optional) path to the report file for this batch – if omitted GCP will use report.xml in the current directory. 12 A User’s Guide GCP -ci the input files are all gzip-compressed -co the output files should all be gzip-compressed. This only makes sense if -f xml is also specified since the default output format finf already is a compressed format. If this option is specified, the output file name gets the extension .gz appended, in addition to any other extension it already may have. -p (optional, may be specified multiple times) a GATE plugin to pre-load in addition to those specified by the saved application. The value of this option can be one of three formats (tried in this order): 1. a set of Maven co-ordinates group:artifact:version, to load a plugin from a Maven repository. 2. an absolute URL e.g. starting file:/... or http://, to load a directorybased plugin from that URL. 3. a local file path, either absolute or relative to the working directory of the GCP process, to load a directory-based plugin from disk. -C (optional, may be specified multiple times) when loading Maven-style plugins using -p GATE will typically go out to the internet to fetch the plugin and its dependencies from a remote Maven repository. If you have a local Maven cache of plugins on your disk you can specify its location with this option and the local cache will be searched first before attempting to download plugins from the network. Additionally, you can specify -D and -X options which will be passed through to the Java VM, for example you can set the maximum amount of heap memory that the JVM can use with an option like -Xmx2G The gcp-direct.sh script is deliberately opinionated, in order to reduce the number of different options that need to be set, and it has a number of hard-coded assumptions. It assumes that your input documents use the UTF-8 character encoding, that the correct document format parser to use can be determined from the file extension, and that you always want to save all the annotations that your application generates. If you need to process documents in a different encoding, you have more complex output requirements (XCES, JSON, Mı́mir, . . . ) or want to output only a subset of the GATE annotations from each document, then you should write a batch definition in XML and use gcp-cli.jar as discussed above. 13 Chapter 3 The Batch Definition File 3.1 The Structure of a Batch Descriptor GCP batches are defined by an XML file whose format is as follows. The root element defines the batch identifier: 1 < batch id = " batch - id " xmlns = " http: // gate . ac . uk / ns / cloud / batch /1.0 " > The children of thiselement are: application (required) specifies the location of the saved GATE application state. report (required) specifies the location of the XML report file. If the report file already exists GCP will read it and process only those documents that have not already been processed successfully. input (required) specifies the input handler which will be the source of documents to process. Most handlers load documents one by one based on their IDs, but certain handlers operate in a streaming mode, processing a block of documents in one pass. output (zero or more) specified what to do with the documents once they have been processed. documents (required, except when using a streaming input handler) specifies the document IDs to be processed, as any combination of the child elements: id a single document ID. bbc/article001.html documentEnumerator an enumerator that generates a list of IDs. The enumerator implementation chosen will typically depend on the specific type of input handler that the batch uses. The following example shows a simple XML batch definition file which runs ANNIE and saves the results as GATE XML format. The input, output and documents elements are discussed in more detail in the following sections. 1 xml version = " 1.0 " encoding = " UTF -8 " ? > 14 A User’s Guide 2 3 GCP < batch id = " sample " xmlns = " http: // gate . ac . uk / ns / cloud / batch /1.0 " > < application file = " ../ annie . xgapp " / > 4 < report file = " ../ reports / sample - report . xml " / > 5 6 < input dir = " ../ input - files " mimeType = " text / html " compression = " none " encoding = " UTF -8 " class = " gate . cloud . io . file . F i l e I n p u t H a n d l e r " / > 7 8 9 10 11 12 < output dir = " ../ output - files - gate " compression = " gzip " encoding = " UTF -8 " fileExtension = " . GATE . xml . gz " class = " gate . cloud . io . file . G A T E S t a n d O f f F i l e O u t p u t H a n d l e r " / > 13 14 15 16 17 18 19 20 21 22 23 24 < documents > < id > ft /03082001. html id > < id > gu /04082001. html id > < id > in /09082001. html id > documents > batch > It is important to note that all relative file paths specified in a batch descriptor are resolved against the location of the descriptor file itself, thus if this descriptor file were located at /data/gcp/batches/sample.xml then it would load the application from /data/gcp/annie.xgapp. 3.2 Specifying the Input Handler Each batch definition must include a single element defining the source of documents to be processed. Given a document ID, the job of the input handler is to locate the identified document and load it as a gate.Document to be processed by the application. Note that the input handler describes how to find the document for each ID but does not define which IDs are to be processed, that is the job of theelement below. The element must have a class attribute specifying the name of the Java class implementing the handler. GCP will create an instance of this class and pass the remaining attributes to the handler to allow it to configure itself. Thus, which attributes are supported and/or required depends on the specific handler class. GCP provides four standard input handler types: • gate.cloud.io.file.FileInputHandler to read documents from individual files on the filesystem • gate.cloud.io.zip.ZipInputHandler to read documents directly from a ZIP archive • gate.cloud.io.arc.ARCInputHandler and gate.cloud.io.arc.WARCInputHandler to read documents from an ARC or WARC archive as produced by the Heritrix web crawler (http://crawler.archive.org). 15 GCP A User’s Guide and one streaming handler: • gate.cloud.io.json.JSONStreamingInputHandler to read a stream of documents from a single large JSON file (for example a collection of Tweets from Twitter’s streaming API). 3.2.1 The FileInputHandler FileInputHandler reads documents from individual files on the filesystem. It can read any document format supported by GATE Embedded, and in addition it can read files that are GZIP compressed, unpacking them on the fly as they are loaded. It supports the following attributes on the element in the batch descriptor: encoding (optional) The character encoding that should be used to read the documents (i.e. the value for the encoding parameter when creating a DocumentImpl using the GATE Factory). If omitted, the default GATE Embedded behaviour applies, i.e. the platform default encoding is used. mimeType (optional) The MIME type that should be assumed when creating the document (i.e. the value of the DocumentImpl mimeType parameter). If omitted GATE Embedded will attempt to guess the appropriate MIME type for each document in the usual way, based on the file name extension and magic number tests. compression (optional) The compression that has been applied to the files, either “none” (the default) or “gzip”. The actual mapping from document IDs to file locations is controlled by a naming strategy, another Java object which is configured from the attributes. The default naming strategy (gate.cloud.io.file.SimpleNamingStrategy) treats the document ID as a relative path1 , and takes the following attributes: dir (required) The base directory under which documents are found. fileExtension (optional) A file extension to append to the document ID. Given a document ID such as “ft/03082001”, a base directory of “/data” and a file extension of “.html” the SimpleNamingStrategy would load the file “/data/ft/03082001.html” To use a different naming strategy implementation, specify the Java class name of the custom strategy class as the namingStrategy attribute of the element, along with any other attributes the strategy requires to configure it. 3.2.2 The ZipInputHandler The ZIP input handler reads documents directly out of a ZIP archive, and is configured in a similar way to the file-based handler. It supports the following attributes: encoding (optional) exactly as for FileInputHandler 1 Technically a relative URI, so forward slashes must be used in document IDs even when running on Windows where file paths normally use backslashes. 16 A User’s Guide GCP mimeType (optional) exactly as for FileInputHandler srcFile (required) The location of the ZIP file from which documents will be read. This parameter was previously named “zipFile”, the old name is supported for backwards compatibility but not recommended for new batches. fileNameEncoding (optional) The default character encoding to assume for file names inside the ZIP file. This attribute is only relevant if the ZIP file contains files whose names contain non-ASCII characters without the “language encoding flag” or “Unicode extra fields”, and can be omitted if this does not apply. There is a detailed discussion on file name encodings in ZIP files in the Ant manual (http://ant.apache.org/manual/Tasks/zip.html#encoding), but the rule of thumb is that if the ZIP file was created using Windows “compressed folders” then fileNameEncoding should be set to match the encoding of the machine that created the ZIP file, otherwise the correct value is probably “Cp437” or “UTF-8”. The ZIP input handler does not use pluggable naming strategies, and simply assumes that the document ID is the path of an entry in the ZIP file. 3.2.3 The ARCInputHandler and WARCInputHandler These two input handlers read documents out of ARC- and WARC format web archive files as produced by the Heritrix web crawler and other similar tools. They support the following attributes: srcFile (optional) The location of the archive file2 . These input handlers can operate in one of two modes – if srcFile is specified then the handler will load records from this specific archive file on disk, but if srcFile is not specified then each document ID must provide a fully qualified http or https URL to an archive. In the second mode the selected records will be downloaded individually using “byte range” HTTP requests. defaultEncoding (optional) The default character encoding to assume for entries that do not specify their encoding in the entry headers. If an entry specifies its own encoding explicitly this will be used. If this attribute is omitted, “Windows-1252” is assumed as the default. mimeType (optional) The MIME type that should be assumed when creating the document (i.e. the value of the DocumentImpl mimeType parameter). If omitted, the usual GATE Embedded heuristics will apply. The input handlers make the HTTP headers from the archive entry available to GATE as if the document had been downloaded directly from the web, so the Content-Type header from the archive entry is available to these heuristics. The web archive input handlers expect document IDs of the following form: 1 2 < id r ecordPos ition = " NNN " [ url = " optional url of archive " ] recordOffset = " NNN " recordLength = " NNN " >{ original entry url } id > 2 For ARC, this parameter was previously called “arcFile”, the old name is supported for backwards compatibility but not recommended for new batches. 17 GCP A User’s Guide The content of the id element should be the original URL from which the entry was crawled, and the attributes are: recordPosition a numeric value that is used as a sequence number. If the IDs are generated by the corresponding enumerator (see below), then the this attribute will contain the actual record position inside the archive file. recordOffset and recordLength the byte offset of the required record in the archive, and the record’s length in bytes. url (optional) a full HTTP or HTTPS URL to the source archive file. If this is provided, GCP will download just the specific target record using a “Range” header on the HTTP request, rather than loading the record from the input handler’s usual srcFile. The standard enumerator implementations (see below) create IDs in the correct form. The ARC input handler adds all the HTTP headers and archive record headers for the entry as features on the GATE Document it creates. HTTP header names are prefixed with “http header ” and ARC/WARC record headers with “arc header ”. 3.2.4 The streaming JSON input handler An increasing number of services, most notably Twitter and social media aggregators such as DataSift, provide their data in JSON format. Twitter offers streaming APIs that deliver Tweets as a continuous stream of JSON objects concatenated together, DataSift typically delivers a large JSON array of documents. The streaming JSON input handler can process either format, treating each JSON object in the “stream” as a separate GATE document. The gate.cloud.io.json.JSONStreamingInputHandler accepts the following attributes: srcFile the file containing the JSON objects (either as a top-level array or simply concatenated together, optionally separated by whitespace). idPointer the “path” within each JSON object of the property that represents the document identifier. This is an expression in the JSON Pointer 3 language. It must start with a forward slash and then a sequence of property names separated by further slashes. A suitable value for the Twitter JSON format would be /id_str (the property named “id_str” of the object), and for DataSift /interaction/id (the top-level object has an “interaction” property whose value is an object, we want the “id” property of that object). Any object that does not have a property at the specified path will be ignored. compression (optional) the compression format used by the srcFile, if any. If the value is “none” (the default) then the file is assumed not to be compressed, if the value is one of the compression formats supported by Apache Commons Compress (“gz”4 , “bzip2”, “xz”, “lzma”, “snappy-raw”, “snappy-framed”, “pack200”, “z”) then it will be unpacked using that library. If the value is “any” then the handler uses the auto-detection capabilities of Commons Compress to attempt to detect the appropriate compression format. Any other 3 http://tools.ietf.org/html/draft-ietf-appsawg-json-pointer-03 4 For backwards compatibility, “gzip” is treated as an alias for “gz” 18 A User’s Guide GCP value is taken to be the command line for a native decompression program that expects compressed data on stdin and will produce decompressed data on stdout, for example "lzop -dc". mimeType (optional but highly recommended) the value to pass as the “mimeType” parameter when creating a GATE Document from the JSON string. This will be used by GATE to select an appropriate document format parser, so for Twitter JSON you should use "text/x-json-twitter" and for DataSift "text/x-json-datasift". Note that the GATE plugin defining the relevant format parser must be loaded as part of your GATE application. This is a streaming handler – it will process all documents in the JSON bundle and does not require a documents section in the batch specification. As with other input handlers, when restarting a failed batch documents that were successfully processed in the previous run will be skipped. 3.3 Specifying the Output Handlers Output handlers are responsible for taking the GATE Documents that have been processed by the application and doing something with the results. GCP supplies a number of standard output handlers to save the document text and annotations to files in various formats, and also a handler to send the annotated documents to a remote Mı́mir server for indexing. Most batches would specify at least one output handler but GCP does support batches with no outputs (if, for example, the application itself contains a PR responsible for outputting results). Output handlers are specified using
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.5 Linearized : No Page Count : 32 Page Mode : None Author : Title : Subject : Creator : LaTeX with hyperref package Producer : pdfTeX-1.40.18 Create Date : 2018:02:04 18:27:36Z Modify Date : 2018:02:04 18:27:36Z Trapped : False PTEX Fullbanner : This is pdfTeX, Version 3.14159265-2.6-1.40.18 (TeX Live 2017/MacPorts 2017_2) kpathsea version 6.2.3EXIF Metadata provided by EXIF.tools