Grobid Service Manual
User Manual:
Open the PDF directly: View PDF .
Page Count: 9
Download | |
Open PDF In Browser | View PDF |
GROBID manual 1. Overview.......................................................................................................................................... 1 2. Build and set up environment for local deployment........................................................................ 1 3. Build and set up environment for remote deployment..................................................................... 1 4. Use of grobid-service console.......................................................................................................... 2 5. grobid-service REST API................................................................................................................. 5 6. Examples with curl...........................................................................................................................9 Authors: Damien Ridereau, Patrice Lopez Contact: patrice.lopez@inria.fr 1. Overview The project grobid-service is a RESTful service implementation for accessing the grobid system. grobid-service is an open source project under the Apache License 2.0. It comes as a war file for deploying on a web container e.g. tomcat. The project also contains the libraries of grobid-core, doing the extraction work. 2. Build and set up environment for local deployment To build grobid for local deployment, you just have to go to the root of the project and run the following command: mvn clean install Then deploy the generated war to the server. The artifact is in: grobid-service/target/grobid-service-.war 3. Build and set up environment for remote deployment 3.1. Logs Grobid uses Apache log4j as logging library. By default, the log are written in a file grobid.log in the current directory where the application is launched. This is of course not adapted to a deployment in production. In order to set the path and filename for logging, edit the file under grobid/grobid-core/src/resources/log4j.xml and change the following line according to your production logging policy: you can indicate the wished log path, for instance for Tomcat: Be sure that the Tomcat or JBoss has the write authorization in the indicated log path. 3.2. Parameters set up In grobid-service- .war, the file web.xml has 3 parameters to set before starting the server: org.grobid.property: path to grobid.property org.grobid.property.service: path to grobid_service.properties org.grobid.home: path to grobid_home These properties are filled by the following variables: _GROBID_PROPERTY, _GROBID_SERVICE_PROPERTY, and _GROBID_HOME so that it is possible to fill these values with a script given the environment. It is also possible to set manually these variables before building the war artefact. 3.3. Build To build grobid for remote deployment, you have to go to the root of the project and run the following command: mvn clean install -PgenericBuild It will generate 2 artifacts, 1 in grobid-home, 1 in grobid-service: grobid-home/target/grobid-home- .zip grobid-service/target/grobid-service- .war Copy these 2 artifacts to your remote server. grobid-home- .zip contains the needed native libraries, the models, lexicons and a config directory that contains 2 properties files grobid.properties and grobid_service.properties. You have to unzip grobid-home wherever you want on your server. unzip grobid-home- .zip 4. Use of grobid-service console Welcome page is available at http:// / (i.e: for local tomcat http://localhost:8080/ ). From there you can access to about grobid (Fig 4.1), process some conversion from the interface "Test Rest Interface" (Fig 4.2) and access the administration parameters contained in grobid.properties and grobid_service.properties (Fig 4.3): Fig 4.1: About Fig 4.2: Test Rest Interface Fig 4.3: Service administration The web page "Test Rest Interface" (Fig. 4.2) allows you to test the different REST requests quickly and easily. For technical look in the code, GrobidRestService class is the entry point for each rest service of Grobid. 5. grobid-service REST API The table below shows the provided resources corresponding to the HTTP verbs, to use the grobidservice. All url described bellow are relative path, the root url is http:// / . The consolidation parameters (consolidateHeader and consolidateCitations) indicate if GROBID should try to complete the extracted metadata with an additional external call to CrossRef API. The CrossRef look-up is realized based on the reliable subset of extracted metadata which are supported by this API. Parameter name Type of request URL Reque MIME Type sting Request input type type /admin POST application/x-www-formurlencoded sha1 /admin?sha1= GET String /sha1 POST application/x-www-formurlencoded sha1 /sha1?sha1= GET /allProperties application/x-www-formPOST urlencoded text/html Request to get parameters of grobid.properties and grobid_service.properties formatted in html table. text/html Request to get an input string hashed using sha1. Request to get all properties key/value/type as xml. Sent xml follow the following schema: text/xml /allProperties?sha1= GET String /changePropertyValue POST application/x-www-formurlencoded xml /changePropertyValue?xml= String Change the property value from the property key passed in the xml input. Xml input has to follow the following schema: text/xml GET Description String sha1 Administration Response output type key value type ... General /grobid /processHeaderDocument N/A input consolidateHeader GET N/A POST, multipart/form-data PUT text/html application/xml Gives a very brief description about grobid. Extract the header of the input PDF document, normalize it and convert it into a TEI format. consolidateHeader is a string of value 0 (no consolidation) or 1 (consolidate). /processFulltextDocument input POST, consolidateHeader multipart/form-data PUT consolidateCitations application/xml Convert the complete input document into tei.xml format (header, body and bibliographical section). consolidateHeader and consolidateCitations are string of value 0 (no consolidation) or 1 (consolidate). Convert the complete input document into tei.xml format (header, body and bibliographical section). Pdf to tei.xml conversion /processFulltextAssetDocument input POST, consolidateHeader multipart/form-data PUT consolidateCitations application/zip consolidateHeader and consolidateCitations are string of value 0 (no consolidation) or 1 (consolidate). The result is a ZIP archive containing the TEI fulltext and the embedded images (the document assets) converted in PNG. /processReferences input POST, multipart/form-data consolidateCitations PUT application/xml Extract and convert all the references present in the input document into tei.xml format consolidateCitations is a string of value 0 (no consolidation) or 1 (consolidate). /processDate /processHeaderNames Parse/normalize data /processCitationNames /processAffiliations /processCitations date application/x-www-formPOST, urlencoded PUT application/xml names application/x-www-formPOST, urlencoded PUT application/xml names application/x-www-formPOST, urlencoded PUT application/xml affiliations application/x-www-formPOST, urlencoded PUT application/xml citations POST, application/x-www-formconsolidateCitations PUT urlencoded application/xml Parse a raw date and return the corresponding normalized date in ISO 8601 embedded in a TEI fragment. Parse a raw sequence of names from a header section and return the corresponding normalized authors in TEI format. Parse a raw sequence of names from a header section and return the corresponding normalized authors in TEI format.. Parse a raw sequence of affiliations and return the corresponding normalized affiliations with address in TEI format.. Parse a raw citation and return the corresponding normalized citations in TEI format. consolidateCitations is a string of value 0 (no consolidation) or 1 (consolidate). /processCitationPatentTEI input POST, multipart/form-data consolidateCitations PUT application/xml Extract and parse the patent and non patent citations in the description of a patent encoded in TEI. Results are added to the original document as TEI stand-off annotations. consolidateCitations is a string of value 0 (no consolidation) or 1 (consolidate). /processCitationPatentST36 input POST, multipart/form-data consolidateCitations PUT application/xml Extract and parse the patent and non patent citations in the description of a patent encoded in ST.36. Results are returned as a lits of TEI citations. consolidateCitations is a string of value 0 (no consolidation) or 1 (consolidate). Citation extraction and normalization from patents /processCitationPatentTXT text POST, application/x-www-formconsolidateCitations PUT urlencoded application/xml Extract and parse the patent and non patent citations in the description of a patent sent as UTF-8 text. Results are returned as a lits of TEI citations. consolidateCitations is a string of value 0 (no consolidation) or 1 (consolidate). /processCitationPatentPDF input POST, multipart/form-data consolidateCitations PUT application/xml Extract and parse the patent and non patent citations in the description of a patent sent as PDF. Results are returned as a lits of TEI citations. consolidateCitations is a string of value 0 (no consolidation) or 1 (consolidate). 6. Examples with curl Here are examples of command lines calling the Grobid service using curl. The server instance name here is localhost using the port 8080. • header extraction of a PDF file in the current directory: > curl -v --form input=@./thefile.pdf localhost:8080/processHeaderDocument • fulltext extraction of a PDF file in the current directory with consolidation of the citations: > curl -v --form consolidateCitations=1 --form input=@./thefile.pdf localhost:8080/processFulltextDocument • parsing of a raw reference string in isolation with default consolidation (by default header metadata are consolidated, but bibliographical references are not): > curl -X POST -d "citations=Graff, Expert. Opin. Ther. Targets (2002) 6(1): 103-113" localhost:8080/processCitation • extraction and parsing of all references in a PDF with default consolidation (by default bibliographical references are not consolidated): > curl -v --form --form input=@./thefile.pdf localhost:8080/processReferences pwd key value type
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.4 Linearized : No Page Count : 9 Language : en-US Title : grobid-service Creator : Writer Producer : LibreOffice 5.1 Create Date : 2017:09:12 23:12:38+02:00EXIF Metadata provided by EXIF.tools