Cbrtekstraktor Manual V04 20180627
cbrtekstraktor%20manual%20V04-20180627
User Manual:
Open the PDF directly: View PDF .
Page Count: 115
Download | |
Open PDF In Browser | View PDF |
cbrTekStraktor “Strange adventures on other planets” Space Detective - Issue 4 Published April 1952 by Avon Publications. D E S I G N 0 Chapter C U S T O M I Z A T I O N Preface Notices Copyright (c) 2017 - 2018 - cbrTekStraktor cbrTekStraktor is free software Permission is granted to copy, distribute and/or modify this software under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA to obtain the GNU General Public License This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. GNU General Public License: www.gnu.org/copyleft/gpl.html Contact details for copyright holder: cbrtekstraktor@gmail.com 1 D E S I G N C U S T O M I Z A T I O N Trademarks cbrTekStraktor relies on the following freely available technology Java SDK 1.5 through 1.9 or higher Apache Tesseract 4 Google TensorFlow 1.4 or higher (optionally) The cbrTekStraktor java source code has been developed to be deployed on various platforms and operating systems. The Java source code has been tested on the following platforms and Operating Systems. Platform Intel – AMD Intel – AMD Intel - AMD Operating system Windows 7 (32 and 64 bit) Linux Ubuntu 16.04 Windows 10 (64 bit) Release note Release history of this document Date 2017-06-05 2018-05-01 2018-06-27 Document version Draft Updated version Added TensorFlow support 2 cbrTekStraktor version V01 – Build 2017_06_05 V02 – Build 2018_05_01 V04 – Build 2018_06_27 D E S I G N C U S T O M I Z A T I O N YouTube channel https://www.youtube.com/channel/UCy0NfU7-N8RcyI-rj3fSCEw Comments welcome Mail your comments or defects reports to: cbrtekstraktor@gmail.com 3 1 Chapter C B R T E K S T R A K T O R Introduction cbrTekStraktor is an application to automatically extract text from the text bubbles or speech balloons present in comic book reader files (CBR). Its prime goal is to perform analysis on the texts of comic books. cbrTekStraktor can however also be used for scanlation or similar purposes. The application also enables to manually define text areas in CBR files. The application comprises a simple graphical editor for further processing the extracted text. The text extraction is achieved by a combination of statistical and graphical processing operations. It is based on the following 3 major algorithms Binarization of color images (Niblak and other methods) Connected components K-Means clustering Apache Tesseract is used to perform Optical Character Recognition on the extracted text. Google's TensorFlow Inception Visual Recognition Convolution Neural Network can optionally be used to fine-tune the speech balloon detection. cbrTekStraktor has some known limitations. It has been conceived to perform extraction of Western (Roman) characters and will only work on comic pages with a light background. Subsequent versions of the application will Integrate with translation software in order to provide automated translation of comic book texts. Provide a mechanism to automatically re-inject translated text into the text balloons 1 C B R T E K S T R A K T O R [Wikipedia] Scanlation (also scanslation) is the scanning, translation, and editing of comics from a language into another language. Scanlation is done as an amateur work and is nearly always done without express permission from the copyright holder. The word "scanlation" is a portmanteau of the words scan and translation. [Wikipedia] A comic book archive or comic book reader file (also called sequential image file) is a type of archive file for the purpose of sequential viewing of images, commonly for comic books. Comic book archive files mainly consist of a series of image files, typically PNG or JPEG files, stored as a single archive file. The file name extension indicates the archive type used, e.g. CBR or CBRZ Associated documents [01] Christophe Rigaud, Norbert Tsopze, Jean-Christophe Burie and Jean-Marc Ogier : Robust frame and text extraction from comic books, La Rochelle (France) and Yaoundé (Cameroon) [02] Christophe Rigaud, Dimsthenis Karatzas, Joost Van De Weijer, Jean-Christophe Burie and Jean-Marc Ogier : Automatic tekst location in scanned comic books, Barcelona (Spain), 2013 [03] Christophe Rigaud, Jean-Christophe Burie and Jean-Marc Ogier : An active contour model for speech balloon detection in comics, La Rochelle (France) [04] Karl Tombre, Salvator Tabbone, Loïc Pélissier, Bart Lamiroy and Philippe Dosch : Text/graphics separation revisited, Vandoeuvre-lès-Nancy (France) [05] Muhammad Muzamil Luqman, Hoang Nam Ho, Jean-Christophe Burie and JeanMarc Ogier : Automatic indexing of comic page images for query by example based focused content retrieval, la Rochelle (France) [06]Zhongliang Fu, Fulin Bian, Songtao Zhou and Qingwu Hu : Algorithm for fast detection and identification of characters in gray-level images, Wuhan (Republic of China) [07] Olivier Augereau, Motoi Iwata and Koichi Kise : A survey of comics research in computer science. November 2017 14thInternational Conference on Document Analysis and Recognition, Kyoto (Japan). 2 C B R T E K S T R A K T O R Public domain Comic Book Archives The Comic Book Images which are used in this manual have been downloaded from the “Digital Comic Museum”. DCM is a great site for downloading free public domain Golden Age Comics. All files here have been researched by DCM‟s staff and users to make sure they are copyright free and in the public domain. http://digitalcomicmuseum.com/ 3 D E S I G N 2 Chapter C U S T O M I Z A T I O N Installation Distribution The most recent version of cbrTekStraktor is published on GitHub (https://github.com/cbrTekStraktor/cbrTekStraktor) SourceForge (https://sourceforge.net/projects/cbrTekStraktor). The material distributed via GitHub and SourceForge comprises the source code, an executable JAR file and this reference manual. Quick Installation Prerequisites A recent Java Runtime Engine (JRE) or Java Software Development Kit (JSDK) is required. cbrTekStraktor has been tested with 64-bit Oracle Java SDK 7 and 8 on Windows 7 and Linux Ubuntu 16.04 The application is based on standard Java Swing functionality and will therefore more than likely also function correctly on other operating systems (e.g. OS-X, Red Hat, Windows 10, etc.) and Java other JRE‟s and JSDKs. W I N D O W S L I N U X Windows users need to manually create the folder C:\temp and c:\temp\cbrTekStraktor\bin Linux users need to create the directory $HOME/cbrTekStraktor and $HOME/cbrTekStraktor/bin, in which $HOME is to be substituted by the actual location of the Linux user‟s home directory. 1 D E S I G N C U S T O M I Z A T I O N Installation Just put the JAR file (cbrTekStraktor.jar) in c:\temp\cbrTekStraktor\bin (or $HOME/cbrTekStraktor/bin) Starting the application It should suffice to double click on the cbrTekStraktor Jar file to start and run the application. In the event that double-clicking on the Jar file does not work, you can manually start the application as follows W I N D O W S L I N U X Command line parameters Open a Windows command window CD c:\temp\cbrTekStraktor java –jar cbrTekStraktor.jar Open a Linux command window cd $HOME/cbrTekStraktor/bin java –jar ./cbrTekStraktor.jar The following command line parameters are supported -D {project folder name}. The –D options enables to specify the folder name of the project to be opened. If the project root folder is not specified as a command line parameter, it will be defaulted to “c:\temp\cbrTekStraktor” or $HOME/cbrTekStraktor 2 D E S I G N First time usage C U S T O M I Z A T I O N The following dialog will be shown when the application is started for the first time or whenever one of the required file system components is found to be missing. The dialog reportson all folders which are missing and will prompt you to confirm whether those missing folders can be created automatically. Click “Yes” if you want to have the missing folders created. Upon having successfully completed the creation of the missing folders, the following dialog will be displayed. You should close the application at this stage and then restart. Source code installation Download or clone the cbrTekStraktor application package from GitHub or SourceForge using the approach you are comfortable with and install the material in the workspace folder of your preferred Java IDE. 3 D E S I G N C U S T O M I Z A T I O N The source code of cbrTekStraktor was created using the Eclipse Neon IDE. The following screenshot depicts the structure of the Java packages. [Wikipedia] A JAR (Java ARchive) is a package file format typically used to aggregate many Java class files and associated metadata and resources (text, images, etc.) into one file for distribution. JAR files are archive files with which include a Java-specific manifest file. They are built on the ZIP format and typically have a .jar file extension. Apache Tesseract Installation Tesseract is an optical character recognition engine for various operating systems. It is free software, released under the Apache License; Version 2.0 Development has been sponsored by Google since 2006. Tesseract is considered one of the most accurate open-source OCR engines available. cbrTekStraktor uses tesseract (e.g. tesseract-ocr 4.00.00alpha) to perform OCR. Given that only basic OCR functionality is needed, it can safelybe assumed that other versions of Tesseract will integrate with cbrTekStraktor too. Read the Tesseract home page on GitHub for a quick introduction https://github.com/tesseract-ocr/tesseract/wiki It is recommended to use the latest version of tesseract; you should therefore regularly upgrade or reinstall Tesseract. 4 D E S I G N M I C R O S O F T W I N D O W S C U S T O M I Z A T I O N The binaries of the Tesseract OCR engine can be found on https://github.com/tesseract-ocr/tesseract/wiki/Downloads cbrTekStraktor V02 was tested using the University of Mannheim‟s experimental 64 bit tesseract-ocr-w64-setup-v4.0.0-beta version available at https://github.com/UB-Mannheim/tesseract/wiki Whilst installing Tesseract make sure to make a note where the installer is putting the binaries. cbrTekStraktor accesses the Tesseract OCR client via the Windows command shell or the Linux shell. You therefore need to set the name of the folder holding the Tesseract binaries via the Project Configuration dialog. You might also consider installing additional language packs for Apache Tesseract. cbrTekStraktor will detect which language packs have been installed and use those when appropriate. Language packs can be found via https://github.com/tesseractocr/tesseract/wiki Once installed Tesseract‟s installation directory should resemble the following. 5 D E S I G N C U S T O M I Z A T I O N You should run an installation test on Tesseract by issuing the following instruction from a command window: tesseract. This will result in an exhaustive usage message. Alternatively run the tesseract - - versioncommand. A double dash is required. 6 D E S I G N C U S T O M I Z A T I O N The installation of Tesseract 3 on Ubuntu 16.04 is straightforward.There are plenty of resources on the World Wide Web commenting on how to install and use Tesseract on Linux. The following URL explains how to install Tesseract on Ubuntu 16.04. https://www.howtoforge.com/tutorial/tesseract-ocr-installation-and-usageon-ubuntu-16-04/ L I N U X ( U B U N T U ) In a nutshell, you need to consecutively run the following commands in a Unix shell >sudo apt install tesseract-ocr >sudo apt-get install tesseract-ocr-[lang] The first command will prompt for the root password and then install the latest version of Tesseract for Ubuntu. The second command will install a language pack. For example tesseract-ocr-fra will install the French language pack. Re-run the command for any of the language packs you want to install. Once the installation has completed you should perform a quick installation test on Linux. Open a Unix command window and run the following two commands: In order to determine where tesseract has been installed, issue the command “type –a tesseract”. In most cases this will result in /usr/bin/tesseract. Should the installation directory be different from /usr/bin you will need to correctly configure the Tesseract Installation Folder parameter on your cbrTekStraktor project. The section “HowTo: Projects” provides detailed instructions on cbrTekStraktor projects and how to configure the Tesseract installation folder. Next simply run the command ”tesseract” in the Unix shell. Detailed status and error messages will be displayed. 7 3 Chapter C B R T E K S T R A K T O R Main screen Summary This section provides step by step instructions on how to use the cbrTekStraktor application. Starting the application See previous section to learn how to start the application. The above picture shows the main screen. The size and the location of the main screen are reused when the application is restarted. The major functions of the application can be accessed via set of buttons located on the left of the application‟s main canvas. Status information is displayed on the top right corner of the application‟s main window. 1 C B R T E K S T R A K T O R Image This button enables to select a scanned image of a comic book page (or any other image in JPG, GIF or PNG format) and display it on the canvas. Extract Text This button enables to select a scanned image of a comic book page and extract the textual information on it. Edit Pressing this button will open the graphical editor, which enables to examine the various graphical components of a single comic book page. The editor also enables to manually select or deselect textor graphical components and to further edit or translate the extracted text. OCR The Optical Character Recognition functionality is accessed via this button. Translate The translation component is currently not implemented. Report By pressing this button one opens the reporting component, which provides access to summarized graphical and statistical information of a single comic book page. Re-inject This will enable to re-inject text into the previously identified speech balloons or other text areas. This option is currently under development. Other The “Bulk processor” checkbox will activate the bulk processing option. This enables to perform the text extraction on a set of comic book pages stored in a single folder. This option might therefore be used to extract and OCR the text of an entire comic book. The “spinner” component enables to enlarge or shrink the image displayed on the screen. It should only be used in “Image” mode. In fact it is recommended no to use when in “Edit” mode. 2 C B R T E K S T R A K T O R Menu bar All of the above and additional functions can also be accessed via the menu bar items “File” and „Tools”. See the picture below for a quick overview of the available menu items. Pop-up menu Right-clicking of the main canvas will open a pop-up menu, comprising similar menuitems as the ones on the menus discussed in the above section. 3 C B R T E K S T R A K T O R Additional major functionality Additional functionality, whichcan be accessed via the menu bar. There are 2 main menu items Files Properties Tools The Files>Properties menu item provides access the “Edit option” and “Tesseract Option” Dialogs. The “Edit option” dialog enables to customize the look and feel of the Comic Page Editor, e.g. the background drop. See section “Edit mode” The Tesseract Option dialog screen enables to handpick one of the many Tesseract options and parameters and set it to an appropriate value.See section “OCR mode”. 4 C B R T E K S T R A K T O R Statistics The Tools>Statistics function collects statistical information on the entire set scanned comic book files available in the $PROJECTDIR folder. See the “Developer Notes” section for more information on the folder structure of the application and to learn where to locate the $PROJECTDIR folder. Housekeeping Tools>Housekeeping. This will prune (remove) temporary files from the cbrTekStraktor workfolders. Import Files>Import. The import function is currently not implemented. Export Files>Export. The Export function creates a file of all extracted textual information within a project (see cbrTekStraktor Projects). 5 4 Chapter C B R T E K S T R A K T O R How To : Image mode Introduction The following picture shows the cbrTekStraktor application when running in “Image” mode. When you click on the “Image” button a file selection dialog will be presented, enabling to browse through your comic book (or other) image files.In this example, an 18th century caricature “1024px-Caricature_gillray_plumpudding.jpg” image file is rendered. Note: The 5 most recent selected images can quickly be re-accessed on the “File” menu. 1 C B R T E K S T R A K T O R Marquee The following buttons are available on the marquee Save, this enables to save the image Refresh, this will reload the original image Info, this will open the “Image Info” dialog Colour histogram A color histogram is present in the bottom left corner. The histogram (or frequency diagram) shows the distribution of the Red, Green and Blue (RGB) color components of the pixels present in a picture (JPG,GIF, PNG) on a scale of 256. In which 0 is the most intense and 255 the lightest value. The circles separate from the vertical axis show the median of each RGB component. The circles on the vertical axis show the means of the RGB component. The histogram also shows the frequency distribution of the luminance or the “Alpha” channel. 2 C B R T E K S T R A K T O R Image filters A choice of image filters is available on the drop-down list to further process the image. Filter Bleach Description RGB to HSB conversion followed by lowering the hue component. (See the detailed info on HSL and HSV in the appendix to this reference manual). Blueprint Binarization (by default Niblak is used) and subsequent reduction to the Blue color component. Convolution Blur Blurs the image via convolution (explained in the appendix). Convolution Edge Applies the edge convolution filter Convolution Gaussian Applies a Gaussian blur filter. Convolution Sharpen Applies a sharpening convolution filter 3 C B R T E K S T R A K T O R Gradient narrow Applies a narrow gradient transformation Gradient wide Applies a wide gradient transformation (explained in the appendix). Grayscale RGB to grayscale conversion using the formula describedat the end of this section. Histogram Equalization Histogram equalization image transformation (explained in the appendix). Info Opens a pop-up window displaying the Image properties Invert Inverts the colors on the RGB image Mainframe Binarization and subsequent reduction to the Green color component. Monochrome (Niblak) Binarization using the Niblak transformation (see appendix) Monochrome (Otsu) Monochromization or binarization using the Otsu transformation (explained in the appendix). Monochrome (Sauvola) Binarization using the Sauvola transformation (explained in the appendix). Original This option redisplays the original image. Sobel Applies a Sobel filter (explained in the appendix). Sobel on grayscale Applies a Sobel filter on the grayscaled image 4 C B R T E K S T R A K T O R The next picture shows the result of applying the “inverse” image filter.Apart from a bizarre aesthetical interest there is no practical usage known for inverting the color schema of an image. You can save the result of the image processing by pressing “Save” and providing a file name. The screenshot below shows the how the result of the “Blueprint” image filter are about to be saved to a PNG file. 5 C B R T E K S T R A K T O R Comic Page Info Screen The Comic Page Info Screen provides access to a selection of characteristics of a Comic Page Image. One can either examine or specify various characteristics of a comic book and page. The Comic Page Info Screen can be accessed via the marquee buttons, the menubar or the pop-up menu. Click on “Info” to open this dialog. The top of the screen comprises The RGB Histogram of the image is located on the left hand side. On the right is the histogram of the gray scale information. The peaks and valleys on this histogram are used to determine whether the picture is a monochrome image or not. The Box Plot of the RGB information. The Box Plot diagram shows the first and third quartiles, median and mean of the RGB and Grayscale values of the pixels. Standard Deviation Third Quartile Mean Median First Quartile 6 C B R T E K S T R A K T O R The histograms on the top of the screen can be collapsed in order to reduce the clutter on your desktop by setting the “Hide histogram” option to “Yes”. [Wikipedia] In descriptive statistics, a box plot or boxplot is a convenient way of graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers may be plotted as individual points. Box plots are non-parametric: they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution. The spacing between the different parts of the box indicate the degree of dispersion (spread) and skewness in the data, and show outliers. In addition to the points themselves, they allow one to visually estimate various L-estimators, notably the interquartile range, midhinge, range, mid-range, and trimean. Box plots can be drawn either horizontally or vertically.Box plots received their name from the box in the middle.A JAR (Java ARchive) is a package file format typically used to aggregate many Java class files and associated metadata and resources (text, images, etc.) into one file for distribution. JAR files are archive files with which include a Java-specific manifest file. They are built on the ZIP format and typically have a .jar file extension. 7 C B R T E K S T R A K T O R The bottom of the Image Info Screen consists of Label CMXUID Description Comic Book Unique Identifier, which is a simplified normalization of the filename of the Comic Book Page picture. The normalization consists of removing all non-alphabeticalcharacters from the file‟s name. UID A unique identifier comprised of a hexadecimal number of 32 characters. ISBN The International Standard Book Numberof the comic book. You will need to enter the ISBN number manually. On www.isbn.org most ISBNs can be found. Series Name of series of the comic book Series sequence The sequence number of a comic book in a comic book series Book title Comic book title Page The page numberof the image in the comic book Penciller Colorer Writer Comment Information on the various authors who contributed to the creationof the comic book. Folder The folder where the comic book page image is stored File The name of the scan (or picture) of the comic book page Size The size of the picture in pixels and in Bytes, as well as Dots per Inch (DPI) information (if present in the image). Color schema cbrTekStraktor will determine whether the picture has a monochrome, grayscale or colorscheme. If needed you can overrule the color schema detected. Language The language used in the speech balloons or text areas. If the matching language pack has been installed, Tesseract will be Can be used to provide additional comments 8 C B R T E K S T R A K T O R instructed to use it. Binarization technique One can select the binarization technique to be used when processing the comic book page image. The options are {NIBLAK, SAUVOLA, OTSU, BLEACHED, ITERATIVE}. Cluster classification method This is the method used when determining which one of the connected components cluster comprises the textual information, a.k.a. the character cluster or text paragraph. By default the method is set to “automatic”. If needed one can select “Cluster 1” through “Cluster 5” to override the automatically identified character cluster. See section “Text Extraction” for a detailed discussion of cluster classification. Proximity tolerance {WIDE, LENIENT, TIGHT, ULTRA_WIDE} The proximity level is used when adjacent characters are combined into words and paragraphs. The default value is “Tight”. Crop Image By selecting this option, one can remove the margins of a comic book page. Cropping the image reduces the size of an image and will therefore shorten albeit marginally the length of the text extraction process. TesseractCuration This option enables to apply additional image processing filters prior to performing the OCR step. In particular blur an image or increase the DPI (Dots Per Inch). Tesseract notoriouslyperforms best on 300+ DPI images. The characteristics of a comic page are stored in the zMetadata_.xml file, which is part of the Archive file. See section “Developer Notes” for a discussion of the contents of the cbrTekStraktor Archive file. 9 C B R T E K S T R A K T O R Troubleshooting The cbrTekStraktor application functions optimal when you are using a monitor of substantial size and a display card supporting the higher resolution ranges. Images present in CBR files often have widths and heights of more than 1000 pixels, so you will also need a hefty computer to support cbrTekStraktor image processing activities. 10 5 Chapter C B R T E K S T R A K T O R How To : Projects A project is an arbitrary grouping of comic book images. One could envisage to create a cbrTekStraktor project that comprises all images of a single comic book or to create a project of comic book images created by the same penciller. A project is physically little more than the name of the folder on the Windows or Linux file system containing a predefined set of files and folders which are required by the cbrTekStraktor application. See “Developer Notes” for detailed information on the folders and files that constitute a project. 1 C B R T E K S T R A K T O R Editing a project The properties or settings of a project can be set and modified via the menu item“File>Project>Edit project” When cbrTekStraktor is started for the first time the “Tutorial” project is automatically created with default configuration settings. It is recommended to change these settings to match your local environment and preferences. Property Encoding Description Options are { UTF-8, UTF-16, ISO-8859-01, ASCII} You can select the encoding of the files created by cbrTekStraktor. Default encoding is ISO-8859-01 (Latin 1) Editor backdrop There are loads of options { BLEACHED, BLUEPRINT, GRAYSCALE RASTERIZED, COLOUR RASTERIZED, NIBLAK , SAUVOLA, MAINFRAME, NONE, ORIGINAL } When in Edit mode the backdrop setting defines the picture that is displayed on the work area. Just choose the backdrop which you like best. Language The drop-down list contains all the languages supported by Tesseract. The project language is the default language to be used for all image 2 C B R T E K S T R A K T O R files of that project. The language can be overruled per page. Browser The drop-down list contains the supported HTML browser. These browsers just render the reports that are stored in HTML format. Options are { MOZILLA , EXPLORER, CHROME } Project Name The name of the project. The name of the project will stripped of non-alphanumeric characters and will be used as the name of the $PROJECTDIR folder‟s name. Description A short description of the project Mean Character count Horizontal vertical variance threshold Tesseract folder See developer notes Preferred Font name The name of the preferred Font. The Preferred Font is used on all screens and dialogs. See developer notes This is the name of the folder where the Tesseract binaries are stored. Be aware that the default Font is set to “Comic Sans Serif”. Change ad lib. Preferred Font Size The preferred size of the font. Sizes between 10 and 12 work best. Python home This is the name of the folder where the Python 3.5 binaries are installed. Python is required when using the Google Artificial Intelligence Image Recognition (AI VR) software components. The AI VR module is an optional module available from cbrTekStraktor V04 onwards. Maximum number of threads This is the maximum number of threads that will be started when using the AI VR component. Logging Level The logging level can range between 0 and 9. Level 0 is terse logging, level 9 provides very detailed logging. The logging and error information is displayed on stdout and stderr 3 C B R T E K S T R A K T O R and redirected to the log and error files in the $PROJECTDIR folder. Date format The Java Date Format string. See docs.oracle.com on the options of the Java Date Time format Size This is the totalbyte size of all objects in the current project. Number of archives This is the number of all objects in the current project. First accessed Timestamp of the moment the current project was created. Last accessed Timestamp of the moment when the last time an object was created. Prior to saving the Project properties a validation of the property values will be performed. This will for example prevent of specifying an in correct Java Date Format. 4 C B R T E K S T R A K T O R Creating a project A new project can be created via “Files>Projects>New project” The configuration settings of the current project will be inherited by the newly created project. When creating a new project it often suffices to merely provide the name and description of the new project. 5 C B R T E K S T R A K T O R Selecting a project You can switch between project via the menu item “Files>Projects>OpenProject”. An overview of the projects which are available will be presented on the topmost dropdown list. Select the project you want to switch to and then click on the “Switch” button. 6 C B R T E K S T R A K T O R Project properties file and current project The current project will be reused when the application is restarted. The current project is saved in the Project Properties Files. The Project Properties file is located one folder up from the current project. When using windows this will more than likely be in C:\temp\cbrTekStraktor. When using Linux the Project Properties files will be stored in $HOME/cbrTekStraktor. In both cases the file is named “cbrTekStraktorProjectConfig.txt”. See the “Developer Notes” section for more information on the contents of this file. Defining the project via the command line You can set the project to be used via the command line option –D. For example: java –jar cbrTekStraktor.jar –D C:\temp\myComicProject 7 6 Chapter C B R T E K S T R A K T O R HowTo : Text extraction The text extraction process is started by pressing the “Extract” button on the main screen. Alternatively when in “Image mode” you can opt to start the text extraction process on the current image by double clicking on it or by pressing the “Extract” button. The text extraction process runs through the following four steps. The extraction process can be interrupted by pressing the “Stop” button on the marquee. First step You will be prompted to select an image for which the text is to be extracted. The images need not be stored in the cbrTekStraktor $PROJECTDIR folder. 1 C B R T E K S T R A K T O R Second step The image which was chosen in the previous step is displayed and the Comic Book Metadata dialog opens. In most cases itsuffices to just select the language of the comic book‟s text on this dialog.It is important to correctly define the language, because later on it is used by the Tesseract OCR engine. Additional fine-tuning of the extraction process The image which was chosen in the previous step is displayed and the Comic Book Metadata dialog opens. Additional fine-tuning of the extraction process Option Color schema Comment In the event that the color scheme was not correctly identified, one can set its correct value either to Color, Grayscale or Monochrome. You should ensure that the correct colorscheme is set prior to starting the next phase of the text extraction. The accuracy of the color schema detection algorithm is monitored for future enhancements. The information is stored in the zMetadata XML file and is used to track the correctness of 2 C B R T E K S T R A K T O R the color scheme detection logic. Binarize Method Options are {FAST_BLEACHED,ITERATIVE,SLOW_NIBLAK,SLOW_SAUVOLA,OT SU} It is advised to use the default binarization method (Sauvola or Niblak). OTSU and Bleached are valid options too. It is not recommended to use the “Iterative” option any longer. Cluster Classification Method Options are {AUTOMATIC, CLUSTER1,CLUSTER2,CLUSTER3,CLUSTER4,CLUSTER5} It is advised to use the “Automatic” option. A key element of the text extraction process is the module that decides which of the clusters that have been created using the KMeans algorithms, contains characters. The idea is to create clusters of similar components and subsequently classify these clusters in A cluster containing frames and borders A cluster containing groups of characters, i.e. paragraphs. Clusters containing noise The Cluster Classification modulesometimes picks the wrong cluster for the Paragraph cluster. This happens when the characters on the comic page are rather large. The symptoms of such a misclassification can easily be spotted in “Edit” mode: most of the characters will tagged to be “noise” and smaller pictorial elements resembling characters will be tagged to be “Letter”. (“Letter” is the tag used for components which are deemed to be characters. It is a misnomer, typical for Dutch native speakers.) In the event the Automatic Cluster Classification designates the wrong cluster to comprise text, you can override it by arbitrarily setting it to Clusters 1 through 5. A good approach is to start by setting the Text Paragraph to be Cluster2 and restart the text extraction process. Proximity Tolerance Options are {TIGHT,LENIENT,WIDE, ULTRA_WIDE} 3 C B R T E K S T R A K T O R The proximity tolerance is used to group characters into words and paragraphs. In essence this is achieved by clustering graphical components based on the distance between the components. The default Proximity Tolerance is “Tight”. Tight assumes the inter-character space to be rather narrow. Adapt to Lenient or Wide if deemed appropriate. Crop Image Options are {YES,NO} When Crop is set to Yes, the margins on a comic book page will be detected and removed in order to reduce the size of the image. The text extraction process runs faster when images are cropped to their actual payload. Tesseract Curation Options are {IGNORE_DPI,USE_IMAGE_DPI,INCREASE_AND_CONVOLVE,INCR EASE_AND_FOURRIER} Tesseract reads the DPI information from the Image File metadata. Tesseract works best on 200+ DPI images. USE_IMAGE_DPI will ensure to set the DPI on the image submitted to Tesseract. INCREASE_DPI will attempt to increase the DPI of an image to approximately 300 DPI. INCREASE_AND_CONVOLVE and INCREASE_AND_FOURRIER will increase the DPI and will apply some blurring to the image prior to being processed by Tesseract. INCREASE_AND_FOURRIER is not implemented in cbrTekStraktor V01. The actual text extraction process will commence when pressing OK on the Image Info dialog box. 4 C B R T E K S T R A K T O R Third step The original image is cropped, turned into a grayscale image and then binarized. These intermediary images are displayed as the text extraction progresses. No user actions are required during this phase. Example grayscale image. Example binarized (monochrome) image. This is an important step in the process. Its purpose is to get crisp characters that distinctively stand out. By default the Sauvola or Niblak binarization method is used. 5 C B R T E K S T R A K T O R Concluding Step The extraction process ends by displaying cut-outs of the text paragraphs which have been identified. Character Paragraphs have a green border and non-character paragraphs have a red border. Frames have a lilac border. The text extraction process stores the results in an “Archive” file in the $ROOTDIR/Output/Archive folder. An “Archive” file is a ZIP file that holds a broad variety of results, e.g. statistical data, image information, etc. See the “Developer Notes” section for detailed information. Marquee The following actions can be performed via the buttons on the marquee. The Save Button enables to save this resulting picture The Refresh button will redisplay the picture The Info button will open the Comic Book metadata dialog The “Edit mode” can be activated by double clicking on the canvas or by pressing the “Edit” button. Bulk mode The text extraction process can run on entire sets of comic book pages. The bulk extraction will process all images within a single folder. 6 C B R T E K S T R A K T O R The Bulk extraction process is similar to the single page text extraction process. It can be interrupted by pressing the “Stop” button on the marquee. Bulk mode initial step : folder selection You need to set the “Bulk extraction” option on the main screen before the caption on the “Extract” button will change to “Bulk”. When you click on this button you will be prompted to select the folder containing the set of images to be processed. Bulk mode second step : Comic Page Info The Comic page Info screen will then appear. The settings that you define on this screen will be applied to all images present in the bulk extraction folder. It is therefore recommended to only run the bulk extraction process on images sharing the same characteristics, e.g. pages from the same comic book, a set of images, which are all monochrome, etc. 7 C B R T E K S T R A K T O R Bulk mode : Progress monitor The text extraction process will be performed on each image file in the folder selected. The progress of the text extraction can be observed on the monitoring screen. 8 C B R T E K S T R A K T O R Bulk mode : concluding step The extraction process will stop by displaying the cut-out pictorial elements of the last image in the folder. The monitor dialog will close automatically after a short period of time. Troubleshooting The cbrTekStraktor application functions optimal when you are using a monitor of substantial size and a display card supporting the higher resolution ranges. 9 7 Chapter C B R T E K S T R A K T O R How To:Edit mode This section describes actions which can be performed when in Edit Mode. The edit mode provide a GUI enabling to modify the results of the text classification process: remove paragraphs, define character paragraphs, define non-character paragraphs, etc. to manually enter the text within a speech balloon to translate the text within a speech balloon Opening an archive for editing Click on “Edit” to start editing a previously processed Comic Page. A file browse dialog will open enabling you to select an Archive file. Archive Files are located in $PROJECTDIR/Output/Archive and have a name ending on “_set.zip”. Alternatively you can double click on the canvas once the text extraction process has been completed on a Comic Page image. You will be asked whether you want to start editing the current comic book image. 1 C B R T E K S T R A K T O R Editing After choosing the “spacedetective2_set” archive on the previous dialog the below screen will open. Marquee The following buttons are active on the marquee Save Quick Refresh Ingest Option These functions are commented upon in the remainder of this section. The backdrop of the edit screen is the Comic Image file upon which an image processing filter has been applied. It is recommended to select a backdrop that nicely contrasts with the original comic page image. In the example above the “Black Bleached” filter has been applied. Character paragraphs have a green border 2 C B R T E K S T R A K T O R Non-character paragraphs have an amber border There is a crosshair mouse pointer. Crosshair pointers are tacky, but have the advantage to be able to precisely select an image element. In the example above you will see that some character paragraphs have erroneously been classified to be non-character paragraphs, e.g. “Chapter One Spaceship of the dead” has an orange border, whereas this is a character paragraph and therefore should have a green border. In the next section you will be shown how to fix this. Hovering When hovering over a pictorial element an information box will be shown for a couple of seconds, providing succinct information on that element. In the example above a non-character paragraph, size 25x44 and featuring 2 Child objects; is in the crosshairs. When you move the crosshairs over an element enclosed by a border, its background will momentarily adapt a rosy sheenand its constituting elements will be outlined in red. In the following example the characters which are part of the “Chapter One” text balloon are displayed. Quick edit The quick edit screen can be accessed by clicking on the “Quick” button on the marquee. 3 C B R T E K S T R A K T O R The quick edit screen puts detailed information on character paragraphs, noncharacter paragraphs, frames, noise and other types of component at your fingertips. The tick box in the first column you can define whether or not a paragraph contains text. The tick box on the “removed” column enables to remove (or delete) a paragraph. The “Extracted text” column can be used to enter or edit the textual information on the speech balloons. In the event that the image has been OCR‟ed, it will contain the automatically extracted textualinformation. On the drop down list you can select which type of pictorial information you want to see displayed, e.g. noise, frames, potential text, etc. If you changed to characteristics of a pictorial element the “Confirm” button will become active. Detailed edit – Ingest The detailed edit dialog can be accessed by clicking on “Ingest” button on the marquee or by double clicking on a paragraph. 4 C B R T E K S T R A K T O R The dialog enables to navigate through the various paragraphs. Use the Previous and Next Buttons. The image of a character paragraphs is displayed between thick green vertical bars. Non-character paragraphs have red borders. You can define whether a paragraphs contains text or not via the “Is a text paragraph” tick box. The paragraph can be deleted by pressing the “Delete” button. The monochrome tick box can used to display a monochrome version of the paragraph image. Keying in text The topmost text entry box is used to edit the original text. The bottom text box can be used to store translated text. 5 C B R T E K S T R A K T O R In the event that the OCR process has been performed, the OCR‟ed text will be displayed in the topmost text entry box. See the example above. Edit options The edit option dialog opens when you click on Edit Options The edit option dialog is used to change the appearance of the Edit canvas Option Description 6 C B R T E K S T R A K T O R Show payload boundaries This will set out the boundaries of the comic page margins in lilac. Show frames This will put bluish lines around the frames within a comic book page. Show paragraphs This will set out the non-character paragraphs in red. Show text paragraphs This will draw a green border around character paragraphs, e.g. speech balloons. Show characters This will put a pink border on the image components that have been identified to be characters. Show noise This will put pink borders around any image component. Show valid components This will show the components which are valid. Show invalid components Puts border on those image components which are invalid. See “developer notes”. Backdrop This drop-down list enables to set the type of backdrop you want on see displayed in Edit mode. Wisker This drop-down list enables to define the color of the crosshair pointer. How to delete a paragraph If you select a paragraph and left-click on it for more than 2 seconds, a thick red border will be put around the paragraph and you will be asked whether you want the paragraphs to be removed (deleted). 7 C B R T E K S T R A K T O R In the example above the object that comprises the face of the Space Detective Hero and snippet of text are combined into a single object. A possible manner to correct is to remove the object and create a new object that only contains the text. The objects which have been removed can be seen in the Quick Edit dialog by selecting “Potential Text Area” and trawling for images which have been crossed-out by a thick red line. How to create a new text paragraph 8 C B R T E K S T R A K T O R A new paragraph can be created by positioning the pointer on the top-left corner of the object to be created and dragging the cursor to the bottom-right corner of the object to be created. Whilst dragging the pointer, a light-blue rectangle will be displayed. When the dragging operation is completed, you will be prompted to confirm the creation of a new object. It is recommended to refresh the screen to reflect the changes made (by pressing the “Refresh” button on the marquee). The freshly create object should now be visible in both the Quick Edit and Detailed Edit dialog screens. 9 C B R T E K S T R A K T O R How To quickly delete or change the characteristics of a paragraph A pop-menu will open when you position the crosshairs over a paragraph and rightclick on it. The pop-menu permits to Delete the paragraph Toggle between character and non-character Pop-up menu The functionality described in this section can also be accessed via the pop-menu which is opened by right-clicking anywhere on the canvas. Saving changes In the event that changes have been made to any of the components of the comic book page a greenish hue can be observed around the “Stop Edit” button. When you click on this button you will be asked to confirm the changes. 10 C B R T E K S T R A K T O R Note. Changes to the image components will be stored in the _stat.xml file and changes to the text information will be stored in the _language.xml file. These XML files are part of the Archive file. The previous version of these XML files will be timestamped and maintained in the Archive file. This enables you to roll-back any of the changes made. 11 8 Chapter C B R T E K S T R A K T O R HowTo : OCR This section describes the Optical Character Recognition process Tesseract is an optical character recognition engine for various operating systems. It is free software, released under the Apache License, Version 2.0 and development has been sponsored by Google since 2006. Prerequisite Tesseract is required to be installed prior to be able to perform OCR. You can also opt to install additional language packs. You need to set the name of the folder in which the Tesseract binaries are stored via the Project Configuration dialog. The Text Extraction process must have been completed on a comic page image before you can perform OCR. The OCR process uses the Archive file as its prime input. If you are not satisfied with the result of the Text Extraction process, you can manually change the contents of the Archive file via the “Edit” option. The quality of the scanned image greatly affects the results of the OCR process, in particular the resolution and DPI of an image. Lately the resolution of scanned images has considerably been enhanced, up to resolutions of 1700x2300 and more. 1 C B R T E K S T R A K T O R Example The next screenshot shows the Comic Page which is used as an example to perform OCR upon. Starting the OCR process First step The OCR process is started by clicking on the OCR button and selecting the Archive file of the comic page that you want to OCR. Second Step 2 C B R T E K S T R A K T O R The text in the speech balloons will be sourced from the results of the text extraction and edit processes. The text within a paragraph will be extracted from the original image and it will be flattened and put onto a single line. Each paragraph will be preceded by a Unique Identifier (UID). The results will be displayed and saved in an Image file (OCRoutput.png). The header line of the Tesseract OCR Image has a reference to the Comic Page Image file (p:briningupfather08-08), the DPI (d:150) and the Comic Page 32 Character Hex UID (u:513A-3078-1D92-55B3-A709-1F58-C3AA-7E84). In order to enhance the Tesseract OCR process, the resolution of the image file might be increased to 300 DPI and might be slightly blurred (using a convolution or Fourier transformation). See “Tesseract Curation” option on the Comic Page Info dialog. Third step The Tesseract options set via the “Tesseract Option” dialog are fetched and stored in a parameter file (TesseractOptionRepository.xml which is located in the $PROJETDIR/OCR folder). The Tesseract OCR client is called via its command line interface using the recently created OCR image file and the Tesseract parameter file. 3 C B R T E K S T R A K T O R The result of the Tesseract OCR process are stored in the OCR Result File (OCRResult.txt) which is located in the $PROJECTDIR/OCR folder. The Header information on the Tesseract OCR file is used to determine the maximum accuracy of the OCR process. The accuracy percentage is reported on the log file but also on the cbrTekStraktor status bar. See the “Developer Notes” section for detailed information on the OCR Folder and files. Fourth Step The OCR Result file is read and parsed. The text of each paragraph is stored in the Language File which is part of the Archive file. The OCR‟ed text is read from the Archive file and displayed on the following screen. Note. V02 applied some changes to the OCR process. The tags, which precede each paragraph, have been reworked to contain a repetitive numerical pattern. This enhances the ability of the OCR post-process to link a paragraph tag to its contents when extracting the OCR‟es texts from the OCRResult.txt file. The horizontal alignment on the Tesseract input image of the various lines within a single text bubble has been improved. 4 C B R T E K S T R A K T O R The output of the Tesseract OCR process is “scrubbed” prior to being loaded into the cbrTekStraktor application. The scrubbing process is rather coarse and removes all characters outside the 0x00000020 and 0x000000ff range. The contents of the Language File are overwritten after each OCR run. Bulk mode The OCR process can run on entire sets of comic book pages. The bulk extraction will process all images within a single folder, of which the text has previously been extracted. The Bulk OCR process is similar to the single page OCR process. It can be interrupted by clicking on the “Stop” button on the marquee. OCR Bulk mode initial step : folder selection OCR Bulk mode : progress monitor 5 C B R T E K S T R A K T O R Tesseract Option File Tesseract has loads of control parameter settings which can be used to modify its behavior. A list of all parameters with default value and short description can be retrieved by issuing the following command: tesseract --print-parameters The cbrTekStraktor application enables to browse through the various Tesseract V4 parameters and to set or unset those. The Tesseract option dialog is accessed via “File > Properties > Tesseract options”. 6 C B R T E K S T R A K T O R Note. $$DEBUGFILE$$ is an internal cbrTekStraktor variable for the default name of the file holding Tesseract Logging information; TesseractLog.txt which is located in the $PROJECTDIR/OCR folder. The options which have been activated to be used are displayed at the beginning of the list and have the tick box “Withhold” set. The default settings are documented in the following table. Parameter debug_file paragraph_debug_level tessedit_char_whitelist textord_heavy_nr Setting c:\temp\cbrTekStraktor\Tutorial\Ocr\TesseractLog.txt 1 ABCDEFGHIJKLMNOPQRSDTUVWXYZ012345789 1 Note. If you close the monitor dialog during a bulk run, you can re-open it via “Tools > More > Monitor”. 7 9 Chapter C B R T E K S T R A K T O R HowTo : Artificial Intelligence Visual Recognition Context The 2017 version of cbrTekStraktor often gives false positives when classifying the aggregated objects into text or non-text paragraphs, i.e. too many areas of a comic book page are wrongly identified as speech balloons. The below screenshot depicts the results of extracting text from the cover page of “Space Detective N04” (see last page of this manual). The root cause of this misclassification is possibly located in the process step that groups characters into text paragraphs based on a basic proximity rule. A first attempt to remedy this issue comprises to “bolt on” an additional classification process, which is leveraging the visual recognition capabilities of recently commodified artificial intelligence (AI) software components. In the particular case of cbrTekStraktor, Google Tensorflow‟s Inception V3 Image Classifiernow formsan additional and concluding text extraction step. See https://www.tensorflow.org/tutorials/image_recognition for more information on the Inception V3 AI VR component. The AI VR integration is an optional module in cbrTekStraktor, i.e. text extraction can be performed without this module. 1 C B R T E K S T R A K T O R Conceptual design A classifier enables to determine in which category an object belongs. A classifier can for example be used to detect speech balloons and non-textual graphical components on a comic book page. Contemporary classifiers rely on artificial intelligence techniques and are now readily available. cbrTekStraktor provides an integration with Google‟s Inception V3 image classifier. Inception V3 classification capabilities improve automatically through experience gathering. This is commonly referred to as supervised machine learning. After previously been shown a number of representative examples of speech balloons and non-textual image objects from a comic book page Inception is able to make a correct distinction between those. According to Google: “Inception uses a deep convolutional neural network (CNN) to achieve reasonable performance on visual recognition tasks”. Technical design cbrTekStraktor provides the following supporting functionality for the Google image classifier. A module to create a set (or sets) of standardized training images. A standardized image is an RGB color image in JPEG format, which is either cropped or centered to fit in a 300x300 frame. The images ae extracted from previously processed comic book pages. A script to retrain the visual recognition model A component that integrates cbrTekStraktor, Python and Tensorflow. In essence, cbrTekStraktor merely calls a Python script via the Operating System shell. Note. Future releases of cbrTekStraktor will directly call the TensorFlow Java library. Installation of Python and TensorFlow components The combination of Python 3.5 and TensorFlow, versions 1.4 through 1.7 proved to be a well-functioning basis for the additional cbrTekStraktor classification step. 2 C B R T E K S T R A K T O R The installation of Python and TensorFlow components is a bit cumbersome. A stepby-step installation on the Windows platform has been documented in appendix. One aspires that using the AI Visual Recognition add-ons in cbrTekStraktor is more intuitive. There are 2 configuration settings that affect the AI Visual Recognition component: “Python folder” and “Maximum number of threads”.See the “How-To: Projects” section. Accessing the AI Visual Recognition components The AI Visual Recognition (VR) add-ons are located on the following menu item: “File > TensorFlow”. 3 C B R T E K S T R A K T O R TensorFlow Setting. This item currently not implemented Make training set. This item is used to create a set of images to train the Visual Recognition model. Extract single set. The purpose of this item is to extract a set of image paragraphs from a single comic book page and use this set to manually validate the correctness of the Visual Recognition model. Readjust text bubbles via tensorflow. This item performs an addition classification once cbrTekStraktor has completed it standard text extraction steps. The results of AI Visual recognition process are used to re-adjust the previously obtained results. Training, test and validation of the Inception model Supervised machine learning is most often based on the following three steps. The model is initially fit on a training dataset that consist of a set of example images. Successively, the fitted model is used to predict the responses for the observations on a second dataset which is called the validation dataset Finally, the test dataset is a dataset used to provide an unbiased evaluation of a final model fit on the training dataset. CbrTekStraktor facilitates the creation of the training set of images, but still requires that you manually (re)train the Inception V3 model. First step: Creation of the training image set Start by “File > TensorFlow > Make training set” cbrTekStraktor will then extract the training images from comic book pages of which the text has already been extracted. A substantial number of test images are required to train the visual recognition model. You will to have at least previously processed 30 comic book pages. Use the bulk extraction to do this. You might encounter the following warning message. 4 C B R T E K S T R A K T O R The monitor dialog box similar to the below one is displayed whilst the sample images are being extracted from your set of comic book pages. The following dialog box will be displayed when the sample images have been created. Second step: manual classification The example images are to found in the “$ROOTDIR\Corpus\Images” folder. You will need to determine in a visual manner which image constitutes a genuine speech balloon and which image does not. A working approach is to configure Windows Explorer to display “large icons” and just drag & drop the speech balloons and nontextual image objects into two separate folders. 5 C B R T E K S T R A K T O R It is imperative (oh dear!) that you put the images in a folder called “validbubble” and the other images in a folder called “invalidbubble”. The TensorFlow model retrain script will use these folder names for the various categories of the classifier. cbrTekStraktor also uses these folder names when interpreting the results of Google‟s image recognition component. Third step: retrain the Inception model We will use a Python script originally provided by Google to retrain the VR model using the set of training images created in the 2nd step. The Python scripts and the Operating System command scripts are to be found on GitHub in the AncillarySourceCode folder of the cbrTekStraktor project (https://github.com/cbrtekstraktor/cbrTekStraktor/tree/master/src/AncillarySourceCod e). Note. The script names refer to version 1.4 of TensorFlow. There are a few pre-requisite steps to be performed. Create a working folder, e.g. C:\Temp\cbrTekStraktor\VR\Tutorial. Hence onward referred to as $VRDIR. It is advised to keep this working directory separate from the $PROJDIR\TensorFlow folder. This will enable you to experiment with retraining the VR model. Create the folders to store the manually classified images, e.g. $VRDIR\comics\validbubble and $VDIR\comics\invalidbubble Copy the images that comprise speech balloons in to the $VRDIR\validbubble folder; copy the non-textual images in the $VR\invalidbubble folder. Download the retrain Python script (retrain14.py) from GitHub and put it in $VRDIR. Download the retrain cmd script (cbrTekStraktorRetrain14.cmd) from GitHub and put it in $VRDIR Modify the cbrTekStraktorRetrain batch script to match the installation folder of Python and the working folder on your system. REM cbrTekStraktor V04 retrain all command file REM SET PYTHON_HOME=C:\temp\devtools\Python35 PATH=%PYTHON_HOME%;%PYTHON_HOME%\Scripts;%PATH% SET KDIR=C:\temp\cbrTekStraktor\tutorial\VR SET KSTART=%KDIR% SET KPROG=%KDIR% SET KDATA=%KDIR%\comics 6 C B R T E K S T R A K T O R python %KPROG%\retrain14.py -bottleneck_dir=%KSTART%\bottlenecks --how_many_training_steps 500 --model_dir=%KSTART%\inception -output_graph=%KSTART%\comics_graph.pb -output_labels=%KSTART%\comics_labels.txt --image_dir=%KDATA% pause Verify all of the above. You are all set to run the cbrTekStraktorRetrain14 batch either by double clicking on it or from the command shell. C:\temp\cbrTekStraktor\Tutorial\VR>python C:\temp\cbrTekStraktor\tutorial\VR\ret rain14.py -bottleneck_dir=C:\temp\cbrTekStraktor\tutorial\VR\bottlenecks -how_many_training_steps 500 -model_dir=C:\temp\cbrTekStraktor\tutorial\VR\inception -output_graph=C:\temp\cbrTekStraktor\tutorial\VR\comics_graph.pb -output_labels=C:\temp\cbrTekStraktor\tutorial\VR\comics_labels.txt --image_dir=C:\temp\cbrTekStraktor\tutorial\VR\comics >> Downloading inception-2015-12-05.tgz 100.0% Successfully downloaded inception-2015-12-05.tgz 88931400 bytes. Looking for images in 'InvalidBubble' Looking for images in 'ValidBubble' Creating bottleneck at C:\temp\cbrTekStraktor\tutorial\VR\bottlenecks\InvalidBub ble\PObj_0dd79ede_46629736087335.jpg.txt << etc >> 2018-05-27 12:33:49.765481: Step 490: 2018-05-27 12:33:49.765481: Step 490: 2018-05-27 12:33:50.030682: Step 490: (N=100) 2018-05-27 12:33:52.355086: Step 499: 2018-05-27 12:33:52.355086: Step 499: 2018-05-27 12:33:52.604686: Step 499: (N=100) Final test accuracy = 100.0% (N=70) Converted 2 variables to const ops. Train accuracy = 98.0% Cross entropy = 0.047499 Validation accuracy = 100.0% Train accuracy = 100.0% Cross entropy = 0.029145 Validation accuracy = 100.0% The first time you run the cbrTekStraktorRetrain command, TensorFlow will download a readily available trained Inception model. It will take a couple of minutes before the retraining is completed. This will result in a retrained model file “comics_graph.pb” and a label file “comics_labels.txt”. $VRDIR should have a structure similar to the below. 7 C B R T E K S T R A K T O R Fourth step: verification Download the file „test14.py” from GitHub and put in $VRDIR Download the file “cbrTekStraktorTest14.cmd” file from github and put in in $VRDIR Adapt the cmd file to reflect where you installed Python and working folder. Run the cbrTekStraktorTest14 batch either by double clicking on it or from the command shell. Look for the scores for “validbubble” and “invalidbubble” and use those to assess the accuracy of the image classification model which you have just retrained. Fifth step: deploy the model in cbrTekStraktor Copy the “comics_graph.pb” and the label file “comics_labels.txt” from $VRDIR to $PROGDIR\TensorFlow. 8 C B R T E K S T R A K T O R Running the VR post-process Start by “File > TensorFlow > Readjust text bubbles via TensorFlow”. You will first need to select the comic book page to be re-classified. Subsequently the below dialog will pop-up, enabling you to monitor the reclassification progress. The re-classification process can be throttled by defining the number of parallel threads. You will need to find a good balance. Too many threads might put a too demanding workload on your computer. See the section on cbrTekStraktor projects to learn how this parameter can be set. The result of the re-classification of the cover page of “Space Detective N04” can be observed in the below picture, only the area comprising “10c N0.4” is maintained as a text paragraph. 9 A Chapter C B R T E K S T R A K T O R HowTo : Report An overview report will be created inHTML format when you click on themain screen‟s “Report” button. The report will be displayed in Mozilla or any other browser that you have specified to be the vehicle reporting client. The Report functionality is rather sparse and will be enhanced in future releases. 1 B Chapter C B R T E K S T R A K T O R HowTo : Miscellaneous items Archive browser The archive browser utility “Tools > More > Archive Browser” enables to examine the contents of a cbrTekStraktor archive file. The “Action” item on the dialogs provides to Explore the archive file; which gives an overview all files in the archive Examine the objects extracted; which will open the STAT xml file in a web browser Examine the text extracted; which will open the Language (extracted text and translated text) in a web browser Zap an archive, i.e. delete an archive file 1 C B R T E K S T R A K T O R Exporting all extracted text The option “File > Export > Export Text” enables to export all textual information to a single file. The file comprises the results of the OCR process, the modifications to the OCR‟ed text and the translated texts. Exported text can be used to quickly modify of translate texts. First step Select the folder containing the comic book pages from which the texts should be exported. Second step The text extraction process runs. The batch monitor screen is displayed. The monitor screen will close automatically a few seconds after the export is completed. 2 C B R T E K S T R A K T O R Third step A dialog box opens in which you are able to define the provide the location and name of the export file. Example export _____________________________________________________________ $30253054928566: THAT MUST pF A PICTURE THEY ARC REHEARING $30253054928567: The count de Cay got me to buy a studio for $100000 $30253054928568: whatstrig? $30253054928569: OH" IM JUST CRAZY To se A MOVE Stam $30253054928571: Lovely $30253054928583: HE Cemtaminr DID THAT WELL $30253054928584: SHE THE STAQ? WHAT 15 THE NaME Of THIS PAN > $30253054928585: THAT'S NO PLAy: THAT'S THE SHEmirP. Th?? STUDIO 1?? ATTACHED FOR Gaamics $30253054928595: cises $30253054928596: her'. Fextune Service $30253054928567: Le comte de Cay m'a convaincu d’achèter une studio de cinéma pour $100000 $30253054928569: Oh je suis folle. Je veux être une star du cinéma. $30253054928571: Parfait. 3 C B R T E K S T R A K T O R The texts are grouped per image file, per language and per paragraph.Each paragraph has a unique UID, e.g.30253054928571. UIDs are enclosed between a dollar sign and a semi-colon. You can quickly assess the correctness of the OCR‟d text and modify where deemed appropriate.Alternatively, you can quickly enter translated text and prepare it for reloading. When entering or modifying text, make sure not to change the structure of the file, i.e. leave a space between the UID‟s terminating semicolon and the text. Importing text This option which is to be found under “File > Import > Import text” enables to load text from a text source file into the cbrTekStraktor archive file. Its purpose is to reimport modified or translated texts in the cbrTekStraktor application. The structure of the source file must adhere to the format of the text Export file (see previous section) First step Select the import file from which you want to import data. Second step The data from the file will be import. A monitor progress screen will be shown. 4 C B R T E K S T R A K T O R Round tripping Round tripping is a quick way for modifying and translating text. Just perform the following steps. Export the text for a given folder Modify, translate or spell-check the exported data in a text editor. Import the modified data Logging information There are two log files to be found in the $PROJECTDIR folder. cbrTekStraktorErrFile.txt; which comprises the errors cbrTekStraktorLogFile.txt; which comprises the loglines created by the application. [TODO – Comments on details of the logging files in cbrTekStraktor will be provided in future versions of this manual] 5 C Chapter C B R T E K S T R A K T O R Developer notes This section contains information that might be useful for developers. Folder structure of the application cbrTekStraktor relies on a predefinedand static folder structure. $PROJECTDIR folder The root folder is the Project Directory or root folder ($PROJECTDIR). By creating several root folders, the application is able create multiple and separated projects. The root of the folder structure can be specified on the command line when starting the application (option –D). If the root folder is not specified as a command line parameter, it will be defaulted to “c:\temp\cbrTekStraktor” or $HOME/cbrTekStraktor. The $PROJECTDIR must adhere to the following structure. 1 C B R T E K S T R A K T O R $PROJECTDIR folders Folder Cache Description This is a mandatory directory. It is used to cache files temporarily, for example when in Edit mode files are cached in this folder. Corpus This is a mandatory directory. It comprises statistical information gathered by the application, e.g. timing information. Output Required directory. Additional information to be found in the next section. Temp This is a required directory. It is used to store temporary files, for example when the Image screen Info dialog is opened, the boxdiagram.png of the RGB box plot diagram are stored in this directory. OCR Required directory. It is used to store the Tesseract configuration, debug and result files; as well as a PNG image of the text to be OCR‟ed. $PROJECTDIR files File cbrTekStraktor.xml Description This is a required configuration file properties.txt. This file comprises the GUI properties of the application, for example the width and height of the main canvas. It is created and maintained by the application. cbrTekStraktorLogFile.txt File with logging information cbrTekStraktorErrFile.txt File with error information Output folder These are the folders which are to be found in $PROJECTDIR/Output 2 C B R T E K S T R A K T O R File Archive Description This is a required folder in which the cbrTekStraktor Archives are stored. An archive is a ZIP file comprising reports, statistical, pictorial and textual information all of which have been generated by the application. See the next section for an overview of the content of an archive file. HTML This a required folder in which the HTML reports are stored. It also comprises the CSS (Custom Style Sheet) file (cbrTekStraktorCSS.txt). If the CSS file is missing, a new one will automatically be created by the application. There is a single HTML file for each file ( .html). When the HTML report file is no longer required it is automatically transferred from the HTML folder into the applicable Archive file. Conversely, HTML reports areextracted from the Archive files when a report is requested. Images This is a required folder for temporarily storing images created by the application. In general temporary files will have a name preceded by „z‟ e.g. zPeakDiag_wonderwoman.png is the RGB Peak histogram image. Stats This is a required folder for storing the statistical information on each image file in XML format in a file named .xml. Detailed information on the statistical elements maintained in this file is available at the end of this appendix. Note. A housekeeping routine is automatically performed on a frequent basis. This routine will remove obsolete files in the $PROJECTDIR folders. The housekeeping routine can also manually be started from the “Tools>Housekeeping” menu. OCR Folder These files might be present in the $PROJECTDIR/OCR folder 3 C B R T E K S T R A K T O R File TesseractOptionRepository.xml Description This file will only be present if you open the “properties>Tesseract Option” menu. The file comprises all options for Apache Tesseract 4.0. TesseractLog.txt This is the debug file which the Tesseract client creates during the OCR process. TesseractConfig.txt This is the Tesseract Parameter file which is created by cbrTekStraktor before the Tesseract client is called. It comprises the Tesseract options that have been defined in “Properties>Tesseract Options” OCRTextResult.txt This is the result file created by Tesseract. OCROutput.png This is the image that has is created from extracting the characters form the comic book image. Corpus folder Filename AllStat.txt Comment This file collates various metrics and statistical information on the comic book pages in the project. These statistics are created when pressing the “Statistics” button on the “Tools” menu. TimingAccuracyStats.txt Statistical information which is used to monitor the accuracy of the execution time predictive analysis. This module used Euclidian distances. TimingInputStats.txt Timing info of the various process gathered by the application whilst executing these processes. 4 C B R T E K S T R A K T O R File structures of the application Project File A reference to project which was last accessed will be stored in the configuration file “cbrTekStraktorProjectConfig.txt”. This file is located one folder above the current $PROJETDIR folder. In most cases this will be in C:\temp\cbrTekStraktor or when using Linux it will be found in $HOME/cbrTekStraktor. The content of the Project Properties File is rather sparse and might look like this ================================================= cbrTekStraktor V0.1 (07-May-2017) Started=07-may-2017 13:37:30 Stopped=07-may-2017 13:38:36 ================================================= EntryFolder=c:\temp\cbrTekStraktor RecentProject=Tutorial cbrTekStraktor configuration file “cbrTekStraktor.xml” is the main configuration file and is located in the $PROJECTDIR folder. Example configuration file Tag Browser Comment Defines which Web Brower is to be used buy the reporting subcomponent. Supported values are {CHROME,MOZILLA,EXPLORER} Created The time the cbrTekStraktor project was created Dateformat Enables to specify the display format of date and time information. Description A description of the project Encoding {UTF8,UTF16,LATIN1, ASCI} HorizontalVerticalVarianceThreshold A thresholdvalue that is used to separate connected components, which are characters for non-textual graphical objects. This value should not be modified. Logginglevel A number ranging from 0 to 9 to define the level of detail of the logging information. Setting the level to 9 will provide the most detailed logging information. MeanCharacterCount This is a threshold value which is used to identify the cluster with textual information. Name The name of the cbrTekStraktor project 6 C B R T E K S T R A K T O R PreferredFont The name of the font to be used in the majority of the application‟s screens and dialogs. PreferredFontSize The size of the preferred font. TesseractFolder This is the folder where the Tesseract OCR application is installed. Updated The time the cbrTekStraktor project has been updated PythonFolder MaximumNumberOfThreads Installation folder of Python The maximum number of threads to be used by the TensorFlow integration component. Content of the archive file The Archive files are to be found in the $ROOTDIR/Output/Archive folder. An archive file name always ends on “_set.zip”. An archive is a ZIP file comprising reports, statistical, pictorial and textual information generated by the application. The archive file contains the following files. Filename Binarized_Out.png Comment This is the monochrome version of the comic book image file. This PNG file is used by the OCR component. {CMXUID}.html This is the HTML report file. {CMXUID}_lang.xml XML file containing the extracted and translated textual information {CMXUID}_Lang_Ver_{YYM MDDHHMISS}.xml All versions of the Lang.XML file are maintained in the archive. Versions are timestamped. {CMXUID}_stat.xml XML file comprising the statistical and graphical information of the comic book image file. 7 C B R T E K S T R A K T O R {CMXUID}_Stat_Ver_{YYMM DDHHMISS.xml {CMXUID}{00.NNN}.png Previous version of the Stat.XML file zBoxBiagr_{CMXUID}.png Box Plot diagram of the RGB histogram. This image is used by the reporting component. zCharacts_{CMXUID}.png This is the image comprising the cut-out paragraphs. It is created and displayed at the end of the text extraction process and stored in the archive for further usage by the reporting component. zClusters_{CMXUID}.png This is an image that comprises an overview of the clusters identified during the text extraction process. zColrHist_{CMXUID}.png An image of RGB histogram used by the reporting component. zGrayHist_{CMXUID}.png An image of the grayscale histogram used by the reporting component. zMetaData_{CMXUID}.xml This is the Comic Book Metadata XML. zPeakDiag_{CMXUID}.png An image of the Peak Detection histogram. These are the cut-out images of the text and non-text paragraphs. These images are used by the reporting component. cbrTekstraktor STAT XML file The STAT XML file comprises the majority of the results of the image and text processing activities on an Image file. Detailed information on this file is provided in a separate section at the end of this appendix. Language file The language file contains the result of the OCR and translation processes. 8 C B R T E K S T R A K T O R 20180502102508 20180527093118 Tutorial Created by cbrTekStraktor V0.3 (1-May-2018) dd-MMM-yy HH:mm:ss 5 C B R T E K S T R A K T O RFrench 9 500 5 Verdana 12 ISO_8859_1 BLEACHED C:\temp\devtools\Tesseract\Tesseract_4_64Bit C:\Temp\devtools\Python35 6 MOZILLA Tag TextBundleChangeDate Comment The time the content of the paragraph was created or updated. TextBundleIdx This is the sequence number of the paragraph. TextBundleRemoved Possible vales are {True,False} TextBundleUID This is the Unique Identifier of the paragraph TextConfidence Possible values are {Text,Nontext} TextFrom This field contains the text in its original language TextOCR This is the result of the OCR operation on the paragraph TranslatedText_{language} This field contains the translated text. zFiles The following zFiles are present in the $PROJECTDIR\temp folder when the extraction and edit processes are active. The files are subsequently stored in the Archive file. Image File zBoxDiagr Description An image of the RGB frequency distribution‟s Quartile Box diagrams zCharacters An image file in which the character clusters are visualized zClusters A image file in which the clusters are vizualised zColHist Color Histogram image zGrayHist Grayscale histogram image zPeakHist Peak Histogram, which is used to determine whether an image file 10 C B R T E K S T R A K T O R comprises a black and white, grayscale or color image. image zMetadata File The characteristics of a single comic page are stored in the file $PROJECTDIR\Output\Archive\zMetadata_ 20170501095026 20170501112243 French Afrikaans Albanian .. etc ..comic01.jpg C:\temp\cmcProc\test 288775 comic01 6C58-E110-A319-547F-F401-5341-3623-2EC9 11 9 C B R T E K S T R A K T O R 100 8870019149256 text false 20170501131957 .xml. This is an example of a zMetaData _ .xml file. Tesseract command file 11 C B R T E K S T R A K T O R The following command will be generated to run Tesseract. This is a temporary file which is removed by the application once Tesseract has completed the OCR. C:\temp\Tesseract-OCR-4\Tesseract-OCR>tesseract c:\temp\cbrTekStraktor\Tutorial\Ocr\OCROutput.png c:\temp\cbrTekStraktor\Tutorial\Ocr\OCRTextResult -l eng c:\temp\cbrTekStraktor\Tutorial\Ocr\TesseractConfig.txt Example TesseractOptionRepository file C:\temp\cmcProc\test superman_01.jpg 6C58-E110-A319-547F-F401-5341-3623-2EC9 superman01 0 1 FRENCH colour TIGHT SLOW_SAUVOLA AUTOMATIC USE_IMAGE_DPI CHROME CROP_IMAGE false allow_blob_division 1 Use divisible blobs chopping File abbreviated Example OCROutput.PNG file 12 C B R T E K S T R A K T O R Example TesseractConfig file debug_file c:\temp\cbrTekStraktor\Tutorial\Ocr\TesseractLog.txt paragraph_debug_level 1 tessedit_char_whitelist ABCDEFGHIJKLMNOPQRSDTUVWXYZ012345789 textord_heavy_nr 1 Example OCRResultFile file [p:bringingupfather08-08] [d:150] [u:5A3A-3078-1D92-55B3-A7091F58-C3AA-7E84] P22826615692200 MRJIGGS THIS 1S Miss Peacy~® JUST ENgAgep FOR OUR MELODRaMA P22826615692201 THAT 1S MISS TAKE- SHE IS TO TAKE A STAR® PART IN THE COMEDy P22826615692203 VERyGOOD P22826615692205 SHE LOOKS LIKE A COMEDY P22826615692208 ELL WHAT'S TO BE DONE IN THE STUDIO TODAY * P22826615692210 we CAn Only - POY on ome Tooay | ETHER THE mELopRam - OR THE COMEDY. | was - THiNKNq P22826615692211 NEVER MIND | Thinkin' weir PUY On THE MELODRAmA P22826615692218 Jo -g reatume. sarvice. 1 Example TesseractLog file Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica # Final Paragraph Segmentation 13 C B R T E K S T R A K T O R #row space .. lword[widthSEL] rword[widthSEL] [lmarg,lind;rind,rmarg] model text 0 8 [p:bringingupfather08-08][236SEl] [u:5A3A-3078-1D92-55B3A709-1F58-C3AA-7E84][490sEl] [ 1, 0; 0, 0] S:1 [p:bringingupfather08-08] [d:150] [u:5A3A-3078-1D92-55B3-A709-1F58-C3AA7E84] 1 12 P22826615692200[180Sel] P22826615692200[180sel] [ 1, 1;737, 0] C:1 P22826615692200 2 13 MRJIGGS[108Sel] MELODRaMA[147sel] [ 1, -1; 68, 0] S:1 MRJIGGS THIS 1S Miss Peacy~® JUST ENgAgep FOR OUR MELODRaMA 3 12 P22826615692201[177Sel] P22826615692201[177sel] [ 1, 2;739, 0] C:1 P22826615692201 4 13 THAT[62Sel] COMEDy[111sel] [ 1, 0; 0, 0] S:1 THAT 1S MISS TAKE- SHE IS TO TAKE A STAR® PART IN THE COMEDy Active Paragraph Models: 1: margin: 1, first_indent: 0, body_indent: 0, alignment: LEFT # Final Paragraph Segmentation #row space .. lword[widthSEL] rword[widthSEL] [lmarg,lind;rind,rmarg] model text 0 7 P22826615692203[180Sel] P22826615692203[180sel] [ 1, 1; 1, 1] U:0 P22826615692203 1 7 VERyGOOD[142Sel] VERyGOOD[142sel] [ 1, -1; 39, 1] U:0 VERyGOOD Active Paragraph Models: # Final Paragraph Segmentation #row space .. lword[widthSEL] rword[widthSEL] [lmarg,lind;rind,rmarg] model text 0 7 P22826615692205[180Sel] P22826615692205[180sel] [ 1, 1;136, 1] U:0 P22826615692205 1 8 SHE[44Sel] COMEDY[98sel] [ 1, -1; 1, 1] U:0 SHE LOOKS LIKE A COMEDY Active Paragraph Models: # Final Paragraph Segmentation #row space .. lword[widthSEL] rword[widthSEL] [lmarg,lind;rind,rmarg] model text 0 11 P22826615692208[180Sel] P22826615692208[180sel] [ 1, 1;453, 1] U:0 P22826615692208 1 11 ELL[47Sel] *[8SEL] [ 1, -1; 1, 1] U:0 ELL WHAT'S TO BE DONE IN THE STUDIO TODAY * Active Paragraph Models: # Final Paragraph Segmentation #row space .. lword[widthSEL] rword[widthSEL] [lmarg,lind;rind,rmarg] model text 0 10 P22826615692210[180Sel] P22826615692210[180sel] [ 1, 0;1017, 1] U:0 P22826615692210 1 11 we[39sel] THiNKNq[111sel] [ 1, -1; 1, 1] U:0 we CAn Only - POY on ome Tooay | ETHER THE mELopRam OR THE COMEDY. | was - THiNKNq Active Paragraph Models: # Final Paragraph Segmentation #row space .. lword[widthSEL] rword[widthSEL] [lmarg,lind;rind,rmarg] model text 0 7 P22826615692211[177Sel] P22826615692211[177sel] [ 1, 1;585, 1] U:0 P22826615692211 1 11 NEVER[84Sel] MELODRAmA[168sel] [ 1, -1; 1, 1] U:0 NEVER MIND | Thinkin' weir PUY On THE MELODRAmA Active Paragraph Models: # Final Paragraph Segmentation 14 C B R T E K S T R A K T O R #row space .. lword[widthSEL] rword[widthSEL] [lmarg,lind;rind,rmarg] model text 0 7 P22826615692218[180Sel] P22826615692218[180sel] [ 34, 1] U:0 P22826615692218 1 10 Jo[25Sel] 1[5SeL] [ 1, 1] U:0 Jo -g reatume. sarvice. 1 Active Paragraph Models: 1, 1; 1, -1; - AllStat File The AllStat.txt file is to be found in the $PROJECTDIR/Corpus/Stats. The file is generated by selecting “Tools > More > Statistics”. Its purpose is to provide key statistical information (metrics) on each Comic Book Page processed. These metrics can be used in further analysis or for various other prediction purposes. The file is a comma delimited ASCI text file. Column UID Comment The UID of the image processed UncroppedWidth The Uncropped Width in pixels UncroppedHeigth The Uncropped Height in pixels PayloadWidth The Payload width PayloadHeigth The PayloadHeight NbrOfElementsInLetterCluster The number of elements found in the paragraph classified to be the text paragraph. NbrOfParagraphs The number of paragraphs NbrOFLetterParagraphs The number of paragraphs which have been identified to comprise text NbrOfLetters The total number of characters Colour {TRUE, FALSE} Set to TRUEby the Color schema detection algorithm if a full color scheme is detected. MonochromeDetected {TRUE,FALSE} set to TRUE if the color scheme detection logic found a monochrome schema 15 C B R T E K S T R A K T O R MonochromeDetectionStatus {TRUE,FALSE}-{POSITIVE,NEGATIVE} True-positive : correctly identified monochrome schema False-positive : incorrectly identified monochrome schema True-negative: correctly identified color schema False-negative: incorrectly identified color schema NbrPeak The number of peaks found by the color schema detection logic NbrValidPreak The number of valid peaks %PeakCoverage Coverage of pixels within valid peaks Example UID,UncroppedWidth,UncroppedHeigth,PayloadWidth,PayloadHeigth,NbrOfElementsI nLetterCluster,NbrOfParagraphs,NbrOFLetterParagraphs,NbrOfLetters,Colour,Monoch romeDetected,MonochromeDetectionStatus,NbrPeak,NbrValidPreak,%PeakCoverage e7f2-2af1-d553-b70e-8cb8-b350-ebc3bd42,975,1465,877,1355,858,25,15,826,unknown,false,false-negative,8,0,46%, ce6d-3395-7b87-e18e-b53f-2f55-112fd37e,1200,1857,1093,1656,597,36,10,485,unknown,false,false-negative,9,0,21%, 795b-7aa8-39ea-7ee8-f1d4-3e10-06db0717,1024,1105,998,1012,321,31,10,250,unknown,true,false-positive,39,0,87%, TimingInputStat file The TimingInputStat.txt file is to be found in the $PROJECTDIR/Corpus/Stats. The file stores the elapsed time in nanoseconds of various text extraction and image loading processes. The timing information is gathered during the image loading and text extraction activities. It is the prime source for estimating the duration of image loading and text extraction activities, i.e. the lead time of a process is calculated using Euclidian Distance and is used by the progress indicator on the main screen. It is a vertical pipe delimited ASCI file. 16 C B R T E K S T R A K T O R Column UID Comment UID of the image Width Uncropped width of the image (in pixels) Heigth Uncropped height of the image (in pixels) FileSize The filesize of the image (in bytes) ColourScheme {COLOR,GRAYSCALE,MONOCHROME} FileType {PNG,JPG,GIF,JPEG} BinarizationType This is the binarization mechanism used, e.g. OTSU, SAUVOLA, etc. ConnectedComponents The overall number of connected components Paragraphs The number of paragraphs BWDensity Black and White density of the entire picture ImageLoadTime The time required to load the image (in nanoseconds) PageLoadTime The time required to load the comic book page (the page is an object that envelops the image) in nanoseconds. Preprocess The duration of the pre-process step (in nanoseconds) BinarizeTime The duration of the binarization process (in nanoseconds) CoCoTime The duration of the connected components process(expressed in nanoseconds) LetterTime The duration of the process that identifies and expands characters (in nanoseconds). See the section where the text extraction process is explained. ParagraphTime The duration for creating an processing the contents of 17 C B R T E K S T R A K T O R paragraphs OverheadTime This is the duration of the end to end text extraction process minus all of the above durations. Timestamp The time the record was written to the file Example UID|Width|Heigth|FileSize|ColourScheme|FileType|ConnectedComponents|Paragraphs| BWDensity|ImageLoadTime|PageLoadTime|Preprocess|BinarizeTime|CoCoTime|LetterT ime|ParagraphTime|OverheadTime 629E-BF04-2D3B-5C27-29BB-C6BC-FBFC-E7F2-2AF1-D553-B70E-8CB8-B350-EBC3BD42|975|1465|256874|COLOR|JPG|SLOW_SAUVOLA|10183|25|0.3010666072368622| 245314798|61627908|199194031|4462646481|246614265|904632564|834179973|925 567697|03-JUN-2017 08:27:33 CE6D-3395-7B87-E18E-B53F-2F55-112FD37E|1200|1857|634270|COLOR|JPG|SLOW_SAUVOLA|31328|36|0.3828845024108886 7|310097424|83530141|245702033|7934692773|346997986|1079128458|1232142512| 1286570859|03-JUN-2017 08:27:46 795B-7AA8-39EA-7EE8-F1D4-3E10-06DB0717|1024|1105|354385|GRAYSCALE|JPG|SLOW_SAUVOLA|3518|31|0.1837499141693 1152|186662399|64333023|179845335|3157337579|63852631|549405945|691859669| 989257304|03-JUN-2017 08:27:52 TimingAccuracyStat The TimingAccuracyStat.txt file is to be found in $PROJECTDIR/Corpus/Stats. The file stores the difference between the calculated and actual process times of various image loading and text processing activities. The timing information is gathered during the image loading and text extraction activities. Its major objective tis to report on the accuracy of the process time prediction algorithm. Column Timestamp EstimatedImage ActualImage Ratio Comment The time the record was created Estimated time to load the image Actual time to load the image Correctness or accuracy ratio (estimated – actual) / 18 C B R T E K S T R A K T O R EstimatedBeforePreprocess ActualbeforPreprocess Ratio EstimatedBeforeBinarize ActualBeforeBinarize Ratio EstimatedBeforeCoCo ActualBeforeCoCo Ratio actual Estimated Before preprocess Actual before preprocess Correctness ratio Estimated before Binarize step Actual before binarize step Correctness ratio Estimated before Connected component Actual before connected components Accuracy ratio Example TimeStamp (EstimatedImage,ActualImage,Ratio) (EstimatedBeforePreprocess,ActualbeforPreprocess,Ratio) (EstimatedBeforeBinarize,ActualBeforeBinarize,Ratio) (EstimatedBeforeCoCo,ActualBeforeCoCo,Ratio) 03-JUN-2017 13:05:45 ( 250ms, 283ms, -11%) ( 5886ms, 5882ms, 14896ms, -60%) ( 5882ms, 14896ms, -60%) 03-JUN-2017 13:12:00 ( 306ms, 408ms, -25%) ( 7879ms, 7879ms, 8606ms, -8%) ( 7879ms, 8606ms, -8%) 03-JUN-2017 13:12:13 ( 393ms, 301ms, 30%) ( 12518ms, 12518ms, 12436ms, 0%) ( 12518ms, 12436ms, 0%) 03-JUN-2017 13:12:19 ( 283ms, 151ms, 87%) ( 5882ms, 5882ms, 5926ms, 0%) ( 5882ms, 5926ms, 0%) 14896ms, -60%) ( 8606ms, -8%) ( 12436ms, 5926ms, 0%) ( 0%) ( cbrTekstraktor STAT XML file The Stat file comprises detailed information on the characteristics of the file and image; as well as the results of the image analysis and text extraction processes. 19 C B R T E K S T R A K T O R Major segments in the Stat file The root node in the Stat file is. The following segments are its immediate descendants. Tag ProcessHistory Comment Creation and modification timestamps File Details on the Image file Image High level details on the image OriginalHistogram The RGB histogram data of the original image PayloadHistogram The RGB histogram of the image once the margins have been removed. 20 C B R T E K S T R A K T O R ConnectedComponentFrequenc yDistribution ClusterClassification Connected Components frequency distribution ConnectedComponentClusters Details on the connected components before the post-process FinalConnectedComponentClust ers Details on the post-processed connected components, e.g., reallocation of character components. paragraphs Detailed information on the text and non-text paragraphs characteristics. GraphicalEditorArea A dump of the connected components and objects that are part of the text and non-text paragraphs TimingInfoNanoSec Timing information on nanosecond granularity. Results of the Cluster classification process ProcessHistory [TODO – Will be covered in future releases of the cbrTekStraktor manual] File [TODO – Will be covered in future releases of the cbrTekStraktor manual] Image [TODO – Will be covered in future releases of the cbrTekStraktor manual] OriginalHistogram [TODO – Will be covered in future releases of the cbrTekStraktor manual] PayloadHistogram [TODO – Will be covered in future releases of the cbrTekStraktor manual] ConnectedComponentFrequencyDistribution [TODO – Will be covered in future releases of the cbrTekStraktor manual] ClusterClassification 21 C B R T E K S T R A K T O R [TODO – Will be covered in future releases of the cbrTekStraktor manual] ConnectedComponentClusters [TODO – Will be covered in future releases of the cbrTekStraktor manual] FinalConnectedComponentClusters [TODO – Will be covered in future releases of the cbrTekStraktor manual] Paragraphs [TODO – Will be covered in future releases of the cbrTekStraktor manual] GraphicalEditorArea [TODO – Will be covered in future releases of the cbrTekStraktor manual] TimingInfoNanoSec [TODO – Will be covered in future releases of the cbrTekStraktor manual] 22 C B R T E K S T R A K T O R Text extraction process Step1 . Determining the payload area of a comic book page image (determining the width and height of the margins) Step 2. Cropping the image to its payload area (on the Comic Book Metadata Dialog one can opt not to crop the image) Step 3. The grayscale version of the cropped image is displayed. Step 4.Binarization of the image. Various binarization methods can manually be selected on the previously displayed Comic Book Metadata Dialog screen: Otsu, Niblak or Sauvola. It is recommended to use either Niblak or Sauvola. Step 5. Display of binarized image Step 6.Gathering the connected components, i.e. gathering information on every single graphical element present on the image. See Connected Component in appendix. Step 7.K-Means clustering of the connected components. This is a straightforward classification of the connected components. The classification criterion is the pixel height of a connected component On a comic book page, letters or characters more or less all have the same height. The idea is to create groups of connected components which all have a similar height and therefore constitute a cluster which contains merely characters. cbrTekStraktor uses K = 5. See K-Means clustering in appendix. Step 8. Identification of those connected components which are characters. This is performed in various steps, e.g. by analyzing the number of vertical and horizontal elements of a single connected component. Typically a character of the Latin Script has 2 or 3 horizontal elements and 1 or 2 vertical elements. There are also more white pixels than dark pixels in a character, i.e. the density of these connected components should be less than 50%. Connected component having a pixel height of less than 6 are excluded and classified as noise. Step 9. Identification of the K-Means cluster comprising the characters. There are typically between 300 and 500 characters on a single comic book page. So the cluster holding approximately 500 connected components resembling a character is initially chosen to be the “Text Cluster” (a.k.a. in Dutch as the Letter Cluster).Additional finetuning steps are performed . 23 C B R T E K S T R A K T O R It is possible to override the automatic detection of the cluster comprising characters via the “Cluster Classification Method” drop-down on the Image Info dialog. Step 10.Expansion of the characters. Previously determined characters which are part of the TextCluster are subsequently grouped based on their proximity on the image. This will result in a set of characters that are part of the same speech balloon or text area. CbrTekStraktor uses the noun “paragraphs” for these groups. The “Proximity Tolerance” can be set on the Comic Book metadata Screen to be either tight, lenient, wide or ultra-wide. Step 11.Adjustment of characters. Any Connected component that is part of the area determined by the previously found Text Paragraphs boundaries are re-assessed whether to be a potential character or not. Paragraphs that comprise little or no characters are then set to be “non-character” or “non-text” paragraphs; the remaining paragraphs are hence onward referred to as “character” or “text” paragraphs. Step 12. The results of the text extraction are stored in an archive file (see the definition of the stat.xml and language file). Cut-outs of the paragraphs are displayed. Text paragraphs have a green border and non-textual paragraphs have a red border. Frames have a bluish border. 24 C B R T E K S T R A K T O R Image processing concepts This section comprises a quick overview of the image processing concepts, e.g. image processing filters, used in the application. Concept Convolution Comment [Wikipedia] In mathematics convolution is a mathematical operation on two functions; it produces a third function, that is typically viewed as a modified version of one of the original functions, giving the integral of the pointwise multiplication of the two functions as a function of the amount that one of the original functions is translated. It has applications that include probability, statistics, computer vision, natural language processing, image and signal processing, engineering, and differential equations. In image processing, a kernel, convolution matrix, or mask is a small matrix. It is used for blurring, sharpening, embossing, edge detection, and more. This is accomplished by doing a convolution between a kernel and an image. Gaussian blur [Wikipedia] In image processing, a Gaussian blur is the result of blurring an image by a Gaussian function. It is a widely used effect in graphics software, typically to reduce image noise and reduce detail. The visual effect of this blurring technique is a smooth blur resembling that of viewing the image through translucent scree. Gaussian smoothing is also used as a pre-processing stage in computer vision algorithms in order to enhance image structures at different scales. Mathematically, applying a Gaussian blur to an image is the same as convolving the image with a Gaussian function. Since the Fourier transform of a Gaussian is another Gaussian, applying a Gaussian blur has the effect of reducing the image's high-frequency components; a Gaussian blur is thus a low pass filter. Grayscale [Wikipedia] In photography and computing, a grayscale digital image is an image in which the value of each pixel is a single sample, that is, it carries only intensity information. Images of this sort, also known as black-and-white, are composed exclusively of shades of gray, varying from black at the weakest intensity to white at the strongest. 25 C B R T E K S T R A K T O R A common strategy is to use the principles of photometry or, more broadly, colorimetry to match the luminance of the grayscale image to the luminance of the original color image. To convert a color from a colorspace based on an RGB color model to a grayscale representation of its luminance, weighted sums must be calculated in a linear RGB space, that is, after the gamma compression function has been removed first via gamma expansion. Formula: 0.2126R + 0.7152G + 0.0722B Histogram equalization [Wikipedia] Histogram equalization is a method in image processing of contrast adjustment using the image's histogram. This method usually increases the global contrast of many images, especially when the usable data of the image is represented by close contrast values. Through this adjustment, the intensities can be better distributed on the histogram. This allows for areas of lower local contrast to gain a higher contrast. Histogram equalization accomplishes this by effectively spreading out the most frequent intensity values. The method is useful in images with backgrounds and foregrounds that are both bright or both dark. In particular, the method can lead to better views of bone structure in x-ray images, and to better detail in photographs that are over or under-exposed. A key advantage of the method is that it is a fairly straightforward technique and an invertible operator. A disadvantage of the method is that it is indiscriminate. It may increase the contrast of background noise, while decreasing the usable signal. HSL/HSV [Wikipedia] HSL and HSV are the two most common cylindricalcoordinate representations of points in an RGB color model. The two representations rearrange the geometry of RGB in an attempt to be more intuitive and perceptually relevant than the Cartesian (cube) representation. Developed in the 1970s for computer graphics applications, HSL and HSV are used today in color pickers, in image editing software, and less commonly in image analysis and computer vision. HSL stands for hue, saturation, and lightness (or luminosity), and is also often called HLS. HSV stands for hue, saturation, and value, and 26 C B R T E K S T R A K T O R is also often called HSB (B for brightness). A third model, common in computer vision applications, is HSI (I for intensity). However, while typically consistent, these definitions are not standardized, and any of these abbreviations might be used for any of these three or several other related cylindrical models. (For technical definitions of these terms, see below.) In each cylinder, the angle around the central vertical axis corresponds to "hue", the distance from the axis corresponds to "saturation", and the distance along the axis corresponds to "lightness", "value" or "brightness". Note that while "hue" in HSL and HSV refers to the same attribute, their definitions of "saturation" differ dramatically. Because HSL and HSV are simple transformations of devicedependent RGB models, the physical colors they define depend on the colors of the red, green, and blue primaries of the device or of the particular RGB space, and on the gamma correction used to represent the amounts of those primaries. As a result, each unique RGB device has unique HSL and HSV absolute color spaces to accompany it (just as it has unique RGB absolute color space to accompany it), and the same numerical HSL or HSV values (just as numerical RGB values) may be displayed differently by different devices. Image gradient [Wikipedia] An image gradient is a directional change in the intensity or color in an image. In graphics software for digital image editing, the term gradient or color gradient is also used for a gradual blend of color which can be considered as an even gradation from low to high values, as used from white to black in the images to the right. Another name for this is color progression. Mathematically, the gradient of a two-variable function (here the image intensity function) at each image point is a 2D vector with the components given by the derivatives in the horizontal and vertical directions. At each image point, the gradient vector points in the direction of largest possible intensity increase, and the length of the gradient vector corresponds to the rate of change in that direction. Niblack Sauvola Niblack and Sauvola thresholds are local thresholding techniques that are useful for images where the background is not uniform, especially for text recognition. Instead of calculating a single global threshold 27 C B R T E K S T R A K T O R for the entire image, several thresholds are calculated for every pixel by using specific formulae that take into account the mean and standard deviation of the local neighborhood (defined by a window centered around the pixel). OTSU [Wikipedia] In computer vision and image processing, Otsu's method, named after Nobuyuki Otsu, is used to automatically perform clustering-based image thresholding or the reduction of a graylevel image to a binary image. The algorithm assumes that the image contains two classes of pixels following bi-modal histogram (foreground pixels and background pixels), it then calculates the optimum threshold separating the two classes so that their combined spread (intra-class variance) is minimal, or equivalently (because the sum of pairwise squared distances is constant), so that their interclass variance is maximal. RGB [Wikipedia] The RGB color model is an additive color model in which red, green and blue light are added together in various ways to reproduce a broad array of colors. The name of the model comes from the initials of the three additive primary colors, red, green and blue. The main purpose of the RGB color model is for the sensing, representation and display of images in electronic systems, such as televisions and computers, though it has also been used in conventional photography. Before the electronic age, the RGB color model already had a solid theory behind it, based in human perception of colors. To form a color with RGB, three light beams (one red, one green and one blue) must be superimposed (for example by emission from a black screen or by reflection from a white screen). Each of the three beams is called a component of that color, and each of them can have an arbitrary intensity, from fully off to fully on, in the mixture. Zero intensity for each component gives the darkest color (no light, considered the black), and full intensity of each gives a white; the quality of this white depends on the nature of the primary light sources, but if they are properly balanced, the result is a neutral white matching the system's white point. When the intensities for all the components are the same, the result is a shade of gray, darker or lighter depending on the intensity. When the intensities are different, the result is a colorized hue, more or less saturated depending on the difference of the strongest and weakest of the intensities of the primary colors employed. 28 C B R T E K S T R A K T O R When one of the components has the strongest intensity, the color is a hue near this primary color (reddish, greenish or bluish), and when two components have the same strongest intensity, then the color is a hue of a secondary color (a shade of cyan, magenta or yellow). RGBA [Wikipedia] RGBA stands for red green blue alpha. While it is sometimes described as a color space, it is actually simply a use of the RGB color model, with extra alpha channel information. The color is RGB, and may belong to any RGB color space, but an integral alpha value as invented by Catmull and Smith between 1971 and 1972 enables alpha compositing. The alpha channel is normally used as an opacity channel. If a pixel has a value of 0% in its alpha channel, it is fully transparent (and, thus, invisible), whereas a value of 100% in the alpha channel gives a fully opaque pixel (traditional digital images). Values between 0% and 100% make it possible for pixels to show through a background like a glass, an effect not possible with simple binary (transparent or opaque) transparency. It allows easy image compositing. Sobel [Wikipedia] The Sobel operator, sometimes called the Sobel–Feldman operator or Sobel filter, is used in image processing and computer vision, particularly within edge detection algorithms where it creates an image emphasizing edges. Technically, it is a discrete differentiation operator, computing an approximation of the gradient of the image intensity function. At each point in the image, the result of the Sobel–Feldman operator is either the corresponding gradient vector or the norm of this vector. The Sobel–Feldman operator is based on convolving the image with a small, separable, and integer-valued filter in the horizontal and vertical directions and is therefore relatively inexpensive in terms of computations. 29 D Chapter C B R T E K S T R A K T O R HowTo : Install Google Inception on Windows Summary TensorFlow is most often used on Linux. It is however possible to install and use TensorFlow in CPU mode on Windows. This appendix describes how to install Google TensorFlow locally on a Windows 64-bit based Operating Systems, e.g. Windows 7 Service Pack 1 Windows 10 Caveat TensorFlow only works with Python3.5 and 3.6. Source: https://stackoverflow.com/questions/38896424/tensorflow-not-found-in-pip Install Python Prerequisites If you are using Windows 7, you need to have Service Pack 1 installed, otherwise the installation will not even start. Some authors state that TensorFlow only works on Python 3.5.2; so better be safe than sorry and stick to this Python version. Tensorflow is installed via the Python package installer pip3. You need to use pip3 in particular version 1.8, which is not part of the Python 3.5.2 installation. We will need to upgrade to pip3. Download 1 C B R T E K S T R A K T O R Download version 3.5.2. From https://www.python.org/downloads/release/python352/ Select the “Windows x86-64 executable installer” file. Perform the Python installation via “Run as administrator” Optionally tick the following boxes Install for all users. Python3.5.2 will then be installed in “c:\Program Files\Python35” rather than in your home folder. You might also opt to store Python in a bespoke folder and hence maintain multiple versions of Python on your computer. “Update PATH” at the beginning of the installation process. This is not necessaryfor cbrTekStraktor, because the integration scripts explicitly set the PATH. Install TensorFlow Next, we are going to install the TensorFlow packages via “pip3” Prepare TensorFlow installation Create a Windows Batch or Command file (.bat/.cmd). The idea is to extend the PATH and then open another shell in which python will be executed from the command line. During installation of TensorFlow, you will be warned that the Scripts folder is not included on the PATH, so let us add this also in the batch file. SET PYTHON_HOME=”c:\Program files\Python35” PATH=PYTHON_HOME%;%PYTHON_HOME%\Scripts;%PATH% cmd.exe pause Note. If during the installation you opted to have the PATH updated, then you only need to extend the PATH with the %PYTHON_HOME%\Scripts folder. Upgrade pip Run the batch file (run As Administrator, so you will have extended privileges when upgrading pip) From the command prompt, upgrade pip via the following command. python -m pip install --upgrade pip 2 C B R T E K S T R A K T O R Note. You might get a message stating that you are on pip3 version 1.8 and a suggestion to upgrade pip to 1.10. Do not perform this upgrade, TensorFlow only works on pip3 1.8. Note. Although pip3 is to be upgraded the package name is pip and not pip3. Install TensorFlow Caveats: Current version of TensorFlow is V1.8.0. This version however appears to be incompatible with Python 3.5.2.TensorFlow 1.4 has successfully been tested with Python 3.5.2 and pip 3 1.8. From the command line pip3 install --upgrade TensorFlow==1.4 Perform a smoke test Just run Python from the command line >>>import tensorflow as tf >>>hello = tf.constant('Hello, TensorFlow!') >>>sess = tf.Session() >>>print(sess.run(hello)) You will know whether the installation is successful as soon as you enter the first command (i.e. import tensorflow as tf). The error messages provided by Tensorflow are extensive. The following typical issues might occur. (A) msvcp140.ddl is missing (https://www.microsoft.com/enus/download/details.aspx?id=53587) Visual C++ 2015 redistributable DLLs are missing (Msvcp140.d) The msvcp140.ddl is part of the Microsoft Visual C++ 2015 Redistributable 64-bit component. It is located in C:\Windows\System32 The TensorFlow error message will provide the URL where you can download the Visual C++ redistributable package. https://www.microsoft.com/en-us/download/details.aspx?id=53587 Download the installer (vc_redist.exe) and install as “Run as administrator” Note. This issue has only observed on Windows 7 SP1 (B) Libraries which fail to be loaded, e.g. python\pywrap_tensorflow_internal.py", line 18, in swig_import_helper (followed by a complete stack trace) 3 C B R T E K S T R A K T O R The issue appears to be related to the TensorFlow 1.8 version and the type of CPU of your computer (a.k.a. the AVX processor issue). https://github.com/tensorflow/tensorflow/issues/17386 The issue can be solved by downgrading to a version of TensorFlow, known to work on Python 3.5.2, for example V1.4. You might then gradually uninstall TensorFlow and upgrade to 1.5 and so forth (or just stick to V1.4). Uninstall TensorFlow can be performed by: pip3 uninstall tensorflow Force a reinstall: pip3 install - -upgrade - - no-deps - - force-reinstall tensorflow==1.5 Re-do the smoke test Run Python >>>import TensorFlow as tf >>>hello = tf.constant('Hello, TensorFlow!') >>>sess = tf.Session() >>>print(sess.run(hello)) You are all set (hopefully) Retrain script issues The Retrain.py script is no longer part of the TensorFlow examples on GitHub. To make things worse: the replacement script appears to be incompatible with the current TensorFlow version. Issues are related to Tensor_Hub. Previous versions of this script are however still available on GitHub. So go back in time until you find a functioning one. An easier approach: a copy functioning version of retrain.py can be found here https://raw.githubusercontent.com/tensorflow/tensorflow/r1.1/tensorflow/examples/i mage_retraining/retrain.py Source : https://stackoverflow.com/questions/41433282/tensorflow-retrain-onwindows 4 C B R T E K S T R A K T O R 5 C B R T E K S T R A K T O R Space Detective - Issue 4 Published April 1952 by Avon Publications. A groovy funny mystery book, cover art by Gene Fawcette and artwork by Gerald McCann. 6 7
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.5 Linearized : No Page Count : 115 Language : nl-BE Tagged PDF : Yes Author : Berton Koen Creator : Microsoft® Office Word 2007 Create Date : 2018:05:27 10:42:24 Modify Date : 2018:05:27 10:42:24 Producer : Microsoft® Office Word 2007EXIF Metadata provided by EXIF.tools