Manual Textextractorx Textextractor
User Manual: Pdf
Open the PDF directly: View PDF .
Page Count: 8
Download | |
Open PDF In Browser | View PDF |
MANUAL TEXTEXTRACTOR Regular Expressions in Textfiles with Python (Windows Version) Jasper van Boven Radboud Universiteit Table of Contents 1. 2. 3. What does the script do? ..............................................................................................2 How to prepare my data for the script? ........................................................................2 Filetypes: ...................................................................................................................................... 2 Filenames: .................................................................................................................................... 2 Map with files............................................................................................................................... 2 How do regular expressions work? ...............................................................................2 Regular expressions language ...................................................................................................... 2 Basic expressions .......................................................................................................................... 3 One word: ................................................................................................................................. 3 One of the following words: ...................................................................................................... 3 Combinations of words: ............................................................................................................ 3 Words that can be written differently: ...................................................................................... 4 The easy way out: ..................................................................................................................... 4 Combined: ................................................................................................................................ 5 Expressionsfile:............................................................................................................................. 5 4. How does the script work? ............................................................................................5 5. FAQ & Troubleshoot ......................................................................................................7 1 1. What does the script do? This script enables the user to look up certain words and combinations of words in batches of text documents in an automatic way. This automated script brings contentanalysis to the next level by analyzing amounts of documents that would have been too much to do manually. 2. How to prepare my data for the script? Filetypes: The script only works with textdocuments (.txt). So make sure all the worddocuments , pdffiles etc. are converted to text. There are multiple ways achieve this. Click here to go to a site that gives you multiple options. Filenames: It is necessary to remove all spaces from the names of the documents you want to analyze. You can achieve this by using bash scripts in the command prompt window or use an application like this one and follow the manual they have put online. Map with files The script needs a map with the files as input. Make sure this map only contains the textdocuments you want analyzed. Remove all other files from this map. It is recommended to make sure the name of the map does not contain any spaces. 3. How do regular expressions work? Regular expressions language In order to search analyze the textdocuments, the script uses so called Regular Expressions. Regular Expressions are a universal formal computer language. Based on a code it enables the script to look for certain words, multiple words or combination of words in sentences. Following this link you find a book giving very clear explanations on the Regular Expression code. 2 Basic expressions In the book which has been mentioned in the previous part, a few basic regular expressions are mentioned that are very useful in the content analysis as aimed for with this script. In the next part these will be explained. The expressions always start and end with a basic code and then the words you need can be inserted between the brackets. The basic code makes sure the whole sentence is selected and gets written in one of the output files. One word: ^.*\b(test)\b.*$ This expressions looks for one word. The word “test” can be replaced by any word you want to look for in the textdocuments. One of the following words: ^.*\b(one|two|three)\b.*$ With this expression you can look for one of the following words. The piper can be interpreted as “or”. So this one would say: “This sentence is a hit if one or two or three is mentioned in the sentence.” Combinations of words: ^(?=.*?\b(one)\b)(?=.*?\b(two)\b)(?=.*?\b(three)\b).*$ This expression already becomes a bit more complex. This one searches in both one, and two, and three are found in the sentence. By removing (?=.*?\b(three)\b) for example, i twill look for just one and two. By adding (?=.*?\b(four)\b) directly after (?=.*?\b(three)\b) 3 the expressions will also look for the appearance of four in combination with the others. The order of appearance does not matter. Words that can be written differently: Some words can be written differently or conjugated in different ways. Still you want to use an efficient way to look for all these forms of the word. The Regular Expression language is built for this. Let’s look at the word information, it would end up something like this: Inform(ed|ing|ation)? By placing “|”, “?” and “( )” on strategic places the expression can be made smart. The above expressions always looks for “inform” but alsof for “informed”, “informing” and “information”. The parts between the brackets says to look for “ed” or “ing” of “ation” in combination with inform. The questionmark after the closing bracket tells that it is not necessary to find one of these three perse, so that only “inform” will also be a hit. In the book, the explanation of the code goed way deeper. Take a good look at that and don’t feel overwhelmed. It is a great skill to master and you can use the basics within a few tries. Conclusion: With ( ) you can make groups With ? the group of single character before becomes optional with | you can make a distinction between multiple options The easy way out: The easy way, however way less efficient and esthetically appealing is to just put the options in one long row like this: Inform|informed|information|informing 4 Combined: These codes can all be combined into one big expression that looks for multiple combinations of groups and words. The following is one to strive for in your own analysis ^(?=.*?\b((works?)?council)\b)(?=.*?\b(month(ly)?)\b)(?=.*?\b(Inform(ed|ing|ation) ?)\b).*$ In other words: look for sentences where the workcouncil, workscouncil or council gets some kind of information or other form every month or monthly. Expressionsfile: In order to use the expressions they need to be put in a textfile. One of the best applications in Windows is “Notepad++”. Put each expressions in its own individual row respectively which will result in something like this: 4. How does the script work? The script is written in the computer language “Python” and executed in the command prompt which is a part of every windows distribution. The one necessity is to have Python 2 installed on your machine. If it is not on your computer yet you can find it following the next link: https://www.python.org/downloads/ Make sure to install Python 2.7.XX because it does NOT work in Python 3. 5 To run the scipt you have to drag the file “textextractor.py” to the command prompt window. Press enter and you will get the following lines: “[-] drag the map with your files here” and “path to map:”. Drag the map with your textdocuments to the command prompt window and the path to the map will show up in the window. Depending on your version of windows the path might need a backslash “\” at the end of the path. Press enter and you will go to the “save in map” section. If you get an error, try again and this time check for typos, the backslash at the end and try to add or remove the “ “ at the beginning and end of the path. The window will now display “[-] drag the map where you want to save the output here” and “save in map:”. Drag the map where you want to save the output. Use another map than the one with the textdocuments. Depending on your version of windows the path might need a backslash “\” at the end of the path. Press enter and you will go to the “regexfiles” section. If you get an error, try again and this time check for typos, the backslash at the end and try to add or remove the “ “ at the beginning and end of the path. Now drag the textfile with your expressions to the window. Press enter and all the regular expressions you created appear. They might look different, because Python automatically transformed the expressions to the preference of the Python language. If the expressions do not appear, an error in the path occurred. Try again and this time try to add or remove the “ “ at the beginning and end of the path and check for typos. Press enter again and the analysis will start. Depending on the amount of textdocuments and regular expressions it might take a while. If you get an error, there is an problem with you regular expressions. Take a good look at them and identify the error. Tip: run the expressions individually on a small test batch to identify the one with the mistake. The script createst the textfiles (count and output) where the results are stored. Count is the list where each document and its matched expression are saved. This one is perfect for import in excel and transform it using a pivot table. In output you find all the sentences that have been identified, which might be useful to get more insight into the contents of you analysis. 6 5. FAQ & Troubleshoot The main errors happen through either typos in the regular expressions or problems with the paths to the maps and file with regular expressions. I suggest to just try the suggestions that have been written down at each specific part of this manual. No other errors are currently known. This script is licensed under the MIT license Jasper van Boven 2018 7
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.5 Linearized : Yes Create Date : 2018:04:19 14:12:44Z Creator : Word Modify Date : 2018:04:19 16:12:58+02:00 XMP Toolkit : Adobe XMP Core 5.6-c015 84.159810, 2016/09/10-02:41:30 Creator Tool : Word Metadata Date : 2018:04:19 16:12:58+02:00 Keywords : Producer : Adobe Mac PDF Plug-in Format : application/pdf Title : Microsoft Word - manual textextractor.docx Document ID : uuid:8c75c8a3-33ae-5e49-9292-a13ea226842a Instance ID : uuid:3ce8a96a-744f-0e4f-88e6-a12a1940300b Page Count : 8EXIF Metadata provided by EXIF.tools