Senti Strength Java Manual
User Manual:
Open the PDF directly: View PDF .
Page Count: 12
Download | |
Open PDF In Browser | View PDF |
1 SentiStrength Java User Manual This document describes the main tasks and optons for the Java version of SentStrength. Java must be installed on your computer. SentStrength can then run via the command prompt using a command like: java -jar SentiStrength.jar sentidata C:/SentiStrength_Data/ text i+don't+hate+you. Contents SentStrength Java User aanual..........................................................................................1 Quick start............................................................................................................................2 Windows, Linux................................................................................................................2 aac..................................................................................................................................2 Sentment classiicaton tasks..............................................................................................2 Classify a single text.........................................................................................................3 Classify all lines of text in a ile for sentment [includes accuracy evaluatonss..............3 Classify texts in a column within a ile or folder..............................................................3 Listen at a port for texts to classify..................................................................................3 Run interactvely from the command line.......................................................................4 Process stdin and send to stdout....................................................................................4 Import the JAR ile to run within your Java program......................................................4 Improving the accuracy of SentStrength............................................................................6 Basic manual improvements...........................................................................................6 Optmise sentment strengths of existng sentment terms............................................7 Suggest new sentment terms (from terms in misclassiied textss.................................7 Optons:...............................................................................................................................8 Explain the classiicaton..................................................................................................8 Only classify text near speciied keywords......................................................................8 Classify positve (1 to 5s and negatve (-1 to -5s sentment strength separately............8 Use trinary classiicaton (positve-negatve-neutrals......................................................8 Use binary classiicaton (positve-negatves...................................................................8 Use a single positve-negatve scale classiicaton...........................................................8 Locaton of linguistc data folder.....................................................................................8 Locaton of sentment term weights...............................................................................9 Locaton of output folder.................................................................................................9 File name extension for output.......................................................................................9 Classiicaton algorithm parameters................................................................................9 Additonal consideratons..................................................................................................10 Language issues.............................................................................................................10 Long texts.......................................................................................................................10 aachine learning evaluatons...........................................................................................10 Evaluaton optons.........................................................................................................11 2 Quick start Windows, Linux 1. Save SentStrength.jar to your main computer Desktop and Unzip SentStrength_Data.zip to a folder on your main Desktop called SentStrength_Data. So if you open SentStrength_Data you should see all the input iles (can also run from USB or elsewheres. 2. Unzip the downloaded SentStrength text iles from the zip ile into a new folder – a subfolder of the Desktop folder is easiest. 3. Click the Windows start buton, type cmd and then select cmd.exe to start a command prompt. Use Terminal for Linux (Ctrl-Alt-Ts. 4. (The tricky bits At the command prompt, navigate to the folder containing SentStrength.jar by (Windowss entering the drive leter, followed by a colon to change the default directory to the USB drive. Then type cd [names with the name of the folder containing SentStrength. aore informaton here (Windowss if you get stuck: htp://www.digitalcitzen.life/command-prompt-how-use-basic-commands 5. Test SentStrength with the following command, where the path of the SentStrength data folder name will need to be changed to the name on your computer (Windows tp: commands can be pasted to the command prompt with the right click menus. java -jar SentSSrennSt.jar sentdaSa D:/sent/SentSSrennSthDaSa/ SexS i+like+you. Explain Mac 1. Save SentStrength.jar to your main computer Desktop and Unzip SentStrength_Data.zip to a folder on your main Desktop called SentStrength_Data. So if you open SentStrength_Data you should see all the input iles (can also run from USB or elsewheres. 2. Unzip the downloaded SentStrength text iles from the zip ile into a new folder – a subfolder of the Desktop folder is easiest. 3. Start Terminal for aacs (Applicatons|Utlitess. 4. In the terminal window, type the following (case sensitves command and press return to navigate to the Desktop (i.e., where SentStrengthCom.jar iss. cd DeskSop 5. Test SentStrength with the following command. java -jar SentSSrennSt.jar sentdaSa SentSSrennSthDaSa/ SexS i+like+you. Explain Sentiment classification tasks SentStrength can classify individual texts or multple texts and can be invoked in many different ways. This secton covers these methods although most users only need one of them. 3 Classify a single text text [text to process] The submited text will be classiied and the result returned in the form +ve –space- -ve. If the classiicaton method is trinary, binary or scale then the result will have the form +ve –space- -ve –space- overall. E.g., java -jar SentiStrength.jar sentidata C:/SentiStrength_Data/ text i+love+your+dog. The result will be: 3 -1 Classify all lines of text in a file for sentiment [includes accuracy evaluations] input [filename] Each line of [ilenames will be classiied for sentment. ere is an example. java -jar SentiStrength.jar sentidata C:/SentiStrength_Data/ input myfile.txt A new ile will be created with the sentment classiicatons added to the end of each line. If the task is to test the accuracy of SentStrength, then the ile may have +ve codes in the 1st column, then negatve codes in the 2nd column and text in the last column. If using binary/trinary/scale classiicaton then the irst column can contain the human coded values. Columns must be tab-separated. If human coded sentment scores are included in the ile then the accuracy of SentStrength will be compared against them. Classify texts in a column within a file or folder For each line, the text in the speciied column will be extracted and classiied, with the result added to an extra column at the end of the ile (all three parameters are compulsorys. annotateCol [col # 1..] (classify text in col, result at line end) inputFolder [foldername] (all files in folder will be *annotated*) fileSubstring [text] (string must be present in files to annotate) Ok to overwrite files [overwrite] If a folder is speciied instead of a ilename (i.e., an input parameters then all iles in the folder are processed as above. If a ileSubstring value is speciied, then only iles matching the substring will be classiied. The parameter overwrite must be speciied to explicitly allow the input iles to be modiied. This is a purely safety feature. E.g., java -jar SentiStrength.jar sentidata C:/SentiStrength_Data/ annotateCol 1 inputFolder C:/textfiles/ fileSubstring txt Listen at a port for texts to classify listen [port number to listen at - call OR This sets the program to listen at a port number for texts to classify, e.g., to listen at port 81 for texts for trinary classiicaton: java -jar SentiStrength.jar sentidata C:/SentiStrength_Data/ listen 81 trinary 4 The texts must be URLEncoded and submited as part of the URL. E.g., if the listening was set up on port 81 then requestng the following URL would trigger classiicaton of the text "love you": http://127.0.0.1:81/love%20you The result for this would be 3 -1 1. This is: (+ve classiicatons (-ve classiicatons (trinary classiicatons Run interactively from the command line cmd (can also set optons and sentdata folders. E.g., java -jar c:\SentiStrength.jar cmd sentidata C:/SentiStrength_Data/ This allows the program to classify texts from the command prompt. Afer running this every line you enter will be classiied for sentment. To inish enter @end Process stdin and send to stdout stdin (can also set options and sentidata folder). E.g., java -jar c:\SentiStrength.jar stdin sentidata C:/SentiStrength_Data/ SentStrength will classify all texts sent to it from stdin and then will close. This probably the most efcient way of integratng SentStrength efciently with non-Java programs. The alternatves are the Listen at a port opton or dumping the texts to be classiied into a ile and then running SentStrength on the ile. The parameter textCol can be set [default 0 for the irst columns if the data is sent in multple tab-separated columns and one column contains the text to be classiied. The results will be appended to the end of the input data and send to STD out. The Java loop code for this is essentally: while((textToParse = stdin.readLine()) != null) { //code to analyse sentment and return results } So for greatest efciency, null should not be sent to stdin as this will close the program. Import the JAR file to run within your Java program Import the Jar and > initalise it by sending commands to public statc void main(String[s argss in public class SentStrength and then call public String computeSentmentScores(String sentences also from public class SentStrength to get each text processed. ere is some sample code for afer importng the Jar and creatng a class: package uk.ac.wlv.sentistrengthapp; //Whatever package name you choose import uk.ac.wlv.sentistrength.*; public class SentiStrengthApp { public static void main(String[] args) { //Method 1: one-of classiication (inefcient for multiple classiications) //Create an array of command line parameters, including text or ile to process String ssthInitialisationAndText[] = {"sentidata", "f:/SentStrength_Data/", "text", "I+hate+frogs+but+love+dogs.", "explain"}; SentiStrength.main(ssthInitialisationAndText); //Method 2: One initialisation and repeated classiications SentiStrength sentiStrength = new SentiStrength(); 5 //Create an array of command line parameters to send (not text or ile to process) String ssthInitialisation[] = {"sentidata", "f:/SentStrength_Data/", "explain"}; sentiStrength.initialise(ssthInitialisation); //Initialise //can now calculate sentiment scores quickly without having to initialise again System.out.println(sentiStrength.computeSentimentScores("I hate frogs.")); System.out.println(sentiStrength.computeSentimentScores("I love dogs.")); } } To instantate multple classiiers you can start and initalise each one separately. SentiStrength classiier1 = new SentiStrength(); SentiStrength classiier2 = new SentiStrength(); //Also need to initialise both, as above String ssthInitialisation1[] = {"sentidata", "f:/SentStrength_Data/", "explain"}; classiier1.initialise(ssthInitialisation1); //Initialise String ssthInitialisation2[] = {"sentidata", "f:/SentStrength_Spanish_Data/"}; Classiier2.initialise(ssthInitialisation2); //Initialise // after initialisation, can call both whenever needed: String result_from_classiier1 = classiier1.computeSentimentScores(input); String result_from_classiier2 = classiier2.computeSentimentScores(input); Note: if using Eclipse then the following imports SentStrength into your project (there are also other wayss. 6 Improving the accuracy of SentiStrength Basic manual improvements If you see a systematc patern in the results, such as the term “disgustng” typically having a stronger or weaker sentment strength in your texts than given by SentStrength then you can edit the text iles with SentStrength to change this. Please edit SentStrength’s input iles using a plain text editor because if it is edited with a word processor then SentStrength may not be able to read the ile aferwards. 7 Optimise sentiment strengths of existing sentiment terms SentStrength can suggest revised sentment strengths for the EmotonLookupTable.SxS in order to give more accurate classiicatons for a given set of texts. This opton needs a large (>500s set of texts in a plain text ile with a human sentment classiicaton for each text. SentStrength will then try to adjust the EmotonLookupTable.SxS term weights to be more accurate when classifying these texts. It should then also be more accurate when classifying similar texts. optimise [Filename for optimal term strengths (e.g. EmotionLookupTable2.txt)] This creates a new emoton lookup table with improved sentment weights based upon an input ile with human coded sentment values for the texts. This feature allows SentStrength term weights to be customised for new domains. E.g., java -jar c:/SentiStrength.jar minImprovement 3 input C:/twitter4242.txt optimise C:/twitter4242OptimalSentimentLookupTable.txt This is very slow (hours or dayss if the input ile is large (hundreds of thousands or millions, respectvelys. The main optonal parameter is minImprovement (default value 2s. Set this to specify the minimum overall number of additonal correct classiicatons to change the sentment term weightng. For example, if increasing the sentment strength of love from 3 to 4 improves the number of correctly classiied texts from 500 to 502 then this change would be kept if minImprovement was 1 or 2 but rejected if minImprovement was >2. Set this higher to have more robust changes to the dictonary. igher setngs are possible with larger input iles. To check the performance on the new dictonary, the ile could be reclassiied using it instead of the original SentmentLookupTable.txt as follows: java -jar c:/SentiStrength.jar input C:/twitter4242.txt EmotionLookupTable C:/twitter4242OptimalSentimentLookupTable.txt Suggest new sentiment terms (from terms in misclassified texts) SentStrength can suggest a new set of terms to add to the EmotonLookupTable.SxS in order to give more accurate classiicatons for a given set of texts. This opton needs a large (>500s set of texts in a plain text ile with a human sentment classiicaton for each text. SentStrength will then list words not found in the EmotonLookupTable.SxS that may indicate sentment. Adding some of these terms should make SentStrength more accurate when classifying similar texts. termWeights This lists all terms in the data set and the proporton of tmes they are in incorrectly classiied positve or negatve texts. Load this into a spreadsheet and sort on the PosClassAvDiff and NegClassAvDiff to get an idea about terms that either should be added to the sentment dictonary because one of these two values is high. This opton also lists words that are already in the sentment dictonary. aust be used with a text ile containing correct classiicatons. E.g., java -jar c:/SentiStrength.jar input C:/twitter4242.txt termWeights This is very slow (hours or dayss if the input ile is large (tens of thousands or millions, respectvelys. 8 InSerpreSaton: In the output ile, the column PosClassAvDiff means the average difference between the predicted sentment score and the human classiied sentment score for texts containing the word. For example, if the word “nasty” was in two texts and SentStrength had classiied them both as +1,-3 but the human classiiers had classiied the texts as (+2,-3s and (+3,-5s then PosClassAvDiff would be the average of 21 (irst texts and 3-1 (second texts which is 1.5. All the negatve scores are ignored for PosClassAvDiff NegClassAvDiff is the same as for PosClassAvDiff except for the negatve scores. Options: Explain the classification explain Adding this parameter to most of the optons results in an approximate explanaton being given for the classiicaton. E.g., java -jar SentiStrength.jar text i+don't+hate+you. explain Only classify text near specified keywords keywords [comma-separated list - sentiment only classified close to these] wordsBeforeKeywords [words to classify before keyword (default 4)] wordsAfterKeywords [words to classify after keyword (default 4)] Classify positive (1 to 5) and negative (-1 to -5) sentiment strength separately This is the default and is used unless binary, trinary or scale is selected. Note that 1 indicates no positve sentment and -1 indicates no negatve sentment. There is no output of 0. Use trinary classification (positive-negative-neutral) trinary (report positive-negative-neutral classification instead) The result for this would be like 3 -1 1. This is: (+ve classiicatons (-ve classiicatons (trinary classiicatons Use binary classification (positive-negative) binary (report positive-negative classification instead) The result for this would be like 3 -1 1. This is: (+ve classiicatons (-ve classiicatons (binary classiicatons Use a single positive-negative scale classification scale (report single -4 to +4 classification instead) The result for this would be like 3 -4 -1. This is: (+ve classiicatons (-ve classiicatons (scale classiicatons 9 Location of linguistic data folder sentidata [folder for SentiStrength data (end in slash, no spaces)] Location of sentiment term weights EmotionLookupTable [filename (default: EmotionLookupTable.txt or SentimentLookupTable.txt)]. Location of output folder outputFolder [foldername where to put the output (default: folder of inputss File name extension for output resultsextension [ile-extension for output (default _out.txtss Classification algorithm parameters These optons change how the sentment analysis algorithm works. alwaysSplitWordsAtApostrophes (split words when an apostrophe is met – important for languages that merge words with ‘, like French (e.g., t’aime -> t ‘ aime with this option t’aime without)) noBoosters (ignore sentiment booster words (e.g., very)) noNegatingPositiveFlipsEmotion (don't use negating words to flip +ve words) noNegatingNegativeNeutralisesEmotion (don't use negating words to neuter -ve words) negatedWordStrengthMultiplier (strength multiplier when negated (default=0.5)) maxWordsBeforeSentimentToNegate (max words between negator & sentiment word (default 0)) noIdioms (ignore idiom list) questionsReduceNeg (-ve sentiment reduced in questions) noEmoticons (ignore emoticon list) exclamations2 (exclamation marks count them as +2 if not -ve sentence) mood [-1,0,1](interpretation of neutral emphasis (e.g., miiike; hello!!). -1 means neutral emphasis interpreted as –ve; 1 means interpreted as +ve; 0 means emphasis ignored) noMultiplePosWords (don't allow multiple +ve words to increase +ve sentiment) noMultipleNegWords (don't allow multiple -ve words to increase -ve sentiment) noIgnoreBoosterWordsAfterNegatives (don't ignore boosters after negating words) noDictionary (don't try to correct spellings using the dictionary by deleting duplicate letters from unknown words to make known words) noDeleteExtraDuplicateLetters (don't delete extra duplicate letters in words even when they are impossible, e.g., heyyyy) [this option does not check if the new word is legal, in contrast to the above option] illegalDoubleLettersInWordMiddle [letters never duplicate in word middles] this is a list of characters that never occur twice in succession. For English the following list is used (default): ahijkquvxyz Never include w in this list as it often occurs in www 10 illegalDoubleLettersAtWordEnd [letters never duplicate at word ends] this is a list of characters that never occur twice in succession at the end of a word. For English the following list is used (default): achijkmnpqruvwxyz noMultipleLetters (don't use the presence of additional letters in a word to boost sentiment) Additional considerations Language issues If using a language with a character set that is not the standard ASCII collecton then please save in UTF8 format and use the ut8 opton to get SentStrength to read the input iles as UTF8. If using European language like Spanish with diacritcs, please try both with and without the ut8 opton – depending on your system, one or the other might work (Possibly due to a weird ANSII/ASCII coding issue with Windowss. Long texts SentStrength is designed for short texts but can be used for polarity detecton on longer texts with the following optons (see binary or trinary belows. This works similarly to aaite Taboada’s SOCAL program. In this mode, the total positve sentment is calculated and compared to the total negatve sentment. If the total positve is bigger than 1.5* the total negatve sentment then the classiicaton is positve, otherwise it is negatve. Why 1.5? Because negatvity is rarer than positvity, so stands out more (see the work of aaite Taboadas. java -jar SentiStrength.jar sentidata C:/SentiStrength_Data/ text I+hate+frogs+but+love+dogs.+Do+You+like. sentenceCombineTot paragraphCombineTot trinary If you prefer a multplier other than 1.5 then set it with the negatveaultplier opton. E.g., for a multplier of 1 (equally weightng positve and negatve tests try: java -jar SentiStrength.jar sentidata C:/SentiStrength_Data/ text I+hate+frogs+but+love+dogs.+Do+You+like. sentenceCombineTot paragraphCombineTot trinary negativeMultiplier 1 Machine learning evaluations These are machine learning optons to evaluate SentStrength for academic research. The basic command is train. train (evaluate SentStrength by training term strengths on results in iles. An input ile of 500+ human classiied texts is also needed - e.g., java -jar SentiStrength.jar train input C:\1041MySpace.txt This atempts to optmise the sentment dictonary using a machine learning approach and 10-fold cross validaton. This is equivalent to using the command optimise on a random 90% of the data, then evaluatng the results on the remaining 10% and repeatng this 9 more tmes with the remaining 9 sectons of 10% of the data. The accuracy results reported are the average of the 10 atempts. This estmates the improved accuracy gained from using the optmise command to improve the sentment dictonary. 11 The output of this is two iles. The ile ending in houS.SxS reports various accuracy statstcs (e.g., number and proporton correct, number and proporton within 1 of the correct value; correlaton between SentStrength and human coded values. The ile ending in houShSermSSrVars.SxS reports the changes to the sentment dictonary in each of the folds. Both iles also report the parameters used for the sentment algorithm and machine learning. See the What do the results imean? secton at the end for more informaton. Evaluation options Test all opton variatons listed in Classiicaton Algorithm Parameters above rather than use the default optons tot Optmise by the number of correct classiicatons rather than the sum of the classiicaton differences iterations [number of 10-fold iterations (default 1)] This sets the number of tmes that the training and evaluaton is conducted. A value of 30 is recommended to help average out differences between runs. all minImprovement [min extra correct class. to change sentiment weights (default 2)] This sets the minimum number of extra correct classiicatons necessary to adjust a term weight during the training phase. multi [# duplicate term strength optimisations to change sentiment weights (default 1)] This is a kind of super-optmisaton. Instead of being optmised once, term weights are optmised multple tmes from the startng values and then the average of these weights is taken and optmised and used as the inal optmised term strengths. This should in theory give beter values than optmisaton once. e.g., java -jar SentiStrength.jar multi 8 input C:\1041MySpace.txt iterations 2 Example: Usinn SentSSrennSt for 10-fold cross-validaton WtaS is Stis? This estmates the accuracy of SentStrength after it has optmised the term weights for the sentment words (i.e., the values in the ile EmotonLookupTable.txts. WtaS do I need for Stis SesS? You need an input ile that is a list of texts with human classiied values for positve (1-5s and negatve (1-5s sentment. Each line of the ile should be in the format: PositveNegatve text How do I run Ste SesS? Type the following command, replacing the ilename with your own ile name. java -jar SentiStrength.jar input C:\1041MySpace.txt iterations 30 This should take up to one hour – much longer for longer iles. The output will be a list of accuracy statstcs. Each 10-fold cross-validaton 12 WtaS does 10-fold cross-validaton mean? See the k-fold secton in htp://en.wikipedia.org/wiki/Cross-validaton_(statstcss. Essentally, it means that the same data is used to identfy the best sentment strength values for the terms in EmotonLookupTable.txt as is used to evaluate the accuracy of the revised (traineds algorithm – but this isn’t cheatng when it is done this way. The irst line in the results ile gives the accuracy of SentStrength with the original term weights in EmotonLookupTable.txt. WtaS do Ste resulSs mean? The easiest way to read the results is to copy and paste them into a spreadsheet like Excel. The table created lists the optons used to classify the texts and well as the results. ere is an extract from the irst two rows of the key results. It gives the total number correct for positve sentment (Pos Corrects and the proporton correct (Pos Correct/Totals. It also reports the number of predictons that are correct or within 1 of being correct (Pos Within1s. The same informaton is given for negatve sentment. Pos Pos Correct/ Neg Correct Total Correct 653 0.627281 754 Neg Pos Neg Correct/ Pos Within1/ Neg Within1/ Total Within1 Total Within1 Total 0.724304 1008 0.9683 991 0.951969 ere is another extract of the irst two rows of the key results. It gives the correlaton between the positve sentment predictons and the human coded values for positve sentment (Pos Corrs and the aean Percentage Error (PosaPEnoDivs. The same informaton is given for negatve sentment. Pos Corr 0.638382 NegCorr PosaPE NegaPE PosaPEnoDiv NegaPEnoDiv Ignore Ignore 0.61354 this this 0.405379 0.32853 If you speciied 30 iteratons then there will be 31 rows, one for the header and 1 for each iteraton. Take the average of the rows as the value to use. Thanks to Hannes Pirker at OFAI for soime of the above code.
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.4 Linearized : No Page Count : 12 Language : fr-FR Author : Mike Thelwall Creator : Writer Producer : LibreOffice 5.4 Create Date : 2018:05:11 14:10:10+01:00EXIF Metadata provided by EXIF.tools