Senti Strength Java Manual

User Manual:

Open the PDF directly: View PDF .
Page Count: 12

Download
Open PDF In Browser	View PDF

SentiStrength Java User Manual
This document describes the main tasks and optons for the Java version of
SentStrength. Java must be installed on your computer. SentStrength can then run via
the command prompt using a command like:
java -jar SentiStrength.jar sentidata C:/SentiStrength_Data/ text
i+don't+hate+you.

Contents
SentStrength Java User aanual..........................................................................................1
Quick start............................................................................................................................2
Windows, Linux................................................................................................................2
aac..................................................................................................................................2
Sentment classiicaton tasks..............................................................................................2
Classify a single text.........................................................................................................3
Classify all lines of text in a ile for sentment [includes accuracy evaluatonss..............3
Classify texts in a column within a ile or folder..............................................................3
Listen at a port for texts to classify..................................................................................3
Run interactvely from the command line.......................................................................4
Process stdin and send to stdout....................................................................................4
Import the JAR ile to run within your Java program......................................................4
Improving the accuracy of SentStrength............................................................................6
Basic manual improvements...........................................................................................6
Optmise sentment strengths of existng sentment terms............................................7
Suggest new sentment terms (from terms in misclassiied textss.................................7
Optons:...............................................................................................................................8
Explain the classiicaton..................................................................................................8
Only classify text near speciied keywords......................................................................8
Classify positve (1 to 5s and negatve (-1 to -5s sentment strength separately............8
Use trinary classiicaton (positve-negatve-neutrals......................................................8
Use binary classiicaton (positve-negatves...................................................................8
Use a single positve-negatve scale classiicaton...........................................................8
Locaton of linguistc data folder.....................................................................................8
Locaton of sentment term weights...............................................................................9
Locaton of output folder.................................................................................................9
File name extension for output.......................................................................................9
Classiicaton algorithm parameters................................................................................9
Additonal consideratons..................................................................................................10
Language issues.............................................................................................................10
Long texts.......................................................................................................................10
aachine learning evaluatons...........................................................................................10
Evaluaton optons.........................................................................................................11

Quick start
Windows, Linux
1. Save SentStrength.jar to your main computer Desktop and Unzip
SentStrength_Data.zip to a folder on your main Desktop called SentStrength_Data.
So if you open SentStrength_Data you should see all the input iles (can also run
from USB or elsewheres.
2. Unzip the downloaded SentStrength text iles from the zip ile into a new folder – a
subfolder of the Desktop folder is easiest.
3. Click the Windows start buton, type cmd and then select cmd.exe to start a
command prompt. Use Terminal for Linux (Ctrl-Alt-Ts.
4. (The tricky bits At the command prompt, navigate to the folder containing
SentStrength.jar by (Windowss entering the drive leter, followed by a colon to
change the default directory to the USB drive. Then type cd [names with the name of
the folder containing SentStrength. aore informaton here (Windowss if you get
stuck: htp://www.digitalcitzen.life/command-prompt-how-use-basic-commands
5. Test SentStrength with the following command, where the path of the SentStrength
data folder name will need to be changed to the name on your computer (Windows
tp: commands can be pasted to the command prompt with the right click menus.
java -jar SentSSrennSt.jar sentdaSa D:/sent/SentSSrennSthDaSa/ SexS i+like+you.
Explain

Mac
1. Save SentStrength.jar to your main computer Desktop and Unzip
SentStrength_Data.zip to a folder on your main Desktop called SentStrength_Data.
So if you open SentStrength_Data you should see all the input iles (can also run
from USB or elsewheres.
2. Unzip the downloaded SentStrength text iles from the zip ile into a new folder – a
subfolder of the Desktop folder is easiest.
3. Start Terminal for aacs (Applicatons|Utlitess.
4. In the terminal window, type the following (case sensitves command and press
return to navigate to the Desktop (i.e., where SentStrengthCom.jar iss.
cd DeskSop
5. Test SentStrength with the following command. java -jar SentSSrennSt.jar
sentdaSa SentSSrennSthDaSa/ SexS i+like+you. Explain

Sentiment classification tasks
SentStrength can classify individual texts or multple texts and can be invoked in many
different ways. This secton covers these methods although most users only need one of
them.

Classify a single text
text [text to process]

The submited text will be classiied and the result returned in the form +ve –space- -ve.
If the classiicaton method is trinary, binary or scale then the result will have the form
+ve –space- -ve –space- overall. E.g.,
java -jar SentiStrength.jar sentidata C:/SentiStrength_Data/ text
i+love+your+dog.

The result will be: 3 -1

Classify all lines of text in a file for sentiment [includes accuracy
evaluations]
input [filename]

Each line of [ilenames will be classiied for sentment. ere is an example.

java -jar SentiStrength.jar sentidata C:/SentiStrength_Data/ input
myfile.txt

A new ile will be created with the sentment classiicatons added to the end of each
line.
If the task is to test the accuracy of SentStrength, then the ile may have +ve codes in
the 1st column, then negatve codes in the 2nd column and text in the last column. If
using binary/trinary/scale classiicaton then the irst column can contain the human
coded values. Columns must be tab-separated. If human coded sentment scores are
included in the ile then the accuracy of SentStrength will be compared against them.

Classify texts in a column within a file or folder
For each line, the text in the speciied column will be extracted and classiied, with the
result added to an extra column at the end of the ile (all three parameters are
compulsorys.
annotateCol [col # 1..] (classify text in col, result at line end)
inputFolder [foldername] (all files in folder will be *annotated*)
fileSubstring [text] (string must be present in files to annotate)
Ok to overwrite files [overwrite]

If a folder is speciied instead of a ilename (i.e., an input parameters then all iles in the
folder are processed as above. If a ileSubstring value is speciied, then only iles
matching the substring will be classiied. The parameter overwrite must be speciied to
explicitly allow the input iles to be modiied. This is a purely safety feature. E.g.,
java -jar SentiStrength.jar sentidata C:/SentiStrength_Data/
annotateCol 1 inputFolder C:/textfiles/ fileSubstring txt

Listen at a port for texts to classify
listen [port number to listen at - call OR

This sets the program to listen at a port number for texts to classify, e.g., to listen at
port 81 for texts for trinary classiicaton:
java -jar SentiStrength.jar sentidata C:/SentiStrength_Data/
listen 81 trinary

4
The texts must be URLEncoded and submited as part of the URL. E.g., if the listening
was set up on port 81 then requestng the following URL would trigger classiicaton of
the text "love you": http://127.0.0.1:81/love%20you
The result for this would be 3 -1 1. This is: (+ve classiicatons (-ve classiicatons (trinary
classiicatons

Run interactively from the command line
cmd (can also set optons and sentdata folders. E.g.,
java -jar c:\SentiStrength.jar cmd sentidata C:/SentiStrength_Data/

This allows the program to classify texts from the command prompt. Afer running this
every line you enter will be classiied for sentment. To inish enter @end

Process stdin and send to stdout
stdin (can also set options and sentidata folder). E.g.,
java -jar c:\SentiStrength.jar stdin sentidata C:/SentiStrength_Data/

SentStrength will classify all texts sent to it from stdin and then will close. This probably
the most efcient way of integratng SentStrength efciently with non-Java programs.
The alternatves are the Listen at a port opton or dumping the texts to be classiied into
a ile and then running SentStrength on the ile.
The parameter textCol can be set [default 0 for the irst columns if the data is sent in
multple tab-separated columns and one column contains the text to be classiied.
The results will be appended to the end of the input data and send to STD out.
The Java loop code for this is essentally:
while((textToParse = stdin.readLine()) != null) {

//code to analyse sentment and return results
}
So for greatest efciency, null should not be sent to stdin as this will close the program.

Import the JAR file to run within your Java program
Import the Jar and > initalise it by sending commands to public statc void main(String[s argss in
public class SentStrength and then call public String computeSentmentScores(String sentences
also from public class SentStrength to get each text processed. ere is some sample code for
afer importng the Jar and creatng a class:
package uk.ac.wlv.sentistrengthapp; //Whatever package name you choose
import uk.ac.wlv.sentistrength.*;
public class SentiStrengthApp {
public static void main(String[] args) {
//Method 1: one-of classiication (inefcient for multiple classiications)
//Create an array of command line parameters, including text or ile to process
String ssthInitialisationAndText[] = {"sentidata", "f:/SentStrength_Data/",
"text", "I+hate+frogs+but+love+dogs.", "explain"};
SentiStrength.main(ssthInitialisationAndText);
//Method 2: One initialisation and repeated classiications
SentiStrength sentiStrength = new SentiStrength();

5
//Create an array of command line parameters to send (not text or ile to
process)
String ssthInitialisation[] = {"sentidata", "f:/SentStrength_Data/", "explain"};
sentiStrength.initialise(ssthInitialisation); //Initialise
//can now calculate sentiment scores quickly without having to initialise again
System.out.println(sentiStrength.computeSentimentScores("I hate frogs."));
System.out.println(sentiStrength.computeSentimentScores("I love dogs."));
}

}

To instantate multple classiiers you can start and initalise each one separately.
SentiStrength classiier1 = new SentiStrength();
SentiStrength classiier2 = new SentiStrength();
//Also need to initialise both, as above
String ssthInitialisation1[] = {"sentidata", "f:/SentStrength_Data/", "explain"};
classiier1.initialise(ssthInitialisation1); //Initialise
String ssthInitialisation2[] = {"sentidata", "f:/SentStrength_Spanish_Data/"};
Classiier2.initialise(ssthInitialisation2); //Initialise
// after initialisation, can call both whenever needed:
String result_from_classiier1 = classiier1.computeSentimentScores(input);
String result_from_classiier2 = classiier2.computeSentimentScores(input);

Note: if using Eclipse then the following imports SentStrength into your project (there are also
other wayss.

Improving the accuracy of SentiStrength
Basic manual improvements
If you see a systematc patern in the results, such as the term “disgustng” typically
having a stronger or weaker sentment strength in your texts than given by
SentStrength then you can edit the text iles with SentStrength to change this. Please
edit SentStrength’s input iles using a plain text editor because if it is edited with a word
processor then SentStrength may not be able to read the ile aferwards.

Optimise sentiment strengths of existing sentiment terms
SentStrength can suggest revised sentment strengths for the EmotonLookupTable.SxS
in order to give more accurate classiicatons for a given set of texts. This opton needs a
large (>500s set of texts in a plain text ile with a human sentment classiicaton for each
text. SentStrength will then try to adjust the EmotonLookupTable.SxS term weights to
be more accurate when classifying these texts. It should then also be more accurate
when classifying similar texts.
optimise [Filename for optimal term strengths (e.g.
EmotionLookupTable2.txt)]

This creates a new emoton lookup table with improved sentment weights based upon
an input ile with human coded sentment values for the texts. This feature allows
SentStrength term weights to be customised for new domains. E.g.,
java -jar c:/SentiStrength.jar minImprovement 3 input
C:/twitter4242.txt optimise
C:/twitter4242OptimalSentimentLookupTable.txt

This is very slow (hours or dayss if the input ile is large (hundreds of thousands or
millions, respectvelys. The main optonal parameter is minImprovement (default value
2s. Set this to specify the minimum overall number of additonal correct classiicatons to
change the sentment term weightng. For example, if increasing the sentment strength
of love from 3 to 4 improves the number of correctly classiied texts from 500 to 502
then this change would be kept if minImprovement was 1 or 2 but rejected if
minImprovement was >2. Set this higher to have more robust changes to the dictonary.
igher setngs are possible with larger input iles.
To check the performance on the new dictonary, the ile could be reclassiied using it
instead of the original SentmentLookupTable.txt as follows:
java -jar c:/SentiStrength.jar input C:/twitter4242.txt
EmotionLookupTable C:/twitter4242OptimalSentimentLookupTable.txt

Suggest new sentiment terms (from terms in misclassified texts)
SentStrength can suggest a new set of terms to add to the EmotonLookupTable.SxS in
order to give more accurate classiicatons for a given set of texts. This opton needs a
large (>500s set of texts in a plain text ile with a human sentment classiicaton for each
text. SentStrength will then list words not found in the EmotonLookupTable.SxS that
may indicate sentment. Adding some of these terms should make SentStrength more
accurate when classifying similar texts.
termWeights

This lists all terms in the data set and the proporton of tmes they are in incorrectly
classiied positve or negatve texts. Load this into a spreadsheet and sort on the
PosClassAvDiff and NegClassAvDiff to get an idea about terms that either should be
added to the sentment dictonary because one of these two values is high. This opton
also lists words that are already in the sentment dictonary. aust be used with a text
ile containing correct classiicatons. E.g.,
java -jar c:/SentiStrength.jar input C:/twitter4242.txt
termWeights

This is very slow (hours or dayss if the input ile is large (tens of thousands or millions,
respectvelys.

8
InSerpreSaton: In the output ile, the column PosClassAvDiff means the average
difference between the predicted sentment score and the human classiied sentment
score for texts containing the word. For example, if the word “nasty” was in two texts
and SentStrength had classiied them both as +1,-3 but the human classiiers had
classiied the texts as (+2,-3s and (+3,-5s then PosClassAvDiff would be the average of 21 (irst texts and 3-1 (second texts which is 1.5. All the negatve scores are ignored for
PosClassAvDiff
NegClassAvDiff is the same as for PosClassAvDiff except for the negatve scores.

Options:
Explain the classification
explain

Adding this parameter to most of the optons results in an approximate explanaton
being given for the classiicaton. E.g.,
java -jar SentiStrength.jar text i+don't+hate+you. explain

Only classify text near specified keywords
keywords [comma-separated list - sentiment only classified close to
these]
wordsBeforeKeywords [words to classify before keyword (default 4)]
wordsAfterKeywords [words to classify after keyword (default 4)]

Classify positive (1 to 5) and negative (-1 to -5) sentiment strength
separately
This is the default and is used unless binary, trinary or scale is selected. Note that 1
indicates no positve sentment and -1 indicates no negatve sentment. There is no
output of 0.

Use trinary classification (positive-negative-neutral)
trinary (report positive-negative-neutral classification instead)

The result for this would be like 3 -1 1. This is: (+ve classiicatons (-ve classiicatons
(trinary classiicatons

Use binary classification (positive-negative)
binary (report positive-negative classification instead)

The result for this would be like 3 -1 1. This is: (+ve classiicatons (-ve classiicatons
(binary classiicatons

Use a single positive-negative scale classification
scale (report single -4 to +4 classification instead)

The result for this would be like 3 -4 -1. This is: (+ve classiicatons (-ve classiicatons
(scale classiicatons

Location of linguistic data folder
sentidata [folder for SentiStrength data (end in slash, no spaces)]

Location of sentiment term weights
EmotionLookupTable [filename (default: EmotionLookupTable.txt or
SentimentLookupTable.txt)].

Location of output folder
outputFolder [foldername where to put the output (default: folder of inputss

File name extension for output
resultsextension [ile-extension for output (default _out.txtss

Classification algorithm parameters
These optons change how the sentment analysis algorithm works.




















alwaysSplitWordsAtApostrophes (split words when an apostrophe is
met – important for languages that merge words with ‘, like French
(e.g., t’aime -> t ‘ aime with this option t’aime without))
noBoosters (ignore sentiment booster words (e.g., very))
noNegatingPositiveFlipsEmotion (don't use negating words to flip
+ve words)
noNegatingNegativeNeutralisesEmotion (don't use negating words to
neuter -ve words)
negatedWordStrengthMultiplier (strength multiplier when negated
(default=0.5))
maxWordsBeforeSentimentToNegate (max words between negator &
sentiment word (default 0))
noIdioms (ignore idiom list)
questionsReduceNeg (-ve sentiment reduced in questions)
noEmoticons (ignore emoticon list)
exclamations2 (exclamation marks count them as +2 if not -ve
sentence)
mood [-1,0,1](interpretation of neutral emphasis (e.g., miiike;
hello!!). -1 means neutral emphasis interpreted as –ve; 1 means
interpreted as +ve; 0 means emphasis ignored)
noMultiplePosWords (don't allow multiple +ve words to increase +ve
sentiment)
noMultipleNegWords (don't allow multiple -ve words to increase -ve
sentiment)
noIgnoreBoosterWordsAfterNegatives (don't ignore boosters after
negating words)
noDictionary (don't try to correct spellings using the dictionary
by deleting duplicate letters from unknown words to make known
words)
noDeleteExtraDuplicateLetters (don't delete extra duplicate
letters in words even when they are impossible, e.g., heyyyy)
[this option does not check if the new word is legal, in contrast
to the above option]
illegalDoubleLettersInWordMiddle [letters never duplicate in word
middles] this is a list of characters that never occur twice in
succession. For English the following list is used (default):
ahijkquvxyz Never include w in this list as it often occurs in www

10




illegalDoubleLettersAtWordEnd [letters never duplicate at word
ends] this is a list of characters that never occur twice in
succession at the end of a word. For English the following list is
used (default): achijkmnpqruvwxyz
noMultipleLetters (don't use the presence of additional letters in
a word to boost sentiment)

Additional considerations
Language issues
If using a language with a character set that is not the standard ASCII collecton then
please save in UTF8 format and use the ut8 opton to get SentStrength to read the
input iles as UTF8. If using European language like Spanish with diacritcs, please try
both with and without the ut8 opton – depending on your system, one or the other
might work (Possibly due to a weird ANSII/ASCII coding issue with Windowss.

Long texts
SentStrength is designed for short texts but can be used for polarity detecton on longer
texts with the following optons (see binary or trinary belows. This works similarly to
aaite Taboada’s SOCAL program. In this mode, the total positve sentment is calculated
and compared to the total negatve sentment. If the total positve is bigger than 1.5*
the total negatve sentment then the classiicaton is positve, otherwise it is negatve.
Why 1.5? Because negatvity is rarer than positvity, so stands out more (see the work of
aaite Taboadas.
java -jar SentiStrength.jar sentidata C:/SentiStrength_Data/ text
I+hate+frogs+but+love+dogs.+Do+You+like. sentenceCombineTot
paragraphCombineTot trinary

If you prefer a multplier other than 1.5 then set it with the negatveaultplier opton.
E.g., for a multplier of 1 (equally weightng positve and negatve tests try:
java -jar SentiStrength.jar sentidata C:/SentiStrength_Data/ text
I+hate+frogs+but+love+dogs.+Do+You+like. sentenceCombineTot
paragraphCombineTot trinary negativeMultiplier 1

Machine learning evaluations
These are machine learning optons to evaluate SentStrength for academic research.
The basic command is train.
train (evaluate SentStrength by training term strengths on results in iles. An input
ile of 500+ human classiied texts is also needed - e.g.,
java -jar SentiStrength.jar train input C:\1041MySpace.txt

This atempts to optmise the sentment dictonary using a machine learning approach
and 10-fold cross validaton. This is equivalent to using the command optimise on a
random 90% of the data, then evaluatng the results on the remaining 10% and
repeatng this 9 more tmes with the remaining 9 sectons of 10% of the data. The
accuracy results reported are the average of the 10 atempts. This estmates the
improved accuracy gained from using the optmise command to improve the sentment
dictonary.

11
The output of this is two iles. The ile ending in houS.SxS reports various accuracy
statstcs (e.g., number and proporton correct, number and proporton within 1 of the
correct value; correlaton between SentStrength and human coded values. The ile
ending in houShSermSSrVars.SxS reports the changes to the sentment dictonary in each
of the folds. Both iles also report the parameters used for the sentment algorithm and
machine learning. See the What do the results imean? secton at the end for more
informaton.

Evaluation options






Test all opton variatons listed in Classiicaton Algorithm Parameters above
rather than use the default optons
tot Optmise by the number of correct classiicatons rather than the sum of the
classiicaton differences
iterations [number of 10-fold iterations (default 1)] This sets the
number of tmes that the training and evaluaton is conducted. A value of 30 is
recommended to help average out differences between runs.
all

minImprovement [min extra correct class. to change sentiment
weights (default 2)] This sets the minimum number of extra correct

classiicatons necessary to adjust a term weight during the training phase.

multi [# duplicate term strength optimisations to change sentiment
weights (default 1)] This is a kind of super-optmisaton. Instead of being

optmised once, term weights are optmised multple tmes from the startng
values and then the average of these weights is taken and optmised and used as
the inal optmised term strengths. This should in theory give beter values than
optmisaton once. e.g.,

java -jar SentiStrength.jar multi 8 input C:\1041MySpace.txt iterations
2

Example: Usinn SentSSrennSt for 10-fold cross-validaton
WtaS is Stis? This estmates the accuracy of SentStrength after it has optmised the
term weights for the sentment words (i.e., the values in the ile
EmotonLookupTable.txts.
WtaS do I need for Stis SesS? You need an input ile that is a list of texts with human
classiied values for positve (1-5s and negatve (1-5s sentment. Each line of the ile
should be in the format:
Positve Negatve text
How do I run Ste SesS? Type the following command, replacing the ilename with your
own ile name.
java -jar SentiStrength.jar input C:\1041MySpace.txt iterations 30

This should take up to one hour – much longer for longer iles. The output will be a list
of accuracy statstcs. Each 10-fold cross-validaton

12
WtaS does 10-fold cross-validaton mean? See the k-fold secton in
htp://en.wikipedia.org/wiki/Cross-validaton_(statstcss. Essentally, it means that the
same data is used to identfy the best sentment strength values for the terms in
EmotonLookupTable.txt as is used to evaluate the accuracy of the revised (traineds
algorithm – but this isn’t cheatng when it is done this way.
The irst line in the results ile gives the accuracy of SentStrength with the original term
weights in EmotonLookupTable.txt.
WtaS do Ste resulSs mean? The easiest way to read the results is to copy and paste
them into a spreadsheet like Excel. The table created lists the optons used to classify
the texts and well as the results. ere is an extract from the irst two rows of the key
results. It gives the total number correct for positve sentment (Pos Corrects and the
proporton correct (Pos Correct/Totals. It also reports the number of predictons that are
correct or within 1 of being correct (Pos Within1s. The same informaton is given for
negatve sentment.
Pos
Pos
Correct/ Neg
Correct Total
Correct
653 0.627281
754

Neg
Pos
Neg
Correct/ Pos
Within1/ Neg
Within1/
Total
Within1 Total
Within1 Total
0.724304
1008
0.9683
991 0.951969

ere is another extract of the irst two rows of the key results. It gives the correlaton
between the positve sentment predictons and the human coded values for positve
sentment (Pos Corrs and the aean Percentage Error (PosaPEnoDivs. The same
informaton is given for negatve sentment.
Pos Corr
0.638382

NegCorr

PosaPE
NegaPE
PosaPEnoDiv
NegaPEnoDiv
Ignore
Ignore
0.61354
this
this
0.405379
0.32853

If you speciied 30 iteratons then there will be 31 rows, one for the header and 1 for
each iteraton. Take the average of the rows as the value to use.
Thanks to Hannes Pirker at OFAI for soime of the above code.

Source Exif Data:

File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.4
Linearized                      : No
Page Count                      : 12
Language                        : fr-FR
Author                          :  Mike Thelwall
Creator                         : Writer
Producer                        : LibreOffice 5.4
Create Date                     : 2018:05:11 14:10:10+01:00

EXIF Metadata provided by EXIF.tools

Senti Strength Java Manual

Navigation menu

Versions of this User Manual:

Views

Navigation