Script Instructions

User Manual:

Open the PDF directly: View PDF .
Page Count: 10

Download
Open PDF In Browser	View PDF

Data collection procedures
Dylan Weber
June 21, 2017

Things to note before you get started
• All the scripts will write their output files to the directory they are run
in.
• You will need at least one reddit account that is the developer of a ”script
application.” (i.e. an account with a ”client ID” and ”client secret”). See
the next section on obtaining credentials from reddit.
• You will need Python. The scripts were written in version 2.7 and are
therefore not guaranteed (and will probably not) to work with versions ≥
3.
• You will need to install the ”numpy” and ”praw” packages to Python. If
you’re unsure how to do this google ”installing python packages with pip.”
• Ideally these scripts will be run in a linux environment. This is to take
advantage of the ”screen” command in order to achieve reasonable runtimes on large sets of data. This will be more clear once you read the
main implementation instructions. There may be work arounds to this in
Windows that I’m not aware of. If you know of any please let me know!
• Please send all bugs/ideas for improvement to djw009@gmail.com

Obtaining script credentials from reddit
These scripts take advantage of the Python reddit API wrapper ”PRAW”. In
order to access reddit’s API, one needs credentials supplied from reddit.
1. Navigate to reddit.com and create an account.
2. Log into your account and navigate to the preferences tab found in the
upper right hand corner of the screen.
3. Click the ”apps” tab found in the upper left hand corner of the screen.
4. Click the ”Are you a developer? create an app!” button. IF you already
have apps click the ”create another app” button.
5. Fill out the form, make sure you choose ”script app’ and NOT ”web app”
or ”installed app”.
1

6. Your client ID is the alpha-numeric string that appears under the name
of your app after you’ve finished creating it. Your client secret is the
alpha-numeric string that appears in the ”client secret” field.
7. Ideally to run these scripts you should create several script apps to obtain
several client ID and client secret pairs. Keep track of them somewhere.

Collecting interaction data from a subreddit
Step 1 - Generating username list
1. Make sure the scripts are in the directory that you want ouput files written
to.
2. Run usercounter final.py with the following command:
python − i u s e r c o u n t e r f i n a l . py
What does the script do?. usercounter final ”watches” a subreddit of
your choosing. It first creates an empty python dictionary called name list
(”the namelist”). Each time a submission is made to the subreddit you
chose, usercounter grabs the author’s name. If the author is not already
in the name list it adds the author’s username to the namelist with a
submission counter value of one and a unique usernumber. The first user
gets 0 the second user gets 1 and so on. After the scrape is completely
finished an example entry of name list would look like:
’ username ’ : [ 4 , 6 4 ]
So, the user with the username ’username’ was the 64th user collected
by the script and made four submissions to the subreddit of interest while
the script was running. The script also creates a .txt file for that user
and stores the submission’s ID in that .txt file. If the author IS in the
namelist, usercounter adds one to their submission counter and adds the
current submission ID to the .txt file for that user.
3. The script will prompt you to input what subreddit you’d like to scrape
data from. You must type the name EXACTLY as it appears on reddit
with no spaces before or after. If the script fails to run make sure you
didn’t accidentally input a space after typing the subreddit name.
4. The script will prompt you what you’d like to name the write file. I usually
choose the name of the subreddit and the start time/date.
5. The script will prompt you for the name of the folder to store submission
IDs. Inside this folder will be a .txt file named for each username collected
during the scrape containing the IDs of the submissions made by that user
during the scrape.
6. the script will then begin to run, indicated by its printing of messages
such as:
’ username ’ i s a new s u b m i t t e r
2

or:
’ username ’ ’ s count i s n
7. Let the script run for as long as you are interested in collecting user data
from your subreddit of choice. The script will write the namelist to a .csv
file. End the script with Ctrl-C. At this point you can choose to continue
the data collection process in your current session of Python; the namelist
is stored to the variable:
name list
Or, you can exit your current session and load the namelist from the .csv
file written by usercounter to a variable of your choosing in a new session
at any point later in the future (see next section).

Step 2 Generating the interaction data - the fast way
If you need to load namelist from the .csv file written by usercounter read the
bullet points preceding step 1. Otherwise proceed to step 1.
• If you need to import the namelist from the .csv file written by usercounter first start python and import multi breakdown with the following
command:
import multi breakdown
• We are going to use the load andconvert namelists todict function. If we
want to name the variable that we load the namelist to ”name list” and
the .csv file that the namelist was written to is called ”namelist.csv” we
call the function with the following command:

n a m e l i s t = multi breakdown . l o a d a n d c o n v e r t n a m e l i s t s t o d i c t ( ’ n a m e l i s t . csv ’
• the namelist should now be stored to a variable named name list.
For the rest of this section I will assume you will be running the scripts in a
linux environment, with the screen command available, have n clientIDs/client
passwords, and that the namelist is stored to the variable name list.
1. We first need to breakdown the namelist into n sub-namelists. We will do
this using the dictbreaker function in the multi breakdown module. First,
import the multi breakdown module with:
import multi breakdown
What does the script do?. dictbreaker takes a dictionary and an integer, i, as arguments. It outputs a tuple consisting of ten subdictionaries
(this can be changed easily see appendix). It takes the first i entries of the
given dictionary and adds them to the first subdictionary in the tuple, then
it takes the next i entries and adds them to the second subdictionary in
the tuple, and so on, until the given dictionary is empty. Therefore, if the
integer is set high enough the output tuple may contain empty dictionaries.
This is ok.
3

NOTE:. In its current form, dictbreaker does not support breaking a dictionary into more then 10 sub dictionaries. If you have more then 10
clientIDs you will want to do this. It is a relatively easy fix, but involves
modifying the source code of dictbreaker. See the appendix.
2. Then, check the length of your namelist with:
len ( name list )
3. Divide the length of your namelist by n and round up (I will refer to this
number as x), this is the maximum number of items in each sub dictionary.
4. call dictbreaker to break the namelist into subdictionaries with (the subdictionaries variable name is arbitrary):
s u b d i c t i o n a r i e s = multi breakdown . d i c t b r e a k e r ( n a m e l i s t , x )
5. Now, we need to write each subdictionary in the tuple subdictionaries to
its own .csv file to be loaded to variables in different screen sessions later.
To do this, we will call the listofdicts writer from multi breakdown with:
listofdicts writer ( subdictionaries )
The script will prompt you for how many nonempty dictionaries are in the
tuple subdictionaries. It will then write a csv file for each subdictionary
in the tuple and will name these files ”j.csv” where j is an integer.
6. Quit your current session of Python. Make n screen sessions (one for each
client ID you have, name them in a way you’ll remember). Enter one of
the sessions. In that session start python and import multi breakdown
again.
7. Now we need to generate the interaction data for one subdictionary. We
do this by calling the bot scrape() function from multi breakdown with:
interactiondata j = bot scrape ()
What does the script do?. bot scrape() will ask for a subdictionary
of the name list, the name list and a set of reddit credentials. For each
username in the subdictionary the script checks every other user in the
name list and counts how many times each user in the name list has been
a top level replier to a submission made by the current username or replied
to a comment made by the current username. It then records that count as
a value in a dictionary with the key (i,j) where i and j are the usernumbers
of both users. The dictionary is written to a .csv file and returns as a
variable. For example if user 45 was in a subdictionary that you gave to
bot scrape and the following item was in the returned dictionary:
(65 , 34):8
Then, that means that user 34 was a top-level replier on a submission or
replied to a comment by user 65, 8 times.

4

8. The script will prompt you for your reddit username, password, a client
ID, a client secret, and the names of the .csv files that the namelist and
the subdictionary are written to. Input these EXACTLY with no spaces
before or after. The script will then prompt you for a name for the output
csv file (I will assume you named it interaction data j.csv). The script
should begin to run, indicated by its printing of messages such as:
’ u s e r i n n a m e l i s t ’ has been a t o p l e v e l
r e p l i e r t o ’ u s e r i n s u b d i c t i o n a r y ’ i times ’
or
’ u s e r i n n a m e l i s t ’ has r e p l i e d t o a comment
by ’ u s e r i n s u b d i c t i o n a r y ’ i times ’
The script will return the resulting interaction data as a variable and write
it to a .csv file with the name you supplied. At this point you can quit
Python and leave this session of screen.
9. repeat steps 6-8 for all the subdictionaries. You should then have a interaction data csv file for each subdictionary.
10. Start python and import multi breakdown. We need to load the interaction data for each subdictionary to a variable in python using the
load andconvert namelists todict function in multi breakdown. Since this
function name is annoyingly long and we’re going to use it n times I’d
suggest renaming it:
l o a d = multi breakdown . l o a d a n d c o n v e r t n a m e l i s t s t o d i c t
Now, for each interaction data csv file you generated enter the following:
i n t e r a c t i o n d a t a j = l o a d ( ’ i n t e r a c t i o n d a t a j . csv ’ )
You should now have n different variables each storing the interaction data
generated by running bot scrape on each subdictionary.
11. We now need to merge all these interaction data dictionaries into one
interaction data dictionary for the whole scrape using tuplemerger (this
is all tuplemerger does, it dosen’t merit its own explanation) with the
command:
i n t e r a c t i o n d a t a f i n a l = multi breakdown . t u p l e m e r g e r ( i n t e r a c t i o n d a t a 1 ,
interaction data 2 , . . . , interaction data n )
12. Finally, we need to import array maker:
import a r r a y m a k e r
and use it to store the interaction data in a matrix format. This script has
an option to symmetrize the matrix. If you enter ’yes’ when it prompts
you the ijth and jith entries of the resulting matrix will be equal and will
represent the number of interactions between user i and user j. If you
enter ’no’ the ijth entry of the matrix will represent how many times user j
was a top-level comment to a submission by user i or replied to a comment
by user i. The function is called with the command:
5

matrix = a r r a y m a k e r . a r r a y m a k e r ( i n t e r a c t i o n d a t a f i n a l )
This function will return its output as a variable and write it to a .csv
file that can be used import the matrix in any other supported language
(MATLAB, R, etc.)

A sample run with pictures
The following set of pictures will show you sample inputs and outputs.
However note that they were taken over many different scrapes so the outputs
will not correspond to each other, just use them as a general guide. The names
of the variables and their order match the instructions above.

usercounter
input

running

6

output

dictbreaker
input

output
Figure 1: Notice the empty dictionary in the last entry of the subdictionaries
tuple output by dictbreaker

listofdicts writer
input

7

bot scrape
input

running

8

tuplemerger
input

output

array maker
input

9

output

10

Source Exif Data:

File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.5
Linearized                      : Yes
Create Date                     : 2017:06:21 23:31:17Z
Creator                         : TeX
Modify Date                     : 2017:06:21 23:31:17Z
PTEX Fullbanner                 : This is pdfTeX, Version 3.14159265-2.6-1.40.16 (TeX Live 2015) kpathsea version 6.2.1
Producer                        : pdfTeX-1.40.16
Trapped                         : False
Page Count                      : 10

EXIF Metadata provided by EXIF.tools

Script Instructions

Navigation menu

Versions of this User Manual:

Views

Navigation