Lab2 Instructions

User Manual: Pdf

Open the PDF directly: View PDF .
Page Count: 2

Download
Open PDF In Browser	View PDF

Lab 2 - Stat 215A, Fall 2017
Due: Thursday October 5, 9:00 PM
Push a folder called “lab2-0123456789” (where you replace 0123456789 with your student ID) to your stat215a
GitHub repository. This folder, “lab2-0123456789”, should contain the files below:
• lab2.Rmd or lab2.Rnw: the raw report + code with your name
• lab2.pdf : the output of lab2.Rnw/lab2.Rmd. This output should not contain any code.
• lab2_blind.Rmd or lab2_blind.Rnw: exactly the same as lab2.Rmd/lab2.Rnw but with your student
ID instead of your name
• lab2_blind.pdf : the output of lab2_blind.Rnw/lab2_blind.Rmd. This output should not contain
any code.
• R/: a folder containing .R scripts (e.g. load.R and clean.R) that will be sourced in lab2.Rnw
• extra/: an optional folder containing external figures (e.g. images made in adobe illustrator or
downloaded from the internet.
You should also have a local data/ folder but please do not push this data/ folder. Do not push any other
files that are not needed for your report. Do not push multiple versions of the lab. It must be very clear to
an external viewer which folder contains your lab. Please make an effort to adhere to these filenames exactly,
otherwise there is a chance that your lab will not be properly transferred for peer grading.

Kernel density plots and smoothing

These tasks use the redwood data from the previous lab. You may have already done similar things in your
lab 1; these tasks are focused on experimenting with parameters in kernel smoothers.
1. Plot a density estimate for the distribution of temperature over the whole dataset. Experiment with
different kernels and bandwidth. Explain your findings.
2. Choose a time of day and plot the temperature against the humidity for all nodes at that time for the
entire project period (hint: there are 288 measurements per day so measurements where epoch mod
288 is constant will all be at the same time of day). Add a loess smoother to the plot. Experiment
with bandwidth and the degree of the polynomials. Explain your findings.

Linguistic Data

This section of the lab uses data from a Dialect Survey conducted by Bert Vaux. Some limited information
can be found at the original website http://www4.uwm.edu/FLL/linguistics/dialect/index.html. The
questions and answers can be found in the file question_data.Rdata (this information was found and
processed from the http://dialect.redlog.net/index.html by an intrepid STAT215 student past). We
will focus on the questions that look at lexical differences as opposed to phonetic differences, which are
numbered 50-121. There two data sets on bSpace. lingData contains the answers to the questions for
47,471 respondents across the United States. The dataset contains the variables ID, CITY, STATE, ZIP, Q50 Q121 (a few questions in this range are left out), lat and long. ID is a number identifying the respondent.
CITY and STATE were self reported by respondents. Former GSIs found the latitude and longitude for the
1

center of each zipcode and added the lat and long variables based on the reported city and state. Note
that there are missing values. The variables starting with Q are the responses to the corresponding question
on the website. A value of 0 indicates no response. The other numbers should directly match the responses
on the website, i.e. a value of 1 should match a response of (a).
For the second data set, lingLocation, the same categorical responses were turned into binary responses.
Then the data was binned into one degree latitude by one degree longitude squares. Within each of these bins,
the binary response vectors were summed over individuals. Please note that the rows are not normalized.
For example, say John and Paul take this questionnaire for two questions. The first question has three answer
choices and the second question has four answer choices. If John answered A and D and Paul answered B
and D, then lingData would encode two vectors: (1, 4) and (2, 4). If they lived in the same longitude and
latitude box, then it would be encoded in lingLocation as one vector: (1, 1, 0, 0, 0, 0, 2).

2.1

Your tasks

1. Have a look at the review papers Nerbonne and Kretzschmar [2003] and Nerbonne and Kretzschmar
[2006] (both are posted under Lab 2 in bSpace)
2. Pick two survey questions and investigate their relationship to each other and geography. You will need
to use maps and can experiment with interactivity, e.g. using linked brushing (see the crosstalk R
package: https://rstudio.github.io/crosstalk/). Do the answers to the two questions define any
distinct geographical groups? Does a response to one question help predict the other? Try to analyze
the categorical data for more than 2 questions.
3. Encode the data so that the response is binary instead of categorical. In the previous example of
John and Paul, the encoded binary vectors would be (1, 0, 0, 0, 0, 0, 1) for John and (0, 1, 0, 0, 0, 0, 1)
for Paul. (You might want to do this for the previous question as well.) This makes p = 468 and
n = 47, 471. Experiment with dimension reduction techniques. What do you see? If you do not see
anything, change your projection. Does that make things look different?
4. Use the methods we learned in class for dimension reduction and clustering to try to gain insight into
the full dataset. Are there any groups? Do these groups relate to geography? What questions separate
the groups? Is there a continuum? From where to where? Which questions produce this continuum?
Does the mathematical model behind your dimension reduction strategy make sense for these clusters?
5. Choose one of your interesting findings. Analyze and discuss the robustness of the finding. What
happens when you perturb the data set? Different starting points? What can be generalized from this
finding?

References
John Nerbonne and William Kretzschmar. Introducing computational techniques in dialectometry. Computers and the Humanities, 37(3):245–255, 2003.
John Nerbonne and William Kretzschmar. Progress in dialectometry: toward explanation. Literary and
linguistic computing, 21(4):387–397, 2006.

Source Exif Data:

File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.5
Linearized                      : No
Page Count                      : 2
Producer                        : pdfTeX-1.40.17
Creator                         : TeX
Create Date                     : 2017:09:19 12:09:25-07:00
Modify Date                     : 2017:09:19 12:09:25-07:00
Trapped                         : False
PTEX Fullbanner                 : This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2

EXIF Metadata provided by EXIF.tools

Lab2 Instructions

Navigation menu

Versions of this User Manual:

Views

Navigation