Lab2 Instructions
User Manual: Pdf
Open the PDF directly: View PDF .
Page Count: 2
Download | |
Open PDF In Browser | View PDF |
Lab 2 - Stat 215A, Fall 2017 Due: Thursday October 5, 9:00 PM Push a folder called “lab2-0123456789” (where you replace 0123456789 with your student ID) to your stat215a GitHub repository. This folder, “lab2-0123456789”, should contain the files below: • lab2.Rmd or lab2.Rnw: the raw report + code with your name • lab2.pdf : the output of lab2.Rnw/lab2.Rmd. This output should not contain any code. • lab2_blind.Rmd or lab2_blind.Rnw: exactly the same as lab2.Rmd/lab2.Rnw but with your student ID instead of your name • lab2_blind.pdf : the output of lab2_blind.Rnw/lab2_blind.Rmd. This output should not contain any code. • R/: a folder containing .R scripts (e.g. load.R and clean.R) that will be sourced in lab2.Rnw • extra/: an optional folder containing external figures (e.g. images made in adobe illustrator or downloaded from the internet. You should also have a local data/ folder but please do not push this data/ folder. Do not push any other files that are not needed for your report. Do not push multiple versions of the lab. It must be very clear to an external viewer which folder contains your lab. Please make an effort to adhere to these filenames exactly, otherwise there is a chance that your lab will not be properly transferred for peer grading. 1 Kernel density plots and smoothing These tasks use the redwood data from the previous lab. You may have already done similar things in your lab 1; these tasks are focused on experimenting with parameters in kernel smoothers. 1. Plot a density estimate for the distribution of temperature over the whole dataset. Experiment with different kernels and bandwidth. Explain your findings. 2. Choose a time of day and plot the temperature against the humidity for all nodes at that time for the entire project period (hint: there are 288 measurements per day so measurements where epoch mod 288 is constant will all be at the same time of day). Add a loess smoother to the plot. Experiment with bandwidth and the degree of the polynomials. Explain your findings. 2 Linguistic Data This section of the lab uses data from a Dialect Survey conducted by Bert Vaux. Some limited information can be found at the original website http://www4.uwm.edu/FLL/linguistics/dialect/index.html. The questions and answers can be found in the file question_data.Rdata (this information was found and processed from the http://dialect.redlog.net/index.html by an intrepid STAT215 student past). We will focus on the questions that look at lexical differences as opposed to phonetic differences, which are numbered 50-121. There two data sets on bSpace. lingData contains the answers to the questions for 47,471 respondents across the United States. The dataset contains the variables ID, CITY, STATE, ZIP, Q50 Q121 (a few questions in this range are left out), lat and long. ID is a number identifying the respondent. CITY and STATE were self reported by respondents. Former GSIs found the latitude and longitude for the 1 center of each zipcode and added the lat and long variables based on the reported city and state. Note that there are missing values. The variables starting with Q are the responses to the corresponding question on the website. A value of 0 indicates no response. The other numbers should directly match the responses on the website, i.e. a value of 1 should match a response of (a). For the second data set, lingLocation, the same categorical responses were turned into binary responses. Then the data was binned into one degree latitude by one degree longitude squares. Within each of these bins, the binary response vectors were summed over individuals. Please note that the rows are not normalized. For example, say John and Paul take this questionnaire for two questions. The first question has three answer choices and the second question has four answer choices. If John answered A and D and Paul answered B and D, then lingData would encode two vectors: (1, 4) and (2, 4). If they lived in the same longitude and latitude box, then it would be encoded in lingLocation as one vector: (1, 1, 0, 0, 0, 0, 2). 2.1 Your tasks 1. Have a look at the review papers Nerbonne and Kretzschmar [2003] and Nerbonne and Kretzschmar [2006] (both are posted under Lab 2 in bSpace) 2. Pick two survey questions and investigate their relationship to each other and geography. You will need to use maps and can experiment with interactivity, e.g. using linked brushing (see the crosstalk R package: https://rstudio.github.io/crosstalk/). Do the answers to the two questions define any distinct geographical groups? Does a response to one question help predict the other? Try to analyze the categorical data for more than 2 questions. 3. Encode the data so that the response is binary instead of categorical. In the previous example of John and Paul, the encoded binary vectors would be (1, 0, 0, 0, 0, 0, 1) for John and (0, 1, 0, 0, 0, 0, 1) for Paul. (You might want to do this for the previous question as well.) This makes p = 468 and n = 47, 471. Experiment with dimension reduction techniques. What do you see? If you do not see anything, change your projection. Does that make things look different? 4. Use the methods we learned in class for dimension reduction and clustering to try to gain insight into the full dataset. Are there any groups? Do these groups relate to geography? What questions separate the groups? Is there a continuum? From where to where? Which questions produce this continuum? Does the mathematical model behind your dimension reduction strategy make sense for these clusters? 5. Choose one of your interesting findings. Analyze and discuss the robustness of the finding. What happens when you perturb the data set? Different starting points? What can be generalized from this finding? References John Nerbonne and William Kretzschmar. Introducing computational techniques in dialectometry. Computers and the Humanities, 37(3):245–255, 2003. John Nerbonne and William Kretzschmar. Progress in dialectometry: toward explanation. Literary and linguistic computing, 21(4):387–397, 2006.
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.5 Linearized : No Page Count : 2 Producer : pdfTeX-1.40.17 Creator : TeX Create Date : 2017:09:19 12:09:25-07:00 Modify Date : 2017:09:19 12:09:25-07:00 Trapped : False PTEX Fullbanner : This is pdfTeX, Version 3.14159265-2.6-1.40.17 (TeX Live 2016) kpathsea version 6.2.2EXIF Metadata provided by EXIF.tools