Pre Workshop Instructions
User Manual:
Open the PDF directly: View PDF .
Page Count: 5
Download | |
Open PDF In Browser | View PDF |
Pre-workshop instructions: In the upcoming workshop we are trying to tackle challenges that could be grouped into 3 buckets: (I) access to BigQuery and the related stuff, (II) setting up a programming environment that allows one to analyze the data obtained, and (III) the actual process of analysis. Ideally, we would be able to take care of (I) & (II) before the workshop, and focus more on (III) during the workshop. Even if we can’t get steps (I) & (II) done for everyone before the workshop, we’ll be able to address the sticking points. With that in mind, I created this document to lay out the challenges in front of us. It would help us to get the most out of the workshop if you could go through the document before the workshop. Section I: BigQuery-related points 1. Getting access to BigQuery: a. This step is important if you need access to BigQuery. In case you have a course team member (or me) who will provide you with the downloaded files, you may not need to go through this step. b. To get access to the datasets of your courses, you can go directly to Jon Daries, or work with us (we are still trying to figure out the best way to go about it). In either case, you’ll need a Gmail id, and in the first case you will probably need to set up a payment methods (credit card/free trial credit, if using the Google Cloud for the first time). 2. Interacting with BigQuery and downloading raw data: a. If you are able to access edx/mitx course datasets in your repository and know how to download files, go to https://bigquery.cloud.google.com to download the following tables from any one course. i. Person_course ii. Problem_analysis iii. Course_axis If you have these tables, you can perform analyses on your actual data in the workshop, rather than working with the fake data I’ll provide. b. One way to download these tables is via creating a “bucket” in Google Cloud storage (https://console.cloud.google.com/storage/), and exporting the tables a-c into that bucket. To do that, i. First make sure that you create/choose the right “Project”, and create a new storage “bucket”, say “test_bucket”. ii. Within the “test_bucket”, create a new folder, e.g., “test_folder” iii. iv. From BigQuery, open a table and click on the “Export table” button. 1. Choose the “csv” for files without nested format (e.g., person_course), and “JSON” for those with nested data (e.g., course_axis, problem_analysis) 2. No need to use compression unless the file is really large (>1 GB) 3. In the address bar, after gs://, type “test_bucket/test_folder/test_file” 4. Hit export, and refresh your folder in the Storage console to ensure that you see the new file, and the file size is what you expect from the BigQuery Table Details -> Table Size. 5. Once satisfied, just click on the file in storage, and it’ll be downloaded in your computer. If you do a lot of table exporting and downloading, you’d perhaps want to perform steps 1-5 via their API. Let me know if that’s what you’d want, and we can set you up for that process. 2. Running SQL jobs in BigQuery: a. I’m no expert in SQL, and I avoid using this feature in BigQuery for a number of reasons: i. I’m no expert in SQL... ii. From a research point of view, having access to raw data is more versatile: you can do more analyses without having to write new SQL queries. iii. Visualization is also easier, as you can use your favorite programs in many different ways iv. It’s easy to create expensive queries by mistake (it’s possible to set up a max cost per query somewhere to avoid running into such a scenario.) b. However, direct SQL queries could make more sense if i. You are planning on combing through, say, all the log files from one course for each student’s activities, or ii. You are making a real-time app, and low latency is the goal iii. want to look at some specific numbers every day, and have not much use for the raw data beyond those numbers Let’s figure out what kind of analysis you have in mind, and if running SQL queries (on BigQuery or via API) would make more sense for your project. If so, I can provide you with starter codes written by Ike and Jon (and possibly others), and from there we can possibly figure out the queries you need to write. Section II: Python-setup and the related points: We are going to use python-based tools for this workshop. However you are welcome to use other programming languages/tools that can perform similar tasks. Even in the latter case, hopefully the workshop will give you an idea of how to translate the basic analysis steps in your favorite language. In python, we would need the following packages for the workshop: For analysis: Numpy, Scipy, Pandas, matplotlb, jupyter notebook (the last one is convenient, but not essential) For generating plots, I use plotly. You can use Bokeh, Seaborn, or stick to matplotlib, for the workshop, if you don’t want to use plotly. If you have these packages (and/or know how to install them), you are good to go. Otherwise, please read on. ___________________________ Setting up a python eco-system in your computer: When one is creating a python development environment for the first time, unfortunately there are a number of decisions one needs to make regarding python version, package managing system etc. TL;DR: Consider installing anaconda, if you don’t have it. On MacOSx you may already have it pre-installed. You can install Conda on Windows as well. After installation, it takes about 3GB. I recommend python version>=3.6. It’s fine for the workshop if you are running python 2.7, but edx is moving on to python 3, and it’s a good time that we all do so! The article below has step-by-step guidance on installing conda (unlike in the article, I recommend you install full anaconda, and not miniconda). This will hopefully clarify the landscape a bit. https://medium.freecodecamp.org/why-you-need-python-environments-and-how-to-managethem-with-conda-85f155f4353c For the masochists: If you do not want to install Anaconda, another option is to use pip (a python package installer). More here. As explained in this link, If you are going to take this route (and use, python2.7), it’s strongly recommended that you also install something called virtualenv, which creates a separate installation of python in an isolated environment, so that you don’t have to “Marie Kondo” your laptop at a future date. Here is a very quick summary of what virtualenv does. For python>=3.3 you don’t need to install virtualenv separately -- there’s an equivalent in-built program called venv that you can use to create a virtual environment. Here is a semi-high-level view of the process: 1. We create a virtualenv (or venv), which makes a copy of system python installation inside it. “python3 -m /path/to/venv” 2. Whenever we want to use this isolated environment (which could be just one environment for all similar projects), we activate it with the following command (for bash shell in MacOSx or Linux): “source /path/to/venv/bin/activate” 3. We check that the venv is activated (there will be a “(venv)” in the terminal console). We then install requisite python packages by using something called “pip” -- package installer for python. The command for installing a python package xyz is as simple as: “pip install xyz” We also run our programs in this environment (via scripts or in jupyter notebook). 4. Finally, deactivate the virtual environment by simply typing “deactivate”. The process in windows and other shells is similar. Section III: Actual analysis If you are not super familiar with python, this 5-minute overview could be a good starting point. For the workshop, we’ll use Pandas package for analysis, which has its own set of “commands”. I’ll have our tasks available through jupyter notebook and standalone scripts (in case you don’t have jupyter notebook installed, but have the other python packages installed). If you are itching to start with Pandas, this official “10-minute” guide is safe for me to recommend. Although I doubt if all different sections of the guide are of equal use, and it’d be very impressive if you can make sense of the whole guide in 10 minutes! I’ll make the scripts/notebook and the data available on 4/10, before the workshop. Questions/comments? Email anindyar@mit.edu ___________________________ The dropbox folder containing the notebook and the data is here: https://www.dropbox.com/sh/caqjwx2sak7p8aw/AABiKGP37x83KhjJKTvgo-tQa?dl=0 If you don’t have jupyter notebook installed, you can use google colaboratory: The equivalent notebook is at https://drive.google.com/file/d/1pR9KmjUfyCmXsk9PWAGqZFomsAFnz2H2/view?usp=sharing Download this notebook, and upload it in your own Google Drive, connect colab, and then we’ll see how to run it. ___________________________________________________________________________ Other resources: 1. Jolyon has written a package that could be used to catch errors in edx courses and create a pretty pdf outline of the course index: https://github.com/jolyonb/edx-xml-clean 2. Connor has a repo that could help you get an idea of how to download datasets from BigQuery: https://github.com/connor-makowski/BQ_Data_Generic 3. Martin has a code (in R) that could create pseudonymized usernames/user_id lists. Contact Martin directly if you need it. 4. Dan has a wealth of jupyter notebooks, one for each paper that he has written (and that’s a lot!). Ask him if you need more real-life examples of edx data analysis. 5. Here is the edx2bigquery repository, that generates the tables in BigQuery, and the data reference for the tables is at: http://edx.readthedocs.io/projects/devdata/en/latest/internal_data_formats/
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.6 Linearized : No XMP Toolkit : Adobe XMP Core 5.6-c015 84.158566, 2015/11/06-02:38:14 Create Date : 2019:06:18 11:35:09Z Creator Tool : Word Modify Date : 2019:06:18 07:35:11-04:00 Metadata Date : 2019:06:18 07:35:11-04:00 Keywords : Producer : Acrobat PDFMaker 15 for Word Format : application/pdf Title : Microsoft Word - Pre-workshop instructions_.docx Document ID : uuid:951b4b4c-2353-c84a-9d74-d9678d860b48 Instance ID : uuid:4bbad627-6a8e-c647-8fe5-8e50f0084c51 Page Count : 5 Creator : WordEXIF Metadata provided by EXIF.tools