Team01 EB5001 Stock Price Analytics Using Big Data Installation And Setup Guide
User Manual:
Open the PDF directly: View PDF .
Page Count: 9
Download | |
Open PDF In Browser | View PDF |
EB5001 Stock Analysis using Big Data Engineering for Analytics TEAM 1 A N U R AG C H AT T E R J E E ( A0 1 7 8 3 7 3 U) BHUJBAL VAIBHAV SHIVAJI (A0178321H) G O H C H UNG TAT K E NR I C K ( A0 0 8 0 8 9 1 Y) LIM PIER (A0178254X) LIU THEODORUS DAVID LEONARDI (A0178263X) TEO WEI KIN DARREN (A0178197L) TSAN YEE SOON (A0178316Y) Contents 1 Executive summary .............................................................................................................................. 3 2 Big data technologies landscape ......................................................................................................... 3 3 Installations and set-up ....................................................................................................................... 4 3.1 Installing JDK ................................................................................................................................ 4 3.2 Installing HDFS ............................................................................................................................. 4 3.3 Installing Spark ............................................................................................................................. 4 3.4 Setting up Cassandra ................................................................................................................... 5 3.5 Installing Kafka and Zookeeper .................................................................................................... 5 3.5.1 3.6 Creating Kafka topics ........................................................................................................... 6 Setting up and starting the real-time data producers ................................................................. 6 3.6.1 Producer for IEX data ........................................................................................................... 7 3.6.2 Producer for StockTwits data............................................................................................... 7 3.7 Setting up Spark jobs ................................................................................................................... 8 3.8 Installing Redis ............................................................................................................................. 8 3.9 Visualizing using Qlik Sense ............................................................................................................... 8 1 Executive summary The objective of the document is to introduce the Big data landscape, describe the functionalities of the various components as part of this project and then lay down the steps that need to be performed to install and set up the components so that the solution that has been built can be realized. 2 Big data technologies landscape The overall landscape looks like the below in terms of Big data technologies and the proposed functionalities. Figure 1 Big data technologies landscape There are 2 real-time data producers which fetch data from 2 different REST APIs. The stock quote API provides updates on the real time price of the stocks from IEX and the stock twits API provides tweets related to the stock. The producers continuously fetch responses from these APIs and push the retrieved JSON to Kafka topics. The data from the Kafka topics are then processed by Spark streaming jobs in realtime. There are 2 category of jobs, one that perform real-time aggregations and visualize in a console and other that pushes to Cassandra. The data at rest in Cassandra are then processed by 2 categories of Spark batch jobs. The first category performs aggregations on the static data and the other category performs batch machine learning. The results from these batch jobs are saved in separate tables in Cassandra. A separate batch job performs archival by routinely converting the data stored in Cassandra to Parquet files. Qlik Sense is used to visualize the results of the batch processing into a dashboard using the Cassandra connector. The below sections focus on setting up the various components to realize the above landscape end to end. 3 Installations and set-up These installation steps are tested on a machine with 32GB of memory running on Ubuntu 18.04. 3.1 Installing JDK $ sudo apt install openjdk-8-jdk You should see the following message after you check the java version. $ java -version openjdk version "1.8.0_191" OpenJDK Runtime Environment (build 1.8.0_191-8u191-b12-2ubuntu0.18.10.1-b12) OpenJDK 64-Bit Server VM (build 25.191-b12, mixed mode) Set the path and JAVA_HOME variable, add the following commands to ~/.bashrc file. export JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java" export PATH=$PATH:$JAVA_HOME/bin 3.2 Installing HDFS Install Hadoop 3.1.2 by executing this commands: $ $ $ $ cd /opt sudo wget https://www-eu.apache.org/dist/hadoop/common/hadoop-3.1.1/hadoop-3.1.1.tar.gz . sudo tar -zxvf hadoop-3.1.1.tar.gz sudo ln -s Hadoop-3.1.1 hadoop and add this entries to ~/.bashrc file export HADOOP_HOME=/opt/hadoop export PATH="$ HADOOP_HOME/bin:$PATH" Hadoop HDFS is used by Archival script to archive old data (older than 10 year) based on UNIX timestamp information as Parquet files with Snappy compression. 3.3 Installing Spark Install Spark 2.3.3 by executing these commands: $ $ $ $ cd /opt sudo wget https://www-us.apache.org/dist/spark/spark-2.3.3/spark-2.3.3-bin-hadoop2.7.tgz . sudo tar -zxvf spark-2.3.3-bin-hadoop2.7.tgz sudo ln -s spark-2.3.3-bin-hadoop2.7 spark and add this entries to ~/.bashrc file export SPARK_HOME=/opt/spark export PATH="$SPARK_HOME/bin:$PATH" 3.4 Setting up Cassandra In order to run the Cassandra 3.11.4, we need to install docker on Ubuntu machine based on this guideline: https://www.digitalocean.com/community/tutorials/how-to-install-and-use-docker-onubuntu-18-04. You need to assign Cassandra broadcast IP address to your private IP address. $ ip addr show Example: Private IP = 172.30.0.172 export CASSANDRA_IP_ADDRESS=172.30.0.172 Then run the Cassandra docker container by executing these bash commands. $ mkdir -p /mnt/data/var/lib/cassandra $ docker run --name cassandra-server --network cda -d -e CASSANDRA_BROADCAST_ADDRESS="${CASSANDRA_IP_ADDRESS}" -p 7000:7000 -p 9042:9042 -v /mnt/data/var/lib/cassandra:/var/lib/cassandra cassandra:3.11.4 3.5 Installing Kafka and Zookeeper This section is on the instructions to install Apache Kafka and the dependencies required. The required dependency is Zookeeper. The version that we have used for Apache Kafka and Zookeeper are as shown below: - ZooKeeper 3.4.6 (ZooKeeper-3.4.6.tar.gz) Apache Kafka 1.1.1 (Apache Kafka 1.1.1) We will download Zookeeper using the following command and install using the command line. $ $ $ $ $ cd opt/ sudo wget https://archive.apache.org/dist/zookeeper/zookeeper-3.4.6/zookeeper-3.4.6.tar.gz sudo ln -s /opt/zookeeper-3.4.6 /opt/zookeeper cd zookeeper mkdir data Upon installing Zookeeper, we will create the configuration file for Zookeeper to initialize properly. $ vim conf/zoo.cfg tickTime=2000 dataDir=/path/to/zookeeper/data clientPort=2181 initLimit=5 syncLimit=2 The command code will initialize the Zookeeper server and if the initialization is successful, the command will be similar to the one as shown below. $ bin/zkServer.sh start $ JMX enabled by default $ Using config: /Users/../zookeeper/bin/../conf/zoo.cfg $ Starting zookeeper ... STARTED To check if the Zookeeper is working properly, we can run the following command to check if Zookeeper is working. If it is working, we will be able to see that it is connected as shown below. $ bin/zkCli.sh Connecting to localhost:2181 ................ ................ ................ Welcome to ZooKeeper! ................ ................ WATCHER:: WatchedEvent state:SyncConnected type: None path:null [zk: localhost:2181(CONNECTED) 0] After installing, we will continue to work on installing Kafka. As mentioned, the version we have used for this project is Apache Kafka 1.1.1. The following commands will download Apache Kafka and install it. $ $ $ $ $ cd opt/ sudo wget https://archive.apache.org/dist/kafka/1.1.1/kafka_2.11-1.1.1.tgz tar -zxf kafka_2.12-2.1.0.tgz sudo ln -s /opt/kafka_2.12-2.1.0 /opt/kafka cd kafka You can start the server by giving the following command. $ bin/kafka-server-start.sh config/server.properties After starting the command, if Kafka were to run smoothly, we will be able to see the following response on the screen. $ bin/kafka-server-start.sh config/server.properties [2016-01-02 15:37:30,410] INFO KafkaConfig values: request.timeout.ms = 30000 log.roll.hours = 168 inter.broker.protocol.version = 0.9.0.X log.preallocate = false security.inter.broker.protocol = PLAINTEXT 3.5.1 Creating Kafka topics The following command will create the topic stockquotes for the Kafka producer. The first command will be for the stockquotes. The second command will be for the stocktwits. $ sudo $KAFKA_HOME/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic stockquotes $ sudo $KAFKA_HOME/bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic stocktwits 3.6 Setting up and starting the real-time data producers We have got 2 different producers for streaming the data from 2 different data sources to 2 Kafka topics created as per section 3.5.1 $ sudo $KAFKA_HOME/bin/kafka-console-producer.sh --broker-list localhost:9092 —topic stockquotes 3.6.1 Producer for IEX data The python script named “iex_producer.py” will act as a producer and will take date time in epoch format as a command line parameter. This Python file will request the data from “https://api.iextrading.com/1.0/stock/AAPL/ “ URL and transmit the response to the topic called “stockquotes” created as per section 3.5.1. The command for starting the producer is $ python iex_producer.py 20190415 API response format to be transmitted to the “stockquotes” topic is shown below. 3.6.2 Producer for StockTwits data In order to ingest StockTwits data, please run Ingestion-StockTwits-Producer.jar with parameters: • • • Topic name: stock-twits Stock ticker: "AAPL", "MSFT", "GOOG", etc Redis Host IP Address: 127.0.0.1 since we host Redis server locally $ java -jar Ingestion-StockTwits-Producer.jar stock-twits AAPL 127.0.0.1 objc[14636]: Class JavaLaunchHelper is implemented in both /Library/Java/JavaVirtualMachines/jdk1.8.0_144.jdk/Contents/Home/bin/java (0x10369e4c0) and /Library/Java/JavaVirtualMachines/jdk1.8.0_144.jdk/Contents/Home/jre/lib/libinstrument.dylib (0x1037224e0). One of the two will be used. Which one is undefined. SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. Pull stock tweets from https://api.stocktwits.com/api/2/streams/symbol/AAPL.json?since=150025459 since ID: 150025574, max ID: 150027266 Publish AAPL stock tweets: id AAPL-150027266 Publish AAPL stock tweets: id AAPL-150027168 Publish AAPL stock tweets: id AAPL-150027146 Publish AAPL stock tweets: id AAPL-150026955 Publish AAPL stock tweets: id AAPL-150026876 Publish AAPL stock tweets: id AAPL-150026822 Publish AAPL stock tweets: id AAPL-150026445 Publish AAPL stock tweets: id AAPL-150026408 Publish AAPL stock tweets: id AAPL-150026303 Publish AAPL stock tweets: id AAPL-150026279 Publish AAPL stock tweets: id AAPL-150026257 Publish AAPL stock tweets: id AAPL-150026234 Publish AAPL stock tweets: id AAPL-150026212 Publish AAPL stock tweets: id AAPL-150026193 Publish AAPL stock tweets: id AAPL-150026190 Publish AAPL stock tweets: id AAPL-150026163 Publish AAPL stock tweets: id AAPL-150026103 Publish AAPL stock tweets: id AAPL-150026044 Publish AAPL stock tweets: id AAPL-150026021 Publish Publish Publish Publish Publish Publish Publish Publish Publish Publish Publish Message AAPL AAPL AAPL AAPL AAPL AAPL AAPL AAPL AAPL AAPL AAPL sent stock tweets: stock tweets: stock tweets: stock tweets: stock tweets: stock tweets: stock tweets: stock tweets: stock tweets: stock tweets: stock tweets: successfully id id id id id id id id id id id AAPL-150025870 AAPL-150025839 AAPL-150025824 AAPL-150025720 AAPL-150025663 AAPL-150025647 AAPL-150025636 AAPL-150025632 AAPL-150025618 AAPL-150025577 AAPL-150025574 3.7 Setting up Spark jobs The different Scala projects as per different task have been created and converted into “jar” format. The commands for executing jars can be given as follows. All the JARs are uploaded to Google drive and are available in this link: https://drive.google.com/drive/folders/1kYnweP0WGCPd1yesRgAtZUrtLXoCqq-g. $ spark-submit Streaming-Spark-StockwithUnixTS.jar localhost:9092 group1 stockquotes This is a consumer Spark-Streaming job which will get the messages from “stock-twits” topic and insert them in Cassandra table. $ spark-submit SparkStreaming.jar This is a consumer Spark-Streaming job which will get messages from “stockquotes” topic and perform moving average for messages received for 10 seconds. $ spark-submit BatchML.jar 18.136.251.110 9042 This is a batch job which will use Spark-Machine Learning to predict the marketAverage price. $ spark-submit StockTwitAnalytics.jar This is a batch job which will do batch aggregations on stock-twits data. $ spark-submit StockQuoteAggregatesBatch.jar This is a batch job which will do batch aggregations on IEX data. 3.8 Installing Redis We run Redis 5.0.4 by executing these commands: $ mkdir -p /mnt/data/var/lib/redis/data $ docker run --name redis-server --network cda -v /mnt/data/var/lib/redis/data:/data -p 6379:6379 -d redis:5.0.4 redis-server --appendonly yes 3.9 Visualizing using Qlik Sense 1) Install Qlik Sense Desktop from the URL below: https://www.qlik.com/us/try-or-buy/downloadqlik-sense 2) It is required to setup a free Qlik Sense account 3) Once installed, paste the qvf file into the directory below 4) Download the Qlik Sense & Cassandra connector: https://academy.datastax.com/quickdownloads 5) Setup the ODBC connector 6) The below shows how the Qlik Sense ODBC was initially setup (not required if the qvf file is pasted into the app folder in step 3)
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf Linearized : No Page Count : 9 PDF Version : 1.4 Title : Microsoft Word - Team01-EB5001-Stock Price analytics using big data installation and setup guide.docx Producer : macOS Version 10.14.4 (Build 18E226) Quartz PDFContext Creator : Word Create Date : 2019:04:22 14:31:22Z Modify Date : 2019:04:22 14:31:22ZEXIF Metadata provided by EXIF.tools