Setup Guide Cent OS7

User Manual:

Open the PDF directly: View PDF .
Page Count: 13

Download
Open PDF In Browser	View PDF

Setup Resilient ML Research Platform
NOTE: This instruction is to demo MLaaS. Adjustments are required for production deployment. Please
use with your judgments.
Setup OS:
Download CentOS 7.0 from http://www.centos.org/download/ or any Linux flavor
Burn DVD if needed. Install Gnome Desktop and LVM (optional)
The scripts below are for CentOS only. Please adjust accordingly if not using CentOS.
Example clusters for demo:
Local 3 node cluster:
xx1.your.com (Hadoop/ Spark master, HDFS name node, Django web server)
xx2.your.com (HDFS 2nd name/data node, Spark worker)
xx3.your.com (HDFS data node, Spark worker)
Note: There is no security setup for Hadoop and Spark clusters in this instruction. If security is required, suggest to
separate Django web server from Hadoop and Spark master and only allow Django web to access Hadoop/Spark
clusters.
Instruction below is for ALL nodes unless specified for specific node.

Setup Prerequisites:
# update yum
yum repolist
yum update

Add user “hadoop” and install openssh*; this account is used to run Hadoop and Spark.
# add user
useradd hadoop
passwd 
yum install openssh openssh-clients

Generate public key pair for each node and copy public key to other nodes
# impersonate hadoop
su hadoop
cd ~
# generate pub key
ssh-keygen -t rsa
# copy key to the file authorized_keys
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
# copy key to other nodes: example for xx1
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@xx2.your.com
ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@xx3.your.com
# make sure permission is correct
chmod 0600 ~/.ssh/authorized_keys
# verify ssh, should be able to connect without a pwd
ssh localhost
ssh hadoop@xx1.your.com
ssh hadoop@xx2.your.com
ssh hadoop@xx3.your.com

Setup Hadoop on master node: example hostname “xx1”
Download Hadoop from http://www.apache.org/dyn/closer.cgi/hadoop/common/ on xx1 node only
Assume Hadoop will be run as user “hadoop”.
# un-tar at download folder
tar xzf hadoop-2.n.m.tar.gz
# move to home folder
cp to /home/hadoop/
# make sure files own by hadoop
chown hadoop.hadoop /home/hadoop/* -R
# create folder for data or name nodes based on hdfs-site.xml ===
mkdir /home/hadoop/hadoopdata
mkdir /home/hadoop/hadoopdata/hdfs
mkdir /home/hadoop/hadoopdata/hdfs/datanode
mkdir /home/hadoop/hadoopdata/hdfs/namenode

Configure Hadoop .xml files on xx1 node only
gedit hadoop-2.n.m/etc/hadoop/core-site.xml
# hdfs master xx1


fs.default.name
hdfs://xx1.your.com:9000/


dfs.permissions
false


# define data folders here at ~/hadoopdata
gedit hadoop-2.n.0/etc/hadoop/hdfs-site.xml #1st namenode: xx1, 2nd: xx2


dfs.datanode.data.dir
file:/home/hadoop/hadoopdata/hdfs/datanode
true


dfs.namenode.name.dir
file:/home/hadoop/hadoopdata/hdfs/namenode
true


dfs.http.address
xx1.your.com:50070


dfs.secondary.http.address
xx2.your.com:50090


dfs.replication
1


# list datanode
gedit hadoop-2.n.m/etc/hadoop/slaves
# add all data nodes here
xx2.your.com
xx3.your.com

Copy Hadoop program folder from xx1 to xx2 and xx3 nodes (for Hadoop master xx1 only)
scp -r hadoop-2.m.n xx2.your.com:/home/hadoop
scp -r hadoop-2.m.n xx3.your.com:/home/hadoop

Create a symbolic link for $HADOOP_HOME for all nodes
# create soft link for Hadoop folder
ln -s ~/hadoop-2.m.n ~/hadoop_latest

Create a folder for PID and modify hadoop-env.sh
# create pid folder
sudo mkdir /var/hadoop
sudo chown hadoop.hadoop /var/hadoop
# modify a variable in hadoop-env.sh
export HADOOP_PID_DIR=/var/hadoop

Set Env for Hadoop program and modify .bashrc for all nodes
#set Env variables for user hadoop’s ~/.bashrc on all nodes
# please edit JAVA home path accordingly
gedit ~/.bashrc
and add:
export HADOOP_USER_NAME=hadoop
export HADOOP_HOME=/home/hadoop/hadoop_latest
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export YARN_HOME=$HADOOP_HOME
export JAVA_HOME=/usr/java/default
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin:$JAVA_HOME/bin :
$HADOOP_HOME/spark_latest/bin
export _JAVA_OPTIONS="-Xmx2g"
export THEANO_FLAGS=mode=FAST_RUN,floatX=float32

# for darknet; path to KML libs etc
source /opt/intel/compilers_and_libraries/linux/bin/compilervars.sh intel64

# source it
source ~/.bashrc

Format Hadoop Name Node (for Hadoop master xx1 only)
# format Namenode
hdfs namenode -format

Open ports for all nodes or disable firewalld if possible
# make sure service firewalld stop, disable firewalld if possible
systemctl disable firewalld
systemctl stop firewalld
systemctl status firewalld
# FYI for production, specified all ports...
firewall-cmd --state
# need to find all hadoop ports...
firewall-cmd --permanent --zone=public --add-port=50070/tcp
firewall-cmd --permanent --zone=public --add-port=50075/tcp
etc…
firewall-cmd --reload

Start Hadoop master and verify (for Hadoop master xx1 only)
# start Hadoop ====
$HADOOP_HOME/sbin/start-dfs.sh
# verify namenode
http://xx1.your.com:50070/dfshealth.html#tab-overview
# verify cluster
http://xx1.your.com:8088/cluster
# verify 2nd namenode
http://xx2.your.com:50090/status.jsp
# verify datanode
http://xx1.your.com:50075/dataNodeHome.jsp
# FYI: stop Hadoop
sbin/stop-dfs.sh

====

Setup Spark on master node “xx1”:
Download and unzip “spark-n.n.n-bin-hadoop2.n.tgz” from https://spark.apache.org/downloads.html
Spark will be run as user “hadoop”
# copy unzip Spark program folder to $HADOOP_HOME
# change owner to hadoop for Spark program folder
cd /home/hadoop/spark-n.n.n-bin-hadoopm.m
# modify config files:
gedit conf/spark-env.sh
# edit below based on your hardware specs (optional)
SPARK_MASTER_IP=xx1.your.com
SPARK_WORKER_CORES=4
SPARK_WORKER_MEMORY=32g
_JAVA_OPTIONS="-Xmx2g"
SPARK_PID_DIR=/var/hadoop
# modify file, conf/slaves for master only
gedit conf/slaves
# add
xx2.your.com
xx3.your.com
# modify file, conf/log4.properties, to
log4j.rootCategory=ERROR, console
# modify file, conf/spark-defaults.conf, to
spark.driver.maxResultSize 16g
spark.rpc.message.maxSize 1024
# Copy Spark programs to other nodes
scp -r spark-n.n.n-bin-hadoopm.m xx2.your.com:/home/hadoop
scp -r spark-n.n.n-bin-hadoopm.m xx3.your.com:/home/hadoop
# may need to open ports if firewall not disabled
# create spark_latest link
ln -s ~/spark-n.n.n-bin-hadoopm.m ~/spark_latest

Start Spark master at xx1
# Start Spark master on xx2
/home/hadoop/spark_latest/sbin/start-all.sh
# Verify by
http://xx1.your.com:8080
# FYI Stop Spark
sbin/stop-all.sh

Install MongoDB:
http://docs.mongodb.org/manual/tutorial/install-mongodb-on-red-hat/

Install software packages for machine learning framework:
Enable yum, pip and wget to work behind proxy
# To allow YUM to work with proxy server,
gedit /etc/yum.conf
# add proxy to /etc/yum.conf:
proxy=http://yourproxy.your.com:1234/
# Also proxy for wget, edit http_proxy=
gedit /etc/wgetrc
# setup EPEL repository ============\
yum install epel-release
yum repolist
# allow pip work behind proxy
export http_proxy=http://yourproxy.your.com:1234
export https_proxy=http://yourproxy.your.com:1234

Required packages:
# make sure python is 2.7+. by default CentOS 7 has 2.7.5
python -V
# needed package
yum -y install python-setuptools python-setuptools-devel
# install pip
yum -y install python-pip
# packages for python ====
yum install python-argparse
yum -y install gcc gcc-c++ python-devel
yum install blas blas-devel lapack lapack-devel
pip
pip
pip
pip
pip
pip
pip
pip
pip
pip
yum
yum
pip

install
install
install
install
install
install
install
install
install
install
install
install
install

ujson
numpy --upgrade
scipy --upgrade
distribute
--upgrade setuptools
python-dateutil
pytz
tornado
pyparsing
scikit-learn
libpng-devel
freetype-devel
matplotlib --upgrade

pip
pip
yum
yum
yum
yum
yum
yum
yum
yum
yum
yum
yum

install
install
install
install
install
install
install
install
install
install
install
install
install

opencv-python
Pillow
python-imaging -y
opencv -y
opencv-devel -y
cmake -y
gcc gcc-c++ -y
gtk2-devel -y
libdc1394-devel -y
libv4l-devel -y
ffmpeg-devel -y
gstreamer-plugins-base-devel -y
libpng-devel libjpeg-turbo-devel jasper-devel openexr-devel -y

pip install sympy seaborn pyzmq pyxdg pycrypto psutil pickleshare pexpect joblib ipaddr
ecdsa
# jupyter?
pip install nbconvert nbformat jupyter jupyter-client jupyter-console jupyter-core
jsonschema ipython
# may need this version. error from pandas/gtk-2.0? TBD
pip install pandas==0.17.1
yum install graphviz graphviz-devel graphviz-graphs graphviz-python
pip install requests isodate pydot
pip install pymongo
pip install py4j

pip install importlib
yum install boost boost-devel openssl-devel

Install Django web framework and database SQLite3 on xx1 only:
Install SQLite before Django and also create user “django” which will be the account to run the web application.
# sqlite3 & django
yum install -y zlib-devel openssl-devel sqlite-devel bzip2-devel
pip install Django
# cd to folder /home/django
# django needs to be in “hadoop” group for HDFS access
sudo usermod -a -G hadoop django
# allow grp to access hadoop folder
chmod 750 /home/hadoop/hadoop-2.x.x # to allow access by group

Start a web project and application. Copy web files over
# start a web project “myml”
django-admin startproject myml
cd myml
# start an application “atdml”
python manage.py startapp atdml
# Copy files from Github to folder “myml”
Copy code from a source control site, overwrite existing files
# add entries to user django’s ~/.bashrc; please edit JAVA home path accordingly
export HADOOP_USER_NAME=hadoop
export JAVA_HOME=/usr/java/default
# init the website
python manage.py migrate
# Modify settings.py, atdml/settings.py and app.config for your server names etc.
# search by ”?” to find items needed to be edited
# init the app
python manage.py makemigrations atdml
python manage.py migrate
# create web root admin account
python manage.py createsuperuser
# make sure folders under “media” exists:
upload, tmpdata, result, log etc.
# start web server
python manage.py runserver 0.0.0.0:8000
# add 3 groups from http://xx1.your.com:8000/admin by web root login
1-reader
3-writer
5-developer
# add web users and sign a group
make sure web root has a group assigned
# verify homepage at
http://xx1.your.com:8000/

MLaaS Software Stack

Screenshot for Django web

Admin’s quick reference:
For HDFS cluster: Login to Hadoop master as user, hadoop, stop or start HDFS processes
# Start hdfs daemon, as sudo; by default service will run after bootup
sudo systemctl start hdfs
# Stop all dfs processes on all nodes, as hadoop
/home/hadoop/hadoop_latest/sbin/start-dfs.sh
# Start all dfs processes, as hadoop
/home/hadoop/hadoop_latest/sbin/stop-dfs.sh

For Spark cluster: Login to Spark master as user, hadoop, Stop and start Spark processes
# Start/stop spark daemon as sudo; by default service will run after bootup
sudo systemctl start|stop spark
# Stop all spark processes on all nodes, as hadoop
/home/hadoop/spark_latest/sbin/stop_all.sh
# Start all spark processes, as hadoop
/home/hadoop/spark_latest/sbin/start_all.sh

Restart Django Web: Login to web master as the user, django, start web processes
# Start web daemon; by default service will run after bootup
sudo systemctl start django_atdml
# start web processes, as django
cd /home/django/myml
python manage.py runserver 0.0.0.0:8000
# OR run as backend proc ===========
nohup python manage.py runserver 0.0.0.0:8000 > /dev/null 2>&1 &
# OR run as a service ===========
systemctl list-unit-files | grep
# stop web processes or daemon
Ctrl-C or kill the python process

Sqlite3 DB: /home/django/myml/db.sqlite3
Information for table, atdml_user_profile: limit the upload count per user:
count_upload: current upload count
count_upload_max: max upload count for current time window
count_upload_date: starting date/time for time window
count_upload_period: time window length in hour
acl_list: tbd
user_id: key to link to user table auth_user
To increase the upload count, increase the count_upload_max or decrease the
count_upload_period

Information for table, atdml_document: for all data:
Field, file_type, define the types of records:
 Dataset: data stored in HDFS or uploaded through web
N-gram pattern
N-gram JSON
N-gram pattern gz
ATD
Libsvm Format
Custom: special custom featuring module



Classifier: no training dataset, only pretrained ML models
ensemble
image-inception
image-yolo



Prediction: entry for prediction
predict
ensemble_predict
image_predict



Emulation: entry for APK emulation [+ prediction]
emulate

Source Exif Data:

File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.5
Linearized                      : No
Page Count                      : 13
Language                        : en-US
Tagged PDF                      : Yes
Author                          : cyang8
Keywords                        : CTPClassification=CTP_NT
Creator                         : Microsoft® Word 2013
Create Date                     : 2018:03:19 21:09:40-07:00
Modify Date                     : 2018:03:19 21:09:40-07:00
Producer                        : Microsoft® Word 2013

EXIF Metadata provided by EXIF.tools

Setup Guide Cent OS7

Navigation menu

Versions of this User Manual:

Views

Navigation