Launch an AWS EMR cluster with Pyspark and Jupyter Notebook inside a VPC

Boost processing power by using clusters in the cloud … and let their geek flag fly.

July 4, 2017

When your data becomes massive and data analysts are eager to construct complex models it might be a good time to boost processing power by using clusters in the cloud … and let their geek flag fly. Therefore, we use AWS Elastic Map Reduce (EMR) which lets you easily create clusters with Spark installed. Spark is a distributed processing framework for executing calculations in parallel. Our data analysts undertake analyses and machine learning tasks using Python 3 (with libraries such as pandas, scikit-learn, etc.,) on Jupyter notebooks. To enable our data analysts to create clusters on demand and not completely change their programming routines we choose Jupyter Notebook with PySpark (Spark Python API) on top of EMR. We mostly followed the example of Tom Zeng in the AWS Big Data Blog post. For security reasons we run the Spark cluster inside a private subnet of a VPC, and to connect to the cluster we use a bastion host with SSH tunnelling, so all the traffic between browser and cluster is encrypted.

Configuration

Network and Bastion Host

The configuration for our setup includes a virtual private cloud (VPC) with a public subnet and a private subnet. The cluster will run inside the private subnet and the bastion will be inside the public subnet. The bastion host needs to respect the followings conditions:

have an Elastic IP to be reached though the internet
have a security group (SG) that accepts traffic on port 22 from all IPs
be deployed inside a public (DMZ) subnet of the VPC
Linux OS

The cluster needs to respect the following conditions:

be deployed inside a private subnet of the VPC
have an AdditionalMasterSecurityGroups in order to accept ALL traffic from the bastion

More information about security groups and bastion host inside a VPC can be found here.

To connect to the bastion we used an SSH key-based authentication. We put the public keys of our data analysts inside /home/ubuntu/.ssh/authorized_keys on the bastion host. They then add a ~/.ssh/config file like the one below to their local machine:

Host bastion 
HostName elastic.public.ip.compute.amazonaws.com 
Port 22 
User ubuntu 
IdentityFile ~/.ssh/dataAnalystPrivateKey.pem

If the public keys are deployed correctly they will be able to SSH into the bastion by simply running: ssh bastion

Create-Cluster Command

To launch a cluster from command line the aws cli needs to be installed. The command is then aws emr create-cluster –parameter options. The example command below creates a cluster named Jupyter on EMR inside VPC with EMR version 5.2.1 and Hadoop, Hive, Spark, Ganglia (an interesting tool to monitor your cluster) installed.

aws emr create-cluster --release-label emr-5.2.1 \ 
--name 'Jupyter on EMR inside VPC' \ 
--applications Name=Hadoop Name=Hive Name=Spark Name=Ganglia \ 
--ec2-attributes \ KeyName=yourKeyName,InstanceProfile=EMR_EC2_DefaultRole,SubnetId=yourPrivateSubnetIdInsideVpc,AdditionalMasterSecurityGroups=yourSG \ 
--service-role EMR_DefaultRole \ 
--instance-groups \ InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.xlarge \ InstanceGroupType=CORE,InstanceCount=2,BidPrice=0.1,InstanceType=m4.xlarge \ 
--region yourRegion \ 
--log-uri s3://yourBucketForLogs \ 
--bootstrap-actions \ Name='Install Jupyter',Path="s3://yourBootstrapScriptOnS3/bootstrap.sh"

With --instance groups, the count and type of the machines you want are defined. For the workers we use spot instances with a bid price to save money. You pay less for these unused EC2 instances but if the demand increases beyond your bidding price you might loose them. The last option --bootstrap-actions lists the location of the bootstrap script. Bootstrap actions are run on your cluster machines before the cluster is ready for operation. They let you install and set up additional software or change configurations for the applications you installed with --applications.

Bootstrap action

You can use the bootstrap action from this Gist as a reference. In the bootstrap script we undertake the following steps:

1. Install conda and with conda install other needed libraries such as hdfs3, findspark, numPy and UltraJSON on all instances. The first lines set up the user password for Jupyter and the S3 path where your notebooks should live. You can also pass them as a parameter in the AWS command.

# arguments can be set with create cluster
JUPYTER_PASSWORD=${1:-"SomePassWord"}
NOTEBOOK_DIR=${2:-"s3://YourBucketForNotebookCheckpoints/yourNotebooksFolder/"}
	
# mount home to /mnt
if [ ! -d /mnt/home ]; then
  	  sudo mv /home/ /mnt/
  	  sudo ln -s /mnt/home /home
fi
	
# Install conda
wget https://repo.continuum.io/miniconda/Miniconda3-4.2.12-Linux-x86_64.sh -O /home/hadoop/miniconda.sh\
	&& /bin/bash ~/miniconda.sh -b -p $HOME/conda
echo '\nexport PATH=$HOME/conda/bin:$PATH' >> $HOME/.bashrc && source $HOME/.bashrc
conda config --set always_yes yes --set changeps1 no
	
# Install additional libraries for all instances with conda
conda install conda=4.2.13

conda config -f --add channels conda-forge
conda config -f --add channels defaults

conda install hdfs3 findspark ujson jsonschema toolz boto3 py4j numpy pandas==0.19.2
	
# cleanup
rm ~/miniconda.sh
	
echo bootstrap_conda.sh completed. PATH now: $PATH
	
# setup python 3.5 in the master and workers
export PYSPARK_PYTHON="/home/hadoop/conda/bin/python3.5"

Setup PYSPARK_PYTHON to use the python 3.5 in the master and in the workers

2. We want the notebooks to be saved on S3. Therefore we install s3fs-fuse on the master node and mount a S3 bucket in the file system. This avoids that the data analyst will lose their notebooks after shutting down the cluster

# install dependencies for s3fs-fuse to access and store notebooks
sudo yum install -y git
sudo yum install -y libcurl libcurl-devel graphviz
sudo yum install -y cyrus-sasl cyrus-sasl-devel readline readline-devel gnuplot
	
# extract BUCKET and FOLDER to mount from NOTEBOOK_DIR
NOTEBOOK_DIR="${NOTEBOOK_DIR%/}/"
BUCKET=$(python -c "print('$NOTEBOOK_DIR'.split('//')[1].split('/')[0])")
FOLDER=$(python -c "print('/'.join('$NOTEBOOK_DIR'.split('//')[1].split('/')[1:-1]))")

# install s3fs
cd /mnt
git clone https://github.com/s3fs-fuse/s3fs-fuse.git
cd s3fs-fuse/
ls -alrt
./autogen.sh
./configure
make
sudo make install
sudo su -c 'echo user_allow_other >> /etc/fuse.conf'
mkdir -p /mnt/s3fs-cache
mkdir -p /mnt/$BUCKET
/usr/local/bin/s3fs -o allow_other -o iam_role=auto -o umask=0 -o url=https://s3.amazonaws.com  -o no_check_certificate -o enable_noobj_cache -o use_cache=/mnt/s3fs-cache $BUCKET /mnt/$BUCKET

3. On the master node, install Jupyter with conda and configure it. Here we also install scikit-learn and some visualisation libraries.

# Install Jupyter Note book on master and libraries
conda install jupyter
conda install matplotlib plotly bokeh
conda install --channel scikit-learn-contrib scikit-learn==0.18

# jupyter configs
mkdir -p ~/.jupyter
touch ls ~/.jupyter/jupyter_notebook_config.py
HASHED_PASSWORD=$(python -c "from notebook.auth import passwd; print(passwd('$JUPYTER_PASSWORD'))")
echo "c.NotebookApp.password = u'$HASHED_PASSWORD'" >> ~/.jupyter/jupyter_notebook_config.py
echo "c.NotebookApp.open_browser = False" >> ~/.jupyter/jupyter_notebook_config.py
echo "c.NotebookApp.ip = '*'" >> ~/.jupyter/jupyter_notebook_config.py
echo "c.NotebookApp.notebook_dir = '/mnt/$BUCKET/$FOLDER'" >> ~/.jupyter/jupyter_notebook_config.py
echo "c.ContentsManager.checkpoints_kwargs = {'root_dir': '.checkpoints'}" >> ~/.jupyter/jupyter_notebook_config.py

The default port for the notebooks is 8888.

4. Create the Jupyter PySpark daemon with Upstart on master and start it.

cd ~
sudo cat << EOF > /home/hadoop/jupyter.conf
description "Jupyter"
author      "babbel-data-eng"
 
start on runlevel [2345]
stop on runlevel [016]
 
respawn
respawn limit 0 10
 
chdir /mnt/$BUCKET/$FOLDER
	
script
  		sudo su - hadoop > /var/log/jupyter.log 2>&1 << BASH_SCRIPT
    export PYSPARK_DRIVER_PYTHON="/home/hadoop/conda/bin/jupyter"
    export PYSPARK_DRIVER_PYTHON_OPTS="notebook --log-level=INFO"
    export PYSPARK_PYTHON=/home/hadoop/conda/bin/python3.5
    export JAVA_HOME="/etc/alternatives/jre"
    pyspark
  	   BASH_SCRIPT
 
end script
EOF
sudo mv /home/hadoop/jupyter.conf /etc/init/
sudo chown root:root /etc/init/jupyter.conf
 
# be sure that jupyter daemon is registered in initctl
sudo initctl reload-configuration
 
# start jupyter daemon
sudo initctl start jupyter

If everything runs correctly, your EMR cluster will be in a waiting (cluster ready) status, and the Jupyter Notebook will listen on port 8888 on the master node.

Connect to the Cluster

To connect to Jupyter we use a web proxy to redirect all traffic through the bastion host. Therefore we first set up an SSH tunnel to the bastion host using: ssh -ND 8157 bastion. The command will open the port 8157 on your local computer. Next we will setup the web proxy. We are using SwitchyOmega in Chrome, available in the WebStore. Configure the proxy as shown below:

After activating the proxy as shown below you should be able to reach the master node of your cluster from your browser.

To find the IP of the master node go to AWS Console → EMR → Cluster list and retrieve the Master’s public DNS as shown below:

Following the example, you can reach your Jupyter Notebook under the master nodes ip address at port 8888 (something like http://10.1.234.567.89:8888) with the web proxy prepared browser.

Pyspark on Jupyter

If you can reache the url, you will be prompt to enter a password which we set by default to ‘jupyter’.

Create a new notebook choose New → Python 3 as shown below:

Every notebook is a PySpark-app where spark context (sc) as well as sqlContext are already initiated, something you would usually do first when creating Spark applications. So we can directly start to play around with PySpark Dataframes: