How to Set Up a GPU-Enabled TensorFlow Instance on AWS

by Evheny ShaliovJanuary 28, 2016

This step-by-step tutorial explains how to create an instance, configure the environment, run a sample, and use TensorBoard.

Recently, Google announced the open-source release of its second-generation machine learning system, TensorFlow. Originally developed by the Google Brain Team, the library is used for numerical computations expressed as data flow graphs.

TensorFlow supports deployment to one or more CPUs or GPUs in a single machine. In this tutorial, we provide step-by-step instructions for installing the GPU version of TensorFlow on Amazon Elastic Compute Cloud.

Table of Contents

Installation options

Until now, the primary option for configuring GPU-enabled TensorFlow on AWS was to use Amazon Linux AMI with NVIDIA GRID GPU Driver and follow the steps of this tutorial. However, it might take a day or two before you get access to all necessary NVIDIA libraries and set up the image.

To avoid the hassle, we have created an Amazon AMI that minimizes efforts needed to configure the environment and provides users with GPU-enabled TensorFlow on AWS right away.

Nevertheless, if manual configuration is a preferred method for you, follow the instructions below.

Creating an instance on AWS

Select the Amazon Machine Image (AMI) named Amazon Linux AMI with NVIDIA GRID GPU Driver.

Choose the instance type. We use g2.2xlarge with 16 GB of SSD.

Configure a security group and add port 6006 to open inbound ports.

Configure other instance details and launch the instance.
Generate (or reuse) authenticators to access the instance.

Configuring an environment

Log in to the remote instance via SSH with the default username—ec2-user.
Check the availability of Python 2.7, pip, and CUDA on the instance. Commands:
- python –version
- pip –version
- nvcc –version
The chosen AMI contains:
- Python 2.7.10
- pip 6.1.1
- CUDA 6.5.12
Install the CUDA Toolkit 7.0 (7.5 is not valid):

wget http://developer.download.nvidia.com/compute/cuda/7_0/Prod/local_installers/cuda_7.0.28_linux.run
sh cuda_7.0.28_linux.run

1 2	wget http://developer.download.nvidia.com/compute/cuda/7_0/Prod/local_installers/cuda_7.0.28_linux.run sh cuda_7.0.28_linux.run

Follow the instructions.

Do you accept the previously read EULA? <strong>accept</strong>
You are attempting to install on an unsupported configuration. Do you wish to continue? <strong>y</strong>
Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 346.46? <strong>n</strong>
Do you want to install the OpenGL libraries? ((y)es/(n)o/(q)uit)<strong> n</strong>
Install the CUDA 7.0 Toolkit? <strong>y</strong>
Enter Toolkit Location [ default is /usr/local/cuda-7.0 ]:
Do you wish to run the installation with 'sudo'? ((y)es/(n)o): <strong>y</strong>
Do you want to install a symbolic link at /usr/local/cuda? <strong>y</strong>
Install the CUDA 7.0 Samples? <strong>n</strong>

Do you accept the previously read EULA? accept

You are attempting to install on an unsupported configuration. Do you wish to continue? y

Install NVIDIA Accelerated Graphics Driver for Linux-x86_64 346.46? n

Do you want to install the OpenGL libraries? ((y)es/(n)o/(q)uit) n

Install the CUDA 7.0 Toolkit? y

Enter Toolkit Location [ default is /usr/local/cuda-7.0 ]:

Do you wish to run the installation with 'sudo'? ((y)es/(n)o): y

Do you want to install a symbolic link at /usr/local/cuda? y

Install the CUDA 7.0 Samples? n

Download cuDNN. Note: The cuDNN library from here is not valid for the current environment. You can download cuDNN from the archive. However, the mandatory registration might take 1–2 days.
Install cuDNN 6.5 v2 (this particular version is specified in the TensorFlow installation guidelines):

tar -zxf cudnn-6.5-linux-x64-v2.tgz
cd cudnn-6.5-linux-x64-v2
sudo cp -R lib* /usr/local/cuda/lib64/
sudo cp cudnn.h /usr/local/cuda/include/
cd

tar -zxf cudnn-6.5-linux-x64-v2.tgz

cd cudnn-6.5-linux-x64-v2

sudo cp -R lib* /usr/local/cuda/lib64/

sudo cp cudnn.h /usr/local/cuda/include/

Add the LD_LIBRARY_PATH and CUDA_HOME environment variables to ~/.bashrc:

export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64"
export CUDA_HOME=/usr/local/cuda

1 2	export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64" export CUDA_HOME=/usr/local/cuda

Install TensorFlow:

sudo pip install https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.6.0-cp27-none-linux_x86_64.whl

1	sudo pip install https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow-0.6.0-cp27-none-linux_x86_64.whl

Response:

Expected result
Installing collected packages: six, numpy, tensorflow
  Found existing installation: six 1.8.0
    Uninstalling six-1.8.0:
      Successfully uninstalled six-1.8.0
  Running setup.py install for numpy
Successfully installed numpy-1.10.1 six-1.10.0 tensorflow-0.5.0

Expected result

Installing collected packages: six, numpy, tensorflow

Found existing installation: six 1.8.0

Uninstalling six-1.8.0:

Successfully uninstalled six-1.8.0

Running setup.py install for numpy

Successfully installed numpy-1.10.1 six-1.10.0 tensorflow-0.5.0

Check whether the configured environment is correct using a Python terminal:

$ python
>>> import tensorflow as tf
>>> hello = tf.constant('Hello, TensorFlow!')
>>> sess = tf.Session()
>>> print sess.run(hello)
Hello, TensorFlow!
>>> a = tf.constant(10)
>>> b = tf.constant(32)
>>> print sess.run(a+b)
42

$ python

>>> import tensorflow as tf

>>> hello = tf.constant('Hello, TensorFlow!')

>>> sess = tf.Session()

>>> print sess.run(hello)

Hello, TensorFlow!

>>> a = tf.constant(10)

>>> b = tf.constant(32)

>>> print sess.run(a+b)

Running a sample

Install Git:

sudo yum install git -y

1	sudo yum install git -y

Clone the project:

git clone --recurse-submodules https://github.com/tensorflow/tensorflow

1	git clone --recurse-submodules https://github.com/tensorflow/tensorflow

Run the TensorFlow neural net model:

python tensorflow/tensorflow/models/image/mnist/convolutional.py

1	python tensorflow/tensorflow/models/image/mnist/convolutional.py

Response:

Initialized!
Epoch 0.00
Minibatch loss: 12.054, learning rate: 0.010000
Minibatch error: 90.6%
Validation error: 84.6%
Epoch 0.12
Minibatch loss: 3.285, learning rate: 0.010000
Minibatch error: 6.2%
Validation error: 7.0%
Epoch 0.23
Minibatch loss: 3.473, learning rate: 0.010000
Minibatch error: 10.9%
Validation error: 3.7%
Epoch 0.35
Minibatch loss: 3.221, learning rate: 0.010000
Minibatch error: 4.7%
Validation error: 3.2%

Initialized!

Epoch 0.00

Minibatch loss: 12.054, learning rate: 0.010000

Minibatch error: 90.6%

Validation error: 84.6%

Epoch 0.12

Minibatch loss: 3.285, learning rate: 0.010000

Minibatch error: 6.2%

Validation error: 7.0%

Epoch 0.23

Minibatch loss: 3.473, learning rate: 0.010000

Minibatch error: 10.9%

Validation error: 3.7%

Epoch 0.35

Minibatch loss: 3.221, learning rate: 0.010000

Minibatch error: 4.7%

Validation error: 3.2%

Using TensorBoard

To enable TensorBoard, proceed with the following steps.

Add /usr/local/bin to PATH via .bashrc and then log in to the OS again:

PATH=$PATH:$HOME/bin:/usr/local/bin

1	PATH=$PATH:$HOME/bin:/usr/local/bin

Copy the TAG file:

sudo cp -R /home/ec2-user/tensorflow/tensorflow/tensorboard/TAG /usr/local/lib/python2.7/site-packages/tensorflow/tensorboard/TAG

1	sudo cp -R /home/ec2-user/tensorflow/tensorflow/tensorboard/TAG /usr/local/lib/python2.7/site-packages/tensorflow/tensorboard/TAG

Copy the favicon.ico file into the TensorBoard folder:

sudo cp favicon.ico /usr/local/lib/python2.7/site-packages/tensorflow/tensorboard/

1	sudo cp favicon.ico /usr/local/lib/python2.7/site-packages/tensorflow/tensorboard/

Check the AWS security group. The inbound TCP port 6006 must be openned.

Run TensorBoard on the server:

tensorboard --logdir /var/log

1	tensorboard --logdir /var/log

Open TensorBoard in a browser.

Following the provided instructions, we were able to launch a TensorFlow instance from Amazon Linux AMI with NVIDIA GRID GPU Driver and keep it running. The task, nevertheless, required going beyond the default AMI configuration.

So, if you do not need manual configuration, you can use our one-click AMI.

About the author

Evheny Shaliov is a software developer at Altoros with 5+ years of experience in architecture design and database development. He has strong skills in building microservices-based architectures that use NoSQL data stores. Evheny is a teaching assistant in Belarusian State University of Informatics and Radioelectronics; he’s also studying for a PhD in information security.

Posted In