Configure and Use JupyterHub on Remote Ubuntu Server

I recently re-configured my Ubuntu server with JupyterHub so that I can run a long-running Python task with a large dataset, which can not be handled very well using my MacBook Pro.

There are a number of steps I spent quite sometime to figure out, which I think I might re-use in the future and decide to write them down before I forget.

The task was a sentiment analysis of ~1.4 million hotel reviews using our fine-tuned model based on Erlangshen-Roberta-330M-Sentiment, which took about 10 days to complete (NOTE: if the repo is private, you must get a token from HuggingFace and run huggingface-cli login with token first and the code also needs to be changed to use token - see my inline comments for details below). I developed a simplified version to use in this tutorial.

The following are the key steps:

JupyterHub Setup and SFTP

Our system admin used this tutorial to setup the JupyterHub and added my account to the sudo group.

For small files, I just use JupyterLab interface to upload them to the server. For large files, I use FileZilla to upload.

Python Versions

Different python versions might be needed for different projects. I use pyenv to manage Python versions. The following is my note for setting up pyenv on Ubuntu server, refer to my Mac tutorial if needed.

ssh hjwang@128.175.21.xxx  # ssh to remote server
sudo apt update -y  # update apt
sudo apt install -y make build-essential libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev libncursesw5-dev xz-utils tk-dev libffi-dev liblzma-dev python-openssl git  # install dependencies
git clone https://github.com/pyenv/pyenv.git ~/.pyenv  # clone the repo
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc  # setup environment
echo 'export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc  # setup environment
echo -e 'if command -v pyenv 1>/dev/null 2>&1; then\n eval "$(pyenv init -)"\nfi' >> ~/.bashrc  # setup environment
exec "$SHELL"  # restart the shell
pyenv install --list  # available versions to install 
pyenv versions  # installed versions
pyenv install 3.9.1  # install a Python version
pyenv global 3.9.1  # set global python version to use

Now if I type python, I see Python 3.9.1:

Python 3.9.1 (default, Aug  5 2022, 16:07:24) 
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 

Virtual Environments

Setup and activate a virtual environment (here I put all virtual environments in the folder venv and this sample one is named ml):

python -m venv venv/ml
source venv/ml/bin/activate

Next, while the virtual environment is activated, we need to add it to ipykernel so that we can access it via JupyterLab:

pip install ipykernel
sudo env "PATH=$PATH VIRTUAL_ENV=$VIRTUAL_ENV" python -m ipykernel install --name=ml

Now, the new virtual environment (kernel) can be used when creating new notebook:

Long-running Python Tasks

pandas
torch==1.11.0
transformers
python-dotenv
sendgrid

then run pip install -r requirements.txt in the activated virtual environment.

SENDGRID_API_KEY='SG.t1o-XXuERs2VIMykY81w2A.2LokhGSW0b0H9bwX-Kwa6xxx'
FROM_EMAIL='hjwang@udel.edu'
TO_EMAIL='harryjwang@gmail.com'

The program will calculate the sentiment of each review and generate a new csv file as the result. In total, the small sample has 500 rows. The processed result is saved every 100 rows and an email notification is sent out every 200 rows.

I use this program to show:

To run the program in the background, you can ssh to the server or use the terminal from JupyterLab, activate the virtual environment, then execute:

nohup python sentiment-analysis.py &

& asks the terminal to return control to you immediately and allows the command to complete in the background (source).

The program will keep on running even if you close the terminal or close JupyterLab.

Two files will be generated:

INFO: 08/05/2022 06:44:49 data loaded
INFO: 08/05/2022 06:44:55 model loaded
...

Three email notifications will also be sent:

The logging and email notifications are very useful for long-running tasks so that I know the program is still working fine or where to restart if something went wrong.