May 1, 2024

How to Setup a (Real) Self-Contained Python Repository

tutorial

Have you ever find some Python code on Github and could not easily run them locally due to issues like missing data, packages, versioning, etc.?

In this post, I discuss what a real self-contained Python repository should have.

The quick checklist is as follows:

.gitignore with additional settings
requirements.txt with locked packaged versions and variations for development, CUDA, etc.
.env for managing environment variables
venv for virtual environment
data files or instructions on how to get the data files

I have created a sample real self-contained Python repo here, which has the following structure:

This repo includes two files for demonstration purposes:

openai-api-starter.ipynb: a Python notebook file to show how to call OpenAI API.
train_mnist.py: a Python program to train a diffusion model using MNIST dataset.

`.gitignore`

Alway create your repo with the following settings:

.gitignore specifies intentionally untracked file - see details here

Github default .gitignore is pretty good, which ignores many common files and folders so that you don’t push them into your repo, e.g., the default venv virtual environment folder and .env environment variable file (discussed below).

You often need to add folders and/or files to this file so that you don’t push:

large files
unnecessary files

In the sample repo, I added the following lines to the default Github file:

# Training data folder
data/

# large model files
*.pth

data/ folder will be created once you run train_mnist.py file. These data files should not be stored in the repo.
*.pth will be the diffusion model files generated by the training script. They are too large to be included in the repo.

`requirements.txt`

You should always have the requirements.txt to include the package information needed by the Python programs in the repo. It’s useful to include the specific versions of the packages that work with your programs to avoid the situation where newer versions break your programs - if no version is specified, the latest version is installed by default. Once the packages are installed, you can use pip freeze -r requirements.txt to include the package versions.

openai==1.25.0
tqdm==4.66.2
torch==2.3.0
torchvision==0.18.0
python-dotenv==1.0.1

You can also have requirements-dev.txt to include packages that are only needed for development but not for production.

As an example, I included requirements-cuda.txt and requirements-torch.txt in the sample repo to address a potential issue with CUDA. If you have CUDA 12.1 installed on Windows, you have to install the special Pytorch version to be compatible.

You should run pip install -r requirements-cuda.txt, which also installs the packages in requirements-torch.txt.

requirements-cuda.txt:

-r requirements-torch.txt
openai==1.25.0
tqdm==4.66.2
python-dotenv==1.0.1

requirements-torch.txt:

--index-url https://download.pytorch.org/whl/cu121
torch==2.3.0 
torchvision==0.18.0

`.env`

You should never include sensitive information in your code, such as API keys. A good way of managing the environment variables is using .env file and python-dotenv package.

NOTE: .env is gitignored by the default Github .gitignore file - make sure you don’t push this file to your repo.

See my other blog post on this topic: Manage Environment Variables in Python Projects

I include the OpenAI API key in the .env file:

OPENAI_API_KEY=sk-Klmtxxx

which can be loaded into the program using the following code snippet:

import openai
import os
from dotenv import load_dotenv

load_dotenv()  # take environment variables from .env
openai.api_key = os.getenv('OPENAI_API_KEY')

`venv`

You should always use virtual environment to setup your local development environment for the repo. There are many articles on why you should do this, such as this one.

Use the following commands to setup a local virtual environment in .venv folder (which is gitignored by default) and install the packages in the virtual environment:

$ python -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txt

Then, you run the notebook file using VSCode and execute the Python program via terminal.

Data Files

“GitHub limits the size of files allowed in repositories. If you attempt to add or update a file that is larger than 50 MiB, you will receive a warning from Git….GitHub blocks files larger than 100 MiB.”

As a rule of thumb, you should not push any files larger than 25 MiB.

You can host those data files on Kaggle or Huggingface and put detailed instructions in README to show how to download and where to store the data files.

Python Version

Ideally, you should also specify the Python version that works for your program. See my other blog post on how to use pyenv to manage Python versions:How to Setup Mac for Python Development

PS. The featured image for this post is generated using HiddenArt tool from Takin.ai.