Have you ever find some Python code on Github and could not easily run them locally due to issues like missing data, packages, versioning, etc.?
In this post, I discuss what a real self-contained Python repository should have.
The quick checklist is as follows:
.gitignore
with additional settingsrequirements.txt
with locked packaged versions and variations for development, CUDA, etc..env
for managing environment variablesvenv
for virtual environmentI have created a sample real self-contained Python repo here, which has the following structure:
This repo includes two files for demonstration purposes:
openai-api-starter.ipynb
: a Python notebook file to show how to call OpenAI API.train_mnist.py
: a Python program to train a diffusion model using MNIST dataset..gitignore
Alway create your repo with the following settings:
.gitignore
specifies intentionally untracked file - see details here
Github default .gitignore
is pretty good, which ignores many common files and folders so that you don’t push them into your repo, e.g., the default venv
virtual environment folder and .env
environment variable file (discussed below).
You often need to add folders and/or files to this file so that you don’t push:
In the sample repo, I added the following lines to the default Github file:
# Training data folder
data/
# large model files
*.pth
data/
folder will be created once you run train_mnist.py
file. These data files should not be stored in the repo.*.pth
will be the diffusion model files generated by the training script. They are too large to be included in the repo.requirements.txt
You should always have the requirements.txt
to include the package information needed by the Python programs in the repo. It’s useful to include the specific versions of the packages that work with your programs to avoid the situation where newer versions break your programs - if no version is specified, the latest version is installed by default. Once the packages are installed, you can use pip freeze -r requirements.txt
to include the package versions.
openai==1.25.0
tqdm==4.66.2
torch==2.3.0
torchvision==0.18.0
python-dotenv==1.0.1
You can also have requirements-dev.txt
to include packages that are only needed for development but not for production.
As an example, I included requirements-cuda.txt
and requirements-torch.txt
in the sample repo to address a potential issue with CUDA. If you have CUDA 12.1 installed on Windows, you have to install the special Pytorch version to be compatible.
You should run pip install -r requirements-cuda.txt
, which also installs the packages in requirements-torch.txt
.
requirements-cuda.txt
:
-r requirements-torch.txt
openai==1.25.0
tqdm==4.66.2
python-dotenv==1.0.1
requirements-torch.txt
:
--index-url https://download.pytorch.org/whl/cu121
torch==2.3.0
torchvision==0.18.0
.env
You should never include sensitive information in your code, such as API keys. A good way of managing the environment variables is using .env
file and python-dotenv
package.
NOTE: .env
is gitignored by the default Github .gitignore
file - make sure you don’t push this file to your repo.
See my other blog post on this topic: Manage Environment Variables in Python Projects
I include the OpenAI API key in the .env
file:
OPENAI_API_KEY=sk-Klmtxxx
which can be loaded into the program using the following code snippet:
import openai
import os
from dotenv import load_dotenv
load_dotenv() # take environment variables from .env
openai.api_key = os.getenv('OPENAI_API_KEY')
venv
You should always use virtual environment to setup your local development environment for the repo. There are many articles on why you should do this, such as this one.
Use the following commands to setup a local virtual environment in .venv
folder (which is gitignored by default) and install the packages in the virtual environment:
$ python -m venv venv
$ source venv/bin/activate
$ pip install -r requirements.txt
Then, you run the notebook file using VSCode and execute the Python program via terminal.
“GitHub limits the size of files allowed in repositories. If you attempt to add or update a file that is larger than 50 MiB, you will receive a warning from Git….GitHub blocks files larger than 100 MiB.”
As a rule of thumb, you should not push any files larger than 25 MiB.
You can host those data files on Kaggle or Huggingface and put detailed instructions in README to show how to download and where to store the data files.
Ideally, you should also specify the Python version that works for your program. See my other blog post on how to use pyenv
to manage Python versions:How to Setup Mac for Python Development
PS. The featured image for this post is generated using HiddenArt tool from Takin.ai.