Python Project Structure
Author
A clear project layout makes computational work easier to run, review, extend, and hand off. This lesson surveys common folders and setup files for scientific Python projects, with attention to where data, scripts, reusable package code, tests, environments, documentation, and collaboration files belong.
Section 1: Folder Structure
<project_name>
|
├── data/
| ├── raw/
| | └── <session_name>
| | ├── <session_file>.nlx
| | ├── <session_file>.dat
| | ├── <session_file>.xlsx
| | └── <session_name>.tif
| ├── preprocessed/
| | └── <session_name>
| | ├── <description>.npy
| | ├── <description>.h5
| | └── <descrpition>.mat
| |
| ├── processed/
| | ├── <session_name1>.nix
| | └── <session_name2>.nix
| |
| └── final/
| └── <dataset_name>.parquet
|
├── reports/
| └── <report-group>/
| ├── <report>.png
| └── <report>.pdf
|
├── logs/
| └── <log-group>/
| └── <log>.txt
|
├── scripts/
| ├── <script>.py
| ├── <script>.r
| └── <script>.m
|
├── scratch/
| ├── <researcher1>
| | └── <notebook>.ipynb
| └── <researcher2>
| └── <notebook>.ipynb
|
├── notebooks/
| └── <notebook>.ipynb
|
├── dodo.py
├── Snakefile
├── Makefile
|
├── src/
| ├── <my_package>/
| | ├── __init__.py
| | └── <module>.py
| |
| └── <module>.py
├── tests/
| ├── conftest.py
| └── test_<group>.py
|
├── pyproject.toml
├── environment.yml
├── Dockerfile
├── compose.yml
├── Makefile
├── .github/
| └── workflows/
| └── <workflow>.yml
|
├── examples/
| ├── <example1>.ipynb
| └── <example2>.ipynb
|
├── docs/
| ├── <doc-section>.md
| └── <doc-section>.rst
|
├── README.md
├── LICENSE.txt
├── CONTRIBUTORS.txt
├── CONTRIBUTORS.txt
├── CODE_OF_CONDUCT.txt
└── datacite.xml
Data Files
Raw
Raw data is the original data, and it doesn’t have to be pretty, just complete. Experimental Raw data is organized by what data was collected and when.
|
├── data/
| ├── raw/
| | └── <session_name>
| | ├── <session_file>.nlx
| | ├── <session_file>.dat
| | ├── <session_file>.xlsx
| | └── <session_name>.tif
| |Preprocessed
Data is complex, and extracting variables out of raw data can be some work. The “preprocessed” section of data pipelines is where intermediate files can go; they tend to be focused on individual variables of each session and byproducts of third-party tools, stored in a way that makes the data easy to read in for later processing steps. Don’t worry if the folder organization here is fairly messy–data extraction is a messy business!
| |
| ├── preprocessed/
| | └── <session_name>
| | ├── <description>.npy
| | ├── <description>.h5
| | └── <descrpition>.mat
| |Processed
How do all these different variables relate to each other? “processed” data includes the data’s schema, and is meant to be complete; as much of the data is accessed in the same way as possible. Note that the data is still in a “records” format, organized by collection date–this makes it easy to add new processed data files without having to touch the old ones.
| |
| ├── processed/
| | ├── <session_name1>.nix
| | └── <session_name2>.nix
| |Final
What data structure makes the data as easy to analyze as possible? These files contain data grouped in ways that make them easy to analyze; multiple sessions are combined together, only specific variables are extracted, data and metadata may be duplicated in the files, and variables may appear in multiple files. The goal here is to have files that someone can just read into R, Pandas, etc, and get started with statistics, data visualization, and machine learning!
This folder can also get complex, and that’s okay–data analysis is complex, and this folder is a representation of that data analysis. These files tend to be much smaller in size than the previous steps.
| |
| └── final/
| └── <analysis_type1>.parquet
|Code Files: Scripts
|
├── scripts/
| ├── <script>.py
| ├── <script>.r
| └── <script>.m
|
├── scratch/
| ├── <researcher1>
| | └── <notebook>.ipynb
| └── <researcher2>
| └── <notebook>.ipynb
|
├── notebooks/
| └── <notebook>.ipynb
|
├── dodo.py
├── Snakefile
├── Makefile
|-
scripts/andnotebooks/: Scripts that are meant to be run directly as a protocol belong together; their steps tend to be referenced in the methods sections of a research paper. Sometimes people will seperate them by programming language (’e.g.scripts_python/), but it’s usually not necessary. -
scratch/: Just playing around, don’t want to worry about code quality or maintenance? Keep ascratchfolder (alternatiely, sometimes calledsandboxorplayground) for that! If working with multiple colleagues -
dodo.py,Snakefile,Makefile: What order are these scripts supposed to be run in? What inputs and outpus are needed from each file? Workflow management tools like DoIt, Snakemake, and Make are meant for directly describing these steps, and can be run in order to do the full processing and analysis pipeline.
Code Files: Tests
├── tests/
| ├── conftest.py
| └── test_<group>.pyThis is where your automated tests live. They are generally seperated from the source code, to give flexibility in organization and packaging of both the source and test files.
Code Files: Libraries (Functions, Classes, Constants, etc)
|
├── src/
| ├── <my_package>/
| | ├── __init__.py
| | └── <module>.py
| |
| └── <module>.py
|
├── pyproject.toml
|This is where custom project code that scripts reference live. They come complete with an intaller file (pyproject.toml shown here, for Python projects), which installs the packages into a location where your scripts can easily import them.
Pyproject.toml Minimal Example
[project]
name = "project-name"
version = "v0.0.1"
requires-python = ">=3.10"
dependencies = ["matplotlib", "numpy>=1.26"]Command |
Description |
|---|---|
pip install -e . |
Install the packages and its dependencies into the current python environment, but keep it easy to modify the files. |
pip uninstall . |
Remove this package from the current python environment. Note: won’t uninstall the dependencies. |
Additional Fields
in the [project] section |
|
|---|---|
description = "A short description of the project's purpose." |
A short description, appears in pip show. |
authors = [{name="Nicholas DG", emails="dg@email.com"}] |
The authors of the project |
maintainers = [{name="Nicholas DG", emails="dg@email.com"}] |
The people responsible for keeping the project going. |
readme = "README.md" |
Where to find the readme file. |
licence = "MIT" |
What licence the project uses. |
licence = {file = "LICENSE.txt"} |
What license the project uses, if it’s found in a file. |
.
| Build Systems | |
|---|---|
[build-system] |
Use setuptools, (the default). |
[build-system] |
Use hatch, a great modern builder |
There is a lot more one can put into the file–more fields and explanations of the pyproject.toml format can be found at the official guide: https://packaging.python.org/en/latest/guides/writing-pyproject-toml/
Aside: What if I don’t want an installer file?
That’s okay, but you’ll need to tell your scripts how to find your library code somehow. Most scripting languages offer a way to do this inside your scripts by modifying their import search path, so they know what folders to search in. Here’s the relevant code for Python:
| Python Code | Description |
|---|---|
import os |
Add the src folder to the python import command’s search path |
import os |
Add the src folder to the python import command’s search path |
import os |
View the operating system’s search path |
Section 2: Computational Environment Setup Files
|
├── environment.yml
├── Dockerfile
├── compose.yml
├── project.sif
├── Makefile
├── .github/
| └── workflows/
| └── <workflow>.yml
|These files are commonly placed in the root directory, because they are used by software that helps set up the computational environemnt (installing libraries, setting up the operating system, downloading data, configuring environment variables, etc) for the entire project.
-
environment.yml: used by the Conda, Mamba, and Micromamba package managers:
- Cheat sheet: https://docs.conda.io/projects/conda/en/4.6.0/_downloads/52a95608c49671267e40c689e0bc00ca/conda-cheatsheet.pdf
- Download sources"
- Miniforge, where you can download conda without needing a paid license: https://conda-forge.org/miniforge/
- Mamba: https://mamba.readthedocs.io/en/latest/installation/mamba-installation.html
- Micromamba: https://mamba.readthedocs.io/en/latest/installation/micromamba-installation.html
-
Dockerfile and compose.yml are used by Docker, and .sif files are used by Singularity and Apptainer, which can additionally set up your project into its own sandboxed operating system (called a “container”)
-
Makefile is used by CMake, and can do any kind of recipe you give it; sometimes just installing things, but sometimes also running a script pipeline. It’s very generic and flexible that way.
Environment.yml Reference
Minimal Example:
##### environment.yaml
dependencies: [python=3.11]Useful conda terminal commands:
| Command | Description |
|---|---|
conda env create -f environment.yml |
Create an environment from a file. |
conda create -n <name> |
Crete an environment without a file. |
conda env remove --name <name> |
Delete an environment. |
conda env export > environment-lock.yml |
Have conda tell you what it installed into the environment. |
Optional Fields :
| Field | Example Values | Description |
|---|---|---|
channels: |
[defaults, conda-forge] |
Where conda should look to download dependencies |
name: |
my-env |
A name to use to activate the environment, without knowing the path: conda activate my-env |
prefix: |
C:\Users\nickdg\miniconda3 |
An absolute path, where on the computer to install the environment. Note: not great for cross-computer usage. It’s beter to specify the path when building the env with conda env -f env.yml -p ./env, when the computer can find the path at runtime. |
Operating System-Level Package Managers
| Operating System | Package Manager | Search Command | Install Command |
|---|---|---|---|
| Windows | WinGet | winget search <name> |
winget install --id=<Id> |
| Windows | Chocolatey | choco search <name> |
choco install <name> |
| Mac | Homebrew | brew search <name> |
brew install <name> |
| Linux | Aptitude | apt-get search <name> |
apt-get install <name> |
| Linux | Yum | yum search <name> |
yum install <name> |
Virtual Machines: Vagrant
| Command | Description |
|---|---|
vagrant init generic/ubuntu2204 |
Make a Vagrantfile that will specify Ubuntu 22.04 as the virtual machine. |
vagrant up |
create the virtual machine |
vagrant ssh |
log in to your virtual machine on the terminal. |
Section 3: Documentation
|
├── examples/
| ├── <example1>.ipynb
| └── <example2>.ipynb
|
├── docs/
| ├── <doc-section>.md
| └── <doc-section>.rst
|
├── README.md
├── LICENSE.txt
├── datacite.xml
|These files are there to help others understand better how to use your project. Written explanations, interactive examples, references to licenses, etc, all contribute to help tell people about your project and how it is meant to relate to them.
Readme File: Essential Parts
A useful reference: https://www.makeareadme.com/
| Section | What goes here |
|---|---|
# <project name> |
The title. Put the name of the project there. |
## Installation |
How to install the project. Best to include copy-pastable code in code blocks |
## Usage |
The main ways the project is run, and what to expect when it works properly. Include code blocks here, too. |