iBOTS Learn: Python Project Structure

Software Delivery for Scientific Python Projects

Testing Code

Python Project Structure

Author

Dr. Nicholas A. Del Grosso

Download Materials

A clear project layout makes computational work easier to run, review, extend, and hand off. This lesson surveys common folders and setup files for scientific Python projects, with attention to where data, scripts, reusable package code, tests, environments, documentation, and collaboration files belong.

Section 1: Folder Structure

<project_name>
|
├── data/
|   ├── raw/
|   |   └── <session_name>
|   |       ├── <session_file>.nlx
|   |       ├── <session_file>.dat
|   |       ├── <session_file>.xlsx
|   |       └── <session_name>.tif
|   ├── preprocessed/
|   |   └── <session_name>
|   |       ├── <description>.npy
|   |       ├── <description>.h5
|   |       └── <descrpition>.mat
|   |
|   ├── processed/
|   |   ├── <session_name1>.nix
|   |   └── <session_name2>.nix
|   |
|   └── final/
|       └── <dataset_name>.parquet
|
├── reports/
|   └── <report-group>/
|       ├── <report>.png
|       └── <report>.pdf
|
├── logs/
|   └── <log-group>/
|       └── <log>.txt
|
├── scripts/
|   ├── <script>.py
|   ├── <script>.r
|   └── <script>.m
|
├── scratch/
|   ├── <researcher1>
|   |   └── <notebook>.ipynb
|   └── <researcher2>
|       └── <notebook>.ipynb
|
├── notebooks/
|   └── <notebook>.ipynb
|
├── dodo.py
├── Snakefile
├── Makefile
|
├── src/
|   ├── <my_package>/
|   |   ├── __init__.py
|   |   └── <module>.py
|   |
|   └── <module>.py
├── tests/
|   ├── conftest.py
|   └── test_<group>.py
|
├── pyproject.toml
├── environment.yml
├── Dockerfile
├── compose.yml
├── Makefile
├── .github/
|   └── workflows/
|       └── <workflow>.yml
|
├── examples/
|   ├── <example1>.ipynb
|   └── <example2>.ipynb
|    
├── docs/
|   ├── <doc-section>.md
|   └── <doc-section>.rst
|
├── README.md
├── LICENSE.txt
├── CONTRIBUTORS.txt
├── CONTRIBUTORS.txt
├── CODE_OF_CONDUCT.txt
└── datacite.xml

Data Files

Raw

Raw data is the original data, and it doesn’t have to be pretty, just complete. Experimental Raw data is organized by what data was collected and when.

|
├── data/
|   ├── raw/
|   |   └── <session_name>
|   |       ├── <session_file>.nlx  
|   |       ├── <session_file>.dat
|   |       ├── <session_file>.xlsx
|   |       └── <session_name>.tif
|   |

Preprocessed

Data is complex, and extracting variables out of raw data can be some work. The “preprocessed” section of data pipelines is where intermediate files can go; they tend to be focused on individual variables of each session and byproducts of third-party tools, stored in a way that makes the data easy to read in for later processing steps. Don’t worry if the folder organization here is fairly messy–data extraction is a messy business!

|   |
|   ├── preprocessed/
|   |   └── <session_name>
|   |       ├── <description>.npy
|   |       ├── <description>.h5
|   |       └── <descrpition>.mat
|   |

Processed

How do all these different variables relate to each other? “processed” data includes the data’s schema, and is meant to be complete; as much of the data is accessed in the same way as possible. Note that the data is still in a “records” format, organized by collection date–this makes it easy to add new processed data files without having to touch the old ones.

|   |
|   ├── processed/
|   |   ├── <session_name1>.nix
|   |   └── <session_name2>.nix
|   |

Final

What data structure makes the data as easy to analyze as possible? These files contain data grouped in ways that make them easy to analyze; multiple sessions are combined together, only specific variables are extracted, data and metadata may be duplicated in the files, and variables may appear in multiple files. The goal here is to have files that someone can just read into R, Pandas, etc, and get started with statistics, data visualization, and machine learning!

This folder can also get complex, and that’s okay–data analysis is complex, and this folder is a representation of that data analysis. These files tend to be much smaller in size than the previous steps.

|   |
|   └── final/
|       └── <analysis_type1>.parquet
|

Code Files: Scripts

|
├── scripts/
|   ├── <script>.py
|   ├── <script>.r
|   └── <script>.m
|
├── scratch/
|   ├── <researcher1>
|   |   └── <notebook>.ipynb
|   └── <researcher2>
|       └── <notebook>.ipynb
|
├── notebooks/
|   └── <notebook>.ipynb
|
├── dodo.py
├── Snakefile
├── Makefile
|

scripts/ and notebooks/: Scripts that are meant to be run directly as a protocol belong together; their steps tend to be referenced in the methods sections of a research paper. Sometimes people will seperate them by programming language (’e.g. scripts_python/), but it’s usually not necessary.
scratch/: Just playing around, don’t want to worry about code quality or maintenance? Keep a scratch folder (alternatiely, sometimes called sandbox or playground) for that! If working with multiple colleagues
dodo.py, Snakefile, Makefile: What order are these scripts supposed to be run in? What inputs and outpus are needed from each file? Workflow management tools like DoIt, Snakemake, and Make are meant for directly describing these steps, and can be run in order to do the full processing and analysis pipeline.

Code Files: Tests


├── tests/
|   ├── conftest.py
|   └── test_<group>.py

This is where your automated tests live. They are generally seperated from the source code, to give flexibility in organization and packaging of both the source and test files.

Code Files: Libraries (Functions, Classes, Constants, etc)

|
├── src/
|   ├── <my_package>/
|   |   ├── __init__.py
|   |   └── <module>.py
|   |
|   └── <module>.py
|
├── pyproject.toml
|

This is where custom project code that scripts reference live. They come complete with an intaller file (pyproject.toml shown here, for Python projects), which installs the packages into a location where your scripts can easily import them.

Pyproject.toml Minimal Example

[project]
name = "project-name"
version = "v0.0.1"
requires-python = ">=3.10"
dependencies = ["matplotlib", "numpy>=1.26"]

`Command`	`Description`
`pip install -e .`	Install the packages and its dependencies into the current python environment, but keep it easy to modify the files.
`pip uninstall .`	Remove this package from the current python environment. Note: won’t uninstall the dependencies.

Additional Fields

in the `[project]` section
`description = "A short description of the project's purpose."`	A short description, appears in `pip show`.
`authors = [{name="Nicholas DG", emails="dg@email.com"}]`	The authors of the project
`maintainers = [{name="Nicholas DG", emails="dg@email.com"}]`	The people responsible for keeping the project going.
`readme = "README.md"`	Where to find the readme file.
`licence = "MIT"`	What licence the project uses.
`licence = {file = "LICENSE.txt"}`	What license the project uses, if it’s found in a file.

Build Systems
`[build-system] requires = [“setuptools >= 61.0”] build-backend = “setuptools.build_meta”`	Use `setuptools`, (the default).
`[build-system] requires = [“hatchling”] build-backend = “hatchling.build”`	Use `hatch`, a great modern builder

There is a lot more one can put into the file–more fields and explanations of the pyproject.toml format can be found at the official guide: https://packaging.python.org/en/latest/guides/writing-pyproject-toml/

Aside: What if I don’t want an installer file?

That’s okay, but you’ll need to tell your scripts how to find your library code somehow. Most scripting languages offer a way to do this inside your scripts by modifying their import search path, so they know what folders to search in. Here’s the relevant code for Python:

Python Code	Description
`import os os.path`	Add the `src` folder to the python import command’s search path
`import os os.path.append(’../src’)`	Add the `src` folder to the python import command’s search path
`import os os.environ[‘PATH’]`	View the operating system’s search path

Section 2: Computational Environment Setup Files

|
├── environment.yml
├── Dockerfile
├── compose.yml
├── project.sif
├── Makefile
├── .github/
|   └── workflows/
|       └── <workflow>.yml
|

These files are commonly placed in the root directory, because they are used by software that helps set up the computational environemnt (installing libraries, setting up the operating system, downloading data, configuring environment variables, etc) for the entire project.

environment.yml: used by the Conda, Mamba, and Micromamba package managers:
- Cheat sheet: https://docs.conda.io/projects/conda/en/4.6.0/_downloads/52a95608c49671267e40c689e0bc00ca/conda-cheatsheet.pdf
- Download sources"
- Miniforge, where you can download conda without needing a paid license: https://conda-forge.org/miniforge/
- Mamba: https://mamba.readthedocs.io/en/latest/installation/mamba-installation.html
- Micromamba: https://mamba.readthedocs.io/en/latest/installation/micromamba-installation.html
Dockerfile and compose.yml are used by Docker, and .sif files are used by Singularity and Apptainer, which can additionally set up your project into its own sandboxed operating system (called a “container”)
Makefile is used by CMake, and can do any kind of recipe you give it; sometimes just installing things, but sometimes also running a script pipeline. It’s very generic and flexible that way.

Environment.yml Reference

Minimal Example:

##### environment.yaml
dependencies:  [python=3.11]

Useful `conda` terminal commands:

Command	Description
`conda env create -f environment.yml`	Create an environment from a file.
`conda create -n <name>`	Crete an environment without a file.
`conda env remove --name <name>`	Delete an environment.
`conda env export > environment-lock.yml`	Have conda tell you what it installed into the environment.

Optional Fields :

Field	Example Values	Description
`channels:`	`[defaults, conda-forge]`	Where `conda` should look to download dependencies
`name:`	`my-env`	A name to use to activate the environment, without knowing the path: `conda activate my-env`
`prefix:`	`C:\Users\nickdg\miniconda3`	An absolute path, where on the computer to install the environment. Note: not great for cross-computer usage. It’s beter to specify the path when building the env with `conda env -f env.yml -p ./env`, when the computer can find the path at runtime.

Operating System-Level Package Managers

Operating System	Package Manager	Search Command	Install Command
Windows	WinGet	`winget search <name>`	`winget install --id=<Id>`
Windows	Chocolatey	`choco search <name>`	`choco install <name>`
Mac	Homebrew	`brew search <name>`	`brew install <name>`
Linux	Aptitude	`apt-get search <name>`	`apt-get install <name>`
Linux	Yum	`yum search <name>`	`yum install <name>`

Virtual Machines: Vagrant

Command	Description
`vagrant init generic/ubuntu2204`	Make a Vagrantfile that will specify Ubuntu 22.04 as the virtual machine.
`vagrant up`	create the virtual machine
`vagrant ssh`	log in to your virtual machine on the terminal.

Section 3: Documentation

|
├── examples/
|   ├── <example1>.ipynb
|   └── <example2>.ipynb
|    
├── docs/
|   ├── <doc-section>.md
|   └── <doc-section>.rst
|
├── README.md
├── LICENSE.txt
├── datacite.xml
|

These files are there to help others understand better how to use your project. Written explanations, interactive examples, references to licenses, etc, all contribute to help tell people about your project and how it is meant to relate to them.

Readme File: Essential Parts

A useful reference: https://www.makeareadme.com/

Section	What goes here
`# <project name>`	The title. Put the name of the project there.
`## Installation`	How to install the project. Best to include copy-pastable code in code blocks
`## Usage`	The main ways the project is run, and what to expect when it works properly. Include code blocks here, too.

Section 4: Collaboration

| 
├── CONTRIBUTORS.txt
└── CODE_OF_CONDUCT.txt

These files explain to other collaborators how to work on the project; it’s meant for your internal team.