Python Project Structure

Author
Dr. Nicholas A. Del Grosso

A clear project layout makes computational work easier to run, review, extend, and hand off. This lesson surveys common folders and setup files for scientific Python projects, with attention to where data, scripts, reusable package code, tests, environments, documentation, and collaboration files belong.

Section 1: Folder Structure

<project_name>
|
├── data/
|   ├── raw/
|   |   └── <session_name>
|   |       ├── <session_file>.nlx
|   |       ├── <session_file>.dat
|   |       ├── <session_file>.xlsx
|   |       └── <session_name>.tif
|   ├── preprocessed/
|   |   └── <session_name>
|   |       ├── <description>.npy
|   |       ├── <description>.h5
|   |       └── <descrpition>.mat
|   |
|   ├── processed/
|   |   ├── <session_name1>.nix
|   |   └── <session_name2>.nix
|   |
|   └── final/
|       └── <dataset_name>.parquet
|
├── reports/
|   └── <report-group>/
|       ├── <report>.png
|       └── <report>.pdf
|
├── logs/
|   └── <log-group>/
|       └── <log>.txt
|
├── scripts/
|   ├── <script>.py
|   ├── <script>.r
|   └── <script>.m
|
├── scratch/
|   ├── <researcher1>
|   |   └── <notebook>.ipynb
|   └── <researcher2>
|       └── <notebook>.ipynb
|
├── notebooks/
|   └── <notebook>.ipynb
|
├── dodo.py
├── Snakefile
├── Makefile
|
├── src/
|   ├── <my_package>/
|   |   ├── __init__.py
|   |   └── <module>.py
|   |
|   └── <module>.py
├── tests/
|   ├── conftest.py
|   └── test_<group>.py
|
├── pyproject.toml
├── environment.yml
├── Dockerfile
├── compose.yml
├── Makefile
├── .github/
|   └── workflows/
|       └── <workflow>.yml
|
├── examples/
|   ├── <example1>.ipynb
|   └── <example2>.ipynb
|    
├── docs/
|   ├── <doc-section>.md
|   └── <doc-section>.rst
|
├── README.md
├── LICENSE.txt
├── CONTRIBUTORS.txt
├── CONTRIBUTORS.txt
├── CODE_OF_CONDUCT.txt
└── datacite.xml

  

Data Files

Raw

Raw data is the original data, and it doesn’t have to be pretty, just complete. Experimental Raw data is organized by what data was collected and when.

|
├── data/
|   ├── raw/
|   |   └── <session_name>
|   |       ├── <session_file>.nlx  
|   |       ├── <session_file>.dat
|   |       ├── <session_file>.xlsx
|   |       └── <session_name>.tif
|   |

Preprocessed

Data is complex, and extracting variables out of raw data can be some work. The “preprocessed” section of data pipelines is where intermediate files can go; they tend to be focused on individual variables of each session and byproducts of third-party tools, stored in a way that makes the data easy to read in for later processing steps. Don’t worry if the folder organization here is fairly messy–data extraction is a messy business!

|   |
|   ├── preprocessed/
|   |   └── <session_name>
|   |       ├── <description>.npy
|   |       ├── <description>.h5
|   |       └── <descrpition>.mat
|   |

Processed

How do all these different variables relate to each other? “processed” data includes the data’s schema, and is meant to be complete; as much of the data is accessed in the same way as possible. Note that the data is still in a “records” format, organized by collection date–this makes it easy to add new processed data files without having to touch the old ones.

|   |
|   ├── processed/
|   |   ├── <session_name1>.nix
|   |   └── <session_name2>.nix
|   |

Final

What data structure makes the data as easy to analyze as possible? These files contain data grouped in ways that make them easy to analyze; multiple sessions are combined together, only specific variables are extracted, data and metadata may be duplicated in the files, and variables may appear in multiple files. The goal here is to have files that someone can just read into R, Pandas, etc, and get started with statistics, data visualization, and machine learning!

This folder can also get complex, and that’s okay–data analysis is complex, and this folder is a representation of that data analysis. These files tend to be much smaller in size than the previous steps.

|   |
|   └── final/
|       └── <analysis_type1>.parquet
|

Code Files: Scripts

|
├── scripts/
|   ├── <script>.py
|   ├── <script>.r
|   └── <script>.m
|
├── scratch/
|   ├── <researcher1>
|   |   └── <notebook>.ipynb
|   └── <researcher2>
|       └── <notebook>.ipynb
|
├── notebooks/
|   └── <notebook>.ipynb
|
├── dodo.py
├── Snakefile
├── Makefile
|
  • scripts/ and notebooks/: Scripts that are meant to be run directly as a protocol belong together; their steps tend to be referenced in the methods sections of a research paper. Sometimes people will seperate them by programming language (’e.g. scripts_python/), but it’s usually not necessary.

  • scratch/: Just playing around, don’t want to worry about code quality or maintenance? Keep a scratch folder (alternatiely, sometimes called sandbox or playground) for that! If working with multiple colleagues

  • dodo.py, Snakefile, Makefile: What order are these scripts supposed to be run in? What inputs and outpus are needed from each file? Workflow management tools like DoIt , Snakemake , and Make are meant for directly describing these steps, and can be run in order to do the full processing and analysis pipeline.

Code Files: Tests


├── tests/
|   ├── conftest.py
|   └── test_<group>.py

This is where your automated tests live. They are generally seperated from the source code, to give flexibility in organization and packaging of both the source and test files.

Code Files: Libraries (Functions, Classes, Constants, etc)

|
├── src/
|   ├── <my_package>/
|   |   ├── __init__.py
|   |   └── <module>.py
|   |
|   └── <module>.py
|
├── pyproject.toml
|

This is where custom project code that scripts reference live. They come complete with an intaller file (pyproject.toml shown here, for Python projects), which installs the packages into a location where your scripts can easily import them.

Pyproject.toml Minimal Example
[project]
name = "project-name"
version = "v0.0.1"
requires-python = ">=3.10"
dependencies = ["matplotlib", "numpy>=1.26"]
Command Description
pip install -e . Install the packages and its dependencies into the current python environment, but keep it easy to modify the files.
pip uninstall . Remove this package from the current python environment. Note: won’t uninstall the dependencies.
Additional Fields
in the [project] section
description = "A short description of the project's purpose." A short description, appears in pip show.
authors = [{name="Nicholas DG", emails="dg@email.com"}] The authors of the project
maintainers = [{name="Nicholas DG", emails="dg@email.com"}] The people responsible for keeping the project going.
readme = "README.md" Where to find the readme file.
licence = "MIT" What licence the project uses.
licence = {file = "LICENSE.txt"} What license the project uses, if it’s found in a file.

.

Build Systems
[build-system]
requires = [“setuptools >= 61.0”]
build-backend = “setuptools.build_meta”
Use setuptools, (the default).
[build-system]
requires = [“hatchling”]
build-backend = “hatchling.build”
Use hatch, a great modern builder

There is a lot more one can put into the file–more fields and explanations of the pyproject.toml format can be found at the official guide: https://packaging.python.org/en/latest/guides/writing-pyproject-toml/

Aside: What if I don’t want an installer file?

That’s okay, but you’ll need to tell your scripts how to find your library code somehow. Most scripting languages offer a way to do this inside your scripts by modifying their import search path, so they know what folders to search in. Here’s the relevant code for Python:

Python Code Description
import os
os.path
Add the src folder to the python import command’s search path
import os
os.path.append(’../src’)
Add the src folder to the python import command’s search path
import os
os.environ[‘PATH’]
View the operating system’s search path

Section 2: Computational Environment Setup Files

|
├── environment.yml
├── Dockerfile
├── compose.yml
├── project.sif
├── Makefile
├── .github/
|   └── workflows/
|       └── <workflow>.yml
|

These files are commonly placed in the root directory, because they are used by software that helps set up the computational environemnt (installing libraries, setting up the operating system, downloading data, configuring environment variables, etc) for the entire project.

Environment.yml Reference

Minimal Example:
##### environment.yaml
dependencies:  [python=3.11]
Useful conda terminal commands:
Command Description
conda env create -f environment.yml Create an environment from a file.
conda create -n <name> Crete an environment without a file.
conda env remove --name <name> Delete an environment.
conda env export > environment-lock.yml Have conda tell you what it installed into the environment.
Optional Fields :
Field Example Values Description
channels: [defaults, conda-forge] Where conda should look to download dependencies
name: my-env A name to use to activate the environment, without knowing the path: conda activate my-env
prefix: C:\Users\nickdg\miniconda3 An absolute path, where on the computer to install the environment. Note: not great for cross-computer usage. It’s beter to specify the path when building the env with conda env -f env.yml -p ./env, when the computer can find the path at runtime.

Operating System-Level Package Managers

Operating System Package Manager Search Command Install Command
Windows WinGet winget search <name> winget install --id=<Id>
Windows Chocolatey choco search <name> choco install <name>
Mac Homebrew brew search <name> brew install <name>
Linux Aptitude apt-get search <name> apt-get install <name>
Linux Yum yum search <name> yum install <name>

Virtual Machines: Vagrant

Command Description
vagrant init generic/ubuntu2204 Make a Vagrantfile that will specify Ubuntu 22.04 as the virtual machine.
vagrant up create the virtual machine
vagrant ssh log in to your virtual machine on the terminal.

Section 3: Documentation

|
├── examples/
|   ├── <example1>.ipynb
|   └── <example2>.ipynb
|    
├── docs/
|   ├── <doc-section>.md
|   └── <doc-section>.rst
|
├── README.md
├── LICENSE.txt
├── datacite.xml
|

These files are there to help others understand better how to use your project. Written explanations, interactive examples, references to licenses, etc, all contribute to help tell people about your project and how it is meant to relate to them.

Readme File: Essential Parts

A useful reference: https://www.makeareadme.com/

Section What goes here
# <project name> The title. Put the name of the project there.
## Installation How to install the project. Best to include copy-pastable code in code blocks
## Usage The main ways the project is run, and what to expect when it works properly. Include code blocks here, too.

Section 4: Collaboration

| 
├── CONTRIBUTORS.txt
└── CODE_OF_CONDUCT.txt

These files explain to other collaborators how to work on the project; it’s meant for your internal team.