Storing and Extracting Metadata From Files: Converting Lists and Dicts into JSON and YAML

Storing and Extracting Metadata From Files: Converting Lists and Dicts into JSON and YAML

Author
Dr. Nicholas A. Del Grosso

Extracting Metadata From Files: Exploring Different Widely-Used, Text-Based File Formats

In neuroscience, data rarely exists on its own. A recording, image, behavioural table, or analysis result usually needs extra information to remain interpretable: when it was collected, which subject it came from, what settings were used, how the data are arranged, and who generated it. If this metadata only lives in our memory, in a lab notebook, or in filenames, it becomes difficult for others — and for our future selves — to understand or reuse the data reliably. In this notebook, we will practise turning Python data structures into text files and reading them back again, using open formats that can be shared across tools, operating systems, and research groups.

That’s where standardized file formats come in. In this set of exercises, we’ll practice serializing data into a string that we can write into a text file and deserializing text into Python data structures, using two different text file formats:

  • JSON (Javascript Object Notation)
  • YAML (Yet Another Markup Language)

Section 1: Python’s Most-Used Core Data Structures: Lists and Dicts

Before we store metadata in files, we need a clear way to organise it in Python. Lists and dictionaries are two of the most useful building blocks for this. Lists help us keep ordered collections, such as trials, subjects, filenames, or time points. Dictionaries help us attach names to values, such as an image width, a subject ID, or a microscope setting. These structures are simple, but they are powerful enough to represent much of the structured information we encounter in research workflows.

Code Reference: Comparing Syntax for Lists and Dicts

Operation Lists Dicts
Initialize (Empty) l = [] d = {}
Initialize (With Data) l = [10, 11, 12] d = {'a': 10, 'b': 11, 'c': 12}
Index (first value ) l[0] NA
Index (last value ) l[-1] NA
Slice (first-to-third values) l[0:3] NA
Index (by key) NA d['a']
Append Data l.append(13) d['d'] = 13
Replace Data l[0] = 14 d['a'] = 14
Delete Data del l[1] del d['b']
Get a list of all data values list(l) list(d.values())
Get a list of all data keys NA list(d.keys())

Exercises

Example: Make a list of three names of European countries.

countries = ['Germany', 'France', 'England']
countries
['Germany', 'France', 'England']

Exercise: Make a list of three scientists.

Solution
scientists = ["Einstein", "Darwin", "Planck"]
scientists
['Einstein', 'Darwin', 'Planck']

Exercise: Display the second scientist in your list

Solution
scientists[1]
'Darwin'

Exercise: Replace the third scientist with "Wilhelm Wundt"

Solution
scientists[2] = "Wilhelm Wundt"
scientists
['Einstein', 'Darwin', 'Wilhelm Wundt']

Exercise: Use the list’s .append() method to put Max Planck onto the end of the list.

Solution
scientists.append('Max Planck')
scientists
['Einstein', 'Darwin', 'Wilhelm Wundt', 'Max Planck']

Dataset: The image dict below describes how researcher Tom’s recording is formatted:

image = {'height': 1920, 'width': 1080, 'format': 'RGB', 'order': 'F'}
image
{'height': 1920, 'width': 1080, 'format': 'RGB', 'order': 'F'}

Example: Write the code to print out the width of the image, by accessing the "width" key:

image['width']
1080

Exercise: What is the height of the image?

Solution
image['height']
1920

Exercise: How are the pixel data in the image formatted?

Solution
image['format']
'RGB'

Exercise: Add a new field: "model" and give it the value "basler"

Solution
image['model'] = "basler"

Section 2: File Objects in Python

Once data are organised in Python, the next question is how we keep them after Python stops running. Files give our data a life outside the current session: they can be inspected later, shared with colleagues, committed to a repository, or used by another program. Python file objects give us a direct way to read from and write to files. Working with them also makes an important idea visible: a file must be opened in the right mode, used deliberately, and then closed so that the operating system knows we are finished with it.

Code Reference

Code Description
f = open('newfile', 'w') Makes a file object that is linked to a newly-opened file called ’newfile’ that expects to do text writing.
f = open('oldfile', 'r') Makes a file object that is linked to an existing file called ‘oldfile’ that expects to do text reading. Note: the ‘r’ is optional, as it is the default “file mode” for the open() function.
f.write(text) Writes the string in text to the text file linked to f
text = f.read() Reads all the text stored in the text file linked to f
f.close() Closes the file that is linked to the file object f.

Exercises

Example: Write the text, “Hello, World!” into a text file called hello.txt, then close the file.

f = open('hello.txt', 'w')
f.write('Hello, World!')
f.close()

Exercise: Write the text, “Goodbye, everyone.’ into a text file called bye.txt, then close the file.

Solution
f = open('bye.txt', 'w')
f.write('Goodbye, everyone.')
f.close()

Exercise: Try writing the text, “Does this work?” to the file object after it has already been closed. You should see an error (after all, the file is not open to writing!) What type of error do you get?

Solution
 f.write('Does this work?')

Exercise: Read the text from the file bye.txt into the string variable text.

Solution
f = open('bye.txt', 'r')
text = f.read()
f.close()
text
'Goodbye, everyone.'

Exercise: Write the string ‘Emma’ into a the file “subj001.txt” in the “data/raw” folder:

Solution
import os
os.makedirs('data/raw', exist_ok=True)
f = open('data/raw/subj001.txt', 'w')
f.write('Emma')
f.close()

Exercise: Write the number 10 into the file ten.txt.

Solution
f = open('ten.txt', 'w')
f.write(str(10))
f.close()

Exercise: Read the number from 10.txt into the integer variable data.

Solution
f = open('ten.txt', 'r')
mdata = int(f.read())
f.close()
mdata
10

Section 3: Saving Metadata in Open, Human- and Machine- Readable File Formats: JSON and YAML

Plain text files are useful because they are easy to inspect, but unstructured text is hard for computers to interpret reliably. JSON and YAML give us a middle ground: they are text-based formats that people can read, but they also have clear rules that software can parse. This makes them useful for metadata, configuration files, and small structured datasets. By serialising Python dictionaries and lists into these formats, we can store research information in a way that is portable, reproducible, and less dependent on one specific Python session or custom script.

To help us read from and write to these formats, serialization libraries have functions to change data into their data structure; these functions usually come in pairs to indicate writing and reading:

  • serialize() and deserialize()
  • parse() and unparse()
  • dump() and load()

Which pair of terms is used depends on the libary.

Here, we’ll try out YAML and JSON, two very-popular text-based file formats, to take complex dict- and list- data structures and read and write them to files. A nice reference sheet showing how YAML and JSON syntaxes are related to each other: https://quickref.me/yaml.html .

Code Description
Reading
data = json.load(f) Read a file object storing JSON data.
data = yaml.load(f, loader=yaml.Loader) Read a file object storing YAML data.
Writing
json.dump(data, f, indent=3) Write data into a JSON text file. (indent=3 is optional, it makes it easiser to read visually in a text editor)
yaml.dump(mdata, f) Write data into a YAML text file.

Exercises

Let’s try out the JSON and YAML formats to get a sense of how these text files map to Python’s data structures when written and read. Both formats are useful; after trying them both out, which one will you prefer?

# %pip install PyYAML
import json
import yaml

Dataset: Use the mdata data structure below:

mdata = {
    'metadata': {
        'height': 1080, 
        'width': 1920, 
        'order': 'RGB', 
        'date': '2024-12-24', 
        'subject': {
            'id': 'x134', 
            'name': 'Scratchy', 
            'sources': ['Cartoon', 'The Simpsons Lab, Springfield']
        }, 
        'researchers': ['Itchy', 'Bart', 'Lisa']
    }
} 
mdata
{'metadata': {'height': 1080,
  'width': 1920,
  'order': 'RGB',
  'date': '2024-12-24',
  'subject': {'id': 'x134',
   'name': 'Scratchy',
   'sources': ['Cartoon', 'The Simpsons Lab, Springfield']},
  'researchers': ['Itchy', 'Bart', 'Lisa']}}

Exercise: Write the metadata a JSON file called recording1.json:

Solution
f = open('recording1.json', 'w')
json.dump(mdata, f, indent=3)
f.close()

Exercise: Read the recording1.json and using the general f.read() method: print() its text contents. How does the data look now? Pretty similar to the original python code, right?

Solution
f = open('recording1.json', 'r')
print(f.read())
f.close()
{
   "metadata": {
      "height": 1080,
      "width": 1920,
      "order": "RGB",
      "date": "2024-12-24",
      "subject": {
         "id": "x134",
         "name": "Scratchy",
         "sources": [
            "Cartoon",
            "The Simpsons Lab, Springfield"
         ]
      },
      "researchers": [
         "Itchy",
         "Bart",
         "Lisa"
      ]
   }
}

Exercise: Finally, read the recording1.json file again, but this time json.load() the data back into dicts and lists. Was the data read back in correctly? It should look identical to the original mdata variable. (Note: this is sometimes called a “round-trip” test.)

Solution
f = open('recording1.json')
data_from_json = json.load(f)
f.close()
data_from_json
{'height': 1080,
 'width': 1920,
 'order': 'RGB',
 'date': '2024-12-24',
 'subject': {'id': 'x134',
  'name': 'Scratchy',
  'sources': ['Cartoon', 'The Simpsons Lab, Springfield']},
 'researchers': ['Itchy', 'Bart', 'Lisa']}

Exercise: Let’s try out YAML now: Write the metadata to a YAML file called recording1.yml:

Solution
f = open('recording1.yml', 'w')
yaml.dump(mdata, f)
f.close()

Exercise: Read the recording1.yml file and, using the general f.read() method: print() its text contents. How does the data look now? Pretty different from the original python code, right?

Solution
f = open('recording1.yml', 'r')
print(f.read())
f.close()
metadata:
  date: '2024-12-24'
  height: 1080
  order: RGB
  researchers:
  - Itchy
  - Bart
  - Lisa
  subject:
    id: x134
    name: Scratchy
    sources:
    - Cartoon
    - The Simpsons Lab, Springfield
  width: 1920

Exercise: Finally, read the recording1.yml file again, but this time yaml.load() the data back into dicts and lists. Was the data read back in correctly? It should look identical to the original mdata variable. (Note: this is sometimes called a “round-trip” test.)

Solution
f = open('recording1.yml')
data = yaml.load(f, Loader=yaml.Loader)
f.close()
data 
{'metadata': {'date': '2024-12-24',
  'height': 1080,
  'order': 'RGB',
  'researchers': ['Itchy', 'Bart', 'Lisa'],
  'subject': {'id': 'x134',
   'name': 'Scratchy',
   'sources': ['Cartoon', 'The Simpsons Lab, Springfield']},
  'width': 1920}}

Section 4: Working with Data in Complex Schemas: Mixing Dicts and Lists

Real experimental metadata often has several connected parts: subjects, sessions, trials, stimuli, recording settings, analysis parameters, and file paths. Lists and dictionaries can represent this complexity, but the structure needs to be consistent. A schema is the organisational plan that tells us where each piece of information belongs. Choosing a schema is not only a coding decision; it affects how easy the data will be to enter, check, analyse, and share. In this section, we practise designing nested structures that make relationships explicit, so that later analysis code can rely on predictable patterns rather than fragile assumptions.

For example, weather data could be stored like:

weather = [
    {'date': '2024-11-20',
     'morning_condition': 'sunny',
     'afternoon_condition', 'rainy',
     'hourly_temperatures': [20, 21, 20, 18, 16, 14, 13],
    },
    {'date': '2024-11-21',
     'morning_condition': 'sunny',
     'afternoon_condition', 'cloudy',
     'hourly_temperatures': [18, 16, 20, 18, 16, 14, 13],
    },
]

Getting the third hourly temperature for the second record would then be done like this:

weather[1]['hourly_temperatures'][2]

.

Because the data is consistently-stored, getting all the morning conditions for analysis can be done like this:

morning_conditions = [day['morning_condition'] for day in weather]

.

It’s worth taking the time to consider what organization schema you’ll choose, though, as it will affect the ease of both data entry and data analysis.

Exercises

Exercise: Using any schema you like, organize the information from the following two sessions into a single data structure, in a variable called experiment:

  • Nov 13, 2024: Subject Jeff did 3 trials: first with a red circle stimulus on the left side, then a green square on the right, then a green circle on the right.
  • Nov 14, 2024: Subject Jane did 2 trials: first with a green square stimulus on the right side, then a red circle on the left.

Note: There is no one right answer here; every data structure has its advantages and disadvantages. Feel free to organize the data as you see fit.

Solution
experiment = {
    'stimuli': [
        {'color': 'red', 'side': 'left', 'shape': 'circle'},
        {'color': 'green', 'side': 'right', 'shape': 'square'},
        {'color': 'green', 'side': 'right', 'shape': 'circle'},
    ],
    'subjects': {
        'A12': {'name': 'Jeff'},
        'B32': {'name': 'Jane'},
    },
    'sessions': [
        { 
            'date': '2024-11-13',
            'subject': 'A12',
            'trials': [{'stimulus': 0}, {'stimulus': 1}, {'stimulus': 2}],
        },
        { 
            'date': '2024-11-13',
            'subject': 'B32',
            'trials': [{'stimulus': 0}, {'stimulus': 1}, {'stimulus': 2}],
        }
    ]
}
experiment
{'stimuli': [{'color': 'red', 'side': 'left', 'shape': 'circle'},
  {'color': 'green', 'side': 'right', 'shape': 'square'},
  {'color': 'green', 'side': 'right', 'shape': 'circle'}],
 'subjects': {'A12': {'name': 'Jeff'}, 'B32': {'name': 'Jane'}},
 'sessions': [{'date': '2024-11-13',
   'subject': 'A12',
   'trials': [{'stimulus': 0}, {'stimulus': 1}, {'stimulus': 2}]},
  {'date': '2024-11-13',
   'subject': 'B32',
   'trials': [{'stimulus': 0}, {'stimulus': 1}, {'stimulus': 2}]}]}

Exercise: Let’s check how one would go about extracting data from this experiment structure: Find out the name of the subject that did the second session.

Solution
experiment['subjects'][experiment['sessions'][1]['subject']]['name']
'Jane'

Exercise: Let’s do one more check: Get the 2nd trial’s stimulus color from the first session.

Solution
experiment['stimuli'][experiment['sessions'][0]['trials'][1]['stimulus']]['color']
'green'

Exercise: Le’ts put it all together, making one more schema and saving it to a file: Translate the following sentence to a Python data structure and save it to either a JSON or YAML file called capture.json or capture.yml: “The image has a width of 1080 pixels, a height of 720 pixels, saved data in RGB format. The camera settings had an exposure time of 8 milliseconds, an aperture of 2.8 stops, and an ISO setting of 100.”

Solution
image = {
    'width': 1080,
    'height': 720,
    'order': 'RGB',
    'settings': {
        'exposure': 8,
        'aperture': 2.8,
        'iso': 100,
    }
}

f = open('capture.yml', 'w')
yaml.dump(image, f)
f.close()

(Optional) The XML File Format

The XML file format is also used to store data, and is extremely popular in data acquisition systems; it’s even used to store OdML data, which is the metadata format used by Nix! Here, we’ll get a sense of what XML looks like, so that when we see richer metadata files, we can more-easily grok what Nix and OdML are doing.

Even though Python has an xml package included in its standard library, it can be quite complex to use. Here, we’re using the simpler xmltodict package to do basic reading and writing to xml.

Code Description
f = open('file.xml', 'wb') Open a writable file in “binary” mode
f = open('file.xml', 'rb') Open a readable file in “binary” mode
xmltodict.unparse(data, f) Write the data to the binary file linked to f
data = xmltodict.parse(f) Read the data in the binary file linked to f.
# %pip install xmltodict
import xmltodict

Exercises

Dataset: Here is our metadata structure again:

mdata = {
    'metadata': {
        'height': 1080, 
        'width': 1920, 
        'order': 'RGB', 
        'date': '2024-12-24', 
        'subject': {
            'id': 'x134', 
            'name': 'Scratchy', 
            'sources': ['Cartoon', 'The Simpsons Lab, Springfield']
        }, 
        'researchers': ['Itchy', 'Bart', 'Lisa']
    }
} 
mdata
{'metadata': {'height': 1080,
  'width': 1920,
  'order': 'RGB',
  'date': '2024-12-24',
  'subject': {'id': 'x134',
   'name': 'Scratchy',
   'sources': ['Cartoon', 'The Simpsons Lab, Springfield']},
  'researchers': ['Itchy', 'Bart', 'Lisa']}}

Exercise: Write this mdata data structure to data.xml.

Solution
f = open('data.xml', 'w')
xmltodict.unparse(mdata, f)
f.close()

Exercise: Read the data.xml file and, using the general f.read() method: print() its text contents. How does the data look now? Pretty different from the original python code, right?

Solution
f = open('data.xml')
print(f.read())
f.close()

Exercise: Read the data.xml file back into Python. Did it read correctly? (Note: the xmltodict.parse() function requires a “bytes” file, so use 'rb' as the )

Solution
f = open('data.xml', 'rb')
data = xmltodict.parse(f)
f.close()
data
{'metadata': {'height': '1080',
  'width': '1920',
  'order': 'RGB',
  'date': '2024-12-24',
  'subject': {'id': 'x134',
   'name': 'Scratchy',
   'sources': ['Cartoon', 'The Simpsons Lab, Springfield']},
  'researchers': ['Itchy', 'Bart', 'Lisa']}}

Exercise: XML also allows for more-complex structures. For example, the text below is valid xml. Parse the text_xml string into a Python variable called dset, and get the side that the “redrights” stimuli appears on the screen.

Note: This “@” syntax is something special that the xmltodict library uses; it’s not part of the XML syntax, and is just a way to make it easier to build a valid dict from the xml code. This is part of the work that always happens when gluing two technologies together, and every library will have a different solution.

text_xml = """
<root>
  <stimuli>
    <redleftc color="red" form="circle" side="left">Red Circle on the Left Side</redleftc>
    <redrights color="red" form="square" side="right">Red Square on the Right Side</redrights>
  </stimuli>
</root>
"""
{'root': {'stimuli': {'redleftc': {'@color': 'red',
    '@form': 'circle',
    '@side': 'left',
    '#text': 'Red Circle on the Left Side'},
   'redrights': {'@color': 'red',
    '@form': 'square',
    '@side': 'right',
    '#text': 'Red Square on the Right Side'}}}}
Solution
dset = xmltodict.parse(text_xml)
dset
{'root': {'stimuli': {'redleftc': {'@color': 'red',
    '@form': 'circle',
    '@side': 'left',
    '#text': 'Red Circle on the Left Side'},
   'redrights': {'@color': 'red',
    '@form': 'square',
    '@side': 'right',
    '#text': 'Red Square on the Right Side'}}}}