import sys
if not sys.warnoptions:
import warnings
warnings.simplefilter("ignore")
import os
import frictionless as fl
from dimcat.base import deserialize_json_file
CORPUS_PATH = os.path.abspath(os.path.join("..", "..", "unittest_metacorpus"))
assert os.path.isdir(CORPUS_PATH)
sweelinck_dir = os.path.join(CORPUS_PATH, "sweelinck_keyboard")
Data#
Resource#
A resource is a combination of a file and its descriptor. It allows for interacting with the file without having to “touch” it by interacting with its descriptor only. The descriptor comes in form of a dictionary and is typically stored next to the file in JSON or YAML format.
DiMCAT follows the Frictionless specification for describing resources. There are two types of resources:
PathResource: Stands for a resource on local disk or on the web.
They can be instantiated from a single filepath using the constructors
.from_resource_path()which takes the path to the resource file to be described.from_descriptor_filepath()which takes a filepath pointing to a JSON or YAML file containing a resource descriptor
Let’s exemplify looking at the
PathResource#
The sweelinck_keyboard repository contains a single MuseScore file (in the folder “MS3”) and several TSV files extracted from it.
Let’s load it:
from dimcat import resources
score_resource = os.path.join(sweelinck_dir, "MS3", "SwWV258_fantasia_cromatica.mscx")
score_resource = resources.PathResource.from_resource_path(score_resource)
score_resource.get_path_dict()
{'basepath': ~/checkouts/readthedocs.org/user_builds/dimcat/checkouts/stable/unittest_metacorpus/sweelinck_keyboard/MS3,
'filepath': 'SwWV258_fantasia_cromatica.mscx',
'innerpath': None,
'descriptor_filename': None,
'descriptor_path': '/home/docs/checkouts/readthedocs.org/user_builds/dimcat/checkouts/stable/unittest_metacorpus/sweelinck_keyboard/MS3/SwWV258_fantasia_cromatica.resource.json',
'normpath': '/home/docs/checkouts/readthedocs.org/user_builds/dimcat/checkouts/stable/unittest_metacorpus/sweelinck_keyboard/MS3/SwWV258_fantasia_cromatica.mscx'}
The dictionary returned by .get_path_dict() tell us everything we need to know to handle the resource physically:
basepathis an absolute directoryfilepathis the filepath (which can include subfolders), relative to thebasepathnormpathis the full path to the resource and defined asbasepath/filepath(both need to be specified)innerpath: whennormpathpoints to a .zip file, innerpath is the relative filepath of the resource within the ZIP archivedescriptor_filenamestores the name of a descriptor when it deviates from the default<resource_name>.resource.json. Cannot include subfolders since it is expected to be stored inbasepath(otherwise, the relativefilepathstored in the descriptor would resolve incorrectly)descriptor_path: defined bybasepath/descriptor_filename
Here, the descriptor_path corresponds to the default, which does not currently point to an existing file:
score_resource.descriptor_exists
False
It can be created using .store_descriptor():
score_descriptor_path = score_resource.store_descriptor()
score_resource.descriptor_exists
True
To underline the functionality of the path resource, even the new descriptor can be treated as a resource:
resources.PathResource.from_resource_path(score_descriptor_path)
PathResource
============
{'dtype': 'PathResource',
'basepath': '/home/docs/checkouts/readthedocs.org/user_builds/dimcat/checkouts/stable/unittest_metacorpus/sweelinck_keyboard/MS3',
'descriptor_filename': None,
'resource': {'name': 'swwv258_fantasia_cromatica.resource.json',
'type': 'json',
'path': 'SwWV258_fantasia_cromatica.resource.json',
'scheme': 'file',
'format': 'json',
'mediatype': 'text/json'}}
Which is different from creating the original PathResource from the created descriptor:
resources.PathResource.from_descriptor_path(score_descriptor_path)
PathResource
============
{'dtype': 'PathResource',
'basepath': '/home/docs/checkouts/readthedocs.org/user_builds/dimcat/checkouts/stable/unittest_metacorpus/sweelinck_keyboard/MS3',
'descriptor_filename': 'SwWV258_fantasia_cromatica.resource.json',
'resource': {'name': 'swwv258_fantasia_cromatica.mscx',
'type': 'file',
'path': 'SwWV258_fantasia_cromatica.mscx',
'scheme': 'file',
'format': 'mscx',
'dtype': 'PathResource',
'basepath': '/home/docs/checkouts/readthedocs.org/user_builds/dimcat/checkouts/stable/unittest_metacorpus/sweelinck_keyboard/MS3',
'descriptor_filename': 'SwWV258_fantasia_cromatica.resource.json'}}
Note that the descriptor_filename is now set to keep track of the existing one the resource originates from.
By the way, the descriptors written to disk qualify as “normal” DimcatConfigs (see ???)…
deserialize_json_file(score_descriptor_path)
PathResource
============
{'dtype': 'PathResource',
'basepath': '/home/docs/checkouts/readthedocs.org/user_builds/dimcat/checkouts/stable/unittest_metacorpus/sweelinck_keyboard/MS3',
'descriptor_filename': 'SwWV258_fantasia_cromatica.resource.json',
'resource': {'name': 'swwv258_fantasia_cromatica.mscx',
'type': 'file',
'path': 'SwWV258_fantasia_cromatica.mscx',
'scheme': 'file',
'format': 'mscx'}}
… and at the same time as valid Frictionless descriptors that can be validated using its commandline tool or Python library:
fl.validate(score_descriptor_path)
{'valid': True,
'stats': {'tasks': 1, 'errors': 0, 'warnings': 0, 'seconds': 0.004},
'warnings': [],
'errors': [],
'tasks': [{'name': 'swwv258_fantasia_cromatica.mscx',
'type': 'file',
'valid': True,
'place': 'SwWV258_fantasia_cromatica.mscx',
'labels': [],
'stats': {'errors': 0, 'warnings': 0, 'seconds': 0.004},
'warnings': [],
'errors': []}]}
This is also what the property is_valid uses under the hood:
score_resource.is_valid
True
The status of a PathResource is always and unchangeably PATH_ONLY, with a value one above EMPTY:
score_resource.status
<ResourceStatus.PATH_ONLY: 1>
The path components cannot be modified because it would invalidate the relations with other path components:
base_path_level_up = os.path.dirname(score_resource.basepath)
score_resource.basepath = base_path_level_up
---------------------------------------------------------------------------
ResourceIsFrozenError Traceback (most recent call last)
Cell In[12], line 2
1 base_path_level_up = os.path.dirname(score_resource.basepath)
----> 2 score_resource.basepath = base_path_level_up
File ~/checkouts/readthedocs.org/user_builds/dimcat/envs/stable/lib/python3.10/site-packages/dimcat/data/resources/base.py:612, in Resource.basepath(self, basepath)
610 @basepath.setter
611 def basepath(self, basepath: str):
--> 612 self.set_basepath(
613 basepath=basepath,
614 reconcile=False,
615 )
File ~/checkouts/readthedocs.org/user_builds/dimcat/envs/stable/lib/python3.10/site-packages/dimcat/data/resources/base.py:1100, in Resource.set_basepath(self, basepath, reconcile)
1098 if self.basepath is not None and not reconcile and self.is_packaged:
1099 raise ResourceIsPackagedError(self.resource_name, basepath, "basepath")
-> 1100 return self._set_basepath(basepath, reconcile=reconcile)
File ~/checkouts/readthedocs.org/user_builds/dimcat/envs/stable/lib/python3.10/site-packages/dimcat/data/resources/base.py:1064, in Resource._set_basepath(self, basepath, reconcile)
1062 if self.is_frozen:
1063 if not reconcile:
-> 1064 raise ResourceIsFrozenError(self.name, self.basepath, basepath_arg)
1065 # reconcile the current basepath with the new one, which may involve adapting filepath
1066 if self.resource_exists:
ResourceIsFrozenError: Resource 'PathResource' is frozen, i.e. tied to data stored on disk at basepath ~/checkouts/readthedocs.org/user_builds/dimcat/checkouts/stable/unittest_metacorpus/sweelinck_keyboard/MS3. Changing it to ~/checkouts/readthedocs.org/user_builds/dimcat/checkouts/stable/unittest_metacorpus/sweelinck_keyboard would invalidate the relative paths. Consider using copy_to_new_location().
DimcatResource#
A DimcatResource is both a Resource in the above sense and a wrapped dataframe. Let’s create one from a TSV resource descriptor:
notes_descriptor_path = os.path.join(sweelinck_dir, "notes", "SwWV258_fantasia_cromatica.notes.resource.json")
notes_resource = resources.DimcatResource.from_descriptor_path(notes_descriptor_path)
notes_resource
DimcatResource
==============
{'dtype': 'DimcatResource',
'basepath': '/home/docs/checkouts/readthedocs.org/user_builds/dimcat/checkouts/stable/unittest_metacorpus/sweelinck_keyboard/notes',
'descriptor_filename': 'SwWV258_fantasia_cromatica.notes.resource.json',
'resource': {'name': 'swwv258_fantasia_cromatica.notes',
'type': 'table',
'path': 'SwWV258_fantasia_cromatica.notes.tsv',
'scheme': 'file',
'format': 'tsv',
'mediatype': 'text/tsv',
'encoding': 'utf-8',
'dialect': {'csv': {'delimiter': '\t'}},
'schema': 'https://raw.githubusercontent.com/DCMLab/frictionless_schemas/main/notes/VvF3LJXVnKvxHg.schema.yaml',
'creator': {'@context': 'https://schema.org/',
'@type': 'SoftwareApplication',
'@id': 'https://pypi.org/project/ms3/',
'name': 'ms3',
'description': 'A parser for MuseScore 3 files and data factory for annotated music corpora.',
'author': {'name': 'Johannes Hentschel', '@id': 'https://orcid.org/0000-0002-1986-9545'},
'softwareVersion': '2.4.0'},
'git_revision': '2a771e5884eace9d254394e2e91538facc533897',
'git_tag': 'v2.0-4-g2a771e5'},
'auto_validate': False,
'default_groupby': [],
'ResourceStatus': 'STANDALONE_NOT_LOADED'}
As the output shows, the status of the resource is STANDALONE_NOT_LOADED.
The resource is considered standalone, as opposed to packaged, because it has its own resource descriptor file.
And it is considered “not loaded” because the actual tabular data has not been loaded from the described TSV file into memory.
The latter is achieved through the property df (short for dataframe):
notes_resource.df
| mc | mn | quarterbeats | quarterbeats_all_endings | duration_qb | mc_onset | mn_onset | timesig | staff | voice | duration | nominal_duration | scalar | tied | tpc | midi | name | octave | chord_id | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| i | |||||||||||||||||||
| 0 | 1 | 1 | 0 | 0 | 4.000000 | 0 | 0 | 4/4 | 2 | 1 | 1 | 1 | 1 | <NA> | 2 | 62 | D4 | 4 | 0 |
| 1 | 2 | 2 | 4 | 4 | 2.000000 | 0 | 0 | 4/4 | 2 | 1 | 1/2 | 1/2 | 1 | <NA> | 2 | 62 | D4 | 4 | 1 |
| 2 | 2 | 2 | 6 | 6 | 2.000000 | 1/2 | 1/2 | 4/4 | 2 | 1 | 1/2 | 1/2 | 1 | <NA> | 2 | 62 | D4 | 4 | 2 |
| 3 | 3 | 3 | 8 | 8 | 2.000000 | 0 | 0 | 4/4 | 2 | 1 | 1/2 | 1/2 | 1 | <NA> | 7 | 61 | C#4 | 4 | 3 |
| 4 | 3 | 3 | 10 | 10 | 2.000000 | 1/2 | 1/2 | 4/4 | 2 | 1 | 1/2 | 1/2 | 1 | <NA> | 0 | 60 | C4 | 4 | 4 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2669 | 195 | 195 | 4679/6 | 4679/6 | 0.166667 | 23/24 | 23/24 | 4/4 | 1 | 1 | 1/24 | 1/16 | 2/3 | <NA> | 4 | 64 | E4 | 4 | 2662 |
| 2670 | 196 | 196 | 780 | 780 | 4.000000 | 0 | 0 | 4/4 | 3 | 1 | 1 | 1 | 1 | <NA> | 2 | 38 | D2 | 2 | 2672 |
| 2671 | 196 | 196 | 780 | 780 | 4.000000 | 0 | 0 | 4/4 | 2 | 1 | 1 | 1 | 1 | <NA> | 3 | 57 | A3 | 3 | 2671 |
| 2672 | 196 | 196 | 780 | 780 | 4.000000 | 0 | 0 | 4/4 | 1 | 2 | 1 | 1 | 1 | <NA> | 2 | 62 | D4 | 4 | 2670 |
| 2673 | 196 | 196 | 780 | 780 | 4.000000 | 0 | 0 | 4/4 | 1 | 1 | 1 | 1 | 1 | <NA> | 6 | 66 | F#4 | 4 | 2669 |
2674 rows × 19 columns
… which changes the status to STANDALONE_LOADED:
notes_resource.status
<ResourceStatus.STANDALONE_LOADED: 6>
type(notes_resource)
dimcat.data.resources.dc.DimcatResource
Package#
A package, or DataPackage, is a collection of resources. Analogously there are two main types:
PathPackage for collecting PathResources, and
DimcatPackage for collecting DimcatResources.
Just like resources, packages have a basepath and may be stored as a frictionless package descriptor.
For starters, let’s assemble a package from scratch:
from dimcat import packages
path_package = packages.PathPackage(package_name="scratch")
path_package
PathPackage
===========
{'name': 'scratch', 'resources': [], 'basepath': None}
The fields are mostly familiar from above:
basepath: Absolute path on disk where the descriptor and the ZIP file would be stored.resources: Currently an empty list. Typically, allresourcesneed to have the samebasepath(if not, the package is ‘misaligned’).name: As per the Frictionless specification every package needs a name. In DiMCAT, the relevant property is calledpackage_name.descriptor_filename: The name of the descriptor file if it deviates from the default<package_name>.datapackage.json.auto_validate: If True, the package is automatically validated after it is stored to disk.
Now let’s add the path resource we have created above:
path_package.add_resource(score_resource)
path_package
PathPackage
===========
{'name': 'scratch', 'resources': ["'swwv258_fantasia_cromatica.mscx' (PathResource)"], 'basepath': None}
path_package.store_descriptor()
'/home/docs/dimcat_data/scratch.datapackage.json'
We can also create a package directly from a resource:
dimcat_package = packages.DimcatPackage.from_resources([notes_resource], package_name="pack")
dimcat_package
DimcatPackage
=============
{'name': 'pack', 'resources': ["'swwv258_fantasia_cromatica.notes' (DimcatResource)"], 'basepath': None}
score_resource.is_serialized
True
score_resource.status
<ResourceStatus.PATH_ONLY: 1>
score_resource.to_dict()
{'dtype': 'PathResource',
'basepath': '/home/docs/checkouts/readthedocs.org/user_builds/dimcat/checkouts/stable/unittest_metacorpus/sweelinck_keyboard/MS3',
'descriptor_filename': 'SwWV258_fantasia_cromatica.resource.json',
'resource': {'name': 'swwv258_fantasia_cromatica.mscx',
'type': 'file',
'path': 'SwWV258_fantasia_cromatica.mscx',
'scheme': 'file',
'format': 'mscx',
'encoding': 'utf-8'}}
score_resource.to_dict(pickle=True)
{'name': 'swwv258_fantasia_cromatica.mscx',
'type': 'file',
'path': 'SwWV258_fantasia_cromatica.mscx',
'scheme': 'file',
'format': 'mscx',
'encoding': 'utf-8',
'dtype': 'PathResource',
'basepath': '/home/docs/checkouts/readthedocs.org/user_builds/dimcat/checkouts/stable/unittest_metacorpus/sweelinck_keyboard/MS3',
'descriptor_filename': 'SwWV258_fantasia_cromatica.resource.json'}
score_resource.to_config().create()
PathResource
============
{'dtype': 'PathResource',
'basepath': '/home/docs/checkouts/readthedocs.org/user_builds/dimcat/checkouts/stable/unittest_metacorpus/sweelinck_keyboard/MS3',
'descriptor_filename': 'SwWV258_fantasia_cromatica.resource.json',
'resource': {'name': 'swwv258_fantasia_cromatica.mscx',
'type': 'file',
'path': 'SwWV258_fantasia_cromatica.mscx',
'scheme': 'file',
'format': 'mscx',
'encoding': 'utf-8'}}
notes_descriptor_path = os.path.join(sweelinck_dir, "notes", "SwWV258_fantasia_cromatica.notes.resource.json")
notes_path_resource = resources.Resource.from_descriptor_path(notes_descriptor_path)
notes_path_resource = resources.PathResource.from_descriptor_path(notes_descriptor_path)
notes_path_resource
PathResource
============
{'dtype': 'PathResource',
'basepath': '/home/docs/checkouts/readthedocs.org/user_builds/dimcat/checkouts/stable/unittest_metacorpus/sweelinck_keyboard/notes',
'descriptor_filename': 'SwWV258_fantasia_cromatica.notes.resource.json',
'resource': {'name': 'swwv258_fantasia_cromatica.notes',
'type': 'table',
'path': 'SwWV258_fantasia_cromatica.notes.tsv',
'scheme': 'file',
'format': 'tsv',
'mediatype': 'text/tsv',
'encoding': 'utf-8',
'dialect': {'csv': {'delimiter': '\t'}},
'schema': 'https://raw.githubusercontent.com/DCMLab/frictionless_schemas/main/notes/VvF3LJXVnKvxHg.schema.yaml',
'creator': {'@context': 'https://schema.org/',
'@type': 'SoftwareApplication',
'@id': 'https://pypi.org/project/ms3/',
'name': 'ms3',
'description': 'A parser for MuseScore 3 files and data factory for annotated music corpora.',
'author': {'name': 'Johannes Hentschel', '@id': 'https://orcid.org/0000-0002-1986-9545'},
'softwareVersion': '2.4.0'},
'git_revision': '2a771e5884eace9d254394e2e91538facc533897',
'git_tag': 'v2.0-4-g2a771e5'}}
notes_resource = resources.Resource.from_descriptor_path(notes_descriptor_path)
notes_resource
DimcatResource
==============
{'dtype': 'DimcatResource',
'basepath': '/home/docs/checkouts/readthedocs.org/user_builds/dimcat/checkouts/stable/unittest_metacorpus/sweelinck_keyboard/notes',
'descriptor_filename': 'SwWV258_fantasia_cromatica.notes.resource.json',
'resource': {'name': 'swwv258_fantasia_cromatica.notes',
'type': 'table',
'path': 'SwWV258_fantasia_cromatica.notes.tsv',
'scheme': 'file',
'format': 'tsv',
'mediatype': 'text/tsv',
'encoding': 'utf-8',
'dialect': {'csv': {'delimiter': '\t'}},
'schema': 'https://raw.githubusercontent.com/DCMLab/frictionless_schemas/main/notes/VvF3LJXVnKvxHg.schema.yaml',
'creator': {'@context': 'https://schema.org/',
'@type': 'SoftwareApplication',
'@id': 'https://pypi.org/project/ms3/',
'name': 'ms3',
'description': 'A parser for MuseScore 3 files and data factory for annotated music corpora.',
'author': {'name': 'Johannes Hentschel', '@id': 'https://orcid.org/0000-0002-1986-9545'},
'softwareVersion': '2.4.0'},
'git_revision': '2a771e5884eace9d254394e2e91538facc533897',
'git_tag': 'v2.0-4-g2a771e5'},
'auto_validate': False,
'default_groupby': [],
'ResourceStatus': 'STANDALONE_NOT_LOADED'}