import sys
if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")
import os
import frictionless as fl
from dimcat.base import deserialize_json_file
CORPUS_PATH = os.path.abspath(os.path.join("..", "..", "unittest_metacorpus"))
assert os.path.isdir(CORPUS_PATH)
sweelinck_dir = os.path.join(CORPUS_PATH, "sweelinck_keyboard")

Data#

Resource#

A resource is a combination of a file and its descriptor. It allows for interacting with the file without having to “touch” it by interacting with its descriptor only. The descriptor comes in form of a dictionary and is typically stored next to the file in JSON or YAML format.

DiMCAT follows the Frictionless specification for describing resources. There are two types of resources:

They can be instantiated from a single filepath using the constructors

  • .from_resource_path() which takes the path to the resource file to be described

  • .from_descriptor_filepath() which takes a filepath pointing to a JSON or YAML file containing a resource descriptor

Let’s exemplify looking at the

PathResource#

The sweelinck_keyboard repository contains a single MuseScore file (in the folder “MS3”) and several TSV files extracted from it. Let’s load it:

from dimcat import resources
score_resource = os.path.join(sweelinck_dir, "MS3", "SwWV258_fantasia_cromatica.mscx")
score_resource = resources.PathResource.from_resource_path(score_resource)
score_resource.get_path_dict()
{'basepath': ~/checkouts/readthedocs.org/user_builds/dimcat/checkouts/stable/unittest_metacorpus/sweelinck_keyboard/MS3,
 'filepath': 'SwWV258_fantasia_cromatica.mscx',
 'innerpath': None,
 'descriptor_filename': None,
 'descriptor_path': '/home/docs/checkouts/readthedocs.org/user_builds/dimcat/checkouts/stable/unittest_metacorpus/sweelinck_keyboard/MS3/SwWV258_fantasia_cromatica.resource.json',
 'normpath': '/home/docs/checkouts/readthedocs.org/user_builds/dimcat/checkouts/stable/unittest_metacorpus/sweelinck_keyboard/MS3/SwWV258_fantasia_cromatica.mscx'}

The dictionary returned by .get_path_dict() tell us everything we need to know to handle the resource physically:

  • basepath is an absolute directory

  • filepath is the filepath (which can include subfolders), relative to the basepath

  • normpath is the full path to the resource and defined as basepath/filepath (both need to be specified)

  • innerpath: when normpath points to a .zip file, innerpath is the relative filepath of the resource within the ZIP archive

  • descriptor_filename stores the name of a descriptor when it deviates from the default <resource_name>.resource.json. Cannot include subfolders since it is expected to be stored in basepath (otherwise, the relative filepath stored in the descriptor would resolve incorrectly)

  • descriptor_path: defined by basepath/descriptor_filename

Here, the descriptor_path corresponds to the default, which does not currently point to an existing file:

score_resource.descriptor_exists
False

It can be created using .store_descriptor():

score_descriptor_path = score_resource.store_descriptor()
score_resource.descriptor_exists
True

To underline the functionality of the path resource, even the new descriptor can be treated as a resource:

resources.PathResource.from_resource_path(score_descriptor_path)
PathResource
============
{'dtype': 'PathResource',
 'basepath': '/home/docs/checkouts/readthedocs.org/user_builds/dimcat/checkouts/stable/unittest_metacorpus/sweelinck_keyboard/MS3',
 'descriptor_filename': None,
 'resource': {'name': 'swwv258_fantasia_cromatica.resource.json',
              'type': 'json',
              'path': 'SwWV258_fantasia_cromatica.resource.json',
              'scheme': 'file',
              'format': 'json',
              'mediatype': 'text/json'}}

Which is different from creating the original PathResource from the created descriptor:

resources.PathResource.from_descriptor_path(score_descriptor_path)
PathResource
============
{'dtype': 'PathResource',
 'basepath': '/home/docs/checkouts/readthedocs.org/user_builds/dimcat/checkouts/stable/unittest_metacorpus/sweelinck_keyboard/MS3',
 'descriptor_filename': 'SwWV258_fantasia_cromatica.resource.json',
 'resource': {'name': 'swwv258_fantasia_cromatica.mscx',
              'type': 'file',
              'path': 'SwWV258_fantasia_cromatica.mscx',
              'scheme': 'file',
              'format': 'mscx',
              'dtype': 'PathResource',
              'basepath': '/home/docs/checkouts/readthedocs.org/user_builds/dimcat/checkouts/stable/unittest_metacorpus/sweelinck_keyboard/MS3',
              'descriptor_filename': 'SwWV258_fantasia_cromatica.resource.json'}}

Note that the descriptor_filename is now set to keep track of the existing one the resource originates from.

By the way, the descriptors written to disk qualify as “normal” DimcatConfigs (see ???)…

deserialize_json_file(score_descriptor_path)
PathResource
============
{'dtype': 'PathResource',
 'basepath': '/home/docs/checkouts/readthedocs.org/user_builds/dimcat/checkouts/stable/unittest_metacorpus/sweelinck_keyboard/MS3',
 'descriptor_filename': 'SwWV258_fantasia_cromatica.resource.json',
 'resource': {'name': 'swwv258_fantasia_cromatica.mscx',
              'type': 'file',
              'path': 'SwWV258_fantasia_cromatica.mscx',
              'scheme': 'file',
              'format': 'mscx'}}

… and at the same time as valid Frictionless descriptors that can be validated using its commandline tool or Python library:

fl.validate(score_descriptor_path)
{'valid': True,
 'stats': {'tasks': 1, 'errors': 0, 'warnings': 0, 'seconds': 0.004},
 'warnings': [],
 'errors': [],
 'tasks': [{'name': 'swwv258_fantasia_cromatica.mscx',
            'type': 'file',
            'valid': True,
            'place': 'SwWV258_fantasia_cromatica.mscx',
            'labels': [],
            'stats': {'errors': 0, 'warnings': 0, 'seconds': 0.004},
            'warnings': [],
            'errors': []}]}

This is also what the property is_valid uses under the hood:

score_resource.is_valid
True

The status of a PathResource is always and unchangeably PATH_ONLY, with a value one above EMPTY:

score_resource.status
<ResourceStatus.PATH_ONLY: 1>

The path components cannot be modified because it would invalidate the relations with other path components:

base_path_level_up = os.path.dirname(score_resource.basepath)
score_resource.basepath = base_path_level_up
---------------------------------------------------------------------------
ResourceIsFrozenError                     Traceback (most recent call last)
Cell In[12], line 2
      1 base_path_level_up = os.path.dirname(score_resource.basepath)
----> 2 score_resource.basepath = base_path_level_up

File ~/checkouts/readthedocs.org/user_builds/dimcat/envs/stable/lib/python3.10/site-packages/dimcat/data/resources/base.py:612, in Resource.basepath(self, basepath)
    610 @basepath.setter
    611 def basepath(self, basepath: str):
--> 612     self.set_basepath(
    613         basepath=basepath,
    614         reconcile=False,
    615     )

File ~/checkouts/readthedocs.org/user_builds/dimcat/envs/stable/lib/python3.10/site-packages/dimcat/data/resources/base.py:1100, in Resource.set_basepath(self, basepath, reconcile)
   1098 if self.basepath is not None and not reconcile and self.is_packaged:
   1099     raise ResourceIsPackagedError(self.resource_name, basepath, "basepath")
-> 1100 return self._set_basepath(basepath, reconcile=reconcile)

File ~/checkouts/readthedocs.org/user_builds/dimcat/envs/stable/lib/python3.10/site-packages/dimcat/data/resources/base.py:1064, in Resource._set_basepath(self, basepath, reconcile)
   1062 if self.is_frozen:
   1063     if not reconcile:
-> 1064         raise ResourceIsFrozenError(self.name, self.basepath, basepath_arg)
   1065     # reconcile the current basepath with the new one, which may involve adapting filepath
   1066     if self.resource_exists:

ResourceIsFrozenError: Resource 'PathResource' is frozen, i.e. tied to data stored on disk at basepath ~/checkouts/readthedocs.org/user_builds/dimcat/checkouts/stable/unittest_metacorpus/sweelinck_keyboard/MS3. Changing it to ~/checkouts/readthedocs.org/user_builds/dimcat/checkouts/stable/unittest_metacorpus/sweelinck_keyboard would invalidate the relative paths. Consider using copy_to_new_location().

DimcatResource#

A DimcatResource is both a Resource in the above sense and a wrapped dataframe. Let’s create one from a TSV resource descriptor:

notes_descriptor_path = os.path.join(sweelinck_dir, "notes", "SwWV258_fantasia_cromatica.notes.resource.json")
notes_resource = resources.DimcatResource.from_descriptor_path(notes_descriptor_path)
notes_resource
DimcatResource
==============
{'dtype': 'DimcatResource',
 'basepath': '/home/docs/checkouts/readthedocs.org/user_builds/dimcat/checkouts/stable/unittest_metacorpus/sweelinck_keyboard/notes',
 'descriptor_filename': 'SwWV258_fantasia_cromatica.notes.resource.json',
 'resource': {'name': 'swwv258_fantasia_cromatica.notes',
              'type': 'table',
              'path': 'SwWV258_fantasia_cromatica.notes.tsv',
              'scheme': 'file',
              'format': 'tsv',
              'mediatype': 'text/tsv',
              'encoding': 'utf-8',
              'dialect': {'csv': {'delimiter': '\t'}},
              'schema': 'https://raw.githubusercontent.com/DCMLab/frictionless_schemas/main/notes/VvF3LJXVnKvxHg.schema.yaml',
              'creator': {'@context': 'https://schema.org/',
                          '@type': 'SoftwareApplication',
                          '@id': 'https://pypi.org/project/ms3/',
                          'name': 'ms3',
                          'description': 'A parser for MuseScore 3 files and data factory for annotated music corpora.',
                          'author': {'name': 'Johannes Hentschel', '@id': 'https://orcid.org/0000-0002-1986-9545'},
                          'softwareVersion': '2.4.0'},
              'git_revision': '2a771e5884eace9d254394e2e91538facc533897',
              'git_tag': 'v2.0-4-g2a771e5'},
 'auto_validate': False,
 'default_groupby': [],
 'ResourceStatus': 'STANDALONE_NOT_LOADED'}

As the output shows, the status of the resource is STANDALONE_NOT_LOADED. The resource is considered standalone, as opposed to packaged, because it has its own resource descriptor file. And it is considered “not loaded” because the actual tabular data has not been loaded from the described TSV file into memory. The latter is achieved through the property df (short for dataframe):

notes_resource.df
mc mn quarterbeats quarterbeats_all_endings duration_qb mc_onset mn_onset timesig staff voice duration nominal_duration scalar tied tpc midi name octave chord_id
i
0 1 1 0 0 4.000000 0 0 4/4 2 1 1 1 1 <NA> 2 62 D4 4 0
1 2 2 4 4 2.000000 0 0 4/4 2 1 1/2 1/2 1 <NA> 2 62 D4 4 1
2 2 2 6 6 2.000000 1/2 1/2 4/4 2 1 1/2 1/2 1 <NA> 2 62 D4 4 2
3 3 3 8 8 2.000000 0 0 4/4 2 1 1/2 1/2 1 <NA> 7 61 C#4 4 3
4 3 3 10 10 2.000000 1/2 1/2 4/4 2 1 1/2 1/2 1 <NA> 0 60 C4 4 4
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2669 195 195 4679/6 4679/6 0.166667 23/24 23/24 4/4 1 1 1/24 1/16 2/3 <NA> 4 64 E4 4 2662
2670 196 196 780 780 4.000000 0 0 4/4 3 1 1 1 1 <NA> 2 38 D2 2 2672
2671 196 196 780 780 4.000000 0 0 4/4 2 1 1 1 1 <NA> 3 57 A3 3 2671
2672 196 196 780 780 4.000000 0 0 4/4 1 2 1 1 1 <NA> 2 62 D4 4 2670
2673 196 196 780 780 4.000000 0 0 4/4 1 1 1 1 1 <NA> 6 66 F#4 4 2669

2674 rows × 19 columns

… which changes the status to STANDALONE_LOADED:

notes_resource.status
<ResourceStatus.STANDALONE_LOADED: 6>
type(notes_resource)
dimcat.data.resources.dc.DimcatResource

Package#

A package, or DataPackage, is a collection of resources. Analogously there are two main types:

Just like resources, packages have a basepath and may be stored as a frictionless package descriptor.

For starters, let’s assemble a package from scratch:

from dimcat import packages
path_package = packages.PathPackage(package_name="scratch")
path_package
PathPackage
===========
{'name': 'scratch', 'resources': [], 'basepath': None}

The fields are mostly familiar from above:

  • basepath: Absolute path on disk where the descriptor and the ZIP file would be stored.

  • resources: Currently an empty list. Typically, all resources need to have the same basepath (if not, the package is ‘misaligned’).

  • name: As per the Frictionless specification every package needs a name. In DiMCAT, the relevant property is called package_name.

  • descriptor_filename: The name of the descriptor file if it deviates from the default <package_name>.datapackage.json.

  • auto_validate: If True, the package is automatically validated after it is stored to disk.

Now let’s add the path resource we have created above:

path_package.add_resource(score_resource)
path_package
PathPackage
===========
{'name': 'scratch', 'resources': ["'swwv258_fantasia_cromatica.mscx' (PathResource)"], 'basepath': None}
path_package.store_descriptor()
'/home/docs/dimcat_data/scratch.datapackage.json'

We can also create a package directly from a resource:

dimcat_package = packages.DimcatPackage.from_resources([notes_resource], package_name="pack")
dimcat_package
DimcatPackage
=============
{'name': 'pack', 'resources': ["'swwv258_fantasia_cromatica.notes' (DimcatResource)"], 'basepath': None}
score_resource.is_serialized
True
score_resource.status
<ResourceStatus.PATH_ONLY: 1>
score_resource.to_dict()
{'dtype': 'PathResource',
 'basepath': '/home/docs/checkouts/readthedocs.org/user_builds/dimcat/checkouts/stable/unittest_metacorpus/sweelinck_keyboard/MS3',
 'descriptor_filename': 'SwWV258_fantasia_cromatica.resource.json',
 'resource': {'name': 'swwv258_fantasia_cromatica.mscx',
  'type': 'file',
  'path': 'SwWV258_fantasia_cromatica.mscx',
  'scheme': 'file',
  'format': 'mscx',
  'encoding': 'utf-8'}}
score_resource.to_dict(pickle=True)
{'name': 'swwv258_fantasia_cromatica.mscx',
 'type': 'file',
 'path': 'SwWV258_fantasia_cromatica.mscx',
 'scheme': 'file',
 'format': 'mscx',
 'encoding': 'utf-8',
 'dtype': 'PathResource',
 'basepath': '/home/docs/checkouts/readthedocs.org/user_builds/dimcat/checkouts/stable/unittest_metacorpus/sweelinck_keyboard/MS3',
 'descriptor_filename': 'SwWV258_fantasia_cromatica.resource.json'}
score_resource.to_config().create()
PathResource
============
{'dtype': 'PathResource',
 'basepath': '/home/docs/checkouts/readthedocs.org/user_builds/dimcat/checkouts/stable/unittest_metacorpus/sweelinck_keyboard/MS3',
 'descriptor_filename': 'SwWV258_fantasia_cromatica.resource.json',
 'resource': {'name': 'swwv258_fantasia_cromatica.mscx',
              'type': 'file',
              'path': 'SwWV258_fantasia_cromatica.mscx',
              'scheme': 'file',
              'format': 'mscx',
              'encoding': 'utf-8'}}
notes_descriptor_path = os.path.join(sweelinck_dir, "notes", "SwWV258_fantasia_cromatica.notes.resource.json")
notes_path_resource = resources.Resource.from_descriptor_path(notes_descriptor_path)
notes_path_resource = resources.PathResource.from_descriptor_path(notes_descriptor_path)
notes_path_resource
PathResource
============
{'dtype': 'PathResource',
 'basepath': '/home/docs/checkouts/readthedocs.org/user_builds/dimcat/checkouts/stable/unittest_metacorpus/sweelinck_keyboard/notes',
 'descriptor_filename': 'SwWV258_fantasia_cromatica.notes.resource.json',
 'resource': {'name': 'swwv258_fantasia_cromatica.notes',
              'type': 'table',
              'path': 'SwWV258_fantasia_cromatica.notes.tsv',
              'scheme': 'file',
              'format': 'tsv',
              'mediatype': 'text/tsv',
              'encoding': 'utf-8',
              'dialect': {'csv': {'delimiter': '\t'}},
              'schema': 'https://raw.githubusercontent.com/DCMLab/frictionless_schemas/main/notes/VvF3LJXVnKvxHg.schema.yaml',
              'creator': {'@context': 'https://schema.org/',
                          '@type': 'SoftwareApplication',
                          '@id': 'https://pypi.org/project/ms3/',
                          'name': 'ms3',
                          'description': 'A parser for MuseScore 3 files and data factory for annotated music corpora.',
                          'author': {'name': 'Johannes Hentschel', '@id': 'https://orcid.org/0000-0002-1986-9545'},
                          'softwareVersion': '2.4.0'},
              'git_revision': '2a771e5884eace9d254394e2e91538facc533897',
              'git_tag': 'v2.0-4-g2a771e5'}}
notes_resource = resources.Resource.from_descriptor_path(notes_descriptor_path)
notes_resource
DimcatResource
==============
{'dtype': 'DimcatResource',
 'basepath': '/home/docs/checkouts/readthedocs.org/user_builds/dimcat/checkouts/stable/unittest_metacorpus/sweelinck_keyboard/notes',
 'descriptor_filename': 'SwWV258_fantasia_cromatica.notes.resource.json',
 'resource': {'name': 'swwv258_fantasia_cromatica.notes',
              'type': 'table',
              'path': 'SwWV258_fantasia_cromatica.notes.tsv',
              'scheme': 'file',
              'format': 'tsv',
              'mediatype': 'text/tsv',
              'encoding': 'utf-8',
              'dialect': {'csv': {'delimiter': '\t'}},
              'schema': 'https://raw.githubusercontent.com/DCMLab/frictionless_schemas/main/notes/VvF3LJXVnKvxHg.schema.yaml',
              'creator': {'@context': 'https://schema.org/',
                          '@type': 'SoftwareApplication',
                          '@id': 'https://pypi.org/project/ms3/',
                          'name': 'ms3',
                          'description': 'A parser for MuseScore 3 files and data factory for annotated music corpora.',
                          'author': {'name': 'Johannes Hentschel', '@id': 'https://orcid.org/0000-0002-1986-9545'},
                          'softwareVersion': '2.4.0'},
              'git_revision': '2a771e5884eace9d254394e2e91538facc533897',
              'git_tag': 'v2.0-4-g2a771e5'},
 'auto_validate': False,
 'default_groupby': [],
 'ResourceStatus': 'STANDALONE_NOT_LOADED'}