Developers guide#
This guide is an introduction into DiMCAT's code architecture. Users who want to contribute to DiMCAT are invited to refer to the contribution guidelines which contain coding conventions and instructions how to set up the development environment.
Introduction#
The library is called DiMCAT and has three high-level objects:
DimcatObject(“object”): the base class for all objects that manages object creation and serialization and subclass registration. The DimcatObject class has a class attribute called _registry that is a dictionary of all subclasses of DimcatObject. Each DimcatObject has a nested class calledSchemathat inherits fromDimcatSchema.DimcatSchema(“schema”): the base class for all nested Schema classes, inheriting from marshmallow.Schema. The Schema defines the valid values ranges for all attributes of the DimcatObject and how to serialize and deserialize them.DimcatConfig(“config”): a DimcatObject that can represent a subset of the attributes of another DimcatObject and instantiate it using thecreate()method. It derives from MutableMapping and is used for communicating about and checking the compatibility of DimcatObjects.
The three classes are defined in the src\dimcat\base.py module.
Serializing DiMCAT objects#
The nested Schema corresponding to each DimcatObject is instantiated as a singleton and can be retrieved via the class attribute schema.
Using this Schema, a DimcatObject can be serialized to and deserialized from:
a dictionary using the
to_dict()andfrom_dict()methods.a DimcatConfig object using the
to_config()andfrom_config()methods.a JSON string using the
to_json()andfrom_json()methods.a JSON file using the
to_json_file()andfrom_json_file()methods.
Under the hood, methods 2-4 use method 1. In addition, DiMCAT has the following standalone functions to deserialize serialized DimcatObjects:
This is possible because each deserialized object includes a value for the field dtype specifying the object’s class name from which the schema can be retrieved thanks to the class attribute schema. Other functions that are relevant in this context are get_class() and get_schema() (see )
Example#
import dimcat as dc
cfg = dc.DimcatConfig("DimcatObject")
obj = cfg.create()
print("This object is a", type(obj))
json_str = obj.to_json()
obj_copy = dc.base.DimcatObject.from_json(json_str)
another_copy = dc.deserialize_json_str(json_str)
print(f"The two deserialized objects are equivalent: ", obj_copy == another_copy)
print(f"The two deserialized objects are identical: ", obj_copy is another_copy)
obj_copy # DimcatObject.__repr__() uses .to_dict() under the hood
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 1
----> 1 import dimcat as dc
2 cfg = dc.DimcatConfig("DimcatObject")
3 obj = cfg.create()
File ~/checkouts/readthedocs.org/user_builds/dimcat/envs/latest/lib/python3.10/site-packages/dimcat/__init__.py:28
13 # modules of dimcat.data are not allowed to import from dimcat.steps, so when they do, they use get_class() which
14 # requires that the respective step was already "seen" and is part of the registry. Hence, although the main purpose
15 # of the imports here is syntactic sugar, some are required.
16 from .base import (
17 DimcatConfig,
18 change_setting,
(...)
26 reset_settings,
27 )
---> 28 from .data import catalogs, datasets, packages, resources
29 from .data.datasets.base import Dataset
30 from .data.resources import PieceIndex
File ~/checkouts/readthedocs.org/user_builds/dimcat/envs/latest/lib/python3.10/site-packages/dimcat/data/__init__.py:3
1 import logging
----> 3 from .resources import features
5 module_logger = logging.getLogger(__name__)
File ~/checkouts/readthedocs.org/user_builds/dimcat/envs/latest/lib/python3.10/site-packages/dimcat/data/resources/__init__.py:5
3 from .base import FeatureName, PathResource, Resource, ResourceStatus
4 from .dc import DimcatIndex, DimcatResource, Feature, PieceIndex
----> 5 from .features import (
6 Annotations,
7 HarmonyLabels,
8 KeyAnnotations,
9 Metadata,
10 Notes,
11 PhraseAnnotations,
12 PhraseComponents,
13 PhraseLabels,
14 )
15 from .results import (
16 CadenceCounts,
17 Counts,
(...)
24 Transitions,
25 )
27 module_logger = logging.getLogger(__name__)
File ~/checkouts/readthedocs.org/user_builds/dimcat/envs/latest/lib/python3.10/site-packages/dimcat/data/resources/features.py:23
14 from dimcat.data.resources.base import D, FeatureName, S
15 from dimcat.data.resources.dc import (
16 HARMONY_FEATURE_NAMES,
17 DimcatIndex,
(...)
21 UnitOfAnalysis,
22 )
---> 23 from dimcat.data.resources.results import PhraseData, PhraseDataFormat
24 from dimcat.data.resources.utils import (
25 get_corpus_display_name,
26 join_df_on_index,
27 merge_ties,
28 safe_row_tuple,
29 )
30 from dimcat.dc_exceptions import (
31 DataframeIsMissingExpectedColumnsError,
32 FeatureIsMissingFormatColumnError,
33 ResourceIsMissingPieceIndexError,
34 )
File ~/checkouts/readthedocs.org/user_builds/dimcat/envs/latest/lib/python3.10/site-packages/dimcat/data/resources/results.py:41
32 from dimcat.base import (
33 DimcatObjectField,
34 FriendlyEnum,
(...)
38 get_setting,
39 )
40 from dimcat.dc_exceptions import UnknownFormat
---> 41 from dimcat.plotting import (
42 CADENCE_COLORS,
43 GroupMode,
44 make_bar_plot,
45 make_bubble_plot,
46 make_heatmap,
47 make_lof_bar_plot,
48 make_lof_bubble_plot,
49 make_pie_chart,
50 update_figure_layout,
51 update_plot_grouping_settings,
52 write_image,
53 )
54 from dimcat.utils import SortOrder
55 from plotly import graph_objs as go
File ~/checkouts/readthedocs.org/user_builds/dimcat/envs/latest/lib/python3.10/site-packages/dimcat/plotting.py:16
14 from plotly import express as px
15 from plotly import graph_objects as go
---> 16 from plotly.validators.heatmap import ColorscaleValidator
18 AVAILABLE_FIGURE_FORMATS: Tuple[str, ...] = PlotlyScope._all_formats
19 """Possible formats for saving Plotly figures, defined as extensions without leading dot."""
ModuleNotFoundError: No module named 'plotly.validators.heatmap'
Implementation#
The implementation is centered on two methods of the respective object’s nested DimcatSchema which derives from marshmallow.Schema: schema.dump() and schema.load(). The former takes an object and returns a dictionary, whereas the latter takes a dictionary and returns an object. Correspondingly, DimcatObject.to_dict() and DimcatObject.from_dict() retrieve the relevant schema singleton from DimcatObject.schema to call these two methods respectively.
Creating a new type of DimcatObject#
All objects in DiMCAT (except DimcatSchema) inherit from DimcatObject. Inheritance also concerns the nested schema class. Effectively, this means that if you subclass an existing object type without adding new initialization arguments, your new class can simply inherit its parent’s Schema class and serialization will just work as described above. However, if you add a property, meaning that you will also need to add the corresponding initialization argument, you also need to include a nested Schema class which inherits from the parent’s schema. Each property that is to be serialized needs to be declared as marshmallow field corresponding to the datatype.
from marshmallow import fields
class NewType(dc.base.DimcatObject):
class Schema(dc.base.DimcatObject.Schema):
new_property = fields.Str()
def __init__(self, new_property: str, **kwargs):
super().__init__(**kwargs)
self.new_property = new_property
new_obj = NewType("some string value")
as_dict = new_obj.to_dict()
new_obj_copy = dc.deserialize_dict(as_dict)
new_obj_copy
In cases where an attribute should point to a DimcatObject (e.g. all Result objects referencing the analyzed DimcatResource via the analyzed_resource property), we can use the type DimcatObjectField in the schema.
The class registry#
Every DimcatObject comes with the attribute _registry which is a dictionary mapping the names of all DimcatObjects to their classes.
It is implemented using init_subclass.
We don’t need to interact directy with the registry thanks to the convenience function get_class() which takes the name of an object as a string and returns the respective class.
In the code, this would typically look like this:
Constructor = dc.get_class("FeatureExtractor")
feature_extractor = Constructor()
Schemas are not part of the registry. For retrieving a class’s schema we can use Constructor.schema (building on the example) or the convenience function get_schema().
Public and private methods#
The DiMCAT project differentiates between private methods whose names begin with _ and public methods whose names don’t.
Semantically, public methods are those that users interact with and which therefore often perform additional checks, e.g. of user input;
then, the public method calls the private method of the same name which performs the actual job.
In most cases, subclasses override only private methods.
Example#
For example, compare the public PipelineStep.process_dataset() with its private counterpart:
def _process_dataset(self, dataset: Dataset) -> Dataset:
"""Apply this PipelineStep to a :class:`Dataset` and return a copy containing the output(s)."""
new_dataset = self._make_new_dataset(dataset)
self.fit_to_dataset(new_dataset)
# this is where subclasses create a new package and add it to the dataset
return new_dataset
def process_dataset(self, dataset: Dataset) -> Dataset:
"""Apply this PipelineStep to a :class:`Dataset` and return a copy containing the output(s)."""
self.check_dataset(dataset)
return self._process_dataset(dataset)
Two types of DimcatObjects#
All classes that are neither a schema nor a config inherit from one of the two following subclasses of DimcatObject:
Data: a DimcatObject that represents a dataset, a subset of a dataset, or an individual resource such as a dataframe.PipelineStep: a DimcatObject that accepts a Data object as input and returns a Data object as output.
They are organized in two packages, dimcat.data and dimcat.steps. Objects defined in dimcat.steps operate on objects defined in dimcat.data and can import from it, but not the other way around.
In a few exceptional cases where Data objects need to actively use PipelineStep (which is the case, for example, for Dataset.extract_feature()), we circumvent circular imports by summoning them via get_class() (and not using type hints for the summoned object).
Data objects#
Data is organized into a hierarchical hierarchy of four objects (from top to bottom):
Dataset, consisting of two catalogs, calledinputsandoutputs;DimcatCatalog(“catalog”), a collection of packages;DimcatPackage(“package”), a collection of resources;DimcatResource(“resource”), a wrapper around a dataframe.
There are three main types of resources, namely
Facet, a representation of score-related elements that represents the loaded data with minimal standardization (e.g.MuseScoreNotes);Feature, some aspect extracted from a facet, standardized byDimcat(e.g.,Notes);Result, a result of applying aPipelineStepto a feature.
Dimcat cannot process facets directly; the relevant features need to be extracted first. Many PipelineSteps extract the required feature automatically.
Dataset#
The principal Data object is the Dataset and is the one that users usually interact with the most. Its three principal properties are:
inputs, anInputsCatalogoutputs, anOutputsCatalogpipeline, aPipelineconsisting of all previously appliedPipelineSteps.
After applying a PipelineStep to a Dataset,
its outputs MUST correspond to the result of applying the pipeline to inputs.
A serialized Dataset is therefore suited for communicating results in a reproducible manner.
Any PipelineStep applied on a dataset will be performed on all eligible resources that the packages in inputs contain and result in a new dataset containing the relevant output packages/resources under outputs.
Datasets are passive ‘by nature’, meaning that, in general, they are being manipulated by PipelineSteps or by the user.
PipelineSteps process a Dataset by requesting one or several features using Dataset.get_feature(),
processing each Feature, and adding the processed Feature(s) to the Dataset’s OutputsCatalog.
However, in one case, the Dataset does play an active role, namely in the extraction of features from the InputsCatalog.
When prompted with .get_feature(F) where F is some specification of a Feature, the Dataset will
look up the feature in its OutputCatalog and return it if present,
call
Dataset.extract_feature()otherwise and return its output.
Since the actual extraction happens on the level of a single resource (a Facet which names the feature among its extractable_features), the latter case invokes the following call chain:
Dataset.extract_feature()callsInputsCatalog.extract_feature()callsPackage.extract_feature()calls
The Dataset applies all previously applied PipelineSteps to the thus extracted Feature,
adds it to its OutputsCatalog and appends the FeatureExtractor to its Pipeline.
DimcatCatalog#
As per the Frictionless Data specifications, a catalog is a collection of packages. In Dimcat, catalogs appear only in a single place: Every Dataset consists of two DimcatCatalogs, namely an InputsCatalog and an OutputsCatalog.
inputs includes all loaded datapackages, and outputs all (processed or unprocessed) features extracted from them, as well as all other
processing results.
DimcatPackage#
The preferred structure of a DimcatPackage is a .zip and a .datapackage.json file, where the former contains one or several .tsv files (resources) described in the latter.
The .datapackage.json file follows the Frictionless Data specifications for packages and allows DiMCAT to know what’s in a package without having to actually load the data from the .tsv files.
Since the data that DiMCAT transforms and analyzes comes from very heterogeneous sources, each original corpus is pre-processed and stored as a frictionless.Package. This task is achieved by loaders. When your aim is to enable DiMCAT to load a new type of dataset, you will have to implement a new Loader. Please approach the developers by creating an issue.
DimcatResource#
From all this follows that the Dataset is mainly a structured container for DimcatResources, of which there are three main types:
Facets, i.e. the resources described in the original datapackage.json. They aim to stay as faithful as possible to the original data, applying only mild standardization and normalization. All Facet resources come with several columns that represent timestamps both in absolute and in musical time, allowing for the alignment of different corpora.Features, i.e. resources derived from Facets explicitly (viaDataset.extract_feature()) or implicitly (by applying PipelineSteps which extract features automatically). They are standardized objects that are required by PipelineSteps to compute statistics and visualizations. To allow for straightforward serialization of the Dataset, all Feature resources are represented as a DimcatCatalog calledoutputs.Results, i.e., resources that represent the results of applying anAnalyzerto aFeature. They are equally stored in theoutputscatalog and come with the appropriate methods for plotting the results.
Although Frictionless resources can be stored as individual .tsv files with their own .resource.json descriptors,
DiMCAT typically stores multiple resources as a datapackage, i.e., a .zip file containing one .tsv file per resource, accompanied with a single .datapackage.json file detailing the metadata for all of them.
In both cases, the descriptor specifies the column schema for each resource, allowing DiMCAT to “lazy-load” the data, meaning that the Dataframes are loaded into memory only the moment when they are actually needed (using DimcatResource.load()).
A DimcatResource can be instantiated in two different ways, either from a descriptor or from a dataframe.
At any given moment, the status attribute returns an Enum value reflecting the availability and state of the/a dataframe. This is relevant for keeping datapackages stored on disk up-to-date.
When a Dataset is serialized, all dataframes from the outputs catalog that haven’t been stored to disk yet are written into one or several .zip files so that they can be referenced by the updated descriptor(s).
If you want to create a new type of DimcatResource, please inherit from the relevant subclass and refer to the docstrings of DimcatResource in the module dc for understanding how to use the class variables.