dimcat.data.datasets package#

Submodules#

dimcat.data.datasets.base module#

The principal Data object is called Dataset and is the one that users will interact with the most. The Dataset provides convenience methods that are equivalent to applying the corresponding PipelineStep. Every PipelineStep applied to it will return a new Dataset that can be serialized and deserialized to re-start the pipeline from that point. To that aim, every Dataset stores a serialization of the applied PipelineSteps and of the original Dataset that served as initial input. This initial input is specified as a DimcatCatalog which is a collection of DimcatPackages, each of which is a collection of DimcatResources, as defined by the Frictionless Data specifications. The preferred structure of a DimcatPackage is a .zip and a datapackage.json file, where the former contains one or several .tsv files (resources) described in the latter. Since the data that DiMCAT transforms and analyzes comes from very heterogeneous sources, each original corpus is pre-processed and stored as a frictionless data package together with the metadata relevant for reproducing the pre-processing. It follows that the Dataset is mainly a container for DimcatResources.

class dimcat.data.datasets.base.Dataset(basepath: Optional[str] = None, **kwargs)[source]#

Bases: Data

The central type of object that all PipelineSteps process and return a copy of.

class PickleSchema(*, only: Optional[Union[Sequence[str], AbstractSet[str]]] = None, exclude: Union[Sequence[str], AbstractSet[str]] = (), many: Optional[bool] = None, load_only: Union[Sequence[str], AbstractSet[str]] = (), dump_only: Union[Sequence[str], AbstractSet[str]] = (), partial: Optional[Union[bool, Sequence[str], AbstractSet[str]]] = None, unknown: Optional[Literal['exclude', 'include', 'raise']] = None)[source]#

Bases: Schema

Dataset serialization schema.

dump_fields: dict[str, Field]#
exclude: set[Any] | MutableSet[Any]#
fields: dict[str, Field]#

Dictionary mapping field_names -> Field objects

init_object(data, **kwargs) Dataset[source]#

Once the data has been loaded, create the corresponding object.

load_fields: dict[str, Field]#
opts: Any = <marshmallow.schema.SchemaOpts object>#
unknown: types.UnknownOption#
class Schema(*, only: Optional[Union[Sequence[str], AbstractSet[str]]] = None, exclude: Union[Sequence[str], AbstractSet[str]] = (), many: Optional[bool] = None, load_only: Union[Sequence[str], AbstractSet[str]] = (), dump_only: Union[Sequence[str], AbstractSet[str]] = (), partial: Optional[Union[bool, Sequence[str], AbstractSet[str]]] = None, unknown: Optional[Literal['exclude', 'include', 'raise']] = None)[source]#

Bases: PickleSchema, Schema

dump_fields: dict[str, Field]#
exclude: set[Any] | MutableSet[Any]#
fields: dict[str, Field]#

Dictionary mapping field_names -> Field objects

load_fields: dict[str, Field]#
opts: Any = <marshmallow.schema.SchemaOpts object>#
unknown: types.UnknownOption#
add_output(resource: DimcatResource, package_name: Optional[str] = None) None[source]#

Adds a resource to the outputs catalog.

Parameters:
  • resource – Resource to be added.

  • package_name – Name of the package to add the resource to. If unspecified, the package is inferred from the resource type.

apply_step(step: StepSpecs | List | Tuple) Ds[source]#
apply_step(*step: StepSpecs) Ds

Applies one or several pipeline steps to this dataset.For backward compatibility, when only a single argument is passed, the method accepts it to be a list or tuple of step specs, too.

check_feature_availability(feature: Union[Feature, Type[Feature], DimcatConfig, MutableMapping, FeatureName, str]) bool[source]#

Checks whether the given feature specs are available from this Dataset.

Parameters:

feature – FeatureSpecs to be checked.

copy() Dataset[source]#

Returns a copy of this Dataset.

extract_feature(feature: Union[Feature, Type[Feature], DimcatConfig, MutableMapping, FeatureName, str], ignore_exceptions: bool = False) F[source]#

Extracts a feature from this Dataset’s input catalog, sends it through its pipeline, adds the result to the OutputsCatalog, and adds the corresponding FeatureExtractor to the dataset’s pipeline.

Parameters:
  • feature – FeatureSpecs to be extracted.

  • ignore_exceptions – By default (False), features that do not make it through the Pipeline without accident raise an exception and are not added to the outputs catalog. Set to True to ignore exceptions and return the extracted feature, ignoring the fact that not all PipelineSteps may have been applied to it.

property extractable_features: Set[FeatureName]#

The dtypes of all features that can be extracted from the facet resources included in the input packages.

classmethod from_catalogs(inputs: DimcatCatalog | List[DimcatPackage], outputs: DimcatCatalog | List[DimcatPackage], pipeline: Optional[Pipeline] = None, basepath: Optional[str] = None, **kwargs) Dataset[source]#

Instantiate by copying existing catalogs.

classmethod from_dataset(dataset: Dataset, **kwargs) Dataset[source]#

Instantiate from this Dataset by copying its fields, empty fields otherwise.

classmethod from_loader(loader: Loader) Dataset[source]#
classmethod from_package(package: Union[Package, Package, str]) Dataset[source]#

Instantiate from a PackageSpecs by loading it into the inputs catalog.

get_feature(feature: Optional[Union[Feature, Type[Feature], DimcatConfig, MutableMapping, FeatureName, str]] = None) F[source]#

High-level method that first looks up a feature fitting the specs in the outputs catalog, and adds a FeatureExtractor to the dataset’s pipeline otherwise.

get_last_step(step_specs: Optional[StepSpecs] = None, allow_subclasses: bool = True) PipelineStep[source]#

Returns the last step that matches the given specs.

Parameters:
  • step_specs – Specification that can be converted to a DimcatConfig describing a PipelineStep. If None, the last step is returned.

  • allow_subclasses – By default, matches the last applied PipelineStep of the type described by step_specs or one of its subclasses. Set to False to return the last step that matches exactly.

Returns:

PipelineStep object that matches the given specs.

Raises:

NoMatchingPipelineStepFoundError – If no matching step is found.

get_metadata(raw: bool = False) Metadata[source]#
get_steps(step_specs: Optional[StepSpecs] = None, allow_subclasses: bool = True) List[PipelineStep][source]#

Returns all steps that match the given specs.

Parameters:
  • step_specs – Specification that can be converted to a DimcatConfig describing a PipelineStep. If None, all steps are returned (equivalent to steps).

  • allow_subclasses – By default, matching subclasses of the PipelineStep described by step_specs are also included. Set to False to only return steps that match exactly.

Returns:

PipelineStep objects that matches the given specs.

property inputs: InputsCatalog#

The inputs catalog.

iter_features(features: Optional[Union[Feature, Type[Feature], DimcatConfig, MutableMapping, FeatureName, str, Iterable[Union[Feature, Type[Feature], DimcatConfig, MutableMapping, FeatureName, str]]]] = None) Iterator[DimcatResource][source]#
load(package: Union[Package, Package, str])[source]#

High-level method that tries to infer what it is that you want to load.

load_feature(feature: Union[Feature, Type[Feature], DimcatConfig, MutableMapping, FeatureName, str]) F[source]#

ToDo: Harmonize with FeatureExtractor

load_package(package: Union[Package, Package, str], package_name: Optional[str] = None, **options)[source]#

Loads a package into the inputs catalog.

Parameters:
  • package – Typically a path to a datapackage.json descriptor.

  • package_name – If you want to assign a different name to the package than given in the descriptor. The package_name is relevant for addressing the package in the catalog.

  • **options

Returns:

property n_active_features: int#

The number of features extracted and stored in the outputs catalog.

property n_features_available: int#

The number of features (potentially) available from this Dataset.

property outputs: OutputsCatalog#

The outputs catalog.

property pipeline: Pipeline#

A copy of the pipeline representing the steps that have been applied to this Dataset so far. To add a PipelineStep to the pipeline of this Dataset, use apply().

reset_pipeline() None[source]#

Resets the pipeline by replacing it with an empty one.

summary_dict() dict[source]#

Returns a summary of the dataset.

dimcat.data.datasets.processed module#

This module contains subclasses of Dataset. They reflect a particular processing status in terms of the previously applied Slicers, Groupers, and Analyzers. Each of them yields a copied Dataset object exposing additional methods, which are defined in the relevant mixin classes.

class dimcat.data.datasets.processed.AnalyzedDataset(*args, **kwargs)[source]#

Bases: _AnalyzedMixin, Dataset

A Dataset subclass that has been analyzed.

classmethod from_dataset(dataset: Dataset, **kwargs)[source]#

Create a new AnalyzedDataset from a Dataset object.

class dimcat.data.datasets.processed.GroupedAnalyzedDataset(*args, **kwargs)[source]#

Bases: _GroupedMixin, _AnalyzedMixin, Dataset

A Dataset subclass that has been grouped and analyzed.

class dimcat.data.datasets.processed.GroupedDataset(basepath: Optional[str] = None, **kwargs)[source]#

Bases: _GroupedMixin, Dataset

A Dataset subclass that has been grouped.

classmethod from_dataset(dataset: Dataset, **kwargs)[source]#

Create a new GroupedDataset from a Dataset object.

class dimcat.data.datasets.processed.SlicedAnalyzedDataset(*args, **kwargs)[source]#

Bases: _SlicedMixin, _AnalyzedMixin, Dataset

A Dataset subclass that has been sliced and analyzed.

class dimcat.data.datasets.processed.SlicedDataset(basepath: Optional[str] = None, **kwargs)[source]#

Bases: _SlicedMixin, Dataset

A Dataset subclass that has been sliced.

classmethod from_dataset(dataset: Dataset, **kwargs)[source]#

Create a new SlicedDataset from a Dataset object.

class dimcat.data.datasets.processed.SlicedGroupedAnalyzedDataset(*args, **kwargs)[source]#

Bases: _SlicedMixin, _GroupedMixin, _AnalyzedMixin, Dataset

A Dataset subclass that has been sliced, grouped, and analyzed.

class dimcat.data.datasets.processed.SlicedGroupedDataset(basepath: Optional[str] = None, **kwargs)[source]#

Bases: _SlicedMixin, _GroupedMixin, Dataset

A Dataset subclass that has been sliced and grouped.

Module contents#