dimcat.data.datasets package#
Submodules#
dimcat.data.datasets.base module#
The principal Data object is called Dataset and is the one that users will interact with the most. The Dataset provides convenience methods that are equivalent to applying the corresponding PipelineStep. Every PipelineStep applied to it will return a new Dataset that can be serialized and deserialized to re-start the pipeline from that point. To that aim, every Dataset stores a serialization of the applied PipelineSteps and of the original Dataset that served as initial input. This initial input is specified as a DimcatCatalog which is a collection of DimcatPackages, each of which is a collection of DimcatResources, as defined by the Frictionless Data specifications. The preferred structure of a DimcatPackage is a .zip and a datapackage.json file, where the former contains one or several .tsv files (resources) described in the latter. Since the data that DiMCAT transforms and analyzes comes from very heterogeneous sources, each original corpus is pre-processed and stored as a frictionless data package together with the metadata relevant for reproducing the pre-processing. It follows that the Dataset is mainly a container for DimcatResources.
- class dimcat.data.datasets.base.Dataset(basepath: Optional[str] = None, **kwargs)[source]#
Bases:
DataThe central type of object that all
PipelineStepsprocess and return a copy of.- class PickleSchema(*, only: Optional[Union[Sequence[str], AbstractSet[str]]] = None, exclude: Union[Sequence[str], AbstractSet[str]] = (), many: Optional[bool] = None, load_only: Union[Sequence[str], AbstractSet[str]] = (), dump_only: Union[Sequence[str], AbstractSet[str]] = (), partial: Optional[Union[bool, Sequence[str], AbstractSet[str]]] = None, unknown: Optional[Literal['exclude', 'include', 'raise']] = None)[source]#
Bases:
SchemaDataset serialization schema.
- exclude: set[Any] | MutableSet[Any]#
- init_object(data, **kwargs) Dataset[source]#
Once the data has been loaded, create the corresponding object.
- unknown: types.UnknownOption#
- class Schema(*, only: Optional[Union[Sequence[str], AbstractSet[str]]] = None, exclude: Union[Sequence[str], AbstractSet[str]] = (), many: Optional[bool] = None, load_only: Union[Sequence[str], AbstractSet[str]] = (), dump_only: Union[Sequence[str], AbstractSet[str]] = (), partial: Optional[Union[bool, Sequence[str], AbstractSet[str]]] = None, unknown: Optional[Literal['exclude', 'include', 'raise']] = None)[source]#
Bases:
PickleSchema,Schema- exclude: set[Any] | MutableSet[Any]#
- unknown: types.UnknownOption#
- add_output(resource: DimcatResource, package_name: Optional[str] = None) None[source]#
Adds a resource to the outputs catalog.
- Parameters:
resource – Resource to be added.
package_name – Name of the package to add the resource to. If unspecified, the package is inferred from the resource type.
- apply_step(step: StepSpecs | List | Tuple) Ds[source]#
- apply_step(*step: StepSpecs) Ds
Applies one or several pipeline steps to this dataset.For backward compatibility, when only a single argument is passed, the method accepts it to be a list or tuple of step specs, too.
- check_feature_availability(feature: Union[Feature, Type[Feature], DimcatConfig, MutableMapping, FeatureName, str]) bool[source]#
Checks whether the given feature specs are available from this Dataset.
- Parameters:
feature – FeatureSpecs to be checked.
- extract_feature(feature: Union[Feature, Type[Feature], DimcatConfig, MutableMapping, FeatureName, str], ignore_exceptions: bool = False) F[source]#
Extracts a feature from this Dataset’s input catalog, sends it through its pipeline, adds the result to the OutputsCatalog, and adds the corresponding FeatureExtractor to the dataset’s pipeline.
- Parameters:
feature – FeatureSpecs to be extracted.
ignore_exceptions – By default (False), features that do not make it through the Pipeline without accident raise an exception and are not added to the outputs catalog. Set to True to ignore exceptions and return the extracted feature, ignoring the fact that not all PipelineSteps may have been applied to it.
- property extractable_features: Set[FeatureName]#
The dtypes of all features that can be extracted from the facet resources included in the input packages.
- classmethod from_catalogs(inputs: DimcatCatalog | List[DimcatPackage], outputs: DimcatCatalog | List[DimcatPackage], pipeline: Optional[Pipeline] = None, basepath: Optional[str] = None, **kwargs) Dataset[source]#
Instantiate by copying existing catalogs.
- classmethod from_dataset(dataset: Dataset, **kwargs) Dataset[source]#
Instantiate from this Dataset by copying its fields, empty fields otherwise.
- classmethod from_package(package: Union[Package, Package, str]) Dataset[source]#
Instantiate from a PackageSpecs by loading it into the inputs catalog.
- get_feature(feature: Optional[Union[Feature, Type[Feature], DimcatConfig, MutableMapping, FeatureName, str]] = None) F[source]#
High-level method that first looks up a feature fitting the specs in the outputs catalog, and adds a FeatureExtractor to the dataset’s pipeline otherwise.
- get_last_step(step_specs: Optional[StepSpecs] = None, allow_subclasses: bool = True) PipelineStep[source]#
Returns the last step that matches the given specs.
- Parameters:
step_specs – Specification that can be converted to a
DimcatConfigdescribing aPipelineStep. If None, the last step is returned.allow_subclasses – By default, matches the last applied
PipelineStepof the type described bystep_specsor one of its subclasses. Set toFalseto return the last step that matches exactly.
- Returns:
PipelineStep object that matches the given specs.
- Raises:
NoMatchingPipelineStepFoundError – If no matching step is found.
- get_steps(step_specs: Optional[StepSpecs] = None, allow_subclasses: bool = True) List[PipelineStep][source]#
Returns all steps that match the given specs.
- Parameters:
step_specs – Specification that can be converted to a
DimcatConfigdescribing aPipelineStep. If None, all steps are returned (equivalent tosteps).allow_subclasses – By default, matching subclasses of the
PipelineStepdescribed bystep_specsare also included. Set toFalseto only return steps that match exactly.
- Returns:
PipelineStep objects that matches the given specs.
- property inputs: InputsCatalog#
The inputs catalog.
- iter_features(features: Optional[Union[Feature, Type[Feature], DimcatConfig, MutableMapping, FeatureName, str, Iterable[Union[Feature, Type[Feature], DimcatConfig, MutableMapping, FeatureName, str]]]] = None) Iterator[DimcatResource][source]#
- load(package: Union[Package, Package, str])[source]#
High-level method that tries to infer what it is that you want to load.
- load_feature(feature: Union[Feature, Type[Feature], DimcatConfig, MutableMapping, FeatureName, str]) F[source]#
ToDo: Harmonize with FeatureExtractor
- load_package(package: Union[Package, Package, str], package_name: Optional[str] = None, **options)[source]#
Loads a package into the inputs catalog.
- Parameters:
package – Typically a path to a datapackage.json descriptor.
package_name – If you want to assign a different name to the package than given in the descriptor. The package_name is relevant for addressing the package in the catalog.
**options –
Returns:
- property n_active_features: int#
The number of features extracted and stored in the outputs catalog.
- property n_features_available: int#
The number of features (potentially) available from this Dataset.
- property outputs: OutputsCatalog#
The outputs catalog.
dimcat.data.datasets.processed module#
This module contains subclasses of Dataset. They reflect a particular processing status in terms of the previously applied Slicers, Groupers, and Analyzers. Each of them yields a copied Dataset object exposing additional methods, which are defined in the relevant mixin classes.
- class dimcat.data.datasets.processed.AnalyzedDataset(*args, **kwargs)[source]#
Bases:
_AnalyzedMixin,DatasetA Dataset subclass that has been analyzed.
- class dimcat.data.datasets.processed.GroupedAnalyzedDataset(*args, **kwargs)[source]#
Bases:
_GroupedMixin,_AnalyzedMixin,DatasetA Dataset subclass that has been grouped and analyzed.
- class dimcat.data.datasets.processed.GroupedDataset(basepath: Optional[str] = None, **kwargs)[source]#
Bases:
_GroupedMixin,DatasetA Dataset subclass that has been grouped.
- class dimcat.data.datasets.processed.SlicedAnalyzedDataset(*args, **kwargs)[source]#
Bases:
_SlicedMixin,_AnalyzedMixin,DatasetA Dataset subclass that has been sliced and analyzed.
- class dimcat.data.datasets.processed.SlicedDataset(basepath: Optional[str] = None, **kwargs)[source]#
Bases:
_SlicedMixin,DatasetA Dataset subclass that has been sliced.