dimcat.steps package#

Subpackages#

Submodules#

dimcat.steps.base module#

class dimcat.steps.base.FeatureProcessingStep(features: Optional[Union[Feature, Type[Feature], DimcatConfig, MutableMapping, FeatureName, str, Iterable[Union[Feature, Type[Feature], DimcatConfig, MutableMapping, FeatureName, str]]]] = None, **kwargs)[source]#

Bases: PipelineStep

This class unites all PipelineSteps that work on one or all features that can be or have been extracted from a Dataset. They can be instantiated with the features argument, with the behaviour defined by class variables.

class Schema(*, only: Optional[Union[Sequence[str], AbstractSet[str]]] = None, exclude: Union[Sequence[str], AbstractSet[str]] = (), many: Optional[bool] = None, load_only: Union[Sequence[str], AbstractSet[str]] = (), dump_only: Union[Sequence[str], AbstractSet[str]] = (), partial: Optional[Union[bool, Sequence[str], AbstractSet[str]]] = None, unknown: Optional[Literal['exclude', 'include', 'raise']] = None)[source]#

Bases: Schema

deal_with_single_item(data, **kwargs)[source]#

dump_fields: dict[str, Field]#

exclude: set[Any] | MutableSet[Any]#

fields: dict[str, Field]#: Dictionary mapping field_names -> Field objects

load_fields: dict[str, Field]#

opts: Any = <marshmallow.schema.SchemaOpts object>#

unknown: types.UnknownOption#

check_dataset(dataset: Dataset) → None[source]#

Check if the dataset is eligible for processing.

Raises:

TypeError – if the given dataset is not a Dataset
EmptyDatasetError – if applicable_to_empty_datasets is False and the given dataset is empty
NoFeaturesActiveError – if requires_at_least_one_feature is True and no features are active
FeatureUnavailableError – if any of the required features is not available in the dataset.

check_resource(resource: DimcatResource) → None[source]#

Check if the resource is eligible for processing.

Raises:

TypeError – if the given resource is not a DimcatResource
EmptyResourceError – if the given resource is empty
FeatureNotProcessableError – if the given resource cannot be processed by this step

property features: List[DimcatConfig]#: The Feature objects you want this PipelineStep to process. If not specified, the step will try to process all features in a given Dataset’s Outputs catalog.

get_feature_specs() → List[DimcatConfig][source]#: Return a list of feature names required for this PipelineStep.

property is_transformation: bool#: True if this PipelineStep replaces the output_package_name in dataset.outputs rather than extending it. Currently, this is the case only if output_package_name ‘features’ or None, defaulting to ‘features’).

class dimcat.steps.base.PipelineStep[source]#

Bases: DimcatObject

This base class unites all classes able to transform some data in a pre-defined way.

The initializer will set some parameters of the processing, and then the process() method is used to transform an input Data object, returning a copy.

Bases: Schema

PipelineSteps do not depend on previously serialized data, so their serialization can be validated by default after dumping them to a dict-like structure. For Data objects, this default is safe only for their PickleSchema, which PipelineSteps do not use.

dump_fields: dict[str, Field]#

exclude: set[Any] | MutableSet[Any]#

fields: dict[str, Field]#: Dictionary mapping field_names -> Field objects

load_fields: dict[str, Field]#

opts: Any = <marshmallow.schema.SchemaOpts object>#

unknown: types.UnknownOption#

validate_dump(data, **kwargs)[source]#: Make sure to never return invalid serialization data.

check_dataset(dataset: Dataset) → None[source]#

Check if the dataset is eligible for processing.

Raises:

TypeError – if the given dataset is not a Dataset
EmptyDatasetError – if applicable_to_empty_datasets is False and the given dataset is empty

check_resource(resource: Resource) → None[source]#

Check if the resource is eligible for processing.

Raises:

TypeError – if the given resource is not a DimcatResource
EmptyResourceError – if the given resource is empty

fit_to_dataset(dataset: Dataset) → None[source]#

Adjust the PipelineStep to the passed dataset.

Parameters:: dataset – The dataset to adjust to.

property is_transformation: Literal[False]#: True if this PipelineStep transforms features, replacing the dataset.outputs[‘features’] package.

process(data: D) → D[source]#
process(data: Union[List[D], Tuple[D]]) → List[D]
process(*data: D) → List[D]: Same as process_data(), with the difference that arbitrarily many objects are accepted.

process_data(data: Dataset) → Dataset[source]#

process_data(data: DimcatResource) → DR

Perform a transformation on an input Data object. This should never alter the Data or its properties in place, instead returning a copy or view of the input.

Parameters:: data – The data to be transformed. Must not be altered in place.
Returns:: A copy of the input Data, potentially transformed or enhanced in some way defined by this PipelineStep.

process_dataset(dataset: Dataset) → Dataset[source]#: Apply this PipelineStep to a Dataset and return a copy containing the output(s).

process_resource(resource: Union[Resource, str, Path]) → DR[source]#

resource_name_factory(resource: DR) → str[source]#: Creates a unique name for the new resource based on the input resource.

class dimcat.steps.base.ResourceTransformation(features: Optional[Union[Feature, Type[Feature], DimcatConfig, MutableMapping, FeatureName, str, Iterable[Union[Feature, Type[Feature], DimcatConfig, MutableMapping, FeatureName, str]]]] = None, **kwargs)[source]#

Bases: FeatureProcessingStep

The subclasses either transform the features specified upon initialization, returning a Dataset containing only these, or, if no features are specified, transform all resources in the outputs catalog.

transform_resource(resource: DimcatResource) → DataFrame[source]#: Apply the transformation to a Resource and return the transformed dataframe.

dimcat.steps package#

Subpackages#

Submodules#

dimcat.steps.base module#

Module contents#