dimcat.steps package#

Subpackages#

Submodules#

dimcat.steps.base module#

class dimcat.steps.base.FeatureProcessingStep(features: Optional[Union[Feature, Type[Feature], DimcatConfig, MutableMapping, FeatureName, str, Iterable[Union[Feature, Type[Feature], DimcatConfig, MutableMapping, FeatureName, str]]]] = None, **kwargs)[source]#

Bases: PipelineStep

This class unites all PipelineSteps that work on one or all features that can be or have been extracted from a Dataset. They can be instantiated with the features argument, with the behaviour defined by class variables.

class Schema(*, only: Optional[Union[Sequence[str], AbstractSet[str]]] = None, exclude: Union[Sequence[str], AbstractSet[str]] = (), many: Optional[bool] = None, load_only: Union[Sequence[str], AbstractSet[str]] = (), dump_only: Union[Sequence[str], AbstractSet[str]] = (), partial: Optional[Union[bool, Sequence[str], AbstractSet[str]]] = None, unknown: Optional[Literal['exclude', 'include', 'raise']] = None)[source]#

Bases: Schema

deal_with_single_item(data, **kwargs)[source]#
dump_fields: dict[str, Field]#
exclude: set[Any] | MutableSet[Any]#
fields: dict[str, Field]#

Dictionary mapping field_names -> Field objects

load_fields: dict[str, Field]#
opts: Any = <marshmallow.schema.SchemaOpts object>#
unknown: types.UnknownOption#
check_dataset(dataset: Dataset) None[source]#

Check if the dataset is eligible for processing.

Raises:
check_resource(resource: DimcatResource) None[source]#

Check if the resource is eligible for processing.

Raises:
property features: List[DimcatConfig]#

The Feature objects you want this PipelineStep to process. If not specified, the step will try to process all features in a given Dataset’s Outputs catalog.

get_feature_specs() List[DimcatConfig][source]#

Return a list of feature names required for this PipelineStep.

property is_transformation: bool#

True if this PipelineStep replaces the output_package_name in dataset.outputs rather than extending it. Currently, this is the case only if output_package_name ‘features’ or None, defaulting to ‘features’).

class dimcat.steps.base.PipelineStep[source]#

Bases: DimcatObject

This base class unites all classes able to transform some data in a pre-defined way.

The initializer will set some parameters of the processing, and then the process() method is used to transform an input Data object, returning a copy.

class Schema(*, only: Optional[Union[Sequence[str], AbstractSet[str]]] = None, exclude: Union[Sequence[str], AbstractSet[str]] = (), many: Optional[bool] = None, load_only: Union[Sequence[str], AbstractSet[str]] = (), dump_only: Union[Sequence[str], AbstractSet[str]] = (), partial: Optional[Union[bool, Sequence[str], AbstractSet[str]]] = None, unknown: Optional[Literal['exclude', 'include', 'raise']] = None)[source]#

Bases: Schema

PipelineSteps do not depend on previously serialized data, so their serialization can be validated by default after dumping them to a dict-like structure. For Data objects, this default is safe only for their PickleSchema, which PipelineSteps do not use.

dump_fields: dict[str, Field]#
exclude: set[Any] | MutableSet[Any]#
fields: dict[str, Field]#

Dictionary mapping field_names -> Field objects

load_fields: dict[str, Field]#
opts: Any = <marshmallow.schema.SchemaOpts object>#
unknown: types.UnknownOption#
validate_dump(data, **kwargs)[source]#

Make sure to never return invalid serialization data.

check_dataset(dataset: Dataset) None[source]#

Check if the dataset is eligible for processing.

Raises:
  • TypeError – if the given dataset is not a Dataset

  • EmptyDatasetError – if applicable_to_empty_datasets is False and the given dataset is empty

check_resource(resource: Resource) None[source]#

Check if the resource is eligible for processing.

Raises:
fit_to_dataset(dataset: Dataset) None[source]#

Adjust the PipelineStep to the passed dataset.

Parameters:

dataset – The dataset to adjust to.

property is_transformation: Literal[False]#

True if this PipelineStep transforms features, replacing the dataset.outputs[‘features’] package.

process(data: D) D[source]#
process(data: Union[List[D], Tuple[D]]) List[D]
process(*data: D) List[D]

Same as process_data(), with the difference that arbitrarily many objects are accepted.

process_data(data: Dataset) Dataset[source]#
process_data(data: DimcatResource) DR

Perform a transformation on an input Data object. This should never alter the Data or its properties in place, instead returning a copy or view of the input.

Parameters:

data – The data to be transformed. Must not be altered in place.

Returns:

A copy of the input Data, potentially transformed or enhanced in some way defined by this PipelineStep.

process_dataset(dataset: Dataset) Dataset[source]#

Apply this PipelineStep to a Dataset and return a copy containing the output(s).

process_resource(resource: Union[Resource, str, Path]) DR[source]#
resource_name_factory(resource: DR) str[source]#

Creates a unique name for the new resource based on the input resource.

class dimcat.steps.base.ResourceTransformation(features: Optional[Union[Feature, Type[Feature], DimcatConfig, MutableMapping, FeatureName, str, Iterable[Union[Feature, Type[Feature], DimcatConfig, MutableMapping, FeatureName, str]]]] = None, **kwargs)[source]#

Bases: FeatureProcessingStep

The subclasses either transform the features specified upon initialization, returning a Dataset containing only these, or, if no features are specified, transform all resources in the outputs catalog.

transform_resource(resource: DimcatResource) DataFrame[source]#

Apply the transformation to a Resource and return the transformed dataframe.

Module contents#