py4ai.data.model.ml module

Module for specifying data-models to be used in modelling.

class py4ai.data.model.ml.CachedDataset(*args, **kwargs)

Bases: DatasetUtilsMixin[FeatType, LabType], CachedIterable[Sample[FeatType, LabType]], DillSerialization

Class that represents dataset cached in-memory, derived by a cached iterables of samples.

Return instance of a class to be used for implementing cached iterables.

Parameters

items – sequence or iterable of elements

cached_type

alias of CachedDataset

lazy_type

alias of LazyDataset

to_df() DataFrame

Reformat the Features and Labels as a DataFrame.

Returns

DataFrame, Dataframe with features and labels

union(other: TDatasetUtilsMixin) CachedDataset[FeatType, LabType]

Perform union on CachedDatasets.

Parameters

other – CachedDataset

Returns

union of current and other CachedDataset

class py4ai.data.model.ml.DatasetUtilsMixin(*args, **kwargs)

Bases: IterableUtilsMixin[Sample[FeatType, LabType], LazyDataset[FeatType, LabType], CachedDataset[FeatType, LabType]], Generic[FeatType, LabType], ABC

Base class for representing datasets as iterable over Samples.

Create a new instance of this class.

Parameters
  • cls – parent object class

  • args – passed to the super class __new__ method

  • kwargs – passed to the super class __new__ method

Raises

RuntimeError – if the cached and lazy versions were not defined before instantiating the class

Returns

an instance of this class

property asPandasDataset: PandasDataset[FeatType, LabType]

Cast object as a PandasDataset.

Returns

dataset

cached_type: Type[CachedIterableType]
static checkNames(x: Optional[Union[int, str, Any]]) Union[str, int]

Check that feature names comply with format and cast them to either string or int.

Parameters

x – feature name

Returns

name as int or str

Raises

AttributeError – if x is none

getFeaturesAs(type: Literal['array']) ndarray[Any, dtype[Any]]
getFeaturesAs(type: Literal['pandas']) DataFrame
getFeaturesAs(type: Literal['dict']) Dict[Union[str, int], FeatType]
getFeaturesAs(type: Literal['list']) List[FeatType]
getFeaturesAs(type: Literal['lazy']) Iterator[FeatType]

Return object of the specified type containing the feature space.

Parameters

type – type of return. Can be one of “pandas”, “dict”, “list” or “array

Returns

an object of the specified type containing the features

Raises

ValueError – if the provided type is not one of the allowed ones

getLabelsAs(type: Literal['array']) ndarray[Any, dtype[Any]]
getLabelsAs(type: Literal['pandas']) DataFrame
getLabelsAs(type: Literal['dict']) Dict[Union[str, int], LabType]
getLabelsAs(type: Literal['list']) List[LabType]
getLabelsAs(type: Literal['lazy']) Iterator[LabType]

Return an object of the specified type containing the labels.

Parameters

type – type of return. Can be one of “pandas”, “dict”, “list” or “array

Returns

an object of the specified type containing the features

Raises

ValueError – if the provided type is not one of the allowed ones

lazy_type: Type[LazyIterableType]
property type: Type[Sample[FeatType, LabType]]

Return the type of the objects in the Iterable.

Returns

type of the object of the iterable

abstract union(other: TDatasetUtilsMixin) DatasetUtilsMixin[FeatType, LabType]

Return a union of datasets.

Parameters

other – other dataset to join

Returns

union dataset

class py4ai.data.model.ml.LazyDataset(*args, **kwargs)

Bases: LazyIterable[Sample[FeatType, LabType]], DatasetUtilsMixin[FeatType, LabType]

Class that represents dataset derived by a lazy iterable of samples.

Return an instance of the class to be used for implementing lazy iterables.

Parameters

items – IterGenerator containing the generator of items

cached_type

alias of CachedDataset

features() Iterator[FeatType]

Return an iterator over sample features.

Returns

iterable of features

getFeaturesAs(type: Literal['array']) ndarray[Any, dtype[Any]]
getFeaturesAs(type: Literal['pandas']) DataFrame
getFeaturesAs(type: Literal['dict']) Dict[Union[str, int], FeatType]
getFeaturesAs(type: Literal['list']) List[FeatType]
getFeaturesAs(type: Literal['lazy']) Iterator[FeatType]

Return object of the specified type containing the feature space.

Parameters

type – type of return. Can be one of “pandas”, “dict”, “list” or “array

Returns

an object of the specified type containing the features

getLabelsAs(type: Literal['array']) ndarray[Any, dtype[Any]]
getLabelsAs(type: Literal['pandas']) DataFrame
getLabelsAs(type: Literal['dict']) Dict[Union[str, int], LabType]
getLabelsAs(type: Literal['list']) List[LabType]
getLabelsAs(type: Literal['lazy']) Iterator[LabType]

Return an object of the specified type containing the labels.

Parameters

type – type of return. Can be one of “pandas”, “dict”, “list”, “array” or iterators

Returns

an object of the specified type containing the features

labels() Iterator[LabType]

Return an iterator over sample labels.

Returns

iterable of labels

lazy_type

alias of LazyDataset

union(other: TDatasetUtilsMixin) LazyDataset[FeatType, LabType]

Perform union on LazyDatasets.

Parameters

other – LazyDataset

Returns

union of LazyDatasets

withLookback(lookback: int) LazyDataset[FeatType, LabType]

Create a LazyDataset with features that are an array of lookback lists of samples’ features.

Parameters

lookback – number of samples’ features to look at

Returns

LazyDataset with changed samples

class py4ai.data.model.ml.MultiFeatureSample(features: List[ndarray[Any, dtype[Any]]], label: Optional[LabType] = None, name: Optional[str] = None)

Bases: Sample[List[ndarray], LabType]

Class representing an observation defined by a nested list of arrays.

Object representing a single sample of a training or test set.

Parameters
  • features – features of the sample

  • label – labels of the sample (optional)

  • name – id of the sample (optional)

class py4ai.data.model.ml.PandasDataset(*args, **kwargs)

Bases: Generic[FeatType, LabType], DatasetUtilsMixin[FeatType, LabType], DillSerialization

Dataset represented via pandas Dataframes for features and labels.

Return a datastructure built on top of pandas dataframes.

The PandasDataFrame allows to pack features and labels together and obtain features and labels as a pandas dataframe, numpy array or a dictionary. For unsupervised learning tasks the labels are left as None.

Parameters
  • features – a dataframe or a series of features

  • labels – a dataframe or a series of labels. None in case no labels are present.

Raises

TypeError – if the labels or features are not DataFrames nor Series

property cached: bool

Return whether the dataset is cached or not in memory.

Returns

boolean

cached_type

alias of PandasDataset

classmethod createObject(features: Union[DataFrame, Series], labels: Optional[Union[DataFrame, Series]]) TPandasDataset

Create a PandasDataset object.

Parameters
  • features – features as pandas dataframe/series

  • labels – labels as pandas dataframe/series

Returns

a PandasDataset object

dropna(**kwargs: Any) TPandasDataset

Drop NAs from feature and labels.

Parameters

kwargs – keyworded arguments are passed to dropna

Returns

PandasDataset with features and labels without NAs

classmethod empty() TPandasDataset

Return empty object.

Returns

Empty instance of class

property features: DataFrame

Get features as pandas dataframe.

Returns

pd.DataFrame

classmethod from_sequence(datasets: Sequence[TPandasDataset]) TPandasDataset

Create a PandasDataset from a list of pandas datasets using pd.concat.

Parameters

datasets – list of PandasDatasets

Returns

PandasDataset

getFeaturesAs(type: Literal['array']) ndarray[Any, dtype[Any]]
getFeaturesAs(type: Literal['pandas']) DataFrame
getFeaturesAs(type: Literal['dict']) Dict[Union[str, int], FeatType]
getFeaturesAs(type: Literal['list']) List[FeatType]
getFeaturesAs(type: Literal['lazy']) Iterator[FeatType]

Get features as numpy array, pandas dataframe or dictionary.

Parameters

type – str, default is ‘array’, can be ‘array’,’pandas’,’dict’

Returns

features according to the given type

Raises

ValueError – provided type not allowed

getLabelsAs(type: Literal['array']) ndarray[Any, dtype[Any]]
getLabelsAs(type: Literal['pandas']) DataFrame
getLabelsAs(type: Literal['dict']) Dict[Union[str, int], LabType]
getLabelsAs(type: Literal['list']) List[LabType]
getLabelsAs(type: Literal['lazy']) Iterator[LabType]

Get labels as numpy array, pandas dataframe or dictionary.

Parameters

type – str, default is ‘array’, can be ‘array’,’pandas’,’dict’

Returns

labels according to the given type

Raises

ValueError – provided type not allowed

property index: Index

Get Dataset index.

Returns

pd.Index

intersection() TPandasDataset

Intersect feature and labels indices.

Returns

PandasDataset with features and labels with intersected indices

property items: Iterator[Sample[FeatType, LabType]]

Get features as an iterator of Samples.

Yield

Iterator of objects of py4ai.data.model.ml.Sample

property labels: DataFrame

Get labels as a pandas dataframe.

Returns

pd.DataFrame

lazy_type

alias of LazyDataset

loc(idx: List[Any]) TPandasDataset

Find given indices in features and labels.

Parameters

idx – input indices

Returns

PandasDataset with features and labels filtered on input indices

takeAsPandas(n: int) TPandasDataset

Return top n records as a PandasDataset.

Parameters

n – int specifying number of records to output

Returns

PandasDataset of length n

union(other: TPandasDataset) TPandasDataset

Return a union between PandasDatasets.

Parameters

other – Dataset to be merged

Returns

Dataset resulting from the merge

class py4ai.data.model.ml.PandasTimeIndexedDataset(*args, **kwargs)

Bases: PandasDataset[FeatType, LabType], Generic[FeatType, LabType]

Class to be used for datasets that have time-indexed samples.

Return a datastructure built on top of pandas dataframes that allows to pack features and labels that are time indexed.

Features and labels can be obtained as a pandas dataframe, numpy array or a dictionary. For unsupervised learning tasks the labels are left as None.

Parameters
  • features – pandas dataframe/series where index elements are dates in string format

  • labels – pandas dataframe/series where index elements are dates in string format

class py4ai.data.model.ml.Sample(features: FeatType, label: Optional[LabType] = None, name: Optional[Union[int, str, Any]] = None)

Bases: DillSerialization, Generic[FeatType, LabType]

Base class for representing a sample/observation.

Return an object representing a single sample of a training or test set.

Parameters
  • features – features of the sample

  • label – labels of the sample (optional)

  • name – id of the sample (optional)

py4ai.data.model.ml.features_and_labels_to_dataset(X: Union[DataFrame, Series], y: Optional[Union[DataFrame, Series]] = None) CachedDataset[Dict[Any, Any], int]

Pack features and labels into a CachedDataset.

Parameters
  • X – features which can be a pandas dataframe or a pandas series object

  • y – labels which can be a pandas dataframe or a pandas series object

Returns

an instance of py4ai.data.model.ml.CachedDataset