py4ai.data.model.ml module
Module for specifying data-models to be used in modelling.
- class py4ai.data.model.ml.CachedDataset(*args, **kwargs)
Bases:
DatasetUtilsMixin
[FeatType
,LabType
],CachedIterable
[Sample
[FeatType
,LabType
]],DillSerialization
Class that represents dataset cached in-memory, derived by a cached iterables of samples.
Return instance of a class to be used for implementing cached iterables.
- Parameters
items – sequence or iterable of elements
- cached_type
alias of
CachedDataset
- lazy_type
alias of
LazyDataset
- to_df() DataFrame
Reformat the Features and Labels as a DataFrame.
- Returns
DataFrame, Dataframe with features and labels
- union(other: TDatasetUtilsMixin) CachedDataset[FeatType, LabType]
Perform union on CachedDatasets.
- Parameters
other – CachedDataset
- Returns
union of current and other CachedDataset
- class py4ai.data.model.ml.DatasetUtilsMixin(*args, **kwargs)
Bases:
IterableUtilsMixin
[Sample
[FeatType
,LabType
],LazyDataset[FeatType, LabType]
,CachedDataset[FeatType, LabType]
],Generic
[FeatType
,LabType
],ABC
Base class for representing datasets as iterable over Samples.
Create a new instance of this class.
- Parameters
cls – parent object class
args – passed to the super class __new__ method
kwargs – passed to the super class __new__ method
- Raises
RuntimeError – if the cached and lazy versions were not defined before instantiating the class
- Returns
an instance of this class
- property asPandasDataset: PandasDataset[FeatType, LabType]
Cast object as a PandasDataset.
- Returns
dataset
- cached_type: Type[CachedIterableType]
- static checkNames(x: Optional[Union[int, str, Any]]) Union[str, int]
Check that feature names comply with format and cast them to either string or int.
- Parameters
x – feature name
- Returns
name as int or str
- Raises
AttributeError – if x is none
- getFeaturesAs(type: Literal['array']) ndarray[Any, dtype[Any]]
- getFeaturesAs(type: Literal['pandas']) DataFrame
- getFeaturesAs(type: Literal['dict']) Dict[Union[str, int], FeatType]
- getFeaturesAs(type: Literal['list']) List[FeatType]
- getFeaturesAs(type: Literal['lazy']) Iterator[FeatType]
Return object of the specified type containing the feature space.
- Parameters
type – type of return. Can be one of “pandas”, “dict”, “list” or “array
- Returns
an object of the specified type containing the features
- Raises
ValueError – if the provided type is not one of the allowed ones
- getLabelsAs(type: Literal['array']) ndarray[Any, dtype[Any]]
- getLabelsAs(type: Literal['pandas']) DataFrame
- getLabelsAs(type: Literal['dict']) Dict[Union[str, int], LabType]
- getLabelsAs(type: Literal['list']) List[LabType]
- getLabelsAs(type: Literal['lazy']) Iterator[LabType]
Return an object of the specified type containing the labels.
- Parameters
type – type of return. Can be one of “pandas”, “dict”, “list” or “array
- Returns
an object of the specified type containing the features
- Raises
ValueError – if the provided type is not one of the allowed ones
- lazy_type: Type[LazyIterableType]
- property type: Type[Sample[FeatType, LabType]]
Return the type of the objects in the Iterable.
- Returns
type of the object of the iterable
- abstract union(other: TDatasetUtilsMixin) DatasetUtilsMixin[FeatType, LabType]
Return a union of datasets.
- Parameters
other – other dataset to join
- Returns
union dataset
- class py4ai.data.model.ml.LazyDataset(*args, **kwargs)
Bases:
LazyIterable
[Sample
[FeatType
,LabType
]],DatasetUtilsMixin
[FeatType
,LabType
]Class that represents dataset derived by a lazy iterable of samples.
Return an instance of the class to be used for implementing lazy iterables.
- Parameters
items – IterGenerator containing the generator of items
- cached_type
alias of
CachedDataset
- features() Iterator[FeatType]
Return an iterator over sample features.
- Returns
iterable of features
- getFeaturesAs(type: Literal['array']) ndarray[Any, dtype[Any]]
- getFeaturesAs(type: Literal['pandas']) DataFrame
- getFeaturesAs(type: Literal['dict']) Dict[Union[str, int], FeatType]
- getFeaturesAs(type: Literal['list']) List[FeatType]
- getFeaturesAs(type: Literal['lazy']) Iterator[FeatType]
Return object of the specified type containing the feature space.
- Parameters
type – type of return. Can be one of “pandas”, “dict”, “list” or “array
- Returns
an object of the specified type containing the features
- getLabelsAs(type: Literal['array']) ndarray[Any, dtype[Any]]
- getLabelsAs(type: Literal['pandas']) DataFrame
- getLabelsAs(type: Literal['dict']) Dict[Union[str, int], LabType]
- getLabelsAs(type: Literal['list']) List[LabType]
- getLabelsAs(type: Literal['lazy']) Iterator[LabType]
Return an object of the specified type containing the labels.
- Parameters
type – type of return. Can be one of “pandas”, “dict”, “list”, “array” or iterators
- Returns
an object of the specified type containing the features
- labels() Iterator[LabType]
Return an iterator over sample labels.
- Returns
iterable of labels
- lazy_type
alias of
LazyDataset
- union(other: TDatasetUtilsMixin) LazyDataset[FeatType, LabType]
Perform union on LazyDatasets.
- Parameters
other – LazyDataset
- Returns
union of LazyDatasets
- withLookback(lookback: int) LazyDataset[FeatType, LabType]
Create a LazyDataset with features that are an array of
lookback
lists of samples’ features.- Parameters
lookback – number of samples’ features to look at
- Returns
LazyDataset
with changed samples
- class py4ai.data.model.ml.MultiFeatureSample(features: List[ndarray[Any, dtype[Any]]], label: Optional[LabType] = None, name: Optional[str] = None)
Bases:
Sample
[List
[ndarray
],LabType
]Class representing an observation defined by a nested list of arrays.
Object representing a single sample of a training or test set.
- Parameters
features – features of the sample
label – labels of the sample (optional)
name – id of the sample (optional)
- class py4ai.data.model.ml.PandasDataset(*args, **kwargs)
Bases:
Generic
[FeatType
,LabType
],DatasetUtilsMixin
[FeatType
,LabType
],DillSerialization
Dataset represented via pandas Dataframes for features and labels.
Return a datastructure built on top of pandas dataframes.
The PandasDataFrame allows to pack features and labels together and obtain features and labels as a pandas dataframe, numpy array or a dictionary. For unsupervised learning tasks the labels are left as None.
- Parameters
features – a dataframe or a series of features
labels – a dataframe or a series of labels. None in case no labels are present.
- Raises
TypeError – if the labels or features are not DataFrames nor Series
- property cached: bool
Return whether the dataset is cached or not in memory.
- Returns
boolean
- cached_type
alias of
PandasDataset
- classmethod createObject(features: Union[DataFrame, Series], labels: Optional[Union[DataFrame, Series]]) TPandasDataset
Create a PandasDataset object.
- Parameters
features – features as pandas dataframe/series
labels – labels as pandas dataframe/series
- Returns
a
PandasDataset
object
- dropna(**kwargs: Any) TPandasDataset
Drop NAs from feature and labels.
- Parameters
kwargs – keyworded arguments are passed to dropna
- Returns
PandasDataset
with features and labels without NAs
- classmethod empty() TPandasDataset
Return empty object.
- Returns
Empty instance of class
- property features: DataFrame
Get features as pandas dataframe.
- Returns
pd.DataFrame
- classmethod from_sequence(datasets: Sequence[TPandasDataset]) TPandasDataset
Create a PandasDataset from a list of pandas datasets using pd.concat.
- Parameters
datasets – list of PandasDatasets
- Returns
PandasDataset
- getFeaturesAs(type: Literal['array']) ndarray[Any, dtype[Any]]
- getFeaturesAs(type: Literal['pandas']) DataFrame
- getFeaturesAs(type: Literal['dict']) Dict[Union[str, int], FeatType]
- getFeaturesAs(type: Literal['list']) List[FeatType]
- getFeaturesAs(type: Literal['lazy']) Iterator[FeatType]
Get features as numpy array, pandas dataframe or dictionary.
- Parameters
type – str, default is ‘array’, can be ‘array’,’pandas’,’dict’
- Returns
features according to the given type
- Raises
ValueError – provided type not allowed
- getLabelsAs(type: Literal['array']) ndarray[Any, dtype[Any]]
- getLabelsAs(type: Literal['pandas']) DataFrame
- getLabelsAs(type: Literal['dict']) Dict[Union[str, int], LabType]
- getLabelsAs(type: Literal['list']) List[LabType]
- getLabelsAs(type: Literal['lazy']) Iterator[LabType]
Get labels as numpy array, pandas dataframe or dictionary.
- Parameters
type – str, default is ‘array’, can be ‘array’,’pandas’,’dict’
- Returns
labels according to the given type
- Raises
ValueError – provided type not allowed
- property index: Index
Get Dataset index.
- Returns
pd.Index
- intersection() TPandasDataset
Intersect feature and labels indices.
- Returns
PandasDataset
with features and labels with intersected indices
- property items: Iterator[Sample[FeatType, LabType]]
Get features as an iterator of Samples.
- Yield
Iterator of objects of
py4ai.data.model.ml.Sample
- property labels: DataFrame
Get labels as a pandas dataframe.
- Returns
pd.DataFrame
- lazy_type
alias of
LazyDataset
- loc(idx: List[Any]) TPandasDataset
Find given indices in features and labels.
- Parameters
idx – input indices
- Returns
PandasDataset with features and labels filtered on input indices
- takeAsPandas(n: int) TPandasDataset
Return top n records as a PandasDataset.
- Parameters
n – int specifying number of records to output
- Returns
PandasDataset
of length n
- union(other: TPandasDataset) TPandasDataset
Return a union between PandasDatasets.
- Parameters
other – Dataset to be merged
- Returns
Dataset resulting from the merge
- class py4ai.data.model.ml.PandasTimeIndexedDataset(*args, **kwargs)
Bases:
PandasDataset
[FeatType
,LabType
],Generic
[FeatType
,LabType
]Class to be used for datasets that have time-indexed samples.
Return a datastructure built on top of pandas dataframes that allows to pack features and labels that are time indexed.
Features and labels can be obtained as a pandas dataframe, numpy array or a dictionary. For unsupervised learning tasks the labels are left as None.
- Parameters
features – pandas dataframe/series where index elements are dates in string format
labels – pandas dataframe/series where index elements are dates in string format
- class py4ai.data.model.ml.Sample(features: FeatType, label: Optional[LabType] = None, name: Optional[Union[int, str, Any]] = None)
Bases:
DillSerialization
,Generic
[FeatType
,LabType
]Base class for representing a sample/observation.
Return an object representing a single sample of a training or test set.
- Parameters
features – features of the sample
label – labels of the sample (optional)
name – id of the sample (optional)
- py4ai.data.model.ml.features_and_labels_to_dataset(X: Union[DataFrame, Series], y: Optional[Union[DataFrame, Series]] = None) CachedDataset[Dict[Any, Any], int]
Pack features and labels into a CachedDataset.
- Parameters
X – features which can be a pandas dataframe or a pandas series object
y – labels which can be a pandas dataframe or a pandas series object
- Returns
an instance of
py4ai.data.model.ml.CachedDataset