py4ai.data.model.text module
Module for providing abstraction and classes for handling NLP data.
- class py4ai.data.model.text.CachedDocuments(*args, **kwargs)
Bases:
CachedIterable
[Document
[Any
]],DocumentsUtilsMixin
,DillSerialization
Class representing a collection of documents cached in memory.
Return instance of a class to be used for implementing cached iterables.
- Parameters
items – sequence or iterable of elements
- cached_type
alias of
CachedDocuments
- lazy_type
alias of
LazyDocuments
- to_df(fields: Optional[List[str]] = None) DataFrame
Represent the corpus of documents as a table by unpacking provided fields as columns.
- Parameters
fields – Name of the document property to be unpacked as columns
- Returns
dataframe representing the corpus with the given fields
- class py4ai.data.model.text.Document(uuid: K, data: Dict[str, Any])
Bases:
Generic
[K
]Document representation as couple of uuid and dictionary of information.
Return instance of a document.
- Parameters
uuid – document id
data – document data as a dictionary
- addProperty(key: str, value: Any) Document[K]
Generate new Document instance with given new data element.
- Parameters
key – key of the data element to add
value – value of the data element to add
- Returns
Document with new given data element
- property author: Optional[str]
Retrieve ‘author’ field.
- Returns
author data field value
- getOrThrow(key: str, default: Optional[Any] = None) Optional[Any]
Retrieve value associated to given key or return default value.
- Parameters
key – key to retrieve
default – default value to return
- Returns
retrieve element
- Raises
KeyError – if key not found and default not provided
- items() Iterator[Tuple[str, Any]]
Yield data items.
- Yield
iterator with tuples of data properties names and values
- property language: Optional[str]
Retrieve ‘language’ field.
- Returns
language data field value
- property properties: Iterator[str]
Yield data properties names.
- Yield
iterator with data properties names
- removeProperty(key: str) Document[K]
Generate new Document instance without given data element.
- Parameters
key – key of data element to remove
- Returns
Document without given data element
- setRandomUUID() Document[bytes]
Generate new document instance with the same data as the current one but with random uuid.
- Returns
Document instance with the same data as the current one but with random uuid
- property text: Optional[str]
Retrieve ‘text’ field.
- Returns
text data field value
- class py4ai.data.model.text.DocumentsUtilsMixin(*args, **kwargs)
Bases:
IterableUtilsMixin
[Document
[Any
],LazyDocuments
,CachedDocuments
]Utilities for Documents iterables.
Create a new instance of this class.
- Parameters
cls – parent object class
args – passed to the super class __new__ method
kwargs – passed to the super class __new__ method
- Raises
RuntimeError – if the cached and lazy versions were not defined before instantiating the class
- Returns
an instance of this class
- cached_type: Type[CachedIterableType]
- lazy_type: Type[LazyIterableType]
- class py4ai.data.model.text.LazyDocuments(*args, **kwargs)
Bases:
LazyIterable
[Document
[Any
]],DocumentsUtilsMixin
Class representing a collection of documents provided by a generator.
Return an instance of the class to be used for implementing lazy iterables.
- Parameters
items – IterGenerator containing the generator of items
- cached_type
alias of
CachedDocuments
- lazy_type
alias of
LazyDocuments
- py4ai.data.model.text.generate_random_uuid() bytes
Create a random number with 12 digits.
- Returns
uuid