py4ai.data.model.text module

Module for providing abstraction and classes for handling NLP data.

class py4ai.data.model.text.CachedDocuments(*args, **kwargs)

Bases: CachedIterable[Document[Any]], DocumentsUtilsMixin, DillSerialization

Class representing a collection of documents cached in memory.

Return instance of a class to be used for implementing cached iterables.

Parameters

items – sequence or iterable of elements

cached_type

alias of CachedDocuments

lazy_type

alias of LazyDocuments

to_df(fields: Optional[List[str]] = None) DataFrame

Represent the corpus of documents as a table by unpacking provided fields as columns.

Parameters

fields – Name of the document property to be unpacked as columns

Returns

dataframe representing the corpus with the given fields

class py4ai.data.model.text.Document(uuid: K, data: Dict[str, Any])

Bases: Generic[K]

Document representation as couple of uuid and dictionary of information.

Return instance of a document.

Parameters
  • uuid – document id

  • data – document data as a dictionary

addProperty(key: str, value: Any) Document[K]

Generate new Document instance with given new data element.

Parameters
  • key – key of the data element to add

  • value – value of the data element to add

Returns

Document with new given data element

property author: Optional[str]

Retrieve ‘author’ field.

Returns

author data field value

getOrThrow(key: str, default: Optional[Any] = None) Optional[Any]

Retrieve value associated to given key or return default value.

Parameters
  • key – key to retrieve

  • default – default value to return

Returns

retrieve element

Raises

KeyError – if key not found and default not provided

items() Iterator[Tuple[str, Any]]

Yield data items.

Yield

iterator with tuples of data properties names and values

property language: Optional[str]

Retrieve ‘language’ field.

Returns

language data field value

property properties: Iterator[str]

Yield data properties names.

Yield

iterator with data properties names

removeProperty(key: str) Document[K]

Generate new Document instance without given data element.

Parameters

key – key of data element to remove

Returns

Document without given data element

setRandomUUID() Document[bytes]

Generate new document instance with the same data as the current one but with random uuid.

Returns

Document instance with the same data as the current one but with random uuid

property text: Optional[str]

Retrieve ‘text’ field.

Returns

text data field value

class py4ai.data.model.text.DocumentsUtilsMixin(*args, **kwargs)

Bases: IterableUtilsMixin[Document[Any], LazyDocuments, CachedDocuments]

Utilities for Documents iterables.

Create a new instance of this class.

Parameters
  • cls – parent object class

  • args – passed to the super class __new__ method

  • kwargs – passed to the super class __new__ method

Raises

RuntimeError – if the cached and lazy versions were not defined before instantiating the class

Returns

an instance of this class

cached_type: Type[CachedIterableType]
lazy_type: Type[LazyIterableType]
property type: Type[Document[Any]]

Return the type of the objects in the Iterable.

Returns

Document class object

class py4ai.data.model.text.LazyDocuments(*args, **kwargs)

Bases: LazyIterable[Document[Any]], DocumentsUtilsMixin

Class representing a collection of documents provided by a generator.

Return an instance of the class to be used for implementing lazy iterables.

Parameters

items – IterGenerator containing the generator of items

cached_type

alias of CachedDocuments

lazy_type

alias of LazyDocuments

py4ai.data.model.text.generate_random_uuid() bytes

Create a random number with 12 digits.

Returns

uuid