finds.unstructured.vocab
Class to manage words vocabulary
Copyright 2022, Terence Lim
MIT License
- class finds.unstructured.vocab.Vocab(words: List = [], unk: str = '<UNK>')[source]
Bases:
object
Class for managing a vocabulary of words
- Parameters:
words – List of words to create index
unk – str representation of unknown word
- get_index(words: str | List) int | List [source]
Return indexes of words list, optionally drop unknown words
- set_embeddings(embeddings: DataFrame) DataFrame [source]
Relativize and index embeddings to words in vocab
- tokenize(text)
a default tokenizer, wraps nltk RegexpTokenizer
- property dim: int
returns the dimensionality of the embeddings vector