finds.unstructured.vocab

Class to manage words vocabulary

Copyright 2022, Terence Lim

MIT License

class finds.unstructured.vocab.Vocab(words: List = [], unk: str = '<UNK>')[source]

Bases: object

Class for managing a vocabulary of words

Parameters:
  • words – List of words to create index

  • unk – str representation of unknown word

__getitem__(item: str | int) int | str[source]

Return index of str item or word of int item

dump(filename: str) Self[source]

Dump vocab to file

get_embeddings(word: str | List) array[source]

Return embedding vector of a (list of) word

get_index(words: str | List) int | List[source]

Return indexes of words list, optionally drop unknown words

get_word(index: int | List) str | List[source]

Return words of indexes

load(filename: str) Self[source]

Load vocab from file

set_embeddings(embeddings: DataFrame) DataFrame[source]

Relativize and index embeddings to words in vocab

tokenize(text)

a default tokenizer, wraps nltk RegexpTokenizer

update(words: List)[source]

update words in vocab, in lower case

property dim: int

returns the dimensionality of the embeddings vector