finds.unstructured.unstructured

Classes for unstructured and textual datasets

Copyright 2022, Terence Lim

MIT License

class finds.unstructured.unstructured.Unstructured(mongodb: MongoDB, database: str)[source]

Bases: object

Base class for unstructured datasets

Parameters:
  • mongod – connection to MongoClient where data collection is stored

  • database – name of the database in MongoDB

Variables:

db – pymongo.database.Database connection

Examples:

>>> fomc = Unstructured(mongodb, 'fomc')  # connect to client named 'fomc'
>>> fomc.show()
>>> fomc.select('minutes', where_clause)
>>> fomc.delete('minutes', where_clause)
>>> fomc.insert('minutes', doc)
>>> fomc['minutes'].estimated_document_count() # count docs in collection
>>> fomc['minutes', 'field']

Notes: - sudo apt-get install -y mongodb-org # install latest community version - sudo systemctl start mongod # start and stop mongodb server - sudo systemctl status mongod - sudo systemctl restart mongod - sudo systemctl stop mongod

__getitem__(args: Tuple | Any) Any[source]

Access a collection by name, or optionally by field

delete(collection: str, where: str | Dict | List) int[source]

Delete all docs in collection satisfying where clause

Parameters:
  • collection – name of collection in database to delete

  • where – where clause describing documents to delete

Returns:

number of documents deleted, -1 if collection not in database

Notes:

  • str filter (passed on directly to pymongo)

  • dict of {keys:values}

  • list of key names (to delete if key name $exists)

get(collection: str, field: str) Any[source]

Return value of field of first doc containing key field name

Parameters:
  • collection – name of collection in database to retrieve from

  • field – key field name

Returns:

value of key field of first document where key field name exists

insert(collection: str, doc: Dict, keys: List[str] = [])[source]

Insert one doc; optionally remove existing duplicate document first

Parameters:
  • collection – name of collection in database to insert into

  • doc – dict of {key:value} representing document

  • keys – list of field names, to delete existing docs with same values

Returns:

number of existing documents (with same key values) deleted

load_dataframe(collection: str, df: DataFrame, keys: List[str] = [], update: bool = False)[source]

Insert_many records from rows of dataframe to a collection

Parameters:
  • collection – Name of collection in database to delete

  • df – Each row of DataFrame is document, column names as key fields

  • keys – Fields names to update or replace if same values

  • update – If key fields have same value, update if True. Else replace

select(collection, where: str | List | Dict = [], include_id: bool = False) List[source]

Iterator to retrieve docs in collection satisfying where clause

Parameters:
  • collection – Name of collection in database to delete

  • where – Where clause describing documents to retrieve

  • include_id – If True, then include _id field in return

Returns:

Document selecting where clause in a list of dict

show(collection: str = '')[source]

Return list of collections; or key names in all docs in collection