Business Textual Analysis#

You shall know a word by the company it keeps - J. R. Firth

Text mining techniques allow new insights to be extracted from unstructured text documents. We retrieve business descriptions text from 10-K filings, and use the Spacy NLP package for syntactic analysis, including part-of-speech tagging, named entity recognition and dependency parsing. Additionally, we explore dimentionality reduction techniques to visualize and cluster companies based on the relationships between business descriptions, represented as word embeddings, in a lower-dimensional space.

# By: Terence Lim, 2020-2025 (terence-lim.github.io)
import re
import numpy as np
from scipy import spatial
import pandas as pd
from pandas import DataFrame, Series
import matplotlib.pyplot as plt
import spacy
from sklearn import cluster
from sklearn.decomposition import PCA
from tqdm import tqdm
from finds.database import SQL, RedisDB
from finds.structured import CRSP, BusDay
from finds.unstructured import Edgar
from finds.utils import Store, Finder, ColorMap
from secret import credentials, paths
# %matplotlib qt
VERBOSE = 0
sql = SQL(**credentials['sql'], verbose=VERBOSE)
user = SQL(**credentials['user'], verbose=VERBOSE)
bd = BusDay(sql)
rdb = RedisDB(**credentials['redis'])
crsp = CRSP(sql, bd, rdb, verbose=VERBOSE)
ed = Edgar(paths['10X'], zipped=True, verbose=VERBOSE)
store = Store(paths['scratch'])
find = Finder(sql)

begdate, enddate = 20240101, 20241231

Retrieve the usual investment universe and retain only the largest size decile (based on NYSE market cap breakpoints).

# Retrieve universe of stocks
univ = crsp.get_universe(bd.endmo(begdate, -1))
comnam = crsp.build_lookup('permno', 'comnam', fillna="")  # company name
univ['comnam'] = comnam(univ.index)
ticker = crsp.build_lookup('permno', 'ticker', fillna="")  # tickers
univ['ticker'] = ticker(univ.index)

Extract Business Description text from 10-K filings.

# retrieve business decriptions from 10K's
item, form = 'bus10K', '10-K'
rows = DataFrame(ed.open(form=form, item=item))
found = rows[rows['permno'].isin(univ.index[univ.decile <= 1])  # largest decile only
             & rows['date'].between(begdate, enddate)]\
             .drop_duplicates(subset=['permno'], keep='last')\
             .set_index('permno')

Syntactic analysis#

Syntactic analysis examines the roles of words in sentences and how they combine to form phrases and larger linguistic structures. This process helps model relationships such as subject-verb-object dependencies, which are fundamental for NLP tasks like dependency and constituent parsing.

SpaCy#

spaCy is a widely used open-source Python library for advanced NLP tasks, including POS tagging, named entity recognition (NER), and dependency parsing. It provides pre-trained models for various languages and domains, as well as customizable pipelines for processing text data.

# ! python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_lg")

Lemmatization#

Lemmatization reduces a word to its base or dictionary form (lemma), representing its morphological root.

# Using spaCy pipeline to tokenize NVIDIA's 10-K business description text
nvidia = find('NVIDIA')['permno'].iloc[0]
doc = nlp(ed[found.loc[nvidia, 'pathname']][:nlp.max_length].lower())
tokens = DataFrame.from_records([{'text': token.text,
                                  'lemma': token.lemma_,
                                  'alpha': token.is_alpha,
                                  'stop': token.is_stop,
                                  'punct': token.is_punct}
                                 for token in doc], index=range(len(doc)))
tokens.head(30)
text lemma alpha stop punct
0 item item True False False
1 1 1 False False False
2 . . False False True
3 business business True False False
4 \n\n \n\n False False False
5 our our True True False
6 company company True False False
7 \n\n \n\n False False False
8 nvidia nvidia True False False
9 pioneered pioneer True False False
10 accelerated accelerate True False False
11 computing computing True False False
12 to to True True False
13 help help True False False
14 solve solve True False False
15 the the True True False
16 most most True True False
17 challenging challenging True False False
18 computational computational True False False
19 problems problem True False False
20 . . False False True
21 nvidia nvidia True False False
22 is be True True False
23 now now True True False
24 a a True True False
25 full full True True False
26 - - False False True
27 stack stack True False False
28 computing computing True False False
29 infrastructure infrastructure True False False

Part-of-speech#

Part-of-speech (POS) tagging assigns grammatical categories (e.g., noun, verb, adjective) to words in a text corpus. This aids in understanding sentence structure and extracting meaning by identifying the roles of words within sentences.

tags = DataFrame.from_records([{'text': token.text,
                                'pos': token.pos_,
                                'tag': token.tag_,
                                'dep': token.dep_}
                                 for token in doc], index=range(len(doc)))
tags.head(30)
text pos tag dep
0 item NOUN NN ROOT
1 1 NUM CD nummod
2 . PUNCT . punct
3 business NOUN NN nsubj
4 \n\n SPACE _SP dep
5 our PRON PRP$ poss
6 company NOUN NN appos
7 \n\n SPACE _SP dep
8 nvidia PROPN NNP appos
9 pioneered VERB VBD ROOT
10 accelerated VERB VBD xcomp
11 computing NOUN NN dobj
12 to PART TO aux
13 help VERB VB advcl
14 solve VERB VB xcomp
15 the DET DT det
16 most ADV RBS advmod
17 challenging ADJ JJ amod
18 computational ADJ JJ amod
19 problems NOUN NNS dobj
20 . PUNCT . punct
21 nvidia PROPN NNP nsubj
22 is AUX VBZ ROOT
23 now ADV RB advmod
24 a DET DT det
25 full ADJ JJ amod
26 - PUNCT HYPH punct
27 stack NOUN NN compound
28 computing NOUN NN compound
29 infrastructure NOUN NN compound

Named entity recognition#

Named Entity Recognition (NER) identifies and categorizes named entities (e.g., people, organizations, locations, dates) in text. This process helps classify textual data into meaningful categories.

ents = DataFrame.from_records([{'text': ent.text,
                                'label': ent.label_,
                                'start': ent.start_char,
                                'end': ent.end_char}
                               for ent in doc.ents], index=range(len(doc.ents)))
ents.head(20)
text label start end
0 1 CARDINAL 5 6
1 nvidia PERSON 133 139
2 as well as hundreds CARDINAL 352 371
3 healthcare ORG 853 863
4 tens of thousands CARDINAL 1012 1029
5 gpu ORG 1033 1036
6 gpu ORG 1239 1242
7 today DATE 1347 1352
8 thousands CARDINAL 1498 1507
9 gpu ORG 1771 1774
10 thousands CARDINAL 1819 1828
11 gpus GPE 2594 2598
12 multi-billion-dollar MONEY 2708 2728
13 third ORDINAL 2849 2854
14 over 45.3 billion MONEY 3100 3117
15 gpu ORG 3248 3251
16 1999 DATE 3255 3259
17 2006 DATE 3391 3395
18 gpu ORG 3451 3454
19 2012 DATE 3557 3561
# Entity Visualizer
from spacy import displacy
displacy.render(doc[:300], style="ent", jupyter=True)
item 1 CARDINAL . business

our company

nvidia pioneered accelerated computing to help solve the most challenging computational problems. nvidia PERSON is now a full-stack computing infrastructure company with data-center-scale offerings that are reshaping industry.

our full-stack includes the foundational cuda programming model that runs on all nvidia gpus, as well as hundreds CARDINAL of domain-specific software libraries, software development kits, or sdks, and application programming interfaces, or apis. this deep and broad software stack accelerates the performance and eases the deployment of nvidia accelerated computing for computationally intensive workloads such as artificial intelligence, or ai, model training and inference, data analytics, scientific computing, and 3d graphics, with vertical-specific optimizations to address industries ranging from healthcare ORG and telecom to automotive and manufacturing.

our data-center-scale offerings are comprised of compute and networking solutions that can scale to tens of thousands CARDINAL of gpu ORG -accelerated servers interconnected to function as a single giant computer; this type of data center architecture and scale is needed for the development and deployment of modern ai applications.

the gpu ORG was initially used to simulate human imagination, enabling the virtual worlds of video games and films. today DATE , it also simulates human intelligence, enabling a deeper understanding of the physical world. its parallel processing capabilities, supported by thousands CARDINAL of computing cores, are essential for deep learning algorithms. this form of ai, in which software writes itself by learning from large amounts of data, can serve as the brain of computers, robots and self-driving cars that can perceive and understand the

Dependency parsing#

Dependency parsing determines grammatical relationships between words in a sentence, representing these relationships as a tree structure where each word (except the root) depends on another word (its head). This technique helps identify syntactic roles, such as subjects, objects, and modifiers.

Transition-based parsing algorithms uses a set of transition operations (e.g. shift, reduce) to incrementally build a dependency tree from an input sentence.

Unlike dependency parsing, constituent parsing focuses on identifying and representing the hierarchical structure of phrases in a sentence based on formal grammar rules. It groups words into nested syntactic units (e.noun phrases and verb phrases) and represents them in a tree structure.

The CKY (Cocke-Kasami-Younger) algorithm is a dynamic programming technique used for parsing sentences and constructing parse trees.

Probabilistic Context-Free Grammar (PCFG) extends standard Context-Free Grammar (CFG) by assigning probabilities to production rules, indicating the likelihood of specific grammatical structures. Each rule defines how non-terminal symbols (e.g., NP for noun phrase) expand into words or other non-terminals, guiding sentence generation and parsing.

Models for automatic tagging and parsing rely on labeled datasets known as treebanks, which contain syntactically annotated sentences. The Penn Treebank is a widely used treebank for English, providing annotations for POS tags and parse trees.

sentence_spans = list(doc.sents)
displacy.render(sentence_spans[2:4], style="dep", jupyter=True, 
                options=dict(compact=False, distance=175))
nvidia PROPN is AUX now ADV a DET full- ADJ stack NOUN computing NOUN infrastructure NOUN company NOUN with ADP data- NOUN center- NOUN scale NOUN offerings NOUN that PRON are AUX reshaping VERB industry. PUNCT SPACE nsubj advmod det amod compound compound compound attr prep npadvmod amod compound pobj nsubj aux relcl punct dep our PRON full- ADJ stack NOUN includes VERB the DET foundational PROPN cuda PROPN programming NOUN model NOUN that PRON runs VERB on ADP all DET nvidia PROPN gpus, PROPN as ADV well ADV as ADP hundreds NOUN of ADP domain- NOUN specific ADJ software NOUN libraries, NOUN software NOUN development NOUN kits, NOUN or CCONJ sdks, ADJ and CCONJ application NOUN programming NOUN interfaces, NOUN or CCONJ apis. NOUN poss amod nsubj det amod compound compound dobj nsubj relcl prep det compound pobj advmod advmod cc conj prep npadvmod amod compound pobj compound compound conj cc conj cc compound compound conj cc conj

Semantic similarity#

Word vectors#

Word vectors are numerical representations of words in a multidimensional space, learned from their co-occurrence patterns in large text corpora. Words with similar syntactic and semantic meanings tend to have vector representations that are close together in this space.

For example, spaCy’s en_core_web_lg model represents over 500,000 words using 300-dimensional vectors.

Extract lemmatized noun forms from business descriptions using spaCy’s POS tagger:

# Extract nouns
bus = {}
for permno in tqdm(found.index):
    doc = nlp(ed[found.loc[permno, 'pathname']][:nlp.max_length].lower())
    nouns = " ".join([re.sub("[^a-zA-Z]+", "", token.lemma_) for token in doc
                      if token.pos_ in ['NOUN'] and len(token.lemma_) > 2])
    if len(nouns) > 100:
        bus[permno] = nouns        
store['business'] = bus
100%|██████████| 192/192 [03:12<00:00,  1.00s/it]
bus = store.load('business')
permnos = list(bus.keys())
tickers = univ.loc[permnos, 'ticker'].to_list()

Compute the average word vector for NVIDIA’s business description text:

# example of word vector
vec1 = nlp(bus[nvidia]).vector
vec1
array([-0.5495549 ,  0.0522468 , -0.70095205,  1.1134154 ,  2.659689  ,
        0.29627848,  1.3154105 ,  3.8272932 , -2.232087  , -1.3053178 ,
        6.069174  ,  2.0604212 , -4.542866  ,  2.3177896 , -1.1287518 ,
        2.3917935 ,  3.1968606 ,  1.5996909 , -2.3438275 ,  0.03434967,
        0.22686806,  1.7824569 , -2.384547  ,  0.8608239 , -1.2319311 ,
       -1.774604  , -1.8425854 , -1.7403452 , -0.7102895 ,  1.0869901 ,
        1.2046682 ,  1.2530138 , -1.1417824 , -0.4984767 ,  0.34321743,
       -0.37546915,  1.4804035 ,  0.8114897 ,  1.3119912 ,  0.38791072,
        0.25189775, -0.1770816 ,  0.1785395 ,  1.0146813 , -1.4704382 ,
        1.6199547 ,  2.021769  , -1.9505422 ,  0.4602281 , -1.281002  ,
        0.07107421,  2.3507724 , -0.18837918, -3.80177   , -0.54604673,
        0.5786306 , -1.7812697 ,  1.5003949 ,  0.40720284, -1.5742034 ,
        2.2474368 ,  1.2563457 , -2.2915537 , -1.2388986 ,  2.408017  ,
        1.9807013 , -2.2583349 , -3.3942797 ,  0.5241013 ,  3.0477126 ,
       -0.97571445,  0.7010974 , -1.2003129 ,  0.38448045, -0.30499592,
        1.3931054 , -1.3777359 ,  0.99693114, -1.8231292 , -0.1508515 ,
       -2.595402  , -0.6554817 ,  0.96918464,  1.4329529 , -0.2420673 ,
       -0.08297056, -1.2713333 , -1.8489853 ,  0.77105236, -0.235054  ,
       -1.2015461 ,  1.0783287 ,  1.5740684 , -2.2493582 ,  0.43573543,
       -0.56152374,  0.58803385, -0.5924455 ,  0.8756682 ,  1.2502667 ,
        3.0838132 ,  0.3415912 ,  1.887193  ,  1.8639498 ,  0.2804779 ,
        4.218261  , -1.1121116 , -1.805579  , -0.22270828, -2.56832   ,
        2.4099169 , -0.22422643, -1.1543158 ,  0.06765282,  0.83664316,
        1.7835015 , -2.6237123 , -1.3570241 , -0.46247697, -2.476458  ,
       -1.9124763 , -2.6474092 ,  0.44025388,  1.1519567 , -0.42653838,
       -2.8462713 ,  0.2980864 , -2.9798195 ,  2.9588706 , -1.819657  ,
       -2.464192  ,  0.32522804,  3.3010955 ,  0.7937097 , -0.15216802,
       -0.42828757, -0.9942988 , -0.44921628,  1.9312432 ,  0.28458852,
       -0.6386989 , -0.87969756, -0.13129689,  0.85792863,  1.7823339 ,
        0.16759728, -3.5592077 , -0.2112106 ,  0.06164274,  3.368905  ,
       -0.3132938 ,  1.2434231 ,  0.21065742,  1.2389091 , -0.66589624,
        0.5398899 ,  2.8091311 ,  1.7358444 , -0.91453594, -2.6195803 ,
       -0.9402572 , -1.2330531 ,  0.71588945,  1.699086  , -1.72446   ,
       -1.1087135 , -2.044233  ,  0.5116665 ,  0.7511035 , -1.510701  ,
       -1.5667175 , -0.22543517, -0.37966043,  0.95050013,  2.100138  ,
        1.966492  ,  0.6108724 , -0.31808442, -1.9397378 , -1.2919445 ,
       -1.7783433 ,  1.7884035 ,  1.1175009 , -1.9068832 , -1.0304377 ,
        0.6533309 , -0.94094557, -1.4831365 ,  1.5252017 ,  1.73394   ,
        0.05441582, -1.1678139 ,  0.10588835, -1.8394533 ,  1.820474  ,
        0.70712596, -2.8049684 , -0.1084957 ,  0.5009497 , -0.05614741,
       -0.74415725, -1.0815455 , -0.28300503, -1.0312603 ,  3.5577624 ,
        0.81285644, -3.4548876 ,  1.3007482 , -0.0527516 , -1.5823797 ,
        0.94375974,  0.01696876, -1.1761798 ,  2.141124  ,  0.8728857 ,
        1.7190704 ,  2.6591263 , -4.227103  , -0.5748424 ,  0.36368647,
       -1.838823  ,  1.3353976 , -1.5363356 , -0.98404247, -0.64337295,
       -2.6795921 ,  0.4494206 ,  2.0296626 ,  1.1337993 , -0.15482494,
        2.2946403 , -2.6738138 , -1.2725773 ,  1.763216  ,  2.8063855 ,
        0.46778396, -0.36578366, -0.26262623,  0.9403954 ,  1.0015386 ,
       -2.0167701 , -1.1006184 , -0.1618742 ,  0.9444055 , -0.27051136,
        0.33339593, -1.7367964 ,  1.3408182 ,  0.32765946,  1.1382856 ,
        0.6616231 , -1.8980618 , -3.5863128 , -2.049999  ,  0.11113743,
       -2.0100012 ,  0.91014284, -1.1377766 ,  0.30864185,  0.8823487 ,
       -1.5351313 ,  4.9388237 ,  1.6379622 ,  1.251249  ,  1.8976227 ,
       -0.77638566, -0.17837228,  1.8998926 , -0.8797592 , -0.77389985,
       -0.19354557, -0.14190447,  0.15318388, -1.0070832 ,  0.3741701 ,
       -2.8941076 ,  0.9670155 , -1.7984045 , -1.1186454 ,  1.0722593 ,
        3.4040382 ,  0.38004097,  1.4921545 , -0.0391783 ,  3.8353264 ,
        0.2084171 ,  1.266672  ,  1.8079621 , -2.3702457 , -0.04794558,
        0.3776366 , -0.42777508, -0.809167  ,  0.7592459 , -1.5167016 ,
        0.2553154 ,  1.1878173 , -2.2171407 , -0.76328766,  2.2422533 ],
      dtype=float32)

Compute the average word vector for all companies’ business descriptions:

# Compute sentence vectors
vecs = np.array([nlp(bus[permno]).vector for permno in bus.keys()])
store['vectors'] = vecs
vecs = store['vectors']
# Distance matrix
n = len(bus)
distances = np.zeros((n, n))
for row in range(n):
    for col in range(row, n):
        distances[row, col] = spatial.distance.cosine(vecs[row], vecs[col])
        distances[col, row] = distances[row, col] 

Identify companies with the most similar business descriptions:

def most_similar(p):
    dist = distances[permnos.index(p)]
    dist[permnos.index(p)] = max(dist)   # to ignore own distance
    return univ.loc[permnos[np.argmin(dist)]]
for name in ['NVIDIA', 'APPLE COMPUTER', 'JNJ', 'EXXON MOBIL', 'AMERICAN EXPRESS']:
    p = find(name)['permno'].iloc[-1]
    print(f"{most_similar(p)['comnam']}' is most similar to '{name}'")
QUALCOMM INC' is most similar to 'NVIDIA'
SALESFORCE INC' is most similar to 'APPLE COMPUTER'
PFIZER INC' is most similar to 'JNJ'
PIONEER NATURAL RESOURCES CO' is most similar to 'EXXON MOBIL'
U S BANCORP DEL' is most similar to 'AMERICAN EXPRESS'

Dimensionality reduction#

t-SNE visualization#

T-distributed Stochastic Neighbor Embedding (t-SNE) visualizes high-dimensional data by converting similarities between points into joint probabilities and minimizing the Kullback-Leibler divergence between the high-dimensional and lower-dimensional representations. t-SNE preserves local structures, making it effective for clustering and uncovering hidden patterns in business descriptions.

t-SNE in scikit-learn

from sklearn.manifold import TSNE
Z = TSNE(n_components=2, perplexity=10, random_state=42)\
    .fit_transform(vecs)

Reduce business description vectors to 2D using t-SNE and label points with ticker symbols:

fig, ax = plt.subplots(figsize=(10, 8))
ax.scatter(Z[:, 0], Z[:, 1], color="C0", alpha=.3)
for text, x, y in zip(tickers, Z[:, 0], Z[:, 1]):
    ax.annotate(text=text, xy=(x, y), fontsize='small')
ax.set_title(f"t-SNE visualization of largest decile stocks ({enddate//10000})")
plt.tight_layout()
_images/a24fe65bb0a42852d9d8cb87abace0c601ede163538e8d67ddc64c87de4b09a8.png

DBSCAN clustering#

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is an unsupervised clustering algorithm that detects clusters of varying densities and identifies outliers. Unlike k-means, it does not require a predefined number of clusters. Instead, it uses two parameters: epsilon (ε) and the minimum number of points required to form a dense region.

# eps is the most important parameter for DBSCAN
eps = 4
db = cluster.DBSCAN(eps=eps)   # default eps

db.fit(Z)
n_clusters = len(set(db.labels_).difference({-1}))
n_noise = np.sum(db.labels_ == -1)
DataFrame(dict(clusters=n_clusters, noise=n_noise, eps=eps),  index=['DBSCAN'])
clusters noise eps
DBSCAN 12 19 4

Visualize DBSCAN clusters in 2D space. Display outlier ticker symbols with larger font sizes:

cmap = ColorMap(n_clusters)
fig, ax = plt.subplots(figsize=(10, 8))
# plot core samples with larger marker size
ax.scatter(Z[db.core_sample_indices_, 0],
           Z[db.core_sample_indices_, 1],
           c=cmap[db.labels_[db.core_sample_indices_]],
           alpha=.1, s=100, edgecolors=None)
# plot non-core samples with smaller marker size
non_core = np.ones_like(db.labels_, dtype=bool)
non_core[db.core_sample_indices_] = False
non_core[db.labels_ < 0] = False
ax.scatter(Z[non_core, 0], Z[non_core, 1], c=cmap[db.labels_[non_core]],
           alpha=.1, s=20, edgecolors=None)
# plot noise samples 
ax.scatter(Z[db.labels_ < 0, 0], Z[db.labels_ < 0, 1], c="darkgrey",
           alpha=.5, s=20, edgecolors=None)

# annotate with tickers not in core samples
for i, (t, c, xy) in enumerate(zip(tickers, db.labels_, Z)):
    if i in db.core_sample_indices_:
        ax.annotate(text=t, xy=xy+.5, color=cmap[c], fontsize='xx-small')
    elif c == -1:
        ax.annotate(text=t, xy=xy+.5, color='black', fontsize='medium')
    else:
        ax.annotate(text=t, xy=xy+.5, color=cmap[c], fontsize='medium')
ax.set_title(f"Largest decile stocks ({enddate//10000})")
plt.tight_layout()
_images/99f222fa309a3872e859f931fa6613710756b4291d75256d7dec66b336034b9c.png

List companies tagged as noisy samples:

print("Samples tagged as noise:")
univ.loc[np.array(permnos)[db.labels_ < 0]].sort_values('naics')
Samples tagged as noise:
cap capco decile nyse siccd prc naics comnam ticker
permno
75241 5.246653e+07 5.246653e+07 1 True 1311 224.88 211120 PIONEER NATURAL RESOURCES CO PXD
21207 4.770164e+07 4.770164e+07 1 True 1041 41.39 212220 NEWMONT CORP NEM
81774 6.104440e+07 6.104440e+07 1 True 1021 42.57 212230 FREEPORT MCMORAN INC FCX
82800 6.654158e+07 6.654158e+07 1 True 1021 86.07 212230 SOUTHERN COPPER CORP SCCO
69796 4.440053e+07 4.440053e+07 1 True 2084 241.75 312130 CONSTELLATION BRANDS INC STZ
11850 4.005332e+08 4.005332e+08 1 True 2911 99.98 324110 EXXON MOBIL CORP XOM
78975 1.749684e+08 1.749684e+08 1 False 7370 625.03 513210 INTUIT INC INTU
26403 1.652592e+08 1.652592e+08 1 True 4833 90.29 516120 DISNEY WALT CO DIS
44644 9.568078e+07 9.568078e+07 1 False 7374 232.97 518210 AUTOMATIC DATA PROCESSING INC ADP
47896 4.917605e+08 4.917605e+08 1 True 6021 170.10 522110 JPMORGAN CHASE & CO JPM
90993 7.350871e+07 7.350871e+07 1 True 6231 128.43 523210 INTERCONTINENTALEXCHANGE GRP INC ICE
89626 7.565405e+07 7.565405e+07 1 False 6200 210.60 523210 C M E GROUP INC CME
17478 1.395567e+08 1.395567e+08 1 True 6282 440.52 523930 S & P GLOBAL INC SPGI
61621 4.285840e+07 4.285840e+07 1 False 8700 119.11 541219 PAYCHEX INC PAYX
13628 5.769654e+07 5.769654e+07 1 False 7372 276.06 541511 WORKDAY INC WDAY
48506 7.147248e+07 7.147248e+07 1 True 7323 390.56 561450 MOODYS CORP MCO
92402 4.473782e+07 4.473782e+07 1 True 7389 565.65 561499 M S C I INC MSCI
85913 6.551968e+07 6.551968e+07 1 False 7011 225.51 721110 MARRIOTT INTERNATIONAL INC NEW MAR
14338 4.669516e+07 4.669516e+07 1 True 7011 182.09 721110 HILTON WORLDWIDE HOLDINGS INC HLT

UMAP vizualization#

UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique that constructs a high-dimensional graph of data points and optimizes a lower-dimensional representation while preserving essential relationships. Compared to t-SNE, UMAP is faster, scales better for large datasets, and retains more global structure.

UMAP Documentation

import umap
Z = umap.UMAP(n_components=2, n_jobs=1, min_dist=0.0, random_state=42)\
        .fit_transform(vecs)
fig, ax = plt.subplots(figsize=(10, 8))
ax.scatter(Z[:, 0], Z[:, 1], color="C0", alpha=.3)
for text, x, y in zip(tickers, Z[:, 0], Z[:, 1]):
    ax.annotate(text=text, xy=(x, y), fontsize='small')
ax.set_title(f"UMAP visualization of largest decile stocks ({enddate//10000})")
plt.tight_layout()
_images/ee0833563d1eae2cacc13c32139ad52425d0c0dbd496ed39ec3b15b6bb6ce25c.png

HDBSCAN clustering#

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) extends DBSCAN by varying the epsilon parameter and optimizing cluster stability. This makes it more robust to variations in density and parameter selection.

# eps is the most important parameter for DBSCAN
hdb = cluster.HDBSCAN()
hdb.fit(Z)
n_clusters = len(set(hdb.labels_).difference({-1}))
n_noise = np.sum(hdb.labels_ == -1)
DataFrame(dict(clusters=n_clusters, noise=n_noise),  index=['HDBSCAN'])
clusters noise
HDBSCAN 14 32

Visualize HDBSCAN clusters in 2D space. Display outlier ticker symbols with larger font sizes.

cmap = ColorMap(n_clusters)
fig, ax = plt.subplots(figsize=(10, 8))
# plot core samples with larger marker size
ax.scatter(Z[hdb.labels_ >= 0, 0],
           Z[hdb.labels_ >= 0, 1],
           c=cmap[hdb.labels_[hdb.labels_ >= 0]],
           alpha=.1, s=100, edgecolors=None)
# plot noise samples 
ax.scatter(Z[hdb.labels_ < 0, 0], Z[hdb.labels_ < 0, 1], c="darkgrey",
           alpha=.5, s=20, edgecolors=None)

# annotate with tickers not in core samples
for i, (t, c, xy) in enumerate(zip(tickers, hdb.labels_, Z)):
    if c >= 0:
        ax.annotate(text=t, xy=xy+.01, color=cmap[c], fontsize='xx-small')
    else:
        ax.annotate(text=t, xy=xy+.01, color="black", fontsize='medium')
ax.set_title(f"Largest decile stocks ({enddate//10000})")
plt.tight_layout()
_images/c3409ab909fb6cff3eda058c5aec5c28b5837e8ca53ddbbb557265ca3f2445a7.png

List companies tagged as noisy samples

print("Samples tagged as noise:")
univ.loc[np.array(permnos)[hdb.labels_ < 0]].sort_values('naics')
Samples tagged as noise:
cap capco decile nyse siccd prc naics comnam ticker
permno
69796 4.440053e+07 4.440053e+07 1 True 2084 241.75 312130 CONSTELLATION BRANDS INC STZ
11850 4.005332e+08 4.005332e+08 1 True 2911 99.98 324110 EXXON MOBIL CORP XOM
36468 7.983580e+07 7.983580e+07 1 True 2851 311.90 325510 SHERWIN WILLIAMS CO SHW
70578 5.655752e+07 5.655752e+07 1 True 2841 198.35 325611 ECOLAB INC ECL
18163 3.453781e+08 3.453781e+08 1 True 2844 146.54 325620 PROCTER & GAMBLE CO PG
18729 6.563098e+07 6.563098e+07 1 True 2844 79.71 325620 COLGATE PALMOLIVE CO CL
22103 5.548783e+07 5.548783e+07 1 True 3491 97.33 332911 EMERSON ELECTRIC CO EMR
19350 1.120656e+08 1.120656e+08 1 True 3523 399.87 333111 DEERE & CO DE
14702 1.345954e+08 1.345954e+08 1 False 3550 162.07 333248 APPLIED MATERIALS INC AMAT
41355 5.918889e+07 5.918889e+07 1 True 3593 460.70 333995 PARKER HANNIFIN CORP PH
56573 7.881408e+07 7.881408e+07 1 True 3569 261.94 333999 ILLINOIS TOOL WORKS INC ITW
12490 1.493406e+08 1.493406e+08 1 True 3571 163.55 334111 INTERNATIONAL BUSINESS MACHS COR IBM
77338 5.827867e+07 5.827867e+07 1 False 3823 545.17 334513 ROPER TECHNOLOGIES INC ROP
22592 6.037929e+07 6.037929e+07 1 True 3841 109.32 339112 3M CO MMM
66181 3.449080e+08 3.449080e+08 1 True 5211 346.55 444110 HOME DEPOT INC HD
84788 1.556169e+09 1.556169e+09 1 False 7370 151.94 454110 AMAZON COM INC AMZN
48725 1.497292e+08 1.497292e+08 1 True 4011 245.62 482111 UNION PACIFIC CORP UNP
64311 5.345403e+07 5.345403e+07 1 True 4731 236.38 488510 NORFOLK SOUTHERN CORP NSC
60628 6.321543e+07 6.321543e+07 1 True 4513 252.97 492110 FEDEX CORP FDX
26403 1.652592e+08 1.652592e+08 1 True 4833 90.29 516120 DISNEY WALT CO DIS
47896 4.917605e+08 4.917605e+08 1 True 6021 170.10 522110 JPMORGAN CHASE & CO JPM
92108 9.302455e+07 9.302455e+07 1 True 6282 130.92 523940 BLACKSTONE INC BX
87842 4.894876e+07 4.894876e+07 1 True 6311 66.13 524113 METLIFE INC MET
57904 4.821135e+07 4.821135e+07 1 True 6321 82.50 524114 AFLAC INC AFL
59459 4.350773e+07 4.350773e+07 1 True 6331 190.49 524126 TRAVELERS COMPANIES INC TRV
64390 9.318533e+07 9.318533e+07 1 True 6331 159.28 524126 PROGRESSIVE CORP OH PGR
66800 4.756321e+07 4.756321e+07 1 True 6331 67.75 524126 AMERICAN INTERNATIONAL GROUP INC AIG
38093 4.855159e+07 4.855159e+07 1 True 6411 224.88 524210 GALLAGHER ARTHUR J & CO AJG
45751 9.342235e+07 9.342235e+07 1 True 6411 189.47 524210 MARSH & MCLENNAN COS INC MMC
89393 2.107022e+08 2.107022e+08 1 False 7841 486.88 532282 NETFLIX INC NFLX
13511 9.297566e+07 9.297566e+07 1 False 7371 294.88 541511 PALO ALTO NETWORKS INC PANW
11955 7.213700e+07 7.213700e+07 1 True 4953 179.10 562219 WASTE MANAGEMENT INC DEL WM

References:

Greg Durrett, 2023, “CS388 Natural Language Processing course materisl”, retrieved from https://www.cs.utexas.edu/~gdurrett/courses/online-course/materials.html

Text-Based Network Industries and Endogenous Product Differentiation. Gerard Hoberg and Gordon Phillips, 2016, Journal of Political Economy 124 (5), 1423-1465.

Product Market Synergies and Competition in Mergers and Acquisitions: A Text-Based Analysis. Gerard Hoberg and Gordon Phillips, 2010, Review of Financial Studies 23 (10), 3773-3811.