Business Textual Analysis

Business Textual Analysis#

You shall know a word by the company it keeps - J. R. Firth

Text mining techniques allow new insights to be extracted from unstructured text documents. We retrieve business descriptions text from 10-K filings, and use the Spacy NLP package for syntactic analysis, including part-of-speech tagging, named entity recognition and dependency parsing. Additionally, we explore dimentionality reduction techniques to visualize and cluster companies based on the relationships between business descriptions, represented as word embeddings, in a lower-dimensional space.

# By: Terence Lim, 2020-2025 (terence-lim.github.io)
import re
import numpy as np
from scipy import spatial
import pandas as pd
from pandas import DataFrame, Series
import matplotlib.pyplot as plt
import spacy
from sklearn import cluster
from sklearn.decomposition import PCA
from tqdm import tqdm
from finds.database import SQL, RedisDB
from finds.structured import CRSP, BusDay
from finds.unstructured import Edgar
from finds.utils import Store, Finder, ColorMap
from secret import credentials, paths
# %matplotlib qt
VERBOSE = 0

sql = SQL(**credentials['sql'], verbose=VERBOSE)
user = SQL(**credentials['user'], verbose=VERBOSE)
bd = BusDay(sql)
rdb = RedisDB(**credentials['redis'])
crsp = CRSP(sql, bd, rdb, verbose=VERBOSE)
ed = Edgar(paths['10X'], zipped=True, verbose=VERBOSE)
store = Store(paths['scratch'])
find = Finder(sql)

begdate, enddate = 20240101, 20241231

Retrieve the usual investment universe and retain only the largest size decile (based on NYSE market cap breakpoints).

# Retrieve universe of stocks
univ = crsp.get_universe(bd.endmo(begdate, -1))
comnam = crsp.build_lookup('permno', 'comnam', fillna="")  # company name
univ['comnam'] = comnam(univ.index)
ticker = crsp.build_lookup('permno', 'ticker', fillna="")  # tickers
univ['ticker'] = ticker(univ.index)

Extract Business Description text from 10-K filings.

# retrieve business decriptions from 10K's
item, form = 'bus10K', '10-K'
rows = DataFrame(ed.open(form=form, item=item))
found = rows[rows['permno'].isin(univ.index[univ.decile <= 1])  # largest decile only
             & rows['date'].between(begdate, enddate)]\
             .drop_duplicates(subset=['permno'], keep='last')\
             .set_index('permno')

Syntactic analysis#

Syntactic analysis examines the roles of words in sentences and how they combine to form phrases and larger linguistic structures. This process helps model relationships such as subject-verb-object dependencies, which are fundamental for NLP tasks like dependency and constituent parsing.

SpaCy#

spaCy is a widely used open-source Python library for advanced NLP tasks, including POS tagging, named entity recognition (NER), and dependency parsing. It provides pre-trained models for various languages and domains, as well as customizable pipelines for processing text data.

spaCy Models

# ! python -m spacy download en_core_web_sm
nlp = spacy.load("en_core_web_lg")

Lemmatization#

Lemmatization reduces a word to its base or dictionary form (lemma), representing its morphological root.

# Using spaCy pipeline to tokenize NVIDIA's 10-K business description text
nvidia = find('NVIDIA')['permno'].iloc[0]
doc = nlp(ed[found.loc[nvidia, 'pathname']][:nlp.max_length].lower())
tokens = DataFrame.from_records([{'text': token.text,
                                  'lemma': token.lemma_,
                                  'alpha': token.is_alpha,
                                  'stop': token.is_stop,
                                  'punct': token.is_punct}
                                 for token in doc], index=range(len(doc)))
tokens.head(30)

	text	lemma	alpha	stop	punct
0	item	item	True	False	False
1	1	1	False	False	False
2	.	.	False	False	True
3	business	business	True	False	False
4	\n\n	\n\n	False	False	False
5	our	our	True	True	False
6	company	company	True	False	False
7	\n\n	\n\n	False	False	False
8	nvidia	nvidia	True	False	False
9	pioneered	pioneer	True	False	False
10	accelerated	accelerate	True	False	False
11	computing	computing	True	False	False
12	to	to	True	True	False
13	help	help	True	False	False
14	solve	solve	True	False	False
15	the	the	True	True	False
16	most	most	True	True	False
17	challenging	challenging	True	False	False
18	computational	computational	True	False	False
19	problems	problem	True	False	False
20	.	.	False	False	True
21	nvidia	nvidia	True	False	False
22	is	be	True	True	False
23	now	now	True	True	False
24	a	a	True	True	False
25	full	full	True	True	False
26	-	-	False	False	True
27	stack	stack	True	False	False
28	computing	computing	True	False	False
29	infrastructure	infrastructure	True	False	False

Part-of-speech#

Part-of-speech (POS) tagging assigns grammatical categories (e.g., noun, verb, adjective) to words in a text corpus. This aids in understanding sentence structure and extracting meaning by identifying the roles of words within sentences.

tags = DataFrame.from_records([{'text': token.text,
                                'pos': token.pos_,
                                'tag': token.tag_,
                                'dep': token.dep_}
                                 for token in doc], index=range(len(doc)))
tags.head(30)

	text	pos	tag	dep
0	item	NOUN	NN	ROOT
1	1	NUM	CD	nummod
2	.	PUNCT	.	punct
3	business	NOUN	NN	nsubj
4	\n\n	SPACE	_SP	dep
5	our	PRON	PRP$	poss
6	company	NOUN	NN	appos
7	\n\n	SPACE	_SP	dep
8	nvidia	PROPN	NNP	appos
9	pioneered	VERB	VBD	ROOT
10	accelerated	VERB	VBD	xcomp
11	computing	NOUN	NN	dobj
12	to	PART	TO	aux
13	help	VERB	VB	advcl
14	solve	VERB	VB	xcomp
15	the	DET	DT	det
16	most	ADV	RBS	advmod
17	challenging	ADJ	JJ	amod
18	computational	ADJ	JJ	amod
19	problems	NOUN	NNS	dobj
20	.	PUNCT	.	punct
21	nvidia	PROPN	NNP	nsubj
22	is	AUX	VBZ	ROOT
23	now	ADV	RB	advmod
24	a	DET	DT	det
25	full	ADJ	JJ	amod
26	-	PUNCT	HYPH	punct
27	stack	NOUN	NN	compound
28	computing	NOUN	NN	compound
29	infrastructure	NOUN	NN	compound

Named entity recognition#

Named Entity Recognition (NER) identifies and categorizes named entities (e.g., people, organizations, locations, dates) in text. This process helps classify textual data into meaningful categories.

ents = DataFrame.from_records([{'text': ent.text,
                                'label': ent.label_,
                                'start': ent.start_char,
                                'end': ent.end_char}
                               for ent in doc.ents], index=range(len(doc.ents)))
ents.head(20)

	text	label	start	end
0	1	CARDINAL	5	6
1	nvidia	PERSON	133	139
2	as well as hundreds	CARDINAL	352	371
3	healthcare	ORG	853	863
4	tens of thousands	CARDINAL	1012	1029
5	gpu	ORG	1033	1036
6	gpu	ORG	1239	1242
7	today	DATE	1347	1352
8	thousands	CARDINAL	1498	1507
9	gpu	ORG	1771	1774
10	thousands	CARDINAL	1819	1828
11	gpus	GPE	2594	2598
12	multi-billion-dollar	MONEY	2708	2728
13	third	ORDINAL	2849	2854
14	over 45.3 billion	MONEY	3100	3117
15	gpu	ORG	3248	3251
16	1999	DATE	3255	3259
17	2006	DATE	3391	3395
18	gpu	ORG	3451	3454
19	2012	DATE	3557	3561

# Entity Visualizer
from spacy import displacy
displacy.render(doc[:300], style="ent", jupyter=True)

item 1 CARDINAL . business

our company

nvidia pioneered accelerated computing to help solve the most challenging computational problems. nvidia PERSON is now a full-stack computing infrastructure company with data-center-scale offerings that are reshaping industry.

our full-stack includes the foundational cuda programming model that runs on all nvidia gpus, as well as hundreds CARDINAL of domain-specific software libraries, software development kits, or sdks, and application programming interfaces, or apis. this deep and broad software stack accelerates the performance and eases the deployment of nvidia accelerated computing for computationally intensive workloads such as artificial intelligence, or ai, model training and inference, data analytics, scientific computing, and 3d graphics, with vertical-specific optimizations to address industries ranging from healthcare ORG and telecom to automotive and manufacturing.

our data-center-scale offerings are comprised of compute and networking solutions that can scale to tens of thousands CARDINAL of gpu ORG -accelerated servers interconnected to function as a single giant computer; this type of data center architecture and scale is needed for the development and deployment of modern ai applications.

the gpu ORG was initially used to simulate human imagination, enabling the virtual worlds of video games and films. today DATE , it also simulates human intelligence, enabling a deeper understanding of the physical world. its parallel processing capabilities, supported by thousands CARDINAL of computing cores, are essential for deep learning algorithms. this form of ai, in which software writes itself by learning from large amounts of data, can serve as the brain of computers, robots and self-driving cars that can perceive and understand the

Dependency parsing#

Dependency parsing determines grammatical relationships between words in a sentence, representing these relationships as a tree structure where each word (except the root) depends on another word (its head). This technique helps identify syntactic roles, such as subjects, objects, and modifiers.

Transition-based parsing algorithms uses a set of transition operations (e.g. shift, reduce) to incrementally build a dependency tree from an input sentence.

Unlike dependency parsing, constituent parsing focuses on identifying and representing the hierarchical structure of phrases in a sentence based on formal grammar rules. It groups words into nested syntactic units (e.noun phrases and verb phrases) and represents them in a tree structure.

The CKY (Cocke-Kasami-Younger) algorithm is a dynamic programming technique used for parsing sentences and constructing parse trees.

Probabilistic Context-Free Grammar (PCFG) extends standard Context-Free Grammar (CFG) by assigning probabilities to production rules, indicating the likelihood of specific grammatical structures. Each rule defines how non-terminal symbols (e.g., NP for noun phrase) expand into words or other non-terminals, guiding sentence generation and parsing.

Models for automatic tagging and parsing rely on labeled datasets known as treebanks, which contain syntactically annotated sentences. The Penn Treebank is a widely used treebank for English, providing annotations for POS tags and parse trees.

sentence_spans = list(doc.sents)
displacy.render(sentence_spans[2:4], style="dep", jupyter=True, 
                options=dict(compact=False, distance=175))

Semantic similarity#

Word vectors#

Word vectors are numerical representations of words in a multidimensional space, learned from their co-occurrence patterns in large text corpora. Words with similar syntactic and semantic meanings tend to have vector representations that are close together in this space.

For example, spaCy’s en_core_web_lg model represents over 500,000 words using 300-dimensional vectors.

Extract lemmatized noun forms from business descriptions using spaCy’s POS tagger:

# Extract nouns
bus = {}
for permno in tqdm(found.index):
    doc = nlp(ed[found.loc[permno, 'pathname']][:nlp.max_length].lower())
    nouns = " ".join([re.sub("[^a-zA-Z]+", "", token.lemma_) for token in doc
                      if token.pos_ in ['NOUN'] and len(token.lemma_) > 2])
    if len(nouns) > 100:
        bus[permno] = nouns        
store['business'] = bus

100%|██████████| 192/192 [03:12<00:00,  1.00s/it]

bus = store.load('business')
permnos = list(bus.keys())
tickers = univ.loc[permnos, 'ticker'].to_list()

Compute the average word vector for NVIDIA’s business description text:

# example of word vector
vec1 = nlp(bus[nvidia]).vector
vec1

array([-0.5495549 ,  0.0522468 , -0.70095205,  1.1134154 ,  2.659689  ,
        0.29627848,  1.3154105 ,  3.8272932 , -2.232087  , -1.3053178 ,
        6.069174  ,  2.0604212 , -4.542866  ,  2.3177896 , -1.1287518 ,
        2.3917935 ,  3.1968606 ,  1.5996909 , -2.3438275 ,  0.03434967,
        0.22686806,  1.7824569 , -2.384547  ,  0.8608239 , -1.2319311 ,
       -1.774604  , -1.8425854 , -1.7403452 , -0.7102895 ,  1.0869901 ,
        1.2046682 ,  1.2530138 , -1.1417824 , -0.4984767 ,  0.34321743,
       -0.37546915,  1.4804035 ,  0.8114897 ,  1.3119912 ,  0.38791072,
        0.25189775, -0.1770816 ,  0.1785395 ,  1.0146813 , -1.4704382 ,
        1.6199547 ,  2.021769  , -1.9505422 ,  0.4602281 , -1.281002  ,
        0.07107421,  2.3507724 , -0.18837918, -3.80177   , -0.54604673,
        0.5786306 , -1.7812697 ,  1.5003949 ,  0.40720284, -1.5742034 ,
        2.2474368 ,  1.2563457 , -2.2915537 , -1.2388986 ,  2.408017  ,
        1.9807013 , -2.2583349 , -3.3942797 ,  0.5241013 ,  3.0477126 ,
       -0.97571445,  0.7010974 , -1.2003129 ,  0.38448045, -0.30499592,
        1.3931054 , -1.3777359 ,  0.99693114, -1.8231292 , -0.1508515 ,
       -2.595402  , -0.6554817 ,  0.96918464,  1.4329529 , -0.2420673 ,
       -0.08297056, -1.2713333 , -1.8489853 ,  0.77105236, -0.235054  ,
       -1.2015461 ,  1.0783287 ,  1.5740684 , -2.2493582 ,  0.43573543,
       -0.56152374,  0.58803385, -0.5924455 ,  0.8756682 ,  1.2502667 ,
        3.0838132 ,  0.3415912 ,  1.887193  ,  1.8639498 ,  0.2804779 ,
        4.218261  , -1.1121116 , -1.805579  , -0.22270828, -2.56832   ,
        2.4099169 , -0.22422643, -1.1543158 ,  0.06765282,  0.83664316,
        1.7835015 , -2.6237123 , -1.3570241 , -0.46247697, -2.476458  ,
       -1.9124763 , -2.6474092 ,  0.44025388,  1.1519567 , -0.42653838,
       -2.8462713 ,  0.2980864 , -2.9798195 ,  2.9588706 , -1.819657  ,
       -2.464192  ,  0.32522804,  3.3010955 ,  0.7937097 , -0.15216802,
       -0.42828757, -0.9942988 , -0.44921628,  1.9312432 ,  0.28458852,
       -0.6386989 , -0.87969756, -0.13129689,  0.85792863,  1.7823339 ,
        0.16759728, -3.5592077 , -0.2112106 ,  0.06164274,  3.368905  ,
       -0.3132938 ,  1.2434231 ,  0.21065742,  1.2389091 , -0.66589624,
        0.5398899 ,  2.8091311 ,  1.7358444 , -0.91453594, -2.6195803 ,
       -0.9402572 , -1.2330531 ,  0.71588945,  1.699086  , -1.72446   ,
       -1.1087135 , -2.044233  ,  0.5116665 ,  0.7511035 , -1.510701  ,
       -1.5667175 , -0.22543517, -0.37966043,  0.95050013,  2.100138  ,
        1.966492  ,  0.6108724 , -0.31808442, -1.9397378 , -1.2919445 ,
       -1.7783433 ,  1.7884035 ,  1.1175009 , -1.9068832 , -1.0304377 ,
        0.6533309 , -0.94094557, -1.4831365 ,  1.5252017 ,  1.73394   ,
        0.05441582, -1.1678139 ,  0.10588835, -1.8394533 ,  1.820474  ,
        0.70712596, -2.8049684 , -0.1084957 ,  0.5009497 , -0.05614741,
       -0.74415725, -1.0815455 , -0.28300503, -1.0312603 ,  3.5577624 ,
        0.81285644, -3.4548876 ,  1.3007482 , -0.0527516 , -1.5823797 ,
        0.94375974,  0.01696876, -1.1761798 ,  2.141124  ,  0.8728857 ,
        1.7190704 ,  2.6591263 , -4.227103  , -0.5748424 ,  0.36368647,
       -1.838823  ,  1.3353976 , -1.5363356 , -0.98404247, -0.64337295,
       -2.6795921 ,  0.4494206 ,  2.0296626 ,  1.1337993 , -0.15482494,
        2.2946403 , -2.6738138 , -1.2725773 ,  1.763216  ,  2.8063855 ,
        0.46778396, -0.36578366, -0.26262623,  0.9403954 ,  1.0015386 ,
       -2.0167701 , -1.1006184 , -0.1618742 ,  0.9444055 , -0.27051136,
        0.33339593, -1.7367964 ,  1.3408182 ,  0.32765946,  1.1382856 ,
        0.6616231 , -1.8980618 , -3.5863128 , -2.049999  ,  0.11113743,
       -2.0100012 ,  0.91014284, -1.1377766 ,  0.30864185,  0.8823487 ,
       -1.5351313 ,  4.9388237 ,  1.6379622 ,  1.251249  ,  1.8976227 ,
       -0.77638566, -0.17837228,  1.8998926 , -0.8797592 , -0.77389985,
       -0.19354557, -0.14190447,  0.15318388, -1.0070832 ,  0.3741701 ,
       -2.8941076 ,  0.9670155 , -1.7984045 , -1.1186454 ,  1.0722593 ,
        3.4040382 ,  0.38004097,  1.4921545 , -0.0391783 ,  3.8353264 ,
        0.2084171 ,  1.266672  ,  1.8079621 , -2.3702457 , -0.04794558,
        0.3776366 , -0.42777508, -0.809167  ,  0.7592459 , -1.5167016 ,
        0.2553154 ,  1.1878173 , -2.2171407 , -0.76328766,  2.2422533 ],
      dtype=float32)

Compute the average word vector for all companies’ business descriptions:

# Compute sentence vectors
vecs = np.array([nlp(bus[permno]).vector for permno in bus.keys()])
store['vectors'] = vecs

vecs = store['vectors']

# Distance matrix
n = len(bus)
distances = np.zeros((n, n))
for row in range(n):
    for col in range(row, n):
        distances[row, col] = spatial.distance.cosine(vecs[row], vecs[col])
        distances[col, row] = distances[row, col] 

Identify companies with the most similar business descriptions:

def most_similar(p):
    dist = distances[permnos.index(p)]
    dist[permnos.index(p)] = max(dist)   # to ignore own distance
    return univ.loc[permnos[np.argmin(dist)]]
for name in ['NVIDIA', 'APPLE COMPUTER', 'JNJ', 'EXXON MOBIL', 'AMERICAN EXPRESS']:
    p = find(name)['permno'].iloc[-1]
    print(f"{most_similar(p)['comnam']}' is most similar to '{name}'")

QUALCOMM INC' is most similar to 'NVIDIA'
SALESFORCE INC' is most similar to 'APPLE COMPUTER'
PFIZER INC' is most similar to 'JNJ'
PIONEER NATURAL RESOURCES CO' is most similar to 'EXXON MOBIL'
U S BANCORP DEL' is most similar to 'AMERICAN EXPRESS'

Dimensionality reduction#

t-SNE visualization#

T-distributed Stochastic Neighbor Embedding (t-SNE) visualizes high-dimensional data by converting similarities between points into joint probabilities and minimizing the Kullback-Leibler divergence between the high-dimensional and lower-dimensional representations. t-SNE preserves local structures, making it effective for clustering and uncovering hidden patterns in business descriptions.

t-SNE in scikit-learn

from sklearn.manifold import TSNE
Z = TSNE(n_components=2, perplexity=10, random_state=42)\
    .fit_transform(vecs)

Reduce business description vectors to 2D using t-SNE and label points with ticker symbols:

fig, ax = plt.subplots(figsize=(10, 8))
ax.scatter(Z[:, 0], Z[:, 1], color="C0", alpha=.3)
for text, x, y in zip(tickers, Z[:, 0], Z[:, 1]):
    ax.annotate(text=text, xy=(x, y), fontsize='small')
ax.set_title(f"t-SNE visualization of largest decile stocks ({enddate//10000})")
plt.tight_layout()

_images/a24fe65bb0a42852d9d8cb87abace0c601ede163538e8d67ddc64c87de4b09a8.png

DBSCAN clustering#

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is an unsupervised clustering algorithm that detects clusters of varying densities and identifies outliers. Unlike k-means, it does not require a predefined number of clusters. Instead, it uses two parameters: epsilon (ε) and the minimum number of points required to form a dense region.

# eps is the most important parameter for DBSCAN
eps = 4
db = cluster.DBSCAN(eps=eps)   # default eps

db.fit(Z)
n_clusters = len(set(db.labels_).difference({-1}))
n_noise = np.sum(db.labels_ == -1)
DataFrame(dict(clusters=n_clusters, noise=n_noise, eps=eps),  index=['DBSCAN'])

	clusters	noise	eps
DBSCAN	12	19	4

Visualize DBSCAN clusters in 2D space. Display outlier ticker symbols with larger font sizes:

cmap = ColorMap(n_clusters)
fig, ax = plt.subplots(figsize=(10, 8))
# plot core samples with larger marker size
ax.scatter(Z[db.core_sample_indices_, 0],
           Z[db.core_sample_indices_, 1],
           c=cmap[db.labels_[db.core_sample_indices_]],
           alpha=.1, s=100, edgecolors=None)
# plot non-core samples with smaller marker size
non_core = np.ones_like(db.labels_, dtype=bool)
non_core[db.core_sample_indices_] = False
non_core[db.labels_ < 0] = False
ax.scatter(Z[non_core, 0], Z[non_core, 1], c=cmap[db.labels_[non_core]],
           alpha=.1, s=20, edgecolors=None)
# plot noise samples 
ax.scatter(Z[db.labels_ < 0, 0], Z[db.labels_ < 0, 1], c="darkgrey",
           alpha=.5, s=20, edgecolors=None)

# annotate with tickers not in core samples
for i, (t, c, xy) in enumerate(zip(tickers, db.labels_, Z)):
    if i in db.core_sample_indices_:
        ax.annotate(text=t, xy=xy+.5, color=cmap[c], fontsize='xx-small')
    elif c == -1:
        ax.annotate(text=t, xy=xy+.5, color='black', fontsize='medium')
    else:
        ax.annotate(text=t, xy=xy+.5, color=cmap[c], fontsize='medium')
ax.set_title(f"Largest decile stocks ({enddate//10000})")
plt.tight_layout()

_images/99f222fa309a3872e859f931fa6613710756b4291d75256d7dec66b336034b9c.png

List companies tagged as noisy samples:

print("Samples tagged as noise:")
univ.loc[np.array(permnos)[db.labels_ < 0]].sort_values('naics')

Samples tagged as noise:

	cap	capco	decile	nyse	siccd	prc	naics	comnam	ticker
permno
75241	5.246653e+07	5.246653e+07	1	True	1311	224.88	211120	PIONEER NATURAL RESOURCES CO	PXD
21207	4.770164e+07	4.770164e+07	1	True	1041	41.39	212220	NEWMONT CORP	NEM
81774	6.104440e+07	6.104440e+07	1	True	1021	42.57	212230	FREEPORT MCMORAN INC	FCX
82800	6.654158e+07	6.654158e+07	1	True	1021	86.07	212230	SOUTHERN COPPER CORP	SCCO
69796	4.440053e+07	4.440053e+07	1	True	2084	241.75	312130	CONSTELLATION BRANDS INC	STZ
11850	4.005332e+08	4.005332e+08	1	True	2911	99.98	324110	EXXON MOBIL CORP	XOM
78975	1.749684e+08	1.749684e+08	1	False	7370	625.03	513210	INTUIT INC	INTU
26403	1.652592e+08	1.652592e+08	1	True	4833	90.29	516120	DISNEY WALT CO	DIS
44644	9.568078e+07	9.568078e+07	1	False	7374	232.97	518210	AUTOMATIC DATA PROCESSING INC	ADP
47896	4.917605e+08	4.917605e+08	1	True	6021	170.10	522110	JPMORGAN CHASE & CO	JPM
90993	7.350871e+07	7.350871e+07	1	True	6231	128.43	523210	INTERCONTINENTALEXCHANGE GRP INC	ICE
89626	7.565405e+07	7.565405e+07	1	False	6200	210.60	523210	C M E GROUP INC	CME
17478	1.395567e+08	1.395567e+08	1	True	6282	440.52	523930	S & P GLOBAL INC	SPGI
61621	4.285840e+07	4.285840e+07	1	False	8700	119.11	541219	PAYCHEX INC	PAYX
13628	5.769654e+07	5.769654e+07	1	False	7372	276.06	541511	WORKDAY INC	WDAY
48506	7.147248e+07	7.147248e+07	1	True	7323	390.56	561450	MOODYS CORP	MCO
92402	4.473782e+07	4.473782e+07	1	True	7389	565.65	561499	M S C I INC	MSCI
85913	6.551968e+07	6.551968e+07	1	False	7011	225.51	721110	MARRIOTT INTERNATIONAL INC NEW	MAR
14338	4.669516e+07	4.669516e+07	1	True	7011	182.09	721110	HILTON WORLDWIDE HOLDINGS INC	HLT

UMAP vizualization#

UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique that constructs a high-dimensional graph of data points and optimizes a lower-dimensional representation while preserving essential relationships. Compared to t-SNE, UMAP is faster, scales better for large datasets, and retains more global structure.

UMAP Documentation

import umap
Z = umap.UMAP(n_components=2, n_jobs=1, min_dist=0.0, random_state=42)\
        .fit_transform(vecs)

fig, ax = plt.subplots(figsize=(10, 8))
ax.scatter(Z[:, 0], Z[:, 1], color="C0", alpha=.3)
for text, x, y in zip(tickers, Z[:, 0], Z[:, 1]):
    ax.annotate(text=text, xy=(x, y), fontsize='small')
ax.set_title(f"UMAP visualization of largest decile stocks ({enddate//10000})")
plt.tight_layout()

_images/ee0833563d1eae2cacc13c32139ad52425d0c0dbd496ed39ec3b15b6bb6ce25c.png

HDBSCAN clustering#

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) extends DBSCAN by varying the epsilon parameter and optimizing cluster stability. This makes it more robust to variations in density and parameter selection.

# eps is the most important parameter for DBSCAN
hdb = cluster.HDBSCAN()
hdb.fit(Z)
n_clusters = len(set(hdb.labels_).difference({-1}))
n_noise = np.sum(hdb.labels_ == -1)
DataFrame(dict(clusters=n_clusters, noise=n_noise),  index=['HDBSCAN'])

	clusters	noise
HDBSCAN	14	32

Visualize HDBSCAN clusters in 2D space. Display outlier ticker symbols with larger font sizes.

cmap = ColorMap(n_clusters)
fig, ax = plt.subplots(figsize=(10, 8))
# plot core samples with larger marker size
ax.scatter(Z[hdb.labels_ >= 0, 0],
           Z[hdb.labels_ >= 0, 1],
           c=cmap[hdb.labels_[hdb.labels_ >= 0]],
           alpha=.1, s=100, edgecolors=None)
# plot noise samples 
ax.scatter(Z[hdb.labels_ < 0, 0], Z[hdb.labels_ < 0, 1], c="darkgrey",
           alpha=.5, s=20, edgecolors=None)

# annotate with tickers not in core samples
for i, (t, c, xy) in enumerate(zip(tickers, hdb.labels_, Z)):
    if c >= 0:
        ax.annotate(text=t, xy=xy+.01, color=cmap[c], fontsize='xx-small')
    else:
        ax.annotate(text=t, xy=xy+.01, color="black", fontsize='medium')
ax.set_title(f"Largest decile stocks ({enddate//10000})")
plt.tight_layout()

_images/c3409ab909fb6cff3eda058c5aec5c28b5837e8ca53ddbbb557265ca3f2445a7.png

List companies tagged as noisy samples

print("Samples tagged as noise:")
univ.loc[np.array(permnos)[hdb.labels_ < 0]].sort_values('naics')

Samples tagged as noise:

	cap	capco	decile	nyse	siccd	prc	naics	comnam	ticker
permno
69796	4.440053e+07	4.440053e+07	1	True	2084	241.75	312130	CONSTELLATION BRANDS INC	STZ
11850	4.005332e+08	4.005332e+08	1	True	2911	99.98	324110	EXXON MOBIL CORP	XOM
36468	7.983580e+07	7.983580e+07	1	True	2851	311.90	325510	SHERWIN WILLIAMS CO	SHW
70578	5.655752e+07	5.655752e+07	1	True	2841	198.35	325611	ECOLAB INC	ECL
18163	3.453781e+08	3.453781e+08	1	True	2844	146.54	325620	PROCTER & GAMBLE CO	PG
18729	6.563098e+07	6.563098e+07	1	True	2844	79.71	325620	COLGATE PALMOLIVE CO	CL
22103	5.548783e+07	5.548783e+07	1	True	3491	97.33	332911	EMERSON ELECTRIC CO	EMR
19350	1.120656e+08	1.120656e+08	1	True	3523	399.87	333111	DEERE & CO	DE
14702	1.345954e+08	1.345954e+08	1	False	3550	162.07	333248	APPLIED MATERIALS INC	AMAT
41355	5.918889e+07	5.918889e+07	1	True	3593	460.70	333995	PARKER HANNIFIN CORP	PH
56573	7.881408e+07	7.881408e+07	1	True	3569	261.94	333999	ILLINOIS TOOL WORKS INC	ITW
12490	1.493406e+08	1.493406e+08	1	True	3571	163.55	334111	INTERNATIONAL BUSINESS MACHS COR	IBM
77338	5.827867e+07	5.827867e+07	1	False	3823	545.17	334513	ROPER TECHNOLOGIES INC	ROP
22592	6.037929e+07	6.037929e+07	1	True	3841	109.32	339112	3M CO	MMM
66181	3.449080e+08	3.449080e+08	1	True	5211	346.55	444110	HOME DEPOT INC	HD
84788	1.556169e+09	1.556169e+09	1	False	7370	151.94	454110	AMAZON COM INC	AMZN
48725	1.497292e+08	1.497292e+08	1	True	4011	245.62	482111	UNION PACIFIC CORP	UNP
64311	5.345403e+07	5.345403e+07	1	True	4731	236.38	488510	NORFOLK SOUTHERN CORP	NSC
60628	6.321543e+07	6.321543e+07	1	True	4513	252.97	492110	FEDEX CORP	FDX
26403	1.652592e+08	1.652592e+08	1	True	4833	90.29	516120	DISNEY WALT CO	DIS
47896	4.917605e+08	4.917605e+08	1	True	6021	170.10	522110	JPMORGAN CHASE & CO	JPM
92108	9.302455e+07	9.302455e+07	1	True	6282	130.92	523940	BLACKSTONE INC	BX
87842	4.894876e+07	4.894876e+07	1	True	6311	66.13	524113	METLIFE INC	MET
57904	4.821135e+07	4.821135e+07	1	True	6321	82.50	524114	AFLAC INC	AFL
59459	4.350773e+07	4.350773e+07	1	True	6331	190.49	524126	TRAVELERS COMPANIES INC	TRV
64390	9.318533e+07	9.318533e+07	1	True	6331	159.28	524126	PROGRESSIVE CORP OH	PGR
66800	4.756321e+07	4.756321e+07	1	True	6331	67.75	524126	AMERICAN INTERNATIONAL GROUP INC	AIG
38093	4.855159e+07	4.855159e+07	1	True	6411	224.88	524210	GALLAGHER ARTHUR J & CO	AJG
45751	9.342235e+07	9.342235e+07	1	True	6411	189.47	524210	MARSH & MCLENNAN COS INC	MMC
89393	2.107022e+08	2.107022e+08	1	False	7841	486.88	532282	NETFLIX INC	NFLX
13511	9.297566e+07	9.297566e+07	1	False	7371	294.88	541511	PALO ALTO NETWORKS INC	PANW
11955	7.213700e+07	7.213700e+07	1	True	4953	179.10	562219	WASTE MANAGEMENT INC DEL	WM

References:

Greg Durrett, 2023, “CS388 Natural Language Processing course materisl”, retrieved from https://www.cs.utexas.edu/~gdurrett/courses/online-course/materials.html

Text-Based Network Industries and Endogenous Product Differentiation. Gerard Hoberg and Gordon Phillips, 2016, Journal of Political Economy 124 (5), 1423-1465.

Product Market Synergies and Competition in Mergers and Acquisitions: A Text-Based Analysis. Gerard Hoberg and Gordon Phillips, 2010, Review of Financial Studies 23 (10), 3773-3811.