Industry Community Detection#

Realize that everything connects to everything else - Leonardo DaVinci

Traditional industry classification systems, such as SIC and NAICS, group firms based on production processes or product similarities. Natural language processing techniques can be leveraged to analyze product descriptions and capture dynamic changes in industry structures over time, as proposed by Hoberg and Phillips (2016). Industry communities can be detected through network analysis, where firms are modeled as nodes in a graph, and connections between them are determined by similarities in their product and market descriptions.

# By: Terence Lim, 2020-2025 (terence-lim.github.io)
import zipfile
import io
from itertools import chain
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from finds.database import SQL
from finds.readers import requests_get, Sectoring
from finds.structured import BusDay, PSTAT
from finds.recipes import graph_info
from secret import credentials
# %matplotlib qt
VERBOSE = 0
sql = SQL(**credentials['sql'], verbose=VERBOSE)
bd = BusDay(sql)
pstat = PSTAT(sql, bd, verbose=VERBOSE)

Industry taxonomy#

Industry classification, or industry taxonomy, organizes companies into groups based on shared characteristics such as production processes, product offerings, or financial market behaviors.

Text-based industry classification#

Hoberg and Phillips (2016) developed a text-based measure of firm similarity by analyzing product descriptions in 10-K filings. They construct firm-by-firm similarity scores using word vectors, filtering out common words and focusing on nouns and proper nouns, while excluding geographic terms. The similarity between firms is quantified using cosine similarity, creating a pairwise similarity matrix across firms and years.

Since Item 101 of Regulation S-K mandates that firms accurately describe their key products in their 10-K filings, the TNIC scheme, based on textual similarity, provides a dynamic classification system that evolves with market changes. This method offers a more flexible alternative to traditional classification systems, capturing shifts in product markets over time.

Source: Hoberg and Phillips Industry Classification

The TNIC pair-wise firm similiarities are retrieved from the Hoberg and Phillips website:

# Retrieve TNIC scheme from Hoberg and Phillips website
tnic_scheme = 'tnic3'
root = 'https://hobergphillips.tuck.dartmouth.edu/idata/'   
source = root + tnic_scheme + '_data.zip'
if source.startswith('http'):
    response = requests_get(source)
    source = io.BytesIO(response.content)

# extract the csv file from zip archive
with zipfile.ZipFile(source).open(tnic_scheme + "_data.txt") as f:
    tnic_data = pd.read_csv(f, sep='\s+')

# extract latest year of tnic as data frame
year = max(tnic_data['year']) # [1989, 1999, 2009, 2019]
tnic = tnic_data[tnic_data['year'] == year].dropna()
tnic
year gvkey1 gvkey2 score
26307358 2023 1004 1823 0.0127
26307359 2023 1004 4091 0.0087
26307360 2023 1004 5567 0.0063
26307361 2023 1004 9698 0.0075
26307362 2023 1004 10519 0.0191
... ... ... ... ...
26973403 2023 351038 329141 0.0684
26973404 2023 351038 331856 0.0769
26973405 2023 351038 332115 0.1036
26973406 2023 351038 347007 0.0731
26973407 2023 351038 349972 0.0871

666050 rows × 4 columns

Industry classification#

Industry classification systems such as SIC and NAICS follow hierarchical structures to categorize firms based on their economic activities:

  • Standard Industrial Classification (SIC): Uses a 2-digit, 3-digit, and 4-digit hierarchy to classify industries.

  • North American Industry Classification System (NAICS): Expands classification granularity from 2-digit to 6-digit levels.

# populate dataframe of nodes with gvkey (as index), permno, sic and naics
nodes = DataFrame(index=sorted(set(tnic['gvkey1']).union(tnic['gvkey2'])))\
        .rename_axis(index='gvkey')
for code in ['lpermno', 'sic', 'naics']:
    lookup = pstat.build_lookup('gvkey', code, fillna=0)
    nodes[code] = lookup(nodes.index)
Series(np.sum(nodes > 0, axis=0)).rename('Non-missing').to_frame().T
lpermno sic naics
Non-missing 3829 3829 3827
# supplement naics and sic with crosswalks in Sectoring class
naics = Sectoring(sql, 'naics', fillna=0)   # supplement from crosswalk
sic = Sectoring(sql, 'sic', fillna=0)
nodes['naics'] = nodes['naics'].where(nodes['naics'] > 0, naics[nodes['sic']])
nodes['sic'] = nodes['sic'].where(nodes['sic'] > 0, sic[nodes['naics']])
Series(np.sum(nodes > 0, axis=0)).rename('Non-missing').to_frame().T 
lpermno sic naics
Non-missing 3829 3829 3829

Sector groups#

Industry taxonomies group detailed classifications into broader sectors for economic analysis:

  • Fama and French aggregate 4-digit SIC codes into industry groups consisting of 5, 10, 12, 17, 30, 38, 48, or 49 sectors.

  • The Bureau of Economic Analysis (BEA) consolidates 6-digit NAICS codes into summary-level industry groups, with updates in 1947, 1963, and 1997.

# include sectoring schemes
codes = {'sic': ([f"codes{c}" for c in [5, 10, 12, 17, 30, 38, 48, 49]]
                 + ['sic2', 'sic3']),
         'naics': ['bea1947', 'bea1963', 'bea1997']}
sectorings = {}   # store Sectoring objects
for key, schemes in codes.items():
    for scheme in schemes:
        if scheme not in sectorings:

            # missing value is integer 0 sic2 and sic3 shemes, else string ''
            fillna = 0 if scheme.startswith('sic') else ''

            # load the sectoring class from SQL
            sectorings[scheme] = Sectoring(sql, scheme, fillna=fillna)

            # apply the sectoring scheme to partition the nodes
            nodes[scheme] = sectorings[scheme][nodes[key]]

        # keep nodes with non-missing data
        nodes = nodes[nodes[scheme].ne(sectorings[scheme].fillna)]
        print(len(nodes), scheme)
nodes
3845 codes5
3845 codes10
3845 codes12
3845 codes17
3845 codes30
3845 codes38
3845 codes48
3845 codes49
3829 sic2
3829 sic3
3561 bea1947
3561 bea1963
3561 bea1997
lpermno sic naics codes5 codes10 codes12 codes17 codes30 codes38 codes48 codes49 sic2 sic3 bea1947 bea1963 bea1997
gvkey
1004 54594 5080 423860 Cnsmr Shops Shops Machn Whlsl Whlsl Whlsl Whlsl 50 508 42 42 42
1045 21020 4512 481111 Other Durbl Durbl Trans Trans Trans Trans Trans 45 451 48 481 481
1050 11499 3564 333413 Manuf Manuf Manuf Machn FabPr Machn Mach Mach 35 356 333 333 333
1076 10517 6141 522220 Other Other Money Finan Fin Money Banks Banks 61 614 52 521CI 521CI
1078 20482 3845 334510 Hlth Hlth Hlth Other Hlth Instr MedEq MedEq 38 384 334 334 334
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
345980 20333 5961 455110 Cnsmr Shops Shops Rtail Rtail Rtail Rtail Rtail 59 596 44RT 44RT 4A0
347007 15533 2836 325414 Hlth Hlth Hlth Other Hlth Chems Drugs Drugs 28 283 325 325 325
349337 20867 3845 334510 Hlth Hlth Hlth Other Hlth Instr MedEq MedEq 38 384 334 334 334
349972 15642 2836 325414 Hlth Hlth Hlth Other Hlth Chems Drugs Drugs 28 283 325 325 325
351038 16161 2834 325412 Hlth Hlth Hlth Cnsum Hlth Chems Drugs Drugs 28 283 325 325 325

3561 rows × 16 columns

Community structure#

In network analysis, community structure refers to the clustering of nodes (firms) into partitions (groups) based on connectivity patterns. Identifying these communities helps reveal hidden industry relationships and competitive dynamics.

# populate undirected graph with tnic edges
edges = tnic[tnic['gvkey1'].isin(nodes.index) & tnic['gvkey2'].isin(nodes.index)]
edges = list(
    edges[['gvkey1', 'gvkey2', 'score']].itertuples(index=False, name=None))
G = nx.Graph()
G.add_weighted_edges_from(edges)
G.remove_edges_from(nx.selfloop_edges(G))  # remove self-loops: not necessary
Series(graph_info(G, fast=True)).rename(year).to_frame()
2023
transitivity 0.877035
average_clustering 0.575643
connected False
connected_components 9
size_largest_component 3523
directed False
weighted True
negatively_weighted False
edges 320352
nodes 3541
selfloops 0
density 0.051113

Measuring partitions#

The quality of graph partitions can be evaluated using modularity, a measure that assesses the strength of community structures by comparing observed connections within clusters to a random network model.

# evaluate modularity of sectoring schemes on TNIC graph
def community_quality(G, communities):
    """helper to measure quality of partitions"""
    out = {'communities': len(communities)}
    out['modularity'] = nx.community.modularity(G, communities)
    (out['coverage'],
     out['performance']) = nx.community.partition_quality(G, communities)    
    return out
modularity = {}   # to collect measurements of each scheme
for scheme in sorted(chain(*codes.values())):
    communities = nodes.loc[list(G.nodes), scheme]\
                       .reset_index()\
                       .groupby(scheme)['gvkey']\
                       .apply(list)\
                       .to_list()    # list of lists of node labels
    modularity[scheme] = community_quality(G, communities)
df = DataFrame.from_dict(modularity, orient='index').sort_index()
print(f"Quality of sectoring schemes on TNIC graph ({year})")
df
Quality of sectoring schemes on TNIC graph (2023)
communities modularity coverage performance
bea1947 40 0.330481 0.779187 0.925859
bea1963 58 0.324246 0.734745 0.948296
bea1997 61 0.324169 0.734514 0.948689
codes10 10 0.335843 0.940503 0.850069
codes12 12 0.336655 0.938187 0.878268
codes17 17 0.285847 0.766719 0.794675
codes30 30 0.335544 0.934385 0.899115
codes38 36 0.333800 0.793237 0.890785
codes48 48 0.331168 0.752610 0.944559
codes49 49 0.331003 0.751526 0.951096
codes5 5 0.337074 0.945045 0.818984
sic2 67 0.327541 0.743694 0.942476
sic3 226 0.288389 0.690952 0.958297

Detecting partitions#

Community detection in graphs can be performed using various algorithms, including:

  • Label Propagation Algorithm: This method assigns an initial label to each node and iteratively updates labels based on the majority of its neighbors’ labels, allowing communities to form dynamically. It is fast and works well for large networks but may produce different results on different runs due to randomness.

  • Louvain Method: This hierarchical clustering algorithm optimizes modularity by iteratively merging small communities into larger ones, maximizing intra-community connections while minimizing inter-community edges.

  • Greedy Algorithm: This algorithm builds communities by iteratively merging pairs of nodes or groups that result in the largest modularity gain, prioritizing locally optimal choices.

# Run community detection algorithms
def community_detection(G):
    """Helper to run community detection algorithms on an undirected graph"""
    out = {}
    out['label'] = nx.community.label_propagation_communities(G)
    out['louvain'] = nx.community.louvain_communities(G, resolution=1)
    out['greedy'] = nx.community.greedy_modularity_communities(G, resolution=1)
    return out
communities = community_detection(G)
quality = {}
for key, community in communities.items():
    quality[key] = community_quality(G, community)
df = DataFrame.from_dict(quality, orient='index').sort_index()
print(f"Modularity of community detection algorithms on TNIC graph ({year})")
df
Modularity of community detection algorithms on TNIC graph (2023)
communities modularity coverage performance
greedy 51 0.323848 0.989711 0.689093
label 101 0.347795 0.990485 0.909486
louvain 19 0.354818 0.824855 0.838868
# Visualize Fama-French 49-industries in the detected communities
key = 'codes49'
for ifig, detection in enumerate(communities.keys()):

    # count industries represented in each partition
    industry = []
    communities_sequence = sorted(communities[detection], key=len, reverse=True)    
    for i, community in enumerate(communities_sequence):
        industry.append(nodes[key][list(community)].value_counts().rename(i+1))
    names = sectorings[key].sectors['name'].drop_duplicates(keep='first')
    df = pd.concat(industry, axis=1)\
           .dropna(axis=0, how='all')\
           .fillna(0)\
           .astype(int)\
           .reindex(names)

    # display as heatmap
    fig, ax = plt.subplots(num=ifig+1, clear=True, figsize=(6, 8))
    sns.heatmap(df.iloc[:,:10],
                square=False,
                linewidth=.5,
                ax=ax,
                yticklabels=1,
                cmap="YlGnBu",
                robust=True)
    if scheme.startswith('bea'):
        ax.set_yticklabels(Sectoring._bea_industry[df.index], size=10)
    else:
        ax.set_yticklabels(df.index, size=10)
    ax.set_title(f'{detection.capitalize()} Community Detection {year}')
    ax.set_xlabel(f"Industry representation in communities")
    ax.set_ylabel(f"{key} industry")
    fig.subplots_adjust(left=0.4)
    plt.tight_layout(pad=0)
_images/7a613ab1e9f3bed78ccd28ea87aca6bfb7e5f2ab7c28a1a1a7b6cb8ec1ac34ca.png _images/3f4449e119c89bda0ac432aeea3284cb6d848adb6a92880e0714802a593c7e0e.png _images/696b71a8bebd916d4d71816e65141fbd6ff9c936a52c78b3a5b2589468e92b27.png

References

Gerard Hoberg and Gordon Phillips, 2016, Text-Based Network Industries and Endogenous Product Differentiation.Journal of Political Economy 124 (5), 1423-1465.

Gerard Hoberg and Gordon Phillips, 2010, Product Market Synergies and Competition in Mergers and Acquisitions: A Text-Based Analysis. Review of Financial Studies 23 (10), 3773-3811.