Quant Factors

Quant Factors#

Quants do it with models - Anonymous

Factor investing is a systematic approach to asset pricing and portfolio management based on the premise that various risk factors drive asset returns. Factor-based investing recognizes that besides broad market exposure, additional systematic risks—such as value, momentum, and volatility—play a crucial role in determining returns. This framework has evolved over time, beginning with early models like the Capital Asset Pricing Model (CAPM) and expanding to multifactor models, behavioral theories, and adaptive market perspectives. This analysis explores the empirical performance of different factor strategies, and the methodologies used to evaluate them. We also examine historical backtests and employ clustering techniques to group similar investment strategies, seeking to identify style factors and construct effective benchmarks.

from pandas import DataFrame, Series
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from sklearn.cluster import AgglomerativeClustering, KMeans
from scipy.cluster.hierarchy import dendrogram
from sklearn.metrics import silhouette_samples, silhouette_score
from tqdm import tqdm
import warnings
from datetime import datetime
from typing import List, Tuple
from finds.database import SQL, RedisDB
from finds.structured import (BusDay, Stocks, Benchmarks, Signals, SignalsFrame,
                              CRSP, PSTAT, IBES, CRSPBuffer)
from finds.backtesting import BackTest, univariate_sorts
from secret import credentials, paths, CRSP_DATE
VERBOSE = 0
if not VERBOSE:
    warnings.simplefilter(action='ignore', category=FutureWarning)


#%matplotlib qt

LAST_DATE = CRSP_DATE
# open connections
imgdir = paths['images']
sql = SQL(**credentials['sql'], verbose=VERBOSE)
user = SQL(**credentials['user'], verbose=VERBOSE)
rdb = RedisDB(**credentials['redis'])
bd = BusDay(sql, verbose=VERBOSE)
crsp = CRSP(sql, bd, rdb=rdb, verbose=VERBOSE)
pstat = PSTAT(sql, bd, verbose=VERBOSE)
bench = Benchmarks(sql, bd, verbose=VERBOSE)
signals = Signals(user, verbose=VERBOSE)
ibes = IBES(sql, bd, verbose=VERBOSE)
backtest = BackTest(user, bench, 'RF', LAST_DATE, verbose=VERBOSE)
outdir = paths['scratch'] / 'output'

Factor investing#

Factor investing posits that asset returns are driven by exposure to specific risk factors, which determine their risk premiums. The market itself is an investable factor, as described by the Capital Asset Pricing Model (CAPM), which asserts that market exposure is the sole driver of asset returns. However, additional factors such as interest rates, value-growth investing, low volatility strategies, and momentum portfolios have been widely recognized. Macroeconomic factors, including inflation and economic growth, also influence asset returns. Assets exhibit varying degrees of exposure to these risk factors, with greater exposure leading to higher risk premiums. Essentially, assets can be viewed as bundles of different factor exposures.

Early multifactor models include Stephen Ross’s (1976) Arbitrage Pricing Theory (APT), which argues that risk factors cannot be arbitraged or diversified away, and Robert Merton’s Intertemporal Capital Asset Pricing Model (ICAPM), which accounts for investors hedging risky positions over multiple time periods. Additionally, behavioral finance theories suggest that factor premiums arise due to investor biases, such as overreaction, underreaction, and bounded rationality.

Adaptive Markets Hypothesis#

Andrew Lo’s (2004) Adaptive Markets Hypothesis proposes that financial markets are shaped by principles of evolutionary biology rather than fixed physical laws. This perspective suggests that investment performance fluctuates as the financial ecosystem and market conditions evolve. Lo advocates studying financial markets by analyzing different “species” of investors—individuals and institutions that share common traits—and tracking their size, growth, interactions, and behavioral tendencies.

Factor Zoo#

John Cochrane (2011) coined the term Factor Zoo to highlight the rapid proliferation of newly identified factors in academic research. In response, Green, Hand, and Zhang (2017) systematically examined nearly 100 firm characteristic factors, addressing issues such as microcap stock overweighting and data snooping biases. Their study assessed the predictive power of these factors across different time periods.

# preload monthly stocks data
monthly = CRSPBuffer(stocks=crsp, dataset='monthly', fields=['ret', 'retx', 'prc'],
                     beg=19251201, end=CRSP_DATE)

# signals to flip signs when forming spread portfolios
leverage = {'mom1m':-1, 'mom36m':-1, 'pricedelay':-1, 'absacc':-1, 'acc':-1,
            'agr':-1, 'chcsho':-1, 'egr':-1, 'mve_ia':-1, 'pctacc':-1,
            'aeavol':-1, 'disp':-1, 'stdacc':-1, 'stdcf':-1, 'secured':-1,
            'maxret':-1, 'ill':-1, 'zerotrade':-1, 'cashpr':-1, 'chinv':-1,
            'invest':-1, 'cinvest':-1, 'idiovol':-1, 'retvol':-1}

Helper functions

# to lag yearly characteristics
def as_lags(df, var, key, nlags):
    """Return dataframe with {nlags} of column {var}, same {key} value in row"""
    out = df[[var]].rename(columns={var: 0})      # first col: not shifted
    for i in range(1, nlags):
        prev = df[[key, var]].shift(i, fill_value=0) # next col: shifted i+1
        prev.loc[prev[key] != df[key], :] = np.nan   # require same {key} value
        out.insert(i, i, prev[var])
    return out

# rolling window of returns
def as_rolling(df, other, width=0, dropna=True):
    """join next dataframe to a sliding window with fixed number of columns"""
    df = df.join(other, how='outer', sort=True, rsuffix='r')
    if width and len(df.columns) > width:          # if wider than width
        df = df.iloc[:, (len(df.columns)-width):]  #    then drop first cols
    if dropna:                                     # drop empty rows
        df = df[df.count(axis=1) > 0]
    df.columns = list(range(len(df.columns)))
    return df

# pipeline to run backtest
def backtest_pipeline(backtest: BackTest,
                      stocks: Stocks,
                      holdings: DataFrame,
                      label: str,
                      benchnames: List[str],
                      suffix: str = '',
                      overlap: int = 0,
                      outdir: str ='',
                      num: int = None) -> DataFrame:
    """wrapper to run a backtest pipeline

    Args:
      backtest: To compute backtest results
      stocks: Where securities returns can be retrieved from (e.g. CRSP)
      holdings: dict (key int date) of Series holdings (key permno)
      label: Label of signal to backtest
      benchnames: Names of benchmarks to attribute portfolio performance
      overlap: Number of overlapping holdings to smooth
      num: Figure num to plot to

    Returns:
      DataFrame of performance returns in rows

    Notes:
      graph and summary statistics are output to jpg and (appended) html
      backtest object updated with performance and attribution data
    """
    summary = backtest(stocks, holdings, label, overlap=overlap)
    excess = backtest.fit(benchnames)
    backtest.write(label)
    backtest.plot(num=num, label=label + suffix)
    if VERBOSE:
        print(pd.Series(backtest.annualized, name=label + suffix)\
              .to_frame().T.round(3).to_string())
    if outdir:
        # performance metrics from backtest to output
        sub = ['alpha', 'excess', 'appraisal', 'sharpe', 'welch-t', 'welch-p']
        with open(outdir / 'index.html', 'at') as f:
            f.write(f"<p><hr><h2>{label + suffix}</h2>\n<pre>\n")
            f.write("{}-{} {}\n".format(min(backtest.excess.index),
                                        max(backtest.excess.index),
                                        benchnames))
            f.write("{:12s} ".format("Annualized"))
            f.write("".join(f"{k:>10s}" for k in sub) + "\n")
            f.write("{:12s} ".format(label + ":"))
            f.write("".join(f"{backtest.annualized[k]:10.4f}" for k in sub))
            f.write(f"\n</pre><p>{datetime.now()}\n")
    return summary

Past prices#

Momentum and dividend yield data are sourced from CRSP monthly records

beg, end = 19251231, LAST_DATE
intervals = {'mom12m': (2,12),
             'mom36m': (13,36),
             'mom6m': (2,6),
             'mom1m': (1,1)}
for label, past in tqdm(intervals.items(), total=len(intervals)):
    out = []
    rebaldates = bd.date_range(bd.endmo(beg, past[1]), end, 'endmo')
    for rebaldate in rebaldates:
        start = bd.endmo(rebaldate, -past[1])
        beg1 = bd.offset(start, 1)
        end1 = bd.endmo(rebaldate, 1-past[0])
        df = crsp.get_universe(end1)
        # require data available as of start month and end month (universe)
        df['start'] = monthly.get_section(dataset='monthly',
                                          fields=['ret'],
                                          date_field='date',
                                          date=start).reindex(df.index)
        df[label] = monthly.get_ret(beg1, end1).reindex(df.index)
        df['permno'] = df.index
        df['rebaldate'] = rebaldate
        df = df.dropna(subset=['start'])
        out.append(df[['rebaldate', 'permno', label]]) # append rows
    out = pd.concat(out, axis=0, ignore_index=True)
    n = signals.write(out, label, overwrite=True)

    beg, end = 19270101, LAST_DATE
    columns = ['chmom', 'divyld', 'indmom']
    out = []
    for rebaldate in bd.date_range(beg, end, 'endmo'):
        start = bd.endmo(rebaldate, -12)
        beg1 = bd.offset(start, 1)
        end1 = bd.endmo(rebaldate, -6)
        beg2 = bd.offset(end1, 1)
        end2 = bd.endmo(rebaldate)
        df = crsp.get_universe(end1)
        df['start'] = monthly.get_section(dataset='monthly',
                                          fields=['ret'],
                                          date_field='date',
                                          date=start).reindex(df.index)
        df['end2'] = monthly.get_section(dataset='monthly',
                                         fields=['ret'],
                                         date_field='date',
                                         date=end2).reindex(df.index)
        df['mom2'] = monthly.get_ret(beg2, end2).reindex(df.index)
        df['mom1'] = monthly.get_ret(beg1, end1).reindex(df.index)
        df['divyld'] = crsp.get_divamt(beg1, end2)\
                           .reindex(df.index)['divamt']\
                           .div(df['cap'])\
                           .fillna(0)
        df['chmom'] = df['mom1'] - df['mom2']

        # 6-month two-digit sic industry momentum (group means of 'mom1')
        df['sic2'] = df['siccd'] // 100
        df = df.join(DataFrame(df.groupby(['sic2'])['mom1'].mean())\
                     .rename(columns={'mom1': 'indmom'}),
                     on='sic2', how='left')
        df['permno'] = df.index
        df['rebaldate'] = rebaldate
        out.append(df.dropna(subset=['start','end2'])\
                   [['rebaldate', 'permno'] + columns])
    out = pd.concat(out, axis=0, ignore_index=True)
    for label in columns:   # save signal values to sql
        n = signals.write(out, label, overwrite=True)

100%|██████████| 4/4 [47:03<00:00, 705.93s/it]

benchnames = ['Mkt-RF(mo)']
rebalbeg, rebalend = 19260101, LAST_DATE
columns = ['mom12m', 'mom6m', 'chmom', 'indmom', 'divyld', 'mom1m', 'mom36m']
for label in tqdm(columns):
    holdings = univariate_sorts(monthly,
                                label,
                                SignalsFrame(signals.read(label)),
                                rebalbeg,
                                rebalend,
                                window=1,
                                months=[],
                                maxdecile=8,
                                minprc=1.0,
                                pct=(10.0, 90.0),
                                leverage=leverage.get(label, 1))
    excess = backtest_pipeline(backtest,
                               monthly,
                               holdings,
                               label,
                               benchnames,
                               overlap=0,
                               outdir=outdir,
                               suffix=(leverage.get(label, 1) < 0)*'(-)')

100%|██████████| 7/7 [22:01<00:00, 188.74s/it]

_images/5b0b292dc148897991d47373ee77759ed533537a63f1969fbbaa1a849ede1990.png

_images/893b5b8c9e6464fcde8940c41b9221d82818db36a316c8998318702bca67821b.png

_images/11078450ca7695ccf2d4f4cf39598b82e8723ecbc7cd7a647fecb4b77c7e3d00.png

_images/3c0a4f50844a517110a8e4d1af8cd9c5f1b4dc52d0e1a02b4eac85c596fda165.png

_images/c2d6cc544f63e59362a66461d13891ac89c7b18c0f414403966186d77647e2d8.png

_images/d53de64fda97f090854c9b8558689399254e396cba64aa6f7eed970e5858dac6.png

_images/959c3c586608efe35f76c107dc50af5faf94f03f1f3a19fd7ac65862d54f695a.png

# helper to calculate beta, idiovol and price delay from weekly returns
def regress(x: np.array, y: np.array) -> Tuple[float, float, float]:
    """helper method to calculate beta, idiovol and price delay

    Args:
      x: equal-weighted market returns (in ascending time order)
      y: stock returns (in ascending time order).  NaN's will be discarded.

    Returns:
      beta: slope from regression on market returns and intercept
      idiovol: mean squared error of residuals
      pricedelay: increase of adjusted Rsq with four market lags over without
    """
    v = np.logical_not(np.isnan(y))
    y = y[v]
    x = x[v]
    n0 = len(y)
    A0 = np.vstack([x, np.ones(len(y))]).T
    b0 = np.linalg.inv(A0.T.dot(A0)).dot(A0.T.dot(y))   # univariate coeffs
    sse0 = np.mean((y - A0.dot(b0))**2)
    sst0 = np.mean((y - np.mean(y))**2)
    if (sst0>0 and sse0>0):
        R0 = (1 - ((sse0 / (n0 - 2)) / (sst0 / (n0 - 1))))
    else:
        R0 = 0
    y4 = y[4:]
    n4 = len(y4)         
    A4 = np.vstack([x[0:-4], x[1:-3], x[2:-2], x[3:-1], x[4:],
                    np.ones(n4)]).T
    b4 = np.linalg.inv(A4.T.dot(A4)).dot(A4.T.dot(y4))  # four lagged coeffs
    sse4 = np.mean((y4 - A4.dot(b4))**2)
    sst4 = np.mean((y4 - np.mean(y4))**2)
    if sst4 > 0 and sse4 > 0:
        R4 = (1 - ((sse4 / (n4 - 6)) / (sst4 / (n4 - 1))))
    else:
        R4 = 0    
    return [b0[0],
            sse0 or np.nan,
            (1 -(R0 / R4)) if R0>0 and R4>0 else np.nan]

Weekly price responses derived from CRSP daily records.

beg, end = 19260101, LAST_DATE
columns  = ['beta', 'idiovol', 'pricedelay']
wd = BusDay(sql, endweek='Wed')  # custom weekly trading day calendar

width    = 3*52+1           # up to 3 years of weekly returns
minvalid = 52               # at least 52 weeks required to compute beta
weekly   = DataFrame()      # rolling window of weekly stock returns
mkt      = DataFrame()      # to queue equal-weighted market returns
out      = []               # to accumulate final calculations

for date in tqdm(wd.date_range(beg, end, 'weekly')):
    df = crsp.get_ret(wd.begwk(date), date)
    mkt = as_rolling(mkt,       # rolling window of weekly mkt returns
                     DataFrame(data=[df.mean()], columns=[date]),
                     width=width)
    weekly = as_rolling(weekly, # rolling window of weekly stock returns
                        df.rename(date),
                        width=width) 
    valid = weekly.count(axis=1) >= minvalid  # require min number weeks
    if valid.any():
        result = DataFrame([regress(mkt.values[0], y)
                            for y in weekly.loc[valid].values],
                           columns=columns)
        result['permno'] = weekly.index[valid].values
        result['rebaldate'] = date
        if wd.ismonthend(date): # signal value from last week of month
            out.append(result)
out = pd.concat(out, axis=0, ignore_index=True)
for label in columns:
    signals.write(out, label, overwrite=True)

100%|██████████| 5165/5165 [31:49<00:00,  2.70it/s]

benchnames = ['Mkt-RF(mo)']
rebalbeg, rebalend = 19290601, LAST_DATE
for label in tqdm(columns):
    holdings = univariate_sorts(monthly,
                                label,
                                SignalsFrame(signals.read(label)),
                                rebalbeg,
                                rebalend,
                                window=1,
                                months=[],
                                maxdecile=8,
                                pct=(10., 90.),
                                leverage=leverage.get(label, 1))
    excess = backtest_pipeline(backtest,
                               monthly,
                               holdings,
                               label,
                               benchnames,
                               overlap=0,
                               outdir=outdir,
                               suffix=(leverage.get(label, 1) < 0)*'(-)')

100%|██████████| 3/3 [07:48<00:00, 156.06s/it]

_images/0be45ea8caf1d2b5b2a7f5cdaceea3c4f3824dec297031bf07233d9f6f7c905d.png

_images/e1f02c250658de47f7b9e13fba6d9a8972a0785b9d010ff30bcb4862efe8fcf0.png

_images/80ae54a88abe90cdd26ca9bd59a603d8d38aa8dc271b96b1880dec71b39a8316.png

Liquidity#

Liquidity signals are derived from daily stock return data.

beg, end = 19830601, LAST_DATE    # nasdaq/volume from after 1982
columns = ['ill', 'maxret', 'retvol', 'baspread', 'std_dolvol',
           'zerotrade', 'std_turn', 'turn']

out = []
dolvol = []
turn = DataFrame()    # to average turn signal over rolling 3-months
dt = bd.date_range(bd.begmo(beg,-3), end, 'endmo') # monthly rebalances
chunksize = 12        # each chunk is 12 months (1 year)
chunks = [dt[i:(i+chunksize)] for i in range(0, len(dt), chunksize)]
for chunk in tqdm(chunks):
    q = (f"SELECT permno, date, ret, askhi, bidlo, prc, vol, shrout "
         f" FROM {crsp['daily'].key}"
         f" WHERE date>={bd.begmo(chunk[0])}"
         f"   AND date<={chunk[-1]}")        # retrieve a chunk
    f = crsp.sql.read_dataframe(q).sort_values(['permno', 'date'])
    f['baspread'] = ((f['askhi'] - f['bidlo']) / ((f['askhi'] + f['bidlo']) / 2))
    f['dolvol'] = f['prc'].abs() * f['vol']
    f['turn1'] = f['vol'] / f['shrout']
    f.loc[f['dolvol']>0, 'ldv'] = np.log(f.loc[f['dolvol']>0, 'dolvol'])
    f['ill'] = 1000000 * f['ret'].abs() / f['dolvol']

    for rebaldate in chunk:            # for each rebaldate in the chunk
        grouped = f[f['date'].ge(bd.begmo(rebaldate))
                    & f['date'].le(rebaldate)].groupby('permno')
        df = grouped[['ret']].max().rename(columns={'ret': 'maxret'})
        df['retvol'] = grouped['ret'].std()
        df['baspread'] = grouped['baspread'].mean()
        df['std_dolvol'] = grouped['ldv'].std()
        df['ill'] = grouped['ill'].mean()
        dv = grouped['dolvol'].sum()
        df.loc[dv > 0, 'dolvol'] = np.log(dv[dv > 0])
        df['turn1'] = grouped['turn1'].sum()
        df['std_turn'] = grouped['turn1'].std()
        df['countzero'] = grouped['vol'].apply(lambda v: sum(v==0))
        df['ndays'] = grouped['prc'].count()
            
        turn = as_rolling(turn, df[['turn1']], width=3)
        df['turn'] = turn.reindex(df.index).mean(axis=1, skipna=False)
        df.loc[df['turn1'].le(0), 'turn1'] = 0
        df.loc[df['ndays'].le(0), 'ndays'] = 0
        df['zerotrade'] = ((df['countzero'] + ((1/df['turn1'])/480000))
                           * 21/df['ndays'])
    
        df['rebaldate'] = rebaldate
        df = df.reset_index()
        out.append(df[['permno', 'rebaldate'] + columns])
        if rebaldate < bd.endmo(end):
            df['rebaldate'] = bd.endmo(rebaldate, 1)
            dolvol.append(df[['permno','rebaldate','dolvol']])
out = pd.concat(out, axis=0, ignore_index=True)
dolvol = pd.concat(dolvol, axis=0, ignore_index=True)            

100%|██████████| 42/42 [25:56<00:00, 37.05s/it]

for label in columns:
    n = signals.write(out, label, overwrite=True)
n = signals.write(dolvol, 'dolvol', overwrite=True)

rebalbeg, rebalend = 19830601, LAST_DATE
benchnames = ['Mkt-RF(mo)']
for label in tqdm(columns + ['dolvol']):
    holdings = univariate_sorts(monthly,
                                label,
                                SignalsFrame(signals.read(label)),
                                rebalbeg,
                                rebalend,
                                window=1,
                                months=[],
                                maxdecile=8,
                                pct=(10., 90.),
                                leverage=leverage.get(label, 1))
    excess = backtest_pipeline(backtest,
                               monthly,
                               holdings,
                               label,
                               benchnames,
                               overlap=0,
                               outdir=outdir,
                               suffix=(leverage.get(label, 1) < 0)*'(-)')

100%|██████████| 9/9 [09:46<00:00, 65.17s/it]

_images/db19c58a365d2089fd6b417ca5a92de19d770e4cd87c4bcb3a2c85c75b7075a4.png

_images/2684ab47dbb557c2e9b98a942b83fb34d38d40017d783b49aed729044d6429b1.png

_images/ad29d61dad29ce16c18c6aadf44e3b0e5c496330ca513d094d13f0bc8c83358d.png

_images/4f6fe8afbc511a7b2c8e132e7c1ce87cc188cb1b362fc475a2ecc9d4b31a37ec.png

_images/022a942fd59f36f56bdd5ad3af99bd4554e8cc5b15d5bebe5200ce6ca8099db9.png

_images/95487e36cc6ef21477b62ee9e395bafbd76fc151d3640424d3e9c7794f356a81.png

_images/115752759814e0023dc786f04fcb3b86095bded9aa38858af0cd7c9eabaaaa55.png

_images/c0fa17a560a6174ede981d553986dc3bce6dd84f7abc70dd7d588955135a808d.png

_images/cbfe283b6234fa44d00eabf00a59293baab5e5bba12c4d88de9b19c86dd62c2c.png

Fundamentals#

Fundamental signals are collected from Compustat, using both annual and quarterly datasets.

columns = ['absacc', 'acc', 'agr', 'bm', 'cashpr', 'cfp', 'chcsho',
           'chinv', 'depr', 'dy', 'egr', 'ep', 'gma', 'grcapx',
           'grltnoa', 'hire', 'invest', 'lev', 'lgr' ,
           'pchdepr', 'pchgm_pchsale', 'pchquick',
           'pchsale_pchinvt', 'pchsale_pchrect', 'pchsale_pchxsga',
           'pchsaleinv', 'pctacc', 'quick', 'rd_sale', 'rd_mve',
           'realestate', 'salecash', 'salerec', 'saleinv', 'secured',
           'sgr', 'sp', 'tang', 'bm_ia', 'cfp_ia', 'chatoia' , 'chpmia',
           'pchcapx_ia', 'chempia', 'mve_ia']
numlag = 6       # number of months to lag data for rebalance
end = LAST_DATE   # last data date

# retrieve annual, keep [permno, datadate] with non null prccq if any
fields = ['sic', 'fyear', 'ib', 'oancf', 'at', 'act', 'che', 'lct',
          'dlc', 'dltt', 'prcc_f', 'csho', 'invt', 'dp', 'ppent',
          'dvt', 'ceq', 'txp', 'revt', 'cogs', 'rect', 'aco', 'intan',
          'ao', 'ap', 'lco', 'lo', 'capx', 'emp', 'ppegt', 'lt',
          'sale', 'xsga', 'xrd', 'fatb', 'fatl', 'dm']
df = pstat.get_linked(dataset='annual',
                      fields=fields,
                      date_field='datadate',
                      where=(f"indfmt = 'INDL' "
                             f"  AND datafmt = 'STD'"
                             f"  AND curcd = 'USD' "
                             f"  AND popsrc = 'D'"
                             f"  AND consol = 'C'"
                             f"  AND datadate <= {end//100}31"))
fund = df.sort_values(['permno', 'datadate', 'ib'])\
         .drop_duplicates(['permno', 'datadate'])\
         .dropna(subset=['ib'])
fund.index = list(zip(fund['permno'], fund['datadate']))  # multi-index
fund['rebaldate'] = bd.endmo(fund.datadate, numlag)

# precompute, and lag common metrics: mve_f avg_at sic2
fund['sic2'] = np.where(fund['sic'].notna(), fund['sic'] // 100, 0)
fund['fyear'] = fund['datadate'] // 10000      # can delete this
fund['mve_f'] = fund['prcc_f'] * fund['csho']

lag = fund.shift(1, fill_value=0)
lag.loc[lag['permno'] != fund['permno'], fields] = np.nan
fund['avg_at'] = (fund['at'] + lag['at']) / 2

lag2 = fund.shift(2, fill_value=0)
lag2.loc[lag2['permno'] != fund['permno'], fields] = np.nan
lag['avg_at'] = (lag['at'] + lag2['at']) / 2

fund['bm'] = fund['ceq'] / fund['mve_f']
fund['cashpr'] = (fund['mve_f'] + fund['dltt'] - fund['at'])/fund['che']
fund['depr'] = fund['dp'] / fund['ppent']
fund['dy'] = fund['dvt'] / fund['mve_f']
fund['ep'] = fund['ib'] / fund['mve_f']
fund['lev'] = fund['lt'] / fund['mve_f']
fund['quick'] = (fund['act'] - fund['invt']) / fund['lct']
fund['rd_sale'] = fund['xrd'] / fund['sale']
fund['rd_mve'] = fund['xrd'] / fund['mve_f']
fund['realestate'] = ((fund['fatb'] + fund['fatl']) /
                      np.where(fund['ppegt'].notna(),
                               fund['ppegt'], fund['ppent']))
fund['salecash'] = fund['sale'] / fund['che']
fund['salerec'] = fund['sale'] / fund['rect']
fund['saleinv'] = fund['sale'] / fund['invt']
fund['secured'] = fund['dm'] / fund['dltt']
fund['sp'] = fund['sale'] / fund['mve_f']
fund['tang'] = (fund['che'] + fund['rect'] * 0.715 + fund['invt'] * 0.547
                + fund['ppent'] * 0.535) / fund['at']

# changes: agr chcsho chinv egr gma egr grcapx grltnoa emp invest lgr
fund['agr'] = (fund['at'] / lag['at'])
fund['chcsho'] = (fund['csho'] / lag['csho'])
fund['chinv'] = ((fund['invt'] - lag['invt']) / fund['avg_at'])
fund['egr'] = (fund['ceq'] / lag['ceq'])
fund['gma'] = ((fund['revt'] - fund['cogs']) / lag['at'])
fund['grcapx'] = (fund['capx'] / lag2['capx'])
fund['grltnoa'] =  (((fund['rect']
                      + fund['invt']
                      + fund['ppent']
                      + fund['aco']
                      + fund['intan']
                      + fund['ao']
                      - fund['ap']
                      - fund['lco']
                      - fund['lo'])
                     / (lag['rect']
                        + lag['invt']
                        + lag['ppent']
                        + lag['aco']
                        + lag['intan']
                        + lag['ao']
                        - lag['ap']
                        - lag['lco']
                        - lag['lo']))
                    - ((fund['rect']
                        + fund['invt']
                        + fund['aco']
                        - fund['ap']
                        - fund['lco'])
                       - (lag['rect']
                          + lag['invt']
                          + lag['aco']
                          - lag['ap']
                          - lag['lco']))) / fund['avg_at']
fund['hire'] = ((fund['emp'] / lag['emp']) - 1).fillna(0)
fund['invest'] = (((fund['ppegt'] - lag['ppegt'])
                   + (fund['invt'] - lag['invt'])) / lag['at'])
fund['invest'] = fund['invest'].where(fund['invest'].notna(),
                                      ((fund['ppent'] - lag['ppent'])
                                       + (fund['invt'] - lag['invt'])) / lag['at'])
fund['lgr'] = (fund['lt'] / lag['lt'])
fund['pchdepr'] = ((fund['dp'] / fund['ppent']) / (lag['dp'] / lag['ppent']))
fund['pchgm_pchsale'] = (((fund['sale'] - fund['cogs']) / (lag['sale'] - lag['cogs']))
                         - (fund['sale'] / lag['sale']))
fund['pchquick'] = (((fund['act'] - fund['invt']) / fund['lct'])
                    / ((lag['act'] - lag['invt']) / lag['lct']))
fund['pchsale_pchinvt'] = ((fund['sale'] / lag['sale']) - (fund['invt'] / lag['invt']))
fund['pchsale_pchrect'] = ((fund['sale'] / lag['sale']) - (fund['rect'] / lag['rect']))
fund['pchsale_pchxsga'] = ((fund['sale'] / lag['sale']) - (fund['xsga'] / lag['xsga']))
fund['pchsaleinv'] = ((fund['sale'] / fund['invt']) / (lag['sale'] / lag['invt']))
fund['sgr'] = (fund['sale'] / lag['sale'])

fund['chato'] = ((fund['sale'] / fund['avg_at']) - (lag['sale'] / lag['avg_at']))
fund['chpm'] = (fund['ib'] / fund['sale']) - (lag['ib'] / lag['sale'])
fund['pchcapx'] = fund['capx'] / lag['capx']

# compute signals with alternative definitions: acc absacc cfp
fund['_acc'] = (((fund['act'] - lag['act']) - (fund['che'] - lag['che']))
                - ((fund['lct'] - lag['lct']) - (fund['dlc'] - lag['dlc'])
                   - (fund['txp'] - lag['txp']) - fund['dp']))
fund['cfp'] = ((fund['ib'] - (((fund['act'] - lag['act']) - (fund['che'] - lag['che']))
                              - ((fund['lct'] - lag['lct'])
                                 - (fund['dlc'] - lag['dlc'])
                                 - (fund['txp'] - lag['txp'])
                                 - fund['dp']))) / fund['mve_f'])
g = fund['oancf'].notnull()
fund.loc[g, 'cfp'] = fund.loc[g, 'oancf'] / fund.loc[g, 'mve_f']
fund.loc[g, '_acc'] = fund.loc[g, 'ib'] - fund.loc[g, 'oancf']
fund['acc'] = fund['_acc'] / fund['avg_at']
fund['absacc'] = abs(fund['_acc']) / fund['avg_at']
fund['pctacc'] = fund['_acc'] / abs(fund['ib'])
h = (fund['ib'].abs() <= 0.01)
fund.loc[h, 'pctacc'] = fund.loc[h, '_acc'] / 0.01

# industry-adjusted
cols = {'bm_ia': 'bm', 'cfp_ia': 'cfp', 'chatoia': 'chato',
        'chpmia': 'chpm', 'pchcapx_ia': 'pchcapx',
        'chempia': 'hire', 'mve_ia': 'mve_f'}
group = fund.groupby(['sic2', 'fyear'])
for k,v in cols.items():
    fund[k] = fund[v] - group[v].transform('mean')

for label in columns:
    signals.write(fund, label, overwrite=True)

rebalbeg, rebalend = 19700101, LAST_DATE
benchnames = ['Mkt-RF(mo)'] #['Mom']  #['ST_Rev(mo)']   #
for label in tqdm(columns):
    holdings = univariate_sorts(monthly,
                                label,
                                SignalsFrame(signals.read(label)),
                                rebalbeg,
                                rebalend,
                                window=12,
                                months=[6],
                                maxdecile=8,
                                pct=(10., 90.),
                                leverage=leverage.get(label, 1))
    excess = backtest_pipeline(backtest,
                               monthly,
                               holdings,
                               label,
                               benchnames,
                               overlap=0,
                               outdir=outdir,
                               suffix=(leverage.get(label, 1) < 0)*'(-)')

 44%|████▍     | 20/45 [29:40<39:56, 95.88s/it] /home/terence/Dropbox/github/data-science-notebooks/finds/backtesting/backtest.py:310: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). Consider using `matplotlib.pyplot.close()`.
  fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True, clear=True,
100%|██████████| 45/45 [1:09:08<00:00, 92.19s/it]

_images/794cb8a4544ce3bfb6a676d8e6c169d9e20228e54916e7281910a0cd3771489c.png

_images/e1a24a36fea3104f13b92834950f5131159ce896d7c0539afb34125a82366128.png

_images/14c282c826a5d2a69deefc024576328977e3c246ad839483c452df05b2ad52d3.png

_images/18ca627b6c3d11d1229219b64f1fbaa5a154aec9994fe65ac8c6fb6f9bb3871b.png

_images/f70c48f066076aca109ef9bfa9d2ce037361ebd38f28e0d6cc3c298484172d4a.png

_images/d5c0a8ea055b5eb226abddd31dbff5a16d3b594b46daaa0af850226e4ff3ac94.png

_images/17a0609da2b61b5994ecc581e66005898cbaacb007a46ea0920157e4eae59eed.png

_images/546d1fa8fa904ec51b3c34ca39527cf7f9a4fea7c9dc860b4396a6bb885fe673.png

_images/93919e504ca9c9a01126c851788d4ef11d307c94e1e2c7df0e5e1d6ed0504465.png

_images/43d876eed86f09e369fa863c14caa2ba5140041324365c28c98b95fa76a30cbd.png

_images/ffe866fa6712a1863c212e471b76d525980287ba0b74f573f46c8a714a4a7c42.png

_images/5282e107dae45a9f04f313e69d22cac5ccf47ae8aeffe7fe4521b6c28e326523.png

_images/c6d882d796a088aba1a5d144853aef997ad8ab707164faefe92f3e545bab195d.png

_images/de406809bcbc68fdbe70d6eaaa81200301467fde6777f25384fa583ee02f0f0a.png

_images/4e53f18118d07316f33ea767b8b4c16888a72b6f66239ff762a4e76e86c6e9e5.png

_images/c1277747d19446a42d5ed4ecaea21087fbf4d0f09688dc53ec090ab335a28f2e.png

_images/7d68cff93c951105aaac7756a78255aff5ae071e3a7172b52226b66137252608.png

_images/7932dfb9bee40434697ab65879929728381006df8042d2f3ca9cd03b8da6e984.png

_images/6bbfd9766043054c861cd12f51f2c7a5c154dc76fd5fd28d00742e2f7024da88.png

_images/ad01880a780bb6cf5751a2e17da28d90b7d407b3f89f3936d215f3f3c695e651.png

_images/0ee8b8487538fbd4304a6404d23a92b1a8fdf08068caf893731f358676e5ccca.png

_images/61346c0537fd699123c5cedbc5270fc50fb4de75dac4477f93213c93a787e284.png

_images/a4e859c3dbbfa9b3b1c2a1d3415c4d8481eae2a802a4f78712c5b84f843d1fff.png

_images/f05a17045d595721804140dfa2e6289f3eceb75d4e2e1fd8e86e3ba34bd149a0.png

_images/d07b2397bd78cc3c084f9d0d38886103f6a8eec3ff7c79238d042e46ef9fbd90.png

_images/9a365222ca10561844d7acf8185d45899332d612afbd8a43b1e322063ec44356.png

_images/5d0ae404d6dc129d9eaf9a6bd370825b3ebc6f4681451cdeb29e65e1b6c52def.png

_images/b8ff51acdf74ba9a7083d4cdd83fe9ac11eaeb10b0230a3088d4294f520b5538.png

_images/75f301ec6c8571c25d98becb0a7166149cb6cb32b2bddc38753620218892acc7.png

_images/0ff08dc3c9a2f1c49bb93f842bd4834be684fcf6ed407ade03e75b494e21d0c0.png

_images/df348a1a3856a5a1127288b756eb75f2f1735c85e87e6b71750d24b8bf76cadf.png

_images/f71566649a9d00ae8f405088423613eac36a27f3199f7c5d06c4e3fb78850285.png

_images/89743a8c51946f728d6ea01a3fba8352852b22a24a5cbcff9c0c47df011b9a3e.png

_images/0572457fabe4ebcea1a790a7afab3edc86acb3b8ddbc2bd63eec2d134f5e8656.png

_images/8a57b6cf9f4b4d75bdd72ea065f4e546606b53f9a2ac6afc2f7247e31b746c5d.png

_images/90ecf240d0b95409da9b5c2121a089b84cec544e4e81e5f167c967621365c54b.png

_images/4e83f63f3f0bfa2d5a41135fae0dbebfa751b30002d92ae63c6eea541d0aa734.png

_images/113a73a9cfc9726cdebb03df2b94166065d32445e126ba538df8ebd118a632b5.png

_images/cc93147c2e860afd1b201d7282b2c0e71379763b3ebcee1e55709d3e168a9719.png

_images/c3494b34abb4a2114237739cd5776887b9175759747b4b75e2ce35ac0bf5d86b.png

_images/3557e390e26b5dd77f42365ef4a24ba0f33a26d91538fe38d9fbeb259a583112.png

_images/2d6126d179c2dc89d33750d39cad72fb3f8b85763b902b7dec1016a1784d9413.png

_images/df518679f17eaaf444e7473cda87fd3214b5a30626d0263d907fcdf84aabe21c.png

_images/12f724bd4ca6bf251b1ca05423137a422bbf2ce36379b4a04de5a272bd62daaa.png

_images/f663e36a3883444521e67c5762eb5973e8ffd473f48eee01fae52dc89cfb9fca.png

Fundamental signals from Compustat Quarterly

columns = ['stdacc', 'stdcf', 'roavol', 'sgrvol', 'cinvest', 'chtx',
           'rsup', 'roaq', 'cash', 'nincr']
numlag = 4       # require 4 month lag of fiscal data
end = LAST_DATE

# retrieve quarterly, keep [permno, datadate] key with non null prccq
fields = ['ibq', 'actq', 'cheq', 'lctq', 'dlcq', 'saleq', 'prccq',
          'cshoq', 'atq', 'txtq', 'ppentq']
df = pstat.get_linked(dataset='quarterly',
                      fields=fields,
                      date_field='datadate',
                      where=(f"datadate > 0 "
                             f"and datadate <= {end//100}31"))
fund = df.sort_values(['permno','datadate', 'ibq'])\
         .drop_duplicates(['permno', 'datadate'])\
         .dropna(subset=['ibq'])
fund.index = list(zip(fund['permno'], fund['datadate']))
rebaldate = bd.endmo(fund.datadate, numlag)

# compute current and lagged: scf sacc roaq nincr cinvest cash rsup chtx
lag = fund.shift(1, fill_value=0)
lag.loc[lag['permno'] != fund['permno'], fields] = np.nan
fund['_saleq'] = fund['saleq']
fund.loc[fund['_saleq'].lt(0.01), '_saleq'] = 0.01

fund['sacc'] = (((fund['actq'] - lag['actq']) - (fund['cheq'] - lag['cheq']))
                - ((fund['lctq'] - lag['lctq'])
                   - (fund['dlcq'] - lag['dlcq']))) / fund['_saleq']
fund['cinvest'] = (fund['ppentq'] - lag['ppentq']) / fund['_saleq']
fund['nincr'] = (fund['ibq'] > lag['ibq']).astype(int)
fund['scf']  = (fund['ibq'] / fund['_saleq']) - fund['sacc']
fund['roaq'] = (fund['ibq'] / lag['atq'])
fund['cash'] = (fund['cheq'] / fund['atq'])

lag4 = fund.shift(4, fill_value=0)
lag4.loc[lag4['permno'] != fund['permno'], fields] = np.nan
fund['rsup'] = ((fund['saleq'] - lag4['saleq'])
                / (fund['prccq'].abs() * fund['cshoq'].abs()))
fund['chtx'] = (fund['txtq'] - lag4['txtq']) / lag4['atq']

# for each var: make dataframe of 15 lags (column names=[0,...,15])
lags = {col : as_lags(fund, var=col, key='permno', nlags=16)
        for col in ['sacc', 'scf', 'roaq', 'rsup', 'cinvest', 'nincr']}
for i in range(1, 16):                      # lags[ninrc][i]=1 iff ibq
    lags['nincr'][i] *= lags['nincr'][i-1]  # increasing all prior qtrs

    # compute signals from the 15 lags
    fund['rebaldate'] = rebaldate
    fund['stdacc'] = lags['sacc'].std(axis=1, skipna=False)
    fund['stdcf'] = lags['scf'].std(axis=1, skipna=False)
    fund['roavol'] = lags['roaq'].std(axis=1, skipna=False)
    fund['sgrvol'] = lags['rsup'].std(axis=1, skipna=False)
    fund['cinvest'] = (fund['cinvest'] -
                       lags['cinvest'][[1, 2, 3, 4]].mean(axis=1, skipna=False))
    
    # count number of consecutive increasing quarters
    fund['nincr'] = lags['nincr'][np.arange(8)].sum(axis=1)

for label in columns:
    signals.write(fund, label, overwrite=True)

rebalbeg, rebalend = 19700101, LAST_DATE
benchnames = ['Mkt-RF(mo)']
for label in tqdm(columns):
    holdings = univariate_sorts(monthly,
                                label,
                                SignalsFrame(signals.read(label)),
                                rebalbeg,
                                rebalend,
                                window=3,
                                months=[],
                                maxdecile=8,
                                pct=(10., 90.),
                                leverage=leverage.get(label, 1))
    excess = backtest_pipeline(backtest,
                               monthly,
                               holdings,
                               label,
                               benchnames,
                               overlap=0,
                               outdir=outdir,
                               suffix='(-)'*(leverage.get(label, 1) < 0))

100%|██████████| 10/10 [11:53<00:00, 71.38s/it]

_images/5468bfd94ea93a4ad924754fc8ec3a794169d57834742444686b6b5566699a52.png

_images/670de384250a0b3efac54dfbd1dcdf1f43e18f2ba9c90ec2fddd5079ce565c80.png

_images/cb232fa5e0c6c3e902650095e14fccabb6d8cb9a7ee859dc0dacf3a986adbb34.png

_images/4ce2854fc980ffc0750c1d9a95d350787e5681b3b1f678a8e8afd06b00762869.png

_images/a102d7b0c700880d3f477edd766540c3514be477dc3ffcb19ce130abdd1cea83.png

_images/e9553fad78aa848eb0f3cad09e7e7c218df1d3f597b0e3aff1ec3e1ec125fb24.png

_images/8131e8f80a062536879bd17f43c65ef46a1024492fd46c9bdeec661e09435e99.png

_images/8447adfad8ef451e158be9cc2ae248d5d2955a9ccc9921321baa7e868b7460a3.png

_images/97344d07bd1e95deee826c71d43947b2f3fc23f59b07ff91a6708c4c06fccc8e.png

_images/023f350e65cb6171e88825ea4054629624be699752dbf4608a69bacb8b8bb738.png

Earnings Estimates#

Earnings estimate signals are drawn from IBES data, including fiscal year 1 projections, long-term growth forecasts, and announcement dates linked to CRSP daily prices and Compustat quarterly.

columns = ['chfeps', 'chnanalyst', 'disp']

df = ibes.get_linked(dataset='statsum',
                     fields=['fpedats', 'meanest', 'medest',
                             'stdev', 'numest'],
                     date_field = 'statpers',
                     where=("meanest IS NOT NULL "
                            "  AND fpedats IS NOT NULL "
                            "  AND statpers IS NOT NULL"
                            "  AND fpi = '1'"))
out = df.sort_values(['permno', 'statpers', 'fpedats', 'meanest'])\
        .drop_duplicates(['permno', 'statpers', 'fpedats'])
out['rebaldate'] = bd.endmo(out['statpers'])

out['disp'] = out['stdev'] / abs(out['meanest'])
out.loc[abs(out['meanest']) < 0, 'disp'] = out['stdev'] / 0.01

lag1 = out.shift(1, fill_value=0)
f1 = (lag1['permno'] == out['permno'])        
out.loc[f1, 'chfeps'] = out.loc[f1, 'meanest'] - lag1.loc[f1, 'meanest']

lag3 = out.shift(3, fill_value=0)
f3 = (lag3['permno'] == out['permno'])
out.loc[f3, 'chnanalyst'] = out.loc[f3, 'numest']-lag3.loc[f3, 'numest']

for label in columns:
    signals.write(out, label, overwrite=True)

rebalbeg, rebalend = 19760101, LAST_DATE
benchnames = ['Mkt-RF(mo)']
for label in tqdm(columns):
    holdings = univariate_sorts(monthly,
                                label,
                                SignalsFrame(signals.read(label)),
                                rebalbeg,
                                rebalend,
                                window=3,
                                months=[],
                                maxdecile=8,
                                pct=(10., 90.),
                                leverage=leverage.get(label, 1))
    excess = backtest_pipeline(backtest,
                               monthly,
                               holdings,
                               label,
                               benchnames,
                               overlap=0,
                               outdir=outdir,
                               suffix=(leverage.get(label, 1) < 0)*'(-)')

100%|██████████| 3/3 [03:39<00:00, 73.25s/it]

_images/71f8cbf860911b691a51cd1c398451089fa53223d07908bc489a15ac2e5e7d29.png

_images/9853623f04ced746515416a771c29baecb43f096c92eaaec5fa2ffd4474263a9.png

_images/51dd81ac127e3e49a36447bfe50ad774919b215336997b92f2f6be6ced71935f.png

IBES Long-term Growth signals

columns = ['fgr5yr']

df = ibes.get_linked(dataset='statsum',
                     fields = ['meanest'],
                     date_field = 'statpers',
                     where=("meanest IS NOT NULL "
                            "  AND fpi = '0'"
                            "  AND statpers IS NOT NULL"))
out = df.sort_values(['permno', 'statpers', 'meanest'])\
        .drop_duplicates(['permno', 'statpers'])\
        .dropna()
out['rebaldate'] = bd.endmo(out['statpers'])
out['fgr5yr'] = out['meanest']
signals.write(out, 'fgr5yr', overwrite=True)

rebalbeg, rebalend = 19760101, LAST_DATE
benchnames = ['Mkt-RF(mo)']
for label in tqdm(columns):
    holdings = univariate_sorts(monthly,
                                label,
                                SignalsFrame(signals.read(label)),
                                rebalbeg,
                                rebalend,
                                window=3,
                                months=[],
                                maxdecile=8,
                                pct=(10., 90.),
                                leverage=leverage.get(label, 1))
    excess = backtest_pipeline(backtest,
                               monthly,
                               holdings,
                               label,
                               benchnames,
                               overlap=0,
                               outdir=outdir,
                               suffix=(leverage.get(label, 1) < 0)*'(-)')

100%|██████████| 1/1 [01:06<00:00, 66.27s/it]

_images/50fd4183d154b98beab6ed647c4eb5a92f23950669e8fc51cdb59d981739f153.png

Announcement date in Quarterly, linked to CRSP daily

columns = ['ear', 'aeavol']

# retrieve rdq, and set rebalance date to at least one month delay
df = pstat.get_linked(dataset='quarterly',
                      fields=['rdq'],
                      date_field='datadate',
                      where=('rdq > 0'))
fund = df.sort_values(['permno', 'rdq', 'datadate'])\
         .drop_duplicates(['permno', 'rdq'])\
         .dropna()
fund['rebaldate'] = bd.offset(fund['rdq'], 2)

# ear is compounded return around 3-day window
out = crsp.get_window(dataset='daily',
                      field='ret',
                      date_field='date',
                      permnos=fund['permno'],
                      dates=fund['rdq'],
                      left=-1,
                      right=1)
fund['ear'] = (1 + out).prod(axis = 1).values

# aeavol is avg volume in 3-day window over 20-day average ten-days prior
actual = crsp.get_window(dataset='daily',
                         field='vol',
                         date_field='date',
                         permnos=fund['permno'],
                         dates=fund['rdq'],
                         left=-1,
                         right=1)
normal = crsp.get_window(dataset='daily',
                         field='vol',
                         date_field='date',
                         permnos=fund['permno'],
                         dates=fund['rdq'],
                         left=-30,
                         right=-11,
                         avg=True)
fund['aeavol'] = normal['vol'].values

signals.write(fund, 'ear', overwrite=True)
signals.write(fund, 'aeavol', overwrite=True)

rebalbeg, rebalend = 19700101, LAST_DATE
benchnames = ['Mkt-RF(mo)']
for label in tqdm(columns):
    holdings = univariate_sorts(monthly,
                                label,
                                SignalsFrame(signals.read(label)),
                                rebalbeg,
                                rebalend,
                                window=3,
                                months=[],
                                maxdecile=8,
                                pct=(10., 90.),
                                leverage=leverage.get(label, 1))
    excess = backtest_pipeline(backtest,
                               monthly,
                               holdings,
                               label,
                               benchnames,
                               overlap=0,
                               outdir=outdir,
                               suffix=(leverage.get(label, 1) < 0)*'(-)')

100%|██████████| 2/2 [02:07<00:00, 63.88s/it]

_images/30206dda368fdc8a5e0942f32c5ac6b80f3f4b2b88ad143c453acc2c175071ce.png

_images/02d109701773dfbfda6efe337e3559d6b1b8ef05bfa984cc64ff781cd24710c4.png

IBES Fiscal Year 1 linked to Quarterly PSTAT

beg, end = 19760101, LAST_DATE
monthnum = lambda d: ((d//10000)-1900)*12 + ((d//100)%100) - 1

df = pstat.get_linked(dataset='quarterly',
                      fields=['prccq'],
                      date_field='datadate')
df = df.dropna()\
       .sort_values(['permno', 'datadate'])\
       .drop_duplicates(['permno', 'datadate'])

out = ibes.get_linked(dataset='statsum',
                      fields=['fpedats', 'meanest'],
                      date_field='statpers',
                      where="fpi='1'")
out = out.dropna()\
         .sort_values(['permno', 'statpers', 'fpedats'])\
         .drop_duplicates(['permno', 'statpers'])
out['monthnum'] = monthnum(out['statpers'])
out = out.set_index(['permno', 'monthnum'], drop=False)
out['sfeq'] = np.nan

for num in range(4): # match ibes statpers to any datadate in last 4 mos
    df['monthnum'] = monthnum(df['datadate']) - num
    df = df.set_index(['permno', 'monthnum'], drop=False)
    out = out.join(df[['prccq']], how='left')
    out['sfeq'] = out['sfeq'].where(out['sfeq'].notna(),
                                    out['meanest'] / out['prccq'].abs())
    out = out.drop(columns=['prccq'])

    out['rebaldate'] = bd.endmo(out['statpers'])
    n = signals.write(out.reset_index(drop=True), 'sfeq', overwrite=True)

IBES Fiscal Year 1 linked to IBES price history

beg, end = 19760101, LAST_DATE

# retrieve monthly price history
df = ibes.get_linked(dataset='actpsum',
                     fields=['price'],
                     date_field='statpers')
hist = df.dropna()\
         .sort_values(['permno', 'statpers'])\
         .drop_duplicates(['permno', 'statpers'], keep='last')\
         .set_index(['permno', 'statpers'])

# retrieve monthly FY1 mean estimate
df = ibes.get_linked(dataset='statsum',
                     fields=['fpedats', 'meanest'],
                     date_field='statpers',
                     where="fpi='1' AND statpers <= fpedats")
out = df.dropna()\
        .sort_values(['permno', 'statpers', 'fpedats'])\
        .drop_duplicates(['permno', 'statpers'])\
        .set_index(['permno', 'statpers'])

# join on [permno, statpers], and reindex on [permno, rebaldate]
out = out.join(hist[['price']], how='left').reset_index()
out['rebaldate'] = bd.endmo(out['statpers'])
out = out.set_index(['permno', 'rebaldate'])
out['sfe'] = out['meanest'].div(out['price'].abs())
n = signals.write(out.reset_index(), 'sfe', overwrite=True)

rebalbeg, rebalend = 19760101, LAST_DATE
benchnames = ['Mkt-RF(mo)']
label = 'sfe'
holdings = univariate_sorts(monthly,
                            label,
                            SignalsFrame(signals.read(label)),
                            rebalbeg,
                            rebalend,
                            window=3,
                            months=[],
                            maxdecile=8,
                            pct=(10., 90.),
                            leverage=leverage.get(label, 1))
excess = backtest_pipeline(backtest,
                           monthly,
                           holdings,
                           label,
                           benchnames,
                           overlap=0,
                           outdir=outdir, 
                           suffix=(leverage.get(label, 1) < 0)*'(-)')

_images/69dc96f57e12324ec546ff9a1255a962eaea476b74f73286f8caaee1ef20c45c.png

IBES Fiscal Quarter 1, linked to Quarterly

columns = ['sue']
numlag = 4
end = LAST_DATE

# retrieve quarterly, keep [permno, datadate] key with non null prccq
df = pstat.get_linked(dataset='quarterly',
                      fields=['prccq', 'cshoq', 'ibq'],
                      date_field='datadate',
                      where=f"datadate <= {end//100}31")
fund = df.dropna(subset=['ibq'])\
         .sort_values(['permno', 'datadate', 'cshoq'])\
         .drop_duplicates(['permno', 'datadate'])
fund['rebaldate'] = bd.endmo(fund['datadate'], numlag)
fund = fund.set_index(['permno', 'rebaldate'], drop=False)

# retrieve ibes Q1 where forecast period <= fiscal date, keep latest
df = ibes.get_linked(dataset='statsum',
                     fields=['fpedats', 'medest', 'actual'],
                     date_field='statpers',
                     where=" fpi = '6' AND statpers <= fpedats")
summ = df.dropna()\
         .sort_values(['permno', 'fpedats', 'statpers'])\
         .drop_duplicates(['permno', 'fpedats'], keep='last')
summ['rebaldate'] = bd.endmo(summ['fpedats'], numlag)
summ = summ.set_index(['permno', 'statpers'])

# retrieve ibes price, then left join
df = ibes.get_linked(dataset='actpsum',
                     fields=['price'],
                     date_field='statpers')
hist = df.dropna()\
         .sort_values(['permno', 'statpers'])\
         .drop_duplicates(['permno', 'statpers'], keep='last')
hist = hist.set_index(['permno', 'statpers'])
summ = summ.join(hist[['price']], how='left')
summ = summ.reset_index()\
           .set_index(['permno', 'rebaldate'])\
           .reindex(fund.index)

# sue with ibes surprise and price
fund['sue'] = (summ['actual'] - summ['medest']) / summ['price'].abs()

# sue with ibes surprice and compustat quarterly price
fund['sue'] = fund['sue'].where(
    fund['sue'].notna(), (summ['actual'] - summ['medest']) / fund['prccq'].abs())

# sue with lag(4) difference in compustat quarterly and price
lag = fund.shift(4, fill_value=0)
fund['sue'] = fund['sue'].where(
    fund['sue'].notna() | (lag['permno'] != fund['permno']),
    ((fund['ibq'] - lag['ibq']) / (fund['prccq'] * fund['cshoq']).abs()))

signals.write(fund.reset_index(drop=True), 'sue', overwrite=True)

rebalbeg, rebalend = 19760101, LAST_DATE
benchnames = ['Mkt-RF(mo)']
for label in tqdm(columns):
    holdings = univariate_sorts(monthly,
                                label,
                                SignalsFrame(signals.read(label)),
                                rebalbeg,
                                rebalend,
                                window=3,
                                months=[],
                                maxdecile=8,
                                pct=(10., 90.),
                                leverage=leverage.get(label, 1))
    excess = backtest_pipeline(backtest,
                               monthly,
                               holdings,
                               label,
                               benchnames,
                               overlap=0,
                               outdir=outdir,
                               suffix=(leverage.get(label, 1) < 0)*'(-)')

100%|██████████| 1/1 [01:14<00:00, 74.37s/it]

_images/917ae314241f7a788e9998e0219b1fe51b86c6173efa139540b36f71757ac701.png

Backtests#

All backtests are conducted on univariate, dollar-neutral spreads between the top and bottom deciles of a given characteristic. Stocks are value-weighted within each decile, excluding the smallest quintile of NYSE-listed firms by market capitalization. Spread portfolios are rebalanced monthly, with fundamental data lagged in the standard manner—six months for annual data and four months for quarterly data.

Each backtest generates a time series of cumulative spread portfolio and market index returns, along with metrics such as monthly turnover rates and the number of long and short positions. Results are ranked based on Welch’s t-statistic, which tests for differences in mean returns before 2002 and after 2003. This aligns with findings from Green et al. (2017) and others, who observed a significant decline in the predictive power of many factors after this period.

Additionally, we compute the maximum drawdown for each strategy, defined as the largest peak-to-trough loss in cumulative returns before a new peak is reached. This measure provides insight into the worst-case historical performance of each strategy.

def maximum_drawdown(x: Series, is_price_level: bool = False) -> Series:
    """Compute max drawdown (max loss from previous high) period and returns"""
    cumsum = np.log(1 + x).cumsum()
    cummax = cumsum.cummax()
    end = (cummax - cumsum).idxmax()
    beg = cumsum[cumsum.index <= end].idxmax()
    dd = cumsum.loc[[beg, end]]
    return np.exp(dd)

zoo = backtest.read().sort_values(['begret', 'permno'])
r = []
rets = []
for label in zoo.index:
    perf = backtest.read(label)
    rets.append(perf['ret'].rename(label))
    excess = {'ret': backtest.fit(['Mkt-RF(mo)'])}
    excess['annualized'] = backtest.annualized
    excess['dd'] = maximum_drawdown(backtest.perf['excess'])
    post = {'ret': backtest.fit(['Mkt-RF(mo)'],
                                beg=20020101).copy()}
    post['annualized'] = backtest.annualized.copy()
    s = label + ('(-)' if leverage.get(label, 1) < 0 else '')
    r.append(DataFrame({
        #            'Start': excess['ret'].index[0],
        'Sharpe Ratio': excess['annualized']['sharpe'],
        'Alpha': excess['annualized']['alpha'],
        'Appraisal Ratio': excess['annualized']['appraisal'],
        'Avg Ret': excess['ret']['excess'].mean(),
        'Vol': excess['ret']['excess'].std(ddof=0),
        'Welch-t': excess['annualized']['welch-t'],
        'Appraisal2002': post['annualized']['appraisal'],
        'Ret2002': post['ret']['excess'].mean(),
        'Drawdown': (excess['dd'].iloc[1]/excess['dd'].iloc[0]) - 1,
    }, index=[s]))
df = pd.concat(r, axis=0).round(4).sort_values('Welch-t')

pd.set_option("display.max_colwidth", None, 'display.max_rows', None)
df #.sort_values('Sharpe Ratio', ascending=False)

	Sharpe Ratio	Alpha	Appraisal Ratio	Avg Ret	Vol	Welch-t	Appraisal2002	Ret2002	Drawdown
pchsale_pchrect	-0.1169	-0.0121	-0.1368	-0.0009	0.0256	-3.1904	-0.6200	-0.0047	-0.8591
mom1m(-)	0.3820	0.0568	0.3132	0.0061	0.0543	-2.7253	-0.3205	-0.0017	-0.7606
bm	0.2233	0.0339	0.2141	0.0029	0.0457	-2.3167	-0.3007	-0.0019	-0.7651
bm_ia	0.2413	0.0216	0.1611	0.0027	0.0393	-2.2583	-0.2118	-0.0012	-0.7011
agr(-)	0.2493	0.0480	0.4320	0.0025	0.0341	-2.1174	0.0200	-0.0008	-0.5362
pchsale_pchinvt	0.2434	0.0229	0.2392	0.0019	0.0276	-2.1010	-0.1343	-0.0008	-0.4944
chmom	0.4007	0.0509	0.3114	0.0058	0.0494	-1.9158	-0.0634	0.0010	-0.6159
pchsaleinv	0.2446	0.0197	0.2238	0.0018	0.0254	-1.7332	-0.1297	-0.0003	-0.4843
lev	0.0261	-0.0002	-0.0012	0.0004	0.0488	-1.6920	-0.3338	-0.0035	-0.8698
absacc(-)	0.0380	0.0212	0.1520	0.0005	0.0414	-1.6578	-0.1800	-0.0027	-0.7840
ep	0.2831	0.0639	0.4231	0.0037	0.0452	-1.4698	0.2254	0.0007	-0.6637
chcsho(-)	0.5415	0.0725	0.7490	0.0047	0.0298	-1.4605	0.5437	0.0027	-0.3253
invest(-)	0.2479	0.0450	0.3776	0.0025	0.0356	-1.3936	0.1366	0.0001	-0.4268
acc(-)	0.3358	0.0377	0.3500	0.0030	0.0312	-1.3486	0.0828	0.0011	-0.3966
indmom	0.3518	0.0643	0.3708	0.0053	0.0512	-1.3305	0.2007	0.0020	-0.6892
sp	0.2303	0.0329	0.2220	0.0028	0.0428	-1.2746	-0.0372	0.0004	-0.5726
pchdepr	0.0617	0.0048	0.0506	0.0005	0.0273	-1.2651	-0.1436	-0.0011	-0.6979
egr(-)	0.3157	0.0539	0.5072	0.0030	0.0327	-1.2025	0.3026	0.0013	-0.3384
sgrvol	0.1918	0.0082	0.0562	0.0024	0.0440	-1.1672	-0.2804	0.0001	-0.6935
mom12m	0.4699	0.1532	0.6550	0.0101	0.0732	-1.1488	0.5282	0.0058	-0.9670
realestate	0.1084	0.0508	0.3445	0.0014	0.0458	-1.1485	0.2426	-0.0007	-0.6426
mom36m(-)	0.1778	0.0191	0.0953	0.0031	0.0601	-1.0582	-0.0673	-0.0000	-0.7280
cashpr(-)	0.2962	0.0475	0.3843	0.0031	0.0363	-1.0456	0.0969	0.0014	-0.4235
divyld	-0.0470	0.0377	0.2455	-0.0008	0.0552	-1.0071	-0.1090	-0.0032	-0.9650
aeavol(-)	0.2160	0.0381	0.3568	0.0020	0.0320	-0.9899	0.2019	0.0006	-0.5583
ill(-)	0.3323	0.0504	0.4209	0.0034	0.0350	-0.8358	0.4068	0.0022	-0.6150
dy	-0.0473	0.0382	0.2262	-0.0008	0.0569	-0.7118	-0.0399	-0.0025	-0.9190
chatoia	0.2567	0.0292	0.2994	0.0021	0.0282	-0.6699	0.1768	0.0012	-0.5082
chinv(-)	0.3601	0.0506	0.4775	0.0033	0.0314	-0.6658	0.2930	0.0023	-0.2747
retvol(-)	0.1306	0.1269	0.6067	0.0029	0.0772	-0.6457	0.5100	0.0009	-0.7990
maxret(-)	0.1807	0.1144	0.6383	0.0034	0.0646	-0.6188	0.5778	0.0017	-0.7173
pricedelay(-)	0.0819	0.0030	0.0302	0.0007	0.0292	-0.5810	-0.1441	-0.0002	-0.6155
chfeps	0.4741	0.0673	0.5567	0.0048	0.0353	-0.5669	0.5971	0.0040	-0.2969
stdcf(-)	0.2132	0.0613	0.5020	0.0024	0.0391	-0.5562	0.4330	0.0015	-0.4787
idiovol(-)	-0.0126	0.0514	0.2623	-0.0003	0.0697	-0.4946	0.3285	-0.0021	-0.9875
mve_ia(-)	0.1372	0.0025	0.0244	0.0012	0.0302	-0.4777	-0.0749	0.0005	-0.5605
chpmia	0.1598	0.0266	0.2003	0.0018	0.0385	-0.4676	0.1088	0.0009	-0.6303
disp(-)	0.3267	0.1025	0.6640	0.0048	0.0508	-0.3850	0.7927	0.0039	-0.5300
cfp	0.2466	0.0511	0.3691	0.0029	0.0412	-0.3045	0.2216	0.0024	-0.5708
stdacc(-)	0.1414	0.0431	0.3801	0.0014	0.0354	-0.2576	0.3357	0.0011	-0.5056
pchsale_pchxsga	-0.0415	-0.0053	-0.0480	-0.0004	0.0320	-0.2270	-0.1208	-0.0007	-0.7740
grltnoa	0.1595	0.0208	0.1793	0.0015	0.0335	-0.2164	0.1620	0.0012	-0.5122
sfe	0.2535	0.0763	0.4340	0.0039	0.0532	-0.1952	0.3435	0.0034	-0.7417
pchgm_pchsale	0.0106	-0.0035	-0.0337	0.0001	0.0297	-0.1935	0.0227	-0.0002	-0.5895
cfp_ia	-0.1814	-0.0331	-0.2501	-0.0020	0.0386	-0.1337	-0.3944	-0.0022	-0.8613
mom6m	0.3185	0.1085	0.5001	0.0064	0.0682	-0.1285	0.5223	0.0060	-0.9885
beta	0.0807	-0.0733	-0.3806	0.0022	0.0906	-0.1225	-0.4818	0.0016	-0.9586
dolvol	0.1131	0.0017	0.0161	0.0010	0.0315	-0.1030	0.1331	0.0009	-0.7624
nincr	0.3410	0.0242	0.2728	0.0026	0.0259	-0.0929	0.3654	0.0024	-0.4061
cinvest(-)	-0.4026	-0.0447	-0.3815	-0.0039	0.0339	0.0387	-0.3699	-0.0039	-0.9587
cash	0.2751	0.0143	0.1021	0.0035	0.0438	0.0479	0.0796	0.0036	-0.7076
chtx	0.2162	0.0211	0.1716	0.0022	0.0358	0.1783	0.3262	0.0025	-0.4779
salecash	0.0170	0.0199	0.1773	0.0002	0.0342	0.2132	0.2326	0.0005	-0.7463
rd_mve	0.1412	0.0114	0.0716	0.0019	0.0463	0.2541	0.0853	0.0024	-0.7868
chnanalyst	0.0875	0.0021	0.0240	0.0006	0.0255	0.2806	0.1425	0.0010	-0.5443
fgr5yr	0.0537	-0.0485	-0.2487	0.0010	0.0640	0.3620	-0.0030	0.0020	-0.8761
sue	0.4709	0.0557	0.5087	0.0043	0.0317	0.3800	0.6849	0.0048	-0.4062
pchcapx_ia	-0.1942	-0.0262	-0.2113	-0.0020	0.0359	0.4287	-0.0762	-0.0013	-0.8750
roavol	0.0768	-0.0063	-0.0425	0.0010	0.0440	0.4640	0.0964	0.0019	-0.7812
baspread	-0.0851	-0.1231	-0.5790	-0.0020	0.0804	0.4727	-0.5203	-0.0005	-0.9539
pchquick	-0.0755	-0.0085	-0.0918	-0.0006	0.0268	0.5811	-0.0879	0.0001	-0.7586
rd_sale	-0.0530	-0.0146	-0.0945	-0.0007	0.0446	0.6114	-0.0313	0.0005	-0.8706
pctacc(-)	0.1174	0.0248	0.2364	0.0011	0.0312	0.6117	0.2568	0.0019	-0.3803
secured(-)	0.0364	0.0337	0.2793	0.0004	0.0376	0.6222	0.3922	0.0014	-0.6406
std_turn	0.0539	-0.0490	-0.2884	0.0009	0.0582	0.6391	-0.2285	0.0024	-0.8064
std_dolvol	-0.0176	0.0007	0.0068	-0.0002	0.0315	0.7879	0.0782	0.0008	-0.5301
rsup	0.1644	0.0237	0.1859	0.0018	0.0369	0.8444	0.4582	0.0032	-0.5353
depr	0.0696	-0.0201	-0.1534	0.0008	0.0422	0.8697	0.0616	0.0024	-0.6522
zerotrade(-)	0.1663	-0.0351	-0.1939	0.0031	0.0644	0.8812	-0.0779	0.0054	-0.8344
sgr	0.0225	-0.0138	-0.1111	0.0002	0.0372	0.9168	0.0682	0.0018	-0.7062
salerec	0.2028	0.0405	0.3172	0.0022	0.0378	0.9547	0.5194	0.0040	-0.6761
roaq	0.2858	0.0610	0.4358	0.0035	0.0422	0.9856	0.7546	0.0054	-0.4532
turn	0.1420	-0.0412	-0.2184	0.0027	0.0669	0.9945	-0.0831	0.0054	-0.8291
ear	0.1909	-0.0019	-0.0185	0.0018	0.0329	1.0453	0.2661	0.0033	-0.5423
quick	0.0603	-0.0241	-0.1670	0.0008	0.0467	1.2014	0.0018	0.0033	-0.7923
gma	0.1117	0.0201	0.1363	0.0014	0.0426	1.2529	0.4330	0.0039	-0.6364
hire	-0.1215	-0.0324	-0.2790	-0.0012	0.0353	1.4328	0.0264	0.0011	-0.8262
tang	0.2328	0.0149	0.1259	0.0024	0.0351	1.4843	0.3043	0.0047	-0.7361
grcapx	-0.2991	-0.0421	-0.4188	-0.0026	0.0298	1.5208	-0.1320	-0.0004	-0.8714
saleinv	0.1401	0.0364	0.3711	0.0013	0.0312	1.9087	0.6251	0.0040	-0.4900
chempia	0.0698	-0.0084	-0.0768	0.0007	0.0331	2.0683	0.3022	0.0038	-0.7319
lgr	-0.0573	-0.0162	-0.1720	-0.0005	0.0279	2.2177	0.2295	0.0024	-0.8286

X = pd.concat(rets, join='outer', axis=1).dropna()
X = X/X.std()   # standardize to unit variance
corr = X.corr()
dist = np.sqrt(1 - corr)

Cluster analysis#

We use historical backtest returns as input features for cluster analysis, enabling the identification of peer benchmarks for different investment strategies. Strategies that exhibit similar return patterns tend to load on the same “style” factors and should be evaluated against one another.

The correlation between two return series is closely related to their Euclidean distance. When returns are standardized to have unit variance, the squared Euclidean norm between two series \(x\) and \(y\) can be expressed as:

\[ d^2 = \sum_i (x_i - y_i)^2 = \sum_i x_i^2 + \sum_i y_i^2 - 2 \sum_i x_i y_i = n + n - 2n\rho \]

since \(\sum_i x_i^2 = \sum_i y_i^2 = n\), and \(\rho = \sum_i x_i y_i / n\), where \( \rho \) represents the correlation between the two series. This relationship allows correlation to be rewritten as:

\[ \rho = 1 - \frac{d^2}{2n} \]

Hiearchical clustering#

Hierarchical clustering builds a tree-like structure by iteratively merging clusters based on their similarity. Different linkage methods determine how clusters are merged:

Single Linkage: Merges clusters based on the closest points, leading to elongated clusters.
Complete Linkage: Uses the farthest points between clusters, producing compact, spherical clusters.
Average Linkage: Merges clusters based on the average distance between all points, balancing between single and complete linkage.
Ward’s Method: Minimizes within-cluster variance, resulting in clusters of similar size and shape.

model = AgglomerativeClustering(metric='precomputed', linkage='average',
                                distance_threshold=0, n_clusters=None).fit(dist)

fig, ax = plt.subplots(figsize=(10, 12))
plt.title("Hierarchical Clustering Dendrogram")
# Create linkage matrix and then plot the dendrogram
# scikit-learn: "Plot Hierarchical Clustering Dendrogram"
counts = np.zeros(model.children_.shape[0])
n_samples = len(model.labels_)
# create the counts of samples under each node
for i, merge in enumerate(model.children_):
    current_count = 0
    for child_idx in merge:
        if child_idx < n_samples:
            current_count += 1  # leaf node
        else:
            current_count += counts[child_idx - n_samples]
    counts[i] = current_count

linkage_matrix = np.column_stack([model.children_, model.distances_, counts])\
                   .astype(float)
# Plot the corresponding dendrogram
dendrogram(linkage_matrix, orientation='left', labels=dist.columns, leaf_font_size=10)
plt.tight_layout()

_images/fbd63dc53bc45a747c5a840a813384eeadd47923d00c07f72500be16457f748a.png

K-means clustering#

K-means is a widely used unsupervised learning algorithm that partitions data into \( K \) non-overlapping clusters. The process involves:

Selecting an initial number of clusters \( K \).
Randomly assigning initial cluster centroids.
Iteratively assigning each data point to the nearest centroid.
Recalculating centroids based on the mean of points in each cluster.
Repeating until convergence.

K-means minimizes the within-cluster sum of squared distances, making it well-suited for datasets where clusters are roughly spherical and of similar size.

The Elbow Method is commonly used to determine the optimal number of clusters by plotting the within-cluster sum of squares for different values of \( K \). The optimal \( K \) is where the graph forms an “elbow,” indicating diminishing returns from adding more clusters.

# Selecting number of centers with elbow method
inertias = []
n_clusters = list(range(2, 32))
for n_cluster in n_clusters:
    kmeans = KMeans(n_clusters=n_cluster, random_state=0, n_init="auto").fit(X.T)
    inertias.append(kmeans.inertia_)
plt.figure(figsize=(10, 6))
plt.plot(n_clusters, inertias, 'bx-')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.title('Elbow Method using Inertia')
plt.show()

_images/af37c84b01415cf48f4b37f8985f7951d724caf9523c70cadcbdf407d115be06.png

Silhouette Analysis provides an alternative evaluation method by measuring how well-separated clusters are. The silhouette score compares cohesion (similarity within a cluster) to separation (difference from other clusters), with values ranging from -1 to 1. Higher scores indicate better-defined clusters, and the number of clusters maximizing the average silhouette score is considered optimal.

# scikit-learn: "Selecting the number of clusters with silhouette analysis on KMeans clustering"
scores = {}
fig, ax = plt.subplots(ncols=4, nrows=int(np.round(len(n_clusters)/4)), figsize=(10, 15))
ax = ax.flatten()
for n_cluster, ax1 in zip(n_clusters, ax):
    clusterer = KMeans(n_clusters=n_cluster, n_init='auto', random_state=10)
    cluster_labels = clusterer.fit_predict(X)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(X, cluster_labels)
    print("For n_clusters =", n_cluster,
          "The average silhouette_score is :", silhouette_avg)
    scores[n_cluster] = silhouette_avg
    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    ax1.set_title(f"{n_cluster} clusters: {silhouette_avg:.4f}")    
    ax1.set_xlim([-0.1, .4])
    ax1.set_ylim([0, len(X) + (n_cluster + 1) * 10])

    y_lower = 10
    for i in range(n_cluster):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_cluster)
        ax1.fill_betweenx(
            np.arange(y_lower, y_upper),
            0,
            ith_cluster_silhouette_values,
            facecolor=color,
            edgecolor=color,
            alpha=0.7,
        )

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
#    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
    plt.suptitle("Average silouette scores by number of clusters", fontweight="bold")
    plt.tight_layout()
plt.show()

For n_clusters = 2 The average silhouette_score is : 0.12182520066378914
For n_clusters = 3 The average silhouette_score is : 0.12298455415373394
For n_clusters = 4 The average silhouette_score is : 0.06615920430551211
For n_clusters = 5 The average silhouette_score is : 0.062263203571522374
For n_clusters = 6 The average silhouette_score is : 0.05360073591339586
For n_clusters = 7 The average silhouette_score is : 0.043399891992264766
For n_clusters = 8 The average silhouette_score is : 0.01865524677198136
For n_clusters = 9 The average silhouette_score is : 0.013274893057161832
For n_clusters = 10 The average silhouette_score is : 0.014055548382794032
For n_clusters = 11 The average silhouette_score is : 0.013559353746117705
For n_clusters = 12 The average silhouette_score is : 0.019298192911994215
For n_clusters = 13 The average silhouette_score is : 0.029625503519344714
For n_clusters = 14 The average silhouette_score is : 0.011439750949169072
For n_clusters = 15 The average silhouette_score is : 0.010659851615184082
For n_clusters = 16 The average silhouette_score is : 0.010407061841623768
For n_clusters = 17 The average silhouette_score is : 0.017270693194767258
For n_clusters = 18 The average silhouette_score is : 0.01369810719820829
For n_clusters = 19 The average silhouette_score is : 0.016630208862778856
For n_clusters = 20 The average silhouette_score is : 0.019144464436508506
For n_clusters = 21 The average silhouette_score is : 0.005561006651677671
For n_clusters = 22 The average silhouette_score is : 0.006354452315559386
For n_clusters = 23 The average silhouette_score is : 0.005722873193904617
For n_clusters = 24 The average silhouette_score is : 0.006108458004688574
For n_clusters = 25 The average silhouette_score is : 0.004737653906771348
For n_clusters = 26 The average silhouette_score is : 0.007702186499322102
For n_clusters = 27 The average silhouette_score is : 0.009876913034750879
For n_clusters = 28 The average silhouette_score is : 0.008628670013359775
For n_clusters = 29 The average silhouette_score is : 0.00908060151674441
For n_clusters = 30 The average silhouette_score is : 0.01149725260958716
For n_clusters = 31 The average silhouette_score is : 0.009230744450581462

_images/1ce45fbe6f285e7ac92503a9d0682ebae43f755c6ef49a749b36884ef2a5036e.png

Series(scores).plot.bar(title='Average Silhouette Score by Number of Clusters')

<Axes: title={'center': 'Average Silhouette Score by Number of Clusters'}>

_images/943cd8872d80924a8726279c8e1f7232e3aa44e28917cb98e1674e2d650e98a1.png

Heatmap of strategy correlations

A heatmap visualizes the distances between strategy returns (y-axis) and their respective cluster centers (x-axis), with clusters determined using the K-Means algorithm (K=20). The darkest-colored values appear along the jagged diagonal, indicating that strategy returns are closest to the centers of their assigned clusters. This visualization aids in identifying common style factors and risk exposures across different investment strategies.

kmeans = KMeans(n_clusters=13, random_state=0, n_init="auto").fit(X.T)
Z = DataFrame(kmeans.transform(X.T), index=X.columns)
Z['distance'] = np.min(Z.iloc[:, :kmeans.n_clusters], axis=1)
Z['cluster'] = np.argmin(Z.iloc[:, :kmeans.n_clusters], axis=1)
Z = Z.sort_values(['cluster', 'distance'])  # group by cluster and distance to center

fig, ax = plt.subplots(figsize=(10, 9))
plt.title('Distance of Factor Returns to Cluster Centers')
plt.imshow(Z.iloc[:, :kmeans.n_clusters], cmap='hot', aspect=1/3)
plt.yticks(range(len(Z)), labels=Z.index, fontsize=8)
plt.xticks(range(kmeans.n_clusters), labels=range(kmeans.n_clusters), fontsize=8)
plt.xlabel('Cluster')
plt.colorbar(label='distance from cluster centers')
plt.tight_layout()
plt.show()

_images/ef90d4509cc79fd13dd53bfb18a360239ef01ad389d94b02322d5309da76f2fa.png

References:

Stephen Ross, 1976, “The arbitrage theory of capital asset pricing”, Journal of Economic Theory, Volume 13, Issue 3, December 1976, Pages 341-360

Robert C. Merton, 1973, “An Intertemporal Capital Asset Pricing Model”, Econometrica, Vol. 41, No. 5 (Sep., 1973), pp. 867-887

Jeremiah Green, John R. M. Hand, X. Frank Zhang, 2017, “The Characteristics that Provide Independent Information about Average U.S. Monthly Stock Returns”, The Review of Financial Studies, Volume 30, Issue 12, December 2017, Pages 4389–4436

John H. Cochrane, 2011, “Presidential Address: Discount Rates”, The Journal of Finance, Volume 66, Issue 4, August 2011, Pages 1047-1108

Lo, Andrew W. (2004). “The Adaptive Markets Hypothesis: Market Efficiency from an Evolutionary Perspective”. Journal of Portfolio Management. 5. 30: 15–29

Andrew Lo, 2017, “Adaptive Markets: Financial Evolution at the Speed of Thought”.

Andrew Ang, 2014, “Asset Management: A Systematic Approach to Factor Investing”

FRM Part II Exam Book Investment Management Ch. 1-3.

Wharton Research Data Services.