FINANCIAL DATA SCIENCE

FINANCIAL DATA SCIENCE#

As financial markets produce vast volumes of structured and unstructured data, the ability to extract insights and develop predictive models has become increasingly important. Financial Data Science Python Notebooks provide a practical guide for analysts, researchers, and data scientists looking to apply Python and its broad ecosystem of libraries, tools, frameworks, and community resources to financial analysis, econometrics, and machine learning.

Designed to support financial data science workflows, the companion FinDS Python package demonstrates how to use database engines such as SQL, Redis, and MongoDB to manage and access large datasets, including:

  • Core financial databases such as CRSP, Compustat, IBES, and TAQ

  • Public economic data APIs from sources like FRED and the Bureau of Economic Analysis (BEA)

  • Structured and unstructured data from academic and research websites

In addition to data access, it provides practical examples and templates for applying:

  • Financial econometrics and time series modeling

  • Graph analytics, event studies, and backtesting strategies

  • Machine learning for predictive analytics

  • Natural language processing (NLP) to extract insights from financial text

  • Neural networks and large language models (LLMs) for advanced decision-making

March 2025: Updated with data through early 2025 and incorporated the latest LLMs – Microsoft Phi-4-multimodal (released Feb 2025), Google Gemma-3-12B (March 2025), DeepSeek-R1-14B (January 2025), Meta Llama-3.1-8B (July 2024), GPT-4o-mini (July 2024).

image

Topics#

notebook

Financial

Data

Science

1.1_stock_prices

Stock price properties

CRSP stocks

Statistical moments

1.2_jegadeesh_titman

Price momentum

CRSP stocks

Hypothesis testing,
Newey-West estimator

1.3_fama_french

Value and size

CRSP stocks,
Compustat

Linear regression

1.4_fama_macbeth

CAPM

Fama-French

Non-linear regression,
Quadratic optimization

1.5_contrarian_trading

Mean reversion,
Implementation shortfall

CRSP stocks

Structural breaks

1.6_quant_factors

Factor investing,
Backtesting

CRSP stocks,
Compustat, IBES

Cluster analysis

1.7_event_study

Event studies

S&P key developments

Multiple testing, Fourier transforms

2.1_economic_indicators

Economic data revisions,
Employment payrolls

ALFRED

Outlier detection

2.2_regression_diagnostics

Consumer and
producer prices

FRED

Linear regression diagnostics

2.3_time_series

Industrial production
and inflation

FRED

Time series analysis

2.4_approximate_factors

Approximate factor models

FRED-MD

Unit root test,
EM Algorithm

2.5_economic_states

State space models

FRED-MD

Gaussian mixture,
hidden Markov models

3.1_term_structure

Interest rates

FRED yield curve

Low-rank approximation

3.2_bond_returns

Bonds risk factors

FRED bond returns

Principal component analysis

3.3_options_pricing

Binomial tree,
Black-Scholes-Merton

simulated

Monte Carlo simulations

3.4_value_at_risk

Value-at-risk

FRED crypto-currencies

Conditional volatility

3.5_covariance_matrix

Portfolio risk

Fama-French industries

Covariance matrix estimation

3.6_market_microstructure

Market liquidity

TAQ tick data

High frequency volatility

3.7_event_risk

Earnings expectations

IBES

Poisson regression,
generalized linear model

4.1_network_graphs

Supply chain

Compustat principal customers

Network graphs

4.2_community_detection

Industry taxonomy

Hoberg-Phillips

Community detection

4.3_graph_centrality

Input-output uses

Bureau of Economic Analysis

Graph centrality

4.4_link_prediction

Product markets

Hoberg-Phillips

Link prediction

4.5_spatial_regression

Earnings surprises

IBES, Hoberg-Phillips

Spatial regression

5.1_fomc_topics

FOMC meetings

Federal Reserve

Topic modeling

5.2_management_sentiment

Management discussions

SEC Edgar,
Loughran-Macdonald

Sentiment analysis

5.3_business_textual

Business descriptions

SEC Edgar

Part-of-speech,
Density-based clustering

6.1_classification_models

Industry classification

SEC Edgar

Classification

6.2_regression_models

Macroeconomic forecasts

FRED-MD

Regression

6.3_deep_learning

Industry classification

SEC Edgar

Neural networks,
word embeddings

6.4_convolutional_net

Macroeconomic forecasts

FRED-MD

Convolutional neural nets,
vector autoregression

6.5_recurrent_net

Macroeconomic forecasts

FRED-MD

Recurrent neural nets,
dynamic factor models

6.6_reinforcement_learning

Retirement spending

SBBI

Reinforcement learning

6.7_language_modeling

Fedspeak

Federal Reserve

Language modeling,
Transformers

7.1_large_language_models

Market risk disclosures

SEC Edgar

Text summarization

7.2_llm_finetuning

Industry classification

SEC Edgar

LLM fine-tuning

7.3_llm_prompting

Financial news sentiment

Kaggle

Prompt engineering

7.4_llm_agents

Corporate philanthropy

MVCP textbook

Multi-agents, chatbots,
retrieval-augmented generation

Documentation#

Github repos#

Contact#

https://terence-lim.github.io