FINANCIAL DATA SCIENCE#
As financial markets produce vast volumes of structured and unstructured data, the ability to extract insights and develop predictive models has become increasingly important. Financial Data Science Python Notebooks provide a practical guide for analysts, researchers, and data scientists looking to apply Python and its broad ecosystem of libraries, tools, frameworks, and community resources to financial analysis, econometrics, and machine learning.
Designed to support financial data science workflows, the companion FinDS Python package demonstrates how to use database engines such as SQL, Redis, and MongoDB to manage and access large datasets, including:
Core financial databases such as CRSP, Compustat, IBES, and TAQ
Public economic data APIs from sources like FRED and the Bureau of Economic Analysis (BEA)
Structured and unstructured data from academic and research websites
In addition to data access, it provides practical examples and templates for applying:
Financial econometrics and time series modeling
Graph analytics, event studies, and backtesting strategies
Machine learning for predictive analytics
Natural language processing (NLP) to extract insights from financial text
Neural networks and large language models (LLMs) for advanced decision-making
March 2025: Updated with data through early 2025 and incorporated the latest LLMs – Microsoft Phi-4-multimodal (released Feb 2025), Google Gemma-3-12B (March 2025), DeepSeek-R1-14B (January 2025), Meta Llama-3.1-8B (July 2024), GPT-4o-mini (July 2024).
Topics#
notebook |
Financial |
Data |
Science |
---|---|---|---|
Stock price properties |
CRSP stocks |
Statistical moments |
|
Price momentum |
CRSP stocks |
Hypothesis testing, |
|
Value and size |
CRSP stocks, |
Linear regression |
|
CAPM |
Fama-French |
Non-linear regression, |
|
Mean reversion, |
CRSP stocks |
Structural breaks |
|
Factor investing, |
CRSP stocks, |
Cluster analysis |
|
Event studies |
S&P key developments |
Multiple testing, Fourier transforms |
|
Economic data revisions, |
ALFRED |
Outlier detection |
|
Consumer and |
FRED |
Linear regression diagnostics |
|
Industrial production |
FRED |
Time series analysis |
|
Approximate factor models |
FRED-MD |
Unit root test, |
|
State space models |
FRED-MD |
Gaussian mixture, |
|
Interest rates |
FRED yield curve |
Low-rank approximation |
|
Bonds risk factors |
FRED bond returns |
Principal component analysis |
|
Binomial tree, |
simulated |
Monte Carlo simulations |
|
Value-at-risk |
FRED crypto-currencies |
Conditional volatility |
|
Portfolio risk |
Fama-French industries |
Covariance matrix estimation |
|
Market liquidity |
TAQ tick data |
High frequency volatility |
|
Earnings expectations |
IBES |
Poisson regression, |
|
Supply chain |
Compustat principal customers |
Network graphs |
|
Industry taxonomy |
Hoberg-Phillips |
Community detection |
|
Input-output uses |
Bureau of Economic Analysis |
Graph centrality |
|
Product markets |
Hoberg-Phillips |
Link prediction |
|
Earnings surprises |
IBES, Hoberg-Phillips |
Spatial regression |
|
FOMC meetings |
Federal Reserve |
Topic modeling |
|
Management discussions |
SEC Edgar, |
Sentiment analysis |
|
Business descriptions |
SEC Edgar |
Part-of-speech, |
|
Industry classification |
SEC Edgar |
Classification |
|
Macroeconomic forecasts |
FRED-MD |
Regression |
|
Industry classification |
SEC Edgar |
Neural networks, |
|
Macroeconomic forecasts |
FRED-MD |
Convolutional neural nets, |
|
Macroeconomic forecasts |
FRED-MD |
Recurrent neural nets, |
|
Retirement spending |
SBBI |
Reinforcement learning |
|
Fedspeak |
Federal Reserve |
Language modeling, |
|
Market risk disclosures |
SEC Edgar |
Text summarization |
|
Industry classification |
SEC Edgar |
LLM fine-tuning |
|
Financial news sentiment |
Kaggle |
Prompt engineering |
|
Corporate philanthropy |
Multi-agents, chatbots, |