Financial ETL Pipeline
A robust data engineering foundation that automates the Extraction, Transformation, and Loading (ETL) of SEC financial reports into actionable investment intelligence.
Engineered a highly modular Python-based ETL pipeline utilizing Design Patterns to automate the processing of complex SEC financial data. This system eliminated manual data entry for 50+ financial indicators and served as the foundational domain exploration for the 2025 LZStock Enterprise Architecture.

The Challenge & Impact
- The Challenge: Financial analysis of SEC reports (10-K/10-Q) is notoriously labor-intensive. Inconsistent data formats across different companies and the high volatility of financial API schemas meant that traditional, hard-coded scraping scripts would constantly break.
- The Objective: To build a resilient data pipeline—treating it as an enterprise software product rather than a simple script—capable of seamlessly swapping data sources (e.g., AlphaVantage, Yahoo Finance and Local File) and investment strategies without rewriting the core engine.
- The Impact: Transitioned unstructured financial data processing into a strictly systematized ETL pipeline, laying a robust architectural foundation for future scalability. Most importantly, this project served as the critical domain exploration phase, providing the deep financial logic required to accurately map the 15 Bounded Contexts in the 2025 LZStock microservices architecture.

Architecture & Execution
Tech Stack
[Python][Pandas/NumPy][Google Cloud API][Design Patterns][ETL]
Engineering Design
To ensure high extensibility, I avoided hard-coded procedural logic and instead implemented a suite of Design Patterns, treating the pipeline as a scalable enterprise software product. The architecture is strictly divided into three distinct phases:
Extract (Data Ingestion)
Ingests raw financial data from hybrid sources, bridging local Excel files and external APIs seamlessly.
- Template Method: Defines the standard data loading skeleton in
InputTemplate, ensuring all data sources follow the same ingestion contract. - Mediator: The
APIMediatorcoordinates complex interactions and rate-limiting between different external API services. - Command: Encapsulates API requests into autonomous command objects, completely decoupling the execution logic from the invoker.
Transform (Data Processing & Analysis)
Cleans raw data, calculates financial indicators, and executes valuation models.
- Chain of Responsibility: A
PipelineHandlercreates a sequential processing pipeline for data cleaning (e.g., stripping whitespace → type conversion → missing value interpolation). - Abstract Factory: The
TableAbstractFactoryprovides an interface to generate families of related financial tables (e.g., Price, Score, Decision) without specifying their concrete classes. - Builder: Constructs complex, multi-layered analysis objects step-by-step (e.g.,
ParsTableBuilder). - Strategy: Encapsulates interchangeable algorithms for valuation scoring and buy/sell decisions (
ScoreTableStrategy,BuyDecisionTableStrategy), allowing new financial models to be "plugged in" without altering the core engine.
Load (Data Output)
Dispatches the final analysis results to various destination systems.
- Observer Pattern: The
OutputSubjectactively notifies subscribed observers (such as Google Sheets APIs, SQL Databases, or local CSV writers) when new analysis data is ready. This strictly decouples the analysis engine from the reporting layer.
Advanced Mechanics 1: Data Transformation Strategy
⚙️ View Details: Vectorized Operations
Leveraged Pandas & NumPy for heavy data manipulation, replacing slow Python iterative loops with vectorized operations to process multi-year financial statements efficiently.
Advanced Mechanics 2: Project Structure
⚙️ View Details: Codebase Organization
Utilized Abstract Base Classes (ABCs) to enforce strict interface contracts across the pipeline.
src/company_screener/
├── API/ # Data fetching layer (Command & Mediator Patterns)
├── CreateTables/ # Table generation logic (Abstract Factory Pattern)
├── Input/ # Data ingestion strategies (Template Method)
├── Output/ # Data export handlers (Observer Pattern)
└── mainFactory.py # Dependency Injection root (The brain of the system)
🔗 View Full Source on GitHub
🌟 Present-Day (2025) Retrospective
The Evolution of Complexity: In 2020, this pipeline was a monolithic Python application. While highly modularized through Design Patterns, it faced scaling limitations regarding real-time data ingestion and distributed processing.
The "Bridge" to 2025: The rigid synchronous nature of this pipeline (Request-Response) is exactly what led me to choose NATS and Event-Driven Architecture for the 2025 LZStock project.