Skip to main content

Core Order Service & Migration

Architecting a new TypeScript Node.js order engine from scratch to replace a legacy monolith Transforming into Event-Driven Architecture, unblocking international expansion and supporting uninterrupted high-growth traffic across 4 countries.

TL;DR

Spearheaded the end-to-end technical execution of the most complex core system migration in the company's history. Rebuilt the core order engine from scratch in TypeScript, achieving 90% Jest test coverage, scaling to 100M+ monthly requests with P95 latency under 100ms and delivered the accompanying React-based UI module. Led rigorous database optimizations that delivered 15x faster responsiveness, supporting 24k+ merchants and 6.5k daily orders seamlessly across international markets.

The Challenge & Impact

  • The Challenge: The legacy "Old Order Service" had grown into a massive monolithic bottleneck that severely hindered international expansion. It tightly coupled heavy, CPU-intensive dispatching and trip logic with core order management. This monolithic processing made the system highly vulnerable to cascading failures during peak traffic and crippled high-frequency API orchestration.
  • The Objective: To execute a strategic transformation towards an Event-Driven Architecture (EDA). The goal was to decouple the monolith into a refined "New Core Order Service" (focusing strictly on Order, Parcel, and Waypoint data) and isolated "Last-leg" services. The New Core Order Service needed to act as a Read-Oriented System Hub, synchronizing downstream modules with absolute zero migration downtime.
  • The Impact: Successfully executed a 4-phase Soft Launch migration without a single minute of user disruption. The new TypeScript Core Order Service seamlessly powered market expansion across 4 countries, scaling to 100M+ monthly requests (1.2k peak RPS) with P95 latency under 100ms. Furthermore, led rigorous database query optimizations as the new dataset scaled, cutting redundant infrastructure costs and delivering 15x faster responsiveness with 50% lower database resource consumption.

Architecture & Execution

Tech Stack

TypeScript (Node.js), React (Hooks), PostgreSQL, NATS JetStream, Jest, AWS Performance Insights

1. Strategic Transformation & Quality Engineering

Moving away from the legacy monolithic processing, I architected the new Core Order Service service to guarantee strict boundary isolation, type safety, and maintainability:

  • Refined Domain Scope: Delegated complex dispatching and trip calculation logic to external worker services. Narrowed the New Core Order Service's scope to focus exclusively on core entities: Order, Parcel, Waypoint, and an Append-only Tracking Timeline.
  • Read-Oriented System Hub: Optimized the service for high-frequency I/O queries (e.g., getOrder, getOrders). Leveraged durable messaging to synchronously and asynchronously orchestrate downstream modules (Wallet, Warehouse, and Outsource).
  • TypeScript & 90% Test Coverage: Built the core order engine entirely from scratch using TypeScript, elevating codebase reliability. Enforced strict testing standards, achieving over 90% unit and integration test coverage with Jest.
  • Full-Stack Ownership (React): Owned and delivered the corresponding React-based order detail module. Leveraged modern React Hook state management to integrate directly with the new Node APIs.

2. Why Event-Driven Architecture?

Transitioning from a synchronous legacy monolith to an Event-Driven Architecture (via NATS JetStream) was not merely a modern tech-stack upgrade; it was a strategic necessity to handle our 100M+ monthly requests and strict latency requirements.

  • Elastic Scaling & Load Leveling: By completely decoupling event production from consumption, downstream domain services (such as Wallet and Warehouse) can process events at their own capacity. During 1.2k RPS peak surges, the message queue acts as a critical buffer, preventing the core Backbone engine from being bottlenecked by slower downstream systems.
  • Optimizing the Critical Path (Order Traceability): In the legacy system, writing to the order lifecycle timeline was a synchronous operation that severely degraded API performance. By delegating timeline population to a dedicated asynchronous consumer, we stripped non-essential database writes from the critical path. This architectural shift was the primary driver in achieving our strict P95 latency of under 100ms.

3. The 4-Phase Zero-Downtime Migration (Soft Launch)

To guarantee system stability during the transition, I took full ownership of executing a strict 4-phase soft launch strategy. I engineered the critical backward-compatible adapters that allowed seamless data flow between the legacy and new systems without disrupting ongoing operations:

Four Phases Migration

  1. Phase 1 (Baseline Verification): Implemented Dual-write to both legacy and new systems, while continuing to read from Legacy to verify data consistency.
  2. Phase 2 (Async Sync): Shifted primary writes to the New Service, asynchronous-syncing data back to Legacy as a fallback.
  3. Phase 3 (Canary Read Switch): The most critical phase. Primary writes remained on the New Service, and we strategically switched read traffic to the New Service to perform canary testing.
  4. Phase 4 (Full Cutover): Complete deprecation of the legacy order system.

Advanced Mechanics 1: Performance Engineering & Scalability

⚙️ View optimization strategies & cost reduction

Given the I/O-bound, Read-Oriented nature of the New Core Order Service, the data layer was completely overhauled. My query optimization initiative yielded a 15x speedup and halved our resource footprint:

  • Bottleneck Identification: Utilized AWS Performance Insights to pinpoint high-latency SQL queries and analyze execution plans under heavy load.
  • Indexing Strategy: Optimized query paths by implementing targeted indexing (Composite, Covering/INCLUDE, Partial/WHERE, and GIN indexes), strictly aiming for Index Only Scans and eliminating expensive Sequential Scans.
  • Load Distribution: Effectively leveraged RDS Read Replicas (1 Master, 2 Replicas) to distribute the massive read volume generated by high-frequency operations.

Advanced Mechanics 2: Event-Driven Resilience

⚙️ View load leveling & traceability implementation
  • Elastic Scaling & Load Leveling: Decoupled consumption allows downstream services (Wallet, Warehouse) to handle traffic spikes at their own pace. Heavy CPU dispatching tasks are offloaded, ensuring the core New Core Order Service remains responsive and immune to cascading bottlenecks.
  • Data Integrity & Traceability: Engineered an append-only event stream timeline. This asynchronous population minimizes latency on the critical path while providing a naturally immutable, single source of truth for the entire order lifecycle, significantly simplifying complex dispute resolutions.

🌟 Engineering Retrospective (Scaling the Core Order Service)

Navigating the maturation and scaling phases of the newly built Core Order Service provided invaluable engineering lessons regarding database modeling and distributed trade-offs:

1. The EAV Model Trap vs. JSONB

Initially within the New Core Order Service, we implemented the Entity-Attribute-Value (EAV) model for secondary order attributes to maximize schema flexibility. The Reality: It led to extreme vertical table bloat (10+ rows per order), destroyed index efficiency, and stripped database-level constraints.

-- EAV pattern
CREATE TABLE eav_values (
id SERIAL PRIMARY KEY,
...
key VARCHAR(20),
value TEXT,
...
);

The Fix: We reverted to fixed columns for core data and leveraged PostgreSQL JSONB with GIN indexing for flexible attributes, striking the perfect balance between schema-less flexibility and query performance.

-- JSONB
CREATE TABLE products (
id SERIAL PRIMARY KEY,
...
attributes JSONB NOT NULL DEFAULT '{}',
);

2. Strict Normalization vs. Read-Scale Denormalization

To ensure strict ACID compliance, the New Core Order Service's order data was spread across 6 highly normalized tables. As the dataset matured, these 6-way joins crippled read performance.

The Fix: We adopted a pragmatic modeling approach. For fast-growing, read-heavy data, we shifted to Denormalized Read Models, sacrificing a small amount of storage efficiency for massive gains in read latency.

3. Idealistic Event Sourcing vs. Materialized State

We initially calculated order statuses on-the-fly by querying the latest record from the New Core Order Service's event store (ORDER BY created_at DESC LIMIT 1). As the table grew to millions of rows, this became prohibitively expensive.

CREATE TYPE event_type AS ENUM (
'ORDER_CREATED',
'PAYMENT_SUCCESS'
...
)

CREATE TABLE timelines (
id INT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,,
event event_type NOT NULL
...
);

The Fix: We shifted the workload from Read-time to Write-time by persisting the latest status to a dedicated field in the main orders table, reducing an expensive $O(\log N)$ search to an instant $O(1)$ lookup.

CREATE TABLE orders (
...
status event_type NOT NULL,
...
);

Order-Status-Retrieval

4. Distributed Transaction Resilience (The Saga Pattern)

While NATS JetStream (Raft-based) guaranteed message persistence across our new EDA, it couldn't resolve downstream business logic failures (e.g., inventory shortage after an order is created).

The Evolution: I advocated for a multi-tiered resilience strategy: Retry/Dead Letter Queues (DLQ) for transient network errors, and transitioning toward a Choreography-based Saga Pattern to manage compensating transactions for permanent business failures.

5. Decoupling the "Fat Controller"

The Monolithic Controller: A single controller was heavily coupled, handling everything from gRPC communication and ORM mapping to complex domain business logic and downstream service calls.

The Pain Points: This made the code extremely brittle, nearly impossible to unit test effectively without massive mocking, and created a high maintenance overhead for onboarding new features.

The Shift to Clean Architecture: To resolve this, I advocated for a paradigm shift towards Clean Architecture. By strictly separating core business logic from the infrastructure layer—moving logic into dedicated Use Cases (Orchestrators) and relying heavily on Dependency Injection for databases and gRPC clients—we successfully isolated each component. This shift from simply "scripting logic" to "architecting services" was a major turning point for our team's engineering standards.

🚀 The Birth of LZStock:

The architectural bottlenecks I encountered and resolved during this Backbone migration became the primary catalyst for my next major initiative. I took these Clean Architecture principles and applied them from day one to architect a high-concurrency financial ecosystem from scratch: LZStock.

Clean Architecture

👉 Explore how I implemented strict Clean Architecture for LZStock