Skip to main content

Infrastructure & Internal Tools

Architecting foundational internal developer tools (IDPs) and real-time infrastructure to accelerate engineering velocity and ensure high availability across a distributed microservices ecosystem.

TL;DR

Engineered a high-performance Feature Toggle SDK ecosystem (Node.js/Go/React) to enable risk-free product experiments. To support its real-time sync requirements without jeopardizing core system reliability, I architected an independent WebSocket Gateway to offload stateful connection churn. Additionally, authored a distributed tracing library utilizing AOP, slashing issue identification time from 20 minutes to just 2 minutes.

The Challenge & Impact

  • The Challenge: To accelerate product iteration, the business required a robust Feature Toggle system for instant A/B testing. However, pushing real-time toggle updates required persistent WebSockets. Integrating these stateful connections directly into our core stateless REST APIs introduced severe connection churn and memory leak risks during deployments, threatening overall system reliability. Furthermore, debugging the growing asynchronous microservices led to lost trace IDs and massive Mean Time To Recovery (MTTR).
  • The Objective: To build a high-performance feature-flagging ecosystem powered by a dedicated, standalone WebSocket infrastructure—completely decoupling stateful connections from stateless business logic. Simultaneously, to guarantee trace context persistence across all asynchronous boundaries.
  • The Impact: Empowered product teams to conduct instant, risk-free A/B testing across 30+ microservices in production. Protected core REST services from deployment connection drops and memory bloat. Drastically improved system observability, reducing critical bug tracking time by 90%.

Architecture & Execution

Node.js, Express, React (Hooks/Context API), NATS JetStream, WebSockets, Proxy/Reflect, AsyncLocalStorage

1. Independent WebSocket Gateway (Real-Time Infrastructure)

To support the real-time push requirements of the Feature Toggle system without compromising the stability of our core REST APIs, I architected a standalone, Express-based WebSocket gateway.

  • State/Stateless Decoupling: By terminating WebSockets at this dedicated gateway, the core business services remained purely stateless. This eliminated connection churn during frequent API deployments.
  • Event Stream Integration: Seamlessly integrated the gateway with NATS JetStream, allowing backend microservices to broadcast real-time configuration changes to connected clients efficiently.
  • Declarative Configuration: Engineered a flexible routing configuration using predicate functions to handle dynamic grouping, payload validation, and data transformation before broadcasting.

2. Feature Toggle Ecosystem (Canary Releases)

Built a comprehensive Feature Toggle system acting as an instant switch, enabling product managers to turn features on or off without requiring a new deployment.

  • High-Performance SDKs: Authored client libraries for React, Golang, and Node.js. To eliminate network latency on the critical path, I designed the SDKs to perform Local State Resolution. By caching configurations in-memory during initialization, checking a feature flag becomes a near-instant $O(1)$ memory lookup rather than a blocking API call.
  • Optimized Payload: Implemented tag-based filtering (e.g., 'release', 'developing') to strictly limit the query scope and reduce the initialization payload size.

3. Aspect-Oriented Distributed Tracing Library

In a distributed environment, losing the Trace ID across async boundaries is a major pain point.

  • Asynchronous Context Tracking: Leveraged Node.js AsyncLocalStorage (via async_hooks) to ensure that trace metadata travels invisibly through the entire call stack, completely eliminating the need for manual "prop-drilling" of context objects.
  • Aspect-Oriented Interception: Utilized JavaScript's Proxy and Reflect APIs to build a high-level function tracing wrapper. This allowed us to inject logging and performance monitoring transparently at the service layer without modifying any existing business logic.
💻 See Code Example
/** @typedef {import('async_hooks').AsyncLocalStorage<{
path: string,
pack: string,
service: string,
method: string,
isSkipPayload: boolean,
callStack: string[]
}>} ClsService */

/**
* @template T
* @param {ClsService} clsService
* @returns {(func: T, functionName?: string) => T}
*/

module.exports.GenerateWithLog = (clsService) => {
return (func, functionName) => {
const log = generateFunctionLog(clsService)
const handler = {
apply: (target, props, args) => {
const startedAt = dayjs()
// log start here
try {
if (target.constructor.name === 'AsyncFunction') {
return Reflect
.apply(target, props, args)
.then((result) => {
// async log end here
return result
})
.catch((error) => {
// async log error here
throw error
})
}
const result = Reflect.apply(target, props, args)
// log end here
return result
} catch (error) {
// log error here
throw error
}
},
}
return new Proxy(func, handler)
}
}

Advanced Mechanics 1: WebSocket State & Resilience

⚙️ View connection resource management

Handling persistent connections at scale requires rigorous memory management and resilience against network instability:

  • O(1) Connection Tracking: Managed session states using an ES6 Map (with the socket object as the key) for instant lookup and efficient garbage collection.
  • Zombie Connection Pruning: Implemented a proactive 30-second Ping-Pong heartbeat mechanism. Unlike OS-level TCP keep-alives, this application-level check verifies the actual responsiveness of the Node process, enabling rapid detection and closure of "half-open" ghost connections to prevent memory leaks.
  • Resilient React Client: On the frontend, I provided a custom hook utilizing Context/Provider. By leveraging useCallback with the sessionToken as a dependency, I ensured the WebSocket instance acts as a singleton across the app, eliminating unnecessary handshake overhead during component re-renders.

Advanced Mechanics 2: Feature Toggle SDK Hybrid Sync & Resilience

⚙️ View SDK mechanics & sync engineering

To guarantee that 30+ microservices and thousands of frontend clients remain synchronized without overwhelming the configuration database, I architected a multi-layered sync and resilience strategy:

React SDK Global State

Engineered the frontend React SDK utilizing the native Context API and custom hooks, providing a seamless, globally accessible state management solution for evaluating feature flags across the UI.

Hybrid Sync Strategy

To ensure local caches remain up-to-date in real-time while a user is active, I implemented a hybrid approach:

  • Polling-First Baseline: Provides a fallback to guarantee eventual consistency.
  • Real-Time Push: Integrated a passive mode enhancement powered by NATS JetStream and the standalone WebSocket Gateway for instant state updates.

Resilience & Thundering Herd Prevention

  • gRPC Retries: Implemented custom gRPC retry mechanisms within the Go and Node.js SDKs to increase internal backend reliability.
  • Jitter + Exponential Backoff: If all microservices restart simultaneously, a synchronized retry would trigger a DDoS-like "Thundering Herd" against the Admin API. I engineered a robust mathematical distribution using Exponential Backoff with Jitter to safely disperse the load:
// Mathematical distribution of retry load to prevent server crash
const delay = Math.min(
BASE_DELAY * Math.pow(2, retryCount) + (Math.random() * 1000),
MAX_RETRY_DELAY
);

🌟 Engineering Retrospective (Infrastructure Evolution)

Building infrastructure components requires balancing immediate control with future scalability. Here are the key architectural reflections:

  1. Scaling Stateful Connections (The Multi-Instance Challenge): Using an ES6 Map on a single Express server was highly efficient for our initial scale. However, as horizontal scaling becomes necessary, connection states will become fragmented (users connected to different instances). To solve this in the next iteration, I evaluated transitioning to a distributed tracking model using Redis Pub/Sub to manage user_id to server instance subscriptions, or utilizing ZooKeeper with Consistent Hashing for deterministic routing.

  2. Raw WebSockets vs. Abstractions (Socket.io): We deliberately chose raw WebSockets over feature-rich libraries like Socket.io. While this required us to manually engineer complex retry logic and application-level heartbeats, it granted us absolute granular control over connection lifecycles, reduced library integration overhead for non-Node.js clients, and aligned better with our strict memory management goals.

  3. The Overhead of Proxy in Tracing: While Proxy and Reflect provided beautiful, non-intrusive Aspect-Oriented Programming (AOP) capabilities, they introduced a slight performance overhead. I learned that such dynamic interception should not be applied blindly across all layers. I optimized its usage by applying it selectively—primarily at the domain service layer and for complex, deeply nested new features—achieving maximum observability with minimal performance degradation.