Lecture 12: Implementation Constraints & Challenges

        Learning Objective: Recognize the technical and operational realities of distributed systems. Learn how to design integrations that gracefully handle network failures, data inconsistencies, and latency using modern resiliency patterns.
    

1. The Fallacies of Distributed Computing

When transitioning from a single monolithic application to an integrated, distributed system (via APIs, ESB, or iPaaS), developers often make fatal assumptions. In 1994, L. Peter Deutsch and others at Sun Microsystems formulated the "8 Fallacies of Distributed Computing". Believing these fallacies leads to fragile integrations:

Fallacy 1: The network is reliable. (Reality: Cables get cut, routers crash, and DNS fails).
Fallacy 2: Latency is zero. (Reality: Calling an API in Singapore from servers in Jakarta takes time).
Fallacy 3: Bandwidth is infinite. (Reality: Sending 10GB of JSON data will choke your gateway).
Fallacy 4: The network is secure. (Reality: You must encrypt everything via TLS/HTTPS).

Because the network is not reliable, our integration architectures must be designed for failure.

2. Data Consistency: ACID vs. BASE

In a monolithic system, data is stored in a single database. We rely on ACID transactions (Atomicity, Consistency, Isolation, Durability). If a bank transfer fails halfway, the database rolls everything back instantly.

In a distributed integration, Service A (Order) and Service B (Billing) have separate databases. We cannot use ACID across the internet. Instead, we rely on BASE (Basically Available, Soft state, Eventual consistency). The system might be temporarily inconsistent (e.g., the order is placed, but billing hasn't deducted the funds yet), but it will *eventually* become consistent.

To handle distributed transactions safely without ACID, architects use the Saga Pattern (a sequence of local transactions where a failure triggers compensating transactions to undo the previous steps).

3. Handling System Failures: The Circuit Breaker Pattern

If Service A calls Service B, and Service B is completely down, Service A will get a fast "Connection Refused" error. That is easy to handle.

However, what if Service B is not down, but is experiencing a CPU spike and taking 30 seconds to respond? Service A will wait. If 1,000 users make requests, 1,000 threads in Service A will be stuck waiting for Service B. Soon, Service A runs out of memory and crashes too. This is called a Cascading Failure.

To prevent this, we use the Circuit Breaker Pattern.

Figure 1: The Circuit Breaker State Machine

4. How the Circuit Breaker Works

Modeled after an electrical circuit breaker in your house, this software pattern prevents catastrophic meltdowns:

CLOSED (Normal): Everything is working fine. Service A sends requests to Service B.
OPEN (Broken): If Service B fails or times out too many times (e.g., 50% of requests fail in 10 seconds), the circuit "trips" and opens. Service A immediately stops sending requests. Any new request fails instantly with a CircuitBreakerOpenException. This gives Service B time to recover instead of hammering it with traffic.
HALF-OPEN (Testing): After a timeout (e.g., 30 seconds), the circuit allows a few "test" requests through. If they succeed, it closes the circuit (normal). If they fail, it trips wide open again.

Code Example: Circuit Breaker Configuration (YAML)

Using modern libraries like Resilience4j in microservices, developers do not code this logic from scratch. They apply a configuration wrapper around their API calls.

# Resilience4j Circuit Breaker Configuration Example
resilience4j.circuitbreaker:
  instances:
    billingServiceBackend:
      registerHealthIndicator: true
      slidingWindowSize: 100               # Evaluate the last 100 calls
      failureRateThreshold: 50             # Trip if 50% of calls fail
      slowCallRateThreshold: 50            # Trip if 50% of calls are too slow
      slowCallDurationThreshold: 2000ms    # "Slow" means taking longer than 2 seconds
      permittedNumberOfCallsInHalfOpenState: 10  # Let 10 requests through to test recovery
      waitDurationInOpenState: 30000ms     # Wait 30 seconds before testing (Half-Open)

        Discussion Prompt for Students: Imagine an e-commerce checkout process. The user submits their cart, but the integration with the external Payment Gateway times out. The Circuit Breaker trips to "OPEN". What should the application do? Should it show an error page? Should it save the order as "Pending" and try again later? Discuss the concept of a "Fallback Method" and how business requirements dictate technical error handling.