How PhonePe and Paytm Avoid Lost Money on System Crashes

How UPI apps like PhonePe/Paytm keep payments safe using idempotency, sagas, and state machines.

Kodetra TechnologiesKodetra Technologies
5 min read
Mar 3, 2026
0 views

Why “Money Never Gets Lost” Is a System Design Problem

When you hit “Pay” on PhonePe, Paytm, or any UPI app, you’re orchestrating a distributed transaction across at least three independent systems: the app’s backend (PSP), NPCI’s UPI switch, and two banks (payer and payee).

Any of these can crash, timeout, or glitch mid-flow. Yet, we almost never see money permanently stuck in limbo. That guarantee is not luck; it’s the result of explicit design: durable state machines, idempotent requests, and compensating actions (sagas) glued together with strong observability.​​

In this post, we’ll walk through how modern UPI apps ensure that “no money is lost” even when everything fails at the worst possible moment.


The UPI Ecosystem and Failure Surface

Before diving into reliability patterns, it helps to understand who is involved when you pay via UPI.

At a high level, a UPI payment involves:

  • TPAP/PSP: PhonePe/Paytm backend that accepts your intent and talks to NPCI.
  • NPCI Switch: A stateless central router that validates metadata and forwards requests to banks.
  • Issuer Bank: Your bank, which debits your account.
  • Acquirer Bank: Receiver’s bank, which credits their account.

Failures can occur at many points: your app crashes, your PSP times out on a bank API, NPCI switch has momentary issues, issuer or acquirer bank is slow or down, or the network drops packets.

The trick is to treat every payment as a first-class entity with an explicit lifecycle and to make every external call safe to repeat.


Transaction State Machine: Payments as First-Class Entities

The most fundamental building block is a transaction state machine in the PSP backend.

A simplified state diagram looks like this:

  • INITIATED – user tapped pay, transaction record created.
  • PROCESSING – PSP has sent request to NPCI/issuer, waiting for response.
  • DEBITED – issuer bank has successfully debited payer account.
  • CREDITED – acquirer bank has successfully credited beneficiary account.
  • SUCCESS – final success, both sides consistent.
  • FAILED – terminal failure (e.g., insufficient funds, hard error from bank).
  • REFUND_PENDING / REFUNDED – debit succeeded but credit failed, compensating flow executed.

Each state is persisted durably in the PSP’s database (with timestamps and metadata). If the PSP crashes or restarts, it reloads the pinned state and continues the saga from there instead of starting from scratch.

Example:

  • Crash after DEBITED but before SUCCESS: on recovery, the system sees “DEBITED, no credit event” and either retries credit or triggers a refund saga, depending on NPCI/bank status.

This state machine is why “money cannot be lost”—the system always knows the last known safe point.


Idempotency and Unique Reference IDs

The second pillar is idempotency—every payment operation can be retried without double-debiting or double-crediting.​

UPI introduces a UTR (Unique Transaction Reference) at the network/bank layer, while PSPs have their internal transaction IDs and sometimes a global idempotency key per user action.​

Patterns used here:

  • PSP generates a globally unique transaction_id per payment intent and persists it with state INITIATED.
  • All calls to NPCI and banks carry that same transaction_id/UTR.​
  • If the PSP doesn’t receive a response (timeout, crash, network issue), it can safely retry the same request with the same IDs.
  • Issuer/acquirer banks treat repeated requests with the same UTR as idempotent: either they return the original result or indicate that the transaction is already processed.

This ensures that multiple attempts to “finish” a payment converge to one actual debit–credit pair.


Sagas, Not 2PC: Compensating Actions at Scale

You might be tempted to think of distributed transactions with two-phase commit (2PC): lock both accounts, prepare, then commit. UPI does not use 2PC because that pattern does not scale for millions of per-second transactions with independent banks.

Instead, UPI behaves like a saga: a sequence of local transactions with compensating actions.

Simplified saga steps:

  1. Local transaction T1 at issuer bank: debit payer account.
  2. Local transaction T2 at acquirer bank: credit receiver account.
  3. If T2 fails irrecoverably after T1 succeeded, run compensating transaction C1: refund payer (reverse the debit).

From the PSP’s perspective:

  • It drives the saga through state transitions (DEBITED → CREDITED → SUCCESS).
  • It subscribes to callbacks or polls NPCI/banks to know which step completed.​
  • If the saga ends in a partial state (e.g., debit-only), it schedules compensating flows until both ledgers are consistent again.

This models “eventual atomicity” across independently owned systems without global locks.


End-to-End Flow with Crash Scenarios

Let’s walk a realistic flow and inject failures.

Baseline UPI Payment Flow

  1. User taps “Pay” in PhonePe/Paytm.
  2. PSP backend creates transaction row with state INITIATED and generates transaction_id.
  3. After authentication, PSP sends a “debit payer, credit payee” request to NPCI with transaction_id.
  4. NPCI routes to issuer, which debits payer and returns success with a UTR.
  5. NPCI routes to acquirer, which credits receiver and returns success.
  6. PSP receives final status and updates state to SUCCESS, then notifies user/merchant.

Failure 1: PSP Crashes After Sending to Issuer

  • PSP sent the debit request, but before processing issuer’s response, the server crashes.
  • Issuer may still have debited the account; user’s bank balance is reduced.

On restart:

  • A background reconciler scans transactions stuck in PROCESSING for longer than a threshold.
  • It queries NPCI/issuer using the same transaction_id/UTR: “What’s the final status?”​
  • If issuer reports debit succeeded but no final network status, PSP transitions to DEBITED and continues saga (credit or refund).

Failure 2: App Crashes on User’s Phone

  • The mobile app is a thin client; it doesn’t own the state machine.
  • If the app crashes, the PSP and banks continue the saga independently.
  • When the user reopens the app, it fetches latest transaction states from PSP; UI shows SUCCESS, FAILED, or “pending – will auto-reconcile”.

This is why crash on your phone rarely correlates with “lost” money; your device is not a source of truth.


Recoverability: Polling, Callbacks, and Reconciliation

To guard against intermediate network and system failures, PSPs implement multiple layers of recovery:

  • Synchronous callbacks: banks/NPCI call PSP’s callback-url when a transaction reaches a final state.​
  • Asynchronous polling: PSP periodically calls “check-status” APIs on NPCI/issuer/acquirer for transactions stuck in PROCESSING/DEBITED.
  • Reconciliation jobs: batch jobs compare PSP’s ledger vs bank/NPCI reports to catch any out-of-sync transactions and fix them via manual or automated compensations.

This layered approach means even if one notification path fails (e.g., callback lost due to network blip), others eventually converge the state.


UX Layer: Making Failure Non-Scary

The technical guarantees only matter if users don’t panic when something hangs. Modern PSPs invest a lot in communicating “we’ve got this”:

  • Real-time banners when a bank or UPI rail is unhealthy and payments may be delayed.
  • Clear statuses like “Processing, don’t retry,” “Failed, no money debited,” or “Debited, will be auto-refunded by X time.”
  • Automatic reattempts on alternate rails or processors when possible.

This is less about core consistency and more about protecting the user from double-paying or unnecessary anxiety, but it relies on the same robust backend state.


A Minimal Backend Pattern (Pseudo-Example)

Here’s a sketch of how you might design a simplified “UPI-like” transaction service in, say, a Node.js/NestJS backend:

TxState {
  INITIATED = 'INITIATED',
  PROCESSING = 'PROCESSING',
  DEBITED = 'DEBITED',
  CREDITED = 'CREDITED',
  SUCCESS = 'SUCCESS',
  FAILED = 'FAILED',
  REFUND_PENDING = 'REFUND_PENDING',
  REFUNDED = 'REFUNDED',
}

interface Transaction {
  id: string;            // PSP transaction_id
  utr?: string;          // network/bank reference
  payerAccount: string;
  payeeAccount: string;
  amount: number;
  state: TxState;
  createdAt: Date;
  updatedAt: Date;
}

// idempotent create-or-return
async function createTransaction(intent: PaymentIntent): Promise<Transaction> {
  const existing = await txRepo.findByClientIdempotencyKey(intent.key);
  if (existing) return existing;

  return txRepo.insert({
    id: uuid(),
    payerAccount: intent.payer,
    payeeAccount: intent.payee,
    amount: intent.amount,
    state: TxState.INITIATED,
  });
}

async function startPayment(tx: Transaction) {
  if (tx.state !== TxState.INITIATED) return; // idempotent guard

  await txRepo.updateState(tx.id, TxState.PROCESSING);

  const debitResult = await issuerApi.debit({
    txId: tx.id,
    amount: tx.amount,
  });

  if (!debitResult.success) {
    await txRepo.updateState(tx.id, TxState.FAILED);
    return;
  }

  await txRepo.updateState(tx.id, TxState.DEBITED);

  const creditResult = await acquirerApi.credit({
    txId: tx.id,
    amount: tx.amount,
  });

  if (creditResult.success) {
    await txRepo.updateState(tx.id, TxState.CREDITED);
    await txRepo.updateState(tx.id, TxState.SUCCESS);
  } else {
    await txRepo.updateState(tx.id, TxState.REFUND_PENDING);
    scheduleRefund(tx.id);
  }
}

async function scheduleRefund(txId: string) {
  // background worker picks this up
}

This is vastly simplified compared to real UPI rails, but it illustrates:

  • Explicit states (no “magic booleans”).
  • Idempotent operations keyed by transaction_id.
  • Saga-like compensation via refund.

Key Patterns in One Table

ConcernPattern UsedHow It Prevents Lost Money
Crash recoveryDurable state machine in DBRestart resumes from last safe state instead of re-running blindly. 
Double debitIdempotent APIs + UTR/IDsRetries return same result; no duplicate debits/credits. ​
Partial successSaga with compensating refundDebit-only states are eventually refunded or completed. 
Network issuesPolling + callbacks + reconciliationMultiple paths converge PSP state with bank reality. 
User anxietyClear UX + status messagingUsers don’t re-pay or assume money is lost. 

Conclusion: Design for “Money Safety” First, Everything Else Second

UPI apps like PhonePe and Paytm aren’t magical; they are carefully engineered distributed systems where safety of funds is the primary invariant. That invariant is enforced by:

  • Explicit, durable transaction state machines.
  • Idempotent APIs keyed by unique transaction IDs and UTRs.
  • Saga-style orchestration with compensating refunds.
  • Multi-channel reconciliation loops between PSP, NPCI, and banks.

If you’re designing any payment or wallet system—whether it’s UPI, card, or in-app tokens—start by modeling your transaction lifecycle and failure modes as rigorously as these apps do. Then add retries, idempotency, and compensations around that model. Once you guarantee “no money is ever lost,” everything else (latency, UX, rewards, growth loops) becomes much easier to evolve without breaking user trust.

Kodetra Technologies

Kodetra Technologies

Kodetra Technologies is a software development company that specializes in creating custom software solutions, mobile apps, and websites that help businesses achieve their goals.

0 followers

Comments

No comments yet. Be the first to comment!