System Design Interview

By PAUL MASIBO WABWAYI NGOME WORLD STREET NEWSTIME Saturday, March 30, 2024

Alex Xu published a highly acclaimed book titled "Systems Design Interview: An Insider's Guide: Volume 2." It has quickly become the best-selling computer science book on Amazon in the United States, and it's well-deserved.

This book presents 13 comprehensive and original systems design challenges that are not found elsewhere. These challenges cover a wide range of topics, including developing a proximity service, implementing a nearby friends feature, designing a Google Maps-like system, building a distributed messaging queue, creating a metrics monitoring and alerting system, constructing a hotel reservation system, designing a real-time gaming leaderboard, developing a digital wallet, building a stock exchange, and many more intriguing topics.

Each chapter of the book follows a systematic four-step process:

Understand the problem and establish the design scope: The author thoroughly explains the problem statement and sets the boundaries for the design.
Propose high-level design and get buy-in: The book guides readers in proposing a high-level design solution and obtaining approval or consensus from stakeholders.
Design deep-dive: This step delves into the nitty-gritty details of the design, exploring various components, algorithms, and trade-offs.
Wrap-up: The final step concludes the chapter, summarizing the key points and insights gained from the design exercise.

This structured approach ensures a comprehensive understanding of the systems design process and equips readers with the necessary skills to excel in systems design interviews. The book's popularity and its unique set of challenges make it a valuable resource for computer science professionals and anyone preparing for systems design interviews.

I have read the book mentioned, explicitly focusing on the Payments System chapter. As someone who has worked on payment systems for several years at a startup company, I found this chapter particularly relevant. Building payment systems requires a good understanding of payment service, payment executor, PSP (Payment Service Provider), ledger, and wallet.

While it's impossible for a book to cover every real-life scenario encountered when building a payment system, this summary balances conciseness and provides sufficient depth. In my experience, I had to learn most of what Alex covers in the book through hands-on work on a payments system, seeking advice from others, and learning from trial and error.

This book issue focuses on the following topics covered in the original text.

Step 1: understand the problem

Functional requirements
Non-functional requirements
Back-of-the-envelope estimations

Step 2: high-level design

Payment service
Payment executor
Payment service provider
Card schemes
Ledger
Wallet
Double-entry ledger system

Step 3: design deep dive

PSP integration
Reconciliation
Handling processing delays
Handling failed payments
Exactly-once delivery
Consistency
Payment Security

Step 4: wrap up

If you find this helpful chapter, get System Design Interview: Volume 2

Payment System

In this section, we focus on the design of a payment system, considering the tremendous growth of e-commerce worldwide. Behind every successful transaction lies a robust, scalable, and adaptable payment system. It plays a crucial role in securely and efficiently exchanging monetary value.

So, what exactly is a payment system? Wikipedia says it is "any system utilized to settle financial transactions by transferring monetary value. This encompasses the institutions, instruments, individuals, rules, procedures, standards, and technologies that enable such exchanges" [1].

While the concept of a payment system may seem straightforward, it can be daunting for many developers. Even a minor error could lead to substantial financial losses and harm users' trust. However, there's no need to be alarmed! This section aims to demystify payment systems and clarify their inner workings.

Step 1 - Understand the Problem and Establish the Design Scope

A payment system can mean very different things to different people. Some may think it’s a digital wallet like Apple Pay or Google Pay. Others may think it’s a backend system that handles payments such as PayPal or Stripe. It is very important to determine the exact requirements at the beginning of the interview. These are some questions you can ask the interviewer:

Candidate: What kind of payment system are we building?

Interviewer: Assume you are building a payment backend for an e-commerce application like Amazon.com. When a customer places an order on Amazon.com, the payment system handles everything related to money movement.

Candidate: What payment options are supported? Credit cards, PayPal, bank cards, etc?

Interviewer: The payment system should support all of these options in real life. However, in this interview, we can use credit card payment as an example.

Candidate: Do we handle credit card payment processing ourselves?

Interviewer: No, we use third-party payment processors, such as Stripe, Braintree, Square, etc.

Candidate: Do we store credit card data in our system?

Interviewer: Due to extremely high security and compliance requirements, we do not store card numbers directly in our system. We rely on third-party payment processors to handle sensitive credit card data.

Candidate: Is the application global? Do we need to support different currencies and international payments?

Interviewer: Great question. Yes, the application would be global but we assume only one currency is used in this interview.

Candidate: How many payment transactions per day?

Interviewer: 1 million transactions per day.

Candidate: Do we need to support the pay-out flow, which an e-commerce site like Amazon uses to pay sellers every month?

Interviewer: Yes, we need to support that.

Candidate: I think I have gathered all the requirements. Is there anything else I should pay attention to?

Interviewer: Yes. A payment system interacts with a lot of internal services (accounting, analytics, etc.) and external services (payment service providers). When a service fails, we may see inconsistent states among services. Therefore, we need to perform reconciliation and fix any inconsistencies. This is also a requirement.

With these questions, we get a clear picture of both the functional and non-functional requirements. In this interview, we focus on designing a payment system that supports the following.

Functional requirements

Pay-in flow: payment system receives money from customers on behalf of sellers.
Pay-out flow: payment system sends money to sellers around the world.

Non-functional requirements

Reliability and fault tolerance. Failed payments need to be carefully handled.
A reconciliation process between internal services (payment systems, accounting systems) and external services (payment service providers) is required. The process asynchronously verifies that the payment information across these systems is consistent.

Back-of-the-envelope estimation

The system needs to process 1 million transactions per day, which is 1,000,000 transactions / 10^5 seconds = 10 transactions per second (TPS). 10 TPS is not a big number for a typical database, which means the focus of this system design interview is on how to correctly handle payment transactions, rather than aiming for high throughput.

Step 2 - Propose High-Level Design and Get Buy-In

To outline the payment flow, we can divide it into two main steps that align with how money moves:

Pay-in Flow: This step involves the process of money being received. Let's consider the example of an e-commerce site like Amazon. When a buyer places an order, the money flows into Amazon's bank account. However, it's important to note that Amazon doesn't fully own this money. A significant portion belongs to the seller. In this scenario, Amazon acts as the custodian of the money, charging a fee for its services.
Pay-out Flow: Once the products are delivered and the transaction is completed, the remaining balance after deducting fees is transferred from Amazon's bank account to the seller's bank account. This step represents the movement of money from the platform to the rightful owner, the seller.

Figure 1 provides a simplified visualization of the pay-in and pay-out flows.

Pay-in flow

The high-level design diagram for the pay-in flow is shown in Figure 2. Let’s take a look at each component of the system.

Payment service

The payment service plays a crucial role in accepting payment events from users and managing the payment process. One of the initial steps performed by the payment service is conducting a risk check to ensure compliance with regulations like Anti-Money Laundering and Combating the Financing of Terrorism (AML/CFT). The purpose of this risk check is to identify any signs of criminal activity, such as money laundering or terrorism financing. Payments that pass this risk check are further processed by the payment service.

Typically, the risk check service relies on a third-party provider for its implementation. This is because performing a comprehensive risk check involves intricate processes and specialized knowledge. By leveraging the expertise of a third-party provider, the risk check service can effectively assess the compliance and security aspects of payments.

Payment executor

The payment executor is responsible for carrying out individual payment orders through a Payment Service Provider (PSP). Each payment event can encompass multiple payment orders, and it is the role of the payment executor to execute them accordingly. By utilizing a Payment Service Provider, the payment executor ensures secure and reliable processing of the payment orders within the given payment event.

Payment Service Provider (PSP)

A PSP moves money from account A to account B. In this simplified example, the PSP moves the money out of the buyer’s credit card account.

Card schemes

Card schemes are the organizations that process credit card operations. Well known card schemes are Visa, MasterCard, Discovery, etc. The card scheme ecosystem is very complex [3].

Ledger

The ledger keeps a financial record of the payment transaction. For example, when a user pays the seller $1, we record it as debit $1 from a user and credit $1 to the seller. The ledger system is very important in post-payment analysis, such as calculating the total revenue of the e-commerce website or forecasting future revenue.

Wallet

The wallet keeps the account balance of the merchant. It may also record how much a given user has paid in total.

As shown in Figure 2, a typical pay-in flow works like this:

When a user clicks the “place order” button, a payment event is generated and sent to the payment service.
The payment service stores the payment event in the database.
Sometimes, a single payment event may contain several payment orders. For example, you may select products from multiple sellers in a single checkout process. If the e-commerce website splits the checkout into multiple payment orders, the payment service calls the payment executor for each payment order.
The payment executor stores the payment order in the database.
The payment executor calls an external PSP to process the credit card payment.
After the payment executor has successfully processed the payment, the payment service updates the wallet to record how much money a given seller has.
The wallet server stores the updated balance information in the database.
After the wallet service has successfully updated the seller’s balance information, the payment service calls the ledger to update it.
The ledger service appends the new ledger information to the database.

APIs for payment service

We use the RESTful API design convention for the payment service.

POST /v1/payments

This endpoint executes a payment event. As mentioned above, a single payment event may contain multiple payment orders. The request parameters are listed below:

The payment_orders look like this:

Please note that the "payment_order_id" is a globally unique identifier. When the payment executor sends a payment request to a third-party Payment Service Provider (PSP), the payment_order_id serves as the deduplication ID or idempotency key used by the PSP.

It is worth mentioning that the "amount" field in the data structure is represented as a "string" rather than a "double" data type. This choice is deliberate due to several reasons:

Serialization and Deserialization Precision: Different protocols, software systems, and hardware configurations may support varying levels of numeric precision during serialization and deserialization. Using a "double" data type could lead to unintended rounding errors or inconsistencies across different platforms.
Handling Large or Small Numbers: The "amount" field could potentially represent extremely large values (for example, Japan’s GDP is around 5x1014 yen for the calendar year 2020) or extremely small values (for example, a satoshi of Bitcoin is 10-8). Representing such values accurately with a "double" data type may pose challenges.

To address these concerns, it is recommended to keep numeric values in string format during transmission and storage. The string representation allows for precise preservation of the original value. Conversion to numeric types is typically done only when performing calculations or displaying the data.

Additionally, the API includes a GET endpoint ("GET /v1/payments/{:id}") that provides the execution status of a specific payment order based on the corresponding payment_order_id.

It's worth noting that the payment API described shares similarities with the APIs of well-known Payment Service Providers (PSPs). For a more comprehensive understanding of payment APIs, you can explore Stripe's API documentation which offers detailed information on the subject.

The data model for payment service

For the payment service, we require two tables: payment event and payment order. When selecting a storage solution for a payment system, performance is not the primary consideration. Instead, we prioritize the following factors:

Proven Stability: We look for a storage system that has been successfully used by other prominent financial firms for an extended period, typically more than 5 years, and has received positive feedback regarding its reliability and stability.
Rich Supporting Tools: It is essential to assess the availability of supporting tools, such as monitoring and investigation tools, that can enhance the management and maintenance of the storage system.
Maturity of the Database Administrator (DBA) Job Market: The availability of experienced DBAs is a critical factor to consider. A well-established market of skilled DBAs ensures proper administration and maintenance of the chosen storage solution.

In general, our preference leans towards a traditional relational database that provides ACID transaction support, rather than NoSQL or NewSQL solutions.

Regarding the payment event table, it stores comprehensive information related to payment events. Here is an example of its structure: [Provide the details of the payment event table structure.

The payment order table stores the execution status of each payment order. This is what it looks like:

Before we explore the tables, let's provide some background information:

Foreign Key: The checkout_id serves as a foreign key. It links a single checkout to a payment event that can contain multiple payment orders.
Pay-In and Pay-Out Process: When we utilize a third-party Payment Service Provider (PSP) to deduct money from a buyer's credit card, the funds are not immediately transferred to the seller. Instead, the money is first transferred to the e-commerce website's bank account in a process known as pay-in. The actual transfer to the seller's bank account, called pay-out, occurs when certain conditions are met, such as the delivery of products. Consequently, during the pay-in flow, only the buyer's card information is required, not the seller's bank account information.
Payment Order Table: In Table 4 (payment order table), the payment_order_status is an enumerated type (enum) that indicates the execution status of the payment order. The possible statuses are NOT_STARTED, EXECUTING, SUCCESS, and FAILED. The update logic for payment_order_status is as follows:

The initial status is NOT_STARTED.
When the payment service sends the payment order to the payment executor, the status changes to EXECUTING.
The payment service updates the status to SUCCESS or FAILED based on the response from the payment executor.
Once the status is SUCCESS, the payment service invokes the wallet service to update the seller's balance and sets the wallet_updated field to TRUE. For simplicity, we assume that wallet updates always succeed in this design.
Subsequently, the payment service calls the ledger service to update the ledger database, marking the ledger_updated field as TRUE.

Payment Event Table: When all payment orders associated with the same checkout_id are successfully processed, the payment service sets the is_payment_done field to TRUE in the payment event table. A scheduled job typically runs at fixed intervals to monitor the status of ongoing payment orders. If a payment order exceeds a specified threshold without completing, an alert is triggered to notify engineers, who can then investigate the issue.

By providing this background information, we establish the context for understanding the subsequent discussion of the tables and their related fields.

Double-entry ledger system

The double-entry principle, also known as double-entry accounting/bookkeeping, is a crucial design principle in the ledger system. It is essential for accurate bookkeeping and is fundamental to any payment system. This principle involves recording each payment transaction in two separate ledger accounts, both with the same amount. One account is debited, while the other is credited with the identical amount (Table 5)

According to the double-entry system, the total sum of all transaction entries should always be zero. This principle ensures that if one cent is lost, someone else gains exactly one cent. It offers end-to-end traceability and maintains consistency throughout the payment cycle. For further information on implementing the double-entry system, you can refer to Square's engineering blog, which covers the topic of an immutable double-entry accounting database service.

Hosted payment page

Many companies choose not to store credit card information internally due to the complexities associated with regulations like the Payment Card Industry Data Security Standard (PCI DSS) [8] in the United States. To avoid the burden of handling credit card information, companies opt for hosted credit card pages offered by Payment Service Providers (PSPs). These pages come in the form of a widget or an iframe for websites, and for mobile applications, they may be a pre-built page from the payment SDK. Figure 3 provides an example of the checkout process with PayPal integration. The important aspect here is that the PSP supplies a hosted payment page that directly captures the customer's card information, eliminating the need to rely on our own payment service.

Pay-out flow

The pay-out flow shares many similarities with the pay-in flow in terms of its components. However, there is a notable difference: instead of relying on a Payment Service Provider (PSP) to transfer funds from the buyer's credit card to the e-commerce website's bank account, the pay-out flow involves a third-party pay-out provider that facilitates the transfer of funds from the e-commerce website's bank account to the seller's bank account.

Typically, payment systems employ third-party account payable providers like Tipalti [9] to handle pay-outs. Pay-outs come with their own set of bookkeeping and regulatory requirements that need to be fulfilled.

Step 3 - Design Deep Dive

This section is dedicated to enhancing the system's speed, robustness, and security. In a distributed system, errors and failures are not only expected but also frequent. For instance, what occurs if a customer accidentally clicks the "pay" button multiple times? Will they be charged multiple times? How should we handle payment failures resulting from unstable network connections? This section delves into several crucial subjects to address these concerns comprehensively.

PSP integration
Reconciliation
Handling payment processing delays
Communication among internal services
Handling failed payments
Exact-once delivery
Consistency
Security

PSP integration

In situations where a payment system has the capability to directly connect with banks or card schemes like Visa or MasterCard, it is possible to conduct payments without relying on a Payment Service Provider (PSP). However, these direct connections are rare and require specialized expertise. Typically, they are utilized by large companies that can justify the investment required. For the majority of companies, integrating with a PSP is the preferred approach, which can be done in two ways:

If a company can securely store sensitive payment information and chooses to do so, they can integrate the PSP using an API. In this case, the company is responsible for developing the payment web pages, collecting and storing the sensitive payment information. The PSP, on the other hand, handles the connection to banks or card schemes.
If a company opts not to store sensitive payment information due to complex regulations and security concerns, the PSP provides a hosted payment page. This page is used to collect the card payment details from customers and securely store them within the PSP's system. This is the more commonly adopted approach by most companies.

Figure 4 is used to provide a detailed explanation of how the hosted payment page functions.

For the sake of simplicity, Figure 4 excludes the payment executor, ledger, and wallet. The payment service acts as the orchestrator of the entire payment process.

The payment process begins when the user clicks the "checkout" button on the client browser. The client then sends the payment order information to the payment service.

Upon receiving the payment order information, the payment service sends a payment registration request to the Payment Service Provider (PSP). This registration request contains relevant payment details such as the amount, currency, expiration date of the payment request, and the redirect URL. To ensure unique registration and prevent duplication, a UUID field known as the nonce [10] is included. Typically, this UUID corresponds to the ID of the payment order.

The PSP returns a token to the payment service, which serves as a unique identifier for the payment registration on the PSP side. This token enables later examination of the payment registration and payment execution status.

The payment service stores the token in the database before initiating the call to the PSP-hosted payment page.

Once the token is saved, the client displays the PSP-hosted payment page. In mobile applications, PSP's SDK integration is commonly utilized for this functionality. As an example, we consider Stripe's web integration (Figure 5). Stripe offers a JavaScript library that presents the payment user interface (UI), collects sensitive payment information, and directly communicates with the PSP to complete the payment. Stripe handles the collection of sensitive payment information, ensuring it never reaches our payment system. The hosted payment page typically requires two pieces of information:

The token received in step 4: The PSP's JavaScript code utilizes this token to retrieve detailed information about the payment request from the PSP's backend. One crucial piece of information is the amount to be collected.
The redirect URL: This is the web page URL that is called upon completion of the payment. When the PSP's JavaScript completes the payment process, it redirects the browser to the specified redirect URL. Usually, the redirect URL corresponds to an e-commerce web page displaying the checkout status. It is important to note that the redirect URL differs from the webhook [11] URL in step 9.

The user enters the payment information, including the credit card number, cardholder's name, expiration date, and other relevant details, on the Payment Service Provider's (PSP) webpage. After filling in the necessary information, the user clicks the pay button. Subsequently, the PSP initiates the payment processing procedure
The Payment Service Provider (PSP) provides information regarding the payment status.
After the redirection, the web page is directed to the specified redirect URL. In step 7, the payment status is usually added to the URL as an appended parameter. For instance, the complete redirect URL might look like [12]:

https://your-company.com/?tokenID=JIOUIQ123NSF&payResult=X324FSa

In an asynchronous manner, the Payment Service Provider (PSP) communicates the payment status to the payment service through a webhook. During the initial setup with the PSP, an URL on the payment system's side is registered as the webhook. When payment events are sent to the payment system through the webhook, the payment system extracts the payment status information and updates the "payment_order_status" field in the Payment Order database table.

Up until now, we have discussed the ideal scenario of the hosted payment page. However, in reality, there can be instances of unreliable network connections and potential failures in any of the nine steps mentioned above. To address such failure cases in a systematic manner, reconciliation comes into play.

Reconciliation

In scenarios where system components communicate asynchronously, there is no guarantee of message delivery or immediate response. Asynchronous communication is commonly employed in the payment industry to enhance system performance. External systems like Payment Service Providers (PSPs) and banks also prefer this type of communication. However, ensuring correctness in such cases requires the practice of reconciliation.

Reconciliation is a process that involves periodically comparing the states of related services to verify their agreement. It serves as the final line of defense in the payment system.

Each night, PSPs or banks send a settlement file to their clients. This file includes the bank account balance and all the transactions that occurred on that account during the day. The reconciliation system parses the settlement file and compares its details with the ledger system. Figure 6 provides an illustration of where the reconciliation process fits within the overall system.

Reconciliation plays a crucial role in ensuring internal consistency within the payment system. It helps identify any discrepancies that may arise between the states recorded in the ledger and the wallet.

When mismatches are detected during reconciliation, the finance team is typically responsible for making manual adjustments to resolve them. These mismatches and adjustments can generally be classified into three categories:

Classifiable mismatch with automatable adjustment: If the cause of the mismatch is known and it is cost-effective to develop a program to automate the adjustment, engineers can create an automated solution. This involves both classifying the mismatch and automating the adjustment process.
Classifiable mismatch with unautomatable adjustment: In cases where the cause of the mismatch is known, but the cost of developing an automated adjustment program is prohibitively high, the mismatch is added to a job queue. The finance team then manually addresses and resolves the mismatch.
Unclassifiable mismatch: When the cause of the mismatch cannot be determined, it is considered unclassifiable. These mismatches are placed in a special job queue, and the finance team conducts a manual investigation to identify the underlying cause and resolve the discrepancy.

Handling payment processing delays

As previously mentioned, the payment request process involves various components and parties, both internal and external. While most payment requests are completed within seconds, there are scenarios where a payment request may experience delays, taking hours or even days to be finalized or rejected. Here are a few examples of situations that can cause longer processing times for payment requests:

The Payment Service Provider (PSP) identifies a payment request as high risk and requires manual review by a human.
A credit card used for the payment requires additional protection, such as 3D Secure Authentication [13], which involves gathering extra details from the cardholder to verify the purchase.

The payment service must be equipped to handle these prolonged payment request processes. In cases where the buy page is hosted by an external PSP, which is common nowadays, the PSP manages these extended payment requests in the following ways:

The PSP returns a pending status to the client (our client) who then displays it to the user. The client also provides a dedicated page for customers to check the current payment status.
The PSP monitors the pending payment on behalf of the payment service and notifies it of any status updates through a webhook that the payment service has registered with the PSP.
When the payment request is finally completed, the PSP calls the registered webhook mentioned above. The payment service updates its internal system accordingly and proceeds with the shipment to the customer.

Alternatively, some PSPs may require the payment service to actively poll the PSP for status updates on any pending payment requests, instead of relying on webhooks to receive updates.

Communication among internal services

There are two types of communication patterns that internal services use to communicate: synchronous vs asynchronous. Both are explained below.

Synchronous communication

As systems scale up, the limitations of synchronous communication, such as HTTP, become more apparent. While suitable for small-scale systems, it exhibits drawbacks as the scale increases. These drawbacks include:

Low performance: The performance of the entire system is dependent on the performance of each service in the chain. If any of the services underperform, it can impact the overall system performance.
Poor failure isolation: In a synchronous communication model, if any of the Payment Service Providers (PSPs) or other services fail, the client will not receive a response. This lack of failure isolation can lead to disruptions in the system.
Tight coupling: Synchronous communication requires the request sender to have knowledge of and directly communicate with the intended recipient. This tight coupling between services can make the system less flexible and hinder modularity.
Difficulty in scaling: Without using a queue or buffer as a mechanism, scaling the system to handle sudden spikes in traffic becomes challenging. Synchronous communication lacks the ability to easily accommodate increased traffic without additional measures in place.

Overall, as the scale of the system increases, the limitations of synchronous communication become more evident, necessitating alternative approaches for improved performance, failure handling, loose coupling, and scalability.

Asynchronous communication

Asynchronous communication can be divided into two categories:

Single receiver: each request (message) is processed by one receiver or service. It’s usually implemented via a shared message queue. The message queue can have multiple subscribers, but once a message is processed, it gets removed from the queue. Let’s take a look at a concrete example. In Figure 9, service A and service B both subscribe to a shared message queue. When m1 and m2 are consumed by service A and service B respectively, both messages are removed from the queue as shown in Figure 10.

To accommodate scenarios where a request needs to be processed by multiple receivers or services, Kafka proves to be effective. In this model, each request or message is received by multiple consumers. Kafka retains the messages even after they are consumed, allowing different services to process the same message. This architecture aligns well with payment systems, where a single request may trigger multiple side effects, such as sending push notifications, updating financial reporting, and performing analytics.

Figure 11 illustrates an example of this model. Payment events are published to Kafka and subsequently consumed by various services, including the payment system itself, an analytics service, and a billing service. This design enables the seamless distribution of payment-related information across multiple services, ensuring that each service can perform its specific tasks and process the payment events accordingly.

In general, synchronous communication is characterized by its simplicity in design, but it lacks the ability to allow services to operate autonomously. As the dependency graph expands, the overall performance of the system tends to decline. On the other hand, asynchronous communication prioritizes scalability and failure resilience, albeit at the cost of design simplicity and consistency.

For a large-scale payment system that involves intricate business logic and relies on numerous third-party dependencies, opting for asynchronous communication is a more favorable choice. The complexity of such a system necessitates the ability to handle concurrent tasks, distribute workloads efficiently, and withstand failures without impacting the overall performance. By embracing asynchronous communication, the system can achieve the scalability and resilience required to effectively handle the demands and challenges of a large-scale payment ecosystem.

Handling failed payments

Every payment system has to handle failed transactions. Reliability and fault tolerance are key requirements. We review some of the techniques for tackling those challenges.

Tracking payment state

Having a definitive payment state at any stage of the payment cycle is crucial. Whenever a failure happens, we can determine the current state of a payment transaction and decide whether a retry or refund is needed. The payment state can be persisted in an append-only database table.

Retry queue and dead letter queue

To gracefully handle failures, we utilize the retry queue and dead letter queue, as shown in

Figure 12.

Retry queue: retryable errors such as transient errors are routed to a retry queue.
Dead letter queue [14]: if a message fails repeatedly, it eventually lands in the dead letter queue. A dead letter queue is useful for debugging and isolating problematic messages for inspection to determine why they were not processed successfully.

Check whether the failure is retryable.

Retryable failures are routed to a retry queue.
For non-retryable failures such as invalid input, errors are stored in a database.

The payment system consumes events from the retry queue and retries failed payment transactions.
If the payment transaction fails again:

If the retry count doesn’t exceed the threshold, the event is routed to the retry queue.
If the retry count exceeds the threshold, the event is put in the dead letter queue. Those failed events might need to be investigated.

If you are interested in a real-world example of using those queues, take a look at Uber’s payment system that utilizes Kafka to meet the reliability and fault-tolerance requirements [16].

Exactly-once delivery

One of the most critical issues that a payment system can encounter is double-charging a customer. Therefore, it is crucial to ensure that the design of the payment system guarantees the execution of a payment order exactly once [16].

At first glance, achieving exactly-once delivery may seem challenging, but breaking down the problem into two parts makes it more manageable. Mathematically, an operation is considered to be executed exactly once if:

It is executed at least once.
Simultaneously, it is executed at most once.

We will now explain how to implement the "at-least-once" aspect using retry mechanisms, and the "at-most-once" aspect using idempotency checks.

Implementing the "at-least-once" execution can be achieved by employing retry strategies. If a payment operation fails or encounters an error, it can be retried a certain number of times until it succeeds. By ensuring that the system retries failed operations, we can guarantee that the payment order will be executed at least once.

To address the "at-most-once" execution, we utilize idempotency checks. Idempotency ensures that if the same payment order is received multiple times (due to retries or other factors), the system will recognize and handle it as a duplicate request. This prevents the system from processing the same payment order more than once, eliminating the risk of double-charging the customer.

By combining retry mechanisms for at-least-once execution and implementing idempotency checks for at-most-once execution, we can create a robust and reliable payment system that ensures the execution of payment orders exactly once, mitigating the risk of double-charging customers.

Retry

In certain cases, it becomes necessary to retry a payment transaction as a result of network errors or timeouts. Retry functionality ensures an "at-least-once" guarantee for the transaction. Figure 13 illustrates an example where a client attempts to make a $10 payment, but the payment request fails repeatedly due to a poor network connection. However, as the network eventually stabilizes, the request succeeds on the fourth attempt. This retry mechanism allows for resilient payment processing, ensuring that the transaction eventually goes through despite temporary network disruptions.

Determining the most suitable retry strategy involves careful consideration. Here are some common retry strategies:

Immediate retry: The client promptly resends the request after a failure occurs.
Fixed intervals: A fixed amount of time is waited between the failed payment and subsequent retry attempts.
Incremental intervals: The client initially waits for a short period before the first retry and then gradually increases the waiting time for subsequent retries.
Exponential backoff [17]: The waiting time between retries is doubled after each failed attempt. For instance, if the request fails the first time, a retry is attempted after 1 second. If it fails again, the next retry is attempted after 2 seconds, and so on.
Cancel: The client has the option to cancel the request, especially when the failure is permanent or further retries are unlikely to succeed.

Selecting the appropriate retry strategy can be challenging as there is no one-size-fits-all solution. However, as a general guideline, exponential backoff is effective when network issues are expected to persist for longer. Avoiding overly aggressive retry strategies is important as they can waste computing resources and potentially overload the service. Including an error code with a Retry-After header is considered a good practice.

One potential problem with retries is the possibility of double payments. Let's explore two scenarios to understand this issue further.

Here are two scenarios that highlight the need for idempotency in payment systems:

Scenario 1: The payment system integrates with a Payment Service Provider (PSP) using a hosted payment page, and the client unintentionally clicks the pay button twice. This can result in the possibility of the payment being processed twice.

Scenario 2: The payment is successfully processed by the PSP, but the response confirming the payment fails to reach our payment system due to network errors or communication issues. As a result, the user may click the "pay" button again or the client may retry the payment, potentially leading to duplicate payments.

To prevent the occurrence of double payments in such situations, it is crucial to ensure that payments are executed in an at-most-once manner. This means that even if multiple requests or retries are made, the system should only process the payment once. This guarantee of at-most-once execution is commonly referred to as idempotency. By implementing idempotency measures, payment systems can avoid duplicate charges and maintain accurate transaction records.

Idempotency

Idempotency plays a crucial role in ensuring the at-most-once guarantee, particularly in the context of payment systems. Idempotence is a concept rooted in mathematics and computer science, defined as operations that can be applied multiple times without altering the result beyond the initial application [18]. From an API perspective, it means that clients can make the same call repeatedly and consistently obtain the same outcome.

An idempotency key is commonly employed to facilitate communication between clients (web and mobile applications) and servers. This key is a unique value generated by the client and typically has an expiration period. Many tech companies, including Stripe [19] and PayPal [20], recommend using UUIDs as idempotency keys. In an idempotent payment request, the idempotency key is added to the HTTP header, typically as follows:

<idempotency-key: key_value>.

Now that we have a grasp of the fundamental concept of idempotency let's explore how it helps address the issue of double payments mentioned earlier. By incorporating idempotency, the system ensures that even if a payment request is inadvertently duplicated or repeated, it will not result in multiple payments being processed. The idempotency key uniquely identifies each payment request, allowing the system to recognize duplicates and respond accordingly. This prevents unintended duplicate payments, improving the reliability and integrity of the payment process.

Scenario 1: what if a customer clicks the “pay” button quickly twice?

Figure 14 demonstrates the process where, upon a user clicking "pay," an idempotency key is included in the HTTP request sent to the payment system. In the context of an e-commerce website, this idempotency key typically corresponds to the ID of the shopping cart just before the checkout.

In the case of a second request, it is considered a retry since the payment system has already encountered the idempotency key. By including the previously specified idempotency key in the request header, the payment system responds by providing the latest status of the previous request. This approach ensures that duplicate requests or retries are effectively handled and prevents unintended consequences that may arise from multiple submissions of the same payment.

If multiple concurrent requests are detected with the same idempotency key, only one request is processed and the others receive the “429 Too Many Requests” status code.

In order to maintain idempotency, the system employs the use of a unique key constraint in the database. For instance, the primary key of the database table can serve as the idempotency key. The following outlines how this process operates:

Upon receiving a payment, the payment system attempts to insert a new row into the corresponding database table.
If the insertion is successful, it indicates that the payment request is unique and has not been encountered previously.
However, if the insertion fails due to an existing primary key (idempotency key) already being present, it signifies that the payment request has been seen before. As a result, the second request is not processed or executed, preventing duplicate actions.

By utilizing the unique key constraint in the database, the system ensures that only one instance of a payment request with a specific idempotency key is processed. This approach maintains consistency and prevents unintended duplication of actions in the system.

Scenario 2: The payment is successfully processed by the PSP, but the response fails to reach our payment system due to network errors. Then the user clicks the “pay” again.

As shown in Figure 4 (step 2 and step 3), if the payment is successfully processed by the Payment Service Provider (PSP) but the response fails to reach our payment system due to network errors, and subsequently the user clicks the "pay" button again, the following scenario can occur:

The second "pay" request is sent to the payment system, unaware that the initial payment was actually successful.

At this point, the payment system needs to handle the duplicate request and ensure that the duplicate payment is not processed twice, preventing any unintended consequences or duplicate charges.

To address this situation, the payment system can implement mechanisms such as idempotency keys or unique identifiers. By including an idempotency key in the request, the payment system can identify and recognize the duplicate request. It can then determine that the payment has already been successfully processed and respond accordingly, without processing the payment again. This helps maintain data integrity and prevents any duplicate or erroneous transactions caused by network errors or user interactions.

Consistency

Several stateful services are called in a payment execution:

The payment service keeps payment-related data such as nonce, token, payment order, execution status, etc.
The ledger keeps all accounting data.
The wallet keeps the account balance of the merchant.
The PSP keeps the payment execution status.
Data might be replicated among different database replicas to increase reliability.

In a distributed environment, the communication between any two services can fail, causing data inconsistency. Let’s take a look at some techniques to resolve data inconsistency in a payment system.

To maintain data consistency between internal services, ensuring exactly-once processing is very important.

To maintain data consistency between the internal service and external service (PSP), we usually rely on idempotency and reconciliation. If the external service supports idempotency, we should use the same idempotency key for payment retry operations. Even if an external service supports idempotent API, reconciliation is still needed because we shouldn’t assume the external system is always right.

If data is replicated, replication lag could cause inconsistent data between the primary database and the replicas. There are generally two options to solve this:

Serve both reads and writes from the primary database only. This approach is easy to set up, but the obvious drawback is scalability. Replicas are used to ensure data reliability, but they don’t serve any traffic, which wastes resources.
Ensure all replicas are always in-sync. We could use consensus algorithms such as Paxos [21] and Raft [22], or use consensus-based distributed databases such as YugabyteDB [23] or CockroachDB [24].

Payment security

Payment security is very important. In the final part of this system design, we briefly cover a few techniques for combating cyberattacks and card thefts.

Step 4 - Wrap Up

In this topic , we investigated the pay-in flow and pay-out flow. We went into great depth about retry, idempotency, and consistency. Payment error handling and security are also covered at the end of the chapter.

A payment system is extremely complex. Even though we have covered many topics, there are still more worth mentioning. The following is a representative but not an exhaustive list of relevant topics.

Monitoring. Monitoring key metrics is a critical part of any modern application. With extensive monitoring, we can answer questions like “What is the average acceptance rate for a specific payment method?”, “What is the CPU usage of our servers?”, etc. We can create and display those metrics on a dashboard.
Alerting. When something abnormal occurs, it is important to alert on-call developers so they respond promptly.
Debugging tools. “Why does a payment fail?” is a common question. To make debugging easier for engineers and for customer support, it is important to develop tools that allow staff to review the transaction status, processing server history, PSP records, etc. of a payment transaction.
Currency exchange. Currency exchange is an important consideration when designing a payment system for an international user base.
Geography. Different regions might have completely different sets of payment methods.
Cash payment. Cash payment is very common in Egypt , India, Brazil, and some other countries. Uber [28] and Airbnb [29] wrote detailed engineering blogs about how they handled cash-based payment.
Google/Apple pay integration. Please read [30] if interested.

Congratulations on getting this far! Now give yourself a pat on the back. Good job!

Topic Summary

Reference Materials

[1] Payment system: https://en.wikipedia.org/wiki/Payment_system

[2] AML/CFT: https://en.wikipedia.org/wiki/Money_laundering

[3] Card scheme: https://en.wikipedia.org/wiki/Card_scheme

[4] ISO 4217: https://en.wikipedia.org/wiki/ISO_4217

[5] Stripe API Reference: https://stripe.com/docs/api

[6] Double-entry bookkeeping: https://en.wikipedia.org/wiki/Double-entry_bookkeeping

[7] Books, an immutable double-entry accounting database service:

https://developer.squareup.com/blog/books-an-immutable-double-entry-accounting-database-service/

[8] Payment Card Industry Data Security Standard: https://en.wikipedia.org/wiki/Payment_Card_Industry_Data_Security_Standard

[9] Tipalti: https://tipalti.com

[10] Nonce: https://en.wikipedia.org/wiki/Cryptographic_nonce

[11] Webhooks: https://stripe.com/docs/webhooks

[12] Customize your success page: https://stripe.com/docs/payments/checkout/custom-success-page

[13] 3D Secure: https://en.wikipedia.org/wiki/3-D_Secure

[14] Kafka Connect Deep Dive – Error Handling and Dead Letter Queues: https://www.confluent.io/blog/kafka-connect-deep-dive-error-handling-dead-letter-queues/

[15] Reliable Processing in a Streaming Payment System:

[16] Chain Services with Exactly-Once Guarantees: https://www.confluent.io/blog/chain-services-exactly-guarantees/

[17] Exponential backoff: https://en.wikipedia.org/wiki/Exponential_backoff

[18] Idempotence: https://en.wikipedia.org/wiki/Idempotence

[19] Stripe idempotent requests: https://stripe.com/docs/api/idempotent_requests

[20] Idempotency: https://developer.paypal.com/docs/platforms/develop/idempotency/

[21] Paxos: https://en.wikipedia.org/wiki/Paxos_(computer_science)

[22] Raft: https://raft.github.io

[23] YogabyteDB: https://www.yugabyte.com

[24] Cockroachdb: https://www.cockroachlabs.com

[25] What is DDoS attack: https://www.cloudflare.com/learning/ddos/what-is-a-ddos-attack/

[26] How Payment Gateways Can Detect and Prevent Online Fraud: https://www.chargebee.com/blog/optimize-online-billing-stop-online-fraud/

[27] Advanced Technologies for Detecting and Preventing Fraud at Uber: https://eng.uber.com/advanced-technologies-detecting-preventing-fraud-uber/

[28] Re-Architecting Cash and Digital Wallet Payments for India with Uber Engineering: https://eng.uber.com/india-payments/

[29] Scaling Airbnb’s Payment Platform: https://medium.com/airbnb-engineering/scaling-airbnbs-payment-platform-43ebfc99b324

SOCIAL AND EQUALITY TO ALL

My main agenda is adopting a Gramscian theoretical framework, the five parts of this volume focus on the various ways in which the political is discursively and materially realized in its dialogic co-constructions within the media, the economy, culture and identity, affect, and education. We focus at examining the power instantiations of sociolinguistic and semiotic practices in society from a variety of critical perspectives, this blog focus at how applied political linguists globally is responding to, and challenge, current discourses of issues such as militarism, nationalism, Islamophobia, sexism, racism and the free market, and suggests future directions. No peace, no unity, no coexistence hence all becomes vanity...! It's why the world is oval.

LATEST HOT NEWS IN THE ROOM