Productionizing a Real-Time Fraud Detection Model

How I led an initiative to replace a costly 3rd-party vendor, saving €XX,000/year in Chargebacks and unlocking €YY M/year in GMV by reducing customer friction.

Partha Mandal

End-to-End Project Lifecycle

Business Case

Perf. Metrics
Financial Impact

DS/ML Solution

Data Discovery & EDA
Feature Engineering
Modelling & Validation

Production

Data Map & Feature Store
Deploy & A/B Test
Monitoring

My Role: Driving the Initiative End-to-End

I Drove...

The initial data discovery and EDA to validate project feasibility.

The entire feature engineering process, from aggregations to network features.

I Owned...

The business case and definition of success metrics (KPIs).

The final model selection and the A/B test analysis and recommendation.

I Led...

The overall DS/ML project strategy from conception to post-launch.

The critical data mapping process and the design of production monitoring.

The Business Case: A Three-Fold Opportunity

Cost Savings

Replace a 3rd-party vendor contract costing €XM annually.

Performance Uplift

Beat the vendor's black-box model to reduce chargebacks & customer friction.

Agility & Speed

Gain the ability to rapidly retrain and deploy, adapting to new fraud patterns in days, not months.

Measuring Success: KPIs and Financial Impact

ML Performance KPIs

Precision

?"Are we right?" Of all transactions we flagged, what % were actually fraud? High precision reduces customer friction.

Recall

?"Do we miss anything?" Of all actual fraud, what % did we catch? High recall reduces chargebacks.

F1-Score

?The harmonic mean of Precision and Recall. A great single metric for imbalanced problems.

Financial Impact Framework

The Bottom Line

The Total Impact provides the full business picture by combining the model's operational gains with the €XM annual contract savings.

*The 11% 3DS drop rate, used to calculate friction cost, was determined from prior, isolated A/B tests on challenged transactions.

Project Acceptance Criteria

Green

Positive operational impact alone. Clear Go.

Yellow

Marginal operational impact. Still a Go.

Orange

Positive only after including contract savings. Go, but needs monitoring.

Red

Negative total financial impact. No-Go.

EDA: Key Insights That Shaped Our Strategy

Pareto Principle in Fraud & GMV

75% of chargebacks came from just 3 countries, while ~92% of GMV came from the top 8 countries.

Implication: We needed to pay close attention to model performance in these key segments, not just the global average.

Fraudulent Order Value

The average order value for transactions that resulted in a chargeback was significantly higher than for legitimate orders.

Implication: This confirmed that `amount` would be a powerful feature and that focusing on high-value transactions was critical.

Chargeback Maturity Time

Waiting 90 days for chargeback data to mature captured 91% of all final chargebacks.

Implication: I decided this was the optimal trade-off between data completeness and speed, allowing us to retrain models quarterly without waiting the full 180 days.

Performance Segmentation

Based on these insights, I established a process to monitor performance across crucial segments.

Implication: We tracked F1, Precision, and Recall for top countries and for new vs. existing users to ensure the model was fair and effective for everyone.

Feature Engineering at Scale

We engineered a rich set of features from ~100 raw data points per order.

Activity & User

  • Order/Card Counts
  • Amount (Sum/Avg)
  • Account/Card Age
  • Payment Type
  • Domestic/Foreign Card

Time Horizon Aggs

  • Aggregates over 3h, 1d, 7d, 30d, 90d
  • User, Device & Phone level

Graph & Network

  • Network Size
  • Associated Device Counts
  • Associated Card Counts
  • Associated Email Counts

Advanced Behavioral

  • Email Domain
  • Phone Model, OS
  • Haversine Distance
  • Order Time/Day Patterns
  • Refund Count Rate
  • Cash Unpaid Rate

DS/ML Solution: Validation & Modeling

Overall Validation Strategy

Training Data

Nov - Mar

Validation Set

Apr

Test Set

May

CHB Maturation

Jun - Aug

A strict time-based split was crucial to prevent data leakage and accurately simulate how the model would perform on future, unseen data.

Algorithms Explored

Benchmarked several gradient boosting models, but ruled out Deep Learning networks due to strict real-time latency requirements.

XGBoost
CatBoost
LightGBM (Winner)

LightGBM provided the best balance of prediction performance and low-latency inference speed for our production environment.

DS/ML Solution: Optimization & Interpretability

Training & Optimization Workflow

1. Undersampling

Undersampled the majority class in the training set (testing 1%, 5%, 10%, 20% ratios). This was chosen for faster training iterations with minimal performance trade-off compared to oversampling.

2. CV & Optimization

Ran 5-Fold Stratified CV on the sampled data. Used Optuna to optimize for PR-AUC, an ideal metric for imbalanced classification as it is not sensitive to a single decision threshold.

3. Final Training

Trained the final LightGBM model on the full, undersampled training data using the best hyperparameters discovered by Optuna, and most important features as per SHAP.

4. Thresholding

Used the trained model to predict on the full validation set. Optimized decision thresholds to maximize financial impact for key segments (global, country, new/existing users).

5. Verification

Applied the chosen thresholds to the Test Set to verify performance generalization over time before deployment.

Model Interpretability

To build trust and understand model behavior, I used two key techniques:

SHAP (SHapley Additive exPlanations): To explain global feature importance and debug individual predictions, ensuring the model was making decisions for the right reasons.
ALE (Accumulated Local Effects) Plots: To safely analyze how risk changed with feature values. This is more reliable than Partial Dependence Plots (PDPs) when features are correlated.

My Path to Production: From Silo to System

Click on each stage to see the details of my work.

1. Data Mapping
2. Feature Store
3. Deployment
4. A/B Test Analysis
5. Monitoring

Impact & Results: A Clear Success (Green)

Additional CHBs Saved

€XX K

annually

Additional GMV Freed

€YY M

annually from 3DS

Production F1-Score

0%

stable in production

Learnings & Future Work

Key Learnings

  • Business Context is King: The best features came from understanding fraudster behavior, not just algorithms.
  • Infrastructure First: Investing in a feature store early was a force multiplier that prevented major issues.
  • Collaboration is Non-Negotiable: Success depended on the tight loop between Data Science, Engineering, and Business.

What I Would Do Differently

  • Include LTV loss from FP in Financial Impact: False Positives mean genuine customers are facing unnecessary friction, which might turn them away from platform
  • Explore Graph Neural Networks (GNNs): For a future iteration, I would explore using GNNs to more directly model the complex relationships between users, cards, and devices, potentially capturing sophisticated fraud rings that are harder to detect with traditional feature engineering.

Thank You

Q&A