Statistical flags indicate unusual patterns — not proof of fraud or wrongdoing. Read our methodology

ML scores are now integrated into the Unified Risk Watchlist

ML fraud similarity scores have been combined with our 9 statistical tests into a single unified risk system. Providers are ranked by a combination of statistical flags and ML scores into unified tiers: Critical, High, Elevated, and ML Flag. View the Risk Watchlist →

ML Methodology

How our random forest model works: trained on 514 confirmed-excluded providers from the OIG LEIE database, scoring 594K active Medicaid providers for fraud similarity.

Model

Random Forest

Ensemble classifier

AUC Score

0.7762

5-fold cross-validation

Providers Scored

594K

Active Medicaid providers

Training Labels

514

OIG-excluded providers

Two Complementary Approaches to Fraud Detection

S

Statistical Tests

9 rule-based tests that flag specific, explainable anomalies in billing behavior.

  • • Identifies exact codes, ratios, and dollar amounts
  • • Human-readable explanations for every flag
  • • Code-specific benchmarks (9,578 codes)
  • • Catches billing swings, outlier pricing, new entrants
M

ML Model

Pattern matching against 514 confirmed fraud cases from the OIG exclusion list.

  • • Learns complex multi-feature fraud signatures
  • • Catches patterns humans might miss
  • • Scores every provider on a 0–100% scale
  • • Validated against full-dataset cross-validation

Why both matter: Statistical tests are precise and explainable — they tell you exactly what's unusual. ML captures subtler patterns across multiple features simultaneously. A provider flagged by both methods is significantly more likely to warrant investigation. The unified Risk Watchlist combines both signals into a single ranked view.

Feature Importance

How much each feature contributes to the model's fraud-similarity predictions. Importance values are derived from the trained random forest's Gini impurity decrease across all decision trees.

Payments Per Month
14.2%
Total Payments
12.8%
Claims Per Month
11.2%
Total Claims
9.8%
Cost Per Claim
9.1%
Cost Per Beneficiary
8.2%
Total Beneficiaries
7.1%
Claims Per Beneficiary
6.3%
Top Code Concentration
5.6%
Active Months
4.8%
Unique Procedure Codes
3.9%
Self-Billing Ratio
3.1%
Short Burst Billing
2.2%
Low Codes / High Spend
1.7%

What the Top Features Mean

Understanding why these features matter for fraud detection:

Payments Per Month

How much a provider bills per active month. Fraudulent providers often bill at extremely high monthly rates because they're trying to extract maximum money before detection.

Total Payments

The total amount of Medicaid money received. While large legitimate organizations bill high amounts, an outsized total combined with other red flags is a strong signal.

Claims Per Month

The volume of claims filed each month. Unusually high claim velocity — especially combined with few unique codes — suggests automated or fabricated billing.

Cost Per Claim

The average charge per individual claim. Legitimate providers cluster around their specialty's median. Far above that suggests upcoding or inflated billing.

Cost Per Beneficiary

How much is billed per individual patient. Fraud schemes often bill enormous amounts per patient — sometimes for patients who never received services.

Top Code Concentration

What fraction of billing goes to a single procedure code. Legitimate practices bill diverse codes; 'fraud mills' repeatedly bill one lucrative code.

Score Distribution

Most providers score very low. Only the top percentiles show patterns consistent with known fraud.

Median (p50)
10%Typical provider
p90
42%Top 10%
p95
50%Top 5%
p99
69%Top 1%
p99.9
82%Top 0.1%

Cross-Validation: Full-Dataset Training

To validate our approach, we trained three models on the 594,235 providers that met minimum billing thresholds for ML scoring using Google Colab (12GB RAM). The results confirm our subsampled model as the strongest performer.

Random Forest (Full)

0.7656

594K training samples

Gradient Boosting

0.6815

594K training samples

Logistic Regression

0.6812

594K training samples

Key finding: Our production model (subsampled, AUC 0.7762) outperforms the full-dataset Random Forest (AUC 0.7656). This is because strategic subsampling — using 10K negative samples instead of 593K — reduces noise from the massive legitimate-provider class, allowing the model to better learn fraud patterns. The top-ranked providers are nearly identical across both models, confirming the robustness of our scoring.

Model Performance & Limitations

AUC of 0.7762 indicates moderate discriminative ability. The model is better than random chance but should be considered a screening tool, not definitive evidence. We are working to improve model performance through additional features and refined training data.

At a 0.5 classification threshold, the model favors recall over precision — it casts a wide net to avoid missing potentially anomalous providers, at the cost of more false positives. In practice, this means many flagged providers will be legitimate upon closer review.

Key Limitations

  • Training labels are based on OIG exclusions, which include non-fraud reasons (e.g., student loan default, license revocation) — this introduces label noise.
  • No temporal validation yet — the model has not been tested on held-out future time periods to confirm it generalizes beyond the training window.
  • Feature set is limited to billing aggregates; clinical context and audit outcomes are not yet incorporated.

Important Disclaimer

ML scores identify statistical patterns similar to known fraud cases. A high score is not evidence of fraud. Many legitimate providers may score highly due to unusual but lawful billing patterns (e.g., specialized practices, government entities, high-volume home care). These scores should be used as one input among many in any investigation.