Our Methodology

Explore our rigorous methodologies for benchmarking and testing AI-driven identity verification solutions.

Benchmarking Methodology

Our benchmarking approach systematically combines the scale and efficiency of advanced AI with the indispensable nuance and accuracy of expert human oversight through a rigorous, four-phase process.

Phase 1: Defining the Competitive Arena

Company Selection

We identify the 20 leading IDV providers in the United States, including established market leaders, high-growth challengers, and significant incumbents.

Framework Architecture

We evaluate each company across seven critical domains:

Company Overview

Corporate health, leadership stability, financial transparency

Product Offering

Breadth, depth, and integration capabilities

Core Technology Performance

AI/ML models, biometric accuracy, system architecture

Product Experience

User experience, platform usability, developer-friendliness

Additional Services

Fraud intelligence, consortium data, value-add services

Go-To-Market Strategy

Sales effectiveness, pricing models, partnerships

Deep Fraud Detection

Capability against deepfakes, injection attacks, synthetic identities

150+

Each domain contains 7-27 granular subcategories, creating 150+ distinct evaluation points per company.

Phase 2: High-Volume Data Aggregation via Proprietary AI

Engineered Prompting

Meticulously designed prompts for each of 150+ subcategories

Source Integrity

AI synthesizes from curated high-integrity sources (SEC filings, financial reports, Gartner, G2, NIST publications)

Structured Output

Captures findings, complete URL lists, and confidence scores with cross-reference validation

Phase 3: Expert Human Verification & Curation

Line-by-Line Validation

Industry analysts with deep domain expertise conduct complete audits of AI-generated datasets

Error Correction

Team corrects factual errors, reconciles conflicting data, and adds contextual insights AI cannot provide

Phase 4: Multi-Point Scoring & Confidence-Weighted Analysis

Granular Scoring Rubric

Multi-point system translates quantitative and qualitative attributes into normalized scores

Confidence-Weighted Model

x 1.0
High Confidence

SEC filings, NIST reports

Full weight

x 0.8
Medium Confidence

Single news articles

Adjusted weight

x 0.5
Low Confidence

Tangential/outdated sources

Reduced weight

Final Output

Confidence-adjusted scores aggregate into 0-to-5 scores for each domain, with weighted average for overall benchmark.

Testing Sample Methodology

Our testing methodology determines the minimum sample sizes required to estimate algorithm performance with statistical confidence, accounting for real-world data collection complexities.

Core Objective

We evaluate image categorization algorithm performance using confusion matrix analysis, focusing on True Positive Rate (TPR) and True Negative Rate (TNR) estimation with specified confidence levels.

Key Statistical Parameters

95%

Confidence Level

(α = 0.05)

±1%

Target Precision

(W = 0.02)

95-99.5%

Expected Accuracy Range

Sample Size Calculation Framework

Single Observation Analysis

Formula

n ≈ 4z²(α/2) × p(1-p) / W²

Sample Size Examples

~191

99.5% accuracy

samples needed

~380

99% accuracy

samples needed

~1,825

95% accuracy

samples needed

Key Insight: Sample size is highly sensitive to expected algorithm accuracy.

Multi-Level Structure Analysis

Real-World Consideration: Multiple data points per user (different angles, devices, times) create clustered data structure.

Design Effect Formula

DE = 1 + (m-1)ρ
n = n_effective × DE
m: Number of images per individual (typically 5)
ρ: Intraclass correlation coefficient (0.5 to 0.95 range)

Practical Examples with 5 images per user

ρ = 0.5

1,140 total samples

~228 users needed

ρ = 0.7

1,445 total samples

~289 users needed

ρ = 0.95

1,825 total samples

~365 users needed

Synthetic Data Evaluation

Current Limitations

Key Finding: Synthetic data cannot reliably substitute for real user data in algorithm evaluation.

  • • Synthetic datasets not designed to closely emulate real-world data distributions
  • • Algorithm sensitivity to distributional shifts varies significantly
  • • Limited validation of synthetic-to-real performance correlation

Recommendation

Primary Approach

Prioritize real data collection for evaluation

Secondary Option

Use synthetic data for comparative analysis only, with minimal cost investment