AI Bias in Medical Coding: Risks, Regulations & Safeguards

When a healthcare organization deploys an AI system for medical coding, it inherits every bias embedded in that system's training data, model architecture, and optimization objectives. If the training data contains historical coding patterns where certain demographics were systematically undercoded — receiving lower-complexity E/M codes despite equivalent clinical presentations — the AI will learn to replicate those patterns. If payer policies that the model learned from applied different standards to different populations, the AI will perpetuate those standards. The AI does not introduce bias through malice. It introduces bias through mathematics — by learning patterns from data that was itself shaped by decades of inequity in healthcare delivery, documentation, and reimbursement.

This is not a hypothetical concern. Documented examples of algorithmic bias in healthcare have demonstrated that AI systems can produce systematically different outcomes for different patient populations — differences that correlate with race, ethnicity, socioeconomic status, gender, and geography. While most documented cases involve clinical AI rather than coding AI specifically, the mechanisms by which bias enters coding models are well understood and the regulatory framework for addressing them is rapidly evolving.

This guide examines how bias enters AI coding systems, what the regulatory landscape requires, how organizations can test for and mitigate bias, and what questions to ask AI vendors about their bias prevention practices.

How Bias Enters AI Coding Models

Training Data Skew

AI coding models learn from historical claims data, clinical documentation, and coding decisions. Every form of bias present in this historical data is available for the model to learn.

Documentation disparities: Research has consistently shown that clinical documentation quality and specificity vary by patient demographics. Studies published in journals including JAMA and Health Affairs have found that clinical notes for Black patients, patients with limited English proficiency, and patients with lower socioeconomic status tend to be shorter, less detailed, and contain fewer specific clinical findings than notes for white, English-speaking, and higher-income patients. An AI coding model trained on this documentation will assign lower-complexity codes for these populations — not because the patients are less complex, but because the documentation is less detailed.

Historical coding patterns: If human coders historically assigned lower E/M levels to certain patient populations due to unconscious bias or documentation disparities, an AI model trained on those coding decisions will learn to assign lower levels for similar patients. The model treats historical coding decisions as ground truth, even when those decisions were systematically biased.

Payer mix effects: Training data that overrepresents certain payer types (for example, commercial insurance) and underrepresents others (for example, Medicaid or uninsured) can create models that perform better for patient populations associated with overrepresented payers. If a model is primarily trained on coding data from commercially insured patients, it may perform poorly when coding encounters for Medicaid populations, whose clinical presentations, documentation patterns, and applicable coding rules may differ.

Demographic Patterns in Historical Claims

Geographic bias: Coding patterns vary significantly by geography, reflecting regional differences in clinical practice, documentation norms, and payer requirements. An AI model trained predominantly on data from urban academic medical centers may perform poorly when applied to rural primary care practices — and vice versa.

Specialty bias: If training data is dominated by certain specialties, the model may perform less accurately for underrepresented specialties. Specialties that disproportionately serve certain patient populations will have their coding patterns underrepresented in the training data.

Age-related bias: Coding complexity for geriatric patients differs from coding for younger populations. If the training data does not adequately represent the full spectrum of age-related clinical complexity, the model may systematically undercode or overcode for certain age groups.

Payer Policy Bias

Different payers have different coverage policies, medical necessity criteria, and coding requirements. An AI model trained on data from one payer mix may apply coding logic that does not accurately reflect the requirements of other payers. This is not bias in the traditional demographic sense, but it can produce systematically different outcomes for patient populations that are associated with specific payer types — and in the United States, payer type correlates strongly with race, ethnicity, and socioeconomic status.

Optimization Objective Bias

The objectives used to train and optimize AI coding models can introduce bias even when the training data is balanced. If the model is optimized to maximize revenue, it may learn to upcode for populations where historical data shows upcoding was accepted by payers and to code conservatively for populations where claims were more frequently denied. If optimized for denial avoidance, the model may learn to avoid codes that are frequently denied — even when those codes are clinically appropriate — creating systematic undercoding for services associated with certain patient populations.

Documented Examples of AI Bias in Healthcare

While the specific body of research on AI bias in medical coding is still emerging, analogous examples from clinical AI and healthcare algorithms provide clear evidence that algorithmic bias in healthcare is real and consequential.

The Optum Algorithm

In a landmark 2019 study published in Science, researchers at UC Berkeley found that a widely used commercial algorithm for identifying patients who need extra care systematically underestimated the health needs of Black patients. The algorithm used healthcare costs as a proxy for health needs, and because Black patients historically had less access to healthcare and therefore lower costs, the algorithm systematically rated Black patients as healthier than equally sick white patients. At a given risk score, Black patients were significantly sicker than white patients with the same score. The algorithm affected an estimated 200 million patients annually.

The relevance to coding AI: algorithms that use historical cost data or utilization patterns as features can perpetuate the same type of proxy discrimination. A coding model that learns associations between patient characteristics and historical billing patterns may reproduce cost-based disparities.

Dermatology AI Diagnostic Bias

Multiple studies have documented that AI dermatology diagnostic systems perform significantly worse on darker skin tones because their training data overrepresented lighter skin tones. While this is a clinical diagnostic issue rather than a coding issue, it illustrates how training data imbalance directly translates to differential performance across demographics.

Pulse Oximetry Bias

Studies published in the New England Journal of Medicine demonstrated that pulse oximeters overestimate blood oxygen levels in patients with darker skin pigmentation, leading to delayed recognition of hypoxemia. Medical devices that produce biased clinical measurements generate biased clinical documentation — which in turn feeds biased data to AI coding systems.

Natural Language Processing Bias

Research has shown that NLP models — the same category of technology used in many AI coding systems — can exhibit bias based on the language patterns in their training data. If clinical documentation uses different language to describe similar conditions in different patient populations (which studies have shown it does), NLP-based coding systems may produce different coding outputs for clinically equivalent presentations.

Regulatory Landscape

Federal Regulatory Framework

FDA AI/ML Action Plan: The FDA's Artificial Intelligence/Machine Learning-Based Software as a Medical Device (SaMD) Action Plan, first published in 2021 and updated through 2025, addresses algorithmic bias in clinical AI systems. While most coding AI does not qualify as a medical device under FDA jurisdiction, the FDA's framework for evaluating algorithmic bias — including requirements for diverse training data, performance testing across subpopulations, and post-market surveillance — provides a useful template for coding AI evaluation.

HHS Section 1557 Nondiscrimination: Section 1557 of the Affordable Care Act prohibits discrimination on the basis of race, color, national origin, sex, age, or disability in health programs receiving federal financial assistance. In 2024, HHS finalized a rule clarifying that Section 1557 applies to the use of AI and other technologies in covered health programs. If an AI coding system produces systematically different outcomes for protected populations that result in differential access to healthcare services or insurance coverage, this may constitute a Section 1557 violation.

Executive Order on AI Safety (October 2023): Executive Order 14110 directed HHS to establish an AI safety program for healthcare and to develop a strategy for regulating AI in health and human services. While the executive order creates a policy framework rather than specific regulations, it signals the federal government's intent to regulate AI in healthcare more actively.

NIST AI Risk Management Framework: The NIST AI RMF includes fairness and bias mitigation as core components of AI risk management. While not legally binding, the framework is referenced by federal agencies as a recommended standard for AI governance.

State and Local Regulatory Framework

Colorado AI Act (SB 21-169): Effective in 2026, the Colorado AI Act is the first state law specifically requiring bias testing and impact assessments for high-risk AI systems, including those used in healthcare and insurance. Developers must provide documentation about training data, known limitations, and bias testing results. Deployers must conduct impact assessments evaluating the risk of algorithmic discrimination.

New York City Local Law 144: While focused on employment, NYC Local Law 144 established the precedent of requiring bias audits for automated decision-making systems. Healthcare organizations should anticipate that similar requirements will be extended to healthcare AI, particularly AI systems that affect access to care or insurance coverage.

Illinois Artificial Intelligence Video Interview Act and Analogues: Several states have enacted laws requiring transparency and bias testing for AI systems used in specific contexts. While these laws do not directly address medical coding, they establish a regulatory pattern that is expanding to healthcare applications.

Proposed EU AI Act Healthcare Provisions: The EU AI Act, which began phased implementation in 2024, classifies AI systems used in healthcare as high-risk and imposes requirements for bias testing, transparency, and human oversight. While directly applicable only to EU markets, the EU AI Act influences global regulatory trends and sets expectations that multinational healthcare organizations and AI vendors must meet.

Testing for Bias in Coding Accuracy

Stratified Performance Analysis

The most direct method for detecting bias in an AI coding system is to measure its accuracy across demographic subgroups and compare the results. This requires:

Step 1: Define subgroups. At minimum, evaluate accuracy across race/ethnicity, age, gender, payer type, geographic location, and primary language. Other relevant subgroups may include disability status, income level (using ZIP code-based proxies), and clinical complexity.

Step 2: Establish accuracy baselines. For each subgroup, measure the AI system's coding accuracy using the same methodology applied to overall accuracy assessment. This means comparing AI-generated codes against codes assigned by expert human reviewers.

Step 3: Compare accuracy across subgroups. Statistically compare accuracy rates between subgroups. Key metrics include:

Overall accuracy rate by subgroup
Upcoding rate by subgroup (frequency with which AI assigns a higher code than the expert reviewer)
Undercoding rate by subgroup (frequency with which AI assigns a lower code)
Revenue impact by subgroup (average revenue difference between AI-coded and expert-coded encounters)

Step 4: Evaluate clinical significance. Statistical differences between subgroups are concerning, but clinical and financial significance matter more. A 1% accuracy difference between subgroups may be statistically significant in a large sample but clinically and financially trivial. A 10% difference in upcoding rate between racial groups is both statistically and substantively significant.

Disparity Impact Analysis

Beyond accuracy, evaluate the downstream financial impact of any identified disparities:

Revenue per encounter by subgroup: Do certain patient populations generate systematically lower revenue per encounter after AI coding, controlling for clinical complexity?
Denial rate by subgroup: Are AI-coded claims for certain populations denied at higher rates?
Modifier application by subgroup: Are modifiers applied at different rates across subgroups?

Longitudinal Monitoring

Bias testing is not a one-time activity. AI models can develop or amplify bias over time as they are updated or as the patient population changes. Implement ongoing monitoring that tracks accuracy and outcome metrics by demographic subgroup at regular intervals.

Health Equity Implications

AI bias in medical coding has direct health equity implications beyond the billing process itself. Systematic undercoding for certain patient populations can lead to:

Reduced reimbursement for safety-net providers: Hospitals and clinics that serve disproportionately underserved populations rely on accurate coding to capture the full clinical complexity of their patient encounters. If AI systems systematically undercode for these populations, the providers that serve them receive lower reimbursement — reducing their capacity to provide care to the populations that need it most.

Inaccurate risk adjustment: Risk adjustment models used by Medicare Advantage, Medicaid managed care, and commercial health plans depend on accurate diagnosis coding. If AI systems undercode diagnoses for certain populations, those patients appear healthier than they are in risk adjustment calculations. Health plans receive lower capitation payments, which may result in fewer resources allocated to care for those populations.

Research and population health distortion: Health systems increasingly use billing data for population health analytics, quality measurement, and research. If AI coding bias produces systematically different coding patterns for different populations, downstream analytics based on that data will reflect the bias — potentially leading to misallocation of public health resources.

Widening existing disparities: Healthcare disparities in the United States are already well documented. AI systems that perpetuate or amplify these disparities through biased coding exacerbate a problem that the healthcare system is actively trying to address. This is not just a compliance concern — it is a public health concern.

Bias Mitigation Strategies

Training Data Interventions

Balanced representation: Ensure training data includes adequate representation of all relevant demographic groups. This may require oversampling underrepresented populations or supplementing training data with data from safety-net providers and community health centers.

Data quality equalization: Address documentation quality disparities in training data by normalizing documentation detail across subgroups or by using expert-reviewed coding (rather than historically submitted codes) as training labels.

Temporal debiasing: If historical coding practices have improved over time (for example, if documentation equity initiatives have reduced disparities in recent years), weight more recent data more heavily in training.

Model Architecture Interventions

Fairness constraints: Incorporate fairness constraints into the model training process that penalize differential performance across demographic subgroups. Techniques include equalized odds (requiring similar true positive and false positive rates across groups), demographic parity (requiring similar prediction rates across groups), and calibration (requiring similar accuracy across groups).

Demographic-blind features: Evaluate whether the model uses features that serve as proxies for demographic characteristics (ZIP code, payer type, language preference). If these features contribute to bias without improving overall accuracy, consider removing them.

Ensemble approaches: Use multiple models with different architectures and training data, and aggregate their outputs. Ensemble methods can reduce individual model biases if the biases are uncorrelated across models.

Operational Interventions

Differential human review: If bias testing reveals that AI accuracy is lower for certain subgroups, implement enhanced human review for encounters involving those subgroups until the disparity is resolved.

Feedback loops: Establish mechanisms for coders to flag potential bias in AI suggestions. Track these flags by demographic category to identify patterns.

Regular retraining: Periodically retrain models on updated data that reflects current documentation practices and coding standards, reducing the influence of historical biases.

Vendor Questions About Bias Testing

When evaluating AI coding vendors, healthcare organizations should ask specific questions about the vendor's approach to bias testing and mitigation:

Training Data Questions

What is the demographic composition of your training data (race/ethnicity, age, gender, payer mix, geography)?
What steps have you taken to ensure balanced representation in training data?
Were training labels based on historically submitted codes or expert-reviewed codes?
How frequently is training data updated, and how is representativeness maintained over time?

Testing and Validation Questions

Do you conduct stratified accuracy testing across demographic subgroups?
What fairness metrics do you use (equalized odds, demographic parity, calibration)?
Can you provide bias audit results showing accuracy rates by race/ethnicity, age, gender, and payer type?
What is your threshold for acceptable disparity between subgroups?
Do you engage independent third parties to conduct bias audits?

Mitigation Questions

What bias mitigation techniques are incorporated into your model training process?
How do you address identified disparities when they are detected?
Do you provide tools for customers to conduct their own bias testing using their patient population?
How does your model handle encounters with limited or lower-quality documentation?

Transparency Questions

Will you provide a model card or documentation describing known limitations and performance characteristics across subgroups?
Do you comply with the transparency requirements of the Colorado AI Act or comparable regulations?
How do you communicate model updates that may affect performance across subgroups?

Accountability Questions

What is your process for responding to customer-identified bias issues?
Do your terms of service or contracts include commitments regarding fairness and non-discrimination?
Do you maintain a bias incident log, and can customers access it?
Have you received any complaints, regulatory inquiries, or litigation related to algorithmic bias?

Building an Organizational Bias Prevention Framework

Governance

Designate a responsible individual or committee for AI fairness oversight. This may be the compliance officer, a health equity officer, or a cross-functional committee. Establish clear accountability for bias monitoring, reporting, and remediation.

Policy

Develop a written AI fairness policy that articulates the organization's commitment to equitable AI coding, defines unacceptable disparity thresholds, establishes testing and monitoring requirements, and specifies corrective action procedures.

Measurement

Implement the stratified performance analysis and disparity impact analysis described above. Report results to leadership at defined intervals. Track trends over time and across AI model updates.

Remediation

When bias is identified, take corrective action proportional to the severity of the disparity. This may range from enhanced human review for affected subgroups to system configuration changes, vendor engagement for model retraining, or system replacement if the vendor cannot or will not address the issue.

Transparency

Communicate the organization's AI fairness practices to patients, staff, and regulators. Transparency about bias testing and mitigation builds trust and demonstrates good faith compliance with evolving regulatory requirements.

The Road Ahead

AI bias in medical coding is a problem that exists at the intersection of technology, healthcare delivery, and structural inequity. No AI system will be perfectly unbiased, because the data it learns from reflects an imperfect system. But the goal is not perfection — it is measurable, continuous improvement toward equitable performance, accompanied by transparency about known limitations and active mitigation of identified disparities.

The regulatory environment is moving decisively toward requiring this kind of accountability. The Colorado AI Act, Section 1557 enforcement, and the broader federal AI governance framework signal that organizations deploying AI in healthcare will be expected to demonstrate that they have tested for bias, measured its impact, and taken reasonable steps to mitigate it. Organizations that build these capabilities now — before regulations compel them — will be better positioned for compliance, better equipped to serve all patient populations equitably, and better aligned with the fundamental purpose of healthcare.