Data Quality

From The Foundation for Best Practices in Machine Learning


Hint
To view additional information and to make edit suggestions, click the individual items.

Data Quality

Objective
To ensure Data Quality and prevent unintentional effects, changes and/or deviations in Product and Model outputs associated with poor Product data.


12.1. Exploration

Objective
To determine if the quality of the data shall be sufficient, or can be made sufficient, to achieve the Product Definitions.
Item nr. Item Name and Page Control Aim
12.1.1. Data Definitions

Document and ensure all subtleties of definitions of all data dimensions are clear, inclusive of but not limited to gathering methods, allowed values, collection frequency, etc. If not, acquire such knowledge, or discard the dimension.

To (a) assess and prevent unjustified assumptions about the meaning of a data dimension or its values; and (b) highlight associated risks that might occur in the Product Lifecycle.

12.1.2. Data Modeling

Document and ensure all relationships between (the fields of) different datasets are clear, in the light of their Data Definitions. (See Section 12.1.1. - Data Definitions for further information.) If this "Data Model" is not clear or available, create it, or discard the datasets.

To (a) prevent the creation and/or combination of invalid datasets; and (b) highlight associated risks that might occur in the Product Lifecycle.

12.1.3. Missing and Bad Data Assessment

Document and assess (a) the occurrence rates and (b) co-variances of missing values and nonsensical values throughout the Model data. If either is significant, investigate causes and consider discarding affected data dimension(s) or commit dedicated research and development to mitigating measures for affected data dimension(s). (See Section 12.3.1. - Live Data Quality for further information.)

To assess (a) the risk of low quality data introducing bias to Model data and/or Outcomes; and (b) whether Model dataset(s) quality is sufficient for Product Definitions; and (c) highlight associated risks that might occur in the Product Lifecycle.

12.1.4. Data Veracity Uncertainty & Precision

Document and assess the veracity and precision of data. If compromised, uncertain and/or unknown, document and assess (i) the causes and sources hereof and (ii) statistical accuracy .Incorporate appropriate statistical handling procedures, such as calibration, and appropriate control mechanisms in Model, or discard the data dimension.

To assess (a) the risk of low quality data introducing bias to Model data and/or outcomes; (b) a priori the plausibly achievable performance; (c) whether the Model dataset(s) quality is sufficient for Product Definitions; and (d) highlight associated risks that might occur in the Product Lifecycle.

12.2. Development

Objective
To determine if Model performance is affected or biased due to data quality issues.
Item nr. Item Name and Page Control Aim
12.2.1. Missing and Bad Data Handling

Document and assess how missing and nonsensical data (a) are handled in the Model, through datapoint exclusion or data imputation; (b) affect the Selection Function through datapoint removal; (c) affect Model performance and Fairness for subpopulations through data imputation. If (Sub)populations are unequally affected, take additional measures to increase data quality and/or improve Model resilience. Consult Domain experts during assessment and mitigation.

To (a) prevent introducing bias to Model Outcomes due to low quality data; and (b) highlight associated risks that might occur in the Product Lifecycle.

12.2.2. Error - Quality Correlation

Document and assess whether low-quality datapoints (those with low-confidence, uncertain, nonsensical, missing and/or imputed attributes) correlate with high (rates of) error, and how this affects (Sub)populations. If so, take additional measures to increase data quality and/or improve Model performance for specific (Sub)populations.

To (a) prevent introducing bias to Model Outcomes due to low quality data; (b) whether the Model dataset(s) quality is sufficient for Product Definition(s); and (c) highlight associated risks that might occur in the Product Lifecycle.

12.3. Production

Objective
To ensure the quality of incoming data to the Product during operations.
Item nr. Item Name and Page Control Aim
12.3.1. Live Data Quality

Document and assess whether live incoming data with low quality (low-confidence, uncertain, nonsensical, missing and/or imputed attributes) can be handled appropriately by the Model on the per-Data Subject level. If not, implement additional measures, and/or re-assess validity of Product Definition(s) in view of non-applicability to low quality live subsets.

To (a) assess and control that all Product Subjects can be supported appropriately by the live Product; and (b) highlight associated risks that might occur in the Product Lifecycle.