Data Quality
Data Quality
12.1. Exploration
Item nr. | Item Name and Page | Control | Aim |
---|---|---|---|
12.1.1. | Data Definitions |
Document and ensure all subtleties of definitions of all data dimensions are clear, inclusive of but not limited to gathering methods, allowed values, collection frequency, etc. If not, acquire such knowledge, or discard the dimension. |
To (a) assess and prevent unjustified assumptions about the meaning of a data dimension or its values; and (b) highlight associated risks that might occur in the Product Lifecycle. |
12.1.2. | Data Modeling |
Document and ensure all relationships between (the fields of) different datasets are clear, in the light of their Data Definitions. (See Section 12.1.1. - Data Definitions for further information.) If this "Data Model" is not clear or available, create it, or discard the datasets. |
To (a) prevent the creation and/or combination of invalid datasets; and (b) highlight associated risks that might occur in the Product Lifecycle. |
12.1.3. | Missing and Bad Data Assessment |
Document and assess (a) the occurrence rates and (b) co-variances of missing values and nonsensical values throughout the Model data. If either is significant, investigate causes and consider discarding affected data dimension(s) or commit dedicated research and development to mitigating measures for affected data dimension(s). (See Section 12.3.1. - Live Data Quality for further information.) |
To assess (a) the risk of low quality data introducing bias to Model data and/or Outcomes; and (b) whether Model dataset(s) quality is sufficient for Product Definitions; and (c) highlight associated risks that might occur in the Product Lifecycle. |
12.1.4. | Data Veracity Uncertainty & Precision |
Document and assess the veracity and precision of data. If compromised, uncertain and/or unknown, document and assess (i) the causes and sources hereof and (ii) statistical accuracy .Incorporate appropriate statistical handling procedures, such as calibration, and appropriate control mechanisms in Model, or discard the data dimension. |
To assess (a) the risk of low quality data introducing bias to Model data and/or outcomes; (b) a priori the plausibly achievable performance; (c) whether the Model dataset(s) quality is sufficient for Product Definitions; and (d) highlight associated risks that might occur in the Product Lifecycle. |
12.2. Development
Item nr. | Item Name and Page | Control | Aim |
---|---|---|---|
12.2.1. | Missing and Bad Data Handling |
Document and assess how missing and nonsensical data (a) are handled in the Model, through datapoint exclusion or data imputation; (b) affect the Selection Function through datapoint removal; (c) affect Model performance and Fairness for subpopulations through data imputation. If (Sub)populations are unequally affected, take additional measures to increase data quality and/or improve Model resilience. Consult Domain experts during assessment and mitigation. |
To (a) prevent introducing bias to Model Outcomes due to low quality data; and (b) highlight associated risks that might occur in the Product Lifecycle. |
12.2.2. | Error - Quality Correlation |
Document and assess whether low-quality datapoints (those with low-confidence, uncertain, nonsensical, missing and/or imputed attributes) correlate with high (rates of) error, and how this affects (Sub)populations. If so, take additional measures to increase data quality and/or improve Model performance for specific (Sub)populations. |
To (a) prevent introducing bias to Model Outcomes due to low quality data; (b) whether the Model dataset(s) quality is sufficient for Product Definition(s); and (c) highlight associated risks that might occur in the Product Lifecycle. |
12.3. Production
Item nr. | Item Name and Page | Control | Aim |
---|---|---|---|
12.3.1. | Live Data Quality |
Document and assess whether live incoming data with low quality (low-confidence, uncertain, nonsensical, missing and/or imputed attributes) can be handled appropriately by the Model on the per-Data Subject level. If not, implement additional measures, and/or re-assess validity of Product Definition(s) in view of non-applicability to low quality live subsets. |
To (a) assess and control that all Product Subjects can be supported appropriately by the live Product; and (b) highlight associated risks that might occur in the Product Lifecycle. |