Performance Robustness

From The Foundation for Best Practices in Machine Learning
Technical Best Practices > Performance Robustness


Hint
To view additional information and to make edit suggestions, click the individual items.

Performance Robustness

Objective
To warrant Model Outcomes and prevent unintentional Model behaviour a priori under operational conditions as far as is reasonably practical.


14.1. Product Definition(s)

Objective
To prevent performance loss due to Product Definition changes.
Item nr. Item Name and Page Control Aim
14.1.1. Product Definition(s) Stability

Document and assess the stability of historic and prospective Product Definition(s) and Product Aim(s). If unstable, take measures to redefine or, failing that, to correct for or mitigate as much as is reasonably practical.

To (a) ensure that Product Definition(s) and Models remain stable and up-to-date in light of Product Domain Stability; and (b) highlight associated risks that might occur in the Product Lifecycle.

14.1.2. Product Domain Stability

Document and assess the stability of historic and prospective Product Domain(s). If unstable, revise Product Definition(s) accordingly to ensure Product consistency and stability.

To (a) ensure that Product Definition(s) and Models remain stable and up-to-date in light of Product Domain Stability; and (b) highlight associated risks that might occur in the Product Lifecycle.

14.2. Exploration

Objective
To prevent performance loss due to (a) data and/or data definition instability; (b) volatile data elements; and/or (c) prospective increases in scale.
Item nr. Item Name and Page Control Aim
14.2.1. Data Drift Assessment

Document and assess historic and prospective changes in data distribution, inclusive of missing and nonsensical data. If data drift is apparent and/or expected in the future, implement mitigating measures as much as is reasonably practical.

To (a) assess and promote the stability of data distributions (data drift); (b) determine the need for data distributions monitoring, risk-based mitigation strategies and responses, drift resistance and adaptation simulations and optimization, and data distribution calibration; and (c) highlight associated risks that might occur in the Product Lifecycle.

14.2.2. Data Definition Temporal Stability

Document and assess - both technically and conceptually - historic and prospective changes of each data dimension definition. If unstable, consider refining Product Definitions and/or limiting usage of unstable data dimensions.

To (a) assess and control for the need for Model design adaptation based on data definition stability; and (b) highlight associated risks that might occur in the Product Lifecycle.

14.2.3. Outlier Occurrence Rates

Document and assess outliers, their causes, and occurrence rates as a function of their location in data space. If numerous and persistent, include mitigating measures in Model design accordingly.

To (a) identify outliers and assess the need for Model design adaptation; and (b) highlight associated risks that might occur in the Product Lifecycle.

14.2.4. Selection Function Temporal Stability

Document and assess the historic and prospective behaviour of Selection Function(s) of Model data. (See Section 13.2.4. - Selection Function for more information.) If unstable, take measures to account for past and future changes, and/or promote the consistency and representativeness of Model datasets and data gathering as much as is reasonably practical.

To (a) assess and control for hard-to-measure changes to the relation between Model datasets and Product Domain(s); (b) identify the risk of hard-to-diagnose Model performance degradation and bias throughout Product Lifecycle (to be controlled by 14.3.6. - Model Drift & Model Robustness Simulations); and (c) highlight associated risks that might occur in the Product Lifecycle.

14.2.5. Data Generating Process Temporal Stability

Document and assess the historic and prospective behaviour of data generating processes, and their influence on the Selection Function. If unstable, take measures to account for past and future changes and/or promote the stability and consistency of data generation processes as much as is reasonably practical.

To (a) assess and control for hard-to-measure changes to the relation between Model datasets and Product Domain(s); (b) identify the risk of hard-to-diagnose Model performance degradation and bias throughout Product Lifecycle (to be controlled by 14.3.6. - Model Drift & Model Robustness Simulations); and (c) highlight associated risks that might occur in the Product Lifecycle

14.3. Development

Objective
To characterize, determine and control for Model performance variation, risks and robustness under live conditions a priori and throughout the Product Lifecycle.
Item nr. Item Name and Page Control Aim
14.3.1. Target Feature Definition Stability

Document and assess - both technically and conceptually - the historic and prospective stability of the Target Feature definition. If unstable, consider refining Product Definitions and/or choosing a different Target Feature.

To (a) assess the need for Model design and Product Definition adaptation based on Target Feature definition stability; and (b) highlight associated risks that might occur in the Product Lifecycle.

14.3.2. Blind Performance Validation

Document and validate that Model Performance can always be reproduced on never-before-seen hold-out data-subsets and prove that these hold-out data-subsets are never used to guide Model and Product design choices by comparing Model performance on the hold-out dataset. If performance cannot be reproduced on never-before-seen hold-out data-subset, take measures to improve robustness and Model fitting as much as is reasonably practical.

To (a) ensure Model performance robustness against insufficient generalization capabilities on live data (such as overfitting); and (b) highlight associated risks that might occur in the Product Lifecycle.

14.3.3. Error Distributions

Document and assess error and/or residual distributions along as many dimensions and/or subsets as is practically feasible. If distributions are too broad and/or too unequal between subsets, improve Model(s).

To (a) assess and control for performance influence of data points and/or groups; (b) assess and control for the distribution of errors to influence - (i) performance robustness as a function of data drift, (ii) the systematic performance of minority data-subsets, and (iii) the risks of unacceptable errors and/or catastrophic failure; and (c) highlight associated risks that might occur in the Product Lifecycle.

14.3.4. Output Edge Cases

Document and assess the causes, occurrence probabilities, overall performance impact of Edge Cases output by Model(s), inclusive of on Model training and design. If their influence is significant, improve model design. If occurrence is high, increase Model, code and data quality control.

To (a) assess and control for the impact of Output Edge Cases on Model design, bugs and performance; and (b) highlight associated risks that might occur in the Product Lifecycle.

14.3.5. Performance Root Cause Analysis

Document and assess Model performance Root Cause Analysis as well as its testing method. If Root Cause Analysis is ineffective, simplify Model and/or increase diagnostics like logging and tracking.

To (a) assess and control for Model performance changes and assist in Model design, development, and debugging; (b) highlight associated risks that might occur in the Product Lifecycle.

14.3.6. Model Drift & Model Robustness Simulations

Document and perform simulations of Model training and retraining cycles, using historic and synthetic data. Document and assess the effects of temporal changes to, amongst other things, the Selection Function, Data Generating Process and Data Drift on the drift in performance and error distributions of said simulations. If Model drift is apparent, document and perform further simulations for Model drift response optimization, and/or consider refining Product Definitions.

To (a) assess and control for Model propensity for Model drift; (b) determine the robustness of Model performance as a function of data changes; (c) determine appropriate Product response to drift; and (d) highlight associated risks that might occur in the Product Lifecycle.

14.3.7. Catastrophic Failures

Document and assess the prevalence of predictions with High Confidence Values, but large Evaluation Errors. If apparent, improve Model to avoid these, and/or implement processes to mitigate these as much as is reasonably practical.

To (a) assess the propensity of the Model for catastrophic failures; and (b) highlight associated risks that might occur in the Product Lifecycle.

14.3.8. Performance Uncertainty and Sensitivity Analysis

Document and assess the probability distribution of the model performance using cross-validation, statistical and simulation techniques under - (a) the assumption that the distribution of training and validation data is representative of the distribution of live data; and (b) multiple realistic variations to the Model data due to both statistical and contextual causes. If Model performance variation is high, improve Model and/or take measures to mitigate performance variation impact.

To (a) assess and control for the range of expected values of Model performance under both constant and changing conditions; (b) assess and control for whether trained model performance is consistent with these ranges; (c) identify main sources of uncertainty and variation for further control; and (d) highlight associated risks that might occur in the Product Lifecycle.

14.3.9. Outlier Handling

Document and assess the effect of various outlier handling procedures on (a) Performance Robustness and (b) Representativeness & Specification. Ensure that only procedures are implemented that positively affect both.

To (a) ensure that outlier removal is not used to heedlessly improve test-time performance only and (b) highlight associated risks that might occur in the Product Lifecycle.

14.4. Production

Objective
To ensure the future satisfaction of Product Definition(s) through the technical and functional implementation of the Product Model(s) and systems.
Item nr. Item Name and Page Control Aim
14.4.1. Real World Robustness

Document and assess potential future change in the applied effects of the Product, such as through diminishing returns and/or psychological effects. If significant change or decrease is expected, consider refining Product Definitions and/or develop procedures for mitigation.

To (a) assess and control for the variation in applied effects of the Product on Product Definition(s) and performance; and (b) highlight associated risks that might occur in the Product Lifecycle.

14.4.2. Performance Stress Testing

Perform and document experiments designed to attempt to induce failures in the Product and/or Model, for example, but not limited to, by supplying large quantities of or unusual data to the training or inferencing phases.

To (a) identify and control for risks associated with operational scenario's outside of regimes encountered during Model development.