Analysing Overfitting

Risk of Overfitting

Overview

Like all systematic trading strategies, AiLA strategies face the risk of being based on overfitted results.
Even with the best intentions and following the best prevailing practices, there is not guarantee to avoid overfitting, e.g. from un-intentionally being influenced by past experiences.
In order to actively try to minimize the impact of overfitting the AiLA strategy development is strictly following a set of principles.

Outline

Types: summary of commonly considered sources of overfitting w.r.t. different strategy R&D steps.
Risk: highlight main steps in the AiLA strategy process w.r.t. risk that overfitting could occur.
Mitigation: describe the AiLA principles followed in order to decrease the risk of overfitting.

Illustration of overfitted strategy results, for details see perspective [Overfitting Risk].
Strategy improvement designed using same data as used to evaluate better performance than default strategy (2017 to 2021).
Same improved strategy based on different data (2011 to 2016), indicate hypothesis has no impact.

Types of Overfitting

Overfitting is often considered in different stages of the R&D process, with two common types below.
The terminology is common within machine learning (ML), however, the problems are not ML specific, but applies to any exercise where data is consulted.

Training (design) stage

This refers to overfitting that occurs during the strategy training or design phase.
Here overfitting is to some extent inevitable and different methods are used to address the problems from fitting a relationship to the in-sample data noise.

Testing stage

This refers to overfitting that occurs during the strategy test phase, where no overfitting is intended.
This often occurs from multiple testing scenarios, such as “test many and pick the best” or “test, then change and test again”, where the reason leading to flattering the results is overlooked (often un-intentionally) in the final evaluation.

Example of measures to address overfitting

Training (design) stage

Cross validation techniques are used to address ability to reproduce (generalize) results in out-of-sample data.
Regularization techniques are used to reduce model complexity as well as sensitivity to the specific data sample.
Ensemble models can be used to reduce variance of the model results to become more robust w.r.t. overfitting.

Testing stage

Keep track of multiple tests conducted and use methods to correct or estimate impact, e.g. familywise tests or deflated performance.
Enforce strict policy regarding pre-registered test details and using different (“blind”) data.

The illustration above [Overfitting Risk], was easily revealed using cross validation, however, overfitting (un-intentionally) during testing is typically much more difficult to quantify.

Risk Areas

Through out the AiLA strategy development process there are several opportunities for overfitting to occur.

Areas with particular risk

Choices regarding the micro/macro data and feature definition, these concern how to best allow data to expose important factors without implying how it should be used by the strategy.
Training and selection of the asset models. Here techniques as mentioned above are used to address overfitting and all model evaluation and hyper-parameter tuning is made using predetermined training/validation dataset.
The final upstream allocation method is also optimized w.r.t. its parameters using the same training/validation dataset.
The downstream index construction combine different asset allocation models into indices of different types. This process is based on a fixed method, with focus on achieving a diversified risk across the portfolio rather than asset performance. However, this step include parameter choices and index asset selection, where introducing a selection bias must be avoided.

Mitigation Principles

In order to reduce the impact of overfitting a number of principles are strictly followed.
However, it should be noted that all approaches have shortcomings, and that caution is required w.r.t all decisions.

Principles followed

All feature data should be available for a foreseeable future and together with the feature definitions, no changes are made after the asset model is frozen. The data and feature definition choices are based on domain knowledge obtained from practicing fundamental/discretionary commodity trading, which could be reviewed over time. However, give the focus on stationary factors, the principle used is to also freeze the data aspects to avoid potential overfitting through such a review/update process.
Training and selection of the asset models is strictly based on data available until the end of 2016, which also coincides with the period after which the asset allocation models for most of the liquid commodities were frozen and available in live trading. For this reason, the period is also used for training/validation of new research and asset models.
Optimizing the parameters in-sample for the final asset allocation method could potentially result in a selection bias. However, the value range is bounded so that it is not expected to yield dramatic impact on the results, and results are validated by the absence of significant overfitting observed in the out-of-sample results.
Testing of the entire upstream asset allocation model requires that all details are frozen before the out-of-sample test results are obtained using hold-out data from 1 Jan 2017. The strict practice around freezing the method and using the hold-out data sets for the final upstream evaluation, i.e. common within the ML community, is chosen as the primary measure to reduce the risk from underestimating the impact, and/or accidentally include, a flattering effect on the results from a multiple test bias. For the same reason most liquid asset allocation models were frozen from the 1 Jan 2017 and have not been changed since then, i.e. yielding live out-of-sample results.
The downstream index construction involve parameters mostly with modest impact on performance, such as volatility target and allocation limits etc., and where the decisions are generally based on practical objectives rather than on index performance. However, care must be taken and in particular w.r.t. the index asset selection which have a larger influence on performance. Here the general principle is to consider assets on an equivalent basis and only introduce discrimination between assets if motivated in the training stage and with support from the out-of-sample testing results.