Covid-19 Forecasting

The following graphs are a forecast model for Covid-19 daily new-cases and new-deaths in . These forecasts are computed using a machine learning model that uses all of the Covid-19 case data provided by Oxford University1,2. No time-of-year information is included in the model, only localized daily-historical trends are used in each feature record. This prevents day-biasing of the results, such as the model predicting that covid outbreaks will occur more often on Monday rather than after 26 days of a specific statistical trend. Our model uses statistical trends to accurately forecast trends in Covid-19 morbidity and mortality.

The immunity factor (IF) shown in the graphs as a black line starting from the lower left of the graph is a measurement of how much exposure the population has had. The IF is computed as the (ConfirmedCases + VaccinationDoses - ConfirmedDeaths) divided by the population. Reaching an IF of 1.5 seems to be a critical step in controlling the morbidity of Covid-19. This model does not look at individual vaccinations. Rather, it considers only vaccine doses administered. Having 1 dose of a vaccination is good, just like having 2 is better. The notion of "fully vaccinated" is not used in this learning model.

The Forecast

This graph shows the daily new cases at the top, the daily new deaths at the bottom, and the vaccination with immunity factor in the middle. The green lines are the forecast, the red lines are the historical truth. This is raw data, no data smoothing or manipulation is performed on the data for training or analysis. The PCA model does all of the smoothing we need for this forecast.

Daily Predictions

These two graphs are the new cases (left) and new deaths (right) on a per-day basis. The green dots are the prediction, and the red dots are the known (historical). The date runs along the X axis (horizontal) of graph. The R2 value shown in the title of the graph is the explained variancewikipedia,scikit of the predicted values as compared to the known values (when known). We know the truth starting on 8/1/2021, up until yesterday. The forecast predicts starting on 8/1/2021 so that its predictability performance can be determined.

Validation of Learning Model Matching Historical Data

The validation graphs, again for new cases on the left, and new deaths on the right, show how well the learning model can predict the historical data. Only 80% of the historical data was used for training the model. The hold-out data is selected randomly from the ordered vector of inputs. The training data is chosen as the complement set of that hold-out data, and is shuffled before being used in the training. What you are seeing here is the result of a randomized n-day window of data fed to the training model followed by all of the data fed back into that model. The R2 (explained variance) is shown in the legends, multiplied by 10,000 for readability.

The MAE is the Mean Absolute Error. In these graphs, the MAE Pct is the measure of the residual as a ratio to the MAE of the prediction. Small values mean good agreement with the truth, and a large value means divergence.

Forecast Validation

The forecast must be validated against the historical truth. The following graphs are the current forecast superimposed over the forecast from 10/12/2021. The faded graph is the past prediction, and the current forecast and truth are bright. The vertical dotted lines depict the today day of that forecast. Use these graphs to decide, for yourself, how accurate the forecast has become.

In some of the graphs, the scale no longer matches with the historical forecast. In these cases you will have to use your innate mental abilities to scale the historical forecast properly. The scaling mismatch occurs when the future forecast has larger predictions than the historical predictions.

The following graph compares the truth data from 10/12/2021 to the prediction that started on 10/12/2021. The top of the graph shows New Daily Cases, and the bottom shows New Daily Deaths. The middle graph shows the relative cumulative cases, both truth and predicted. The green color is used to depict the prediction and the red color is the truth. The relative basis is zero starting on 10/12/2021. The 14-day zero-lag exponential moving averagewikipedia is shown as the black line in the top and bottom graphs.

The lines in the middle graph are the daily-running explained variance of the prediction. The first few days of the explained variance are undefined because there isn't enough data to compute a good variance, so the lines start 4 days after 10/12/2021. Look at how the lines diverge when the prediction pulls away from the truth, and then converges back to zero when the prediction aligns with the truth. Ideally the explained variance would tend towards 1.0 (for a perfect match), but the truth data can have wide variance (Monday reporting trends) which makes the prediction very wrong in some cases, but very good in others, and so the variance becomes balanced by the negative/positive difference from the truth. The scale of the explained variance will mute the [0,1] range when there is a large negative divergence.

The middle region of this graph is a stacked graph. The predicted confirmed cases is stacked atop the truth as if the truth were the zero. This can be confusing to read and may not convey the importance of this comparison. Even if the prediction does not match the truth perfectly, if the two have the same variance then the stacked graphs should be identical in size on any given period. The running daily R2 will indicate when the graphs are similar, and when they are not. As the R2 diverges negatively, the green graph area should be different than the red graph area. As the R2 approaches 1.0 so does the difference between the prediction and truth converge to zero.

A Final Note About Machine Learning Models

"Even a stopped clock is right twice a day" - Marie von Ebner-Eschenbach

It's easy to get excited about the numbers matching, when they match. Looking at the model closely you can easily see the divergence of the prediction (the numbers). In many regions the model over-predicts the case load, which is largely due to the lack of emmigration information in the model.

The true value of this model is its ability to forecast future trends. Look for upward movement and peaks, these will tell you the forecast of the infection. There are instabilities in the model, these are apparent where sharp up-ticks occur and give rise to "super waves." These predictions are incorrect and should be ignored. These instabilities are often caused by model confusion, such as poor reporting of truth data, or new states in the model that were not considered.

At any given moment in this covid forecast model there is a 2X-period horizon of validity, followed by a protracted period of instability. For the 29 day model, for instance, the validity period is 58 days. For the 11 day model, the validity period is 22 days. Eventhough the forecast is predicting out to April 2022, only consider the horizon of validity.