Analyzing the vaccination performance in Germany.

This post will discuss how to analyze the vaccination performance against Covid-19 based on the data provided by the German Roland-Koch-Institut (RKI). Since the RKI publishes daily numbers of the vaccination progress, I was thinking about harvesting these numbers and using the to display them visually on my web site.

While the daily numbers are very interesting in understanding the current state of the country, I became more interested in understanding when Germany will reach herd immunity. Initially, politicians have floated numbers of 75% or 70% of the population being vaccinated, however, as of the last couple of days, it was even circulated that 60% might enough.

One of the challenging aspects of these approaches, for me as a son, parent, and citizen, is that they only reflect a very narrow message that our politicians want to convey. Typically these messages are simplified to make sure people don’t loose hope and follow the necessary restrictions. However, I believe, that it’s very important for everyone to understand that being able to understand the full picture, and this means that the problem will not simply go away in one or two months.

To provide some of this transparency for my own sake, I built this Jupyter Notebook to use the data that I’m scraping daily from the RKI to perform som analysis of the vaccination progress in Germany.

Feel free to use this Notebook and the provided data to reproduce my results or gather your own findings.

%pylab inline
%config InlineBackend.figure_format = 'retina'

import matplotlib.pyplot as plt
import numpy as np'fivethirtyeight')
Populating the interactive namespace from numpy and matplotlib
plt.rcParams['font.size'] = 10
plt.rcParams['axes.labelsize'] = 10
plt.rcParams['axes.labelweight'] = 'bold'
plt.rcParams['xtick.labelsize'] = 8
plt.rcParams['ytick.labelsize'] = 8
plt.rcParams['legend.fontsize'] = 10
plt.rcParams['figure.titlesize'] = 12
plt.rcParams['lines.markersize'] = 10
plt.rcParams["figure.figsize"] = [14, 7]

Loading the Data

The data as provided by the RKI is in Microsoft Excel format, and I have built a Google Cloud function to process the data and transform it into a format that is directly usable by my website. Since we’re going to need the data in a slightly different format, we will have to transform the time series data a bit to make it usable for Pandas and Matplotlib.

import requests
request_data = requests.get("").json()
from IPython.display import JSON
<IPython.core.display.JSON object>

Originally, the data was produced to be dropped directly into the datasets of a ChartJS chart, grouped by federal state. The first step is to perform the aggregation to the country level.

labels = request_data["labels"]
from collections import defaultdict
result = defaultdict(int)

for i, v in enumerate(labels):
    for d in request_data['datasets']:
        result[v] += d['data'][i]

            {'2020-12-29': 41962,
             '2020-12-30': 78109,
             '2020-12-31': 131626,
             '2021-01-01': 165575,
             '2021-01-02': 188553,
             '2021-01-03': 238809,
             '2021-01-04': 265986,
             '2021-01-05': 316962,
             '2021-01-06': 367331,
             '2021-01-07': 417060,
             '2021-01-08': 476959,
             '2021-01-09': 532878,
             '2021-01-11': 613347,
             '2021-01-12': 688782})
list_data = list(map(lambda x: result[x], result))

The data contains the absolute number of vaccinations and not the change to the previous day. To extract the delta to the previous day, I simply use numpy.diff to build a list of deltas to the previous day, with the first value being 0 as the origin.

delta = np.diff([0,] + list_data)
array([41962, 36147, 53517, 33949, 22978, 50256, 27177, 50976, 50369,
       49729, 59899, 55919, 80469, 75435])

Vaccination progress in Germany

This data allows us to plot the vaccination progress in Germany along two axis. First, the total number of vaccinations given and the number of vaccinations per day.

fig, ax = plt.subplots()
ax.set_ylabel("Total Vaccinations")
ax.set_xlabel("Time"), list_data, color="#008fd5", linewidth=2)
ax.set_title("Vaccination progress in Germany")

ax2 = ax.twinx()
ax2.plot(labels, delta, color="#e5ae38", marker="o", linewidth=2)
ax2.set_ylabel("Vaccinations per Day")

plt.setp(ax.get_xticklabels(), rotation=-45)


Predicting the vaccination progress in Germany

So far, we have only observed historical values and have not yet tried to look into the future. As mentioned initially, one of the goals of this article is to provide transparency when Germany will reach herd immunity. To be able to do that, we have to at least try to build a simple model leveraging the number of vaccinations per day. The assumption is that we have not yet reached full capacity when it comes to vaccinations and the number of shots per day will still grow. It is either possible to use curve fitting directly from numpy or using a simple linear regression from sklearn.

from sklearn.linear_model import LinearRegression
xvals = np.array(range(len(labels))).reshape(-1,1)
model = LinearRegression(), delta)
model.score(xvals, delta)

Of course the score is not as good as we want it to be since we only have a couple of data points and due to the new years holidays the data is not as consistent as it could be. To visualize the quality of our model, let’s plot the number of vaccinations per day and the prediction in one chart.

array([30857.97142857, 33679.62417582, 36501.27692308, 39322.92967033,
       42144.58241758, 44966.23516484, 47787.88791209, 50609.54065934,
       53431.19340659, 56252.84615385, 59074.4989011 , 61896.15164835,
       64717.8043956 , 67539.45714286])
fig, ax = plt.subplots()
ax.set_ylabel("Impfungen pro Tag")
ax.set_title("Impfungen in Deutschland")
ax.plot(labels, delta, color="#008fd5", marker="o", linewidth=2, label="Impfungen pro Tag")

ax.plot(labels, model.predict(xvals), ":o", color="black", label="Trend", linewidth=2)
plt.setp(ax.get_xticklabels(), rotation=-45)



When will Germany reach herd immunity?

Satisfied with our linear model, we will use it to predict the number of vaccinations per day for the next 600 days. However, there is a slight caveat. At some point in time, we will no longer be able to vaccination more people per day and the progress will flatten out. Since I’m not a medical doctor, I can only guess some of these values and provide different alternative scenarios.

The total throughput heavily depends on the availability and kind of vaccine, the easier it is to handle, the higher the likely throughput. Potentially, we can use again some data from the RKI to get a reasonable upper bound for an “easy” vaccine like against influenza. The link above mentions that roughly 36% of the population receive influenza vaccines. In addition, we can assume that typically these vaccines are given in a three months period from September to November. I know this is a lot of guess work.

POP_GERMANY = 83 * 1000 * 1000
# 36% receive the vaccine
# 3 months with 4.5 weeks and 5 days of vaccination, assuming business days only
influenza_per_day = int((POP_GERMANY * INFLUENZA_RATE) / (3 * 4.5 * 5))

Of course, this number might be relatively conservative since there is not an immensely high pressure to perform these vaccinations so, let’s assume that the actual capacity would be roughly 50% higher, given the stakes at play.

influenza_per_day = 1.5 * influenza_per_day

With this “guessed” upper bound in play, we will now build the forecast for the next 600 days.

max_days = 600
vacc_per_day = model.predict(np.array(range(max_days)).reshape(-1,1))
def vacc_per_day_with_max(period, max_value):
    """Helper function that is used to pick the minimum
    of the value in the period or the maximum value."""
    return [min(x, max_value) for x in period]
import datetime
base = datetime.datetime.strptime("2020-12-29", "%Y-%m-%d")
all_labels = [(base + datetime.timedelta(days=x)) for x in range(600)]

Using these helpers, we can now build a simple Pandas data frame containing the predicted values given an upper bound.

import pandas as pd
vacc_data_rows = np.transpose([
    vacc_per_day_with_max(vacc_per_day, 200*1000),
    vacc_per_day_with_max(vacc_per_day, 300*1000),
    vacc_per_day_with_max(vacc_per_day, 500*1000),
    vacc_per_day_with_max(vacc_per_day, influenza_per_day)
column_labels = [200000, 300000, 500000, influenza_per_day]
df = pd.DataFrame(vacc_data_rows, index=all_labels, columns=column_labels).round(4)
200000.0 300000.0 500000.0 663999.0
2020-12-29 30857.9714 30857.9714 30857.9714 30857.9714
2020-12-30 33679.6242 33679.6242 33679.6242 33679.6242
2020-12-31 36501.2769 36501.2769 36501.2769 36501.2769
2021-01-01 39322.9297 39322.9297 39322.9297 39322.9297
2021-01-02 42144.5824 42144.5824 42144.5824 42144.5824

Out of curiosity, we’re interested in when exactly we will reach the cross over point of when we will no longer be able to vaccinate more people per day. For all columns, we select the first row, where the number of vaccinations has reached the maximum as indicated by the colum label.

cross_over_vals = []
for c in df.columns:
[Timestamp('2021-02-27 00:00:00'),
 Timestamp('2021-04-04 00:00:00'),
 Timestamp('2021-06-14 00:00:00'),
 Timestamp('2021-08-11 00:00:00')]

Herd Immunity

To calculate herd immunity, we have to translate from vaccinations per day to the cumulative sum of vaccinations to get a total. In addition, we transform the total to a percentage value based on the population of Germany and of course since most of the vaccines require two shots, we have to double the amount of shots given to account for that.

cum_vacc = (df.cumsum() / (POP_GERMANY * 2)).round(4)
cum_vacc_filter = cum_vacc < 1
cum_vacc = cum_vacc.where(cum_vacc_filter) * 100
200000.0 300000.0 500000.0 663999.0
2020-12-29 0.02 0.02 0.02 0.02
2020-12-30 0.04 0.04 0.04 0.04
2020-12-31 0.06 0.06 0.06 0.06
2021-01-01 0.08 0.08 0.08 0.08
2021-01-02 0.11 0.11 0.11 0.11
... ... ... ... ...
2022-08-16 68.70 99.90 NaN NaN
2022-08-17 68.82 NaN NaN NaN
2022-08-18 68.94 NaN NaN NaN
2022-08-19 69.06 NaN NaN NaN
2022-08-20 69.18 NaN NaN NaN

600 rows × 4 columns

For easier visualization, we want to annotate the points when the herd immunity is reached for 70% and 60% of the population. This can be done using a scatter plot of the values for respective herd immunity. The following steps simply build new data frames for these herd immunity assumptions.

herd_vals_70 = []
for k in cum_vacc.columns:
    flt = cum_vacc[cum_vacc[k] >= 70]
    if not flt.empty:
        herd_vals_70.append((70, flt.head(1)[k].index[0]))
herd_vals_70 = np.array(herd_vals_70)

df70 = pd.DataFrame(herd_vals_70, columns=["y", "x"])
y x
0 70 2022-03-04
1 70 2021-11-04
2 70 2021-10-07
herd_vals_60 = []
for k in cum_vacc.columns:
    flt = cum_vacc[cum_vacc[k] >= 60]
    if not flt.empty:
        herd_vals_60.append((60, flt.head(1)[k].index[0]))
herd_vals_60 = np.array(herd_vals_60)
df60 = pd.DataFrame(herd_vals_60, columns=["y", "x"])
y x
0 60 2022-06-05
1 60 2022-01-08
2 60 2021-10-02
3 60 2021-09-12

The final step is now to put all the values together in a final plot.

# Create a new plot that we use as a base. Since we're modifying the plot
# directly, we generate a new subplot.
fig, ax = plt.subplots()

today =
today_offset = today + datetime.timedelta(days=4)
colors = ["#feb24c", "#fd8d3c", "#fc4e2a", "#e31a1c", "#bd0026", "#800026"]

# First, we're setting some text annotations and horizontal / vertical lines
# that we use as visual guidlines with regard to reaching herd immunity.
ax.text(today_offset, 71, "Herd Immunity 70%")
ax.text(today_offset, 61, "Herd Immunity 60%")
ax.text(today_offset, 20, f"Today {today.strftime('%m-%d-%Y')}", rotation=90)
ax.text("2021-12-31", 2, "1 year since vaccinations", rotation=90)
ax.axvline(today, linestyle=":", linewidth=2, color="#800026")
ax.axvline(datetime.datetime(2021, 12, 29), linestyle=":", linewidth=2, color="#800026")
ax.axhline(70, linestyle=":", linewidth=2, color="#800026")
ax.axhline(60, linestyle=":", linewidth=2, color="#800026")

# First, we plot the actual prediction of the different curves
plotted = cum_vacc.plot(ax=ax, linewidth=2, color=colors)
ax.legend(["200k vacc/day", "300k vacc/day", "500k vacc/day", "1.5 x influenza vacc/day"])
# Next, we scatter plot the points at which the herd immunity is reached.
df70.plot("x", "y", ax=ax, kind='scatter', s=60, color="#000000", zorder=3)
df60.plot("x", "y", ax=ax, kind='scatter', s=60, color="#000000", zorder=3)

# Lastly, we annotate each point with the exact date.
for k,v in df70.iterrows():
    yoff = 10
    if k > 0:
        yoff = 15 * (k)-5

               (v["x"], v["y"]),
               textcoords='offset points',
               fontsize=12, zorder=3,

for k,v in df60.iterrows():
    yoff = 20
    if k > 0:
        yoff = 20 * (k)-5

               (v["x"], v["y"]),
               textcoords='offset points',
               fontsize=12, zorder=3,



Looking at the above graph makes me slightly optimistic that there is a way to deal with the pandemic over the course of the year and that we will have a way to return to a certain degree of normalcy. But of course, in the above model, I assumed that the vaccines will always be available and the only thing stopping our growth is the capacity of the process not the resource.

However, I think, as I will update the above graph over the next couple of months, we will see how “right” or “wrong” my predictions have been.