🤔 Problem

Resource usage on renkulab is very peaky - some peaks we can anticipate because we know ahead of time when users for certain courses will come online. For those courses, it would be beneficial to pre-warm the nodes, especially when GPUs are involved because it would greatly reduce the waiting times.


Renkulab.io is hosted on an Azure Kubernetes Service deployment with node auto-scaling enabled.

In order to minimise the costs of running Renkulab.io, the size of the user session node pools is kept to a minimum.

A key disadvantage is that session startup can be significantly delayed. Launching a user session can initiate a node-scaling operation, and the user session pod cannot be scheduled until the new node is fully operational.

A solution which has been identified to minimise the impact of node scaling operations on user session scheduling is the use of pause/over-provisioning pods. These pods act as simple placeholders, with a low priority. A newly scheduled user session pod then simply replaces an over-provisioning pod, and the over-provisioning pod is scheduled on a different node, or triggers a node scaling operation, if it cannot be scheduled.

This solution works well for most sessions on Renkulab.io, however, there are circumstances in which the over-provisioning pods' usefulness is diminished.

Renkulab.io is used in courses, which can include up to 100 participants. Course instructors might want participants to be able to start their own sessions on Renkulab.io in order to complete activities during course occurrences. Considering that Renkulab.io hosts about 50 user sessions at any given time, a sudden demand for 100 additional user sessions represents a large increase in the amount of compute resource required by the platform.

Renkulab’s standard over‑provisioning pods provide little benefit here, as they would only speed up startup for the first 5–10 user sessions. Permanently raising the over-provisioning pod count to 100 pods would be wasteful. As a result, course participants still face long delays when starting their sessions during lessons.

Existing explorative solutions

Instructors are requested to give advance notice about upcoming course occurrences, so we can know how many course participants will attend and what computing resources they will need. This helps us prepare, but comes with its own operational challenges.

Hands-on scaling

The first approach was straightforward but demanding: manually increasing the size of a user session node pool shortly before course occurrences began. While this did speed up session startup times for course participants, it required someone on the team to constantly be on alert, performing timely hands-on operations for every single course occurrence.

CalendarOps