🤔 Problem

Different services in Renku have different logging formats. This makes it hard to debug and troubleshoot problems. This is even harder for other people who just want to deploy and run Renku. And they do not know all the intricacies that someone who is on the Renku team knows.

🍴 Appetite

4-5 weeks

🎯 Solution

Add a flag in the values file - just a single centralized one to control the logging format. We should just support 2 types: a human readable one and a compact json one. The json format should be the default because it is easier to parse by log aggregators like Loki or similar.
Make the data services (and all the related services that are defined in the same repo) use the same format.
Show the request ID in all relevant log entries. Ideally this should be included implicitly so that we don’t have to explicitly log it every time. But perhaps this is simply not possible. I am not sure about this. It is fine if we explicitly have to include the request id in all log calls. But it will be nice to check if it is possible to somehow have it be implicitly included.
Change the gateway to conform to this log format. The way that the request id field is specified and formatted should be the same between the gateway and data services.
Make the ui show the request id when a 500 error message is shown.
Update documentation on how to search logs and troubleshoot errors when a 500 error occurs after all the above changes are implemented.
Update amalthea to conform to this format.
Update the ui and ui server logs to conform to this format.

I did some more digging and it seems that this is more or less what open telemetry is trying to support. So see here for more details: https://opentelemetry.io/docs/languages/python/. We may be able to use open telemetry for all of this instead of coming up with our own custom json-like format. In addition to this open telemetry packages for python/go/js will generate and inject span/trace ids so that requests can be followed between different services. So it seems like this is a much more powerful/standard way of doing this than simply just relying on request ids that we define. They monkey patch the request libraries (for example httpx) so that when you send requests from one service to another the headers with the trace/span ids are propagated properly. The nice things is that for example Keycloak supports open telemetry. Which means that we can get tracing and visibility not just for our own components but also for external ones like keycloak, redis and postgres. But being able to just connect the gateway and data service with open telemetry would be great to start.

🚞 User stories / journeys

When an 500 status code is received in the ui I want to see the request ID so that I can track that though all the different services in the logs.

When an error occurs somewhere and I know the request id I want to easily filter and show all relevant log lines. Especially in services like Loki/Grafana.

🤔 Problem

🍴 Appetite

🎯 Solution

🚞 User stories / journeys

🐰 Rabbit Holes