Operating autonomous robots on city streets is a huge software engineering challenge. Some of these programs run on the bot itself but many of them are already running in the background. Things like remote control, route finding, matching bots to customers, fleet health management, and also interactions with customers and merchants. All of this should run 24/7, without interruptions and is scaling dynamically to match the workload.
Starship’s SRE is responsible for providing the cloud infrastructure and platform services to run these backend services. We have united rulers For our mini-services and running them above AWS. MongoDB It is the primary database for most backend services, but we also love PostgreSQLEspecially when strong print and transaction guarantees are required. for asynchronous messages Kafka It is the preferred messaging platform and we use it for almost everything other than shipping video streams from bots. For the sake of observation we rely on Prometheus And GravanaAnd lokiAnd the left And jaeger. CICD is handled by Jenkins.
A significant portion of the SRE’s time is spent maintaining and optimizing the Kubernetes infrastructure. Kubernetes is our main publishing platform and there is always something to improve, whether that’s by adjusting automatic scaling settings, adding Pod disabling policies, or improving Spot instance usage. Sometimes it’s like laying bricks – simply installing a Helm chart to provide certain functionality. But often the ‘brick’ has to be chosen and evaluated carefully (is Loki good for log management, is network service a thing etc) and sometimes the function just doesn’t exist in the world and has to be written from the start. When this happens, we usually switch to Python and Golang but also Rust and C when needed.
Another big part of the infrastructure that SRE is responsible for is data and databases. Starship started with a single monolithic MongoDb – a strategy that has worked well so far. However, as the business grows, we need to revisit this architecture and start thinking about supporting bots by the thousands. Apache Kafka is part of the scaling story, but we also need to explore hashing, regional clustering, and microservices database architecture. Furthermore, we are constantly developing tools and automation to manage our existing database infrastructure. Examples: Add the ability to monitor MongoDb with a custom side proxy to analyze database traffic, enable PITR support for databases, automate regular failover and recovery tests, collect metrics for Kafka re-segmentation, and enable data retention.
Finally, one of the most important goals of Site Reliability Engineering is to reduce Starship production downtime. While SRE is sometimes called upon to deal with infrastructure outages, the most impactful work is done to prevent outages and ensure that we can recover quickly. This can be a very broad topic, from owning a solid K8s infrastructure to engineering practices and business operations. There are great opportunities to make an impact!
A day in the life of SRE
Getting to work, some time between 9 and 10 (sometimes working remotely). Grab a cup of coffee and check your Slack messages and emails. Review the alerts fired during the night, and make sure there’s anything interesting there.
It detected that the MongoDb connection latency increased during the night. When searching in Prometheus benchmarks with Grafana, I found out that this happens during the time when backups are running. Why is this suddenly becoming a problem, we’ve been running those backups for ages? Turns out we’re compressing backups pretty aggressively to save on network and storage costs and that’s consuming all the available CPU. The load on the database seems to have grown a bit to make this noticeable. This happens in the standby node, and does not affect production, however it is still a problem, should the primary fail. Add a Jira item to fix this.
Next, change the MongoDb prober (Golang) code to add more histogram groups to get a better understanding of the latency distribution. Run the Jenkins pipeline to put the new probe into production.
At 10 AM there’s a Standup meeting, share your updates with the team and see what others are doing – set up monitoring for a VPN server, setup a Python app with Prometheus, set up ServiceMonitors for external services, debug MongoDb connection issues, and try canary deployments with Flagger.
After the meeting, he resumed the work planned for the day. One of the things I plan to do today is to create an additional Kafka group in a test environment. We’re running Kafka on Kubernetes so it should be easy to take the existing YAML cluster files and modify them for the new cluster. Or, on second thought, should we use Helm instead, or maybe there is a good Kafka agent available now? No, I’m not going there – too much magic, I want more clear control over my combos. Raw YAML it is. After an hour and a half a new block started working. The setup was fairly simple; Only the prefix containers that register Kafka’s brokers in the DNS needed to change the configuration. Creating credentials for apps requires a bash applet to set up accounts on Zookeeper. One of the parts that was left hanging, was setting up Kafka Connect to capture database change log events – it turns out that test databases don’t run in ReplicaSet mode and Debezium can’t get an oplog from it. Accumulate this and move forward.
Now it’s time to prepare your Wheel of Calamity exercise scenario. At Starship, we run these to improve our understanding of systems and to share troubleshooting techniques. It works by breaking a part of the system (usually in testing) and making some miserable people try to troubleshoot and mitigate the problem. In this case, I will setup a load test using hey To increase the load on the microservices of the track accounts. Post this as a Kubernetes job called “haymaker” and hide it well enough that it doesn’t appear immediately in the Linkerd service network (yes, evil). Later run the ‘wheel’ exercise and note any gaps we have in our playbooks, metrics, alerts, etc.
For the last few hours of the day, block all interrupts and try to do some coding. I’ve implemented the Mongoproxy BSON parser as an asynchronous streamer (Rust + Tokio) and I want to see how well this works with real data. Turns out there is an error somewhere in the guts of the analyzer and I need to add a deep logging to find out. Find a great tracing library for Tokio and get carried away with it…
Disclaimer: The events described here are based on a true story. It didn’t all happen on the same day. Some meetings and interactions with co-workers have been omitted. We clean.