Gigabytes of data for a bag of groceries. This is what you get when you make an automated delivery. That’s a huge amount of data – especially if you repeat it over a million times as we have.
But the rabbit hole is deeper. The data is also incredibly diverse: bot sensor and image data, user interactions with our apps, transaction data from requests, and much more. And the use cases are just as diverse, from training deep neural networks to creating polished visualizations of our business partners, and everything in between.
So far, we’ve been able to handle all this complexity with our central data team. So far, our continued exponential growth has led us to search for new ways of working to keep up with the pace.
We have found that the data network model is the best way forward. I’ll describe Starship’s approach to data networking below, but first, let’s look at a brief summary of the approach and why we decided to go with it.
What is a data network?
The data network framework was first described by Zhamak Dehghani. The model is based on the following basic concepts: data products, data domains, data platform, and data governance.
The main objective of the Data Network Framework was to help large organizations eliminate bottlenecks in data architecture and deal with complexity. So it addresses many details relevant to setting up an organization, ranging from data quality, architecture, and security to governance and organizational structure. As it stands, only two companies have announced commitment to the data network model – all big multi-billion dollar companies. Despite this, we believe that it can be successfully applied in small businesses as well.
Starship’s data network
Make data close to the people who produce or consume information
To run high-level robotic delivery markets around the world, we need to transform a diverse set of data into products of value. The data comes from bots (such as telemetry, routing decisions, and ETAs), merchants and customers (with their applications, orders, offerings, etc.), and all operational aspects of the business (from summary teleoperator tasks to global spare parts logistics) parts and robots).
Diversity of use cases is the main reason we were drawn to the data network approach – we want to do data work close to the people who produce or consume the information. By following the principles of the data network, we hope to meet the diverse data needs of our teams while keeping central oversight reasonably light.
Since Starship is not enterprise wide yet, it is not practical for us to implement all aspects of the data network. Instead, we’ve settled on a simplified approach that makes sense for us now and puts us on the right path for the future.
Define your data products – each with an owner, interface, and users
Applying productive thinking to our data is the foundation of the entire approach. We think of anything that exposes data to other users or processes as a data product. It can display its data in any form: such as a BI dashboard, a Kafka thread, a data warehouse view, a response from a predictive microservice, etc.
A simple example of a Starship data product might be a business intelligence dashboard for potential customers on the site to keep track of their site’s turnover. The most detailed example is the self-service pipeline for bot engineers to send any kind of driving information from the bots to our data lake.
In any case, we do not treat our data warehouse (actually the Databriks lake repository) as a single product, but as a platform that supports a number of interconnected products. These granular products are usually owned by the data scientists/engineers who build and maintain them, not the dedicated product managers.
The product owner is expected to know who their users are and what needs they are solving with the product – and based on this, set and meet the quality expectations of the product. Perhaps as a result, we are starting to pay more upfront attention to interfaces, components that are critical for ease of use but tedious to modify.
Most importantly, understanding users and the value each product creates for them makes it easy to prioritize ideas. This is critical in a startup context where you need to move quickly and don’t have the time to make everything perfect.
Group your data products into areas that reflect the company’s organizational structure
Before we realized the data network model, we had used the . format Lightly Embedded Data Scientists For a while in Starship. Effectively, some major teams had a data team member work with them part-time – whatever that means on any given team.
We set out to define data areas in line with our organizational structure, and this time we made sure to cover every part of the company. After assigning data products to domains, we assigned a member of the data team to take care of each domain. This person is responsible for taking care of the entire range of data products in the domain – some owned by the same person, some owned by other engineers in the domain team, or even some by other data team members (eg for resource reasons).
There are a number of things we love about our domain setup. First of all, now in every area of the company there is someone who takes care of its data engineering. Given the subtleties inherent in each field, this is only possible because we have divided the work.
Creating an architecture in our data products and interfaces has also helped us better understand the data world. For example, if there are more domains of data team members (currently 19 vs. 7), we are now doing a better job of making sure that each of us is working on a coherent set of topics. We now understand that to alleviate the growing pains, we must reduce the number of interfaces used across domain boundaries.
Finally, there’s a more subtle bonus to using data fields: we now feel we have a recipe for dealing with all kinds of new situations. When a new initiative emerges, it is more clear to everyone where it belongs and who should run it.
There are also some open questions. While some domains naturally lean toward exposing source data mostly while others tend toward consuming and transforming it, there are some domains that contain a fair amount of both. Should we break it down when you’re too old? Or should we have subdomains within the larger domains? We will need to make these decisions in the future.
Empower the people who build your data products with decentralized standardization
The goal of the Starship Data Platform is straightforward: to make it possible for a single data person (usually a data scientist) to take care of a field from start to finish, i.e. to keep the data central platform team out of the day-to-day business. This requires providing field engineers and data scientists with good tools and standard building blocks for their data products.
Does that mean you need a full data platform team for a data network approach? Not right. Our data platform team consists of one data platform engineer, who spends parallel half of his time embedded in a domain. The main reason we are so thin on data platform architecture is that we chose Spark + Databriks as the foundation for our data platform. Our earlier and more traditional data warehouse architecture placed a significant burden on data architecture due to the diversity of our data fields.
We have found it useful to make a clear distinction in the dataset between the components that are part of the platform versus everything else. Some examples of what we provide to domain teams as part of our data platform:
- Databriks + Spark as a versatile computing platform and work environment;
- Single-line functions for ingesting data, for example from Mongo sets or Kafka threads;
- Airflow instance for scheduling data pipelines;
- Templates for building and publishing predictive models as microservices;
- tracking the cost of data products;
- Business intelligence and visualization tools.
As a general approach, our goal is to standardize as much as makes sense in our current context – even the parts we know won’t stay standardized forever. As long as it helps productivity at the moment, and doesn’t focus on any part of the process, we’re happy. And of course, some elements are completely missing from the platform currently. For example, data quality assurance tools, data discovery and data attribution are things we left for the future.
Strong personal ownership backed by feedback loops
Having fewer people and teams is an asset in some aspects of governance, for example, it is much easier to make decisions. On the other hand, our key governance issue is also a direct result of our size. If there is one data person for each field, they are not expected to be experts in every potential technical aspect. However, they are the only person with a detailed understanding of their field. How do we maximize their chances of making good choices within their field?
Our answer: Through a culture of ownership, discussion and feedback within the team. We’ve borrowed liberally from Netflix’s management philosophy and cultivated the following:
- personal responsibility for the outcome (for products and areas);
- seeking different opinions before making decisions, especially those that affect other areas;
- Solicit feedback and code reviews as a quality mechanism and opportunity for personal growth.
We’ve also made two specific agreements about how we handle quality, written down our best practices (including naming conventions), etc. But we believe that good feedback loops are the key ingredient to turning the Guidelines into reality.
These principles also apply outside of the “building” work of our data team – which was the focus of this blog post. Clearly, there is much more than just providing data products to how our data scientists create value in the company.
One last thought on governance – we will continue to iterate our ways of working. There will never be a single “best” way to do things and we know we need to adapt over time.
that’s it! These were the four basic data network concepts as applied in Starship. You see, we’ve found a data network approach that works for us as a growth-stage smart company. If it sounds appealing in your context, I hope reading about our experience was helpful.
If you would like to participate in our work, see our careers page for a list of vacancies. Or check out our Youtube channel to learn more about the world’s leading robotic delivery service.
Contact me if you have any questions or ideas and let’s learn from each other!