Uncover what’s next for software engineering at QCon Plus (November 1-12) Register
Facilitating the spread of knowledge and innovation in professional software development
In this episode, Thomas Betts speaks with Tammy Bryant Butow, principal SRE at Gremlin about training new site reliability engineers. The discussion covers a formal SRE Apprenticeship program Tammy led at DropBox, and gets into ideas about the best way to teach people new technical skills.
In this second edition of the Modern Data Engineering eMag, we’ll explore the ways in which data engineering has changed in the last few years. Data engineering has now become key to the success of products and companies. And new requirements breed new solutions.
Sprint 0 can be a great mechanism in Agile transformations to reset existing teams which are not delivering value, exhibiting a lack of accountability, or struggling with direct collaboration with customers. This article shares the experiences from doing a Sprint 0 with an existing team which was struggling to deliver, helping them to align to a new product vision and become a stronger team.
The panelists discuss monitoring and observability methods that DevOps and SRE teams can employ to balance change and uncertainty without the need to constantly reconfigure monitoring systems.
Learn how to apply containerized applications to improve application speed, reliability and deployment. Virtual Event on September 21th, 9AM EDT / 3PM CEST
Learn how to apply Microservices and DevSecOps to improve application security & deployment speed. Virtual Event on Oct 19th, 9AM EDT/ 3PM CEST
Turn advice from 64+ world-class professionals into immediate action items. Attend online on Nov 1-12.
InfoQ Homepage Presentations How to Tame Your Service APIs: Evolving Airbnb’s Architecture
Jessica Tai discusses the challenges scaling to hundreds of services, how to simplify APIs, the trade-offs in API design, and how to test and operate flexible aggregator APIs and service blocks.
Jessica Tai is an Engineering Manager of Homes Platform Infrastructure at Airbnb. Previously a staff engineer on Airbnb’s Core Services infra team, Jessica has given multiple talks at QCon about the technical design and scaling challenges with the migration to service-oriented architecture. Now, she spends more time thinking about how to grow and scale the humans of Airbnb.
QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world’s most innovative software organizations.
Presented by: Jenn Gile, Manager, Product Marketing & Brian Ehlert, Sr Technical Product Manager
Tai: In the beginning of Airbnb, our architecture was really simple. It was a single Ruby on Rails monolithic application known as monorail. Let's represent our architecture as a rope. If you imagine it, it would be a single rope, easy to follow and easy to detangle. However, as developers were adding code into monorail over many years, that simple rope soon turned into a complicated knot. Monorail became hard to understand and hard to be productive in. Engineers would have groaned when they had to make a change and deploy monorail. It was clear we needed to figure out how to untangle this spaghetti mess. The solution we came to was migrating to service oriented architecture or SOA. SOA organized this massive monorail knot into separate encapsulated services, shown here as different pieces of rope. Our first version of SOA looked very similar to the model view in controller layers of monorail, and helped us get us through the growing pains at the time. However, eventually, new challenges appeared as we had developed hundreds of services. What used to look module and clean, now is beginning to have a complicated series of twists and knots once more.
This is where my team comes in. My name is Jessica. I'm a tech lead manager of the user's infrastructure team. I've been at Airbnb for over six years, and I've seen the various evolutions of our architecture and services. My team is responsible for helping to design that next generation of architecture, focusing on the user entity and domain. I gave a talk at QCon SF two years ago about our migration from monolith to services. This talk will be different. It will be adding on the next chapter of our service architectural journey.
I'll open by describing some of the scaling challenges that we were facing with service oriented architecture. I'll then move into some of the design principles that we created to help us with our second iteration of SOA. I'll then dive into some of the technical details of our abstraction APIs, focused on a central data aggregator as well as service blocks. Then I'll discuss some of the ways that we operate our APIs and the tools that we use to build and maintain these new API patterns.
With services, we originally migrated them to adjust the scaling problems with monorail, but having hundreds of services created different types of scaling challenges as well. The scale and challenges were really motivated by growth. This is what our CTO told our homes architecture working group, when we were figuring out how to continue expanding our business and expanding our engineering team. Growth has been a driving factor and motivator for our architecture migrations. From this graph, here, we can see that our service and engineering team are both growing. In 2016 when we began the migration from monolith into services, we had around 500 engineers. At the end of 2016, we had less than 100 services. In 2020, we had over 500 services and over 2000 engineers. The needs of our team and the product are changing. Unfortunately, our services are changing as well. It's beginning to create a dependency graph that was really hard to reason about. Our developer velocity was slowing down once again. It was taking engineers longer to build features and add new data.
Keep this picture in mind and compare it to a snapshot of our service dependencies. As you see here, we have many services connecting to many other services. Having dozens of cascading dependencies means a single request has to go many layers deep. For example, a request to an endpoint on our product description page where you view a home, hits one of our homes data services over 11 times in addition to calling 29 other services. It's difficult to debug and triage when the service dependency graph was so entangled.
Having that deep complex call stack was really impacting the way that we were working. There are many different integration points required to make a change, which slowed down on developer velocity. With services owned by different teams, this also creates a higher collaboration overhead in order to get a feature done. Our data and our business logic began to be fragmented across multiple services. We were beginning to see repeated patterns in different services. For example, a lot of services were made to load different parts of data, and the state of loading code was boilerplate that was very similar to each other. We had multiple services owned by different teams having the similar functionality.
This brought us back to the drawing board. How could we improve upon our service oriented architecture? It distilled down to simplification. We needed to simplify. It was obvious we needed to simplify our dependency graph. We wanted to reduce the number of service dependencies, remove circular dependencies, and make services that were more modular, more clearly defined. We also wanted to simplify the developer experience. Engineers should be able to focus on the business logic and product requirements, and not so much on the boilerplate of wiring data up and downstream to different services. We also wanted to simplify the way that we were accessing our data and our endpoints, providing finer grained control to the visibility of data fields and our services, would be better protection for not only our services, but allow us to have better control of that dependency graph.
Thus we've created what we call SOA v2. This is something that we started at the end of last year, and have been working this year to get the initial pilots up and running. The building blocks for SOA v2 focus on abstraction. We have internal services, which we call presentation services that are the ones that are responsible for aggregating the data and providing it to a view that our user facing product then uses. Instead of having the presentation services take on a lot of the business logic and fetching of data, we've pushed that down to a layer beneath into what we call a data aggregator. This service introduced new APIs and new API design patterns that allows the aggregation of data to be centralized in a single place, so that our presentation services don't need to duplicate this functionality. Our data aggregation service is able to fetch from multiple data entities, some of which are powered by service blocks. Service blocks also introduce new API design patterns.
Let's walk through an example of how a particular product such as looking at your home reservation may look like in SOA v1, and then we'll look at it from the v2 tech stack. In the reservation page, you might want to know what your reservation check-in date is. There will also be information about the guest and host user photos, and also the host username. Putting this into SOA v1 resulted into a lot of tedious data loading and relationships between these entities. That presentation service would then call to the reservation service to get that check-in date. It also would need to fetch the host user ID and guest user ID for this particular reservation. Then it could go fetch out from the user and user photo services in parallel with those IDs. The relationships between these entities such as reservation to user, or user to user photo, created a graph-like structure. However, it was not natural to represent this type of multi-entity query with a single Thrift query. All our services in SOA v1 were using Thrift APIs. Instead, we had many presentation services calling many other of the data entities in different combinations. Really, it was a lot of boilerplate to fetch this data. There are some like business logic applied on top of the data fetched, but we were recognizing that our presentation services had a lot of similar patterns, and similar query ways where they wanted information from multiple entities.
Thus, this was the motivation for creating our SOA v2 aggregator. We decided to create this data aggregator by using a GraphQL interface. GraphQL offers a more expressive query language for APIs and a way to express the data modeling and ecosystem. It allows for a runtime population of the query with existing data, and allows for these query structures to be populated in the shape of the desired response. If we take our GraphQL query for the reservation information asking about the check-in and some basic guest data, we send it to the aggregation service, which has resolvers that knows where to fetch these fields from. The resolvers also provide us a location to put lightweight business logic as well. We've added some optimizations on the resolver. If we recognize that multiple resolvers are fetching from the same data source, we'll batch those together and make a single query to the underlying service. This helps with performance and scalability. It reduces the number of callers to our data service and creates fewer downstream requests. However, remember before when there were hundreds of services, this means that the data aggregation service would still need to talk to hundreds of other services to fetch the data from and know where the business logic lives. We're wondering, how could we further simplify the graph?
To do this, we created an entity known as service blocks. We may have a user service block, which would be responsible for having information about the user's first name and their picture. We're going to have a home service block and our reservation service block. If we go on to what a service block is, it's a collection of a logical grouping of services focused on a single core entity, providing a cohesive domain around the business logic for that particular entity. We would have a facade service, which exposes a unified API and schema. Everything beneath the service is considered a black box to the client. This helps to simplify the developer experience. Now our client services just need to query this one facade service API. They don't care about the underlying internals of how the services may actually work. The client gets a holistic picture of the entity such as a user entity, while the facade service is responsible for fanning out the request, and to fetching it from our existing services. These internal services are abstracted away and encapsulated.
A question could be asked as doesn't this introduce an additional layer and network hop. The collection of services that are focused on some same core entity, often have data that's queried together. By putting them together behind a single facade, we're able to turn that single query and a fanout in an optimal way, similar to the batching that the data aggregator service provides. This allows us to further optimize in looking at the query patterns, the internal services to perhaps refactor or consolidate them in a way that's abstracted away from the client. We can optimize the internal blocks without impacting all the hundreds of clients that may be depending on this facade service.
In our [inaudible 00:14:21] channels, we often get the question of, does this particular field belong to service A or service B? We originally had these data fields in different services instead of a single large user service or a single large home service, because these different fields have different SLA and isolation requirements. However, they interact closely since they are focused on that same core entity. We want to maintain the separation of code, and thus, in separate services and development iterations, but still expose them by the single facade service API.
Why separate these service blocks from the data aggregator if they're performing similar functionality? There are two reasons. One is we have core entities that we wanted to have as separate code bases as they were critical for the business. We wanted to reduce the chances of having a single point of failure and having service blocks separately, allows for teams to be developing and iterating on each entity without directly impacting each other. Another motivation was that we wanted to simplify the dependency graph by organizing blocks into these larger directed acyclic graphs. We wanted to remove the cycles in the existing dependency graph, and reduce the complexity of our SOA web right now, by having these larger encapsulated blocks. With the hierarchy of our blocks, we only allow certain online calls from a particular block to another block. Then that block cannot call reverse an online call, because that would create a circular dependency. The simplification of our dependency graph was a critical motivator for our design of SOA v2.
These abstraction APIs help us by removing the complexity of underlying services and providing interfaces to these black boxes. With the abstraction APIs, the services that call them don't need to know all the internal services, schemas, and endpoints. We're able to better provide control for what the public API and the internal services can access. We can also define clear schema boundaries for these service blocks. An example of this is what we define as scope. Different levels of our SOA v2 have different scopes and different clients that are accessing them. For example, our data aggregation service may have a service or endpoint that has an access level exposed to our public API. However, we might not want our blocks to be returning public API access data, and so we can acquire that specific access levels such as user block, provided as access level headers, to that user block facade. Combining these access levels with the scope directive that's annotated on our GraphQL schema, allows us to indicate the different access levels per field and per entity type.
For example, we're able to provide a scope directive here, saying that this SomeExample type has the scope level of public API. With a similar annotation but different GraphQL type for user, we might annotate it with a user-block scope. How can we further simplify the developer experience? We looked at where our engineers were spending a lot of their developer time, and we saw that there was often separate schemas for each of the services. If some service A needed to get service B's fields, then service A would copy over the relevant fields from service B schema into a new service A namespace. Sometimes the name of the field would even change slightly. This resulted in each service having their own schema and a lot of duplication across the various parts of the stack. Developers needed to handwrite mapping to convert between the different schemas, even though they were representing the same pieces of data. We've created this effort to make a central schema representing a single source of truth of the various data entities at Airbnb. Our central schema is then paired with the scope directive. We're able to annotate the different services with the scope of the schema that they want to auto-load during the start of the service. For example, when the user block service starts up, we've said that has all the fields within the user block access level. It will go into the GraphQL schema, find all the fields within the scope annotation, and load that schema into the user block and expose that as part of the interface.
However, the idea of centralizing into the aggregator service and the unified service blocks does ring bells of consolidating back to the monolith. To avoid returning back to monorail, we wanted to make sure that code had very clear ownership and boundaries for the various entities. Each schema change to that unified schema must be reviewed by the core entities on a team that has the most product context. In monorail, it evolved to be a dumping ground for some of the core models. For example, the user.rb file had a lot of different attributes that had a user ID but not necessarily record to the user entity. Now with our unified schema, we've given annotations that provide more oversight so that a team is really a product owner of the data going into the unified schema.
An example of this is our owner directive. We're able to annotate the owners of a particular field. This is an easy way to document ownership in line, but also gives us a way to auto-generate alerts. With many parts of the schema being owned by different teams, we didn't want a single team have to be responsible for maintaining the whole unified schema. By having the owners annotated here allows us to auto-generate alerts and page teams should any of the fields from their schemas be failing.
To test before we get into prod if our code is working or failing, we developed various tools to aid us with this process of operating our APIs. In our schema, we've developed a browser based IDE, which allows us to edit GraphQL schemas, mock JSON, and open a GitHub pull request all from the browser. This is important because engineers from different parts of the stack including our frontend engineers, are making changes to the schema. However, the frontend engineers work in a code base that's different than the code base that unified schema is stored in, which is used more often for our backend engineers. This tool is really powerful to enable any engineer to be able to modify the schema and have it be easily understood and get the proper code reviews. There's live GraphQL validation through the browser's enhanced IDE-like experience, enabling us to provide syntax errors, as well as suggestions for different types. The suggestions are important because it helps us realize if we might be creating a duplicate field. If we want to name something but that something already exists with a similar name, it'll give an example of where that exists, and perhaps we don't need to create that new field.
We're able to quickly make these changes in line and make a pull request with a given branch name and commit message all from the browser. This browser IDE also gives us the ability to mock data. This is important because it allows us an easy way to generate example responses against different versions of our schema with correctly typed sample data. Putting this together with our GraphQL Explorer, we're able to simplify the developer experience by validating the schema and JSON changes and querying against various environments, including the staging environment, or even our local development environments.
It seemed like there's a lot of different architectural decisions and tools that are required to get this migration started. They are big changes. Investments in tooling and infrastructure are costly. When is the right time to introduce new architectural components and embark on a long migration? This goes back to growth. Growth is really an important motivator, and it's been Airbnb's number one priority and driver for our first migration to SOA, and our continued migration work in SOA v2. It's natural to outgrow your system architecture, but it's important to design system architecture to address your current growth needs. It's important to not over-engineer. Business and technical needs will continue to evolve over time, so it's important to build a system that focuses on the challenges of today.
When we migrated out of our monorail, we did not have a dependency call graph that looked like this. Our needs at the time were to help with developer productivity and create a way that allowed for us to quickly and incrementally migrate out of a monolith. However now, our needs are to simplify our services and dependency call graph. If we look back at that example of the home reservation page that has information about the check-in day and the photos and the guests, and we put it now into our tech 2.0 stack, we will have that presentation service reduce in scope, and have most of the functionality put into the data aggregator. The data aggregator will fan out, get information for the reservation block and the user block, all behind the scenes. The user and reservation block may call out to various services, for example. The data aggregator only calls to their facade service.
Some of the progress that we've seen so far in our migration to our SOA v2 is that our data aggregation service is serving over 10% of user facing traffic. I've seen performance wins from that. Our user and home service blocks have both piloted production traffic from both that data aggregation service and the tech stack, as well as other internal services. In addition to having a unified schema up and running, we're working on field level privacy frameworks as other simple GraphQL annotations.
There's still a lot of work ahead for us, but we're excited to continue the migration to SOA v2. The road is not quite paved yet, but we have an idea of where we want to go, and the future looks promising.
See more presentations with transcripts
Jun 11, 2021
Uncover emerging trends and practices from the world’s most innovative software professionals. Attend QCon Plus (November, 2021).
You need to Register an InfoQ account or Login or login to post comments. But there’s so much more behind being registered.
Get the most out of the InfoQ experience.
Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p
Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p
Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p
A round-up of last week’s content on InfoQ sent out every Tuesday. Join a community of over 250,000 senior developers. View an example
We protect your privacy.
Focus on the topics that matter in software development right now.
Deep-dive with 64+ world-class software leaders. Discover how they are applying emerging trends. Learn their use cases and best practices.
Stay ahead of the adoption curve and shape your roadmap with QCon Plus online software development conference.
InfoQ.com and all content copyright © 2006-2021 C4Media Inc. InfoQ.com hosted at Contegix, the best ISP we’ve ever worked with.
Is your profile up-to-date? Please take a moment to review and update.
Note: If updating/changing your email, a validation request will be sent