Multi Data Centre Bridge: building the world’s most powerful globally distributed API management platform

What does it mean to build a world-leading API management platform? At Tyk, it means innovating. Not for the sake of innovation itself, but to make things better. That might mean making something more user-friendly, more lightweight or simply faster. Ideally all three at once!

Over the years, Tyk has rolled out plenty of innovative new features, many of which began as little lightbulbs in our CEO Martin’s brain. As such, we sat Martin down to explore one of those features – and the journey to creating it – in detail. Cue Martin to talk about Tyk’s Multi Data Centre Bridge (MDCB)… 

Let’s start right at the beginning. What does Multi Data Centre Bridge mean, and how does it relate to building the world’s most powerful globally distributed API management platform?

Martin: Let’s say you’re running some software in the cloud. You’re a UK start-up and you’re working with the UK market – a calendar app, a dating app, or something like that. You might have some APIs to power your apps.

You decide, “OK, we’ll host this with Amazon Web Services, in their London region, or perhaps in Ireland. That means we’re closer to our customers.” Essentially that means that when customers access the application, they are physically as close as possible to the server that’s doing the work, so they get a slightly faster connection.

But then that same company might say, “OK, but what happens when something goes wrong with AWS in London…OK, so maybe we’ll duplicate our whole architecture in the Ireland region. We’re still close to where our customers are, and if one fails the other can take over, with some technical wizardry to switch between the two.”

That’s a simple example of using more than one data centre – it’s essentially disaster recovery, or “high availability.”

If you’re talking about standard software that you’re putting out there, usually the biggest problem you have is migrating database data between the two systems. Data is usually the biggest problem.

Take that a bit further; say the company expands into Australia. If a customer in Australia needs to use the app, they’re making an internet request across the world, to the London data centre. So the latency is actually quite high.

I live in Auckland, and living on the other side of the world, I can tell you that if you access a website that’s only hosted in one location, with no kind of edge network, you really feel that!

So that’s why you have these multiple regions for APIs – to be closer to your customers and to provide resilience. There are lots of other good reasons too.

It’s not only about using multiple data centres to support a single application. If you’re a larger company you may have multiple teams, or you may have one team working on multiple development cycles – the people making the software, the people testing it, and the people who push it into production, and you may want to provide centralised governance, but ensure separation of gateways, APIs and data.

The people making the software will, at some point, want to “smoke test” what they do. That means they don’t want to run it on their local laptop, they want to try it in a production-like environment. That would be in the cloud, but obviously they can’t push it into the live system. They need to put it into an account for software that’s still being worked on.

So you end up with your work-in-progress in one environment. Then it goes to the quality team, who may put it into another, a kind of near-production environment. That may need to be across multiple data centres, because they need to test all the high availability stuff.

After that, the dev-ops team pushes it live. So, within an organisation, you have these silos: multiple data centres, multiple environments, you could call them, between teams. And then you have the actual production environment, which is even bigger.

The problem is, if you’re managing APIs, that adds an extra layer of complexity.

When you’re only managing a piece of software, you just put that software in front of the user. But if the thing you’ve built has components that you don’t want to expose as an API, or you need to change things in some way – such as do some mediation or access control – you then have something else you need to manage across all of these different environments.

The Tyk gateway sits in front of all of this, managing the traffic coming in. It’s like a valve in the pipe of traffic. You install this valve in each of your environments and then calibrate it. That calibration may be slightly different, as there may be different requirements for each region. As an example, in your Australian data centre, you may need to use a different authentication system to log users in and keep data within Australian borders.

What Multi Data Centre Bridge (MDCB) solves is instead of having to configure each of these systems independently, in each region – doing it four, six, eight times – you do it once, in a central location. Then, your configuration changes propagate across all the different regions.

MDCB enables a high availability API management system, across multiple regions and across different configurations. You manage it all in a central place.

It keeps the system highly available because all the gateways sitting out there behave slightly differently… they behave in a way that’s ephemeral. They can go down, and they can be replaced. You can kill one, and a new one spins up – stuff like that.

You’re solving a really complex environmental problem with some software that cleverly manages all of those regions for you. 

What’s the normal use case for this? Is it specifically for large companies working across multiple countries?

Martin: It is usually companies that have a large geographic presence. For example, HotelBeds, one of our case study customers. They need to be highly available, with very low latency. Everything needs to be very fast, so they need to be close to their customers. They have their system in Europe, and on both coasts of the US – three different data centres, because they operate globally.

Another example is an organisation that has commercial interests across the world – companies that evolve to be a global enterprise, as lots of businesses do.

Another use case for MDCB is for internal sectioning of a complex development stack. Both large and small enterprises can have very complex internal operations. Take the example of a bank with development teams in New York and Shenzhen and a QA team in India. It needs to be able to coordinate between those three teams. It’s fine when everything goes into production, but before that they need to have their own silos. This gives them an internal organisation problem.

With extra large companies, you usually have both internal organisation problems AND external organisation problems! MDCB is basically great for any company that has complex inter-regional issues around how they manage their software stack.

What sparked the original idea for MDCB?

Martin: The software was born because Tyk’s original cloud platform – the first version I wrote, which has been deprecated now – only ran on the east coast of the US. We couldn’t properly replicate it and make it a global system. So when we were selling to customers in Australia or Europe, and they didn’t want to pay for the infrastructure to run it themselves, they couldn’t choose our cloud.

So we set out to still give those customers the cloud, and say that all they needed to run locally was a small gateway. Instead of running the whole stack, you’d just run a tiny component locally. That became our hybrid offering and it was very successful; we had lots of people buying hybrid because it was great for managing traffic locally alongside a cloud platform. It was simple to set up and effective.

Then we worked out it was capable of solving a problem our enterprise users had. So we took the piece of software that was handling the bridging, and developed MDCB from there.

MDCB came from a real use case that we had to solve. I was on a call with a bank talking about the issue and I had a real “eureka” moment; I realised that I already had a piece of software to solve it. It was rough around the edges, but it would work.

I didn’t tell them that, obviously! But that was how MDCB was born.

What were the hard parts of creating this? Where were the dead ends and places you felt stuck? Or was it all smooth sailing?

Martin: It certainly wasn’t all smooth sailing. Throughout the project we’ve had scaling issues. It needed to be fast and it needed to be secure. I picked a Remote Procedure Call (RPC) library, to handle the comms between the gateway and the central system, because in the language we use to program the gateway, there’s a really clever thing you can do called “interface.”

It’s not like a user interface – it’s kind of like “duck typing.” The idea is that if it walks like a duck, talks like a duck and looks like a duck, it’s a duck! So if you write a class, a piece of code that has a certain signature, that has certain types of functions, types of input that are named a certain way, and has the same kind of outputs as something else – as this interface defines – then you can swap the two out.

So I can write an interface that says, “here, talk to my database, get data,” or I can write an interface and then write an implementation of it that says, “talk to this remote server.” The bigger software stack just uses the interface because it knows that if it walks like a duck, talks like a duck and looks like a duck, it’s a duck! It knows what it’s going to get back, so it’s a really good way of just swapping out back ends, and components that you might want to commoditise.

In the very early days, it was a case of swapping out the back end for the Remote Procedure Call library. But, as we scaled this thing up, the RPC library itself had certain quirks. For example, it would automatically set up 50 different connections per client, to our server!

It also had a couple of quirks around how TLS worked – how encryption worked – where it might crash the server if you did it wrong. We had to handle lots of little quirky things like that as we scaled up.

At one point we had to run so many different systems – actual servers – to maintain this huge connection pool. We’ve now thankfully resolved that.

I don’t think we ever had a real crisis moment. It was more of an ongoing “scaling this is a bit of a nightmare” moment.

It was in our own cloud because we ran the largest version of this thing. We had large numbers of international clients all using the software and connecting to our backend. And each gateway they ran was creating 50 connections! It overwhelmed our backend. So there were some issues there, but we solved them – thankfully.

This was before there were other solutions for stuff like this. If I had the choice to go back, I probably would have written the RPC component in something called GRPC, which is on the hotlist at the moment. It’s very good; it’s very fast.

So, yes, the big struggle was scaling. Getting the thing to be really efficient, not leak or blow up! It doesn’t just handle basic data like configuration. The gateways also bring analytics back – and with large deployments that’s a huge amount of data. It’s funnelled up into the main application and has to be processed. That was also a bit of a problem.

Were there any other growing pains?

Absolutely! In each one of these regions, it would be fine if there was only one gateway – then it’s just one thing you’re talking to, and that’s fine. But that’s not usually the case.

Usually, in each one of these regions, you’re running multiple gateways. Again, you’re looking at high availability, so that if one server fails, the others can pick up the slack. So when you send an update to that cluster of gateways, you only want one of those to pick up the update, not all of them.

The way it works is the gateway caches data, so when a request goes through, and it says, “authenticate this API token,” the gateway will say, “OK, let me look this token up.” If it can’t find it locally in its cache, it goes “OK, let me go and talk to the master server.” Then the MDCB system will pull it out from the backend and say, “OK, here you go.”

The local system then caches that data so it doesn’t have to go and look it up again. The idea is that if the master system dies, all that data is still local, so the local system will continue to work.

That’s great, but what happens when you delete one of those tokens, so that something doesn’t have access anymore? In that instance, MDCB will send an update down saying, “OK, delete this token from the cache please.” And it will.

If you tell it to update something, it’s the same. If the token has been deleted, next time that request appears, the local system will look it up again and get the updated token.

But there’s a race condition when you have multiple gateways that are all receiving the signal, and you do an update. Let’s say that one of them gets the delete command first. It deletes, it sees a new request come in, it pulls a new one, but then the next gateway updates. So that gateway then deletes the token and has to do it again. All of a sudden, you’re doing the same work five times – and you can end up in a situation where you’re struggling to track quotas and limits – all the things the gateway is supposed to be doing.

This ends up with everything going into a hot load loop. That’s horrible. You really don’t want that! You want each cluster to work independently, and you want one gateway to implement the change.

So that was a fun challenge! Implementing that was a bit difficult, because you have to identify each gateway as a cluster, and then say “OK, how do I get them to officially only do it once?”

That was a big problem and we solved it with a really complicated bit of code. My engineers hate me for it because they think there are better ways of doing it! But it works really well.

Was there a “lightbulb moment” when you knew you’d cracked it, and that it was all going to be fine?

Martin: Yes, I think so. With all of this stuff, the biggest problem is that you have to maintain backwards compatibility. That’s a big constraint. You can’t change the interface too much, because if you change it too much then it can break things for older clients trying to connect.

The lightbulb moment was coming up with the idea of using a command queue. The dashboard that we make all the changes in sends signals out. The MDCB system picks up those signals, and instead of immediately relaying them, it queues them up. It creates a stack of commands to run.

It requires the first gateway that receives the command to execute it. Then, that gateway is responsible for executing the whole stack. The gateways locally also then all talk to each other.

That was cool, I felt rather smart when I did it. Yes, there may be better ways of doing it, things we could improve, but it solved a tricky problem early on.

Did you have any of the Tyk user community working with you on this, or were you flying solo?

Martin: When it first kicked off, it was very much just me running it. My team then inherited a lot of my code. I built a lot of the stack, so if you talk to some of the team, they will probably say, “Oh yes, I remember inheriting this hot mess!”

It’s obviously not that bad, but if you’re talking about really clean architecture and code, the team were the ones to turn it from my “hobby project” into this really hardcore piece of work.

The users were involved, but this is very much a backend solution, it’s meant to transparently allow you to transport a local configuration to a remote configuration.

The one thing the users did get heavily involved in is how the analytics are handled. There’s this big data stream that goes from your gateways, all the way up to your system. There’s the ability to split that.

For example, talking about HotelBeds again, they have very high traffic, which generates a huge amount of analytics data. If you have multiple regions, one way that cloud providers make money is when they bill you for traffic between them. So if you’re sending huge amounts of analytics data from cloud region A to cloud region B, and you’re doing that three or four times, then your data bill goes up quite significantly.

So for cases like this, we had to look at how we could “split the target.” Instead of sending the data to the main system in the middle, you send it somewhere else where you can work with it. That was an interesting problem to solve. We’ve solved it now, and it’s really flexible in terms of where you want to send the data.

In terms of getting it all to work, obviously clients were vocal about problems and bugs. They certainly battle tested it. It had already been battle tested in our cloud, but we had some real high-energy users where weird little problems only showed up in really high traffic situations.

In terms of Tyk’s competitors, has anyone else created this kind of solution, and if not, why not?

Martin: A lot have tried to and tried to solve it in different ways. They all say they do multi data centre and multi region stuff. But they tend to do it based on the data plane. That means they replicate the database across all the regions, or they do a single configuration, push that out, and that’s it.

We’re talking about quite static setups – not as dynamic as ours, or as intuitive to understand. They also don’t do some of the cool stuff that Tyk does.

One of our differentiating features is that you can split the gateways into groups within a cluster. Imagine you have the London cluster or the New York cluster. You can split them so that in the New York cluster, group A will run one configuration, while group B will run another. You can really segment it out.

Our new Tyk cloud is powered almost entirely by MDCB. So when we’re deploying something, it’s an MDCB deployment. If a client decides they need a gateway in Australia, they deploy a hybrid gateway there, and it connects back to the MDCB system.

Each one of the deployments across the world is tagged, so they’d all be tagged as say “AcmeEdge.” So if you go into the API you’re making and give it a tag of “AcmeEdge,” all the gateways will pick it up. But if I only want to target London, I can just target London – or I could put in two tags, for London and Australia, and those gateways would pick it up.

That’s quite unique to Tyk – really segmenting traffic and managing an edge location. We were the first to market with it. We were very much the first to market with a hybrid gateway solution, and we were also the first to market with this MDCB solution as a commercial offering.

If you go ask the analysts and experts in the spacer looking for something that’s multi region, they’ll say that Tyk has the best offering.

Others have ways of tackling the same problems, it’s just not as good, generally because they do it in a different way. A lot of our competitors don’t own the whole stack. Tyk’s stack was built from the ground up by Tyk, so everything from the thing that moves the bits around on the network, right up to the user interface, it’s all part of our own codebase.

With our competitors, they tend to sit on top of somebody else’s platform. They’ll use something like NGINX or Envoy or Apache – all web servers that have a way to customise them. Then they build on top, customise the hell out of them to make their gateways work. That means that there’s a core layer that they can’t really change or infiltrate without taking a big technical risk.

With Tyk, we don’t have that problem. If we’re in a situation where we want to change how something asks for data, we can. That allows us to do a whole load of cool stuff.

It all sounds very innovative. What do you think is the most innovative part?

Martin: I think what I like most is how resilient it is. When you’re running a hybrid gateway, it does clever stuff in that hybrid mode. For example, the gateway will go, “I’m in a client state,” go to MDCB and say, “give me my configuration.”

It takes all that, takes the core configuration data, encrypts it, and then stores it in the local cache as well. They do that all the time, so there’s always a last good configuration.

The worst thing that could happen is that your multi data centre master goes down, so there’s no communication any more with the master. And then all the gateways fail at the same time, meaning you have to bring up a new one.

But this new one has no other gateways to talk to. It doesn’t have a master to talk to. It’s a blank slate. When it does that, in the absence of MDCB or another gateway to talk to, it checks for a last known good configuration. It goes to the cache, loads up that configuration, and fires up in a sort of “cold mode.”

What that means is, so long as the cache is OK (which is a separate layer, a data layer – you can have a different resiliency configuration for that and back that up in whichever way you wish, using automated tooling) then when that gateway starts up, it will have whatever keys or tokens it’s recently seen. It will have the latest configuration to keep itself up and running. And it will be able to proxy traffic again.

It can basically truck along in the worst possible scenario to keep your traffic going. It will then snap out of that mode once the connection is restored and reconfigure itself.

That resiliency model is very cool. We’re really happy with that, and it’s a very small footprint. We’re looking to make it even smaller, so it’s even faster to spin up these little gateways at the edge.

Another thing the team did was to build a push mechanism. When I first built this, it was very much a pull-based update. It would pull the configurations down for the things it needed. That meant that a gateway in a fail state wouldn’t be very good at handling a new situation – a new token it hadn’t seen before.

That could happen if, for example, you had a whole data centre fail, and you start moving everything to a different one. The keys cached over here may not have been cached over there. If the master has failed, they may not have seen them, so there’s a novel situation to deal with.

This is why the team built a push mechanism. Gateways now constantly receive a stream of new updates, new keys, so they can cache them up front, if they’ve never seen them. So that was quite cool. It’s not life-changingly innovative, but it’s still a really clever system.

We’re very proud of it because it powers some really core infrastructure for us. Our new cloud lets you do some amazing things architecturally, with your team, your organisation, and how you configure your gateways. And it’s all with a single click. It feels like it magically works, and that’s cool! It’s become a really core component of our stack.

How long did it take between having the idea in your head, and Tyk having a fully operational MDCB solution?

Martin: The initial version for the original cloud only took around two weeks! It was a very experimental proof of concept. I was the only person working on it to begin with, and then it went to the team. It didn’t need much babysitting until we had to work with it at real global scale with some of the biggest brands on the planet.

It’s evolved over years but the actual development time was quite short. It’s essentially quite a simple idea: connecting the databases to an RPC layer and making them available for high demand.

While the initial idea was pretty quick, turning it into a proper product took a month or two. It was quick, because as an application it doesn’t need a UI. We didn’t have to worry about how it looked. It’s headless: you configure it, set up and run. Those applications are always faster to build because the primary focus is on how resilient the code is.

The early versions didn’t support much functionality – they were very basic. But now it’s fully featured and handles a lot of really complex scenarios.

What have been your own key learnings from this project?

Martin: Trust your team! There are very smart people around me. I’m glad I’m no longer managing it because the stuff the team has done is amazing. They’ve done optimisations, changes and tweaks in ways I’d never have thought possible. It’s great.

I wish I’d had them earlier in the process. Due to how our start-up grew, in several cases we’ve started with something I wrote, which the team has now turned into something amazing. It’s hard not to think that if I’d had that team at the beginning, we could have made it amazing sooner.

That’s the main thing – I wish I had some of those very smart people around me from the start – instead of it all being my problem!

I need to get more used to having them there. When our new cloud launched, the controller for that was – again – something I built and handed over to a team. It was like “Martin, not again! Why are you doing this? Please stop!”

As a company, we scaled very quickly. We took what we had, organically grew it, and took it to production. Now we’re able to start with something rough, rebuild it properly, and then take that to market.

What difference does MDCB make for Tyk’s customers, now it’s complete?

Martin: It’s enabled some really complicated setups.

Tyk is very flexible – you can configure it in 1,000 different ways, with different types of topologies to make it work. MDCB fits into that – it’s just another topology of Tyk. It’s complicated, but it’s so flexible with all the different components we’ve got.

As I said with the analytics, you can move your analytics to different regions and control where you send the data. You can also do that at both ends – at the hybrid end and at the MDCB end. You can tell MDCB to send its data to other BI systems.

With this, you can create really complicated management scenarios. It’s enabling these global clouds, global systems – and it works across clouds too.

You’re no longer tied to a vendor; you can use anything you like. You can have it in your local data centre on bare metal, you can run it in the cloud – even run it on your laptop – and still configure it via your main interface.

The centralised management approach allows large, complex teams to do large, complex work.

A lot of solutions that come to market tend to look at the greenfield ideal. So they’ll say “well, you should be doing it this way, so we’ll design it for what you should do.”

That’s stupid because it’s not the reality. The reality is that companies are messy, especially large ones. They grow through acquisition, they go out there and they buy company A, they buy company B, and they merge with company C. All of a sudden, they have six or seven different software teams, different stacks, different software environments, different clouds. That’s the real mess of the modern enterprise.

The way MDCB works is that it lets you put a little governance on top of that mess and take control of it by accepting the chaos.

One thing you do find with enterprises is that they like to put things in boxes – compartmentalising. You have teams, and those teams have structures, permissions. You have to build for that – that beehive of compartments. It’s never uniform. But it’s always fun.

Thanks, Martin, for sharing your Tyk MDCB journey with us!