How we did it: Making an API gateway extensible in any language

Making an API gateway extensible in any language is no easy task. Thankfully, the Tyk team loves a challenge! Below, CEO and Founder Martin Buhr sits down with Matías Insaurralde, Senior Go Developer, to walk us through the process.

Tyk makes an API gateway that is extensible in any language. What does that mean, and what problem does it solve?

Martin: When I first wrote the gateway it was pretty static – it just “did its thing.” But right in the early days I received a phone call from one of our competitors who was scoping us out.

They were asking lots of different questions about the gateway, and they said, “What about plugins? Do you have any kind of plugins?” And I hadn’t really thought about doing that yet, with it being a static system.

It spurred me on because I was quite excited by being scoped out by this competitor. I started investigating ways of making the gateway more extensible. This was before Go had native module capability.

Go itself is a language that’s statically linked. That means that when you turn the source code into something that runs on a computer, everything that’s needed to run the source code is included in the file that’s made into the binary.

In other languages, you’re able to compile your code in such a way that all the different libraries – the bits that make up the code – get complied into individual object files. They get compiled down, and when the program runs, it loads from the disk into memory dynamically. That means that you can take one of those object files and replace it with another one, thereby changing the functionality of that module.

But because Go is statically linked and everything lives in the same binary, you can’t really introduce new code – new compiled code. Other languages like Python and JavaScript – dynamic languages – can do this because they’re not compiled, they run through an interpreter. C and C++ are dynamically linked, so you can run them and compile them in a way where you can load more data.

NGINX is generally the most popular baseline for a proxy these days. What that has is something called a Lua interpreter. Lua is a fast, “just in time” compiled, interpreted language. This means you can write some code, it goes through the interpreter, and it’s compiled just in time to produce a fast output.

It’s very good. NGINX uses Lua to do scripting, and therefore that’s how they do their plugins.

With Go – and with Tyk – my initial solution was that I found a good JavaScript interpreter. JavaScript is a very popular language, and the entire interpreter was written in Go. This meant I could put the interpreter in, and it would take the files and run the JavaScript. It wouldn’t be as fast as something in C or C++, but at least it allowed us to extend certain components of the gateway.

That was the very first plugin architecture. It was… a mess! But it worked – we had plugins.

Some time passed, and Matías joined the team. He was “employee zero,” perhaps before James (our co-founder) joined – before or just after. We brought him to London, and one of his first tasks was to look at how we could do plugins better.

Matías is an extremely good engineer, and he’s very, very smart. He came up with this whole system to make it work. I can’t even explain it properly because he’s the one who designed it! He basically managed to extend out from the JavaScript component that we had, to start using Python and Lua – two extra languages that we could run locally.

He then added something called a GRPC engine. GRPC is a Remote Procedure Call library. A Remote Procedure Call is kind of like an API. It’s very fast at moving binary messages around.

Instead of having code running in the gateway, you have code running in a separate process for your plugin. But because the protocol is so fast, the gateway connects to the plugin – via a UNIX or TCP socket – and you have similar performance to JavaScript or something like that. But GRPC can be programmed in any language.

It’s just a protocol, and it has lots of different implementations in Java, .NET etc. That means that if you really want to, you can use whatever you already have in-house. You don’t need to hire new engineers – or train up engineers – to write plugins and start maintaining them. That overhead goes away.

We have languages that run in the binary – those are Python, Lua, JavaScript, and now Go, because they introduced modules that allow you to dynamically load code (it’s tricky, but you can do it!).

Matías was the original architect of the initial “co-process” project, where we had Python, Lua and GRPC. I’ll let Matías describe the actual implementation, because he was very much the owner of that.

Great! Matías, can you walk us through the implementation, and explain what the original solution looked like?

Matías: It was certainly complicated! In the Go world there was limited information about this kind of implementation. Go provides some interfaces and functions to allow you to call or include libraries.

In our case, we wanted to include Python. There are many interpreters for that language, but the most popular is CPython. Most things in the Python world run on top of CPython. We wanted to work out how to embed that interpreter into Tyk.

This required us to dig deep into the Python code itself. There was a lot of research and a lot of testing. On the Go side, this wasn’t something common – it was quite a new scenario.

What did all of this complex work mean for Tyk’s users?

Martin: Well, the gateway does things in a certain way. It’s opinionated in how it will move a proxy request from A to B, do some authentication, some transforms, etc.

But imagine in a typical company, somebody creates a service everybody is dependent on. It was created ten years ago, and the person no longer works there. This happens WAY more often than you might think! It might run on a laptop under a desk with a big sticker on it saying: “Do not switch off!”

Nobody understands how to use this thing, but it keeps ticking over. The company is dependent on it, but they can’t change it. It could be an authentication service or a data service – but it’s not something standard or recognised.

As a company, we can’t take something like that on and say we’ll build support for it. So we need to be able to take all the good stuff we have, and then create some kind of exception in the flow for when we arrive at this “unique” system and it does what it was created to do, such as validate a request.

That’s the value-add for the user. We can make these legacy systems fit into everything else we do. It’s different for each customer – they all have their own things that they want and need for their specific use cases.

It’s a different design philosophy to most of our competitors, especially in the open core/open source world. You would usually have a proxy API gateway that doesn’t really do anything apart from getting a request from A to B.

Instead of having all the functionality – like transforms, timing, tracing, authentication – built in, you have to use plugins. You essentially build your entire application out of plugins. It’s along the lines of how WordPress works.

With Tyk, we put as much as we can into the system so you can be guaranteed that it works. We go through a whole testing process, so you get something that’s highly functional out of the box. The plugins are for the exceptions to the rule – for when you have to go and integrate with something weird in the infrastructure. The focus is different.

So users can make big changes to how requests work by putting some logic in the gateway.

A good example is something really short term: one of our older clients had a big switchover, where all their websites needed to enforce HTTPS instead of HTTP. They were a big content provider – a newspaper. For all of the content they had in the database, all of the images were HTTP links instead of HTTPS. They needed to change those, but the database team wasn’t available for another week – not in time for the deadline they needed to hit.

So instead of updating the database, they went into the gateway and setup a plugin that would analyse the content that came from the server, modify all the URLs it found, and then send that back. It’s the kind of thing you’d only do for a short stopgap; you wouldn’t do it in long-term production. But that’s the kind of stuff we’re talking about.

Another example: a customer could have their own authentication system using something we don’t support. That customer could use a plugin to work around that.

If we’d said that we only support Go-based plugins, the customer would need somebody who knows Go well enough to write and support a plugin. The same applies to Lua. We picked JavaScript and Python because they’re super popular languages, and most people know them.

But then we also added the GRPC component, thinking that if you’re a Java shop, that can be quite a closed universe. Like .NET shops, they don’t tend to run other applications. You need to help those people out as well, and that’s where the GRPC support comes in.

People can write a plugin in the way they know, maintain it in the way they know, and the gateway keeps on trucking.

Is the way Tyk handles this the same now, or have things evolved?

Martin: It’s moved a long way from what we started out with. Matías has taken it really far. It’s now with the wider team. A lot of work could still be done, but it’s highly functional.

Sometimes you can do things in one language that you can’t do in another. It can be really annoying for a plugin developer if you say “Well, you can’t do that in JavaScript, but you can do it in Python.” We’ve ironed out most of that now, but it’s still a factor!

Have there been any bumps in the road that you’ve not been able to get around?

Matías: Performance issues have been difficult, especially around Python. We always try to build the most high-performance code possible, but Python has some limitations. Go, also, can have performance issues when interacting with it.

We’ve had a few customers report related issues over the past five years. Thankfully we were able to work with them to optimise things. In some cases, we could help to correct their plugin code, as we’d seen similar issues before.

Martin: Yes, it’s been around performance generally. For a while, in order to make the Python code work, we needed to compile Go with a very specific version, using specific flags. We had to include something called CGo – a C implementation of the Go compiler which allows foreign function interfaces.

I’m not sure if we still do it – I think we were able to fix it with a more recent version. But that caused SO many headaches. People were running it in systems where they couldn’t support that, or it caused bugs in the Python version they were running.

If, for example, you ran Tyk, and Tyk was expecting version 3.6, but your operating system came bundled with version 3.4, then it wouldn’t work, or it would create weird issues. But because it was a on a secure OS, you couldn’t just get 3.6 running by installing and replacing it. You’d have to go through a rigmarole to replace it, because the distributor might have a policy where they would only use two versions back. In case of bugs, they’d want it to be battle-tested.

It makes sense, it’s conservative security, but it was a huge headache for us. I think Matías solved it with some very clever work.

Matías: Yes, we added a module called Dynamic Loader. It’s much better. We do have some tickets open related to really new versions. We have to test it with each new version, and we had a run where nothing broke, but there are some issues with one version – 3.8, I think – that we are working to fix.

I spent a lot of time working on the Dynamic Loader, but it was worth it because we don’t have to ship our own Python version for each Tyk version. This is much easier for us at Tyk, and easier for the customer, too. They can install two or three Python versions and choose which to use.

Martin: It’s one of those things that’s so small and deep in the code. But once it’s fixed it relieves so much pressure. Sometimes you make changes that seem so geeky, but they actually make customers’ lives much easier.

It’s testament to what the team are doing when they’re building the gateway. Some of the problems are just crazy because they’re so specific – but solving them truly does make people’s lives more straightforward.

This was definitely one of those “hallelujah” moments. It felt like we were finally free of a curse – and it wasn’t easy to get there.

Matías is definitely the plugin guru in the company. It’s very much his world.

Matías: Now, because we’re growing so fast, I’m impressed when I see more people at Tyk working on plugins. More people are now comfortable with modifying plugins and getting involved.

Were there any other “lightbulb moments” that made it all feel worth it?

Martin: There was one customer where having a plugin closed the deal.

We moved Hotelbeds from Mashery. Mashery had a bespoke request signing algorithm. Because they’re a cloud-based provider, they had a very efficient way of calculating the signatures.

The system needed a big clock-skew – a time period during which signatures needed to be valid. That’s a lot of calculation to do during a hash… it’s a cryptographic function, and it’s CPU intensive.

They had a distributed way of doing this, which meant it was quite efficient. When we came in to try to get the deal, we had to do it in real time – and we had to do it with a plugin.

We did a proof of concept – I think it was in JavaScript. It kind of worked, but it was hugely inefficient. And then we implemented it in Go. That’s the reason why we have so many options. It’s easy to prototype in JavaScript, but when you want high performance, you have to write it in Go.

So we built a plugin in Go that did the same thing. All of a sudden, all the latency dropped, and it could do all the calculations and verify the signatures in real time.

That was really cool; it closed us the deal because we were able to do it. When our competitors were trying to do it, they had to modify the memory model for the database – all kinds of crazy stuff that broke the installation. That meant that it wasn’t scalable for the future and that upgrading would have always meant trouble.

For us, it was just a plugin. And that’s where the plugins really shine.

That was probably my favourite “plugins saved the world” moment. But there are many others. I’m always talking to customers and working out, “we’ll do this with a plugin, that with a plugin.” There are all kinds of different use cases, and they’re usually very bespoke to each customer – unique like snowflakes.

How have your competitors dealt with these issues?

Martin: Kong introduced a Go plugin capability recently. They’ve always been Lua-based. There are also lots of small API management solutions coming out based on the Envoy gateway.

They’re using something called Wasm, which is WebAssembly. It’s highly optimised JavaScript. It’s very cool. JavaScript is very slow and runs in a browser, but Wasm optimises it to such an extent that you can take a game like Quake 3 and run it in your browser.

It’s a highly efficient interpreter, and it’s become very popular. So Envoy allows Wasm plugins. The thing with Wasm is that you can compile any language into Wasm. You transpile it across so, for example, you can write something in C, run it through a special compiler. It spits out Wasm at the end, then you can use the Wasm module in your gateway.

It’s super powerful, and it’s pretty much what we did; they’re just using Wasm as the baseline for it instead of building the compatibility layer. We did it with GRPC. Wasm is actually next on our list, and one of our guys, Geofrey, has already implemented a proof of concept for a Wasm Go-based plugin system that works. It’s very cool.

Matías: Yes, I’ve seen it. Wasm is gaining a lot of traction.

What would you say is the most innovative part of how Tyk has made an API gateway that’s extensible in any language?

Martin: It’s just very cool to be able to say, “you want to run Python? You can run Python. You want to run JavaScript? You can run JavaScript. You want to run Go? You can run Go.”

In terms of innovation, we’ve taken something that’s pretty common practice, and made it more interesting and more extensible for people.

It’s a good approach, and there’s more we can do with it too. We’re looking at doing plugin support for our frontend. We have a proof of concept for that now and it looks really good. It will allow you to modify our UI.

And then we’re going to add plugin capability to the backend of our UI. You’ll be able to have a full user interface with specialised stuff in the back end to completely customise it.

Take the whole package: the gateway, the dashboard… you can customise the gateway using the plugins, and then you can customise the dashboard using plugins, and you can get the two to work together. In the end you can take the whole solution and make it perfect for, say, the open banking sector.

You can do it all using a combination of plugins. We may not be the ones doing that, but our partners might – our vendors might. That’s where it becomes really powerful, and it becomes a very “OEM-able” solution.

We can give specialists in the healthcare industry, for example, a way to customise our stack. We work with them on it and let them go out and sell it for us – to a vertical where we may not be yet.

That’s amazing, and it’s building on what Matías originally built – extending that out. That’s where it’s headed – whether it’s innovative or not, I don’t know! It is like what WordPress can do – you can take a blog, use a bunch of plugins and create a dating site, for example.

How long did it take from you having the initial idea to having clients make use of it in production?

Martin: Well, Matías it still working on it!

Matías: (Laughs) I joined about five years ago. I guess it was two or three years working on it almost fulltime – going back and forth to clients, making improvements.

Martin: When we first created the feature, Matías spent a good six to eight months getting the initial part working and polished. It was working, but shaky.

Now it’s at a stable point where we can be confident in selling it, demonstrating it and using it as part of our sales process. People are still working on it, and it’s still being tweaked and extended. For it to become a stable, stalwart of our product, took a good couple of years, definitely.

Matías: You have to take into account that we didn’t have a big testing team or anything at the start. Now we can delegate more and do things quicker.

What has been your key learning from working on this and delivering it?

Martin: Well, I learned it was possible to do it at all, which was amazing!

Matías: Even though I was already working with Go, I learned a huge amount about the low-level workings of it.

Martin: The low-level work required to do this stuff is really hard. I’d never have been able to build this stuff on my own. I needed somebody like Matías in order to make it work. It goes to show that you have to hire people who are better than you! Matías is somebody who understands this stuff on a truly deep level.

Some engineers are happy to just work on the layer they understand, but others get deep into the internals. It’s a different mindset. Matías was the right person in the right place at the right time to build this.

So, what’s next?

Martin: Wasm. Wasm is next.

Matías: Yes, WebAssembly is the next thing. We also have some minor fixes in progress on the plugins. From my side I’m also working on building more of the team’s understanding of this part of the product.

Martin: Yes, we need to do that as we scale. As we hire more developers there’s more opportunity for that.

Thank you both for sharing so much detail about what it took to build an API gateway that’s extensible in any language. We look forward to hearing more about your latest developments in due course.