Harnessing the potential of fully API-based data access and analysis

REGION  Global
SECTOR  Healthcare
PRODUCT  Open Source, Self-Managed

CanDIG and Tyk

CanDIG – Canada’s Distributed Infrastructure for Genomics – is a national federated health research data platform, connecting distributed national datasets, and connecting researchers to a platform where they can discover and explore data.  As a driver project of the Global Alliance for Genomics and Health (GA4GH), and a part of the Canada/EU/Africa CINECA project, it is beginning to connect researchers and data across the globe.

CanDIG has been relying on Tyk’s open source API Gateway to secure access to its APIs and data since 2018. Tyk’s solution supports CanDIG to deliver an innovative solution for the discovery, exploration, and analysis of health data, enabling national-scale analysis with local control of this sensitive private data.

Who Is CanDIG?

CanDIG is Canada’s solution to health data analysis. It facilitates national, distributed, analysis of locally-controlled private genomic data. The CanDIG platform enables clinical researchers from distributed sites to query and analyze quality data, with no need for a central infrastructure to trust, maintain, or secure. By supporting sharing of research-quality consented health research data, CanDIG is facilitating the efficient and effective diagnosis, treatment and follow-up of health conditions including COVID-19, oncology, and rare diseases.

CanDIG connects researchers at Montreal’s McGill University and Montreal University Health Centre; Toronto’s University Health Network (UHN), and, shortly, Hospital for Sick Children; and Canada’s Michael Smith Genome Sciences Centre. The platform is a peer-to-peer federation, directly connecting the centres to each other, with no centralized infrastructure. Coordination happens at the level of software, policy, and data standards development. CanDIG ensures that the sites control their own data, which translates into distributed authentication and authorization decision-making, informed by platform-level information; the concern for local control of data extends to ongoing interest in privacy-preserving methods and privacy by design. All data access is API-based – even local data. This enables fine-grained logging and auditing, as well as the potential for fine-grained authorization. It also allows for the abstraction of back-end data stores.

Because CanDIG is developed by a small team, it relies wherever possible on standards for reusability and interoperability, best practices, and open-source software. “CanDIG is building an open-source, standards-based infrastructure to power truly national-scale Canadian genomics health research projects,” comments Amanjeev Sethi, Senior Application Developer at UHN. “The data supported by the CanDIG platform is part of multiple national projects, whose use is each governed by the sites involved in each of the projects. Our partnership with various sites allows for users across the country to analyse national-scale data while maximising privacy and keeping it under local control. This lets Canadian-scale research programs expand, and makes it easier for new projects to begin.”

Why did CanDIG need an API gateway, and why Tyk?

The initial implementation of the CanDIG software stack was a monolith.  With only one “service”, it controlled the authentication and authorization of all components.

While this greatly simplified early implementation and prototyping, it was clear that it was never going to scale as needs grew. Onboarding new developers to the codebase was challenging, and there were growing requirements for new capabilities at a rate that developing a monolith was not going to support.

The team started looking for an API gateway, to provide a consistent external interface as the code base began being refactored.  “Tyk played an important role in our move away from our original monolith architecture”, says Jimmy Li, Research Programmer at the Genome Sciences Centre.  “The ability to implement virtual endpoints in its middleware let us refactor our original authentication flow in a way that we could start to support additional services, with uniform authentication.  And Tyk’s open source gateway was the only API gateway on our long list of candidates that supported multiple OpenID Connect identity providers out of the box.”

Tyk’s full support for OpenID Connect has enabled CanDIG teams to use multiple ID providers simultaneously – an important feature behind the decision to use Tyk, according to Sethi. As he points out, “CanDIG has some pretty particular authentication and authorization requirements. For our fully distributed federation, with each site self-governing, we need to accept authentication credentials and claims about those identities from multiple institutions.”  The CanDIG architecture uses the OpenID Connect protocol to encode its authorization information

The middleware and OpenID Connect support wasn’t the only reason for choosing Tyk.  “As a ground-up peer-to-peer federation, we have to be able to deploy our software stack in the existing variety of environments we find at our sites,” explained Shaikh Farhan Rashid, Senior Application Development Specialist at UHN. “That includes bare-metal environments with little automation for infrastructure, and OpenStack environments, and commercial clouds for development or demo purposes. Tyk’s cloud-native deployment and configuration makes that straightforward.”

How is CanDIG benefitting from using Tyk?

As an API gateway, Tyk routes requests to the correct services but this isn’t all that it’s doing for CanDIG. Tyk is also acting as the relying party for the OAuth2/OIDC protocol, thus serving as an authenticating reverse proxy. It checks authentication and identity tokens for CanDIG’s controlled-access endpoints before beginning a session and passing on requests.

As CanDIG started implementing additional services, it needed a uniform way of authorizing requests being processed locally across an increasing number of APIs.  It developed an approach based on a rules-based policy engine, Open Policy Agent (OPA).

“As in many healthcare environments, when we consider data access and authorization decisions as they apply to research projects the level of access varies across users, or in our case, researchers. The benefit, but also complexity that is embedded within CanDIG is the autonomy granted to each site to make decisions on who is permitted to access what data. As a nation-wide platform, it is important for us to be able to uniformly enforce data access policies. Collecting entitlements and then evaluating specific requirements at policy engine allows us to do just that.”

Samantha Palmer, Health Data Policy Specialist for CanDIG, at UHN.

But that policy engine required consistent delivery of local and platform-level authorization-relevant information with each request.  Based on the CanDIG team’s experience with Tyk’s middleware, the team further extended Tyk to also perform claims marshalling when a session begins. It performs entitlement claim lookups from Vault for local entitlement information and Data Access Committee portal tool REMS for plattform-wide information, and serves them with the request where they can be used to make authorisation decisions by OPA.

How is Tyk working with CanDIG?

CanDIG has interacted with Tyk’s support team in the past, and submitted pull requests for documentation, and the team looks forward to working with Tyk more closely.  “I’m really excited about Tyk’s future, and Tyk’s renewed commitment to its open-source customers”, said Jonathan Dursi, Staff Scientist II at UHN and lead for CanDIG.  “We’re moving towards an API-driven, API-first world, with technologies like GraphQL changing how developers interact with data, and with Tyk’s support of its open source gateway product and the user community, more teams are going to be able to move there faster.”

Where did the challenge lie in this particular use case?

CanDIG needed its platform to be fully distributed. That meant no central identity or central authorization authority. Authorization needed to be made locally, based on local policies and informed by platform-wide CanDIG services. Any site that needed to provide data had to be able to verify any such remotely provided information.

This made for a more complicated internal structure than that of similar sharing platforms. However, with the right tools in place (and yes, we mean Tyk!), the structure delivers far greater flexibility as a result of being designed in this way.

Not only that, but it needed to address the question of how much data a user could see, as well as which datasets.

“Datasets that a user does not have row-level authorization for might still be queryable for aggregated results or for computations such as training models. In CanDIG, we have been building out infrastructure since the beginning of the project to authorize differentially-private aggregations to data to allow data custodians to make some datasets accessible for calculations without necessarily exposing the data directly to researchers”, explains CanDIG’s Amanjeev Sethi.

CanDIG – Canada’s Distributed Infrastructure for Genomics – is a decentralized federation across multiple healthcare and health research institutions across Canada.  CanDIG connects health research genomics data from cross-Canada projects, and allows researchers to access and explore national distributed consented health data sets.  Tyk’s open source gateway provides the entrypoint to the stack.

This diagram shows the CanDIGv2 AuthN/Z stack. Choosing to rely on existing, well-tested open source packages (with commercial support available) to implement the stack, implementing only the pieces needed specifically by CanDIG.

The CanDIGv2 AuthN/Z stack relies on each site’s Keycloak instance to provide uniform OIDC/OAuth2 to the site’s existing identity management; Vault to securely store local entitlement information; Tyk to be the OAuth2 relying party and to marshal entitlements; and Open Policy Agent to evaluate a request against site-provided policies and the marshalled entitlements.

What does the future hold for CanDIG and Tyk?

As the number of data services grows, and the queries performed across them become more sophisticated, CanDIG is planning to use GraphQL queries across services.  “GraphQL is perfect for allowing complex queries, while not over-returning data unnecessarily,” said University of Toronto MScAC student Siyue Wang, who is prototyping machine language queries across multiple CanDIG data services. Tyk’s Universal Data Graph is a very promising way that the team is examining for exposing those queries in a uniform way across APIs and sites.

In addition, differentially-private federated learning methods that the team is examining will require remote procedure call support rather than ReST or GraphQL APIs. “These more tightly coupled calculations will require a different approach than the ReSTful approach CanDIG has relied on in the past,” said Rishabh Sambare, Waterloo University B.CS. co-op student and part of the UHN CanDIG team.  “We’ll need to move to remote procedure calls to support analysis of the data in this way”. Tyk’s gRPC proxying support would allow a single API gateway to support all three of these methods – essential for a small team.

Finally, the CanDIG team has already extended Tyk using its polyglot middleware support.  They have written their middleware plugins in JavaScript but are now looking at gRPC, which will offer better performance and writing language-agnostic middleware plugins.