Times Higher Education

Data pipeline using AWS

For me, it is not just about the cost which everyone always focuses on, it is about getting the right expertise at the right time and level that can support your team

Freddie Quek, CTO

Outcomes Delivered

  • a streamlined process, with vastly improved data traceability

  • Scalable and configurable pipeline to enhance source and target data sources 

  • Improved data governance – a standardised model for ‘reference data’, acting as a formal collaborative tech contract between data science and engineering.

The Challenge

THE bought on board Tech Amigos (TA) to help address the need of standardizing the publication of certain data to their internal database portal. To begin with, TA have focussed on delivering a data pipeline solution for ‘reference data’, comprising of tabular data for educational subjects and locations. Challenges have included:

  • Lack of formalised release, error handling and data validation within process.

  • Data route-to-live was handled at a very low level by various ad-hoc SQL and Jenkins jobs.

  • Data sets were cut at different times rather than promoted through environments.

  • Limited traceability of data.

THE highlighted they wanted a streamlined process, where they could trigger the update for a single data type e.g. subjects, with a single action and minimal intervention. A key issue was that updating data for a single data type involved cross-communication between data science and engineering teams within a manual, non-formal process.

     Technologies

    • AWS Glue jobs with step functions for orchestration
    • Lambda functions
    • Terraform
    • Postgres RDS

    The Solution

    Cloud Design for Data Pipelines

    At the top level, the data pipeline is managed by triggering a state machine, built using AWS Step Functions. This overarching workflow is symbolised by the red arrow which flows through the different cloud environments, hosted in separate AWS accounts.

    Here’s a simplified overview of the process:

     

    Development Environment:

    Data is sourced from PostGresSQL by an AWS Glue crawler which catalogs this in an AWS Glue catalog.

    Glue jobs shift the catalogued data to an S3 bucket and then migrate the data to Staging.

    Staging Environment:

    Data is sourced from S3 by a AWS Glue crawler which catalogs this in an AWS Glue catalog.

    A Glue job shifts the catalogued data to a AWS RDS (Relational Database Service) PostGresSQL instance.

    At this point a AWS Lambda job is triggered and sends an email notification to subscribed users, that the data is available in the Staging database to check, with a prompt for approval. The process for the staging environment is near replicated for the production environment as per the diagram. The process is finalised by a final email notification that the data is available within production.

    Pipeline Orchestration

    TA have used the open-source Infrastructure as Code tool – Terraform, with the capability of deploying resources to different cloud environments, based on coded blueprints. This has allowed TA flexibility to easily provision all necessary AWS components, including AWS IAM (Identity and Access Management) roles and policies, to support cross-account access for the pipeline. This meant it was easy to test and experiment with the pipeline within sandbox accounts before terraforming all resources for client AWS accounts.

     

    THE provides trusted performance data on universities across the globe and ranks them in a league table