Genomics Analysis Service


🎯 Overview

A fully operational software-as-a-service(SaaS) for genomics analysis.

Tech Stack

Diagram

🔨 Building Process

1️⃣ Building the Annotator API

Objectives

Build an API for accessing cloud-hosted computing service (AWS).

What is annotation?

Annotation is a crucial procedure that involves the analysis of a sequenced genome sample to pinpoint the precise locations of genes and coding regions within the genome. Its primary objective is to unravel the functions and characteristics of these genes, providing invaluable insights into their roles and activities.

The science behind genome annotation

Link

AnnTool

I utilize AnnTools and this modified version, one of the many available open source annotations tools, to perform the analysis.

Flask framework

An application server that can accept HTTP requests and server responses. I use the Flask Python web microframework to implement the annotator API.

REST APIs (Sample endpoints)

Implement a service with an API that allows users to run annotation jobs and check on their status.

  • GET

    • Endpoint: http://{DOMAIN}/annotations/ {job_id}

    • Response data:
      • code (HTTP response code, integer) and status (“success” or “error”)
      • job_id (UUID, string)
      • job_status (string)
      • log (contents of log file for completed job, string)

    • Endpoint: http://{DOMAIN}/annotations

    • Response data:
      • code (HTTP response code, integer) and status (“success” or “error”)
      • jobs (list of jobs)
        • job_id (UUID, string)
        • job_details (URL to get job status via endpoint above)
  • POST
    • Endpoint: http://{DOMAIN}/annotations

    • Request body: input_file (string)
    • Response data:
      • code (HTTP response code, integer) and status (“success” or “error”)
      • job_id (UUID, string)
      • input_file (string)
Running Environment

Both AnnTool and the API server are running on AWS EC2 instances.

Test the API

Utilize Postman to submit request data and view responses.

2️⃣ Uploading Data to Object Storage: Amazon S3

Objectives

To become familiar with object storage and authenticated calls (signed requests) to the cloud API.

Approach

Utilize 2 Amazon S3 buckets to store user input files and annotated result files.

Working with large files

S3 provides a way for developers to upload large files directly from a web browser. It requires that we create a signed request and POST that request to the URL of our S3 bucket. The S3 service then takes over and uploads the file directly via the browser, bypassing our app server.

Protect your credentials
Signed requests

For signed requests, AWS supports two types of signatures. When you refer to documentation on working with signed requests make sure it pertains to Version 4 signatures. To Learn more about signed requests, see the AWS documentation here.

Instance profile

Launch instances with a special type of permission attached, and boto3 will automagically find and use your credentials on the instance. This is a more secure approach, because keys are not physically stored on the instance and AWS automatically refreshes (i.e. expires and renews) these temporary keys as needed. In order to enable this, I launch instances and attach an instance profile to the instance; this profile includes permissions in the form of an instance role that will allow applications running on the instance to access other AWS resources on your behalf.

Add new endpoints (I will not specify all the endpoints in future steps)
  • POST
    • Endpoint: http://{DOMAIN}/annotate

    • Purpose: User will navigate to this endpoint, browse and select a file, and click the button to upload it to the S3 bucket.
    • Important: Must contain an expiration timestamp. This is a critical part of a signed request: the request must be completed before the specified time otherwise it is rendered invalid. Furthermore, the policy document must, at minimum, contain conditions for the ACL, which is critical for security.
  • GET
    • Endpoint: http://{DOMAIN}/annotate/files

    • Purpose: Gets a list of objects from your S3 input bucket and returns them in a JSON object:

3️⃣ Application Decoupling - Front/Back-End Separation

Objectives

Building out the distributed infrastructure for this genome annotation service.

Diagram
Approach

Utilize two EC2 instances: one running the web application server and the other running the annotation server.

  • Separate the code that serves the user interface from the code that implements the API and runs AnnTools.
  • annotator.py: it downloads the input file from S3 and saves it to the AnnTools instance (into a file structure that ensures uniqueness for multiple jobs running simultaneously).
  • run.py: after completion, it copies the annotated results file and the associated log from the local volume on your annotator instance to the S3 results bucket. After copying, the local files on the AnnTools instance must be deleted (otherwise the EBS volume attached to your instance would eventually run out of space).
The application flow till now
  • The user requests an annotation by uploading a VCF file.
  • After the input file is uploaded to S3, the response is redirected to the annotator instance (and the flow continues there instead of on our web server).
  • The annotator then downloads the input file from the input S3 bucket, spawns the annotation job and returns the job_id and file name to the user.
  • Once the annotation is complete, the results file and the log file are moved to the S3 results bucket.

4️⃣ Persisting Data in a Key-Value Store

Objectives

Add persistence that enables distributed services to operate with greater durability.

Diagram
Approach

We are adding a key-value store (KVS) to persist annotation job information and allow both services to access/update a job as its status changes. We will use DynamoDB for this purpose. We’re using a KVS because we assert that annotation data will be irregular and hence better suited to a schema-less database.

The application flow till now
  • User selects an input file which is uploaded to S3 via a signed POST request.
  • S3 sends a request to the redirect URL which must now be handled by our web app, and not by the annotator.
  • The web app creates an item in the database; sets status to “PENDING”.
  • The web app POSTs a request to the annotator.
  • The annotator downloads the input file from S3.
  • The annotator spawns the annotation and updates the job’s status to “RUNNING”.
  • The annotator copies the result and log files to S3.
  • The annotator again updates the job’s status in the database to “COMPLETED”.

5️⃣ Application Decoupling Using Message Queues

Objectives

Use inter-process communication services to allow services to operate asynchronously (thereby increasing system availability) and scale independently.

Diagram
Approach

In previous step 4️⃣ , we added a key-value store as the persistence layer for the service. This allowed the web server and the annotation server to operate independently of each other. Now we will add components that further decouple the services and allow them each to scale independently of the other.

Message Queue: AWS SQS

A message queue that will act as a “buffer” between the web app and the annotator. New annotation requests will be posted to the message queue and the annotator will retrieve them independently. Thus, requests can be successfully accepted by this service, even if the annotator service is not available—an important requirement for increasing the availability of a distributed system.

Notification Topic: AWS SNS

A notification topic that accepts messages (i.e. annotation requests) from the web app. When a notification is sent to the topic, a message will be created in the message queue.

The application flow till now
  • User selects an input file which is uploaded to S3 via a signed POST request.
  • S3 sends a request to the redirect URL in our web app.
  • The web app posts a notification message containing the request data to the SNS topic.
  • The SQS queue receives the notification and persists a message containing the request.
  • The annotator reads the message from the SQS queue.
  • The annotator extracts the input file name from the message and downloads it from S3.
  • The annotator updates the job’s status in the database and spawns the annotation.
  • The annotator copies the result and log files to S3.
  • The annotator again updates the job’s status in the database.
Important

For current approach, we utilize a Python script and long polling to read the messages from queue. In the future, we will utilize webhooks to get the messages from queue.

6️⃣ Integrate Third-party Services and other system components

Objectives

Make the system more functionally complete, integrate with an external cloud service for payments processing (Stripe), and enable automated scaling. Furthermore, apply asynchronous communication to another part of the application by implementing a notifications service in upcoming steps.

Add Key functions
  • Log in (via Globus Auth) to use this service: Some aspects of the service are available only to registered users. Two classes of users will be supported: Free and Premium. Premium users will have access to additional functionality, beyond that available to Free users.
  • Submit an annotation job: Free users may only submit jobs of up to a certain size. Premium users may submit any size job. If a Free user submits an oversized job, the system will refuse it and will prompt the user to convert to a Premium user.
  • Upgrade from a Free to a Premium user: Premium users will be required to provide a credit card for payment of the service subscription. Thise service will integrate with Stripe for credit card payment processing.
  • Receive notifications when annotation jobs finish : When their annotation request is complete, this service will send users an email that includes a link where they can view the log file and download the results file.
  • Browse jobs and download annotation results: This service will store annotation results for later retrieval. Users may view a list of their jobs (completed and running), and the log file for completed jobs.
  • Restrict data access for Free users: Free users may download their results file for a limited time after their job completes; thereafter the results file is archived, and only available to them if they convert to a Premium user. Premium users will always have all their data available for download.
Add and migrate system components
  • Globus Auth: Globus Auth is an identity and access management service that allows users to access our application using an existing identity from thousands of identity providers, including UChicago.
  • uWSGI: Substitute the default Flask WSGI server with the uWSGI server which is a well-tested, multithreaded WSGI server suitable for production use. Furthermore, our application will now listen on port 4433 instead of 5000.
  • HTTPS: The web server will now respond only to HTTPS requests.
  • AWS S3 Glacier: A low cost, highly-durable object store for archiving the data of Free users.
  • PostgreSQL: A relational database for user account information.
  • Bootstrap: Styled web pages for the web app.

7️⃣ Upgrading UX/UI, Implementing Webhooks & User Notifications

Objectives

Enhance UX/UI, implement a more robust approach for processing messages from a queue and apply asynchronous communication to the application by implementing a notifications service.

Demo Page
Approach
User Interface
  • Add a route handler that displays a list of all the jobs submitted by a user.
  • Add a route handler that displays the details of a requested job. The page should be rendered in response to a request.
  • Provide links for users to download the results file and view the log file for a job.
Implement webhooks
  • Why?
  • Requiring the annotator to continuously poll the job requests queue is not good (scalable) application design. A better way is to use a webhook: an HTTP endpoint that is called by the producer when an event of interest to the consumer occurs (in our case, when a new job is available). Fortunately, SNS allows us to send a notification to an HTTP endpoint (in addition to SQS queues and other subscribers). Checkout the AWS documentation.

  • How?
  • Convert annotator.py to a Flask app with a route handler that acts as a webhook. The annotator will no longer continuously poll for job request messages; instead, it will do its work only when it receives a notification from SNS that a job was added to the job request queue. This requires adding the webhook as a subscriber to your SNS job request topic. When a new job is submitted, SNS will push the notification as a POST request to the webhook, triggering the annotation. Checkout the difference between data polling and a webhook.

User Notifications

Publishing a notification to an SNS topic (with a subscribed SQS queue, as we have for job requests) and running a separate Python script that sends emails to users. This way, the annotator can continue to process jobs while notifications for completed jobs are sent out-of-band. In order to send email, we will use the AWS SES service.

8️⃣ Data Archival: AWS S3 Glacier

Objectives

Understand the complexities of implementing scheduled/background tasks in distributed systems.

Approach

The policy for this service is that Free users can only download their results file for up to 5 minutes after completion of their annotation job. After 5 minutes elapse, a free user’s results file (not the log file) will be archived to a Glacier vault. This allows the service to retain user data at relatively low cost, and to restore it in the event that the user decides to upgrade to a Premium user.

Integrating AWS Step Functions and AWS Lambda
  • I used step function to trigger a lambda function to wait for 5 minutes after the file is annotated.
  • The step function workflow is defined in step_function.json:
  • After 5 minutes, the lambda function will publish a message to SNS topic.
  • Then the /archive endpoint will accept the POST requests from my SNS topic and then retrieve and process an archival message from my message queue.
  • Then, the logic in archive_app.py will determine if the user is a premium user or not.
  • If the user is a premium user, the file will not be archived.
  • Otherwise, the file will be archived.
My rationale of this approach
  • Using Lambda rather than EC2
    • With Lambda, I only need to pay for the execution time of my function. In this case, the function is only triggered after the file is annotated, and it waits for 5 minutes before publishing a message to the SNS topic. Therefore, I only billed for the execution time of the Lambda function, which is typically a few milliseconds.
    • If I use an EC2 instance or any other compute service to perform the wait state, I would need to pay for the entire duration of the wait time, which could be costly depending on the duration of the wait state. Additionally, I would need to manage and maintain the compute infrastructure, which requires additional resources and effort.
    • Another benefit of using Lambda is scalability. AWS Lambda automatically scales my function to handle incoming requests, so I don't need to worry about provisioning and scaling infrastructure. This allows me to handle sudden spikes in traffic or increased workload without affecting the performance of my application.
  • Using step function to trigger lambda function
    • The main purpose is based on the coordination of workflow: Using Step Functions to trigger a Lambda function allows me to create a coordinated workflow that involves multiple steps. In this case, I can use Step Functions to orchestrate the entire process of archiving the file, including the wait state, sending messages to SNS, and triggering the archive process. This ensures that each step is executed in the correct order, and the entire workflow is completed successfully.
  • Using SNS to notify the /archive endpoint
    • The use of SNS topic to notify the /archive endpoint to run archive_free_user_data function provides a loosely coupled architecture, which is a key principle for scalability. Decoupling the process of archiving from the annotation process ensures that any changes or scaling issues with one process do not affect the other.

9️⃣ Subscription Upgrade (Stripe Integration) & Data Restoration

Objectives

Explore the requirements of integrating with a third-party SaaS system and understand the complexities of working with cloud archival systems and experiment with serverless computing.

Approach
Stripe API

I integrate the Stripe service to manage all subscription and billing functions for this service. One of the main reason is: Stripe has one of the best API documentation of any SaaS.

Thawing / Restoring Work Flow
  • Thaw Webhook: I configure a thaw webhook to receive a request for restoring data from Glacier. This webhook will initiate the job to retrieve the archive from Glacier.
  • Retrieval Job: Once the thaw webhook receives the request, it triggers a process to initiate a retrieval job from Glacier using the archive ID. This job retrieves the archived data from Glacier.
  • SNS Notification: When the retrieval job is successfully completed, Glacier sends a notification to an SNS topic.
  • Lambda Function: Set up a Lambda function that is triggered by the SNS topic when it receives the success message.
  • Lambda Execution: In the Lambda function, parse the message received from the SNS topic to extract relevant information such as the archive ID, job ID, and S3 key file path.
  • Retrieve Output Job: Use the job ID obtained from the message to fetch the output of the retrieval job. This output contains the restored data.
  • Store Data in S3: Upload the restored data to the designated S3 result bucket using the S3 key file path.
  • Delete Archive Job: Delete the retrieval job from Glacier to clean up the resources and prevent unnecessary storage costs.
  • DynamoDB Update: Remove the attribute named "results_file_archive_id" from the corresponding DynamoDB entry to reflect the completion of the restoration process.

🔟 Scaling the Web Server: ELB & Auto Scaling

Objectives

Experiment with automated provisioning and elasticity in a cloud computing environment.

Diagram
Services
  • Elastic Load Balancer (ELB): This service allows HTTP(S) requests to be distributed among multiple instances in an Auto Scaling group. Checkout the AWS documentation.
  • EC2 Auto Scaling: This service allows us to define standard configuration templates and use them to launch multiple instances as needed, based on user-definable rules. Checkout the AWS documentation.
Approach
  • Create a load balancer and associated target group: An EC2 load balancer will receive all requests to this service and distribute them among multiple EC2 instances running the web server.
  • Create a Launch Template: Launch template provide a means to save your standard EC2 instance launch procedure so that it can be used by the auto scaler when launching new instances.
  • Create an Auto Scaling Group: Auto Scaling groups encapsulate the rules that determine how our application scales out (when demand increases) and in (when demand drops).

🆚 Different Approach

Webhook vs. Data Polling

  • Polling: After sending the payment request to the Payment Service Provider(PSP), the payment service keeps asking the PSP about the payment status. After several rounds, the PSP finally returns with the status.
    • Drawbacks:
      • Constant polling of the status requires resources from the payment service.
      • The External service communicates directly with the payment service, creating security vulnerabilities.
  • Webhooks: We can register a webhook with the external service. It means: call me back at a certain URL when you have updates on the request. When the PSP has completed the processing, it will invoke the HTTP request to update the payment status. In this way, the programming paradigm is changed, and the payment service doesn’t need to waste resources to poll the payment status anymore.

📝 Key Feature

✅ Delivered complete and detailed annotation results through an easy-to-use panel.

✅ Optimized scaling and load balancing for high-throughput genomics annotation tasks.

✅ Offered subscription tiers with varying features and performance.

💻 Source Code

Require GitLab permission:

⚠️ Disclaimer

This is a UChicago MPCS Cloud Computing Capstone Project. © 2023 Vas Vasiliadis, All rights reserved.