Build an API for accessing cloud-hosted computing service (AWS).
What is annotation?
Annotation is a crucial procedure that involves the analysis of a
sequenced genome sample to pinpoint the precise locations of genes
and coding regions within the genome. Its primary objective is to
unravel the functions and characteristics of these genes,
providing invaluable insights into their roles and activities.
Utilize
Postman to
submit request data and view responses.
2ļøā£ Uploading Data to Object Storage: Amazon S3
Objectives
To become familiar with object storage and authenticated calls
(signed requests) to the cloud API.
Approach
Utilize 2 Amazon S3 buckets to store user
input files and annotated result files.
Working with large files
S3 provides a way for developers to upload large files directly
from a web browser. It requires that we create a signed request
and POST that request to the URL of our S3 bucket. The S3 service
then takes over and uploads the file directly via the browser,
bypassing our app server.
Protect your credentials
Signed requests
For signed requests, AWS supports two types of signatures. When
you refer to documentation on working with signed requests make
sure it pertains to Version 4 signatures. To Learn more about
signed requests, see the
AWS documentation here.
Instance profile
Launch instances with a special type of permission attached, and
boto3 will automagically find and use your credentials on the
instance. This is a more secure approach, because keys are not
physically stored on the instance and AWS automatically refreshes
(i.e. expires and renews) these temporary keys as needed. In order
to enable this, I launch instances and attach an
instance profile
to the instance; this profile includes permissions in the form of
an instance role that will allow applications running on the
instance to access other AWS resources on your behalf.
Add new endpoints (I will not specify all the endpoints in future
steps)
POST
Endpoint: http://{DOMAIN}/annotate
Purpose:
User will navigate to this endpoint, browse and select a file,
and click the button to upload it to the S3 bucket.
Important:
Must contain an expiration timestamp. This is a critical part
of a signed request: the request must be completed before the
specified time otherwise it is rendered invalid. Furthermore,
the policy document must, at minimum, contain conditions for
the ACL, which is critical for security.
GET
Endpoint: http://{DOMAIN}/annotate/files
Purpose:
Gets a list of objects from your S3 input bucket and returns
them in a JSON object:
Building out the distributed infrastructure for this genome
annotation service.
Diagram
Approach
Utilize
two EC2 instances: one running the
web application server and the other running the annotation
server.
Separate the code that serves the user interface from the code
that implements the API and runs AnnTools.
annotator.py: it downloads the input file from S3 and
saves it to the AnnTools instance (into a file structure that
ensures uniqueness for multiple jobs running simultaneously).
run.py: after completion, it copies the annotated results
file and the associated log from the local volume on your
annotator instance to the S3 results bucket. After copying, the
local files on the AnnTools instance must be deleted (otherwise
the EBS volume attached to your instance would eventually run
out of space).
The application flow till now
The user requests an annotation by uploading a VCF file.
After the input file is uploaded to S3, the response is
redirected to the annotator instance (and the flow continues
there instead of on our web server).
The annotator then downloads the input file from the input S3
bucket, spawns the annotation job and returns the job_id and
file name to the user.
Once the annotation is complete, the results file and the log
file are moved to the S3 results bucket.
4ļøā£ Persisting Data in a Key-Value Store
Objectives
Add persistence that enables distributed services to operate with
greater durability.
Diagram
Approach
We are adding a key-value store (KVS) to persist annotation job
information and allow both services to access/update a job as its
status changes. We will use
DynamoDB
for this purpose. Weāre using a KVS because we assert that
annotation data will be irregular and hence better suited to a
schema-less database.
The application flow till now
User selects an input file which is uploaded to S3 via a signed
POST request.
S3 sends a request to the redirect URL which must now be handled
by our web app, and not by the annotator.
The web app creates an item in the database; sets status to
āPENDINGā.
The web app POSTs a request to the annotator.
The annotator downloads the input file from S3.
The annotator spawns the annotation and updates the jobās status
to āRUNNINGā.
The annotator copies the result and log files to S3.
The annotator again updates the jobās status in the database to
āCOMPLETEDā.
5ļøā£ Application Decoupling Using Message Queues
Objectives
Use inter-process communication services to allow services to
operate asynchronously (thereby increasing system availability)
and scale independently.
Diagram
Approach
In previous step 4ļøā£ , we added a key-value store as the
persistence layer for the service. This allowed the web server and
the annotation server to operate independently of each other. Now
we will add components that further decouple the services and
allow them each to scale independently of the other.
A message queue that will act as a ābufferā between the web app
and the annotator. New annotation requests will be posted to the
message queue and the annotator will retrieve them independently.
Thus, requests can be successfully accepted by this service, even
if the annotator service is not availableāan important requirement
for increasing the availability of a distributed system.
A notification topic that accepts messages (i.e. annotation
requests) from the web app. When a notification is sent to the
topic, a message will be created in the message queue.
The application flow till now
User selects an input file which is uploaded to S3 via a signed
POST request.
S3 sends a request to the redirect URL in our web app.
The web app posts a notification message containing the request
data to the SNS topic.
The SQS queue receives the notification and persists a message
containing the request.
The annotator reads the message from the SQS queue.
The annotator extracts the input file name from the message and
downloads it from S3.
The annotator updates the jobās status in the database and
spawns the annotation.
The annotator copies the result and log files to S3.
The annotator again updates the jobās status in the database.
Important
For current approach, we utilize a Python script and
long polling
to read the messages from queue. In the future, we will utilize
webhooks to get the messages from queue.
6ļøā£ Integrate Third-party Services and other system components
Objectives
Make the system more functionally complete, integrate with an
external cloud service for payments processing (Stripe), and enable automated scaling. Furthermore, apply asynchronous
communication to another part of the application by implementing a
notifications service in upcoming steps.
Add Key functions
Log in (via Globus Auth) to use this service: Some aspects of the service are available only to registered
users. Two classes of users will be supported: Free and
Premium. Premium users will have access to additional
functionality, beyond that available to Free users.
Submit an annotation job: Free users may only
submit jobs of up to a certain size. Premium users may submit
any size job. If a Free user submits an oversized job, the
system will refuse it and will prompt the user to convert to a
Premium user.
Upgrade from a Free to a Premium user: Premium
users will be required to provide a credit card for payment of
the service subscription. Thise service will integrate with
Stripe
for credit card payment processing.
Receive notifications when annotation jobs finish
: When their annotation request is complete, this service will
send users an email that includes a link where they can view the
log file and download the results file.
Browse jobs and download annotation results:
This service will store annotation results for later retrieval.
Users may view a list of their jobs (completed and running), and
the log file for completed jobs.
Restrict data access for Free users: Free users
may download their results file for a limited time after their
job completes; thereafter the results file is archived, and only
available to them if they convert to a Premium user. Premium
users will always have all their data available for download.
Add and migrate system components
Globus Auth: Globus Auth is an identity and access management service that
allows users to access our application using an existing
identity from thousands of identity providers, including
UChicago.
uWSGI: Substitute the default Flask WSGI server with the uWSGI
server which is a well-tested, multithreaded WSGI server
suitable for production use. Furthermore, our application will
now listen on port 4433 instead of 5000.
HTTPS: The web server will now respond only to HTTPS requests.
AWS S3 Glacier: A low cost, highly-durable object store for archiving the
data of Free users.
PostgreSQL: A relational database for user account information.
7ļøā£ Upgrading UX/UI, Implementing Webhooks & User Notifications
Objectives
Enhance UX/UI, implement a more robust approach for processing
messages from a queue and apply asynchronous communication to the
application by implementing a notifications service.
Demo Page
Approach
User Interface
Add a route handler that displays a list of all the jobs
submitted by a user.
Add a route handler that displays the details of a requested
job. The page should be rendered in response to a request.
Provide links for users to download the results file and view
the log file for a job.
Implement webhooks
Why?
Requiring the annotator to continuously poll the job requests
queue is not good (scalable) application design. A better way is
to use a webhook: an HTTP endpoint that is called by the
producer when an event of interest to the consumer occurs (in
our case, when a new job is available). Fortunately, SNS allows
us to send a notification to an HTTP endpoint (in addition to
SQS queues and other subscribers).
Checkout the AWS documentation.
How?
Convert annotator.py to a Flask app with a route handler that
acts as a webhook. The annotator will no longer continuously
poll for job request messages; instead, it will do its work only
when it receives a notification from SNS that a job was added to
the job request queue. This requires adding the webhook as a
subscriber to your SNS job request topic. When a new job is
submitted, SNS will push the notification as a POST request to
the webhook, triggering the annotation.
Checkout the difference between data polling and a webhook.
User Notifications
Publishing a notification to an SNS topic (with a subscribed SQS
queue, as we have for job requests) and running a separate Python
script that sends emails to users. This way, the annotator can
continue to process jobs while notifications for completed jobs
are sent out-of-band. In order to send email, we will use the
AWS SES service.
Understand the complexities of implementing scheduled/background
tasks in distributed systems.
Approach
The policy for this service is that Free users can only download
their results file for up to 5 minutes after completion of their
annotation job. After 5 minutes elapse, a free userās results file
(not the log file) will be archived to a Glacier vault. This
allows the service to retain user data at relatively low cost, and
to restore it in the event that the user decides to upgrade to a
Premium user.
I used step function to trigger a lambda function to wait for 5
minutes after the file is annotated.
The step function workflow is defined in
step_function.json:
After 5 minutes, the lambda function will publish a message to
SNS topic.
Then the /archive endpoint will accept the POST requests from my
SNS topic and then retrieve and process an archival message from
my message queue.
Then, the logic in archive_app.py will determine if the user is
a premium user or not.
If the user is a premium user, the file will not be archived.
Otherwise, the file will be archived.
My rationale of this approach
Using Lambda rather than EC2
With Lambda, I only need to pay for the execution time of my
function. In this case, the function is only triggered after
the file is annotated, and it waits for 5 minutes before
publishing a message to the SNS topic. Therefore, I only
billed for the execution time of the Lambda function, which
is typically a few milliseconds.
If I use an EC2 instance or any other compute service to
perform the wait state, I would need to pay for the entire
duration of the wait time, which could be costly depending
on the duration of the wait state. Additionally, I would
need to manage and maintain the compute infrastructure,
which requires additional resources and effort.
Another benefit of using Lambda is scalability. AWS Lambda
automatically scales my function to handle incoming
requests, so I don't need to worry about provisioning and
scaling infrastructure. This allows me to handle sudden
spikes in traffic or increased workload without affecting
the performance of my application.
Using step function to trigger lambda function
The main purpose is based on the coordination of workflow:
Using Step Functions to trigger a Lambda function allows me
to create a coordinated workflow that involves multiple
steps. In this case, I can use Step Functions to orchestrate
the entire process of archiving the file, including the wait
state, sending messages to SNS, and triggering the archive
process. This ensures that each step is executed in the
correct order, and the entire workflow is completed
successfully.
Using SNS to notify the /archive endpoint
The use of SNS topic to notify the /archive endpoint to run
archive_free_user_data function provides a loosely coupled
architecture, which is a key principle for scalability.
Decoupling the process of archiving from the annotation
process ensures that any changes or scaling issues with one
process do not affect the other.
9ļøā£ Subscription Upgrade (Stripe Integration) & Data Restoration
Objectives
Explore the requirements of integrating with a third-party SaaS
system and understand the complexities of working with cloud
archival systems and experiment with serverless computing.
Approach
Stripe API
I integrate the
Stripe service
to manage all subscription and billing functions for this service.
One of the main reason is: Stripe has one of the best
API documentation
of any SaaS.
Thawing / Restoring Work Flow
Thaw Webhook: I configure a thaw webhook to
receive a request for restoring data from Glacier. This webhook
will initiate the job to retrieve the archive from Glacier.
Retrieval Job: Once the thaw webhook receives
the request, it triggers a process to initiate a retrieval job
from Glacier using the archive ID. This job retrieves the
archived data from Glacier.
SNS Notification: When the retrieval job is
successfully completed, Glacier sends a notification to an SNS
topic.
Lambda Function: Set up a Lambda function that
is triggered by the SNS topic when it receives the success
message.
Lambda Execution: In the Lambda function, parse
the message received from the SNS topic to extract relevant
information such as the archive ID, job ID, and S3 key file
path.
Retrieve Output Job: Use the job ID obtained
from the message to fetch the output of the retrieval job. This
output contains the restored data.
Store Data in S3: Upload the restored data to
the designated S3 result bucket using the S3 key file path.
Delete Archive Job: Delete the retrieval job
from Glacier to clean up the resources and prevent unnecessary
storage costs.
DynamoDB Update: Remove the attribute named
"results_file_archive_id" from the corresponding DynamoDB entry
to reflect the completion of the restoration process.
š Scaling the Web Server: ELB & Auto Scaling
Objectives
Experiment with automated provisioning and elasticity in a cloud
computing environment.
Diagram
Services
Elastic Load Balancer (ELB): This service
allows HTTP(S) requests to be distributed among multiple
instances in an Auto Scaling group. Checkout the
AWS documentation.
EC2 Auto Scaling: This service allows us to
define standard configuration templates and use them to launch
multiple instances as needed, based on user-definable rules.
Checkout the
AWS documentation.
Approach
Create a load balancer and associated target group:
An EC2 load balancer will receive all requests to this service
and distribute them among multiple EC2 instances running the web
server.
Create a Launch Template: Launch template
provide a means to save your standard EC2 instance launch
procedure so that it can be used by the auto scaler when
launching new instances.
Create an Auto Scaling Group: Auto Scaling
groups encapsulate the rules that determine how our application
scales out (when demand increases) and in (when demand drops).
š Different Approach
Webhook vs. Data Polling
Polling: After sending the payment request to
the Payment Service Provider(PSP), the payment service keeps
asking the PSP about the payment status. After several rounds,
the PSP finally returns with the status.
Drawbacks:
Constant polling of the status requires resources from
the payment service.
The External service communicates directly with the
payment service, creating security vulnerabilities.
Webhooks: We can register a webhook with the
external service. It means: call me back at a certain URL when
you have updates on the request. When the PSP has completed the
processing, it will invoke the HTTP request to update the
payment status. In this way, the programming paradigm is
changed, and the payment service doesnāt need to waste resources
to poll the payment status anymore.