As Aurora continues to grow and prepare for the launch of Aurora Horizon and Aurora Connect—our autonomous trucking and ride-hailing products—reliable and scalable computing power only becomes more important.
Batch API is designed to efficiently and flexibly manage the massive amounts of computing power we require, allowing our teams to run millions of tasks every day. But to maintain optimal performance, the owners of these tasks also need to be able to analyze the effectiveness of all of this computing power. And to do that, they need quick and reliable access to Batch API data.
In the third and final part of our series on Aurora’s Supercomputer, Batch API, we will discuss the challenges of logging user data and debugging on a massive scale, and how we provide users with tools to evaluate the performance of their jobs via access to historical analytics data. We’ll also talk about the future of our platform and how Batch API is scaling to a billion tasks a day.
System analytics and debugging
Every time a user creates a job with Batch API, the system captures a “log,” or report, of that job and everything that happened while that job was executed. Our engineering teams use these logs to understand how the Aurora Driver is performing on the road, determine what capabilities we need to refine or build, and create simulated scenarios to further test and train our technology. They are also critical to debugging, allowing users to uncover why a particular job failed or produced incorrect results.
Batch API tasks produce several terabytes of log data per month. All of these logs must be stored and made easily accessible in case a user ever needs to review them.
To facilitate log availability, Batch API captures user log data twice: logs are streamed to CloudWatch in real-time and then stored in Amazon S3 at task completion.
Logging millions of tasks every day
Streaming logs to CloudWatch means they can be displayed in Batch API’s user interface while a task is progressing. It also means that logs are available even if the node suddenly disappears from the cluster, perhaps due to spot termination or an unexpected crash. This is not possible with Kubernetes pod logs alone, since Kubernetes log data lives on the machine from which it was produced.
Per-account API quota limits in CloudWatch make it difficult to provide reliable multi-tenant programmatic access, without which users cannot use their own tools to directly access and process the log output of their jobs. So we created a custom Kubernetes DaemonSet to allow users to configure Batch API to pipe their logs at task completion time to an S3 bucket and prefix of their own choosing. This means that issues of storage tier, retention, and permissions are user-level concerns rather than administrative concerns.
There are many off-the-shelf products available for capturing and storing Kubernetes logs to external storage mediums, but they typically require:
an unscalable per-node Kubernetes Watch on the Pod API,
an unscalable periodic Pod API list query,
reliance on the unstable Kubelet API, or
some combination of the above three.
Our custom Kubernetes DaemonSet, written in Go and creatively named “Aurora Cluster Logging Daemon,'' or ACLD for short, uses the file system itself to discover when a pod has landed on its node. The ACLD then uses an efficient GET operation to the Pod API to discover the pod’s metadata for log-augmentation purposes and creates hard-links to the log files it is tailing. This way, the lifetime of the log is decoupled from the lifetime of the container, allowing Batch API to aggressively delete pods and their associated containers without losing the logs, and giving us the ability to quickly scale our Kubernetes clusters.
Historical task data
Because of its ubiquity at Aurora, Batch API’s historical data is very nearly an audit record of all batch computation for most organizations within the company. This historical data is useful for many purposes (tracking job performance, examining usage patterns, etc.), but there are challenges to making it available in a form suitable for general consumption.
Batch API uses a single SQL database with one read-write instance and one read-only, stand-by instance. This one database serves as both a persistence engine and a message-passing engine between the highly-available API Gateway and the Batch Runner backend. This single backend would be a bottleneck for the resource-intensive online analytical processing (OLAP) queries that modern data science takes for granted.
Another issue is that the schema of the backend database is heavily optimized for:
running tasks, and
allowing users to create and monitor tasks.
This optimization is necessary to handle the scale expected of Batch API, but makes analytical queries directly on the SQL backend clunky and opaque. Certain highly-nested structures are stored in encoded form, making them completely inaccessible to SQL queries.
To empower teams to access this data, Batch API has a built-in capability to export all of the configuration and status contents of every API element of a job (jobs, workers, tasks, and task attempt statuses) to S3 in parquet format. The parquet data is then indexed into AWS Athena using a Glue Crawler.