The content of each partition is independent of other partitions in the table. Independent partitions enable at-scale distributed processing and writing since multiple machines in a compute cluster can process and write partitions in parallel.
Reading a single row
When training, evaluating a model, or just exploring the data, a user needs to be able to read one or more fields from a row. Let’s assume a user is training a model that uses lidar and radar data, but does not need camera images. First, we load the table index, then we create a row loader object to access data from the blob store, and then we load all lidar and radar fields from the row defined by the row position (the row_pos argument of the get_row function). The library chooses which column-groups will be loaded from the blob store automatically based on its knowledge of the table’s physical layout and some statistical information about data sizes (the same field may be stored in one or more column-groups to reduce the number of storage read requests):
index = tt.read_index("s3://.../table1")
row_loader = tt.row_loader(index)
# Read a row
row = row_loader.get_row(row_pos, columns=["lidar.*", "radar.*"])
Reading consecutive rows
If a model makes use of temporal information in sensor data, it will use a set of consecutive frames as its input. In this case, we could use a row_loader object that supports reading data from multiple consecutive rows with a single input-output (IO) request. Here, we use the offset=range(-10, 0) argument to request that a “history” of ten rows is included in the result.
rows = row_loader.get_rows(row_pos, columns=["lidar.*", "radar.*"], offsets=range(-10, 0))
Joining two tables
Sometimes we might want to join two tables (e.g., joining an existing sensor table with a new labels table would allow us to upgrade labels while preserving the same sensor data). To do this, we mimic the pandas API with a tt.merge(...) function. The rest of the code can transparently use the joint index and perform any table operations (including additional tt.merge calls).
sensors_index = tt.read_index("s3://.../sensors")
labels_index = tt.read_index("s3://.../labels_v2")
merged_index = tt.merge(sensors_index, labels_index, on=["log_id", "timestamp"])
row_loader = tt.row_loader(merged_index)
row = row_loader.get_row(row_pos, columns=["lidar.*", "radar.*", "labels.*"])
Figure 7 illustrates the “secret sauce” behind tt.merge(...) implementation—an ability to merge index “service” fields in a way that makes the merged index reference data from multiple blob stores.