REVIEW: EVA - AI-Relational Database System
Deep dive into a database that brings deep learning models and unstructured data into SQL
https://github.com/georgia-tech-db/eva: v0.2.4 dated May 18th 2023.
EVA is a database system for building AI applications. It employs the AI-relational approach by supporting deep learning models for structured and unstructured data. It comes with a wide range of models for analyzing unstructured data including image classification, object detection, OCR, face detection etc. The built-in collection of optimizations for sampling, caching and filtering speed up queries on large datasets and save money spent on model inference.
The system offers EVAQL - a declarative query language derived from SQL, enabling analytics of unstructured data such as text and media. One of the prominent features of EVAQL is its support of invoking ML models via UDFs, user-defined functions. It allows constructing complex pipelines of models by simply nesting UDF calls.
EVA operates on top of a conventional relational database system and stores unstructured data in the so-called media storage system and embeddings in an optional vector database system.
Let’s jump right into EVA’s codebase to see how exactly the key components are implemented.
Server
EVA offers a TCP server written in Python using asyncio. The choice of asyncio seems reasonable as the system integrates various distributed components and performs coordination which mostly relies on IO.
eva/eva_server.py#L35
Here we see that EVA initializes its built-in UDFs and indefinitely runs the server task. Although asyncio.run
here looks quite ordinary and expected, keep it in mind as we will get back to the topic of asyncio loops soon.
Let’s dive into details of the server implementation.
eva/server/server.py#L23
The server keeps a dict of tasks handling accepted client connections. This is needed to avoid unwanted garbage collection and cancellation of those tasks. Though perfectly fine, add_done_callback
can be considered too low level. The logic above does not guarantee graceful termination of tasks. There are tools that bring some elements of structural concurrency to asyncio such as aiotools.PersisnentTaskGroup and potentially asyncio.Supervisor that can be particularly useful here. Additionally, direct usage of asyncio.Task
is discouraged in favor of asyncio.create_task
.
Next is handle_client
where it gets more interesting.
eva/server/server.py#L69
The server talks a text-based specialized protocol. It expects raw SQL-like new-line separated expressions. EXIT;
and QUIT;
cause the connection to be closed.
What’s peculiar here is that for some reason this method spawns new tasks per expression, this time using `asyncio.create_task`. But it is not keeping track of these newly created tasks, which has some implications: