Introducing StarfishQL

We are pleased to introduce StarfishQL to the Rust community today. StarfishQL is a graph database and query engine to enable graph analysis and visualization on the web. It is an experimental project, with its primary purpose to explore the dependency network of Rust crates published on crates.io.

Motivation

StarfishQL is a framework for providing a graph database and a graph query engine that interacts with it.

A concrete example (Freeport) involving the graph of crate dependency on crates.io is used for illustration. With this example, you can see StarfishQL in action.

At the end of the day, we're interested in performing graph analysis, that is to extract meaningful information out of plain graph data. To achieve that, we believe that visualization is a crucial aid.

StarfishQL's query engine is designed to be able to incorporate different forms of visualization by using a flexible query language. However, the development of the project has been centred around the following, as showcased in our demo apps.

Traverse the dependency graph in the normal direction starting from the N most connected nodes.

Traverse the dependency tree in both forward and reverse directions starting from a particular node.

Design

In general, a query engine takes input queries written in a specific query language (e.g. SQL statements), performs the necessary operations in the database, and then outputs the data of interest to the user application. You may also view a query engine as an abstraction layer such that the user can design queries simply in the supported query language and let the query engine do the rest.

In the case of a graph query engine, the output data is a graph (wiki).

Graph query engine overview

In the case of StarfishQL, the query language is a custom language we defined in the JSON format, which enables the engine to be highly accessible and portable.

Implementation

In the example of Freeport, StarfishQL consists of the following three components.

Graph Query Engine

As a core component of StarfishQL, the graph query engine is a Rust backend application powered by the Rocket web framework and the SeaQL ecosystem.

The engine listens at the following endpoints for the corresponding operation:

You could also invoke the endpoints above programmatically.

Graph data are stored in a relational database:

Metadata - Definition of each entity and relation, e.g. attributes of crates and dependency
Node Data - An instance of an entity, e.g. crate name and version number
Edge Data - An instance of a relation, e.g. one crate depends on another

crates.io Crawler

To obtain the crate data to insert into the database, we used a fast, non-disruptive crawler on a local clone of the public index repo of crates.io.

Graph Visualization

We used d3.js to create force-directed graphs to display the results. The two colourful graphs above are such products.

Findings

Here are some interesting findings we made during the process.

List of top 10 crates order by different decay modes.

Decay Mode: Immediate / Simple Connectivity
crate	connectivity
serde	17,441
serde_json	10,528
log	9,220
clap	6,323
thiserror	5,547
rand	5,340
futures	5,263
lazy_static	5,211
tokio	5,168
chrono	4,794

Decay Mode: Medium (.5) / Complex Connectivity
crate	connectivity
quote	4,126
syn	4,069
pure-rust-locales	4,067
reqwest	3,950
proc-macro2	3,743
num_threads	3,555
value-bag	3,506
futures-macro	3,455
time-macros	3,450
thiserror-impl	3,416

Decay Mode: None / Compound Connectivity
crate	connectivity
unicode-xid	54,982
proc-macro2	54,949
quote	54,910
syn	54,744
rustc-std-workspace-core	51,650
libc	51,645
serde_derive	51,056
serde	51,054
jobserver	50,567
cc	50,566

If we look at Decay Mode: Immediate, where the connectivity is simply the number of immediate dependants, we can see thatserde and serde_json are at the top. I guess that supports our decision of defining the query language in JSON.

Decay Mode: None tells another interesting story: when the connectivity is the entire tree of dependants, we are looking at the really core crates that are nested somewhere deeply inside the most crates. In other words, these are the ones that are built along with the most crates. Under this setting, the utility crates that interacts with the low-level, more fundamental aspects of Rust are ranked higher,like quote with syntax trees, proc-macro2 with procedural macros, and unicode-xid with Unicode checking.

19,369 out of 79,972 crates, or 24% of the crates, do not depend on any crates.

e.g. a, a-, a0, ..., zyx_test, zz-buffer, z_table

In other words, about 76% of the crates are standing on the shoulders of giants! 💪

53,910 out of 79,972 crates, or 67% of the crates, have no dependants, i.e. no other crates depend on them.

e.g. a, a-, a-bot, ..., zzp-tools, zzz, z_table

We imagine many of those crates are binaries/executables, if only we could figure out a way to check that... 🤔

As of March 30, 2022

Conclusion

StarfishQL allows flexible and portable definition, manipulation, retrieval, and visualization of graph data.

The graph query engine built in Rust provides a nice interface for any web applications to access data in the relational graph database with stable performance and memory safety.

Admittedly, StarfishQL is still in its infancy, so every detail in the design and implementation is subject to change. Fortunately, the good thing about this is, like all other open-source projects developed by brilliant Rust developers, you can contribute to it if you also find the concept interesting. With its addition to the SeaQL ecosystem, together we are one step closer to the vision of Rust for data engineering.

People

StarfishQL is created by the following SeaQL team members:

Chris Tsang

Billy Chan

Sanford Pun

Contributing

We are super excited to be selected as a Google Summer of Code 2022 mentor organization!

StarfishQL is one of the GSoC project ideas that opens for development proposals. Join us on GSoC 2022 by following the instructions on GSoC Contributing Guide.

Motivation

Top-N Dependencies

Dependencies & Dependents

Design