In Hawkular Inventory, we use the Tinkerpop API (version 2 for the time being)
to store our inventory model in a graph database. We chose Titan as the storage
engine configured to store the data in the Cassandra cluster that is also
backing Hawkular Metrics and Alerts. This blog post will guide you through some
performance-related lessons with Titan that we learned so far.
Inventory is under heavy development with a lot of redesign and refactoring
going on between releases so we took quite a naive approach to storing and
querying data from the graph database. That is, we store entities from our
model as vertices in the graph and the relationships between the entities as
edges in the graph. Quite simple and a school book example of how it should
look like.
We did declare a couple of indices in the database on the read-only aspects
of the vertices (i.e. a "type" of the entity the vertex corresponds to) but we
actually didn’t pay too much attention to the performance. We wanted to have
the model right first.
Fast forward a couple of months and of course, the performance started to be
a real problem. The Hawkular agent for Wildfly is inserting a non-trivial
amount of entities and not only inserting them but also querying them has seen
a huge performance degradation compared to the simple examples we were unit
testing with (due to number of vertices and edges stored).
The time has come to think about how to squeeze some performance out of Titan
as well as how to store the data and query it more intelligently.
So what we did, you ask, to gain an order of mangnitude speed up?
There are 2 aspects that needed our attention actually - insert performance,
and query performance, which is where we are an order of magnitude faster now.
I will focus only on the query performance in this post.
As a model example, let’s consider the following query: find me all resources
in a certain feed that have a certain resource type.
For illustration purposes, this will be a fabricated graph that we will be
querying.
In Gremlin query language, our
example query would be expressed without any optimizations, that we will going
to describe later in the post, like this:
g.V() (1)
.has("name", "Red Hat, Inc.")
.has("type", "tenant") (2)
.out("contains") (3)
.has("name", "staging")
.has("type", "environment") (4)
.out("contains") (5)
.has("name", "test.redhat.com")
.has("type", "feed") (6)
.out("contains") (7)
.has("type", "resource") (8)
.as("result") (9)
.in("defines") (10)
.has("type", "resourceType")
.has("name", "JBoss EAP") (11)
.back("result"); (12)
-
For all vertices in the graph
-
Choose those that have the type "tenant" called "Red Hat, Inc."
-
Go out from them following the "contains" edges, to the target vertices
-
choose those that have the type "environment" and name "staging"
-
out, following "contains" edges
-
choose vertices with type "feed" and name "test.redhat.com"
-
out, following "contains" edges
-
choose vertices with type "resource"
-
mark the position in the traversal as a "result"
-
follow "defines" edges that point to the "result" vertices
-
choose all the source vertices of the edges that have type "Resource Type"
and name "JBoss EAP"
-
if the above yields a vertex, go back to the "result" and use that instead
So out of that pipeline, as they call it, out come the vertices representing
resources with the desired resource type that live under given feed. E.g. in
the example above, the query will return eap1-test.redhat.com
and
eap2-test.redhat.com
.
So what’s wrong with that you ask?