Currently, the schema for JanusGraph is basically only a list of allowed
labels (for vertices and edges) and available properties. What's missing in my
opinion is the option to specify which vertex and edge label can have which property
keys and which edge labels can connect which vertex labels.
Just to give an idea of what I mean, here are two examples for the Graph of Gods:
- Gods can have the property keys name and age, whereas locations only
have a name (no age allowed).
- The edge label brother can connect gods, but not a god with a
location.
This is of course only a toy graph, but I suspect that most real-world data
models contain similar constraints.
When we allow users to enforce those constraints inside of JanusGraph
then they can be sure that no user of their database can insert data that
doesn't comply with these constraints (e.g., a brother edge that connects a god
with a location). So, a strict schema ensures that the graph is in a consistent
state with respect to those constraints.[1]
In schema-less databases this schema is often included implicitly in the
client applications as those applications need to know how they can access the
data. So even if the database is schema-less, there is still an implicit
schema. This means that updating the (implicit) schema isn't really easier
without having it explicitly defined in the database as it needs to be changed
in the client applications.
Having this schema explicitly defined in JanusGraph also makes it easy
to tell new users what kind of data they can expect, e.g., they know that a
location can't have an age, but a god can. This would also allow tools to fetch
the schema from a JanusGraph instance to visualize it. Such a visualization
makes it much easier to reason about the schema as it provides an easy to
understand representation of it.
Finally, an explicit schema would also allow OGM (object graph mapper)
tools to fetch the schema from JanusGraph and translate it into entity classes
which makes it possible to only have the schema defined in just one place (DRY
principle).
So, in short, I propose that JanusGraph gets a strict schema, either as
the only option or as an additional option for backwards-compatibility with existing
deployments and their data models.
Regards,
Florian
[1] We actually had the problem with our JanusGraph database that it
contained data which shouldn’t be possible. Our schema models the network
traffic of malware samples, so we have edge labels like SampleToDomain or
SampleToIp that connect samples with domains or IP addresses they contacted. At
some point we found edges in our graph that connected samples with domains and
had an edge label of SampleToIp which is problematic as our applications of
course expect an IP address when they follow a SampleToIp edge.