Janusgraph evaluation/POC with large semiconductor measurement data advice needed


Hi all,

i am working on a proof of concept if Janusgraph could be used for measurement data in semiconductor industry. Now it's that point i need some advice. What i did was some comparisons theoretically and some even practically with other NoSQL solutions (MongoDB, Cassandra, HBase, ElasticSearch, TimescaleDB, MariaDB...) in our context and use cases. We know that handling measurement data in graph databases is not that common but we just want to try it out. Goal is in future handling about 30TB measurement data (e.g. Process Control Monitor data).

One reason why going with graphs are especially two different use cases. The first is that we want to query data from up to 25 related measurements. Each measurement capture different kind and amount of parameters (e.g. 2500 double and boolean values whatever). The second use case is that we want to query one parameter over given timerange over all existing data (as soon some measurement includes this). The problem is that each measurement (or group of 25 measurements) could include total different parameters. Just image you perform a measurement like a break down voltage and the next time this information is not required case the process looks different and this measurement is not performed (the parameter wont exist in that measurement). Anyway...the graph allows us now to query quite cool stuff e.g. we can traverse over the graph counting all measurements, process modules... or teststructures where most parameters violating limits and so on. This is realy impressive.

It's fast e.g. calcuating some mean or standard derivation over all values from given parametername. But as i somehow already expected, janusgraph does not perform fast when getting a lot of data e.g.

gremlin> lots=g.V().hasLabel('Lot').has('name',"abc").out('lotFile').out('thxxFilePValue').valueMap('name','parvalue').profile()
  optimization                                                                                 0.027
  backend-query                                                    15540                      10.713
NoOpBarrierStep(2500)                                              15540       15540          32.170     0.25
PropertyMapStep([name, parvalue],value)                            15540       15540       13011.496    99.38
                                            >TOTAL                     -           -       13092.833 

I am using Janusgraph default with a Cassandra and ES Backend. Seems that Cassandra Backend this is too slow handling that much queries, right? How could this improved? Should i install hadoop/spark and calling SparkComputer? 

Thank you,

Join janusgraph-users@lists.lfaidata.foundation to automatically receive all group messages.