Janusgraph evaluation/POC with large semiconductor measurement data advice needed


eric.neufeld@...
 

Hi all,

i am working on a proof of concept if Janusgraph could be used for measurement data in semiconductor industry. Now it's that point i need some advice. What i did was some comparisons theoretically and some even practically with other NoSQL solutions (MongoDB, Cassandra, HBase, ElasticSearch, TimescaleDB, MariaDB...) in our context and use cases. We know that handling measurement data in graph databases is not that common but we just want to try it out. Goal is in future handling about 30TB measurement data (e.g. Process Control Monitor data).

One reason why going with graphs are especially two different use cases. The first is that we want to query data from up to 25 related measurements. Each measurement capture different kind and amount of parameters (e.g. 2500 double and boolean values whatever). The second use case is that we want to query one parameter over given timerange over all existing data (as soon some measurement includes this). The problem is that each measurement (or group of 25 measurements) could include total different parameters. Just image you perform a measurement like a break down voltage and the next time this information is not required case the process looks different and this measurement is not performed (the parameter wont exist in that measurement). Anyway...the graph allows us now to query quite cool stuff e.g. we can traverse over the graph counting all measurements, process modules... or teststructures where most parameters violating limits and so on. This is realy impressive.

It's fast e.g. calcuating some mean or standard derivation over all values from given parametername. But as i somehow already expected, janusgraph does not perform fast when getting a lot of data e.g.


gremlin> lots=g.V().hasLabel('Lot').has('name',"abc").out('lotFile').out('thxxFilePValue').valueMap('name','parvalue').profile()
 
...
  optimization                                                                                 0.027
  backend-query                                                    15540                      10.713
    \_query=thxxFilePValue:SliceQuery[0x74E0,0x74E1)
NoOpBarrierStep(2500)                                              15540       15540          32.170     0.25
PropertyMapStep([name, parvalue],value)                            15540       15540       13011.496    99.38
                                            >TOTAL                     -           -       13092.833 


I am using Janusgraph default with a Cassandra and ES Backend. Seems that Cassandra Backend this is too slow handling that much queries, right? How could this improved? Should i install hadoop/spark and calling SparkComputer? 

Thank you,
Eric


eric.neufeld@...
 

I forgot:

In that example parvalue contains 5 double values (list property) for each parameter. Might be a bit confusing. However that PropertyMapStep is slow. When i put some similar data into e.g. MongoDB for example i can query that as pandas dataframe in less than 1s or even half a second. But in janusgraph it could take up to 60s.

I run this with JanusGraph 0.6.1 on an old simulation server (32 CPUs,64GB memory or something like that).

Greetings, Eric


Boxuan Li
 

Hi Eric,

Could you try the following query instead?

lots=g.V().hasLabel('Lot').has('name',"abc").out('lotFile').out('thxxFilePValue').values('name','parvalue’)

Make sure you also enable the `query.batch` option in your config. FYI, valueMap() is executed in a serial fashion while values() can be executed concurrently (we have an open issue for that https://github.com/JanusGraph/janusgraph/issues/2444).

Please let me know if you still have any sort of performance problem after applying the above trick. It’s likely that your use case could be further tuned.

Best,
Boxuan

On Apr 14, 2022, at 3:18 AM, eric.neufeld via lists.lfaidata.foundation <eric.neufeld=xfab.com@...> wrote:

I forgot:

In that example parvalue contains 5 double values (list property) for each parameter. Might be a bit confusing. However that PropertyMapStep is slow. When i put some similar data into e.g. MongoDB for example i can query that as pandas dataframe in less than 1s or even half a second. But in janusgraph it could take up to 60s.

I run this with JanusGraph 0.6.1 on an old simulation server (32 CPUs,64GB memory or something like that).

Greetings, Eric


eric.neufeld@...
 

Hi Boxuan,

using values('name','parvalue’) instead of valueMap() works much faster. Using valueMap() i got some runs which takes 46s. Now it's about 800ms. 
Thank you a lot.

Yes i think as well that there are more things which could tuned. For a POC (proof of concept) it's not that important having a perfect schema and so on....as long as query time is acceptable for the mentioned use case or query it's absolutely fine (which is about 1s more or less). There is an other use cases which takes in other DB solutions up to 1min or even more and in current POC in janusgraph it takes maximum 3.5s (some runs even 300ms depending on the queried data).

Greetings,
Eric