Date   

Re: OLAP issues - gremlin query optimization

Abhay Pandit <abha...@...>
 

Hi,

My understanding is you won't be able to go more than one level into using SparkGraphComputer.
But yes you can do it using VertexProgram.
You can try to write your own custom vertex program or you can use ShortestPathVertexProgram if it suits you.

https://github.com/apache/tinkerpop/blob/6a705098b5176086c667b94fa9e13aa2f122bb59/gremlin-core/src/main/java/org/apache/tinkerpop/gremlin/process/computer/search/path/ShortestPathVertexProgram.java

Thanks,
Abhay

On Thu, 30 Jul 2020 at 13:11, <bobo...@...> wrote:
Hi Evgeniy,

Thanks a lot for your answer.

The depth-first hint is a good one, I indeed completly forgot about that.
I was hoping to get around writing a custom VertexProgram, but maybe that is something I need to start considering.

I will also try to further isolate the query and see if I can manage to further improve. I will keep you (and others) posted here.

Regards
Claire

Am Donnerstag, 30. Juli 2020 08:01:20 UTC+2 schrieb Evgeniy Ignatiev:

Hello,

Gremlin machine is depth-first, GraphComputer is Pregel - breadth-first essentially, this will definitely give significant differences.
Also https://tinkerpop.apache.org/docs/current/reference/#traversalvertexprogram - personally I don't remember if using traversal based path computation ever worked for our use case either (we had something around shortest path counting/grouping) - performance for such case was atrocious in comparison with Neo4j (in case graph fit in its machine RAM, Tinkerpop never went OOM and actually was incredibly light on resources, but performance was also significantly lower).

Possibly it makes sense to test first half of the query in isolation first? Or first 30%, then 60%, etc. I have suspicions about efficiency of path traversal in GraphComputer, e.g.:

                .repeat(
                    bothE
()
                       
.filter(hasLabel("has_propertyA", "has_propertyB"))
                       
.otherV()
                       
.simplePath())
               
.times(2)

Maybe consider custom VertexProgram with the same logic?

Best regards,
Evgenii Ignatev.

On 7/30/2020 8:13 AM, bo...@... wrote:

links.png

Hi

I am currently experimenting around with OLAP functionalities of gremlin (on Janusgraph). While I get simple queries to run like a charm, I am encountering issues with a slightly more complex case. I have the feeling that the devil hides somewhere in the details, but I am currently unable to pinpoint it, and am hoping to get some help here ...

Setup
- Managed Spark (Dataproc, Google Cloud Platform)
- 1 Master, up to 5 worker nodes (auto scaling enabled, each nodes has roughly 40 GB memory available for Spark jobs. But I am pretty flexible here and can easily configure more RAM, more CPU, more cores)
- (embedded) Janusgrap 0.5.2 backed by ScyllaDB
- Tinkerpop / Spark Gremlin 3.4.7
- Spark job is a Java application
- Current graph has roughly 22 million vertices and 24 million edges, ~65 GB storage on Scylla DB)


Use Case

I want to calculated shortcut edges between two "company" vertices. Two companies are linked via 4 edges (see image below, the red link is the shortcut edge I want to calculate)




Issue

Some simple queries (like g.V().count or g.E().count()) work like a charm, so my setup is in general working. However my shortcut edge-calculating is not.
The (simplified) query is as follows (that query is suposed to return the top 10 related companies)


g
= GraphFactory.open(configuration).traversal().withComputer(SparkGraphComputer.class)

 g
.V().has("company", "companyName", "company")
               
.as("start")
               
.in("has_company")
               
.has("deleted", false)
               
.repeat(
                    bothE
()
                       
.filter(hasLabel("has_propertyA", "has_propertyB"))
                       
.otherV()
                       
.simplePath())
               
.times(2)
               
.has("product", "site", "de")
               
.has("deleted", false)
               
.dedup()
               
.out("has_company")
               
.where(P.neq("start"))
               
.groupCount().by()
               
.order(Scope.local).by(select(Column.values)).limit(Scope.local, 100)
               
.unfold()
               
.project("company", "tf", "tf-idf")
               
.by(select(Column.keys).values("companyName"))
               
.by(select(Column.values)) // tf
               
.by(project("tf", "df") // idf
                   
.by(select(Column.values))
                   
.by(select(Column.keys).in("has_company").count())
                   
.math("tf * log10(2000000/df)")) // todo fix the hardcoded 2m
               
.fold()
               
.order(Scope.local).by(select("tf-idf"), Order.desc).limit(Scope.local, 10)



This runs into timeout or memory issues or any other weird kind of issues, even if I give an insanely high amount of memory to the executor (or driver), or simply stops after some time because some executors failed 10 times.

By trying to pinpoint the issue, I executed the following much more limited query, first on our "normal" OLTP Janusgraph server, and then using the SparkGraphComputer.

While in the OLTP version, I get a result within seconds, I don't if I use SparkGraphComputer. I don't expect the response within seconds, because bringing up the whole spark context takes its time. But I wouldn't expect hours either.
What is wrong? I assume I am using some construct somewhere that should not be used in that mode, but I don't see what.

g.V().has("company", "companyName", "company")
               
.as("start")
               
.in("has_company")
               
.has("deleted", false)
               
.repeat(
                    bothE
()
                       
.filter(hasLabel("has_propertyA", "has_propertyB"))
                       
.otherV()
                       
.simplePath())
               
.times(2)
               
.has("jobOffer", "site", "de")
               
.has("deleted", false)
               
.limit(1000)
               
.out("has_company")
               
.where(P.neq("start"))
               
.groupCount().by()
               
.order(Scope.local).by(select(Column.values)).limit(Scope.local, 100)
               
.unfold()
               
.project("company", "tf", "tf-idf")
               
.by(select(Column.keys).values("companyName"))
               
.by(select(Column.values)) // tf
               
.by(project("tf", "df") // idf
                   
.by(select(Column.values))
                   
.by(select(Column.keys).in("has_company").count())
                   
.math("tf * log10(2000000/df)")) // todo fix the 2m
               
.fold()
               
.order(Scope.local).by(select("tf-idf"), Order.desc).limit(Scope.local, 10)
                   
.toList();




Janusgraph.conf

For the sake of completeness, here my complete configuration


# Hadoop Graph Configuration
gremlin
.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin
.hadoop.graphReader=org.janusgraph.hadoop.formats.cql.CqlInputFormat
gremlin
.hadoop.graphWriter=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
gremlin
.hadoop.jarsInDistributedCache=true
gremlin
.hadoop.inputLocation=none
gremlin
.hadoop.outputLocation=output
gremlin
.spark.persistContext=true
gremlin
.spark.graphStorageLevel=MEMORY_AND_DISK
# Scylla
janusgraphmr
.ioformat.conf.storage.backend=cql
janusgraphmr
.ioformat.conf.storage.hostname=<ip>
janusgraphmr
.ioformat.conf.storage.port=<port>
janusgraphmr
.ioformat.conf.index.search.backend=lucene
janusgraphmr
.ioformat.conf.index.search.directory=/tmp/
janusgraphmr
.ioformat.conf.index.search.hostname=127.0.0.1
cassandra
.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra
.input.widerows=true
# Spark
spark
.master=yarn
spark
.submit.deployMode=client
spark
.serializer=org.apache.spark.serializer.KryoSerializer
spark
.kryo.registrator=org.janusgraph.hadoop.serialize.JanusGraphKryoRegistrator
### I tried with different memory settings here, up to 50G more executor
spark
.executor.memory=16g
spark
.executor.memoryOverhead=4g
spark
.driver.memory=16g
spark
.driver.memoryOverhead=4g
spark
.eventLog.enabled=true
spark
.network.timeout=300s




Any hints, thoughts, comments, feedback are greatly appreciated

Cheers,
Claire
--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janu...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/c450aed2-9a7a-45fa-9dab-5b5da2e746a5o%40googlegroups.com.
-- 
Best regards,
Evgeniy Ignatiev.

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/c8fac1bc-ba16-48cd-920f-c3b66303672do%40googlegroups.com.


Re: OLAP issues - gremlin query optimization

bobo...@...
 

Hi Evgeniy,

Thanks a lot for your answer.

The depth-first hint is a good one, I indeed completly forgot about that.
I was hoping to get around writing a custom VertexProgram, but maybe that is something I need to start considering.

I will also try to further isolate the query and see if I can manage to further improve. I will keep you (and others) posted here.

Regards
Claire

Am Donnerstag, 30. Juli 2020 08:01:20 UTC+2 schrieb Evgeniy Ignatiev:

Hello,

Gremlin machine is depth-first, GraphComputer is Pregel - breadth-first essentially, this will definitely give significant differences.
Also https://tinkerpop.apache.org/docs/current/reference/#traversalvertexprogram - personally I don't remember if using traversal based path computation ever worked for our use case either (we had something around shortest path counting/grouping) - performance for such case was atrocious in comparison with Neo4j (in case graph fit in its machine RAM, Tinkerpop never went OOM and actually was incredibly light on resources, but performance was also significantly lower).

Possibly it makes sense to test first half of the query in isolation first? Or first 30%, then 60%, etc. I have suspicions about efficiency of path traversal in GraphComputer, e.g.:

                .repeat(
                    bothE
()
                       
.filter(hasLabel("has_propertyA", "has_propertyB"))
                       
.otherV()
                       
.simplePath())
               
.times(2)

Maybe consider custom VertexProgram with the same logic?

Best regards,
Evgenii Ignatev.

On 7/30/2020 8:13 AM, bo...@... wrote:

links.png

Hi

I am currently experimenting around with OLAP functionalities of gremlin (on Janusgraph). While I get simple queries to run like a charm, I am encountering issues with a slightly more complex case. I have the feeling that the devil hides somewhere in the details, but I am currently unable to pinpoint it, and am hoping to get some help here ...

Setup
- Managed Spark (Dataproc, Google Cloud Platform)
- 1 Master, up to 5 worker nodes (auto scaling enabled, each nodes has roughly 40 GB memory available for Spark jobs. But I am pretty flexible here and can easily configure more RAM, more CPU, more cores)
- (embedded) Janusgrap 0.5.2 backed by ScyllaDB
- Tinkerpop / Spark Gremlin 3.4.7
- Spark job is a Java application
- Current graph has roughly 22 million vertices and 24 million edges, ~65 GB storage on Scylla DB)


Use Case

I want to calculated shortcut edges between two "company" vertices. Two companies are linked via 4 edges (see image below, the red link is the shortcut edge I want to calculate)




Issue

Some simple queries (like g.V().count or g.E().count()) work like a charm, so my setup is in general working. However my shortcut edge-calculating is not.
The (simplified) query is as follows (that query is suposed to return the top 10 related companies)


g
= GraphFactory.open(configuration).traversal().withComputer(SparkGraphComputer.class)

 g
.V().has("company", "companyName", "company")
               
.as("start")
               
.in("has_company")
               
.has("deleted", false)
               
.repeat(
                    bothE
()
                       
.filter(hasLabel("has_propertyA", "has_propertyB"))
                       
.otherV()
                       
.simplePath())
               
.times(2)
               
.has("product", "site", "de")
               
.has("deleted", false)
               
.dedup()
               
.out("has_company")
               
.where(P.neq("start"))
               
.groupCount().by()
               
.order(Scope.local).by(select(Column.values)).limit(Scope.local, 100)
               
.unfold()
               
.project("company", "tf", "tf-idf")
               
.by(select(Column.keys).values("companyName"))
               
.by(select(Column.values)) // tf
               
.by(project("tf", "df") // idf
                   
.by(select(Column.values))
                   
.by(select(Column.keys).in("has_company").count())
                   
.math("tf * log10(2000000/df)")) // todo fix the hardcoded 2m
               
.fold()
               
.order(Scope.local).by(select("tf-idf"), Order.desc).limit(Scope.local, 10)



This runs into timeout or memory issues or any other weird kind of issues, even if I give an insanely high amount of memory to the executor (or driver), or simply stops after some time because some executors failed 10 times.

By trying to pinpoint the issue, I executed the following much more limited query, first on our "normal" OLTP Janusgraph server, and then using the SparkGraphComputer.

While in the OLTP version, I get a result within seconds, I don't if I use SparkGraphComputer. I don't expect the response within seconds, because bringing up the whole spark context takes its time. But I wouldn't expect hours either.
What is wrong? I assume I am using some construct somewhere that should not be used in that mode, but I don't see what.

g.V().has("company", "companyName", "company")
               
.as("start")
               
.in("has_company")
               
.has("deleted", false)
               
.repeat(
                    bothE
()
                       
.filter(hasLabel("has_propertyA", "has_propertyB"))
                       
.otherV()
                       
.simplePath())
               
.times(2)
               
.has("jobOffer", "site", "de")
               
.has("deleted", false)
               
.limit(1000)
               
.out("has_company")
               
.where(P.neq("start"))
               
.groupCount().by()
               
.order(Scope.local).by(select(Column.values)).limit(Scope.local, 100)
               
.unfold()
               
.project("company", "tf", "tf-idf")
               
.by(select(Column.keys).values("companyName"))
               
.by(select(Column.values)) // tf
               
.by(project("tf", "df") // idf
                   
.by(select(Column.values))
                   
.by(select(Column.keys).in("has_company").count())
                   
.math("tf * log10(2000000/df)")) // todo fix the 2m
               
.fold()
               
.order(Scope.local).by(select("tf-idf"), Order.desc).limit(Scope.local, 10)
                   
.toList();




Janusgraph.conf

For the sake of completeness, here my complete configuration


# Hadoop Graph Configuration
gremlin
.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin
.hadoop.graphReader=org.janusgraph.hadoop.formats.cql.CqlInputFormat
gremlin
.hadoop.graphWriter=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
gremlin
.hadoop.jarsInDistributedCache=true
gremlin
.hadoop.inputLocation=none
gremlin
.hadoop.outputLocation=output
gremlin
.spark.persistContext=true
gremlin
.spark.graphStorageLevel=MEMORY_AND_DISK
# Scylla
janusgraphmr
.ioformat.conf.storage.backend=cql
janusgraphmr
.ioformat.conf.storage.hostname=<ip>
janusgraphmr
.ioformat.conf.storage.port=<port>
janusgraphmr
.ioformat.conf.index.search.backend=lucene
janusgraphmr
.ioformat.conf.index.search.directory=/tmp/
janusgraphmr
.ioformat.conf.index.search.hostname=127.0.0.1
cassandra
.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra
.input.widerows=true
# Spark
spark
.master=yarn
spark
.submit.deployMode=client
spark
.serializer=org.apache.spark.serializer.KryoSerializer
spark
.kryo.registrator=org.janusgraph.hadoop.serialize.JanusGraphKryoRegistrator
### I tried with different memory settings here, up to 50G more executor
spark
.executor.memory=16g
spark
.executor.memoryOverhead=4g
spark
.driver.memory=16g
spark
.driver.memoryOverhead=4g
spark
.eventLog.enabled=true
spark
.network.timeout=300s




Any hints, thoughts, comments, feedback are greatly appreciated

Cheers,
Claire
--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/c450aed2-9a7a-45fa-9dab-5b5da2e746a5o%40googlegroups.com.
-- 
Best regards,
Evgeniy Ignatiev.


Re: OLAP issues - gremlin query optimization

Evgeniy Ignatiev <yevgeniy...@...>
 

Hello,

Gremlin machine is depth-first, GraphComputer is Pregel - breadth-first essentially, this will definitely give significant differences.
Also https://tinkerpop.apache.org/docs/current/reference/#traversalvertexprogram - personally I don't remember if using traversal based path computation ever worked for our use case either (we had something around shortest path counting/grouping) - performance for such case was atrocious in comparison with Neo4j (in case graph fit in its machine RAM, Tinkerpop never went OOM and actually was incredibly light on resources, but performance was also significantly lower).

Possibly it makes sense to test first half of the query in isolation first? Or first 30%, then 60%, etc. I have suspicions about efficiency of path traversal in GraphComputer, e.g.:

                .repeat(
                    bothE
()
                       
.filter(hasLabel("has_propertyA", "has_propertyB"))
                       
.otherV()
                       
.simplePath())
               
.times(2)

Maybe consider custom VertexProgram with the same logic?

Best regards,
Evgenii Ignatev.

On 7/30/2020 8:13 AM, bobo...@... wrote:

Hi

I am currently experimenting around with OLAP functionalities of gremlin (on Janusgraph). While I get simple queries to run like a charm, I am encountering issues with a slightly more complex case. I have the feeling that the devil hides somewhere in the details, but I am currently unable to pinpoint it, and am hoping to get some help here ...

Setup
- Managed Spark (Dataproc, Google Cloud Platform)
- 1 Master, up to 5 worker nodes (auto scaling enabled, each nodes has roughly 40 GB memory available for Spark jobs. But I am pretty flexible here and can easily configure more RAM, more CPU, more cores)
- (embedded) Janusgrap 0.5.2 backed by ScyllaDB
- Tinkerpop / Spark Gremlin 3.4.7
- Spark job is a Java application
- Current graph has roughly 22 million vertices and 24 million edges, ~65 GB storage on Scylla DB)


Use Case

I want to calculated shortcut edges between two "company" vertices. Two companies are linked via 4 edges (see image below, the red link is the shortcut edge I want to calculate)




Issue

Some simple queries (like g.V().count or g.E().count()) work like a charm, so my setup is in general working. However my shortcut edge-calculating is not.
The (simplified) query is as follows (that query is suposed to return the top 10 related companies)


g
= GraphFactory.open(configuration).traversal().withComputer(SparkGraphComputer.class)

 g
.V().has("company", "companyName", "company")
               
.as("start")
               
.in("has_company")
               
.has("deleted", false)
               
.repeat(
                    bothE
()
                       
.filter(hasLabel("has_propertyA", "has_propertyB"))
                       
.otherV()
                       
.simplePath())
               
.times(2)
               
.has("product", "site", "de")
               
.has("deleted", false)
               
.dedup()
               
.out("has_company")
               
.where(P.neq("start"))
               
.groupCount().by()
               
.order(Scope.local).by(select(Column.values)).limit(Scope.local, 100)
               
.unfold()
               
.project("company", "tf", "tf-idf")
               
.by(select(Column.keys).values("companyName"))
               
.by(select(Column.values)) // tf
               
.by(project("tf", "df") // idf
                   
.by(select(Column.values))
                   
.by(select(Column.keys).in("has_company").count())
                   
.math("tf * log10(2000000/df)")) // todo fix the hardcoded 2m
               
.fold()
               
.order(Scope.local).by(select("tf-idf"), Order.desc).limit(Scope.local, 10)



This runs into timeout or memory issues or any other weird kind of issues, even if I give an insanely high amount of memory to the executor (or driver), or simply stops after some time because some executors failed 10 times.

By trying to pinpoint the issue, I executed the following much more limited query, first on our "normal" OLTP Janusgraph server, and then using the SparkGraphComputer.

While in the OLTP version, I get a result within seconds, I don't if I use SparkGraphComputer. I don't expect the response within seconds, because bringing up the whole spark context takes its time. But I wouldn't expect hours either.
What is wrong? I assume I am using some construct somewhere that should not be used in that mode, but I don't see what.

g.V().has("company", "companyName", "company")
               
.as("start")
               
.in("has_company")
               
.has("deleted", false)
               
.repeat(
                    bothE
()
                       
.filter(hasLabel("has_propertyA", "has_propertyB"))
                       
.otherV()
                       
.simplePath())
               
.times(2)
               
.has("jobOffer", "site", "de")
               
.has("deleted", false)
               
.limit(1000)
               
.out("has_company")
               
.where(P.neq("start"))
               
.groupCount().by()
               
.order(Scope.local).by(select(Column.values)).limit(Scope.local, 100)
               
.unfold()
               
.project("company", "tf", "tf-idf")
               
.by(select(Column.keys).values("companyName"))
               
.by(select(Column.values)) // tf
               
.by(project("tf", "df") // idf
                   
.by(select(Column.values))
                   
.by(select(Column.keys).in("has_company").count())
                   
.math("tf * log10(2000000/df)")) // todo fix the 2m
               
.fold()
               
.order(Scope.local).by(select("tf-idf"), Order.desc).limit(Scope.local, 10)
                   
.toList();




Janusgraph.conf

For the sake of completeness, here my complete configuration


# Hadoop Graph Configuration
gremlin
.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin
.hadoop.graphReader=org.janusgraph.hadoop.formats.cql.CqlInputFormat
gremlin
.hadoop.graphWriter=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
gremlin
.hadoop.jarsInDistributedCache=true
gremlin
.hadoop.inputLocation=none
gremlin
.hadoop.outputLocation=output
gremlin
.spark.persistContext=true
gremlin
.spark.graphStorageLevel=MEMORY_AND_DISK
# Scylla
janusgraphmr
.ioformat.conf.storage.backend=cql
janusgraphmr
.ioformat.conf.storage.hostname=<ip>
janusgraphmr
.ioformat.conf.storage.port=<port>
janusgraphmr
.ioformat.conf.index.search.backend=lucene
janusgraphmr
.ioformat.conf.index.search.directory=/tmp/
janusgraphmr
.ioformat.conf.index.search.hostname=127.0.0.1
cassandra
.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra
.input.widerows=true
# Spark
spark
.master=yarn
spark
.submit.deployMode=client
spark
.serializer=org.apache.spark.serializer.KryoSerializer
spark
.kryo.registrator=org.janusgraph.hadoop.serialize.JanusGraphKryoRegistrator
### I tried with different memory settings here, up to 50G more executor
spark
.executor.memory=16g
spark
.executor.memoryOverhead=4g
spark
.driver.memory=16g
spark
.driver.memoryOverhead=4g
spark
.eventLog.enabled=true
spark
.network.timeout=300s




Any hints, thoughts, comments, feedback are greatly appreciated

Cheers,
Claire
--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/c450aed2-9a7a-45fa-9dab-5b5da2e746a5o%40googlegroups.com.
-- 
Best regards,
Evgeniy Ignatiev.


OLAP issues - gremlin query optimization

bobo...@...
 

links.png

Hi

I am currently experimenting around with OLAP functionalities of gremlin (on Janusgraph). While I get simple queries to run like a charm, I am encountering issues with a slightly more complex case. I have the feeling that the devil hides somewhere in the details, but I am currently unable to pinpoint it, and am hoping to get some help here ...

Setup
- Managed Spark (Dataproc, Google Cloud Platform)
- 1 Master, up to 5 worker nodes (auto scaling enabled, each nodes has roughly 40 GB memory available for Spark jobs. But I am pretty flexible here and can easily configure more RAM, more CPU, more cores)
- (embedded) Janusgrap 0.5.2 backed by ScyllaDB
- Tinkerpop / Spark Gremlin 3.4.7
- Spark job is a Java application
- Current graph has roughly 22 million vertices and 24 million edges, ~65 GB storage on Scylla DB)


Use Case

I want to calculated shortcut edges between two "company" vertices. Two companies are linked via 4 edges (see image below, the red link is the shortcut edge I want to calculate)




Issue

Some simple queries (like g.V().count or g.E().count()) work like a charm, so my setup is in general working. However my shortcut edge-calculating is not.
The (simplified) query is as follows (that query is suposed to return the top 10 related companies)


g
= GraphFactory.open(configuration).traversal().withComputer(SparkGraphComputer.class)

 g
.V().has("company", "companyName", "company")
               
.as("start")
               
.in("has_company")
               
.has("deleted", false)
               
.repeat(
                    bothE
()
                       
.filter(hasLabel("has_propertyA", "has_propertyB"))
                       
.otherV()
                       
.simplePath())
               
.times(2)
               
.has("product", "site", "de")
               
.has("deleted", false)
               
.dedup()
               
.out("has_company")
               
.where(P.neq("start"))
               
.groupCount().by()
               
.order(Scope.local).by(select(Column.values)).limit(Scope.local, 100)
               
.unfold()
               
.project("company", "tf", "tf-idf")
               
.by(select(Column.keys).values("companyName"))
               
.by(select(Column.values)) // tf
               
.by(project("tf", "df") // idf
                   
.by(select(Column.values))
                   
.by(select(Column.keys).in("has_company").count())
                   
.math("tf * log10(2000000/df)")) // todo fix the hardcoded 2m
               
.fold()
               
.order(Scope.local).by(select("tf-idf"), Order.desc).limit(Scope.local, 10)



This runs into timeout or memory issues or any other weird kind of issues, even if I give an insanely high amount of memory to the executor (or driver), or simply stops after some time because some executors failed 10 times.

By trying to pinpoint the issue, I executed the following much more limited query, first on our "normal" OLTP Janusgraph server, and then using the SparkGraphComputer.

While in the OLTP version, I get a result within seconds, I don't if I use SparkGraphComputer. I don't expect the response within seconds, because bringing up the whole spark context takes its time. But I wouldn't expect hours either.
What is wrong? I assume I am using some construct somewhere that should not be used in that mode, but I don't see what.

g.V().has("company", "companyName", "company")
               
.as("start")
               
.in("has_company")
               
.has("deleted", false)
               
.repeat(
                    bothE
()
                       
.filter(hasLabel("has_propertyA", "has_propertyB"))
                       
.otherV()
                       
.simplePath())
               
.times(2)
               
.has("jobOffer", "site", "de")
               
.has("deleted", false)
               
.limit(1000)
               
.out("has_company")
               
.where(P.neq("start"))
               
.groupCount().by()
               
.order(Scope.local).by(select(Column.values)).limit(Scope.local, 100)
               
.unfold()
               
.project("company", "tf", "tf-idf")
               
.by(select(Column.keys).values("companyName"))
               
.by(select(Column.values)) // tf
               
.by(project("tf", "df") // idf
                   
.by(select(Column.values))
                   
.by(select(Column.keys).in("has_company").count())
                   
.math("tf * log10(2000000/df)")) // todo fix the 2m
               
.fold()
               
.order(Scope.local).by(select("tf-idf"), Order.desc).limit(Scope.local, 10)
                   
.toList();




Janusgraph.conf

For the sake of completeness, here my complete configuration


# Hadoop Graph Configuration
gremlin
.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin
.hadoop.graphReader=org.janusgraph.hadoop.formats.cql.CqlInputFormat
gremlin
.hadoop.graphWriter=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
gremlin
.hadoop.jarsInDistributedCache=true
gremlin
.hadoop.inputLocation=none
gremlin
.hadoop.outputLocation=output
gremlin
.spark.persistContext=true
gremlin
.spark.graphStorageLevel=MEMORY_AND_DISK
# Scylla
janusgraphmr
.ioformat.conf.storage.backend=cql
janusgraphmr
.ioformat.conf.storage.hostname=<ip>
janusgraphmr
.ioformat.conf.storage.port=<port>
janusgraphmr
.ioformat.conf.index.search.backend=lucene
janusgraphmr
.ioformat.conf.index.search.directory=/tmp/
janusgraphmr
.ioformat.conf.index.search.hostname=127.0.0.1
cassandra
.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra
.input.widerows=true
# Spark
spark
.master=yarn
spark
.submit.deployMode=client
spark
.serializer=org.apache.spark.serializer.KryoSerializer
spark
.kryo.registrator=org.janusgraph.hadoop.serialize.JanusGraphKryoRegistrator
### I tried with different memory settings here, up to 50G more executor
spark
.executor.memory=16g
spark
.executor.memoryOverhead=4g
spark
.driver.memory=16g
spark
.driver.memoryOverhead=4g
spark
.eventLog.enabled=true
spark
.network.timeout=300s




Any hints, thoughts, comments, feedback are greatly appreciated

Cheers,
Claire


Re: Configuring Transaction Log feature

"anj...@gmail.com" <anjani...@...>
 

Thanks Sandeep, yes it works as you mentioned. 
We are using Cassandra as back-end and log tabled are created in it. Data in it are stored as blob type.

I was trying to read blob data type from Cassandra but getting below error 
"InvalidRequest: Error from server: code=2200 [Invalid query] message="In call to function system.blobastext, value 0xc00000000000000000f38503 is not a valid binary representation for type text"

My query : select blobastext(column1) from ulog_test;

How can we read data stored in ulog tables in Cassandra?

Thanks,
Anjani



On Saturday, 11 July 2020 at 13:49:02 UTC+5:30 sa...@... wrote:
Hi,

If you are using same identifier to start the logProcessor, there is no need to explicitly set previous time.
logProcessor keeps a marker of last record read. It should be able to recover from that point.

Do check again.

Regards,
Sandeep

On Thursday, July 9, 2020 at 9:25:17 PM UTC+8, anj...@... wrote:
Hi All,

We are using Janus graph with Cassandra. I am able to capture event using logProcessor and can see table created in Cassandra.

Was trying to figure out, if for some reason logProcessor stops then how to get changes which was done after logProcessor was stopped? 
I tried to start logProcessor by passing previous time thinking it will give all events which were done after that but it does not gave previous changes.


Thanks,
Anjani



On Sunday, 25 February 2018 at 16:01:06 UTC+5:30 sa...@... wrote:
Yeah Jason. I never bothered to look in Janusgraph table, expecting a new table to be created.
I can find a new column family in my setup too.

Thanks and Regards,
Sandeep


On Wednesday, February 21, 2018 at 12:09:14 AM UTC+8, Jason Plurad wrote:
I suppose it could be just confusion on the terminology:

Cassandra -> Keyspace -> Table
HBase -> Table -> Column Family

On Tuesday, February 20, 2018 at 11:05:10 AM UTC-5, Jason Plurad wrote:
Not sure what else to tell you. I just tried the same script from before against HBase 1.3.1, and it created the column family 'ulog_addedPerson' right after the logProcessor.addLogProcessor("addedPerson")...build()command was issued.

hbase(main):001:0> describe 'janusgraph'
Table janusgraph is ENABLED                                                                                                                          
janusgraph                                                                                                                                            
COLUMN FAMILIES DESCRIPTION                                                                                                                          
{NAME => 'e', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREV
ER'
, COMPRESSION => 'GZ', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}                                  
{NAME => 'f', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREV
ER'
, COMPRESSION => 'GZ', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}                                  
{NAME => 'g', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREV
ER'
, COMPRESSION => 'GZ', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}                                  
{NAME => 'h', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREV
ER'
, COMPRESSION => 'GZ', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}                                  
{NAME => 'i', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREV
ER'
, COMPRESSION => 'GZ', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}                                  
{NAME => 'l', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => '60480
0 SECONDS (7 DAYS)'
, COMPRESSION => 'GZ', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}                  
{NAME => 'm', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREV
ER'
, COMPRESSION => 'GZ', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}                                  
{NAME => 's', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREV
ER'
, COMPRESSION => 'GZ', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}                                  
{NAME => 't', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREV
ER'
, COMPRESSION => 'GZ', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}                                  
{NAME => 'ulog_addedPerson', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE'
, TTL => 'FOREVER', COMPRESSION => 'GZ', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}                  
10 row(s) in 0.0230 seconds




On Tuesday, February 20, 2018 at 9:46:26 AM UTC-5, Sandeep Mishra wrote:
Hi Jason,

I tried it with Hbase backend, and I am getting control passed to change processor. 
Appreciate your help, on careful checking I notice that mutation was happening under default transaction initiated by Janusgraph, hence the issue.

However, the problem right now, I am unable to locate a table for data.
I have taken a snapshot of table in hbase using HBase shell before and after processing, but there is no new table created. 
Any idea what could be wrong? Is there as possibility that, its saving log data in janusgraph table meant for saving actual data?

Thanks and Regards,
Sandeep



On Sunday, February 18, 2018 at 11:56:49 PM UTC+8, Sandeep Mishra wrote:
Both groovy and java code works with backend as berkeleyje. Tomorrow in office i will try with Hbase as backend. 
Noted on your point.

Thanks and Regards,
Sandeep

On Sunday, February 18, 2018 at 11:14:48 PM UTC+8, Jason Plurad wrote:
You can use the same exact code in a simple Java program and prove that it works.
I'd think the main thing to watch out for is that your mutations are on a transaction that have the log identifier on it.
Is the Gremlin Server involved in your scenario?

tx = graph.buildTransaction().logIdentifier("addedPerson").start();


On Sunday, February 18, 2018 at 1:00:08 AM UTC-5, Sandeep Mishra wrote:
Hi Jason,

Thanks for a prompt reply.
Sample code attached below works well when executed from Gremlin console.
However, execution of Java version doesn't trigger callback. Probably something wrong with my code.
Unfortunately I can't copy code from my office machine.
I will check it again and keep you posted.

Regards,
Sandeep 

On Wednesday, February 7, 2018 at 10:58:41 PM UTC+8, Jason Plurad wrote:
It means that it will use the 'storage.backend' value as the storage. See the code in GraphDatabaseConfiguration.java. It looks like your only choice is 'default', and it seems like the option is there for the future possibility to use a different backend.

The code in the docs seemed to work ok, other than a minor change in the setStartTime() parameters. You can cut and paste this code into the Gremlin Console to use with the prepackaged distribution.

import java.util.concurrent.atomic.*;
import org.janusgraph.core.log.*;
import java.util.concurrent.*;

graph
= JanusGraphFactory.open('conf/janusgraph-cassandra-es.properties');

totalHumansAdded
= new AtomicInteger(0);
totalGodsAdded
= new AtomicInteger(0);
logProcessor
= JanusGraphFactory.openTransactionLog(graph);
logProcessor
.addLogProcessor("addedPerson").
        setProcessorIdentifier
("addedPersonCounter").
        setStartTime
(Instant.now()).
        addProcessor
(new ChangeProcessor() {
           
public void process(JanusGraphTransaction tx, TransactionId txId, ChangeState changeState) {
               
for (v in changeState.getVertices(Change.ADDED)) {
                   
if (v.label().equals("human")) totalHumansAdded.incrementAndGet();
                   
System.out.println("total humans = " + totalHumansAdded);
               
}
           
}
       
}).
        addProcessor
(new ChangeProcessor() {
           
public void process(JanusGraphTransaction tx, TransactionId txId, ChangeState changeState) {
               
for (v in changeState.getVertices(Change.ADDED)) {
                   
if (v.label().equals("god")) totalGodsAdded.incrementAndGet();
                   
System.out.println("total gods = " + totalGodsAdded);
               
}
           
}
       
}).
        build
()

tx
= graph.buildTransaction().logIdentifier("addedPerson").start();
u
= tx.addVertex(T.label, "human");
u
.property("name", "proteros");
u
.property("age", 36);
tx
.commit();

If you inspect the keyspace in Cassandra afterwards, you'll see that a separate table is created for "ulog_addedPerson".

Did you have some example code of what you are attempting?


On Wednesday, February 7, 2018 at 5:55:58 AM UTC-5, Sandeep Mishra wrote:
Hi Guys,

We are trying to used transaction log feature of Janusgraph, which is not working as expected.No callback is received at
public void process(JanusGraphTransaction janusGraphTransaction, TransactionId transactionId, ChangeState changeState) {

Janusgraph documentation says value for log.[X].backend is 'default'.
Not sure what exactly it means. does it mean HBase which is being used as backend for data.

Please let  me know, if anyone has configured it.

Thanks and Regards,
Sandeep Mishra


Re: Janusgraph Authentication cannot create users

sparshneel chanchlani <sparshneel...@...>
 

Thanks it worked for me.

-Sparshneel

On Sun, Jul 26, 2020 at 2:20 PM HadoopMarc <bi...@...> wrote:
See the section "Credentials graph DSL" in:
So, you instantiate the CredentialsDB GraphTraversalSource using:

credentials = graph.traversal(CredentialTraversalSource.class)

where graph is the JanusGraph instance holding your CredentialsDb (the TinkerPop ref docs refer to TinkerGraph which is not applicable here).

HTH,    Marc


Op maandag 20 juli 2020 om 12:04:20 UTC+2 schreef spars...@...:
Also are there any other ways of creating users??


On Monday, July 20, 2020 at 3:29:54 PM UTC+5:30, sparshneel chanchlani wrote:
Hi, 
I am actually trying to add authentication to Janusgraph. I am actually referring the link below
below is may credentials DB config:

gremlin.graph=org.janusgraph.core.JanusGraphFactory
storage.backend=cql
storage.hostname= 10.xx.xx.xx
storage.port= 9042
storage.username= cassandrauser
storage.password= cassandrapwd
storage.cql.keyspace=creds_db
storage.read-time=50000
storage.cql.write-consistency-level=LOCAL_QUORUM
storage.cql.read-consistency-level=LOCAL_QUORUM
cluster.max-partitions=32
storage.lock.wait-time=5000
storage.lock.retries=5
ids.block-size=100000

Actually when i start the gremlin-server the creds_db and the graphDB creates successfully. The issue, i am not able to create the credentials using Credentials(graph) groovy script, i am trying through gremlin-consle see below.

 g1 = JanusGraphFactory.open('conf/gremlin-server/janusgraph-server-credentials.properties')
==>standardjanusgraph[cql:[]]
gremlin> creds = Credentials(g1)
No signature of method: groovysh_evaluate.Credentials() is applicable for argument types: (org.janusgraph.graphdb.database.StandardJanusGraph) values: [standardjanusgraph[cql:[]

Actaully the groovysh_evaluate script does not support standard graph as parameters. What should be my credentials.properties for cassandra??



Thanks,
Sparshneel

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/1f6729f2-1007-4139-92e3-b4985f20744bn%40googlegroups.com.


Re: The use of vertex-centric index

Leah <18834...@...>
 

Hi Marc,
After testing, has() step() condition is necessary, this solution is very effective.Thank you for all your assistance.

Warm regards,
Leah
在2020年7月27日星期一 UTC+8 上午3:26:38<HadoopMarc> 写道:

g.V().has('address', ‘from_addr_00').outE('TRANSFER_TO').has('amount', gte(6000)).limit(500).valueMap().toList()

I am not sure the has() step is even necessary, or maybe just has('amount') is sufficient to trigger the - already sorted - index.

Best wishes,

Marc

Op zondag 26 juli 2020 om 11:15:16 UTC+2 schreef 18...@...:

How can I use the index to get the top 500 edges of the amount descending sort faster?

在2020年7月26日星期日 UTC+8 下午3:19:56<HadoopMarc> 写道:
You already specified the vertex-centrex index on the amount key to be ordered while creating the index. By explicitly reordering the results in the traversal, the index cannot take effect because the reordering needs alls vertices to be retrieved instead of just the first 500.

HTH,    Marc

Op zaterdag 25 juli 2020 om 20:52:37 UTC+2 schreef 18...@...:

Hi JanusGraph team,

I have created a vertex-centric indexes for vertices. As follows, now I want to use the index to get the information of the top 500 edges in descending sort. However, I find that the execution time is the same as that without vertex index. How can I use the index to sort faster and extract the information of the first 500 edges more quickly?

Here's the graph I've built:

graph=JanusGraphFactory.open(‘janusgraph-cql-es-server-test2.properties’) mgmt = graph.openManagement() mgmt.makeVertexLabel('VirtualAddress').make() addr = mgmt.makePropertyKey('address').dataType(String.class).cardinality(SINGLE).make() token_addr = mgmt.makePropertyKey('token_addr').dataType(String.class).cardinality(SINGLE).make() transfer_to=mgmt.makeEdgeLabel('TRANSFER_TO').multiplicity(MULTI).make() amount = mgmt.makePropertyKey('amount').dataType(Double.class).cardinality(SINGLE).make() tx_hash = mgmt.makePropertyKey('tx_hash').dataType(String.class).cardinality(SINGLE).make() tx_index = mgmt.makePropertyKey('tx_index').dataType(Integer.class).cardinality(SINGLE).make() created_time = mgmt.makePropertyKey('created_time').dataType(Date.class).cardinality(SINGLE).make() updated_time = mgmt.makePropertyKey('updated_time').dataType(Date.class).cardinality(SINGLE).make() mgmt.buildIndex('addressComposite', Vertex.class).addKey(addr).buildCompositeIndex() mgmt.buildIndex('addressTokenUniqComposite', Vertex.class).addKey(addr).addKey(token_addr).unique().buildCompositeIndex() mgmt.buildEdgeIndex(transfer_to,"transferOutAmountTs", Direction.OUT, Order.desc,amount,created_time) mgmt.buildEdgeIndex(transfer_to,"transferOutTs", Direction.OUT, Order.desc,created_time) mgmt.commit()

Here's the data I inserted, building a starting point, a million edges associated with it, and 100 endpoints,

graph_conf = 'janusgraph-cql-es-server-test2.properties'
graph = JanusGraphFactory.open(graph_conf)
g = graph.traversal()
String line = "5244613,tx_hash_00,token_addr_00,from_addr_00,to_addr_00,,,6000,19305.57174337591,72,1520896044"
int start_value = 1
int end_value = 1000000

line = "tx_hash_00,token_addr_00,from_addr_00,to_addr_00,6000,72,1520896044"
cloumns = line.split(',', -1)
(tx_hash, token_addr, from_addr, to_addr, amount, log_index, timestamp) = cloumns
from_addr_node = g.addV('VirtualAddress').property('address', from_addr).property('token_addr', token_addr).next()
from_id = from_addr_node.id()
amount = amount.toBigDecimal()
tx_index = log_index.toInteger()
for (int i = start_value; i <= end_value; i++) {
    to_addr_node = g.addV('VirtualAddress').property('address', to_addr + String.valueOf(i)).property('token_addr', token_addr).next()
    to_id = to_addr_node.id()
    Date ts = new Date((timestamp.toLong() - i) * 1000)
    g.addE('TRANSFER_TO').from(g.V(from_id)).to(g.V(to_id))
            .property('amount', amount + i)
            .property('tx_hash', tx_hash)
            .property('tx_index', tx_index + i)
            .property('created_time', ts)
            .next()


    if (i % 20000 == 0) {
        println("[total:${i}]")
        System.sleep(500)
        g.tx().commit()
        graph.close()
        System.sleep(5000)


        graph = JanusGraphFactory.open(graph_conf)
        g = graph.traversal()
        System.sleep(5000)
    }
    g.tx().commit()
}

graph.close()
 

Here are my query criteria:

g.V().has('address', ‘from_addr_00').outE('TRANSFER_TO').order().by(‘amount’,desc).limit(500).valueMap().toList()


Re: The use of vertex-centric index

HadoopMarc <bi...@...>
 

g.V().has('address', ‘from_addr_00').outE('TRANSFER_TO').has('amount', gte(6000)).limit(500).valueMap().toList()

I am not sure the has() step is even necessary, or maybe just has('amount') is sufficient to trigger the - already sorted - index.

Best wishes,

Marc

Op zondag 26 juli 2020 om 11:15:16 UTC+2 schreef 18...@...:


How can I use the index to get the top 500 edges of the amount descending sort faster?

在2020年7月26日星期日 UTC+8 下午3:19:56<HadoopMarc> 写道:
You already specified the vertex-centrex index on the amount key to be ordered while creating the index. By explicitly reordering the results in the traversal, the index cannot take effect because the reordering needs alls vertices to be retrieved instead of just the first 500.

HTH,    Marc

Op zaterdag 25 juli 2020 om 20:52:37 UTC+2 schreef 18...@...:

Hi JanusGraph team,

I have created a vertex-centric indexes for vertices. As follows, now I want to use the index to get the information of the top 500 edges in descending sort. However, I find that the execution time is the same as that without vertex index. How can I use the index to sort faster and extract the information of the first 500 edges more quickly?

Here's the graph I've built:

graph=JanusGraphFactory.open(‘janusgraph-cql-es-server-test2.properties’) mgmt = graph.openManagement() mgmt.makeVertexLabel('VirtualAddress').make() addr = mgmt.makePropertyKey('address').dataType(String.class).cardinality(SINGLE).make() token_addr = mgmt.makePropertyKey('token_addr').dataType(String.class).cardinality(SINGLE).make() transfer_to=mgmt.makeEdgeLabel('TRANSFER_TO').multiplicity(MULTI).make() amount = mgmt.makePropertyKey('amount').dataType(Double.class).cardinality(SINGLE).make() tx_hash = mgmt.makePropertyKey('tx_hash').dataType(String.class).cardinality(SINGLE).make() tx_index = mgmt.makePropertyKey('tx_index').dataType(Integer.class).cardinality(SINGLE).make() created_time = mgmt.makePropertyKey('created_time').dataType(Date.class).cardinality(SINGLE).make() updated_time = mgmt.makePropertyKey('updated_time').dataType(Date.class).cardinality(SINGLE).make() mgmt.buildIndex('addressComposite', Vertex.class).addKey(addr).buildCompositeIndex() mgmt.buildIndex('addressTokenUniqComposite', Vertex.class).addKey(addr).addKey(token_addr).unique().buildCompositeIndex() mgmt.buildEdgeIndex(transfer_to,"transferOutAmountTs", Direction.OUT, Order.desc,amount,created_time) mgmt.buildEdgeIndex(transfer_to,"transferOutTs", Direction.OUT, Order.desc,created_time) mgmt.commit()

Here's the data I inserted, building a starting point, a million edges associated with it, and 100 endpoints,

graph_conf = 'janusgraph-cql-es-server-test2.properties'
graph = JanusGraphFactory.open(graph_conf)
g = graph.traversal()
String line = "5244613,tx_hash_00,token_addr_00,from_addr_00,to_addr_00,,,6000,19305.57174337591,72,1520896044"
int start_value = 1
int end_value = 1000000

line = "tx_hash_00,token_addr_00,from_addr_00,to_addr_00,6000,72,1520896044"
cloumns = line.split(',', -1)
(tx_hash, token_addr, from_addr, to_addr, amount, log_index, timestamp) = cloumns
from_addr_node = g.addV('VirtualAddress').property('address', from_addr).property('token_addr', token_addr).next()
from_id = from_addr_node.id()
amount = amount.toBigDecimal()
tx_index = log_index.toInteger()
for (int i = start_value; i <= end_value; i++) {
    to_addr_node = g.addV('VirtualAddress').property('address', to_addr + String.valueOf(i)).property('token_addr', token_addr).next()
    to_id = to_addr_node.id()
    Date ts = new Date((timestamp.toLong() - i) * 1000)
    g.addE('TRANSFER_TO').from(g.V(from_id)).to(g.V(to_id))
            .property('amount', amount + i)
            .property('tx_hash', tx_hash)
            .property('tx_index', tx_index + i)
            .property('created_time', ts)
            .next()


    if (i % 20000 == 0) {
        println("[total:${i}]")
        System.sleep(500)
        g.tx().commit()
        graph.close()
        System.sleep(5000)


        graph = JanusGraphFactory.open(graph_conf)
        g = graph.traversal()
        System.sleep(5000)
    }
    g.tx().commit()
}

graph.close()
 

Here are my query criteria:

g.V().has('address', ‘from_addr_00').outE('TRANSFER_TO').order().by(‘amount’,desc).limit(500).valueMap().toList()


Re: The use of vertex-centric index

Leah <18834...@...>
 


How can I use the index to get the top 500 edges of the amount descending sort faster?

在2020年7月26日星期日 UTC+8 下午3:19:56<HadoopMarc> 写道:

You already specified the vertex-centrex index on the amount key to be ordered while creating the index. By explicitly reordering the results in the traversal, the index cannot take effect because the reordering needs alls vertices to be retrieved instead of just the first 500.

HTH,    Marc

Op zaterdag 25 juli 2020 om 20:52:37 UTC+2 schreef 18...@...:

Hi JanusGraph team,

I have created a vertex-centric indexes for vertices. As follows, now I want to use the index to get the information of the top 500 edges in descending sort. However, I find that the execution time is the same as that without vertex index. How can I use the index to sort faster and extract the information of the first 500 edges more quickly?

Here's the graph I've built:

graph=JanusGraphFactory.open(‘janusgraph-cql-es-server-test2.properties’) mgmt = graph.openManagement() mgmt.makeVertexLabel('VirtualAddress').make() addr = mgmt.makePropertyKey('address').dataType(String.class).cardinality(SINGLE).make() token_addr = mgmt.makePropertyKey('token_addr').dataType(String.class).cardinality(SINGLE).make() transfer_to=mgmt.makeEdgeLabel('TRANSFER_TO').multiplicity(MULTI).make() amount = mgmt.makePropertyKey('amount').dataType(Double.class).cardinality(SINGLE).make() tx_hash = mgmt.makePropertyKey('tx_hash').dataType(String.class).cardinality(SINGLE).make() tx_index = mgmt.makePropertyKey('tx_index').dataType(Integer.class).cardinality(SINGLE).make() created_time = mgmt.makePropertyKey('created_time').dataType(Date.class).cardinality(SINGLE).make() updated_time = mgmt.makePropertyKey('updated_time').dataType(Date.class).cardinality(SINGLE).make() mgmt.buildIndex('addressComposite', Vertex.class).addKey(addr).buildCompositeIndex() mgmt.buildIndex('addressTokenUniqComposite', Vertex.class).addKey(addr).addKey(token_addr).unique().buildCompositeIndex() mgmt.buildEdgeIndex(transfer_to,"transferOutAmountTs", Direction.OUT, Order.desc,amount,created_time) mgmt.buildEdgeIndex(transfer_to,"transferOutTs", Direction.OUT, Order.desc,created_time) mgmt.commit()

Here's the data I inserted, building a starting point, a million edges associated with it, and 100 endpoints,

graph_conf = 'janusgraph-cql-es-server-test2.properties'
graph = JanusGraphFactory.open(graph_conf)
g = graph.traversal()
String line = "5244613,tx_hash_00,token_addr_00,from_addr_00,to_addr_00,,,6000,19305.57174337591,72,1520896044"
int start_value = 1
int end_value = 1000000

line = "tx_hash_00,token_addr_00,from_addr_00,to_addr_00,6000,72,1520896044"
cloumns = line.split(',', -1)
(tx_hash, token_addr, from_addr, to_addr, amount, log_index, timestamp) = cloumns
from_addr_node = g.addV('VirtualAddress').property('address', from_addr).property('token_addr', token_addr).next()
from_id = from_addr_node.id()
amount = amount.toBigDecimal()
tx_index = log_index.toInteger()
for (int i = start_value; i <= end_value; i++) {
    to_addr_node = g.addV('VirtualAddress').property('address', to_addr + String.valueOf(i)).property('token_addr', token_addr).next()
    to_id = to_addr_node.id()
    Date ts = new Date((timestamp.toLong() - i) * 1000)
    g.addE('TRANSFER_TO').from(g.V(from_id)).to(g.V(to_id))
            .property('amount', amount + i)
            .property('tx_hash', tx_hash)
            .property('tx_index', tx_index + i)
            .property('created_time', ts)
            .next()


    if (i % 20000 == 0) {
        println("[total:${i}]")
        System.sleep(500)
        g.tx().commit()
        graph.close()
        System.sleep(5000)


        graph = JanusGraphFactory.open(graph_conf)
        g = graph.traversal()
        System.sleep(5000)
    }
    g.tx().commit()
}

graph.close()
 

Here are my query criteria:

g.V().has('address', ‘from_addr_00').outE('TRANSFER_TO').order().by(‘amount’,desc).limit(500).valueMap().toList()


Re: Janusgraph Authentication cannot create users

HadoopMarc <bi...@...>
 

See the section "Credentials graph DSL" in:
So, you instantiate the CredentialsDB GraphTraversalSource using:

credentials = graph.traversal(CredentialTraversalSource.class)

where graph is the JanusGraph instance holding your CredentialsDb (the TinkerPop ref docs refer to TinkerGraph which is not applicable here).

HTH,    Marc


Op maandag 20 juli 2020 om 12:04:20 UTC+2 schreef spars...@...:

Also are there any other ways of creating users??


On Monday, July 20, 2020 at 3:29:54 PM UTC+5:30, sparshneel chanchlani wrote:
Hi, 
I am actually trying to add authentication to Janusgraph. I am actually referring the link below
below is may credentials DB config:

gremlin.graph=org.janusgraph.core.JanusGraphFactory
storage.backend=cql
storage.hostname= 10.xx.xx.xx
storage.port= 9042
storage.username= cassandrauser
storage.password= cassandrapwd
storage.cql.keyspace=creds_db
storage.read-time=50000
storage.cql.write-consistency-level=LOCAL_QUORUM
storage.cql.read-consistency-level=LOCAL_QUORUM
cluster.max-partitions=32
storage.lock.wait-time=5000
storage.lock.retries=5
ids.block-size=100000

Actually when i start the gremlin-server the creds_db and the graphDB creates successfully. The issue, i am not able to create the credentials using Credentials(graph) groovy script, i am trying through gremlin-consle see below.

 g1 = JanusGraphFactory.open('conf/gremlin-server/janusgraph-server-credentials.properties')
==>standardjanusgraph[cql:[]]
gremlin> creds = Credentials(g1)
No signature of method: groovysh_evaluate.Credentials() is applicable for argument types: (org.janusgraph.graphdb.database.StandardJanusGraph) values: [standardjanusgraph[cql:[]

Actaully the groovysh_evaluate script does not support standard graph as parameters. What should be my credentials.properties for cassandra??



Thanks,
Sparshneel


Re: when i use Janusgraph in Apache Atlas, i found an error

HadoopMarc <bi...@...>
 

Hi Pavel,

I do not recognize the way you want to register classes for serialization by JanusGraph towards gremlin driver, but this may be due to my limited knowledge on this issue. JanusGraph itself registers the additional classes it has defined in the following way:

So, this would involve defining your own IoRegistry class and configuring it for gremlin server (and optionally for the remote-objects.yaml for gremlin driver).

HTH,   Marc

Op woensdag 22 juli 2020 om 17:21:15 UTC+2 schreef pav...@...:

Hello,

I've got the same issue with the latest version of JanusGraph and Atlas from master branch. Did you manage somehow appropriate type/serializer registration to produce GraphSON output? I'd like to visualise graph via Cytoscape or Graphexp. Thanks for any advice!

I've tried already - gremlin config (using Scylla and ES):
attributes.custom.attribute1.attribute-class=org.apache.atlas.typesystem.types.DataTypes.TypeCategory
attributes.custom.attribute1.serializer-class=org.apache.atlas.repository.graphdb.janus.serializer.TypeCategorySerializer
attributes.custom.attribute2.attribute-class=java.util.ArrayList
attributes.custom.attribute2.serializer-class=org.janusgraph.graphdb.database.serialize.attribute.SerializableSerializer
attributes.custom.attribute3.attribute-class=java.math.BigInteger
attributes.custom.attribute3.serializer-class=org.apache.atlas.repository.graphdb.janus.serializer.BigIntegerSerializer
attributes.custom.attribute4.attribute-class=java.math.BigDecimal
attributes.custom.attribute4.serializer-class=org.apache.atlas.repository.graphdb.janus.serializer.BigDecimalSerializer

then from gremlin cli:
graph.io(IoCore.graphson()).writeGraph("/atlas.json")

resulting into:
org.apache.tinkerpop.shaded.jackson.databind.JsonMappingException: Could not find a type identifier for the class : class org.apache.atlas.typesystem.types.DataTypes$TypeCategory. Make sure the value to serialize has a type identifier registered for its class.
Dne středa 15. ledna 2020 v 14:31:09 UTC+1 uživatel mar...@... napsal:
Hi,

See a similar question on:


HTH,   Marc

Op woensdag 15 januari 2020 14:11:14 UTC+1 schreef qi...@...:
hello i am new in JanusGraph. When i use Janusgraph in Apache Atlas, i found a question, the error is :
Could not find a type identifier for the class : class org.apache.atlas.typesystem.types.DataTypes$TypeCategory. Make sure the value to serialize has a type identifier registered for its class.
org
.apache.tinkerpop.shaded.jackson.databind.JsonMappingException: Could not find a type identifier for the class : class org.apache.atlas.typesystem.types.DataTypes$TypeCategory. Make sure the value to serialize has a type identifier registered for its class.
at org
.apache.tinkerpop.shaded.jackson.databind.ser.DefaultSerializerProvider._wrapAsIOE(DefaultSerializerProvider.java:509)
at org
.apache.tinkerpop.shaded.jackson.databind.ser.DefaultSerializerProvider._serialize(DefaultSerializerProvider.java:482)
at org
.apache.tinkerpop.shaded.jackson.databind.ser.DefaultSerializerProvider.serializeValue(DefaultSerializerProvider.java:319)
at org
.apache.tinkerpop.shaded.jackson.databind.ObjectMapper._configAndWriteValue(ObjectMapper.java:3893)
at org
.apache.tinkerpop.shaded.jackson.databind.ObjectMapper.writeValue(ObjectMapper.java:3164)
at org
.apache.tinkerpop.gremlin.structure.io.graphson.GraphSONWriter.writeVertex(GraphSONWriter.java:82)
at org
.apache.tinkerpop.gremlin.structure.io.graphson.GraphSONWriter.writeVertices(GraphSONWriter.java:110)
at org
.apache.tinkerpop.gremlin.structure.io.graphson.GraphSONWriter.writeGraph(GraphSONWriter.java:71)
at org
.apache.tinkerpop.gremlin.structure.io.graphson.GraphSONIo.writeGraph(GraphSONIo.java:83)
at org
.apache.tinkerpop.gremlin.structure.io.Io$writeGraph.call(Unknown Source)
at org
.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47)
at org
.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:116)
at org
.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:128)
at groovysh_evaluate
.run(groovysh_evaluate:3)
at org
.codehaus.groovy.vmplugin.v7.IndyInterface.selectMethod(IndyInterface.java:236)
at org
.codehaus.groovy.tools.shell.Interpreter.evaluate(Interpreter.groovy:71)
at org
.codehaus.groovy.tools.shell.Groovysh.execute(Groovysh.groovy:196)
at org
.apache.tinkerpop.gremlin.console.GremlinGroovysh.super$3$execute(GremlinGroovysh.groovy)
at sun
.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
at sun
.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java
.lang.reflect.Method.invoke(Method.java:498)
at org
.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:98)
at groovy
.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
at groovy
.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1225)
at org
.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuperN(ScriptBytecodeAdapter.java:145)
at org
.apache.tinkerpop.gremlin.console.GremlinGroovysh.execute(GremlinGroovysh.groovy:72)
at org
.codehaus.groovy.tools.shell.Shell.leftShift(Shell.groovy:122)
at org
.codehaus.groovy.tools.shell.ShellRunner.work(ShellRunner.groovy:95)
at org
.codehaus.groovy.tools.shell.InteractiveShellRunner.super$2$work(InteractiveShellRunner.groovy)
at sun
.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun
.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun
.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java
.lang.reflect.Method.invoke(Method.java:498)
at org
.codehaus.groovy.reflection.CachedMethod.invoke(CachedMethod.java:98)
at groovy
.lang.MetaMethod.doMethodInvoke(MetaMethod.java:325)
at groovy
.lang.MetaClassImpl.invokeMethod(MetaClassImpl.java:1225)
at org
.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuperN(ScriptBytecodeAdapter.java:145)
at org
.codehaus.groovy.runtime.ScriptBytecodeAdapter.invokeMethodOnSuper0(ScriptBytecodeAdapter.java:165)
at org
.codehaus.groovy.tools.shell.InteractiveShellRunner.work(InteractiveShellRunner.groovy:130)
at org
.codehaus.groovy.tools.shell.ShellRunner.run(ShellRunner.groovy:59)
at org
.codehaus.groovy.tools.shell.InteractiveShellRunner.super$2$run(InteractiveShellRunner.groovy)
at sun
.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun
.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun
.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java
.

how can i solve it? thank you very much


Re: Java APIs for Create/Open/Drop graph using ConfiguredGraphFactory - janusgraph

HadoopMarc <bi...@...>
 

Hi Anumodh,

Interesting question. Using ConfiguredGraphFactory without gremlin server is relevant when you build your own REST endpoints for a graph application.

While the ref docs may not address this use case, the javadocs for ConfiguredGraphFactory seem pretty self-explanatory. Did you checkout the following example graph properties file:
including the line:
gremlin.graph=org.janusgraph.core.ConfiguredGraphFactory

What was the point where things became unclear?

Best wishes,     Marc

Op zaterdag 25 juli 2020 om 01:08:07 UTC+2 schreef a...@...:

Hi JanusGraph team,

We are exploring the possibility of janusgraph in our company. We are planning to use dynamic graph creation/deletion using ConfiguredGraphFactory. We have deployed 2 janusgraph instances. Our plan is to write java library to creating, deletion and other management activities like Index creation etc . I haven't yet found a way to create/delete a graph using java apis by using ConfiguredGraphFactory.

From my investigation, only way to do so using by connecting to gremlin server and sending string commands like 
 - ConfiguredGraphFactory.create(graphName)
 - ConfiguredGraphFactory.drop(graphName)
Is this the only way to create/delete graph? Please adivce.

Thanks,
anumodh 





Re: The use of vertex-centric index

HadoopMarc <bi...@...>
 

You already specified the vertex-centrex index on the amount key to be ordered while creating the index. By explicitly reordering the results in the traversal, the index cannot take effect because the reordering needs alls vertices to be retrieved instead of just the first 500.

HTH,    Marc

Op zaterdag 25 juli 2020 om 20:52:37 UTC+2 schreef 18...@...:

Hi JanusGraph team,

I have created a vertex-centric indexes for vertices. As follows, now I want to use the index to get the information of the top 500 edges in descending sort. However, I find that the execution time is the same as that without vertex index. How can I use the index to sort faster and extract the information of the first 500 edges more quickly?

Here's the graph I've built:

graph=JanusGraphFactory.open(‘janusgraph-cql-es-server-test2.properties’) mgmt = graph.openManagement() mgmt.makeVertexLabel('VirtualAddress').make() addr = mgmt.makePropertyKey('address').dataType(String.class).cardinality(SINGLE).make() token_addr = mgmt.makePropertyKey('token_addr').dataType(String.class).cardinality(SINGLE).make() transfer_to=mgmt.makeEdgeLabel('TRANSFER_TO').multiplicity(MULTI).make() amount = mgmt.makePropertyKey('amount').dataType(Double.class).cardinality(SINGLE).make() tx_hash = mgmt.makePropertyKey('tx_hash').dataType(String.class).cardinality(SINGLE).make() tx_index = mgmt.makePropertyKey('tx_index').dataType(Integer.class).cardinality(SINGLE).make() created_time = mgmt.makePropertyKey('created_time').dataType(Date.class).cardinality(SINGLE).make() updated_time = mgmt.makePropertyKey('updated_time').dataType(Date.class).cardinality(SINGLE).make() mgmt.buildIndex('addressComposite', Vertex.class).addKey(addr).buildCompositeIndex() mgmt.buildIndex('addressTokenUniqComposite', Vertex.class).addKey(addr).addKey(token_addr).unique().buildCompositeIndex() mgmt.buildEdgeIndex(transfer_to,"transferOutAmountTs", Direction.OUT, Order.desc,amount,created_time) mgmt.buildEdgeIndex(transfer_to,"transferOutTs", Direction.OUT, Order.desc,created_time) mgmt.commit()

Here's the data I inserted, building a starting point, a million edges associated with it, and 100 endpoints,

graph_conf = 'janusgraph-cql-es-server-test2.properties'
graph = JanusGraphFactory.open(graph_conf)
g = graph.traversal()
String line = "5244613,tx_hash_00,token_addr_00,from_addr_00,to_addr_00,,,6000,19305.57174337591,72,1520896044"
int start_value = 1
int end_value = 1000000

line = "tx_hash_00,token_addr_00,from_addr_00,to_addr_00,6000,72,1520896044"
cloumns = line.split(',', -1)
(tx_hash, token_addr, from_addr, to_addr, amount, log_index, timestamp) = cloumns
from_addr_node = g.addV('VirtualAddress').property('address', from_addr).property('token_addr', token_addr).next()
from_id = from_addr_node.id()
amount = amount.toBigDecimal()
tx_index = log_index.toInteger()
for (int i = start_value; i <= end_value; i++) {
    to_addr_node = g.addV('VirtualAddress').property('address', to_addr + String.valueOf(i)).property('token_addr', token_addr).next()
    to_id = to_addr_node.id()
    Date ts = new Date((timestamp.toLong() - i) * 1000)
    g.addE('TRANSFER_TO').from(g.V(from_id)).to(g.V(to_id))
            .property('amount', amount + i)
            .property('tx_hash', tx_hash)
            .property('tx_index', tx_index + i)
            .property('created_time', ts)
            .next()


    if (i % 20000 == 0) {
        println("[total:${i}]")
        System.sleep(500)
        g.tx().commit()
        graph.close()
        System.sleep(5000)


        graph = JanusGraphFactory.open(graph_conf)
        g = graph.traversal()
        System.sleep(5000)
    }
    g.tx().commit()
}

graph.close()
 

Here are my query criteria:

g.V().has('address', ‘from_addr_00').outE('TRANSFER_TO').order().by(‘amount’,desc).limit(500).valueMap().toList()


The use of vertex-centric index

张雅南 <18834...@...>
 

Hi JanusGraph team,

I have created a vertex-centric indexes for vertices. As follows, now I want to use the index to get the information of the top 500 edges in descending sort. However, I find that the execution time is the same as that without vertex index. How can I use the index to sort faster and extract the information of the first 500 edges more quickly?

Here's the graph I've built:

graph=JanusGraphFactory.open(‘janusgraph-cql-es-server-test2.properties’) mgmt = graph.openManagement() mgmt.makeVertexLabel('VirtualAddress').make() addr = mgmt.makePropertyKey('address').dataType(String.class).cardinality(SINGLE).make() token_addr = mgmt.makePropertyKey('token_addr').dataType(String.class).cardinality(SINGLE).make() transfer_to=mgmt.makeEdgeLabel('TRANSFER_TO').multiplicity(MULTI).make() amount = mgmt.makePropertyKey('amount').dataType(Double.class).cardinality(SINGLE).make() tx_hash = mgmt.makePropertyKey('tx_hash').dataType(String.class).cardinality(SINGLE).make() tx_index = mgmt.makePropertyKey('tx_index').dataType(Integer.class).cardinality(SINGLE).make() created_time = mgmt.makePropertyKey('created_time').dataType(Date.class).cardinality(SINGLE).make() updated_time = mgmt.makePropertyKey('updated_time').dataType(Date.class).cardinality(SINGLE).make() mgmt.buildIndex('addressComposite', Vertex.class).addKey(addr).buildCompositeIndex() mgmt.buildIndex('addressTokenUniqComposite', Vertex.class).addKey(addr).addKey(token_addr).unique().buildCompositeIndex() mgmt.buildEdgeIndex(transfer_to,"transferOutAmountTs", Direction.OUT, Order.desc,amount,created_time) mgmt.buildEdgeIndex(transfer_to,"transferOutTs", Direction.OUT, Order.desc,created_time) mgmt.commit()

Here's the data I inserted, building a starting point, a million edges associated with it, and 100 endpoints,

graph_conf = 'janusgraph-cql-es-server-test2.properties'
graph = JanusGraphFactory.open(graph_conf)
g = graph.traversal()
String line = "5244613,tx_hash_00,token_addr_00,from_addr_00,to_addr_00,,,6000,19305.57174337591,72,1520896044"
int start_value = 1
int end_value = 1000000

line = "tx_hash_00,token_addr_00,from_addr_00,to_addr_00,6000,72,1520896044"
cloumns = line.split(',', -1)
(tx_hash, token_addr, from_addr, to_addr, amount, log_index, timestamp) = cloumns
from_addr_node = g.addV('VirtualAddress').property('address', from_addr).property('token_addr', token_addr).next()
from_id = from_addr_node.id()
amount = amount.toBigDecimal()
tx_index = log_index.toInteger()
for (int i = start_value; i <= end_value; i++) {
    to_addr_node = g.addV('VirtualAddress').property('address', to_addr + String.valueOf(i)).property('token_addr', token_addr).next()
    to_id = to_addr_node.id()
    Date ts = new Date((timestamp.toLong() - i) * 1000)
    g.addE('TRANSFER_TO').from(g.V(from_id)).to(g.V(to_id))
            .property('amount', amount + i)
            .property('tx_hash', tx_hash)
            .property('tx_index', tx_index + i)
            .property('created_time', ts)
            .next()


    if (i % 20000 == 0) {
        println("[total:${i}]")
        System.sleep(500)
        g.tx().commit()
        graph.close()
        System.sleep(5000)


        graph = JanusGraphFactory.open(graph_conf)
        g = graph.traversal()
        System.sleep(5000)
    }
    g.tx().commit()
}

graph.close()
 

Here are my query criteria:

g.V().has('address', ‘from_addr_00').outE('TRANSFER_TO').order().by(‘amount’,desc).limit(500).valueMap().toList()


Java APIs for Create/Open/Drop graph using ConfiguredGraphFactory - janusgraph

"Anumodh N.K" <anu...@...>
 

Hi JanusGraph team,

We are exploring the possibility of janusgraph in our company. We are planning to use dynamic graph creation/deletion using ConfiguredGraphFactory. We have deployed 2 janusgraph instances. Our plan is to write java library to creating, deletion and other management activities like Index creation etc . I haven't yet found a way to create/delete a graph using java apis by using ConfiguredGraphFactory.

From my investigation, only way to do so using by connecting to gremlin server and sending string commands like 
 - ConfiguredGraphFactory.create(graphName)
 - ConfiguredGraphFactory.drop(graphName)
Is this the only way to create/delete graph? Please adivce.

Thanks,
anumodh 





Problem with 2 property keys (conflict with property keys ?)

ALEX SEYMOUR <alex....@...>
 


Hi everyone,

I have a problem with 2 property keys, named:
"ex_value" String
"feed_name"String

I don't know why, on every vertices I created,  ex_value property name is  replaced by feed_name.

>g.addV("test").property("name", "example").property("ex_value", "test1")
=>3686404256
>g.V(3686404256).valueMap(true)
=>[id:3686404256,label:test_example, name:[test1], feed_name:[test1]]

feed_name property appear in result, but I didn't used this property, I used ex_value

>mgmt.openManagement()
>mgmt.getPropertyKey("ex_value")
=>feed_name
the result is wrong


>mgmt.printPropertyKeys()
there is 2 feed_name properties, ex_value doesn't appear


I don't know how I can resolve this problem.

  
 


Re: BulkLoad error--can bulkload imports vertex without edges

姜明 <mingji...@...>
 

QQ截图20200724084537.jpg
在2020年7月23日星期四 UTC+8 下午9:18:51<姜明> 写道:

I use TinkerPop’s Hadoop-Gremlin to import data.
Take grateful-dead.txt as an example, I want to import vertex without edges.

But I get error like this, "java.util.NoSuchElementException".


Re: Unable to drop Remote JanusGraph

Nicolas Trangosi <nicolas...@...>
 

Hi,
Have you tried instead of drop the graph, to do a g.V().drop().iterate()
Please note, that in this case janusGraph schema would still exist and depending on the backend, this solution can be much longer.

Another solution would be to drop the graph, then replace in global bindings, the graph and the g reference. 
The global bindings are set in org.apache.tinkerpop.gremlin.groovy.engine.GremlinExecutor and are initialized in org.apache.tinkerpop.gremlin.server.util.DefaultGraphManager.getAsBindings(). As the implementation of GraphManager is configurable, it should be possible to develop an extension that allows you to retrieve the global bindings and then replace the  graph and the g reference.   

Le mardi 21 juillet 2020 à 17:39:38 UTC+2, d...@... a écrit :
Hi,

I was using a remote server and using inmemory backend for the remote server - I was still able to query the data, Even after dropping it - the data didn't get removed and I had to restart it.


On Tuesday, July 21, 2020 at 2:50:19 AM UTC-4, Nicolas Trangosi wrote:
Hi,
> How do we know if the graph has been dropped?   
As I am using Cassandra and ES as backend, I can see that data as been dropped by doing a direct request to these components.

>   Is there any way of clearing all the data, the schema without restarting the server. 
I have found any yet but I do not look into the gremlin server code.  I do not have this requirement, so I do search for it. May be using a proxy design pattern on graph object may do the trick.

Nicolas

Le jeudi 16 juillet 2020 à 14:53:03 UTC+2, d...@... a écrit :
Hi Nicolas,

How do we know if the graph has been dropped? You said you have to restart the server after doing that. I am doing the same, but I don't want to. Is there any way of clearing all the data, the schema without restarting the server.

Also, JanusGraphFactory.drop(getJanusGraph()) is the same as JanusGraphFactory.drop(graph) - because that function passes the same object and it still doesn't work. I don't want to restart the server as it's a remote server. How do I achieve that?


On Wednesday, July 15, 2020 at 3:43:13 PM UTC-4, Nicolas Trangosi wrote:
Hi,
I am able to drop remotely  the graph using the script:
JanusGraphFactory.drop(graph);[]

After the script, I need to restart janusGraph in order to re-create the graph.
Could you sent the janusgraph logs with the stack trace ?

Also, instead of  JanusGraphFactory.drop(getJanusGraph());  , could you try with JanusGraphFactory.drop(graph);

Kind regards,
Nicolas
Le mercredi 15 juillet 2020 à 00:44:36 UTC+2, d...@... a écrit :
Hi,

I'm looking to drop a graph, but the JanusgraphFactory.drop(graph) doesn't work for me. I'm using JAVA to connect to the remote server and load/remove data.

Using the below properties file  for remote-properties gremlin.remote.remoteConnectionClass=org.apache.tinkerpop.gremlin.driver.remote.DriverRemoteConnection
gremlin.remote.driver.clusterFile=conf/remote-objects.yaml
gremlin.remote.driver.sourceName=g

Below file for remote-objects - 

hosts: [hostname]
port: 8182
serializer: {
className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV3d0,
config: {
ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry]
}
}
Using this to open the graph - 
conf = new PropertiesConfiguration(propFileName);

// using the remote graph for queries
graph = JanusGraphFactory.open("inmemory");
g = graph.traversal().withRemote(conf);
return g;

Using this to drop the graph:
if (graph != null) {
LOGGER.info("Dropping janusgraph function");
JanusGraphFactory.drop(getJanusGraph());
}

Any help would be appreciated.
Thanks!


Ce message et ses pièces jointes peuvent contenir des informations confidentielles ou privilégiées et ne doivent donc pas être diffusés, exploités ou copiés sans autorisation. 
Si vous avez reçu ce message par erreur, veuillez le signaler a l'expéditeur et le détruire ainsi que les pièces jointes. 
Les messages électroniques étant susceptibles d'altération, DCbrain décline toute responsabilité si ce message a été altéré, déformé ou falsifié. Merci. 

This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, DCbrain is not liable for messages that have been modified, changed or falsified. Thank you.


Ce message et ses pièces jointes peuvent contenir des informations confidentielles ou privilégiées et ne doivent donc pas être diffusés, exploités ou copiés sans autorisation. 
Si vous avez reçu ce message par erreur, veuillez le signaler a l'expéditeur et le détruire ainsi que les pièces jointes. 
Les messages électroniques étant susceptibles d'altération, DCbrain décline toute responsabilité si ce message a été altéré, déformé ou falsifié. Merci. 

This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, DCbrain is not liable for messages that have been modified, changed or falsified. Thank you.


Ce message et ses pièces jointes peuvent contenir des informations confidentielles ou privilégiées et ne doivent donc pas être diffusés, exploités ou copiés sans autorisation. 
Si vous avez reçu ce message par erreur, veuillez le signaler a l'expéditeur et le détruire ainsi que les pièces jointes. 
Les messages électroniques étant susceptibles d'altération, DCbrain décline toute responsabilité si ce message a été altéré, déformé ou falsifié. Merci. 

This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, DCbrain is not liable for messages that have been modified, changed or falsified. Thank you.


Re: Janusgraph with YARN and HBASE

Fábio Dapper <fda...@...>
 

Perfect!!!
That's it!
Thank you, very much!!!

Em qui., 23 de jul. de 2020 às 05:20, Petr Stentor <kiri...@...> escreveu:


Hi!

Try this 
spark.io.compression.codec=snappy

четверг, 23 июля 2020 г., 1:57:38 UTC+3 пользователь Fábio Dapper написал:
Hello, we have a Cluster with CLOUDERA CDH 6.3.2 and I'm trying to run Janusgraph on the Cluster with YARN and HBASE, but without success.
(it's OK with SPARK Local)

Version SPARK 2.4.2
HBASE: 2.1.0-cdh6.3.2
Janusgraph (v 0.5.2 and v0.4.1)

I did a lot of searching, but I didn't find any recent references, and they all use older versions of SPARK and Janusgraph.

Some examples:

According to these references, I followed the following steps:

  1. Copy the following files to the Janusgraph "lib" directory:
    1. spark-yarn-2.11-2.4.0.jar
    2. scala-reflect-2.10.5.jar
    3. hadoop-yarn-server-web-proxy-2.7.2.jar
    4. guice-servlet-3.0.jar
  2. Generate a "/tmp/spark-gremlin-0.5.2.zip" file containing all the .jar files from "janusgraph / lib /".
  3. Create a configuration file called 'test.properties' from conf/hadoop-graph/read-hbase-standalone-cluster.properties by adding (or modifying) the properties below:

        janusgraphmr.ioformat.conf.storage.hostname=XXX.XXX.XXX.XXX 
spark.master= yarn
#spark.deploy-mode=client
spark.submit.deployMode=client
spark.executor.memory=1g
spark.yarn.dist.jars=/tmp/spark-gremlin-0-5-2.zip

spark.yarn.archive=/tmp/spark-gremlin-0-5-2.zip
spark.yarn.appMasterEnv.CLASSPATH=./__spark_libs__/*:[hadoop_conf_dir]
spark.executor.extraClassPath=./__spark_libs__/*:/[hadoop_conf_dir]
spark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native 



Then I ran the following commands:
    graph = GraphFactory.open(conf/hadoop-graph/test.properties)
    g
    = graph.traversal().withComputer(SparkGraphComputer)
    g
    .V().count()
Can someone help me?
a) Are these problems related to version incompatibility?
b) Has anyone successfully used similar infrastructure?
c) Would anyone know how to determine a correct version of the necessary libraries?
d) Any suggestion?


Thank you all !!!

 Below is a copy of the Yarn Log from my last attempt.

ERROR org.apache.spark.scheduler.TaskSetManager  - Task 0 in stage 0.0 failed 4 times; aborting job
org
.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, [SERVER_NAME], executor 1): java.lang.NoSuchMethodError: net.jpountz.lz4.LZ4BlockInputStream.<init>(Ljava/io/InputStream;Z)V
at org
.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:122)
at org
.apache.spark.broadcast.TorrentBroadcast$$anonfun$6.apply(TorrentBroadcast.scala:304)
at org
.apache.spark.broadcast.TorrentBroadcast$$anonfun$6.apply(TorrentBroadcast.scala:304)
at scala
.Option.map(Option.scala:146)
at org
.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:304)
at org
.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:235)
at scala
.Option.getOrElse(Option.scala:121)
at org
.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211)
at org
.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)
at org
.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
at org
.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
at org
.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
at org
.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
at org
.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org
.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:89)
at org
.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org
.apache.spark.scheduler.Task.run(Task.scala:121)
at org
.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org
.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org
.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java
.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java
.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java
.lang.Thread.run(Thread.java:748)

Thank you!!

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/467a21c7-b103-4c1a-9404-a514e4335671o%40googlegroups.com.


Re: Janusgraph with YARN and HBASE

Petr Stentor <kiri...@...>
 


Hi!

Try this 
spark.io.compression.codec=snappy

четверг, 23 июля 2020 г., 1:57:38 UTC+3 пользователь Fábio Dapper написал:

Hello, we have a Cluster with CLOUDERA CDH 6.3.2 and I'm trying to run Janusgraph on the Cluster with YARN and HBASE, but without success.
(it's OK with SPARK Local)

Version SPARK 2.4.2
HBASE: 2.1.0-cdh6.3.2
Janusgraph (v 0.5.2 and v0.4.1)

I did a lot of searching, but I didn't find any recent references, and they all use older versions of SPARK and Janusgraph.

Some examples:

According to these references, I followed the following steps:

  1. Copy the following files to the Janusgraph "lib" directory:
    1. spark-yarn-2.11-2.4.0.jar
    2. scala-reflect-2.10.5.jar
    3. hadoop-yarn-server-web-proxy-2.7.2.jar
    4. guice-servlet-3.0.jar
  2. Generate a "/tmp/spark-gremlin-0.5.2.zip" file containing all the .jar files from "janusgraph / lib /".
  3. Create a configuration file called 'test.properties' from conf/hadoop-graph/read-hbase-standalone-cluster.properties by adding (or modifying) the properties below:

        janusgraphmr.ioformat.conf.storage.hostname=XXX.XXX.XXX.XXX 
spark.master= yarn
#spark.deploy-mode=client
spark.submit.deployMode=client
spark.executor.memory=1g
spark.yarn.dist.jars=/tmp/spark-gremlin-0-5-2.zip

spark.yarn.archive=/tmp/spark-gremlin-0-5-2.zip
spark.yarn.appMasterEnv.CLASSPATH=./__spark_libs__/*:[hadoop_conf_dir]
spark.executor.extraClassPath=./__spark_libs__/*:/[hadoop_conf_dir]
spark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH/lib/hadoop/lib/native 



Then I ran the following commands:
    graph = GraphFactory.open(conf/hadoop-graph/test.properties)
    g
    = graph.traversal().withComputer(SparkGraphComputer)
    g
    .V().count()
Can someone help me?
a) Are these problems related to version incompatibility?
b) Has anyone successfully used similar infrastructure?
c) Would anyone know how to determine a correct version of the necessary libraries?
d) Any suggestion?


Thank you all !!!

 Below is a copy of the Yarn Log from my last attempt.

ERROR org.apache.spark.scheduler.TaskSetManager  - Task 0 in stage 0.0 failed 4 times; aborting job
org
.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, [SERVER_NAME], executor 1): java.lang.NoSuchMethodError: net.jpountz.lz4.LZ4BlockInputStream.<init>(Ljava/io/InputStream;Z)V
at org
.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:122)
at org
.apache.spark.broadcast.TorrentBroadcast$$anonfun$6.apply(TorrentBroadcast.scala:304)
at org
.apache.spark.broadcast.TorrentBroadcast$$anonfun$6.apply(TorrentBroadcast.scala:304)
at scala
.Option.map(Option.scala:146)
at org
.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:304)
at org
.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:235)
at scala
.Option.getOrElse(Option.scala:121)
at org
.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211)
at org
.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326)
at org
.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207)
at org
.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
at org
.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
at org
.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
at org
.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org
.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:89)
at org
.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org
.apache.spark.scheduler.Task.run(Task.scala:121)
at org
.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org
.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org
.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java
.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java
.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java
.lang.Thread.run(Thread.java:748)

Thank you!!

1641 - 1660 of 6666