Date   

Re: olap connection with spark standalone cluster

Abhay Pandit <abha...@...>
 

Hi Lilly,

SparkGraphComputer will not support direct gremlin queries using Java programs.
You can try using this as below.

String query = "g.V().count()";
ComputerResult result = graph.compute(SparkGraphComputer.class)

            .result(GraphComputer.ResultGraph.NEW)

            .persist(GraphComputer.Persist.EDGES)

            .program(TraversalVertexProgram.build()

                    .traversal(

                            graph.traversal().withComputer(SparkGraphComputer.class),

                            "gremlin-groovy",

                            query)

                    .create(graph))

            .submit()

            .get();

System.out.println( computerResult.memory().get("gremlin.traversalVertexProgram.haltedTraversers"));


Join my facebook group: https://www.facebook.com/groups/Janusgraph/

Thanks,
Abhay

On Tue, 15 Oct 2019 at 19:25, <marc.d...@...> wrote:
Hi Lilly,

This error says that are somehow two versions of the TinkerPop jars in your project. If you use maven you check this with the dependency plugin.

If other problems appear, also be sure that the spark cluster is doing fine by running one of the examples from the spark distribution with spark-submit.

HTH,    Marc

Op dinsdag 15 oktober 2019 09:38:08 UTC+2 schreef Lilly:
Hi everyone,

I downloaded a fresh spark binary relaese (spark-2.4.0-hadoop2.7) and set the master to spark://127.0.0.1:7077. I then started all services via $SPARK_HOME/sbin/start-all.sh.
I checked that spark works with the provided example programs.

I am further using the janusgraph-0.4.0-hadoop2 binary.

Now I configured the read-cassandra-3.properties as follows:
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphReader=org.janusgraph.hadoop.formats.cassandra.Cassandra3InputFormat
gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=output
gremlin.spark.persistContext=true
janusgraphmr.ioformat.conf.storage.backend=cassandra
janusgraphmr.ioformat.conf.storage.hostname=127.0.0.1
janusgraphmr.ioformat.conf.storage.port=9160
janusgraphmr.ioformat.conf.storage.cassandra.keyspace=janusgraph
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
spark.master=spark://127.0.0.1:7077
spark.executor.memory=8g
spark.executor.extraClassPath=/home/janusgraph-0.4.0-hadoop2/lib/*
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoRegistrator

where the janusgraph libraries are stored in /home/janusgraph-0.4.0-hadoop2/lib/*

In my java application I now tried
Graph graph = GraphFactory.open('...')
GraphTraversalSource g = graph.traversal().withComputer(SparkGraphComputer.class);
and then g.V().count().next()
I get the error message:
ERROR org.apache.spark.scheduler.TaskSetManager - Task 3 in stage 0.0 failed 4 times; aborting job
Exception in thread "main" java.lang.IllegalStateException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 0.0 failed 4 times, most recent failure: Lost task 3.3 in stage 0.0 (TID 15, 192.168.178.32, executor 0): java.io.InvalidClassException: org.apache.tinkerpop.gremlin.process.traversal.util.DefaultTraversal; local class incompatible: stream classdesc serialVersionUID = -3191185630641472442, local class serialVersionUID = 6523257080464450267

Any ideas as to what might be the problem?
Thanks!
Lilly


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/e7336651-4265-4508-985b-64ed53935fff%40googlegroups.com.


Issue creating vertex with a List property having large number of elements

Aswin Karthik P <zas...@...>
 

There is a small change in last line of the python code.

Updated code

from gremlin_python.driver import client

client = client.Client('ws://localhost:8182/gremlin', 'g')

mgmtScript = "mgmt = graph.openManagement()\n" + \
"if (mgmt.getPropertyKey('name') != null) return false\n" + \
"mgmt.makePropertyKey('name').dataType(String.class).make()\n" + \
"mgmt.makePropertyKey('vl_prop').dataType(Float.class).cardinality(LIST).make()\n" +\
"mgmt.commit()\n" + \
"return true";

client.submit(mgmtScript).next()

f = open("only_vertex.txt", "r")

create_query = f.read()

client.submit(create_query).next()


Issue creating vertex with a List property having large number of elements

Aswin Karthik P <zas...@...>
 

Hi,
For a use case, I'm trying to create a vertex with some list properties which contains large number of elements using gremlin-python.
But the server gets crashed and I'm getting java.lang.StackOverflowError error in gremlin-server.log 

Since the query is too big, I have attached it as txt file. Along with it, I have attached gremlin-server.yaml file too for reference, where I have tried manipulating the content size etc.

Server initiation
The default one with Cassandra as backend storage

$JANUSHOME/bin/janusgraph.sh start

Python Code

from gremlin_python.driver import client
 
client =  client.Client('ws://localhost:8182/gremlin', 'g')

mgmtScript = "mgmt = graph.openManagement()\n" + \
"if (mgmt.getPropertyKey('name') != null) return false\n" + \
"mgmt.makePropertyKey('name').dataType(String.class).make()\n" + \
"mgmt.makePropertyKey('vl_prop').dataType(Float.class).cardinality(LIST).make()\n" +\
"mgmt.commit()\n" + \
"return true";

client.submit(mgmtScript).next()

f = open("/home/aswin/Desktop/only_vertex.txt", "r")

create_query = f.read()

client.submit(the_text).next()

The Python code is just a glimpse, I have to create 5 such properties for each node and there will be few thousand nodes in the graph.

I'm not sure whether it is a shortcoming of JanusGraph or the gremlin server. And is it even feasible to have such a graph model in Janus Graph.

I would also like to know, if there is an easier/ crisp way to create List properties than repeating the same property name with different values.


Re: olap connection with spark standalone cluster

marc.d...@...
 

Hi Lilly,

This error says that are somehow two versions of the TinkerPop jars in your project. If you use maven you check this with the dependency plugin.

If other problems appear, also be sure that the spark cluster is doing fine by running one of the examples from the spark distribution with spark-submit.

HTH,    Marc

Op dinsdag 15 oktober 2019 09:38:08 UTC+2 schreef Lilly:

Hi everyone,

I downloaded a fresh spark binary relaese (spark-2.4.0-hadoop2.7) and set the master to spark://127.0.0.1:7077. I then started all services via $SPARK_HOME/sbin/start-all.sh.
I checked that spark works with the provided example programs.

I am further using the janusgraph-0.4.0-hadoop2 binary.

Now I configured the read-cassandra-3.properties as follows:
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphReader=org.janusgraph.hadoop.formats.cassandra.Cassandra3InputFormat
gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=output
gremlin.spark.persistContext=true
janusgraphmr.ioformat.conf.storage.backend=cassandra
janusgraphmr.ioformat.conf.storage.hostname=127.0.0.1
janusgraphmr.ioformat.conf.storage.port=9160
janusgraphmr.ioformat.conf.storage.cassandra.keyspace=janusgraph
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
spark.master=spark://127.0.0.1:7077
spark.executor.memory=8g
spark.executor.extraClassPath=/home/janusgraph-0.4.0-hadoop2/lib/*
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoRegistrator

where the janusgraph libraries are stored in /home/janusgraph-0.4.0-hadoop2/lib/*

In my java application I now tried
Graph graph = GraphFactory.open('...')
GraphTraversalSource g = graph.traversal().withComputer(SparkGraphComputer.class);
and then g.V().count().next()
I get the error message:
ERROR org.apache.spark.scheduler.TaskSetManager - Task 3 in stage 0.0 failed 4 times; aborting job
Exception in thread "main" java.lang.IllegalStateException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 0.0 failed 4 times, most recent failure: Lost task 3.3 in stage 0.0 (TID 15, 192.168.178.32, executor 0): java.io.InvalidClassException: org.apache.tinkerpop.gremlin.process.traversal.util.DefaultTraversal; local class incompatible: stream classdesc serialVersionUID = -3191185630641472442, local class serialVersionUID = 6523257080464450267

Any ideas as to what might be the problem?
Thanks!
Lilly



How roll back works in janus graph, will it roll back the storage write in one transaction

Lighter <yangch...@...>
 

Hi, for below sample code.  storage backend is Hbase,  "name" is used as  index, it will at least has two rows update: but what if index update succeed while vertex update failed(throw exception). when we call rollback, Will it roll back the index write to storage? 

try {

    user = graph.addVertex()
    user.property("name", name)
    graph.tx().commit()
} catch (Exception e) {
    //Recover, retry,  or return error message
    println(e.getMessage())
    graph.tx().rollback()   // <------- Added line 
}


olap connection with spark standalone cluster

Lilly <lfie...@...>
 

Hi everyone,

I downloaded a fresh spark binary relaese (spark-2.4.0-hadoop2.7) and set the master to spark://127.0.0.1:7077. I then started all services via $SPARK_HOME/sbin/start-all.sh.
I checked that spark works with the provided example programs.

I am further using the janusgraph-0.4.0-hadoop2 binary.

Now I configured the read-cassandra-3.properties as follows:
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphReader=org.janusgraph.hadoop.formats.cassandra.Cassandra3InputFormat
gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=output
gremlin.spark.persistContext=true
janusgraphmr.ioformat.conf.storage.backend=cassandra
janusgraphmr.ioformat.conf.storage.hostname=127.0.0.1
janusgraphmr.ioformat.conf.storage.port=9160
janusgraphmr.ioformat.conf.storage.cassandra.keyspace=janusgraph
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
spark.master=spark://127.0.0.1:7077
spark.executor.memory=8g
spark.executor.extraClassPath=/home/janusgraph-0.4.0-hadoop2/lib/*
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoRegistrator

where the janusgraph libraries are stored in /home/janusgraph-0.4.0-hadoop2/lib/*

In my java application I now tried
Graph graph = GraphFactory.open('...')
GraphTraversalSource g = graph.traversal().withComputer(SparkGraphComputer.class);
and then g.V().count().next()
I get the error message:
ERROR org.apache.spark.scheduler.TaskSetManager - Task 3 in stage 0.0 failed 4 times; aborting job
Exception in thread "main" java.lang.IllegalStateException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 0.0 failed 4 times, most recent failure: Lost task 3.3 in stage 0.0 (TID 15, 192.168.178.32, executor 0): java.io.InvalidClassException: org.apache.tinkerpop.gremlin.process.traversal.util.DefaultTraversal; local class incompatible: stream classdesc serialVersionUID = -3191185630641472442, local class serialVersionUID = 6523257080464450267

Any ideas as to what might be the problem?
Thanks!
Lilly



New committer: Dmitry Kovalev

"Florian Hockmann" <f...@...>
 

On behalf of the JanusGraph Technical Steering Committee (TSC), I'm pleased to welcome a new committer on the project!

Dmitry Kovalev made a major contribution with the production-ready in-memory backend. He is quite responsive and patient during the review process and he also contributed to development decisions.


Congratulations, Dmitry!


Re: [QUESTION] Usage of the cassandraembedded

Lilly <lfie...@...>
 

I now experimented with many types of settings now for the cql connection and timed how long it took.
My observation is the following:
-Embedded with bulk loading took 16 min
- CQL without bulk loading is extremly slow > 2h
- CQL with bulk loading (same settings as for embedded for parameters: storage.batch.loading, ids.block.size, ids.renew.timeout, cache.db-cache, cache.db-cache-clean-wait, cache.db-cache-time, cache.db-cache-size) took 27 min and took up considerable amounts of my RAM (not the case for embedded mode).
- CQL as above but with additionally storage.cql.batch-statement-size = 500 and storage.batch-loading = true took 24 min and not quite as much RAM.

I honestly do not now what else might be the issue..

Am Mittwoch, 9. Oktober 2019 08:17:13 UTC+2 schrieb fa...@...:

For "violation of unique key"  it could be the case that cql checks id's to be unique (JanusGraph could run out of id's in the batch loading mode) but i'm not sure what the embedded backend is doing.


I never used the batch loading mode, see also here: https://docs.janusgraph.org/advanced-topics/bulk-loading/.


Am Dienstag, 8. Oktober 2019 17:50:23 UTC+2 schrieb Lilly:
Hi Jan,

So I tried it again. First of all, I remembered, that for cql I need to commit after each step. Otherwise, I get "violation of unique key" errors, even though I am actually not. Is this supposed to be the case (having to commit each time)?
Now on doing the commit after each function call, I found that with the adaption in the properties configuration (see last reply) it is really super slow. If I use the "default" configuration for cql, it is a bit faster but still much slower than in the embedded case.

I also tried it with another graph  which I persisted like this:
public void persist(Map<Integer, Map<String,Object>> nodes, Map<Integer,Integer> edges, Map<Integer,Map<String,String>> names) {
g = graph.traversal();

int counter = 0;
for(Map.Entry<Integer, Map<String,Object>> e: nodes.entrySet()) {


Vertex v = g.addV().property("taxId",e.getKey()).
property("rank",e.getValue().get("rank")).
property("divId",e.getValue().get("divId")).
property("genId",e.getValue().get("genId")).next();
g.tx().commit();
Map<String,String> n = names.get(e.getKey());
if(n != null) {
for(Map.Entry<String,String> vals: n.entrySet()) {
g.V(v).property(vals.getKey(),vals.getValue()).iterate();
g.tx().commit();
}
}

if(counter % BULK_CHOP_SIZE == 0) {

System.out.println(counter);
}
counter++;

}


counter = 0;
for(Map.Entry<Integer,Integer> e: edges.entrySet()) {
g.V().has("taxId",e.getKey()).as("v1").V().
has("taxId",e.getValue()).as("v2").
addE("has_parent").from("v1").to("v2").iterate();
g.tx().commit();
if(counter % BULK_CHOP_SIZE == 0) {

System.out.println(counter);
}
counter++;
}

g.V().has("taxId",1).as("v").outE().filter(__.inV().where(P.eq("v"))).drop().iterate();
g.tx().commit();
System.out.println("Done with persistence");
}

And had the same problem in either case.

I am probably using the cql backend wrong somehow and would appreciate any help on what else to do!
Thanks,
Lilly

Am Dienstag, 8. Oktober 2019 09:05:56 UTC+2 schrieb Lilly:
Hi Jan,
Ok then I probably screwed up somewhere. I kind of thought this was to be expected, which is why I did not check it more thoroughly.
Maybe the way I persisted is not working well for cql.
I will try to create a test scenario where I do not have to persist all my data and see how it performs with cql again.

In principle, what I do is call this function :
public void updateEdges(String kmer, int pos, boolean strand, int record, List<SequenceParser.Feature> features){

if(features == null) {
features = Arrays.asList();
}

g.withSideEffect("features",features)
.V().has("prefix", kmer.substring(0,kmer.length()-1)).fold().coalesce(__.unfold(),
__.addV("prefix_node").property("prefix",kmer.substring(0,kmer.length()-1)) ).as("v1").
coalesce(__.V().has("prefix", kmer.substring(1,kmer.length())),
__.addV("prefix_node").property("prefix",kmer.substring(1,kmer.length())) ).as("v2").
sideEffect(__.choose(__.select("features").unfold().count().is(P.eq(0)),
__.addE("suffix_edge").property("record",record).
property("strand",strand).property("pos",pos).from("v1").to("v2")).
select("features").unfold().
addE("suffix_edge").property("record",record).property("strand",strand).property("pos",pos)
.property(__.map(t -> ((SequenceParser.Feature)t.get()).category),
__.map(t -> ((SequenceParser.Feature)t.get()).feature)).from("v1").to("v2")).
iterate();

}
and every roughly 50000 calls I do a commit. As a side remark, all of the above properties possess indecees. And Feature is a simple class with two attributes category and feature.

Also I adapted the configuration file in the following way:
storage.batch-loading = true
ids.block-size = 100000
ids.authority.wait-time = 2000 ms
ids.renew-timeout = 1000000 ms

I tried the same with cql and embedded.

I will get back to you once I have tested it once again. But maybe you already spot an issue?
Thanks
Lilly
Am Montag, 7. Oktober 2019 20:14:29 UTC+2 schrieb fa...@...:
We don't see this problem on persistence.
It would be good know what takes longer. Do like to give some more informations?

Jan



Re: index not used for query

Anatoly Belikov <awbe...@...>
 

index.search.backend=elasticsearch
index
.search.hostname=127.0.0.1
index
.search.elasticsearch.client-only=true

Do you think it is due to elastic search?


On Wednesday, 2 October 2019 14:06:01 UTC+3, arnab kumar pan wrote:
facing same issue while creating mixed index, can you share your elasticsearch configuration?

On Tuesday, September 24, 2019 at 7:26:43 PM UTC+5:30, aw...@... wrote:
Hello

I have made an index for vertex property "id", the index is enabled, but still it is not used for the query according to the profiler. Please, give me advice on how to make index work.

gremlin> vindex = mgmt.getGraphIndex("byId")
gremlin
> vindex.fieldKeys
==>id

mgmt
.awaitGraphIndexStatus(graph, vindex.name()).status(SchemaStatus.ENABLED).call()
==>GraphIndexStatusReport[success=true, indexName='byId', targetStatus=[ENABLED], notConverged={}, converged={id=ENABLED}, elapsed=PT0.001S]

gremlin
> g.V().has('id', '-9032656531829342390').profile()
==>Traversal Metrics
Step                                                               Count  Traversers       Time (ms)    % Dur
=============================================================================================================
JanusGraphStep([],[id.eq(-9032656531829342390)])                       1           1        2230.851   100.00
   
\_condition=(id = -9032656531829342390)
   
\_isFitted=false
   
\_query=[]
   
\_orders=[]
   
\_isOrdered=true
  optimization                                                                                
0.005
  optimization                                                                                
0.026
  scan                                                                                        
0.000
   
\_condition=VERTEX
   
\_query=[]
   
\_fullscan=true
                                           
>TOTAL                     -           -        2230.851  



Re: [QUESTION] Usage of the cassandraembedded

faro...@...
 

For "violation of unique key"  it could be the case that cql checks id's to be unique (JanusGraph could run out of id's in the batch loading mode) but i'm not sure what the embedded backend is doing.


I never used the batch loading mode, see also here: https://docs.janusgraph.org/advanced-topics/bulk-loading/.


Am Dienstag, 8. Oktober 2019 17:50:23 UTC+2 schrieb Lilly:

Hi Jan,

So I tried it again. First of all, I remembered, that for cql I need to commit after each step. Otherwise, I get "violation of unique key" errors, even though I am actually not. Is this supposed to be the case (having to commit each time)?
Now on doing the commit after each function call, I found that with the adaption in the properties configuration (see last reply) it is really super slow. If I use the "default" configuration for cql, it is a bit faster but still much slower than in the embedded case.

I also tried it with another graph  which I persisted like this:
public void persist(Map<Integer, Map<String,Object>> nodes, Map<Integer,Integer> edges, Map<Integer,Map<String,String>> names) {
g = graph.traversal();

int counter = 0;
for(Map.Entry<Integer, Map<String,Object>> e: nodes.entrySet()) {


Vertex v = g.addV().property("taxId",e.getKey()).
property("rank",e.getValue().get("rank")).
property("divId",e.getValue().get("divId")).
property("genId",e.getValue().get("genId")).next();
g.tx().commit();
Map<String,String> n = names.get(e.getKey());
if(n != null) {
for(Map.Entry<String,String> vals: n.entrySet()) {
g.V(v).property(vals.getKey(),vals.getValue()).iterate();
g.tx().commit();
}
}

if(counter % BULK_CHOP_SIZE == 0) {

System.out.println(counter);
}
counter++;

}


counter = 0;
for(Map.Entry<Integer,Integer> e: edges.entrySet()) {
g.V().has("taxId",e.getKey()).as("v1").V().
has("taxId",e.getValue()).as("v2").
addE("has_parent").from("v1").to("v2").iterate();
g.tx().commit();
if(counter % BULK_CHOP_SIZE == 0) {

System.out.println(counter);
}
counter++;
}

g.V().has("taxId",1).as("v").outE().filter(__.inV().where(P.eq("v"))).drop().iterate();
g.tx().commit();
System.out.println("Done with persistence");
}

And had the same problem in either case.

I am probably using the cql backend wrong somehow and would appreciate any help on what else to do!
Thanks,
Lilly

Am Dienstag, 8. Oktober 2019 09:05:56 UTC+2 schrieb Lilly:
Hi Jan,
Ok then I probably screwed up somewhere. I kind of thought this was to be expected, which is why I did not check it more thoroughly.
Maybe the way I persisted is not working well for cql.
I will try to create a test scenario where I do not have to persist all my data and see how it performs with cql again.

In principle, what I do is call this function :
public void updateEdges(String kmer, int pos, boolean strand, int record, List<SequenceParser.Feature> features){

if(features == null) {
features = Arrays.asList();
}

g.withSideEffect("features",features)
.V().has("prefix", kmer.substring(0,kmer.length()-1)).fold().coalesce(__.unfold(),
__.addV("prefix_node").property("prefix",kmer.substring(0,kmer.length()-1)) ).as("v1").
coalesce(__.V().has("prefix", kmer.substring(1,kmer.length())),
__.addV("prefix_node").property("prefix",kmer.substring(1,kmer.length())) ).as("v2").
sideEffect(__.choose(__.select("features").unfold().count().is(P.eq(0)),
__.addE("suffix_edge").property("record",record).
property("strand",strand).property("pos",pos).from("v1").to("v2")).
select("features").unfold().
addE("suffix_edge").property("record",record).property("strand",strand).property("pos",pos)
.property(__.map(t -> ((SequenceParser.Feature)t.get()).category),
__.map(t -> ((SequenceParser.Feature)t.get()).feature)).from("v1").to("v2")).
iterate();

}
and every roughly 50000 calls I do a commit. As a side remark, all of the above properties possess indecees. And Feature is a simple class with two attributes category and feature.

Also I adapted the configuration file in the following way:
storage.batch-loading = true
ids.block-size = 100000
ids.authority.wait-time = 2000 ms
ids.renew-timeout = 1000000 ms

I tried the same with cql and embedded.

I will get back to you once I have tested it once again. But maybe you already spot an issue?
Thanks
Lilly
Am Montag, 7. Oktober 2019 20:14:29 UTC+2 schrieb fa...@...:
We don't see this problem on persistence.
It would be good know what takes longer. Do like to give some more informations?

Jan



Re: [QUESTION] Usage of the cassandraembedded

faro...@...
 

Your block-size should be large in this example, see Id Creation: https://www.experoinc.com/post/janusgraph-nuts-and-bolts-part-1-write-performance

Am Dienstag, 8. Oktober 2019 09:05:56 UTC+2 schrieb Lilly:

Hi Jan,
Ok then I probably screwed up somewhere. I kind of thought this was to be expected, which is why I did not check it more thoroughly.
Maybe the way I persisted is not working well for cql.
I will try to create a test scenario where I do not have to persist all my data and see how it performs with cql again.

In principle, what I do is call this function :
public void updateEdges(String kmer, int pos, boolean strand, int record, List<SequenceParser.Feature> features){

if(features == null) {
features = Arrays.asList();
}

g.withSideEffect("features",features)
.V().has("prefix", kmer.substring(0,kmer.length()-1)).fold().coalesce(__.unfold(),
__.addV("prefix_node").property("prefix",kmer.substring(0,kmer.length()-1)) ).as("v1").
coalesce(__.V().has("prefix", kmer.substring(1,kmer.length())),
__.addV("prefix_node").property("prefix",kmer.substring(1,kmer.length())) ).as("v2").
sideEffect(__.choose(__.select("features").unfold().count().is(P.eq(0)),
__.addE("suffix_edge").property("record",record).
property("strand",strand).property("pos",pos).from("v1").to("v2")).
select("features").unfold().
addE("suffix_edge").property("record",record).property("strand",strand).property("pos",pos)
.property(__.map(t -> ((SequenceParser.Feature)t.get()).category),
__.map(t -> ((SequenceParser.Feature)t.get()).feature)).from("v1").to("v2")).
iterate();

}
and every roughly 50000 calls I do a commit. As a side remark, all of the above properties possess indecees. And Feature is a simple class with two attributes category and feature.

Also I adapted the configuration file in the following way:
storage.batch-loading = true
ids.block-size = 100000
ids.authority.wait-time = 2000 ms
ids.renew-timeout = 1000000 ms

I tried the same with cql and embedded.

I will get back to you once I have tested it once again. But maybe you already spot an issue?
Thanks
Lilly
Am Montag, 7. Oktober 2019 20:14:29 UTC+2 schrieb fa...@...:
We don't see this problem on persistence.
It would be good know what takes longer. Do like to give some more informations?

Jan



JanusGraph sessions at Scylla Summit

Peter Corless <pe...@...>
 

Hello everyone! Though I generally just lurk and absorb everyone's collective wisdom, today I wanted to let you know we'll have a pair of JanusGraph practitioners speaking at Scylla Summit this year, November 5-6 in San Francisco:
  • Brian Hall of Expero
  • Ryan Stauffer of Enharmonic
We published a blog today regarding their upcoming talks:
JanusGraph has been a perennial topic at Scylla Summit since 2016, so I cannot be more pleased to continue the tradition of showcasing its capabilities and use cases with our audience.

Further forgive me for sounding all-too-marketing-y, but if anyone on the list would be interested in attending these sessions at Scylla Summit, feel free to use the discount code JANUSGRAPHUSERS25 for 25% off.

With that, I'll let you get back to the heart of your technical discussions. Enjoy the day!

-Peter.

--
Peter Corless
Technical Marketing Manager
650-906-3134


Re: [QUESTION] Usage of the cassandraembedded

Lilly <lfie...@...>
 

Hi Jan,

So I tried it again. First of all, I remembered, that for cql I need to commit after each step. Otherwise, I get "violation of unique key" errors, even though I am actually not. Is this supposed to be the case (having to commit each time)?
Now on doing the commit after each function call, I found that with the adaption in the properties configuration (see last reply) it is really super slow. If I use the "default" configuration for cql, it is a bit faster but still much slower than in the embedded case.

I also tried it with another graph  which I persisted like this:
public void persist(Map<Integer, Map<String,Object>> nodes, Map<Integer,Integer> edges, Map<Integer,Map<String,String>> names) {
g = graph.traversal();

int counter = 0;
for(Map.Entry<Integer, Map<String,Object>> e: nodes.entrySet()) {


Vertex v = g.addV().property("taxId",e.getKey()).
property("rank",e.getValue().get("rank")).
property("divId",e.getValue().get("divId")).
property("genId",e.getValue().get("genId")).next();
g.tx().commit();
Map<String,String> n = names.get(e.getKey());
if(n != null) {
for(Map.Entry<String,String> vals: n.entrySet()) {
g.V(v).property(vals.getKey(),vals.getValue()).iterate();
g.tx().commit();
}
}

if(counter % BULK_CHOP_SIZE == 0) {

System.out.println(counter);
}
counter++;

}


counter = 0;
for(Map.Entry<Integer,Integer> e: edges.entrySet()) {
g.V().has("taxId",e.getKey()).as("v1").V().
has("taxId",e.getValue()).as("v2").
addE("has_parent").from("v1").to("v2").iterate();
g.tx().commit();
if(counter % BULK_CHOP_SIZE == 0) {

System.out.println(counter);
}
counter++;
}

g.V().has("taxId",1).as("v").outE().filter(__.inV().where(P.eq("v"))).drop().iterate();
g.tx().commit();
System.out.println("Done with persistence");
}

And had the same problem in either case.

I am probably using the cql backend wrong somehow and would appreciate any help on what else to do!
Thanks,
Lilly

Am Dienstag, 8. Oktober 2019 09:05:56 UTC+2 schrieb Lilly:

Hi Jan,
Ok then I probably screwed up somewhere. I kind of thought this was to be expected, which is why I did not check it more thoroughly.
Maybe the way I persisted is not working well for cql.
I will try to create a test scenario where I do not have to persist all my data and see how it performs with cql again.

In principle, what I do is call this function :
public void updateEdges(String kmer, int pos, boolean strand, int record, List<SequenceParser.Feature> features){

if(features == null) {
features = Arrays.asList();
}

g.withSideEffect("features",features)
.V().has("prefix", kmer.substring(0,kmer.length()-1)).fold().coalesce(__.unfold(),
__.addV("prefix_node").property("prefix",kmer.substring(0,kmer.length()-1)) ).as("v1").
coalesce(__.V().has("prefix", kmer.substring(1,kmer.length())),
__.addV("prefix_node").property("prefix",kmer.substring(1,kmer.length())) ).as("v2").
sideEffect(__.choose(__.select("features").unfold().count().is(P.eq(0)),
__.addE("suffix_edge").property("record",record).
property("strand",strand).property("pos",pos).from("v1").to("v2")).
select("features").unfold().
addE("suffix_edge").property("record",record).property("strand",strand).property("pos",pos)
.property(__.map(t -> ((SequenceParser.Feature)t.get()).category),
__.map(t -> ((SequenceParser.Feature)t.get()).feature)).from("v1").to("v2")).
iterate();

}
and every roughly 50000 calls I do a commit. As a side remark, all of the above properties possess indecees. And Feature is a simple class with two attributes category and feature.

Also I adapted the configuration file in the following way:
storage.batch-loading = true
ids.block-size = 100000
ids.authority.wait-time = 2000 ms
ids.renew-timeout = 1000000 ms

I tried the same with cql and embedded.

I will get back to you once I have tested it once again. But maybe you already spot an issue?
Thanks
Lilly
Am Montag, 7. Oktober 2019 20:14:29 UTC+2 schrieb fa...@...:
We don't see this problem on persistence.
It would be good know what takes longer. Do like to give some more informations?

Jan



Re: [QUESTION] Usage of the cassandraembedded

Lilly <lfie...@...>
 

Hi Jan,
Ok then I probably screwed up somewhere. I kind of thought this was to be expected, which is why I did not check it more thoroughly.
Maybe the way I persisted is not working well for cql.
I will try to create a test scenario where I do not have to persist all my data and see how it performs with cql again.

In principle, what I do is call this function :
public void updateEdges(String kmer, int pos, boolean strand, int record, List<SequenceParser.Feature> features){

if(features == null) {
features = Arrays.asList();
}

g.withSideEffect("features",features)
.V().has("prefix", kmer.substring(0,kmer.length()-1)).fold().coalesce(__.unfold(),
__.addV("prefix_node").property("prefix",kmer.substring(0,kmer.length()-1)) ).as("v1").
coalesce(__.V().has("prefix", kmer.substring(1,kmer.length())),
__.addV("prefix_node").property("prefix",kmer.substring(1,kmer.length())) ).as("v2").
sideEffect(__.choose(__.select("features").unfold().count().is(P.eq(0)),
__.addE("suffix_edge").property("record",record).
property("strand",strand).property("pos",pos).from("v1").to("v2")).
select("features").unfold().
addE("suffix_edge").property("record",record).property("strand",strand).property("pos",pos)
.property(__.map(t -> ((SequenceParser.Feature)t.get()).category),
__.map(t -> ((SequenceParser.Feature)t.get()).feature)).from("v1").to("v2")).
iterate();

}
and every roughly 50000 calls I do a commit. As a side remark, all of the above properties possess indecees. And Feature is a simple class with two attributes category and feature.

Also I adapted the configuration file in the following way:
storage.batch-loading = true
ids.block-size = 100000
ids.authority.wait-time = 2000 ms
ids.renew-timeout = 1000000 ms

I tried the same with cql and embedded.

I will get back to you once I have tested it once again. But maybe you already spot an issue?
Thanks
Lilly
Am Montag, 7. Oktober 2019 20:14:29 UTC+2 schrieb fa...@...:

We don't see this problem on persistence.
It would be good know what takes longer. Do like to give some more informations?

Jan



Re: [QUESTION] Usage of the cassandraembedded

nicolas...@...
 

hi,
I think that embedded cassandra can lead to some classpath hell, so this option should at least not be possible with default installation.

I have a project where I put my library into janusgraph in a first version. When I had to parse CSV, I found that JG embedded some old CSV libraries and updating them may have some unexpected issues on JG, so I have used this old version even if it was unperfect. In a second version, I use a spring boot application with a remote connection to JG to prevent such issues.

elasticsearch has disabled the embedded mode for this reason (see https://www.elastic.co/blog/elasticsearch-the-server ).

regards,
Nicolas


Re: [QUESTION] Usage of the cassandraembedded

faro...@...
 

We don't see this problem on persistence.
It would be good know what takes longer. Do like to give some more informations?

Jan



Re: Persistence of graph view

marc.d...@...
 

Hi Lilly,

Thanks for explaining, I feared already that I had missed something. I think each type of query has its optimal treatment. When you have two properties to select on, you would have the cases:
  • small result set (let us say smaller than 1000 vertices). This is served well by the default CompositeIndex or MixedIndex on these two property keys
  • large result set. Here, it is probably more efficient to work with stored vertex id's. However, now you store the id's as a dictionary with the values of p2 as keys. So, your query becomes g.V(ids[p2_value]).
If you can make sense of what works best, it would be interesting to read about your results in a blog!

Btw, your use case of getting a large set of vertices as start of a traversal query, is possibly better served by postgresl or a some linearly scalable SQL store. JanusGraph shines at longer traversals for a small number of starting vertices.

Best wishes,      Marc

Op maandag 7 oktober 2019 15:58:36 UTC+2 schreef Lilly:

Hi Marc,

I guess I did not explain my issue very well.
What I meant to say is this. Suppose these ids correspond to some filtering criterion. Now having these ids I can create the subgraph.
However, if on this subgraph I want to use another index (not the one related to the filtering criterion) "property", this will not be used.

A (hopefully) simple example.
Say I have a graph with properties p1, p2 and indecees on both. Now I get all indecees of vertices that have p1=x and store them in ids.
Now doing g.V(ids).has(p2,...) will not make use of index p2. At least it does not show up in the profile step.

Is it clear now what I mean? Or am I mistaken?

Thanks,
Lilly

Am Montag, 7. Oktober 2019 15:49:01 UTC+2 schrieb ma...@...:
Hi Lily,

When you have the vertex id, you do not need any index. The index is a lookup table from property value to vertex id.

Cheers,    Marc


Op maandag 7 oktober 2019 08:15:50 UTC+2 schreef Lilly:
Hi Marc,

Thanks for your reply!

Your suggestions would fetch the subgraph efficiently. However, on this subgraph I could no longer use any of my other indecees.
Say I have an index on "property". Than g.V(ids).has("property",...) would no longer make use of the index  on "property" (only g.V().has("property",..) does.
Especially if the subgraph is still rather large, this would be desirable though.
Any thoughts on how to achieve this?

Thanks
Lilly

Am Sonntag, 6. Oktober 2019 09:47:25 UTC+2 schrieb ma...@...:
Hi Lilly,

Interesting question. For the JanusGraph backends to lookup the vertices of the subgraph efficiently, they need the id's of the vertices. The traversal is then g.V(ids) . There are different ways to get these id's:
  • store the id's on ingestion
  • query the id's once and store them
  • give the subgraph vertices a specific property and run an index on that property. I doubt, however, that this will be efficient for large subgraphs. @Anyone ever tried?
  • maybe the JanusGraph IDPlacementStrategy could provide a way to only query the subgraph vertices without knowing their explicit ids. Seem complicated compared to the first two options.
Cheers,    Marc

Op vrijdag 4 oktober 2019 17:48:52 UTC+2 schreef Lilly:
Hi,

I persisted a janusgraph g1 (with Cassandra backend if that is relevant). Now I would like to persist a "view" of this graph g1, i.e. a subgraph g2 of g1 which only contains some of the nodes and edges of g1. This subgraph is to also have possess all the indecees of the affected nodes and edges.

I am aware of the subgraphstrategy, which can create such a view at runtime. Is it possible to persist this view? I would like to circumvent having to create this view all over again each time. Also, with this view created at runtime, I can no longer exploit other indecees.
If this is not possible, is there another way to achieve this?

Thanks a lot!!
Lilly




Re: Titan to Janus - Change in behavior for properties with the same name but different datatypes.

Bharat Dighe <bdi...@...>
 

Thanks Abhay and Marc.

It came as a bit surprise due to existing behavior in Titan. 
This is a bit restrictive given the nature of my app where other than few fixed properties which are defined by the system, rest of the properties are stamped by external sources.
I will need to redesign the app given this finding.

Bharat


On Sunday, October 6, 2019 at 9:00:57 AM UTC-7, Abhay Pandit wrote:
Hi Bharat,

Janusgraph being more consistent it stores only one PropertyKey with a defined any defined name with only one data type throughout the graph.
Like in your case "Status" can't be of 2 data type in a graph.

Thanks,
Abhay


On Sun, 6 Oct 2019 at 01:03, <ma...@...> wrote:
Hi Bharat,

I understand your annoyance while porting your application, but to me the JanusGraph behaviour seems to be more consistent (by the way, I did not check the difference in behaviour you report, I just took your observation for granted). If you want the old Titan behaviour you can simply typecast your variable-type properties to their common denominator (like String, Long, Double, Object, whatever does the job) before you pass them to JanusGraph.

HTH,    Marc


Op zaterdag 5 oktober 2019 07:31:56 UTC+2 schreef Bharat Dighe:
There is a significant difference in the way Titan and Janus handles properties with the same name which have values with different datatypes.
Titan allows it but Janus does not. 
I am in a process to port my app which is using Titan to Janus and this is causing a major issue. In my app the properites are added dynamically. Other than few fixed properties and there is no predictibility of which properties and their datatypes.

Is there a way Janus can be made to behave same as Titan?

here is an example of the difference of behavior between Titan and Janus

Titan
=====
gremlin> v1=graph.addVertex();
==>v[4144]
gremlin> v2=graph.addVertex();
==>v[4096]
gremlin> v1.property("status", 1);
==>vp[status->1]
gremlin> v2.property("status","connected");
==>vp[status->connected]
gremlin> v1.property("size", 2000000)
==>vp[size->2000000]
gremlin> v2.property("size", 3000000000);
==>vp[size->3000000000]
gremlin> v1.property("status").value().getClass();
==>class java.lang.Integer
gremlin> v2.property("status").value().getClass();
==>class java.lang.String
gremlin> v1.property("size").value().getClass();
==>class java.lang.Integer
gremlin> v2.property("size").value().getClass();
==>class java.lang.Long

Janus
=====
gremlin> v1=graph.addVertex();
==>v[4104]
gremlin> v2=graph.addVertex();
==>v[4176]
gremlin> v1.property("status", 1);
==>vp[status->1]
gremlin> graph.tx().commit();
==>null
gremlin> v2.property("status","connected");
Value [connected] is not an instance of the expected data type for property key [status] and cannot be converted. Expected: class java.lang.Integer, found: class java.lang.String
Type ':help' or ':h' for help.
Display stack trace? [yN]n
gremlin> v1.property("size", 2000000)
==>vp[size->2000000]
gremlin> v2.property("size", 3000000000);
Value [3000000000] is not an instance of the expected data type for property key [size] and cannot be converted. Expected: class java.lang.Integer, found: class java.lang.Long
Type ':help' or ':h' for help.
Display stack trace? [yN]n

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/1b36efa5-a572-4085-81ac-05016a64d7ba%40googlegroups.com.


Re: Persistence of graph view

Lilly <lfie...@...>
 

Hi Marc,

I guess I did not explain my issue very well.
What I meant to say is this. Suppose these ids correspond to some filtering criterion. Now having these ids I can create the subgraph.
However, if on this subgraph I want to use another index (not the one related to the filtering criterion) "property", this will not be used.

A (hopefully) simple example.
Say I have a graph with properties p1, p2 and indecees on both. Now I get all indecees of vertices that have p1=x and store them in ids.
Now doing g.V(ids).has(p2,...) will not make use of index p2. At least it does not show up in the profile step.

Is it clear now what I mean? Or am I mistaken?

Thanks,
Lilly

Am Montag, 7. Oktober 2019 15:49:01 UTC+2 schrieb ma...@...:

Hi Lily,

When you have the vertex id, you do not need any index. The index is a lookup table from property value to vertex id.

Cheers,    Marc


Op maandag 7 oktober 2019 08:15:50 UTC+2 schreef Lilly:
Hi Marc,

Thanks for your reply!

Your suggestions would fetch the subgraph efficiently. However, on this subgraph I could no longer use any of my other indecees.
Say I have an index on "property". Than g.V(ids).has("property",...) would no longer make use of the index  on "property" (only g.V().has("property",..) does.
Especially if the subgraph is still rather large, this would be desirable though.
Any thoughts on how to achieve this?

Thanks
Lilly

Am Sonntag, 6. Oktober 2019 09:47:25 UTC+2 schrieb ma...@...:
Hi Lilly,

Interesting question. For the JanusGraph backends to lookup the vertices of the subgraph efficiently, they need the id's of the vertices. The traversal is then g.V(ids) . There are different ways to get these id's:
  • store the id's on ingestion
  • query the id's once and store them
  • give the subgraph vertices a specific property and run an index on that property. I doubt, however, that this will be efficient for large subgraphs. @Anyone ever tried?
  • maybe the JanusGraph IDPlacementStrategy could provide a way to only query the subgraph vertices without knowing their explicit ids. Seem complicated compared to the first two options.
Cheers,    Marc

Op vrijdag 4 oktober 2019 17:48:52 UTC+2 schreef Lilly:
Hi,

I persisted a janusgraph g1 (with Cassandra backend if that is relevant). Now I would like to persist a "view" of this graph g1, i.e. a subgraph g2 of g1 which only contains some of the nodes and edges of g1. This subgraph is to also have possess all the indecees of the affected nodes and edges.

I am aware of the subgraphstrategy, which can create such a view at runtime. Is it possible to persist this view? I would like to circumvent having to create this view all over again each time. Also, with this view created at runtime, I can no longer exploit other indecees.
If this is not possible, is there another way to achieve this?

Thanks a lot!!
Lilly




Re: Persistence of graph view

marc.d...@...
 

Hi Lily,

When you have the vertex id, you do not need any index. The index is a lookup table from property value to vertex id.

Cheers,    Marc


Op maandag 7 oktober 2019 08:15:50 UTC+2 schreef Lilly:

Hi Marc,

Thanks for your reply!

Your suggestions would fetch the subgraph efficiently. However, on this subgraph I could no longer use any of my other indecees.
Say I have an index on "property". Than g.V(ids).has("property",...) would no longer make use of the index  on "property" (only g.V().has("property",..) does.
Especially if the subgraph is still rather large, this would be desirable though.
Any thoughts on how to achieve this?

Thanks
Lilly

Am Sonntag, 6. Oktober 2019 09:47:25 UTC+2 schrieb ma...@...:
Hi Lilly,

Interesting question. For the JanusGraph backends to lookup the vertices of the subgraph efficiently, they need the id's of the vertices. The traversal is then g.V(ids) . There are different ways to get these id's:
  • store the id's on ingestion
  • query the id's once and store them
  • give the subgraph vertices a specific property and run an index on that property. I doubt, however, that this will be efficient for large subgraphs. @Anyone ever tried?
  • maybe the JanusGraph IDPlacementStrategy could provide a way to only query the subgraph vertices without knowing their explicit ids. Seem complicated compared to the first two options.
Cheers,    Marc

Op vrijdag 4 oktober 2019 17:48:52 UTC+2 schreef Lilly:
Hi,

I persisted a janusgraph g1 (with Cassandra backend if that is relevant). Now I would like to persist a "view" of this graph g1, i.e. a subgraph g2 of g1 which only contains some of the nodes and edges of g1. This subgraph is to also have possess all the indecees of the affected nodes and edges.

I am aware of the subgraphstrategy, which can create such a view at runtime. Is it possible to persist this view? I would like to circumvent having to create this view all over again each time. Also, with this view created at runtime, I can no longer exploit other indecees.
If this is not possible, is there another way to achieve this?

Thanks a lot!!
Lilly



2461 - 2480 of 6661