Date   

Re: Could not alias [g] to [g] as [g]

HadoopMarc <bi...@...>
 

OK, I understand now what you mean with "I see this error only on my server ". What happens if you try to open a websocket connection with a test webscocket client, e.g. for firefox:

https://addons.mozilla.org/en-US/firefox/addon/simple-websocket-client

It might be that the server does not have an open 8182 port. For less wilder guesses, I really need more information (client stacktrace, server logs, etc).

Best wishes,     Marc
Op dinsdag 5 januari 2021 om 12:38:27 UTC+1 schreef ya...@...:

I'm using the docker version and I haven't changed anything. Where do I find the yaml config? I've tested it on 2 PC's (ubuntu and windows) and it works without a problem. It doesn't work only on a server (ubuntu).

On Monday, January 4, 2021 at 5:37:37 PM UTC+1 HadoopMarc wrote:
Hi,

Can you show the gremlin server yaml config file (part of janusgraph) as well as the groovy script file it refers to? The groovy script should define the GraphTraversalSource g that you want to bind to.

Best wishes,     Marc

Op zondag 3 januari 2021 om 23:56:20 UTC+1 schreef ya...@...:
Hi, I have JanusGraph with Scylla and Elasticsearch in a docker. I'm connecting to JanusGraph on backend using gremlin.

I see this error only on my server and not on my local machine. What does it mean and more importantly how do I fix it? I'm a noob when it comes to backend and database stuff so please be kind. Thank you.



Re: Could not alias [g] to [g] as [g]

Yamiteru XYZ <yamit...@...>
 

I'm using the docker version and I haven't changed anything. Where do I find the yaml config? I've tested it on 2 PC's (ubuntu and windows) and it works without a problem. It doesn't work only on a server (ubuntu).


On Monday, January 4, 2021 at 5:37:37 PM UTC+1 HadoopMarc wrote:
Hi,

Can you show the gremlin server yaml config file (part of janusgraph) as well as the groovy script file it refers to? The groovy script should define the GraphTraversalSource g that you want to bind to.

Best wishes,     Marc

Op zondag 3 januari 2021 om 23:56:20 UTC+1 schreef ya...@...:
Hi, I have JanusGraph with Scylla and Elasticsearch in a docker. I'm connecting to JanusGraph on backend using gremlin.

I see this error only on my server and not on my local machine. What does it mean and more importantly how do I fix it? I'm a noob when it comes to backend and database stuff so please be kind. Thank you.



docker base tests for scylladb stopped working

Israel Fruchter <fr...@...>
 

Recently the docker base tests for scylladb stopped working
the last confirmed one that was working (in our CI) was c97e84ef401d5a17c4c0b37c1af5fdad06db06fd

how can I figure out what is the issue there ?

$ java -version
java version "1.8.0_201"
Java(TM) SE Runtime Environment (build 1.8.0_201-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)

$ mvn clean install -pl janusgraph-cql -Pscylladb

[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.085 s <<< FAILURE! - in org.janusgraph.diskstorage.cql.CQLDistributedStoreManagerTest                                              
[ERROR] org.janusgraph.diskstorage.cql.CQLDistributedStoreManagerTest  Time elapsed: 0.085 s  <<< ERROR!                                                                                                     
java.lang.ExceptionInInitializerError                                                                                                                                                                        
        at org.janusgraph.diskstorage.cql.CQLDistributedStoreManagerTest.<clinit>(CQLDistributedStoreManagerTest.java:37)


Re: Could not alias [g] to [g] as [g]

HadoopMarc <bi...@...>
 

Hi,

Can you show the gremlin server yaml config file (part of janusgraph) as well as the groovy script file it refers to? The groovy script should define the GraphTraversalSource g that you want to bind to.

Best wishes,     Marc

Op zondag 3 januari 2021 om 23:56:20 UTC+1 schreef ya...@...:

Hi, I have JanusGraph with Scylla and Elasticsearch in a docker. I'm connecting to JanusGraph on backend using gremlin.

I see this error only on my server and not on my local machine. What does it mean and more importantly how do I fix it? I'm a noob when it comes to backend and database stuff so please be kind. Thank you.



Re: Database Level Caching

Nicolas Trangosi <nicolas...@...>
 

Hi Boxuan,

I have configured janusgraph with:

cache.db-cache-time: 600000  
cache.db-cache: true  
cache.db-cache-size: 50000000  
index.search.elasticsearch.create.ext.number_of_replicas: 0
storage.buffer-size: 1024
index.search.elasticsearch.create.ext.number_of_shards: 1
cache.cache.db-cache-time: 0
index.search.index-name: dcbrain
index.search.backend: elasticsearch
storage.port: 9042
ids.block-size: 1000000
schema.default: logging
storage.cql.batch-statement-size: 50
index.search.hostname: dfe-elasticsearch
storage.backend: cql
storage.hostname: dfe-cassandra
storage.cql.local-max-requests-per-connection: 4096
index.search.port: 9200


I have load some data on the graph and dump memory.
When I import this dump with jvisualVM, retained size for ExpirationKCVSCache 257 Mb when the limit should be 50 Mb.
image.png

Regards,
Nicolas

Le lun. 4 janv. 2021 à 13:11, BO XUAN LI <libo...@...> a écrit :
Hi Nicolas,

Can you provide your configurations and the memory usage you observed?

Regards,
Boxuan

On Jan 4, 2021, at 3:44 PM, Nicolas Trangosi <nicolas...@...> wrote:

Hi,
I try to use  Database Level Caching as described in https://docs.janusgraph.org/basics/cache/ but it seems to use more memory than the configured threshold ( cache.db-cache-size ). Does anyone use such a feature ? Is it production ready ?

Regards,
Nicolas


Ce message et ses pièces jointes peuvent contenir des informations confidentielles ou privilégiées et ne doivent donc pas être diffusés, exploités ou copiés sans autorisation. 
Si vous avez reçu ce message par erreur, veuillez le signaler a l'expéditeur et le détruire ainsi que les pièces jointes. 
Les messages électroniques étant susceptibles d'altération, DCbrain décline toute responsabilité si ce message a été altéré, déformé ou falsifié. Merci. 

This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, DCbrain is not liable for messages that have been modified, changed or falsified. Thank you.

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/CAD7qnB4SYvXq5A3vkzu44fERkySr2kPhsoZC-5%3DbBoz9KvzPnw%40mail.gmail.com.

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/3B5EFE52-BE38-437B-B399-05AF4899F398%40connect.hku.hk.


--

  

Nicolas Trangosi

Lead back

+33 (0)6 77 86 66 44      

   



Ce message et ses pièces jointes peuvent contenir des informations confidentielles ou privilégiées et ne doivent donc pas être diffusés, exploités ou copiés sans autorisation. 
Si vous avez reçu ce message par erreur, veuillez le signaler a l'expéditeur et le détruire ainsi que les pièces jointes. 
Les messages électroniques étant susceptibles d'altération, DCbrain décline toute responsabilité si ce message a été altéré, déformé ou falsifié. Merci. 

This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, DCbrain is not liable for messages that have been modified, changed or falsified. Thank you.


Re: Database Level Caching

BO XUAN LI <libo...@...>
 

Hi Nicolas,

Can you provide your configurations and the memory usage you observed?

Regards,
Boxuan

On Jan 4, 2021, at 3:44 PM, Nicolas Trangosi <nicolas...@...> wrote:

Hi,
I try to use  Database Level Caching as described in https://docs.janusgraph.org/basics/cache/ but it seems to use more memory than the configured threshold ( cache.db-cache-size ). Does anyone use such a feature ? Is it production ready ?

Regards,
Nicolas


Ce message et ses pièces jointes peuvent contenir des informations confidentielles ou privilégiées et ne doivent donc pas être diffusés, exploités ou copiés sans autorisation. 
Si vous avez reçu ce message par erreur, veuillez le signaler a l'expéditeur et le détruire ainsi que les pièces jointes. 
Les messages électroniques étant susceptibles d'altération, DCbrain décline toute responsabilité si ce message a été altéré, déformé ou falsifié. Merci. 

This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, DCbrain is not liable for messages that have been modified, changed or falsified. Thank you.

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/CAD7qnB4SYvXq5A3vkzu44fERkySr2kPhsoZC-5%3DbBoz9KvzPnw%40mail.gmail.com.


How to see the ns (name space) values of janus graph

Arpan Jain <arpan...@...>
 

I have rfd data which is in turtle or xml format. Now the data is having the vertex property name in the form of URL so when I upload that data to Janus graph it converts those URL values to ns1,ns2 and goes on for each unique url.

Now my first question is how the Janus graph is doing that and how I can get those ns values.

Because I need to define my own schema for Janus graph because I need to upload bulk data and for bulk upload, the manual schema creation is required for fast upload speed. How I can crate schema for those URLs. I mean during schema creation we have to declare the property of vertices and in my case properties are url so what is the best way to achieve this.


Database Level Caching

Nicolas Trangosi <nicolas...@...>
 

Hi,
I try to use  Database Level Caching as described in https://docs.janusgraph.org/basics/cache/ but it seems to use more memory than the configured threshold ( cache.db-cache-size ). Does anyone use such a feature ? Is it production ready ?

Regards,
Nicolas


Ce message et ses pièces jointes peuvent contenir des informations confidentielles ou privilégiées et ne doivent donc pas être diffusés, exploités ou copiés sans autorisation. 
Si vous avez reçu ce message par erreur, veuillez le signaler a l'expéditeur et le détruire ainsi que les pièces jointes. 
Les messages électroniques étant susceptibles d'altération, DCbrain décline toute responsabilité si ce message a été altéré, déformé ou falsifié. Merci. 

This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, DCbrain is not liable for messages that have been modified, changed or falsified. Thank you.


Re: Failed to find all paths between 2 vertices upon the graph having 100 million vertices and 100 million edges using SparkGraphComputer

Roy Yu <7604...@...>
 

Hi Marc

My graph has 100 million edges not 100 edges. Sorry for my miswriting. From your advice I think I need to do two things.  Firstly I need to dig into  ConnectedComponentVertexProgram and manage how to write my own VertexProgram. Seondly, implement the VertexProgram path finding logic, about which I have no idea. As the path between 2 vertices on the graph containing 100 million edges could be easily  explode. I have no memory or even disk to store all the results. Could you give your solution in detail?

On Saturday, January 2, 2021 at 6:21:06 PM UTC+8 HadoopMarc wrote:
Hi Roy,

Nice to see you back here, still going strong!

I guess the TraversalVertexProgram used for OLAP traversals is not well suited to your use case. You must realize that 200 stages in an OLAP traversal is a fairly extreme. I assume you edge count is 100 million and not 100. So, the number of paths between two vertices could easily explode and the storage of associated java objects (Traversers in the stacktrace) could grow beyond 80 Gb.

It would be relatively easy to write your own VertexProgram for this simple traversal (you can take the ConnectedComponentVertexProgram as an example). See also the explanation in the corresponding recipe. This will give you far more control over data structures and their memory usage.

Best wishes,    Marc

Op zaterdag 2 januari 2021 om 06:53:08 UTC+1 schreef Roy Yu:
The graph has 100 million vertices and 100 edges
Graph data is saved at HBase Table: MyHBaseTable.

The size of MyHBaseTable is 16.2GB:
root@~$ hdfs dfs -du -h /apps/hbase/data/data/default/
16.2 G   32.4 G   /apps/hbase/data/data/default/MyHBaseTable

MyHBaseTable has 190 regions, the edge data (HBase column family e ) of every region is less than 100MB (one spark task processes one region, in order to avoid spark OOM during loading region data, I use HBaseAdmin to split HBase region to make sure the edges data (HBase column family e ) of every region is less than 100MB) . Below the size of region 077288f4be4c439443bb45b0c2369d5b is more than 100MB because it has index data.
root@~$ hdfs dfs -du -h /apps/hbase/data/data/default/MyHBaseTable
3.8 K    7.6 K    /apps/hbase/data/data/default/MyHBaseTable/.tabledesc
0        0        /apps/hbase/data/data/default/MyHBaseTable/.tmp
78.3 M   156.7 M  /apps/hbase/data/data/default/MyHBaseTable/007e9dbf74f5d35862b68d6434f1d6f2
92.2 M   184.3 M  /apps/hbase/data/data/default/MyHBaseTable/077288f4be4c439443bb45b0c2369d5b
102.4 M  204.8 M  /apps/hbase/data/data/default/MyHBaseTable/0782782071e4a7f2d17800d4a0989a7f
50.6 M   101.3 M  /apps/hbase/data/data/default/MyHBaseTable/07e795022e56a969ede48c9c23fbbc7c
50.6 M   101.3 M  /apps/hbase/data/data/default/MyHBaseTable/084e54e61bbcfc2decd14dcbac55bc50
99.7 M   199.4 M  /apps/hbase/data/data/default/MyHBaseTable/0a85ae356b19c605d9a32b9bf513bcbb
431.3 M  862.6 M  /apps/hbase/data/data/default/MyHBaseTable/0b024c812acfa6efaa40e1cca232e192
5.0 K    10.1 K   /apps/hbase/data/data/default/MyHBaseTable/0c2d8e3a6daaa8ab30c399783e343890
...


the properties of the graph:
gremlin.graph=org.janusgraph.core.JanusGraphFactory
cluster.max-partitions = 16
storage.backend=hbase
storage.hbase.table=MyHBaseTable
storage.hbase.ext.zookeeper.znode.parent=/hbase-unsecure
schema.default=none

storage.hostname=master001,master002,master003
storage.port=2181
storage.hbase.region-count=64
storage.write-time=1000000
storage.read-time=100000

ids.block-size=200000
ids.renew-timeout=600000
ids.renew-percentage=0.4
ids.authority.conflict-avoidance-mode=GLOBAL_AUTO

index.search.backend=elasticsearch
index.search.hostname=es001,es002,es003
index.search.elasticsearch.create.ext.index.number_of_shards=15
index.search.elasticsearch.create.ext.index.refresh_interval=-1
index.search.elasticsearch.create.ext.index.translog.sync_interval=5000s
index.search.elasticsearch.create.ext.index.translog.durability=async
index.search.elasticsearch.create.ext.index.number_of_replicas=0
index.search.elasticsearch.create.ext.index.shard.check_on_startup=false


the schema of the graph:
def defineSchema(graph) {
    m = graph.openManagement()

        node = m.makeVertexLabel("node").make()

        relation = m.makeEdgeLabel("relation").make()
        obj_type_value = m.makePropertyKey("obj_type_value").dataType(String.class).make()

    // edge props
        start_time = m.makePropertyKey("start_time").dataType(Date.class).make()
        end_time = m.makePropertyKey("end_time").dataType(Date.class).make()
        count = m.makePropertyKey("count").dataType(Integer.class).make()
        rel_type = m.makePropertyKey("rel_type").dataType(String.class).make()
    //index
        m.buildIndex("MyHBaseTable_obj_type_value_Index", Vertex.class).addKey(obj_type_value).unique().buildCompositeIndex()
        m.buildIndex("MyHBaseTable_rel_type_index", Edge.class).addKey(rel_type).buildCompositeIndex()
        m.buildIndex("MyHBaseTable_count_index", Edge.class).addKey(count).buildMixedIndex("search")
        m.buildIndex("MyHBaseTable_start_time_index", Edge.class).addKey(start_time).buildMixedIndex("search")
        m.buildIndex("MyHBaseTable_end_time_index", Edge.class).addKey(end_time).buildMixedIndex("search")
    m.commit()
}

the Gremlin I use to find all paths between 2 vertices:

import org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer;
import org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.__;
import org.apache.tinkerpop.gremlin.process.traversal.P;
def executeScript(graph){
    traversal = graph.traversal().withComputer(SparkGraphComputer.class);
    return traversal.V(624453904).repeat(__.both().simplePath()).until(__.hasId(192204064).or().loops().is(200)).hasId(192204064).path().dedup().limit(1000).toList()
    //return traversal.V().where(__.outE().count().is(P.gte(50000))).id().toList()
};

The OLAP spark graph conf:
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphReader=org.janusgraph.hadoop.formats.hbase.HBaseInputFormat
gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat

gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=output
gremlin.spark.graphStorageLevel=DISK_ONLY
gremlin.spark.persistStorageLevel=DISK_ONLY
####################################
# JanusGraph HBase InputFormat configuration
####################################
janusgraphmr.ioformat.conf.storage.backend=hbase
janusgraphmr.ioformat.conf.storage.hostname=master002,master003,master001
janusgraphmr.ioformat.conf.storage.hbase.table=MyHBaseTable
janusgraphmr.ioformat.conf.storage.hbase.ext.zookeeper.znode.parent=/hbase-unsecure

####################################
# SparkGraphComputer Configuration #
####################################
spark.master=yarn
spark.submit.deployMode=client
spark.yarn.jars=hdfs://GRAPHOLAP/user/spark/jars/*.jar

# the Spark YARN ApplicationManager needs this to resolve classpath it sends to the executors
spark.yarn.appMasterEnv.JAVA_HOME=/usr/local/jdk1.8.0_191/
spark.yarn.appMasterEnv.HADOOP_CONF_DIR=/usr/hdp/3.1.4.0-315/hadoop/conf
spark.yarn.am.extraJavaOptions=-Diop.version=3.1.4.0-315 -Djava.library.path=/usr/hdp/current/hadoop-client/lib/native
spark.executor.memoryOverhead=5G
spark.driver.extraJavaOptions=-Diop.version=3.1.4.0-315 -Djava.library.path=/usr/hdp/current/hadoop-client/lib/native

# the Spark Executors (on the work nodes) needs this to resolve classpath to run Spark tasks
spark.executorEnv.JAVA_HOME=/usr/local/jdk1.8.0_191/
#spark.executorEnv.HADOOP_CONF_DIR=/usr/hdp/3.1.4.0-315/hadoop/conf
spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:MaxGCPauseMillis=500 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/mnt/data_1/log/spark2/gc-spark%p.log
spark.executor.cores=1
spark.executor.memory=80G
spark.executor.instances=3
spark.executor.extraClassPath=/etc/hadoop/conf:/usr/spark/jars:/usr/hdp/current/hbase-client/lib:/usr/janusgraph/0.4.0/lib

spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.network.timeout=1000000
spark.rpc.askTimeout=1000000
spark.shuffle.service.enabled=true
spark.shuffle.service.port=7447
spark.maxRemoteBlockSizeFetchToMem=10485760
spark.memory.useLegacyMode=true
spark.shuffle.memoryFraction=0.1
spark.storage.memoryFraction=0.1
spark.memory.fraction=0.1
spark.memory.storageFraction=0.1
spark.shuffle.accurateBlockThreshold=1048576


The spark job failed at stage 50 :
20/12/30 01:53:00 ERROR executor.Executor: Exception in task 40.0 in stage 50.0 (TID 192084)
java.lang.OutOfMemoryError: Java heap space
        at sun.reflect.generics.repository.ClassRepository.getSuperInterfaces(ClassRepository.java:114)
        at java.lang.Class.getGenericInterfaces(Class.java:913)
        at java.util.HashMap.comparableClassFor(HashMap.java:351)
        at java.util.HashMap$TreeNode.treeify(HashMap.java:1932)
        at java.util.HashMap.treeifyBin(HashMap.java:772)
        at java.util.HashMap.putVal(HashMap.java:644)
        at java.util.HashMap.put(HashMap.java:612)
        at java.util.Collections$SynchronizedMap.put(Collections.java:2588)
        at org.apache.tinkerpop.gremlin.process.traversal.traverser.util.TraverserSet.add(TraverserSet.java:90)
        at org.apache.tinkerpop.gremlin.process.computer.traversal.WorkerExecutor.lambda$drainStep$4(WorkerExecutor.java:232)
        at org.apache.tinkerpop.gremlin.process.computer.traversal.WorkerExecutor$$Lambda$86/877696627.accept(Unknown Source)
        at java.util.Iterator.forEachRemaining(Iterator.java:116)
        at org.apache.tinkerpop.gremlin.process.computer.traversal.WorkerExecutor.drainStep(WorkerExecutor.java:221)
        at org.apache.tinkerpop.gremlin.process.computer.traversal.WorkerExecutor.execute(WorkerExecutor.java:151)
        at org.apache.tinkerpop.gremlin.process.computer.traversal.TraversalVertexProgram.execute(TraversalVertexProgram.java:307)
        at org.apache.tinkerpop.gremlin.spark.process.computer.SparkExecutor.lambda$null$4(SparkExecutor.java:118)
        at org.apache.tinkerpop.gremlin.spark.process.computer.SparkExecutor$$Lambda$72/1209554928.apply(Unknown Source)
        at org.apache.tinkerpop.gremlin.util.iterator.IteratorUtils$3.next(IteratorUtils.java:247)
        at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)


From the log it seems there is too much data even the 80G executor heap is not enough. 
Anybody who can help me ?  Anybody who has idea to find all the paths between 2 vertices upon large graph?


Re: Slow convert to Java object

BO XUAN LI <libo...@...>
 

On Dec 31, 2020, at 7:29 PM, Maxim Milovanov <milov...@...> wrote:

Hi!

I am trying write select-query and my perfomance is very slow when I call method toList().
How I can improve this perfomace ?


My example:

        long start;

        long end;

        try (GraphTraversalSource g = graph.traversal()) {

            start = System.currentTimeMillis();

            GraphTraversal<Vertex, Vertex> list2 = g.V().hasLabel("Entity");


            end = System.currentTimeMillis();


            System.out.printf("getGremlinTime: %d ms\n", (end - start));


            start = end;

            List<Vertex> res2 = list2.toList();


            end = System.currentTimeMillis();

            System.out.printf("toList: %d ms\n gremlin count: %d\n", (end - start), res2.size());

        }


Debug log:

getGremlinTime: 13 ms
2020-12-31 14:19:01.336  WARN 14144 --- [           main] o.j.g.transaction.StandardJanusGraphTx   : Query requires iterating over all vertices [(~label = Entity)]. For better performance, use indexes
toList: 12025 ms
 gremlin count: 105  

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/3fbc61df-9eaa-4508-9e9c-43063c1967f3n%40googlegroups.com.


Re: How to interact with Janusgraph using NodeJs

Yamiteru XYZ <yamit...@...>
 

How did you fix it?


On Monday, October 29, 2018 at 11:51:46 AM UTC+1 m...@... wrote:
Checked my gremlin-server log, there was a silly mistake in my configuration file due to which gremlin server was not getting started properly. Fixed it :)


On Monday, October 29, 2018 at 3:02:31 PM UTC+5:30, m...@... wrote:
Hi,

I am new to Janusgraph and want to try how I can interact with it using Node Js. Do I need to configure my gremlin server to allow remote connections? I am not sure how exactly that is to be done. Please help.

I get the following error when I try to run my nodejs code

const gremlin = require('gremlin');
const Graph = gremlin.structure.Graph;
const DriverRemoteConnection = gremlin.driver.DriverRemoteConnection;
const graph = new Graph();
const g = graph.traversal().withRemote(new DriverRemoteConnection('ws://localhost:8182/gremlin'));
g
.V().has('lead', 'leadLeadId', '30087713').values('leadLeadId').toList().then(function(res){console.log(res)},function(err){console.log('Hi');console.log(err)});
Promise {
 
<pending>,
  domain
:
   
Domain {
     domain
: null,
     _events
: { error: [Function: debugDomainError] },
     _eventsCount
: 1,
     _maxListeners
: undefined,
     members
: [] } }
> Hi
Error: Server error: The traversal source [g] for alias [g] is not configured on the server. (499)
    at DriverRemoteConnection._handleMessage (/home/mohit/node_modules/gremlin/lib/driver/driver-remote-connection.js:180:9)
    at
WebSocket.DriverRemoteConnection._ws.on.data (/home/mohit/node_modules/gremlin/lib/driver/driver-remote-connection.js:72:41)
    at emitOne
(events.js:116:13)
    at
WebSocket.emit (events.js:211:7)
    at
Receiver._receiver.onmessage (/home/mohit/node_modules/gremlin/node_modules/ws/lib/WebSocket.js:141:47)
    at
Receiver.dataMessage (/home/mohit/node_modules/gremlin/node_modules/ws/lib/Receiver.js:380:14)
    at
Receiver.getData (/home/mohit/node_modules/gremlin/node_modules/ws/lib/Receiver.js:330:12)
    at
Receiver.startLoop (/home/mohit/node_modules/gremlin/node_modules/ws/lib/Receiver.js:165:16)
    at
Receiver.add (/home/mohit/node_modules/gremlin/node_modules/ws/lib/Receiver.js:139:10)
    at
Socket._ultron.on (/home/mohit/node_modules/gremlin/node_modules/ws/lib/WebSocket.js:138:22)




Could not alias [g] to [g] as [g]

Yamiteru XYZ <yamit...@...>
 

Hi, I have JanusGraph with Scylla and Elasticsearch in a docker. I'm connecting to JanusGraph on backend using gremlin.

I see this error only on my server and not on my local machine. What does it mean and more importantly how do I fix it? I'm a noob when it comes to backend and database stuff so please be kind. Thank you.



Re: Failed to find all paths between 2 vertices upon the graph having 100 million vertices and 100 million edges using SparkGraphComputer

HadoopMarc <bi...@...>
 

Hi Roy,

Nice to see you back here, still going strong!

I guess the TraversalVertexProgram used for OLAP traversals is not well suited to your use case. You must realize that 200 stages in an OLAP traversal is a fairly extreme. I assume you edge count is 100 million and not 100. So, the number of paths between two vertices could easily explode and the storage of associated java objects (Traversers in the stacktrace) could grow beyond 80 Gb.

It would be relatively easy to write your own VertexProgram for this simple traversal (you can take the ConnectedComponentVertexProgram as an example). See also the explanation in the corresponding recipe. This will give you far more control over data structures and their memory usage.

Best wishes,    Marc

Op zaterdag 2 januari 2021 om 06:53:08 UTC+1 schreef Roy Yu:

The graph has 100 million vertices and 100 edges
Graph data is saved at HBase Table: MyHBaseTable.

The size of MyHBaseTable is 16.2GB:
root@~$ hdfs dfs -du -h /apps/hbase/data/data/default/
16.2 G   32.4 G   /apps/hbase/data/data/default/MyHBaseTable

MyHBaseTable has 190 regions, the edge data (HBase column family e ) of every region is less than 100MB (one spark task processes one region, in order to avoid spark OOM during loading region data, I use HBaseAdmin to split HBase region to make sure the edges data (HBase column family e ) of every region is less than 100MB) . Below the size of region 077288f4be4c439443bb45b0c2369d5b is more than 100MB because it has index data.
root@~$ hdfs dfs -du -h /apps/hbase/data/data/default/MyHBaseTable
3.8 K    7.6 K    /apps/hbase/data/data/default/MyHBaseTable/.tabledesc
0        0        /apps/hbase/data/data/default/MyHBaseTable/.tmp
78.3 M   156.7 M  /apps/hbase/data/data/default/MyHBaseTable/007e9dbf74f5d35862b68d6434f1d6f2
92.2 M   184.3 M  /apps/hbase/data/data/default/MyHBaseTable/077288f4be4c439443bb45b0c2369d5b
102.4 M  204.8 M  /apps/hbase/data/data/default/MyHBaseTable/0782782071e4a7f2d17800d4a0989a7f
50.6 M   101.3 M  /apps/hbase/data/data/default/MyHBaseTable/07e795022e56a969ede48c9c23fbbc7c
50.6 M   101.3 M  /apps/hbase/data/data/default/MyHBaseTable/084e54e61bbcfc2decd14dcbac55bc50
99.7 M   199.4 M  /apps/hbase/data/data/default/MyHBaseTable/0a85ae356b19c605d9a32b9bf513bcbb
431.3 M  862.6 M  /apps/hbase/data/data/default/MyHBaseTable/0b024c812acfa6efaa40e1cca232e192
5.0 K    10.1 K   /apps/hbase/data/data/default/MyHBaseTable/0c2d8e3a6daaa8ab30c399783e343890
...


the properties of the graph:
gremlin.graph=org.janusgraph.core.JanusGraphFactory
cluster.max-partitions = 16
storage.backend=hbase
storage.hbase.table=MyHBaseTable
storage.hbase.ext.zookeeper.znode.parent=/hbase-unsecure
schema.default=none

storage.hostname=master001,master002,master003
storage.port=2181
storage.hbase.region-count=64
storage.write-time=1000000
storage.read-time=100000

ids.block-size=200000
ids.renew-timeout=600000
ids.renew-percentage=0.4
ids.authority.conflict-avoidance-mode=GLOBAL_AUTO

index.search.backend=elasticsearch
index.search.hostname=es001,es002,es003
index.search.elasticsearch.create.ext.index.number_of_shards=15
index.search.elasticsearch.create.ext.index.refresh_interval=-1
index.search.elasticsearch.create.ext.index.translog.sync_interval=5000s
index.search.elasticsearch.create.ext.index.translog.durability=async
index.search.elasticsearch.create.ext.index.number_of_replicas=0
index.search.elasticsearch.create.ext.index.shard.check_on_startup=false


the schema of the graph:
def defineSchema(graph) {
    m = graph.openManagement()

        node = m.makeVertexLabel("node").make()

        relation = m.makeEdgeLabel("relation").make()
        obj_type_value = m.makePropertyKey("obj_type_value").dataType(String.class).make()

    // edge props
        start_time = m.makePropertyKey("start_time").dataType(Date.class).make()
        end_time = m.makePropertyKey("end_time").dataType(Date.class).make()
        count = m.makePropertyKey("count").dataType(Integer.class).make()
        rel_type = m.makePropertyKey("rel_type").dataType(String.class).make()
    //index
        m.buildIndex("MyHBaseTable_obj_type_value_Index", Vertex.class).addKey(obj_type_value).unique().buildCompositeIndex()
        m.buildIndex("MyHBaseTable_rel_type_index", Edge.class).addKey(rel_type).buildCompositeIndex()
        m.buildIndex("MyHBaseTable_count_index", Edge.class).addKey(count).buildMixedIndex("search")
        m.buildIndex("MyHBaseTable_start_time_index", Edge.class).addKey(start_time).buildMixedIndex("search")
        m.buildIndex("MyHBaseTable_end_time_index", Edge.class).addKey(end_time).buildMixedIndex("search")
    m.commit()
}

the Gremlin I use to find all paths between 2 vertices:

import org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer;
import org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.__;
import org.apache.tinkerpop.gremlin.process.traversal.P;
def executeScript(graph){
    traversal = graph.traversal().withComputer(SparkGraphComputer.class);
    return traversal.V(624453904).repeat(__.both().simplePath()).until(__.hasId(192204064).or().loops().is(200)).hasId(192204064).path().dedup().limit(1000).toList()
    //return traversal.V().where(__.outE().count().is(P.gte(50000))).id().toList()
};

The OLAP spark graph conf:
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphReader=org.janusgraph.hadoop.formats.hbase.HBaseInputFormat
gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat

gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=output
gremlin.spark.graphStorageLevel=DISK_ONLY
gremlin.spark.persistStorageLevel=DISK_ONLY
####################################
# JanusGraph HBase InputFormat configuration
####################################
janusgraphmr.ioformat.conf.storage.backend=hbase
janusgraphmr.ioformat.conf.storage.hostname=master002,master003,master001
janusgraphmr.ioformat.conf.storage.hbase.table=MyHBaseTable
janusgraphmr.ioformat.conf.storage.hbase.ext.zookeeper.znode.parent=/hbase-unsecure

####################################
# SparkGraphComputer Configuration #
####################################
spark.master=yarn
spark.submit.deployMode=client
spark.yarn.jars=hdfs://GRAPHOLAP/user/spark/jars/*.jar

# the Spark YARN ApplicationManager needs this to resolve classpath it sends to the executors
spark.yarn.appMasterEnv.JAVA_HOME=/usr/local/jdk1.8.0_191/
spark.yarn.appMasterEnv.HADOOP_CONF_DIR=/usr/hdp/3.1.4.0-315/hadoop/conf
spark.yarn.am.extraJavaOptions=-Diop.version=3.1.4.0-315 -Djava.library.path=/usr/hdp/current/hadoop-client/lib/native
spark.executor.memoryOverhead=5G
spark.driver.extraJavaOptions=-Diop.version=3.1.4.0-315 -Djava.library.path=/usr/hdp/current/hadoop-client/lib/native

# the Spark Executors (on the work nodes) needs this to resolve classpath to run Spark tasks
spark.executorEnv.JAVA_HOME=/usr/local/jdk1.8.0_191/
#spark.executorEnv.HADOOP_CONF_DIR=/usr/hdp/3.1.4.0-315/hadoop/conf
spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:MaxGCPauseMillis=500 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/mnt/data_1/log/spark2/gc-spark%p.log
spark.executor.cores=1
spark.executor.memory=80G
spark.executor.instances=3
spark.executor.extraClassPath=/etc/hadoop/conf:/usr/spark/jars:/usr/hdp/current/hbase-client/lib:/usr/janusgraph/0.4.0/lib

spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.network.timeout=1000000
spark.rpc.askTimeout=1000000
spark.shuffle.service.enabled=true
spark.shuffle.service.port=7447
spark.maxRemoteBlockSizeFetchToMem=10485760
spark.memory.useLegacyMode=true
spark.shuffle.memoryFraction=0.1
spark.storage.memoryFraction=0.1
spark.memory.fraction=0.1
spark.memory.storageFraction=0.1
spark.shuffle.accurateBlockThreshold=1048576


The spark job failed at stage 50 :
20/12/30 01:53:00 ERROR executor.Executor: Exception in task 40.0 in stage 50.0 (TID 192084)
java.lang.OutOfMemoryError: Java heap space
        at sun.reflect.generics.repository.ClassRepository.getSuperInterfaces(ClassRepository.java:114)
        at java.lang.Class.getGenericInterfaces(Class.java:913)
        at java.util.HashMap.comparableClassFor(HashMap.java:351)
        at java.util.HashMap$TreeNode.treeify(HashMap.java:1932)
        at java.util.HashMap.treeifyBin(HashMap.java:772)
        at java.util.HashMap.putVal(HashMap.java:644)
        at java.util.HashMap.put(HashMap.java:612)
        at java.util.Collections$SynchronizedMap.put(Collections.java:2588)
        at org.apache.tinkerpop.gremlin.process.traversal.traverser.util.TraverserSet.add(TraverserSet.java:90)
        at org.apache.tinkerpop.gremlin.process.computer.traversal.WorkerExecutor.lambda$drainStep$4(WorkerExecutor.java:232)
        at org.apache.tinkerpop.gremlin.process.computer.traversal.WorkerExecutor$$Lambda$86/877696627.accept(Unknown Source)
        at java.util.Iterator.forEachRemaining(Iterator.java:116)
        at org.apache.tinkerpop.gremlin.process.computer.traversal.WorkerExecutor.drainStep(WorkerExecutor.java:221)
        at org.apache.tinkerpop.gremlin.process.computer.traversal.WorkerExecutor.execute(WorkerExecutor.java:151)
        at org.apache.tinkerpop.gremlin.process.computer.traversal.TraversalVertexProgram.execute(TraversalVertexProgram.java:307)
        at org.apache.tinkerpop.gremlin.spark.process.computer.SparkExecutor.lambda$null$4(SparkExecutor.java:118)
        at org.apache.tinkerpop.gremlin.spark.process.computer.SparkExecutor$$Lambda$72/1209554928.apply(Unknown Source)
        at org.apache.tinkerpop.gremlin.util.iterator.IteratorUtils$3.next(IteratorUtils.java:247)
        at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)


From the log it seems there is too much data even the 80G executor heap is not enough. 
Anybody who can help me ?  Anybody who has idea to find all the paths between 2 vertices upon large graph?


Failed to find all paths between 2 vertices upon the graph having 100 million vertices and 100 million edges using SparkGraphComputer

Roy Yu <7604...@...>
 

The graph has 100 million vertices and 100 edges
Graph data is saved at HBase Table: MyHBaseTable.

The size of MyHBaseTable is 16.2GB:
root@~$ hdfs dfs -du -h /apps/hbase/data/data/default/
16.2 G   32.4 G   /apps/hbase/data/data/default/MyHBaseTable

MyHBaseTable has 190 regions, the edge data (HBase column family e ) of every region is less than 100MB (one spark task processes one region, in order to avoid spark OOM during loading region data, I use HBaseAdmin to split HBase region to make sure the edges data (HBase column family e ) of every region is less than 100MB) . Below the size of region 077288f4be4c439443bb45b0c2369d5b is more than 100MB because it has index data.
root@~$ hdfs dfs -du -h /apps/hbase/data/data/default/MyHBaseTable
3.8 K    7.6 K    /apps/hbase/data/data/default/MyHBaseTable/.tabledesc
0        0        /apps/hbase/data/data/default/MyHBaseTable/.tmp
78.3 M   156.7 M  /apps/hbase/data/data/default/MyHBaseTable/007e9dbf74f5d35862b68d6434f1d6f2
92.2 M   184.3 M  /apps/hbase/data/data/default/MyHBaseTable/077288f4be4c439443bb45b0c2369d5b
102.4 M  204.8 M  /apps/hbase/data/data/default/MyHBaseTable/0782782071e4a7f2d17800d4a0989a7f
50.6 M   101.3 M  /apps/hbase/data/data/default/MyHBaseTable/07e795022e56a969ede48c9c23fbbc7c
50.6 M   101.3 M  /apps/hbase/data/data/default/MyHBaseTable/084e54e61bbcfc2decd14dcbac55bc50
99.7 M   199.4 M  /apps/hbase/data/data/default/MyHBaseTable/0a85ae356b19c605d9a32b9bf513bcbb
431.3 M  862.6 M  /apps/hbase/data/data/default/MyHBaseTable/0b024c812acfa6efaa40e1cca232e192
5.0 K    10.1 K   /apps/hbase/data/data/default/MyHBaseTable/0c2d8e3a6daaa8ab30c399783e343890
...


the properties of the graph:
gremlin.graph=org.janusgraph.core.JanusGraphFactory
cluster.max-partitions = 16
storage.backend=hbase
storage.hbase.table=MyHBaseTable
storage.hbase.ext.zookeeper.znode.parent=/hbase-unsecure
schema.default=none

storage.hostname=master001,master002,master003
storage.port=2181
storage.hbase.region-count=64
storage.write-time=1000000
storage.read-time=100000

ids.block-size=200000
ids.renew-timeout=600000
ids.renew-percentage=0.4
ids.authority.conflict-avoidance-mode=GLOBAL_AUTO

index.search.backend=elasticsearch
index.search.hostname=es001,es002,es003
index.search.elasticsearch.create.ext.index.number_of_shards=15
index.search.elasticsearch.create.ext.index.refresh_interval=-1
index.search.elasticsearch.create.ext.index.translog.sync_interval=5000s
index.search.elasticsearch.create.ext.index.translog.durability=async
index.search.elasticsearch.create.ext.index.number_of_replicas=0
index.search.elasticsearch.create.ext.index.shard.check_on_startup=false


the schema of the graph:
def defineSchema(graph) {
    m = graph.openManagement()

        node = m.makeVertexLabel("node").make()

        relation = m.makeEdgeLabel("relation").make()
        obj_type_value = m.makePropertyKey("obj_type_value").dataType(String.class).make()

    // edge props
        start_time = m.makePropertyKey("start_time").dataType(Date.class).make()
        end_time = m.makePropertyKey("end_time").dataType(Date.class).make()
        count = m.makePropertyKey("count").dataType(Integer.class).make()
        rel_type = m.makePropertyKey("rel_type").dataType(String.class).make()
    //index
        m.buildIndex("MyHBaseTable_obj_type_value_Index", Vertex.class).addKey(obj_type_value).unique().buildCompositeIndex()
        m.buildIndex("MyHBaseTable_rel_type_index", Edge.class).addKey(rel_type).buildCompositeIndex()
        m.buildIndex("MyHBaseTable_count_index", Edge.class).addKey(count).buildMixedIndex("search")
        m.buildIndex("MyHBaseTable_start_time_index", Edge.class).addKey(start_time).buildMixedIndex("search")
        m.buildIndex("MyHBaseTable_end_time_index", Edge.class).addKey(end_time).buildMixedIndex("search")
    m.commit()
}

the Gremlin I use to find all paths between 2 vertices:

import org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer;
import org.apache.tinkerpop.gremlin.process.traversal.dsl.graph.__;
import org.apache.tinkerpop.gremlin.process.traversal.P;
def executeScript(graph){
    traversal = graph.traversal().withComputer(SparkGraphComputer.class);
    return traversal.V(624453904).repeat(__.both().simplePath()).until(__.hasId(192204064).or().loops().is(200)).hasId(192204064).path().dedup().limit(1000).toList()
    //return traversal.V().where(__.outE().count().is(P.gte(50000))).id().toList()
};

The OLAP spark graph conf:
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphReader=org.janusgraph.hadoop.formats.hbase.HBaseInputFormat
gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat

gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=output
gremlin.spark.graphStorageLevel=DISK_ONLY
gremlin.spark.persistStorageLevel=DISK_ONLY
####################################
# JanusGraph HBase InputFormat configuration
####################################
janusgraphmr.ioformat.conf.storage.backend=hbase
janusgraphmr.ioformat.conf.storage.hostname=master002,master003,master001
janusgraphmr.ioformat.conf.storage.hbase.table=MyHBaseTable
janusgraphmr.ioformat.conf.storage.hbase.ext.zookeeper.znode.parent=/hbase-unsecure

####################################
# SparkGraphComputer Configuration #
####################################
spark.master=yarn
spark.submit.deployMode=client
spark.yarn.jars=hdfs://GRAPHOLAP/user/spark/jars/*.jar

# the Spark YARN ApplicationManager needs this to resolve classpath it sends to the executors
spark.yarn.appMasterEnv.JAVA_HOME=/usr/local/jdk1.8.0_191/
spark.yarn.appMasterEnv.HADOOP_CONF_DIR=/usr/hdp/3.1.4.0-315/hadoop/conf
spark.yarn.am.extraJavaOptions=-Diop.version=3.1.4.0-315 -Djava.library.path=/usr/hdp/current/hadoop-client/lib/native
spark.executor.memoryOverhead=5G
spark.driver.extraJavaOptions=-Diop.version=3.1.4.0-315 -Djava.library.path=/usr/hdp/current/hadoop-client/lib/native

# the Spark Executors (on the work nodes) needs this to resolve classpath to run Spark tasks
spark.executorEnv.JAVA_HOME=/usr/local/jdk1.8.0_191/
#spark.executorEnv.HADOOP_CONF_DIR=/usr/hdp/3.1.4.0-315/hadoop/conf
spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:MaxGCPauseMillis=500 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/mnt/data_1/log/spark2/gc-spark%p.log
spark.executor.cores=1
spark.executor.memory=80G
spark.executor.instances=3
spark.executor.extraClassPath=/etc/hadoop/conf:/usr/spark/jars:/usr/hdp/current/hbase-client/lib:/usr/janusgraph/0.4.0/lib

spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.network.timeout=1000000
spark.rpc.askTimeout=1000000
spark.shuffle.service.enabled=true
spark.shuffle.service.port=7447
spark.maxRemoteBlockSizeFetchToMem=10485760
spark.memory.useLegacyMode=true
spark.shuffle.memoryFraction=0.1
spark.storage.memoryFraction=0.1
spark.memory.fraction=0.1
spark.memory.storageFraction=0.1
spark.shuffle.accurateBlockThreshold=1048576


The spark job failed at stage 50 :
20/12/30 01:53:00 ERROR executor.Executor: Exception in task 40.0 in stage 50.0 (TID 192084)
java.lang.OutOfMemoryError: Java heap space
        at sun.reflect.generics.repository.ClassRepository.getSuperInterfaces(ClassRepository.java:114)
        at java.lang.Class.getGenericInterfaces(Class.java:913)
        at java.util.HashMap.comparableClassFor(HashMap.java:351)
        at java.util.HashMap$TreeNode.treeify(HashMap.java:1932)
        at java.util.HashMap.treeifyBin(HashMap.java:772)
        at java.util.HashMap.putVal(HashMap.java:644)
        at java.util.HashMap.put(HashMap.java:612)
        at java.util.Collections$SynchronizedMap.put(Collections.java:2588)
        at org.apache.tinkerpop.gremlin.process.traversal.traverser.util.TraverserSet.add(TraverserSet.java:90)
        at org.apache.tinkerpop.gremlin.process.computer.traversal.WorkerExecutor.lambda$drainStep$4(WorkerExecutor.java:232)
        at org.apache.tinkerpop.gremlin.process.computer.traversal.WorkerExecutor$$Lambda$86/877696627.accept(Unknown Source)
        at java.util.Iterator.forEachRemaining(Iterator.java:116)
        at org.apache.tinkerpop.gremlin.process.computer.traversal.WorkerExecutor.drainStep(WorkerExecutor.java:221)
        at org.apache.tinkerpop.gremlin.process.computer.traversal.WorkerExecutor.execute(WorkerExecutor.java:151)
        at org.apache.tinkerpop.gremlin.process.computer.traversal.TraversalVertexProgram.execute(TraversalVertexProgram.java:307)
        at org.apache.tinkerpop.gremlin.spark.process.computer.SparkExecutor.lambda$null$4(SparkExecutor.java:118)
        at org.apache.tinkerpop.gremlin.spark.process.computer.SparkExecutor$$Lambda$72/1209554928.apply(Unknown Source)
        at org.apache.tinkerpop.gremlin.util.iterator.IteratorUtils$3.next(IteratorUtils.java:247)
        at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
        at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
        at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)


From the log it seems there is too much data even the 80G executor heap is not enough. 
Anybody who can help me ?  Anybody who has idea to find all the paths between 2 vertices upon large graph?


Re: Running OLAP on HBase with SparkGraphComputer fails with Error Container killed by YARN for exceeding memory limits

Roy Yu <7604...@...>
 

Thanks  Evgenii  


On Tuesday, December 15, 2020 at 8:24:11 PM UTC+8 yevg...@... wrote:

Oh, I recall that we once tried to debug the same issue with JanusGraph-Hbase, had clear supernodes in the graph. No attempts on repartitioning, including analyzing code of SparkGraphComputer and tinkering around thought to make it work for partitioned vertices etc. were successful - apparently using Cassandra (latest 3.x version at the time) didn't lead to OOM, but was noticeably slower than HBase when we used it with smaller graphs.

Best regards,
Evgenii Ignatev.

On 15.12.2020 07:07, Roy Yu wrote:
Thanks Marc

On Friday, December 11, 2020 at 3:40:25 PM UTC+8 HadoopMarc wrote:
Hi Roy,

I think I would first check whether the skew is absent if you count the rows reading the HBase table directly from spark (so, without using janusgraph), e.g.:


If this works all right, than you know that somehow in janusgraph HBaseInputFormat the mappers do not get the right key ranges to read from.

I also though about the storage.hbase.region-count property of janusgraph-hbase. If you would specify this at 40 while creating the graph, janusgraph-hbase would create many small regions that will be compacted by HBase later on. But maybe this creates a different structure in the row keys that can be leveraged by the hbase.mapreduce.tableinput.mappers.per.region.

Best wishes,     Marc


Op woensdag 9 december 2020 om 17:16:35 UTC+1 schreef Roy Yu:
Hi Marc, 

The parameter  hbase.mapreduce.tableinput.mappers.per.region  can be effective. I set it to 40, and there are 40 tasks processing every region. But here comes the new promblem--the data skew. I use g.E().count() to count all the edges of the graph. During counting one region, one spark task containing all 2.6GB data, while other 39 tasks containing 0 data. The task failed again.  I checked my data. There are some vertices which have more 1 million incident edges.  So I tried to solve this promblem using vertex cut(https://docs.janusgraph.org/advanced-topics/partitioning/), my graph schema is something like  [mgmt.makeVertexLabel('product').partition().make() ]. But when I using MR to load data to the new graph, it consumed more than 10 times when the attemp without using partition(), from the hbase table detail page, I found the data loading process was busy reading data from  and writing data to the first region. The first region became the hot spot. I guess it relates to vertex ids. Could help me again?

On Tuesday, December 8, 2020 at 3:13:42 PM UTC+8 HadoopMarc wrote:
Hi Roy,

As I mentioned, I did not keep up with possibly new janusgraph-hbase features. From the HBase source, I see that HBase now has a "hbase.mapreduce.tableinput.mappers.per.region" config parameter.


It should not be too difficult to adapt the janusgraph HBaseInputFormat to leverage this feature (or maybe it even works without change???).

Best wishes,

Marc

Op dinsdag 8 december 2020 om 04:21:19 UTC+1 schreef Roy Yu:
you seem to run on cloud infra that reduces your requested 40 Gb to 33 Gb (see https://databricks.com/session_na20/running-apache-spark-on-kubernetes-best-practices-and-pitfalls). Fact of life. 
---------------------
Sorry Marc I misled you. Error Message was generated when I set spark.executor.memory to 30G, when it failed, I increased spark.executor.memory  to 40G, it failed either. I felt desperate and come here to ask for help
On Tuesday, December 8, 2020 at 10:35:19 AM UTC+8 Roy Yu wrote:
Hi Marc

Thanks for your immediate response.
I've tried to set spark.yarn.executor.memoryOverhead=10G and re-run the task, and it stilled failed. From the spark task UI, I saw 80% of processing time is Full GC time. As you said, 2.6GB(GZ compressed) exploding is  my root cause. Now I'm trying to reduce my region size to 1GB, if that will still fail, I'm gonna config the hbase hfile not using compressed format.
This was my first time running janusgraph OLAP, and I think this is a common promblom, as HBase region size 2.6GB(compressed) is not large, 20GB is very common in our production. If the community dose not solve the promblem, the Janusgraph HBase based OLAP solution cannot be adopted by other companies either.

On Tuesday, December 8, 2020 at 12:40:40 AM UTC+8 HadoopMarc wrote:
Hi Roy,

There seem to be three things bothering you here:
  1. you did not specify spark.yarn.executor.memoryOverhead, as the exception message says. Easily solved.
  2. you seem to run on cloud infra that reduces your requested 40 Gb to 33 Gb (see https://databricks.com/session_na20/running-apache-spark-on-kubernetes-best-practices-and-pitfalls). Fact of life.
  3. the janusgraph HBaseInputFormat use sentire HBase regions as hadoop partitions, which are fed into spark tasks. The 2.6Gb region size is for compressed binary data which explodes when expanded into java objects. This is your real problem.
I did not follow the latest status of janusgraph-hbase features for the HBaseInputFormat, but you have to somehow use spark with smaller partitions than an entire HBase region.
A long time ago, I had success with skipping the HBaseInputFormat and have spark executors connect to JanusGraph themselves. That is not a quick solution, though.

Best wishes,

Marc

Op maandag 7 december 2020 om 14:10:55 UTC+1 schreef Roy Yu:
Error message:
ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 33.1 GB of 33 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714. 

 graph conifg:
spark.executor.extraJavaOptions=-XX:+UseG1GC -XX:MaxGCPauseMillis=500 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/mnt/data_1/log/spark2/gc-spark%p.log
spark.executor.cores=1
spark.executor.memory=40960m
spark.executor.instances=3

Region info:
hdfs dfs -du -h /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc
67     134    /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/.regioninfo
0      0      /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/.tmp
2.6 G  5.1 G  /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/e
0      0      /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/f
0      0      /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/g
0      0      /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/h
0      0      /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/i
0      0      /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/l
0      0      /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/m
0      0      /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/recovered.edits
0      0      /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/s
0      0      /apps/hbase/data/data/default/ky415/f069fafb3ee51d6a2e5bc2377b468bcc/t
root@~$

Anybody who can help me?
--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/46bcc3bb-9e66-4fb1-add0-22374909fb63n%40googlegroups.com.


Re: Remote Traversal with Java

HadoopMarc <bi...@...>
 

Hi Peter,

This seems more relevant:
https://docs.janusgraph.org/basics/configured-graph-factory/#graph-and-traversal-bindings

So, some JanusGraph and GraphTraversalSource objects are created remotely with names following a convention. You cannot assign the instances locally.

Best wishes,    Marc

Op donderdag 31 december 2020 om 12:25:10 UTC+1 schreef HadoopMarc:

Hi Peter,

Have you tried the suggestions from an earlier thread:


Best wishes,    Marc

Op woensdag 30 december 2020 om 20:06:36 UTC+1 schreef Peter Borissow:
Dear All,
    I have installed/configured a single node JanusGraph Server with a Berkeley database backend and ConfigurationManagementGraph support so that I can create/manage multiple graphs on the server. 

In a Gremlin console on my desktop I can connect to the remote server, create graphs, create vertexes, etc.   

In Java code on my desktop, I can connect to the remote server and issue commands via Client.submit() method. However, I cannot figure out how to open a specific graph on the server and get a traversal. In the Gremlin console it is as simple as this:

gremlin> :remote connect tinkerpop.server conf/remote.yaml session 
gremlin> :remote console  
gremlin> graph = ConfiguredGraphFactory.open("test"); 
gremlin> g = graph.traversal();  

In Java, once I connect to the server/cluster and create a client connection, I think it should be as simple as this:

DriverRemoteConnection conn = DriverRemoteConnection.using(client, name);
GraphTraversalSource g = AnonymousTraversalSource.traversal().withRemote(conn);  

More info here:

Any help/guidance would be greatly appreciated!

Thanks,
Peter


Slow convert to Java object

Maxim Milovanov <milov...@...>
 

Hi!

I am trying write select-query and my perfomance is very slow when I call method toList().
How I can improve this perfomace ?


My example:

        long start;

        long end;

        try (GraphTraversalSource g = graph.traversal()) {

            start = System.currentTimeMillis();

            GraphTraversal<Vertex, Vertex> list2 = g.V().hasLabel("Entity");


            end = System.currentTimeMillis();


            System.out.printf("getGremlinTime: %d ms\n", (end - start));


            start = end;

            List<Vertex> res2 = list2.toList();


            end = System.currentTimeMillis();

            System.out.printf("toList: %d ms\n gremlin count: %d\n", (end - start), res2.size());

        }


Debug log:

getGremlinTime: 13 ms
2020-12-31 14:19:01.336  WARN 14144 --- [           main] o.j.g.transaction.StandardJanusGraphTx   : Query requires iterating over all vertices [(~label = Entity)]. For better performance, use indexes
toList: 12025 ms
 gremlin count: 105  


Re: Remote Traversal with Java

HadoopMarc <bi...@...>
 

Hi Peter,

Have you tried the suggestions from an earlier thread:

https://groups.google.com/g/janusgraph-users/c/w2_qMchATnw/m/zoMSlMO5BwAJ

Best wishes,    Marc

Op woensdag 30 december 2020 om 20:06:36 UTC+1 schreef Peter Borissow:

Dear All,
    I have installed/configured a single node JanusGraph Server with a Berkeley database backend and ConfigurationManagementGraph support so that I can create/manage multiple graphs on the server. 

In a Gremlin console on my desktop I can connect to the remote server, create graphs, create vertexes, etc.   

In Java code on my desktop, I can connect to the remote server and issue commands via Client.submit() method. However, I cannot figure out how to open a specific graph on the server and get a traversal. In the Gremlin console it is as simple as this:

gremlin> :remote connect tinkerpop.server conf/remote.yaml session 
gremlin> :remote console  
gremlin> graph = ConfiguredGraphFactory.open("test"); 
gremlin> g = graph.traversal();  

In Java, once I connect to the server/cluster and create a client connection, I think it should be as simple as this:

DriverRemoteConnection conn = DriverRemoteConnection.using(client, name);
GraphTraversalSource g = AnonymousTraversalSource.traversal().withRemote(conn);  

More info here:

Any help/guidance would be greatly appreciated!

Thanks,
Peter


Re: Degree-Centrality Filtering & Search – Scalable Strategies for OLTP

BO XUAN LI <libo...@...>
 

Hi Zach,

If you want to run the query in a multi-threading manner, try enabling “query.batch” (ref: https://docs.janusgraph.org/basics/configuration-reference/#query).

Since you are using Cassandra which does not support batch reading natively, JanusGraph will use a thread pool to fire the backend queries. This should reduce latency of this single query but might impact overall application performance if your application is already handling heavy workloads.

Best regards,
Boxuan

On Dec 31, 2020, at 8:51 AM, zblu...@gmail.com <zblu...@...> wrote:

Hi Marc, Boxuan,

Thank you for the discussion. I have been experimenting with different queries including your id() suggesting Marc.  Along Boxuan’s feedback, the where() step performs about the same (maybe slightly slower) when adding .id() step.

My bigger concern for my use case is how this type of operation scales in a matter that seems relatively linear with sample size. i.e.

g.V().limit(10).where(InE().count().is(gt(6))).profile() => ~30 ms

g.V().limit(100).where(InE().count().is(gt(6))).profile() => ~147 ms

g.V().limit(1000).where(InE().count().is(gt(6))).profile()  => ~1284 ms

g.V().limit(10000).where(InE().count().is(gt(6))).profile()  => ~13779 ms

g.V().limit(100000).where(InE().count().is(gt(6))).profile()  => ? > 120000 ms (timeout)

 

This behavior makes sense when I think about it and also when I inspect the profile (example profile of limit(10) traversal below)

I know the above traversal seems a bit funky, but I am trying to consistently analyze the effect of sample size on the edge count portion of the query.

Looking at the profile, it seems like JG needs to perform a sliceQuery operation on each vertex sequentially which isn’t well optimized for my use case. I know that if centrality properties were included in a mixed index then it can be configured for scalable performance.  However, going back to the original post, I am not sure that is the best/only way.  Are there other configurations that could be optimized to make this operation more scalable without to an additional index property? 

In case it is relevant, I am using JanusGraph v 0.5.2 with Cassandra-CQL backend v3.11.

Thank you,

Zach

Example Profile

gremlin> g.V().limit(10).where(inE().count().is(gt(6))).profile()

==>Traversal Metrics

Step                                                               Count  Traversers       Time (ms)    % Dur

=============================================================================================================

JanusGraphStep(vertex,[])                                             10          10           8.684    28.71

    \_condition=()

    \_orders=[]

    \_limit=10

    \_isFitted=false

    \_isOrdered=true

    \_query=[]

  optimization                                                                                 0.005

  optimization                                                                                 0.001

  scan                                                                                         0.000

    \_query=[]

    \_fullscan=true

    \_condition=VERTEX

TraversalFilterStep([JanusGraphVertexStep(IN,ed...                                            21.564    71.29

  JanusGraphVertexStep(IN,edge)                                       13          13          21.350

    \_condition=(EDGE AND visibility:normal)

    \_orders=[]

    \_limit=7

    \_isFitted=false

    \_isOrdered=true

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_vertices=1

    optimization                                                                               0.003

    backend-query                                                      3                       4.434

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      1                       1.291

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       1.311

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      1                       2.483

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       1.310

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       1.313

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       1.192

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      4                       1.287

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      3                       1.231

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       3.546

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

  RangeGlobalStep(0,7)                                                13          13           0.037

  CountGlobalStep                                                     10          10           0.041

  IsStep(gt(6))                                                                                0.022

                                            >TOTAL                     -           -          30.249        -


On Wednesday, December 30, 2020 at 4:59:20 AM UTC-5 libo...@connect.hku.hk wrote:
Hi Marc,

I think it will be as same slow as the initial one if not even slower. If I recall correctly, JanusGraph always fetches the whole edge (column + value) even if you only need the count (in which case neither column nor value is really needed), or you only need edge id (in which case only column is needed). I created https://github.com/JanusGraph/janusgraph/discussions/2315 to discuss about this potential optimization. Btw, even if we assume this optimization is in-place, I don’t expect significant performance boost for Zach’s usecase.

Best regards,
Boxuan


On Dec 30, 2020, at 4:44 PM, HadoopMarc <b...@...> wrote:

Hi Zach, Boxuan,

There is one thing I do not understand. According to the JanusGraph datamodel, the outE relatationIdentifiers are stored in the vertex. So, retrieving all outE() relationIdentifiers with the vertex for counting them should not take long, even if there are 100 thousands of them. What happens if you do:

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().id().count().is(gt(10)));

If this does not work, it should be possible to configure/modify JanusGraph such, that it does not start fetching edge properties that are not needed for the count.

Best wishes,   Marc

Op woensdag 30 december 2020 om 04:15:46 UTC+1 schreef libo...@connect.hku.hk:
Hi Zach,

I have some concerns over concurrency and consistency issues, but this might still be a nice feature to have. I think you could open a new discussion on https://github.com/JanusGraph/janusgraph/discussions. That would be a better place for brainstorming. It would be awesome if you can share more context on why you think this is a very common business requirement.

Cheers,
Boxuan

On Dec 30, 2020, at 4:42 AM, zblu...@gmail.com <zb...@...> wrote:

Thank you Boxuan,

Was using the term “job” pretty loosely.  Your inference about doing these things within ingest/deletion process makes sense.

I know there is a lot on the community’s plate now, but if my above solution is truly optimal for current state, I wonder if a JG feature addition may help tackle this problem more consistently. Something like an additional, 3rd , index type (in addition to “graph” and “vertex-centric” indices) . i.e. “edge-connection” or “degree-centrality” index. The feature would require a mixed indexing backend, and minimally a mechanism to choose vertex and edge label combinations to count IN, OUT, and/or BOTH degree centrality.

Not sure what the level of effort or implementation details would be, but this is a very common business requirement for graph-based search.  If JanusGraph has native/tested support for it, it would make JanusGraph even easier to champion.

😊

Best,

Zach


On Tuesday, December 29, 2020 at 3:19:46 AM UTC-5 libo...@connect.hku.hk wrote:
Hi Zach,

Personally I think your workaround is the most optimal one. JanusGraph does not store number of edges as metadata in the vertex (there are both Pros & Cons for doing / not doing this).

Btw do you have to have another job doing centrality calculation separately? If your application is built on top of JanusGraph, then probably you can maintain the “outDegree” property when inserting/deleting edges.

Best regards,
Boxuan

On Dec 29, 2020, at 6:49 AM, zblu...@gmail.com <zb...@...> wrote:

Hello all,

Curious about best approaches/practices for scalable degree-centrality search filters on large (millions to billions of nodes) JanusGraphs.  i.e. something like :

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().count().is(gt(10)));                            

Suppose the has-step narrows down to a large number of vertices (hundreds of thousands), then performing that form of count on that many vertices will result in timeouts and inefficiencies (at least in my experience).  My workaround for this has been pre-calculating centrality in another job and writing to a Vertex Property that can subsequently be included in a mixed index. So we can do:

g.V()

   .has("someProperty",eq("someValue"))

   .has(“outDegree”,gt(10))

This works, but it is yet another calculation we must maintain in our pipeline and while sufficing, it seems like more of a workaround then a great solution. I was hoping there was a more optimal approach/strategy. Please let me know. 

Thank you,

Zach


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/385af431-d723-4be6-95cb-43b2954f2e58n%40googlegroups.com.


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/0ff0c37a-6a56-476c-8efb-c30416380ec1n%40googlegroups.com.


Re: Degree-Centrality Filtering & Search – Scalable Strategies for OLTP

"zb...@gmail.com" <zblu...@...>
 

Hi Marc, Boxuan,

Thank you for the discussion. I have been experimenting with different queries including your id() suggesting Marc.  Along Boxuan’s feedback, the where() step performs about the same (maybe slightly slower) when adding .id() step.

My bigger concern for my use case is how this type of operation scales in a matter that seems relatively linear with sample size. i.e.

g.V().limit(10).where(InE().count().is(gt(6))).profile() => ~30 ms

g.V().limit(100).where(InE().count().is(gt(6))).profile() => ~147 ms

g.V().limit(1000).where(InE().count().is(gt(6))).profile()  => ~1284 ms

g.V().limit(10000).where(InE().count().is(gt(6))).profile()  => ~13779 ms

g.V().limit(100000).where(InE().count().is(gt(6))).profile()  => ? > 120000 ms (timeout)

 

This behavior makes sense when I think about it and also when I inspect the profile (example profile of limit(10) traversal below)

I know the above traversal seems a bit funky, but I am trying to consistently analyze the effect of sample size on the edge count portion of the query.

Looking at the profile, it seems like JG needs to perform a sliceQuery operation on each vertex sequentially which isn’t well optimized for my use case. I know that if centrality properties were included in a mixed index then it can be configured for scalable performance.  However, going back to the original post, I am not sure that is the best/only way.  Are there other configurations that could be optimized to make this operation more scalable without to an additional index property? 

In case it is relevant, I am using JanusGraph v 0.5.2 with Cassandra-CQL backend v3.11.

Thank you,

Zach

Example Profile

gremlin> g.V().limit(10).where(inE().count().is(gt(6))).profile()

==>Traversal Metrics

Step                                                               Count  Traversers       Time (ms)    % Dur

=============================================================================================================

JanusGraphStep(vertex,[])                                             10          10           8.684    28.71

    \_condition=()

    \_orders=[]

    \_limit=10

    \_isFitted=false

    \_isOrdered=true

    \_query=[]

  optimization                                                                                 0.005

  optimization                                                                                 0.001

  scan                                                                                         0.000

    \_query=[]

    \_fullscan=true

    \_condition=VERTEX

TraversalFilterStep([JanusGraphVertexStep(IN,ed...                                            21.564    71.29

  JanusGraphVertexStep(IN,edge)                                       13          13          21.350

    \_condition=(EDGE AND visibility:normal)

    \_orders=[]

    \_limit=7

    \_isFitted=false

    \_isOrdered=true

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_vertices=1

    optimization                                                                               0.003

    backend-query                                                      3                       4.434

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      1                       1.291

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       1.311

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      1                       2.483

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       1.310

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       1.313

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       1.192

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      4                       1.287

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      3                       1.231

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

    optimization                                                                               0.001

    backend-query                                                      2                       3.546

    \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@9c76d

    \_limit=14

  RangeGlobalStep(0,7)                                                13          13           0.037

  CountGlobalStep                                                     10          10           0.041

  IsStep(gt(6))                                                                                0.022

                                            >TOTAL                     -           -          30.249        -


On Wednesday, December 30, 2020 at 4:59:20 AM UTC-5 li...@... wrote:
Hi Marc,

I think it will be as same slow as the initial one if not even slower. If I recall correctly, JanusGraph always fetches the whole edge (column + value) even if you only need the count (in which case neither column nor value is really needed), or you only need edge id (in which case only column is needed). I created https://github.com/JanusGraph/janusgraph/discussions/2315 to discuss about this potential optimization. Btw, even if we assume this optimization is in-place, I don’t expect significant performance boost for Zach’s usecase.

Best regards,
Boxuan


On Dec 30, 2020, at 4:44 PM, HadoopMarc <b...@...> wrote:

Hi Zach, Boxuan,

There is one thing I do not understand. According to the JanusGraph datamodel, the outE relatationIdentifiers are stored in the vertex. So, retrieving all outE() relationIdentifiers with the vertex for counting them should not take long, even if there are 100 thousands of them. What happens if you do:

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().id().count().is(gt(10)));

If this does not work, it should be possible to configure/modify JanusGraph such, that it does not start fetching edge properties that are not needed for the count.

Best wishes,   Marc

Op woensdag 30 december 2020 om 04:15:46 UTC+1 schreef libo...@connect.hku.hk:
Hi Zach,

I have some concerns over concurrency and consistency issues, but this might still be a nice feature to have. I think you could open a new discussion on https://github.com/JanusGraph/janusgraph/discussions. That would be a better place for brainstorming. It would be awesome if you can share more context on why you think this is a very common business requirement.

Cheers,
Boxuan

On Dec 30, 2020, at 4:42 AM, zblu...@gmail.com <zb...@...> wrote:

Thank you Boxuan,

Was using the term “job” pretty loosely.  Your inference about doing these things within ingest/deletion process makes sense.

I know there is a lot on the community’s plate now, but if my above solution is truly optimal for current state, I wonder if a JG feature addition may help tackle this problem more consistently. Something like an additional, 3rd , index type (in addition to “graph” and “vertex-centric” indices) . i.e. “edge-connection” or “degree-centrality” index. The feature would require a mixed indexing backend, and minimally a mechanism to choose vertex and edge label combinations to count IN, OUT, and/or BOTH degree centrality.

Not sure what the level of effort or implementation details would be, but this is a very common business requirement for graph-based search.  If JanusGraph has native/tested support for it, it would make JanusGraph even easier to champion.

😊

Best,

Zach


On Tuesday, December 29, 2020 at 3:19:46 AM UTC-5 libo...@connect.hku.hk wrote:
Hi Zach,

Personally I think your workaround is the most optimal one. JanusGraph does not store number of edges as metadata in the vertex (there are both Pros & Cons for doing / not doing this).

Btw do you have to have another job doing centrality calculation separately? If your application is built on top of JanusGraph, then probably you can maintain the “outDegree” property when inserting/deleting edges.

Best regards,
Boxuan

On Dec 29, 2020, at 6:49 AM, zblu...@gmail.com <zb...@...> wrote:

Hello all,

Curious about best approaches/practices for scalable degree-centrality search filters on large (millions to billions of nodes) JanusGraphs.  i.e. something like :

g.V()

   .has("someProperty",eq("someValue"))

   .where(outE().count().is(gt(10)));                            

Suppose the has-step narrows down to a large number of vertices (hundreds of thousands), then performing that form of count on that many vertices will result in timeouts and inefficiencies (at least in my experience).  My workaround for this has been pre-calculating centrality in another job and writing to a Vertex Property that can subsequently be included in a mixed index. So we can do:

g.V()

   .has("someProperty",eq("someValue"))

   .has(“outDegree”,gt(10))

This works, but it is yet another calculation we must maintain in our pipeline and while sufficing, it seems like more of a workaround then a great solution. I was hoping there was a more optimal approach/strategy. Please let me know. 

Thank you,

Zach


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/385af431-d723-4be6-95cb-43b2954f2e58n%40googlegroups.com.


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....

1181 - 1200 of 6661