Re: Error when running JanusGraph with YARN and CQL
Varun Ganesh <operatio...@...>
Thanks a lot for responding Marc.
toggle quoted message
Show quoted text
Yes, I had initially tried setting spark.yarn.archive with the path to spark-gremlin.zip. However with this approach, the containers were failing with the message "Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher". I'm yet to understand the differences between the spark.yarn.archive and the HADOOP_GREMLIN_LIBS approaches. Will update this thread as I find out more. Thank you, Varun On Friday, December 11, 2020 at 2:05:35 AM UTC-5 HadoopMarc wrote:
|
|
Re: How to improve traversal query performance
HadoopMarc <bi...@...>
Hi Manabu, Yes, providing an example graph works much better in exploring the problem space. I am afraid, though, that I did not find much that will help you out.
So, concluding, there does not seem to be much you can do about the query: you simply want a large resultset from a traversal with multiple steps. Depending on the size of you graph, you could hod the graph in memory using the inmemory backend, or you could replace cassandra with cql and put on it on infrastructure with SSD storage. Of course, you could also precompute and store results, or split up the query with repeat().times(1), repeat().times(2), etc. for faster intermediate results. Best wishes, Marc Op dinsdag 8 december 2020 om 08:56:03 UTC+1 schreef Manabu Kotani: Hi Marc, |
|
Re: Profile() seems inconsisten with System.currentTimeMillis
HadoopMarc <bi...@...>
In the mean time I found that the difference between profile() and currentTimeMillis can be much larger. Apparently, the profile() step takes into account that for real queries, vertices are not present in the database cache and assumes some time duration to retrieve a vertex or properties from the backend. Is there any documentation on these assumptions? Best wishes, Marc Op vrijdag 11 december 2020 om 09:58:21 UTC+1 schreef HadoopMarc:
|
|
Profile() seems inconsisten with System.currentTimeMillis
HadoopMarc <bi...@...>
Hi, Can anyone explain why the total duration displayed by the profile() step is more than twice as large as the time difference clocked with System.currentTimeMillis? see below, For those who wonder, the query without profile() also takes about 300 msec. Thanks, Marc gremlin> start = System.currentTimeMillis() ==>1607676127027 gremlin> g.V().has('serial', within('1654145144','1648418968','1652445288','1654952168','1653379120', '1654325440','1653383216','1658298568','1649680536','1649819672','1654964456','1649729552', '1656103144','1655460032','1656111336','1654669360')).inE('assembled').outV().profile() ==>Traversal Metrics Step Count Traversers Time (ms) % Dur ============================================================================================================= JanusGraphStep([],[serial.within([1654145144, 1... 16 16 0,486 59,26 \_condition=((serial = 1654145144 OR serial = 1648418968 OR serial = 1652445288 OR serial = 1654952168 OR serial = 1653379120 OR serial = 1654325440 OR serial = 1653383216 OR serial = 1658298568 OR se rial = 1649680536 OR serial = 1649819672 OR serial = 1654964456 OR serial = 1649729552 OR seri al = 1656103144 OR serial = 1655460032 OR serial = 1656111336 OR serial = 1654669360)) \_orders=[] \_isFitted=true \_isOrdered=true \_query=multiKSQ[16]@2000 \_index=bySerial optimization 0,009 optimization 0,267 JanusGraphVertexStep(IN,[assembled],vertex) 73 73 0,334 40,74 \_condition=type[assembled] \_orders=[] \_isFitted=true \_isOrdered=true \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@812bd43d \_vertices=1 optimization 0,037 optimization 0,008 optimization 0,005 optimization 0,004 optimization 0,004 optimization 0,004 optimization 0,017 optimization 0,004 optimization 0,004 optimization 0,004 optimization 0,004 optimization 0,004 optimization 0,004 optimization 0,004 optimization 0,004 optimization 0,004 >TOTAL - - 0,820 - gremlin> System.currentTimeMillis() - start ==>322 |
|
Re: Running OLAP on HBase with SparkGraphComputer fails with Error Container killed by YARN for exceeding memory limits
HadoopMarc <bi...@...>
Hi Roy, I think I would first check whether the skew is absent if you count the rows reading the HBase table directly from spark (so, without using janusgraph), e.g.: https://stackoverflow.com/questions/42019905/how-to-use-newapihadooprdd-spark-in-java-to-read-hbase-data If this works all right, than you know that somehow in janusgraph HBaseInputFormat the mappers do not get the right key ranges to read from. Best wishes, Marc Op woensdag 9 december 2020 om 17:16:35 UTC+1 schreef Roy Yu: Hi Marc, |
|
Re: Error when running JanusGraph with YARN and CQL
HadoopMarc <bi...@...>
Hi Varun, Good job. However, your last solution will only work with everything running on a single machine. So, indeed, there is something wrong with the contents of spark-gremlin.zip or with the way it is put in the executor's local working directory. Note that you already put /Users/my_comp/Downloads/janusgraph-0.5.2/lib/janusgraph-cql-0.5.2.jar explicitly on the executor classpath while it should have been available already through ./spark-gremlin.zip/* O, I think I see now what is different. You have used spark.yarn.dist.archives, while the TinkerPop recipes use spark.yarn.archive. They behave differently in yes/no extracting the jars from the zip. I guess either can be used, provided it is done consistently. You can use the environment tab in Spark web UI to inspect how things are picked up by spark. Best wishes, Marc Op donderdag 10 december 2020 om 20:23:32 UTC+1 schreef Varun Ganesh: Answering my own question. I was able fix the above error and successfully run the count job after explicitly adding /Users/my_comp/Downloads/janusgraph-0.5.2/lib/* to spark.executor.extraClassPath |
|
Re: Janusgraph Hadoop Spark standalone cluster - Janusgraph job always creates constant number 513 of Spark tasks
Varun Ganesh <operatio...@...>
Thank you Marc. I was able to reduce the tasks by adjusting the `num_tokens` settings on Cassandra. Still unsure about why each task takes so long though. Hoping that this a per-task overhead that stays the same as we process larger datasets.
toggle quoted message
Show quoted text
On Saturday, December 5, 2020 at 3:20:17 PM UTC-5 HadoopMarc wrote:
|
|
Re: Error when running JanusGraph with YARN and CQL
Varun Ganesh <operatio...@...>
Answering my own question. I was able fix the above error and successfully run the count job after explicitly adding /Users/my_comp/Downloads/janusgraph-0.5.2/lib/* to spark.executor.extraClassPath
toggle quoted message
Show quoted text
But I am not yet sure as to why that was needed. I had assumed that adding spark-gremlin.zip to the path would have provided the required dependencies. On Thursday, December 10, 2020 at 1:00:24 PM UTC-5 Varun Ganesh wrote: An update on this, I tried setting the env var below: |
|
Re: Error when running JanusGraph with YARN and CQL
Varun Ganesh <operatio...@...>
An update on this, I tried setting the env var below:
toggle quoted message
Show quoted text
export HADOOP_GREMLIN_LIBS=$GREMLIN_HOME/lib After doing this I was able to successfully run the tinkerpop-modern.kryo example from the Recipes documentation. (though the guide at http://yaaics.blogspot.com/2017/07/configuring-janusgraph-for-spark-yarn.html explicitly asks us to ignore this) Unfortunately, it is still not working with CQL. But the error is now different. Please see below: 12:46:33 ERROR org.apache.spark.scheduler.TaskSetManager - Task 3 in stage 0.0 failed 4 times; aborting job org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 0.0 failed 4 times, most recent failure: Lost task 3.3 in stage 0.0 (TID 9, 192.168.1.160, executor 2): java.lang.NoClassDefFoundError: org/janusgraph/hadoop/formats/util/HadoopInputFormat at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:756) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) at java.net.URLClassLoader.access$100(URLClassLoader.java:74) at java.net.URLClassLoader$1.run(URLClassLoader.java:369) at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.security.AccessController.doPrivileged(Native Method) ... (skipping) Caused by: java.lang.ClassNotFoundException: org.janusgraph.hadoop.formats.util.HadoopInputFormat at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ... 130 more Is there some additional dependency that I may need to add? Thanks in advance! On Wednesday, December 9, 2020 at 11:49:29 PM UTC-5 Varun Ganesh wrote: Hello, |
|
Re: How to open the same graph multiple times and not get the same object?
BO XUAN LI <libo...@...>
Thanks for sharing! I personally only use MapReduce and not sure if there is an existing solution for Spark.
toggle quoted message
Show quoted text
> if there is any danger in opening multiple separate graph instances and using them to modify the graph Opening multiple graph instances on the same JVM seems atypical, but I don’t see any problem. It would be great if you can share back in case you see any issue. Best regards, Boxuan
|
|
Re: OLAP, Hadoop, Spark and Cassandra
Mladen Marović <mladen...@...>
A slight correction and clarification of my previous post - the total number of partitions/splits is exactly equal to total_number_of_tokens + 1. In a 3-node cassandra cluster where each node has 256 tokens (if set to default), this would result in a total of 769 partitions, in a single-node cluster this would be 257, etc. There is no "1 task that collects results, or something similar".
toggle quoted message
Show quoted text
This makes sense when you consider that Cassandra partitions data using 64-bit row key hashes, that the total range of 64-bit integer hash values is equal to [-2^63, 2^63 - 1], and that tokens are simply 64-bit integer values used to determine what data partitions a node gets. Splitting that range with n different tokens always gives n + 1 subsets. A log excerpt from a 1-node cassandra cluster with 16 tokens confirms this: 18720 [Executor task launch worker for task 0] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((-4815577940669380240, '-2942172956248108515] @[master]) 18720 [Executor task launch worker for task 1] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((7326109958794842850, '7391123213565411179] @[master]) 18721 [Executor task launch worker for task 3] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((-2942172956248108515, '-2847854446434006096] @[master]) 18740 [Executor task launch worker for task 2] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((-9223372036854775808, '-8839354777455528291] @[master]) 28369 [Executor task launch worker for task 4] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((4104296217363716109, '7326109958794842850] @[master]) 28651 [Executor task launch worker for task 5] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((8156279557766590813, '-9223372036854775808] @[master]) 34467 [Executor task launch worker for task 6] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((-6978843450179888845, '-5467974851507832526] @[master]) 54235 [Executor task launch worker for task 7] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((2164465249293820494, '3738744141825711063] @[master]) 56122 [Executor task launch worker for task 8] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((-2847854446434006096, '180444324727144184] @[master]) 60564 [Executor task launch worker for task 9] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((180444324727144184, '720824306927062455] @[master]) 74783 [Executor task launch worker for task 10] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((-8839354777455528291, '-7732322859452179159] @[master]) 78171 [Executor task launch worker for task 11] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((-7732322859452179159, '-6978843450179888845] @[master]) 79362 [Executor task launch worker for task 12] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((3738744141825711063, '4104296217363716109] @[master]) 91036 [Executor task launch worker for task 13] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((-5467974851507832526, '-4815577940669380240] @[master]) 92250 [Executor task launch worker for task 14] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((1437322944493769078, '2164465249293820494] @[master]) 92363 [Executor task launch worker for task 15] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((720824306927062455, '1437322944493769078] @[master]) 94339 [Executor task launch worker for task 16] INFO org.apache.spark.rdd.NewHadoopRDD - Input split: ColumnFamilySplit((7391123213565411179, '8156279557766590813] @[master]) Best regards, Mladen On Tuesday, December 1, 2020 at 8:05:19 AM UTC+1 HadoopMarc wrote:
|
|
Error when running JanusGraph with YARN and CQL
Varun Ganesh <operatio...@...>
Hello,
I am trying to run SparkGraphComputer on a JanusGraph backed by Cassandra and ElasticSearch. I have previously verified that I am able to run SparkGraphComputer on a local Spark standalone cluster. I am now trying to run it on YARN. I have a local YARN cluster running and I have verified that it can run Spark jobs. I followed the following links: http://yaaics.blogspot.com/2017/07/configuring-janusgraph-for-spark-yarn.html http://tinkerpop.apache.org/docs/3.4.6/recipes/#olap-spark-yarn And here is my read-cql-yarn.properties file: gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph gremlin.hadoop.graphReader=org.janusgraph.hadoop.formats.cql.CqlInputFormat gremlin.hadoop.graphWriter=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat gremlin.hadoop.jarsInDistributedCache=true gremlin.hadoop.inputLocation=none gremlin.hadoop.outputLocation=output gremlin.spark.persistContext=true # # JanusGraph Cassandra InputFormat configuration # # These properties defines the connection properties which were used while write data to JanusGraph. janusgraphmr.ioformat.conf.storage.backend=cql # This specifies the hostname & port for Cassandra data store. janusgraphmr.ioformat.conf.storage.hostname=127.0.0.1 janusgraphmr.ioformat.conf.storage.port=9042 # This specifies the keyspace where data is stored. janusgraphmr.ioformat.conf.storage.cql.keyspace=janusgraph # This defines the indexing backend configuration used while writing data to JanusGraph. janusgraphmr.ioformat.conf.index.search.backend=elasticsearch janusgraphmr.ioformat.conf.index.search.hostname=127.0.0.1 # Use the appropriate properties for the backend when using a different storage backend (HBase) or indexing backend (Solr). # # Apache Cassandra InputFormat configuration # cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner cassandra.input.widerows=true # # SparkGraphComputer Configuration # spark.master=yarn spark.submit.deployMode=client spark.executor.memory=1g spark.yarn.dist.archives=/tmp/spark-gremlin.zip spark.yarn.dist.files=/Users/my_comp/Downloads/janusgraph-0.5.2/lib/janusgraph-cql-0.5.2.jar spark.yarn.appMasterEnv.CLASSPATH=/Users/my_comp/Downloads/hadoop-2.7.2/etc/hadoop:./spark-gremlin.zip/* spark.executor.extraClassPath=/Users/my_comp/Downloads/hadoop-2.7.2/etc/hadoop:/Users/my_comp/Downloads/janusgraph-0.5.2/lib/janusgraph-cql-0.5.2.jar:./spark-gremlin.zip/* spark.driver.extraLibraryPath=/Users/my_comp/Downloads/hadoop-2.7.2/lib/native:/Users/my_comp/Downloads/hadoop-2.7.2/lib/native/Linux-amd64-64 spark.executor.extraLibraryPath=/Users/my_comp/Downloads/hadoop-2.7.2/lib/native:/Users/my_comp/Downloads/hadoop-2.7.2/lib/native/Linux-amd64-64 spark.serializer=org.apache.spark.serializer.KryoSerializer spark.kryo.registrator=org.janusgraph.hadoop.serialize.JanusGraphKryoRegistrator After a bunch of trial and error, I was able to get it to a point where I see containers starting up on my YARN Resource manager UI (port 8088) Here is the code I am running (it's a simple count): gremlin> graph = GraphFactory.open('conf/hadoop-graph/read-cql-yarn.properties') ==>hadoopgraph[cqlinputformat->nulloutputformat] gremlin> g = graph.traversal().withComputer(SparkGraphComputer) ==>graphtraversalsource[hadoopgraph[cqlinputformat->nulloutputformat], sparkgraphcomputer] gremlin> g.V().count() 18:49:03 ERROR org.apache.spark.scheduler.TaskSetManager - Task 2 in stage 0.0 failed 4 times; aborting job org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 0.0 failed 4 times, most recent failure: Lost task 2.3 in stage 0.0 (TID 10, 192.168.1.160, executor 1): java.lang.IllegalStateException: unread block data at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2862) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1682) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2366) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2290) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2148) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1647) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:483) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:441) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:370) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Would really appricate it if someone could shed some light on this error and advise on next steps! Thank you! |
|
Centric Indexes failing to support all conditions for better performance.
chrism <cmil...@...>
JanusGraph documentation: https://docs.janusgraph.org/index-management/index-performance/ is describing usage of Vertex Centrix Index [edge=battled + properties=(rating,time)] g.V(h).outE('battled').has('rating', 5.0).has('time', inside(10, 50)).inV() From my understanding profile() of above is reporting \_isFitted=true to indicate that backend-query delivered all results as conditions: \_condition=(rating = 0.5 AND time > 10 AND time < 50 AND type[battled]) Two things are obvious from above: centric index is supporting multiple property keys, and equality and range/interval constraints. However isFitted is false for all kind of conditions or combinations which are not really breaking the above rules, still in range constraints: a) g.V(h).outE('battled').has('rating',lt(5.0)).has('time', inside(10, 50)).inV() // P.lt used for first key b) g.V(h).outE('battled').has('rating',gt(5.0)) // P.gt used c) g.V(h).outE('battled').or( hasNot('rating'), has('rating',eq(5.0)) ) // OrStep() used Even b) can be "fitted" by has('rating',inside(5.0,Long.MAX_VALUE)) all that is very confusing, and probably not working as expected, what I am doing wrong? as from my experience only one property key can be used for query conditions and using index, the second is ignored. Having isFitted=false is not really improving performance, from my understanding, when one only condition allows to get most of my edges and is asking to filter them in memory, as this is stated by implementation of BasicVertexCentricQueryBuilder.java. Are there limitations not described in the JG doco? It is a glitch? Can you offer explanation how to utilize Centric Indexes for edges in full support? Christopher |
|
Re: How to open the same graph multiple times and not get the same object?
Mladen Marović <mladen...@...>
Hello Boxuan, I need to support reindexing very large graphs. To my knowledge, the only feasible way that's supported is via the `MapReduceIndexManagement` class. This is not ideal for me as I'd like to utilise an existing Apache Spark cluster to run this job, and `MapReduceIndexManagement` is a Hadoop/MapReduce implementation. Therefore, I started writing a `SparkIndexManagement` class that's supposed to be a drop-in replacement that offers Spark support. The basic structure of the code that processes a single partition should be something like this: public ScanMetrics processPartition(Iterator<Tuple2<NullWritable, VertexWritable>> vertices) { if (partition.hasNext()) { // open the graph JanusGraph graph = JanusGraphFactory.open(getGraphConfiguration()); // prepare for partition processing job.workerIterationStart(graph, getJobConfiguration(), metrics); // find and process each vertex vertices.forEachRemaining( tuple -> { ... JanusGraphVertex vertex = ... // load the vertex job.process(vertex, metrics); ... } ); // finish processing the partition job.workerIterationEnd(metrics); } ... } At first everything seemed quite straightforward, so I implemented a quick-and-dirty solution as a proof of concept. However, after running the first buildable solution, I came upon an unexpected error: "java.lang.IllegalArgumentException: The transaction has already been closed". The confusing part was that the implementation worked when I ran the local Spark cluster as "local[1]" (which spawns only one worker thread), but when running it as "local[*]" (which spawns multiple worker threads, one per core), the error would always appear, although not always on the same task. After some digging, I seem to have found the main cause. Loading the graph data by using `org.janusgraph.hadoop.formats.cql.CqlInputFormat` in the `SparkContext.newAPIHadoopRDD()` call returns a `JavaPairRDD<NullWritable, VertexWritable>` with several partitions, as expected. The graph used to read vertices in this input format is opened via `JanusGraphFactory.open()`. After iterating through all vertices returned by the partition, the underlying graph is closed in a final `release()` call for that partition. This makes sense because that partition is done with reading. However, when processing that partition, I need to open a graph to pass to `IndexRepairJob.workerIterationStart()`, and also create a separate read-only transaction (fromt that same graph) to fetch the vertex properly and pass it to `IndexRepairJob.process()`. `IndexRepairJob` also creates a write transaction to make some changes to the graph. This would all work fine in MapReduce because there, the first `map()` step is run in its entirety first, which means that reindexing/vertex is done only after ALL partitions have been read and the `CqlInputFormat` finished its part. I don't have much experience in MapReduce, but that's how I understand it to work - a single map() result is first written on disk, and then that result is read from the disk to be the input to the subsequent map() call. On the other hand, Spark optimizes the map-reduce paradigm by chaining subsequent map() calls to keep objects in memory as much as possible. So, when this runs on a "local[*]" cluster, or a Spark executor with multiple cores, and the graph is opened via JanusGraphFactory.open(), all threads in that executor share the graph object. Each thread runs on a different RDD partition, but they can be at different phases of the reindexing process (different map() steps) at the same time. When one thread closes the graph for whatever reason (e.g. when `CqlInputFormat` finishes reading a partition), other threads simply blow up. For example, if I have partitions/tasks with 300, 600 and 900 vertices and they all run on a single 3-core Spark executor, they'll be processed in parallel by three separate threads. The first thread will process 300 vertices and, upon iterating the final vertex, will close the underlying graph (as part of the `CqlInputFormat` implementation, from what I gathered). Closing the graph immediately closes all opened transactions. However, the same graph is used in other threads as well in parallel. The second thread might have only finished processing 350 vertices at the time the first closed the graph, so the next time it tries to write something, it crashes because it uses a transaction that's already closed. The ideal solution should be to open separate graph instances of the same graph, one in `CqlInputFormat`, and the other that is passed to `IndexRepairJob.workerIterationStart()`, for each task. In that case, if one graph is closed, no other tasks or processing phases would be affected. I tried that out today by opening the graph using the `StandardJanusGraph` constructor (at least in my part of the code) and so far that worked well because in most of my test runs the job completed successfully. The runs that failed occurred during debugging, when the execution was stuck on a breakpoint for a while, so maybe there were some timeouts involved or something. This remains to be tested. I also strongly suspect that the problem still remains, at least in theory, because `CqlInputFormat` still uses the `JanusGraphFactory.open()` call, but the probability for that is reduced, at least in the environment and on the data I'm currently testing on. I haven't analyzed the `CqlInputFormat` code fully to understand how it behaves in that case yet. Admittedly, I could provide my own InputFormat class, or at least subclass it and try to hack and slash and make it work somehow, but that seriously complicates everything and defeats the purpose of everything I'm trying to do here. Another workaround would be to limit each Spark executor to use only one core, but that seems wasteful and is definitely something I would try to avoid. I probably missed a lot of details, but that's the general idea and my conclusions so far. Feel free to correct me if I missed anything or wrote anything wrong, as well as point me in the right direction if such an implementation already exists and I just didn't come across it. Best regards, Mladen PS An additional question here would be to see if there is any danger in opening multiple separate graph instances and using them to modify the graph, but as this is already done in the current MapReduce implementation anyway, and all my transactions are opened as read-only, I'm guessing that shouldn't pose a problem here. On Wednesday, December 9, 2020 at 4:32:10 PM UTC+1 li...@... wrote: Hi Mladen, |
|
Re: Running OLAP on HBase with SparkGraphComputer fails with Error Container killed by YARN for exceeding memory limits
Roy Yu <7604...@...>
Hi Marc,
toggle quoted message
Show quoted text
The parameter
hbase.mapreduce.tableinput.mappers.per.region can be effective. I set it to 40, and there are 40 tasks processing every region. But here comes the new promblem--the data skew. I use g.E().count() to count all the edges of the graph. During counting one region, one spark task containing all 2.6GB data, while other 39 tasks containing 0 data. The task failed again. I checked my data. There are some vertices which have more 1 million incident edges. So I tried to solve this promblem using vertex cut(https://docs.janusgraph.org/advanced-topics/partitioning/), my graph schema is something like [mgmt.makeVertexLabel('product').partition().make() ]. But when I using MR to load data to the new graph, it consumed more than 10 times when the attemp without using partition(), from the hbase table detail page, I found the data loading process was busy reading data from and writing data to the first region. The first region became the hot spot. I guess it relates to vertex ids. Could help me again? On Tuesday, December 8, 2020 at 3:13:42 PM UTC+8 HadoopMarc wrote:
|
|
Re: How to open the same graph multiple times and not get the same object?
Boxuan Li <libo...@...>
Hi Mladen,
toggle quoted message
Show quoted text
Agree with Marc, that's something you could try. If possible, could you share the reason why you have to open the same graph multiple times with different graph objects? If there is no other solution to your problem then this can be a feature request. Best regards, Boxuan On Wednesday, December 9, 2020 at 2:50:48 PM UTC+8 HadoopMarc wrote:
|
|
SimplePath query is slower in 6 node vs 3 node Cassandra cluster
Varun Ganesh <operatio...@...>
Hello, I am currently using Janusgraph version 0.5.2. I have a graph with about 18 million vertices and 25 million edges. I have two versions of this graph, one backed by a 3 node Cassandra cluster and another backed by 6 Cassandra nodes (both with 3x replication factor) I am running the below query on both of them: g.V().hasLabel('label_A').has('some_id', 123).has('data.name', 'value1').repeat(both('sample_edge').simplePath()).until(has('data.name', 'value2')).path().by('data.name').next() The issue is that this query takes ~130ms on the 3 node cluster whereas it takes ~400ms on the 6 node cluster. I have tried running ".profile()" on both versions and the outputs are almost identical in terms of the steps and time taken. g.V().hasLabel('label_A').has('some_id', 123).has('data.name', 'value1').repeat(both('sample_edge').simplePath()).until(has('data.name', 'value2')).path().by('data.name').limit(1).profile() ==>Traversal Metrics Step Count Traversers Time (ms) % Dur ============================================================================================================= JanusGraphStep([],[~label.eq(label_A), o... 1 1 4.582 0.39 \_condition=(~label = label_A AND some_id = 123 AND data.name = value1) \_orders=[] \_isFitted=true \_isOrdered=true \_query=multiKSQ[1]@8000 \_index=someVertexByNameComposite optimization 0.028 optimization 0.907 backend-query 1 3.012 \_query=someVertexByNameComposite:multiKSQ[1]@8000 \_limit=8000 RepeatStep([JanusGraphVertexStep(BOTH,[... 2 2 1167.493 99.45 HasStep([data.name.eq(... 803.247 JanusGraphVertexStep(BOTH,[... 12934 12934 334.095 \_condition=type[sample_edge] \_orders=[] \_isFitted=true \_isOrdered=true \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@812d311c \_multi=true \_vertices=264 optimization 0.073 backend-query 266 5.640 \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@812d311c optimization 0.028 backend-query 12689 312.544 \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@812d311c PathFilterStep(simple) 12441 12441 10.980 JanusGraphMultiQueryStep(RepeatEndStep) 1187 1187 11.825 RepeatEndStep 2 2 810.468 RangeGlobalStep(0,1) 1 1 0.419 0.04 PathStep([value(data.name)]) 1 1 1.474 0.13 >TOTAL - - 1173.969 - I'd really appreciate some input on figuring out why the query is 3x slower on 6 nodes. I realise that you may require more context. Happy to provide more information as required! (I had previously posted this on the forum: https://groups.google.com/g/janusgraph-users/c/nkNFaFzdr4I. But I was hoping that I might get a bit more traction through the mailing list) Thank you! |
|
Re: Configuring Transaction Log feature
Pawan Shriwas <shriwa...@...>
Hi Sandeep,
toggle quoted message
Show quoted text
I think I have already added below line to indicate that it should pull the detail from now onwords in processor. Is it not working? "setStartTimeNow()" Is anyone other face the same thing in their java code? Thanks, Pawan On Friday, 4 December 2020 at 16:22:51 UTC+5:30 sa...@... wrote: pawan, |
|
Re: How to run groovy script in background?
HadoopMarc <bi...@...>
You could end your script with: System.exit(0) HTH, Marc Op woensdag 9 december 2020 om 04:16:43 UTC+1 schreef Phate:
|
|
Re: How to open the same graph multiple times and not get the same object?
HadoopMarc <bi...@...>
Hi Mladen, The constructor of StandardJanusGraph seems worth a try: https://github.com/JanusGraph/janusgraph/blob/master/janusgraph-core/src/main/java/org/janusgraph/graphdb/database/StandardJanusGraph.java HTH, Marc Op dinsdag 8 december 2020 om 19:34:55 UTC+1 schreef Mladen Marović:
|
|