Re: Janus Graph Performance with Cassandra vs BigTable
hadoopmarc@...
Hi Vishal,
Your question is very general. What is most important to you: write performance, simple queries, complex queries? Do you mean comparison between managed Cassandra and managed Bigtable in terms of Euros needed for a specific workload? I am not aware of independent benchmark results for the JanusGraph storage backends, while vendors can be skimmy about circumstances for the benchmarks they present. Some general notions:
|
|
Re: a problem about elasticsearch
Is it ES [the software] that is bottlenecking, or could it be the HW you have it running on? If the HW isn't the issue, have you been able to trace where the issue is in ES? But if not, I'd be remiss to not put in a plug for Scylla as a better performing option as a JanusGraph data store. Hope you get it resolved! On Fri, Jun 11, 2021, 1:37 AM <anjanisingh22@...> wrote: Hi Anshul, |
|
Re: a problem about elasticsearch
anjanisingh22@...
Hi Anshul,
I am facing same issue? Did you got any solution for the issue? Thanks, Anjani |
|
Janus Graph Performance with Cassandra vs BigTable
Vishal Gupta <vgupta@...>
Hi Community/Team,
I see that Janus graph can be integrated with multiple storage backends like Cassandra and BigTable. I am trying to evaluating which storage backend is more performant for Janus Graph. I want to see if people have any recommendations here ? Has anyone done performance comparison evaluating performance of Janus + BitTable vs Janus + Cassandra ? Thanks Vishal |
|
Transaction Recovery and Bulk Loading
madams@...
Hi all, Thanks, |
|
Issues while iterating over self-loop edges in Apache Spark
Mladen Marović
Hello, while debugging some Apache Spark jobs that process data from a Janusgraph graph. i noticed some issues with self-loop edges (edges that connect a vertex to itself). The data is read using:
When I try to process all outbound edges of a single vertex using:
and that vertex has multiple self-loop edges with the same edge label, the iterator always returns only one such edge. Edges that are not self-loop are all returned as expected. To give a specific example, if I have a vertex V0 with edges that E1, E2, E3, E4, E5 that lead to vertices V1, V2, V3, V4, V5, the call After further analysis, I came upon this commit: https://github.com/JanusGraph/janusgraph/commit/d3006dc939c1b640bb263806abd3fd6bee630d12 which explicitly added code that skips deserializing multiple self-loop edges. The code from the linked commit is still present in org.janusgraph:janusgraph-hadoop:0.5.3 and seems to be the cause of this unexpected behavior. My questions are as follows:
Kind regards, Mladen Marović |
|
Re: Difference Between JanusGraph Server and Embedded JanusGraph in Java
hadoopmarc@...
Hi Zach,
1. For building an API service you do not need Gremlin Server. Gremlin Server has all kinds of features though that might (slightly) relieve the complexity of your service (with the complexity of maintaining Gremlin Server added). The main driver for using Gremlin Server is the support for Gremlin Language Variants, which you do not need. Resource usage should not differ very much for similar workloads and comparable settings; Gremlin Server requires an additional JVM, but might be more optimized than what you build in house. 2. First check using Gremlin Console for connecting to Gremlin Server. If that works, please report more details about what visualization tool you use. Best wishes, Marc |
|
Difference Between JanusGraph Server and Embedded JanusGraph in Java
Zach B.
I've seen a lot of discussion about the benefits and such of both implementations but I was wondering if there was a big difference in terms of resource usage? I'm building an API service that will be deployed to a low resource virtual machine and I was wondering if there was a big difference between the memory usage of the two implementations.
Furthermore and unrelated, but I have been developing using the Embedded implementation using HBase as a storage backend. I wanted to use a visualization tool to see if my graph is appearing the way I want, however all the tools I see require gremlin-server. So I started up the server using the same exact HBase configuration as Embedded, but it displays an empty graph. Does anyone know why that is the case? Thank you in advance. |
|
Re: Getting org.janusgraph.graphdb.database.idassigner.IDPoolExhaustedException consistently
hadoopmarc@...
Hi,
There does not seem to be much that helps in finding a root cause (no similar questions or issues in history). The most helpful thing I found is the following javadoc: https://javadoc.io/doc/org.janusgraph/janusgraph-core/latest/org/janusgraph/graphdb/database/idassigner/placement/SimpleBulkPlacementStrategy.html Assuming that you use this default SimpleBulkPlacementStrategy, what value do your use for ids.num-partitions ? The default number might be too small. In the beginning of a spark job, the tasks can be more or less synchronized, that is they finish after about the same amount of time and then cause congestion (task number 349 ...). If this is the case, other configs could help too: ids.renew-percentage If you increase this value, congestion is avoided a bit, but this cannot have a high impact. ids.flush I assume you did not change the default "true" value ids.authority.conflict-avoidance-mode Undocumented, but talks about contention during ID block reservation Best wishes, Marc |
|
Getting org.janusgraph.graphdb.database.idassigner.IDPoolExhaustedException consistently
Hi
I am getting the below exception while ingesting data to an existing graph
The value of `ids.block-size` is set to 5000000 (50M) and I am using spark for data loading (around 300 executors per run). Could you please suggest the configuration which can fix this issue? Thanks |
|
Re: Backend data model deserialization
Elliot Block <eblock@...>
Awesome thank you all for the great info and recent presentations! We are prototyping bulk export + deserialize from Cloud Bigtable over approx. the next week and will try to report back if we can produce something useful to share. Thanks again, -Elliot On Thu, May 20, 2021 at 6:45 AM sauverma <saurabhdec1988@...> wrote:
On Thu, May 20, 2021, 6:12 PM Boxuan Li <liboxuan@...> wrote:
On May 20, 2021, at 8:07 PM, hadoopmarc@... wrote:
_._,_. |
|
Re: ID block allocation exception while creating edge
hadoopmarc@...
Hi Anjani,
One thing that does not feel good is that you create and commit a transaction for every row of your dataframe. Although I do not see how this would interfere with ID allocation, best practice is to have partitions of about 10.000 vertices/edges and commit these as one batch. In case of an exception, you rollback the transaction and raise your own exception. After that, Spark will retry the partition and your job will still succeed. It is worth a atry. Best wishes, Marc |
|
Re: Making janus graph client to not use QUORUM
anjanisingh22@...
Thanks Marc, i will try that option.
|
|
Re: ID block allocation exception while creating edge
anjanisingh22@...
Sharing detail on how i am creating node/edges to make sure nothing wrong with that which is resulting in ID allocation failures.
I am creating one static instance JanusGraph object on each spark worker box and using that i am creating multiple transaction and commit. pairRDD.foreachPartition(partIterator -> {
In createNodeAndEdge() method i am creating GraphTraversalSource using static janusGraph, creating node, edge, committing and then closing GraphTraversalSource object, as shown below in pseudo code: createNodeAndEdge(Tuple2<K, V> pair, JanusGraph janusGraph) { GraphTraversalSource g = janusGraph.buildTransaction().start().traversal(); create node; create edge; } catch ( Exception) { g.tx().rollback(); g.close(); }
Thanks, |
|
Re: ID block allocation exception while creating edge
anjanisingh22@...
Thanks for response Marc. Yes i also think for some reason changes are not getting picked up but not able to figure out why so. In code i have only one method which is used to create janus-instance and same is passed to method for node/edge creation. |
|
Re: ID block allocation exception while creating edge
hadoopmarc@...
Hi Anjani,
It is still most likely that the modified value of "ids.block-size" somehow does not come through. So, are you sure that
|
|
Re: ID block allocation exception while creating edge
anjanisingh22@...
Hi Marc,
I tried setting ids.num-partitions = number of executors through code not directly in janus global config files but no luck. Added below properties but it didn't helped. configProps.set("ids.renew-timeout", "240000"); Thanks, Anjani |
|
Re: MapReduce reindexing with authentication
Boxuan Li
Hi Marc, That is an interesting solution. I was not aware of the mapreduce.application.classpath property. It is not well documented, but from what I understand, this option is used primarily to distribute the mapreduce framework rather than user files. Glad to know it can be used for user files as well. I am not 100% sure, but seems it requires you to upload the file to hdfs first (if you are using a yarn cluster). The ToolRunner, however, can add a file from local filesystem too. We prefer not to store keytab files on hdfs permanently. This difference is subtle, though. Also, we don’t use gremlin console anyway, so not being able to do so via gremlin console is not a drawback for us. Agree with you that the documentation can be enhanced. Right now it simply says “The class starts a Hadoop MapReduce job using the Hadoop configuration and jars on the classpath.”, which is too brief and assumes users have a good knowledge of Hadoop MapReduce. > One could even think of putting the mapreduce properties in the graph properties file and pass on properties of this namespace to the mapreduce client. Not sure if it’s possible, but if someone implements it, it would be very helpful for users to do quick start without worrying about the cumbersome Hadoop configs. Best regards, Boxuan 「<hadoopmarc@...>」在 2021年5月24日 週一,下午3:48 寫道: Hi Boxuan, |
|
Re: MapReduce reindexing with authentication
hadoopmarc@...
Hi Boxuan,
Yes, you are right, I mixed things up by wrongly interpreting GENERIC_OPTIONS as an env variable. I did some additional experiments. though, bringing in new information. 1. It is possible to put a mapred-site.xml file on the JanusGraph classpath that is automatically loaded by the mapreduce client. When using the file below during mapreduce reindexing, I get the following exception (on purpose): gremlin> mr.updateIndex(i, SchemaAction.REINDEX).get() java.io.FileNotFoundException: File file:/tera/lib/janusgraph-full-0.5.3/hi.tgz does not exist The mapreduce config parameters are listed in https://hadoop.apache.org/docs/r2.7.3/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml The description for mapreduce.application.framework.path suggests that you can pass additional files to the mapreduce workers using this option (without any changes to JanusGraph). <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>mapreduce.framework.name</name> <value>local</value> </property> <property> <name>mapreduce.application.classpath</name> <value>dummy</value> </property> <property> <name>mapreduce.application.framework.path</name> <value>hi.tgz</value> </property> <property> <name>mapred.map.tasks</name> <value>2</value> </property> <property> <name>mapred.reduce.tasks</name> <value>2</value> </property> </configuration> 2. When using mapreduce reindexing in the documented way, it already issues the following warning: 08:49:55 WARN org.apache.hadoop.mapreduce.JobResourceUploader - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. When you would resolve your keytab issue by modifying the JanusGraph code and calling the hadoop ToolRunner, you have the additional advantage of getting rid of this warning. This would not work from the gremlin console, though, unless gremlin.sh passes the additional command line options to the java command line (ugly). So, I think I would prefer the option with mapred-site.xml. It would not hurt to slightly extend the mapreduce reindexing documentation, anyway:
|
|
Re: Making janus graph client to not use QUORUM
hadoopmarc@...
Hi Anjani,
To see what exactly happens with local configurations, I did the following:
conf = graph.getConfiguration().getLocalConfiguration() ks = conf.getKeys(); null; while (ks.hasNext()) { k = ks.next() System.out.print(String.format("%30s: %s\n", k, conf.getProperty(k))) } With printed output: storage.hostname: 127.0.0.1 storage.cql.read-consistency-level: LOCAL_ONE cache.db-cache: true storage.cql.keyspace: janusgraph storage.backend: cql index.search.hostname: 127.0.0.1 cache.db-cache-size: 0.25 gremlin.graph: org.janusgraph.core.JanusGraphFactory Can you do the same printing of configurations on the client that shows the exception about the QUORUM? In this way, we can check whether the problem is in your code or in JanusGraph not properly passing the local configurations. Best wishes, Marc |
|