Re: JanusGraph and future versions of ES
Assaf Schwartz
Thank you very much everyone
|
|||
|
|||
Re: JanusGraph and future versions of ES
It would only affect the use of JanusGraph+Elastic if you were offering it in an -as-a-Service (-aaS) offering, like a commercial DBaaS. In which case the SSPL would require you to commit back to the code base modifications, UI, etc. We did a side-by-side analysis of SSPL vs. APGL when it was first announced. You can read that blog here: The most germaine language is this: SSPL requires that if you offer software as a service you must make public, open source, practically almost everything related to your service: “including, without limitation, management software, user interfaces, application program interfaces, automation software, monitoring software, backup software, storage software and hosting software, all such that a user could run an instance of the service using the Service Source Code you make available.” But if you are just using it internally, or deep within another product (one that an end user doesn't have direct access to) you have no worries. Disclaimer: this is my opinion; I am not a lawyer. YMMV. Caveat emptor. Void where prohibited by law. On Fri, Jan 22, 2021, 6:49 AM BO XUAN LI <liboxuan@...> wrote: If I am understanding correctly, this does not affect the usage of Elasticsearch in JanusGraph. Regarding AWS's decision, seems no additional configs/code changes are needed if someone wants to use AWS forked version (assuming Amazon would not introduce breaking changes causing client incompatibility). |
|||
|
|||
Re: Janusgraph traversal time
hadoopmarc@...
Sounds normal to me. Note that setting query.batch=true might help a bit. Also note that the graph structure, in particular the number of repeat steps, is important. Finally, note that caching in the storage backend (in addition to caching inside JanusGraph) plays a role, see:
http://yaaics.blogspot.com/2018/04/understanding-reponse-times-for-single.html Best wishes, Marc |
|||
|
|||
Re: JanusGraph and future versions of ES
Boxuan Li
If I am understanding correctly, this does not affect the usage of Elasticsearch in JanusGraph. Regarding AWS's decision, seems no additional configs/code changes are needed if someone wants to use AWS forked version (assuming Amazon would not introduce breaking changes causing client incompatibility).
|
|||
|
|||
Janusgraph traversal time
faizez@...
Hi All,
I have a multi-level graph with total 800 vertices. When I run g.V(<id of parent>).repeat(out()).emit() for first time, it takes 300ms. In profile(), I could see RepeatStep takes most of the time (almost 296ms). Also I could see, it executes backend-query, same number as that of vertices. But when I run the same query again, it completes in 4ms. I know when I run 2nd time, it uses cache.
My question is whether traversal of 800 vertices without cache in 300ms is normal or not ?
Note : Im using JG 0.5.2 and I have out-of-the-box Janusgraph configurations. |
|||
|
|||
Re: JanusGraph and future versions of ES
hadoopmarc@...
Please also read the stance of Elasticsearch itself (because the AWS article did not link to it ):
https://www.elastic.co/blog/license-change-clarification |
|||
|
|||
JanusGraph and future versions of ES
Assaf Schwartz
Hi all,
Since ES seems to have changed their licensing, what does this entail about the future usage of ES as a JanusGraph index backend? https://aws.amazon.com/blogs/opensource/stepping-up-for-a-truly-open-source-elasticsearch/ Thanks! |
|||
|
|||
Re: Janusgraph query execution performance
hadoopmarc@...
Analytical queries require a full table scan. Some people succeed in speeding up analytical queries on JanusGraph using OLAP, check the older questions on OLAP and SparkGraphComputer and
https://docs.janusgraph.org/advanced-topics/hadoop/ A special case occurring very frequently, is counting the number of vertices for each label (you say: concept). Speeding this up is listed in the known issues: https://github.com/JanusGraph/janusgraph/issues/926 Best wishes, Marc |
|||
|
|||
Janusgraph query execution performance
lalwani.ritu2609@...
Hi,
I have used https://github.com/IBM/expressive-reasoning-graph-store project to import the turtle file having around 4 Lakhs of concepts ad this project is using Janusgraph 0.4.0. Now after importing I am able to run the queries. But here the problem that I am facing is that some of the queries which access few number of nodes are quite faster. But some queries like counting the number of concepts in the graph (which access large number of nodes) are very very slow. Please note that I have used indexing already. So is this due the version of Janusgraph which is 0.4.0(quite older version)? Or the performance will be like this only for Janusgraph? Any help will highly be appreciated. Thanks!! |
|||
|
|||
Re: OLAP Spark
hadoopmarc@...
Hi Vinayak,
JanusGraph has defined hadoop InputFormats for its storage backends to do OLAP queries, see https://docs.janusgraph.org/advanced-topics/hadoop/ However, these InputFormats have several problems regarding performance (see the old questions on this list), so your approach could be worthwhile: 1. It is best to create these ID's on ingestion of data in JanusGraph and add them as vertex property. If you create an index on this property, it is possible to use these id properties for retrieval during OLAP queries. 2. Spark does this automatically if you call rdd.mapPartitions on the RDD with ids. 3. Here is the disadvantage of this approach. You simply run the gremlin query per partition with ids, but you have to merge the results per partition afterwards outside gremlin. The merge logic differs per type of query. Best wishes, Marc |
|||
|
|||
Re: Janusgraph spark on yarn error
hadoopmarc@...
The path of the BulkLoaderVertexProgram might be doable, but I cannot help you on that one. In the stack trace above, the yarn appmaster from spark-yarn apparently tries to communicate with HBase but finds that various libraries do not match. This failure arises because the JanusGraph distribution does not include spark-yarn and thus is not handcrafted to work with spark-yarn.
For the path without BulkLoaderVertexProgram you inevitably need a JVM language (java, scala, groovy). In this case, a spark executor is unaware of any other executors running and is simply passed a callable (function) to execute (through RDD.mapPartitions() or through a spark-sql UDF). This callable can be part of a class that establish its own JanusGraph instances in the OLTP way. Now, you only have to deal with the executor CLASSPATH which does not need spark-yarn and the libs from the janusgraph distribution suffice. Some example code can be found at: https://nitinpoddar.medium.com/bulk-loading-data-into-janusgraph-part-2-ca946db26582 Best wishes, Marc |
|||
|
|||
Re: reindex job is very slow on ElasticSearch and BigTable
hadoopmarc@...
I mean, what happens if you try to run MapReduceIndexManagement on BigTable. Apparently, you get this error message "MapReduceIndexManagement is not supported for BigTable" but I would like to see the full stack trace leading to this error message, to see where this incompatibility stems from. E.g. the code in:
https://github.com/JanusGraph/janusgraph/blob/d954ea02035d8d54b4e1bd5863d1f903e6d57844/janusgraph-hadoop/src/main/java/org/janusgraph/hadoop/MapReduceIndexManagement.java reads: HadoopStoreManager storeManager = (HadoopStoreManager) graph.getBackend().getStoreManager().getHadoopManager();
But this is not what you see. Best wishes, Marc |
|||
|
|||
OLAP Spark
Vinayak Bali
Hi All,
I am working on OLAP using Spark and Hadoop. I have a couple of questions. 1. How to execute a filter step on the driver and create an RDD of internal ids? 2. Distributing the collected Ids to multiple Spark Executor? 3. Execute Gremlin in Parallel Thanks & Regards, Vinayak |
|||
|
|||
Re: Database Level Caching
Boxuan Li
Thanks Nicolas, I am able to reproduce it using your configs & script. Created an issue at https://github.com/JanusGraph/janusgraph/issues/2369
toggle quoted message
Show quoted text
Looks like a bug with calculating cache entries' size.
|
|||
|
|||
Re: Janusgraph spark on yarn error
Thank you for response!
I am using BulkLoaderVertexProgram from console. Sometimes it works correctly. This error still exist when i am running read from hbase spark job. my read-hbase.properties gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph gremlin.hadoop.graphReader=org.janusgraph.hadoop.formats.hbase.HBaseInputFormat gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat gremlin.hadoop.jarsInDistributedCache=false gremlin.hadoop.inputLocation=none gremlin.hadoop.outputLocation=output janusgraphmr.ioformat.conf.storage.backend=hbase janusgraphmr.ioformat.conf.storage.hostname=192.168.1.11,192.168.1.12,192.168.1.13,192.168.1.14 janusgraphmr.ioformat.conf.storage.hbase.table=testTable spark.master=yarn spark.submit.deployMode=client spark.yarn.archive=/usr/local/janusgraph/janusgraph_libs.zip spark.executor.instances=2 spark.driver.memory=8g spark.driver.cores=4 spark.executor.cores=5 spark.executor.memory=19g spark.executor.extraClassPath=/usr/local/janusgraph/lib:/usr/local/hadoop/etc/hadoop/conf spark.executor.extraJavaOptions=-Djava.library.path=/usr/local/hadoop/lib/native spark.yarn.am.extraJavaOptions=-Djava.library.path=/usr/local/hadoop/lib/native spark.yarn.appMasterEnv.CLASSPATH=/usr/local/janusgraph/lib:/usr/local/hadoop/etc/hadoop/conf spark.driver.extraLibraryPath=/usr/local/hadoop/lib/native spark.executor.extraLibraryPath=/usr/local/hadoop/lib/native spark.dynamicAllocation.enabled=false spark.io.compression.codec=snappy spark.serializer=org.apache.spark.serializer.KryoSerializer spark.kryo.registrator=org.janusgraph.hadoop.serialize.JanusGraphKryoRegistrator Can you provide some code example spark of application loading data OLTP way? Which program langugage can i use? (I want python, if it`s possible) |
|||
|
|||
Re: reindex job is very slow on ElasticSearch and BigTable
vamsi.lingala@...
Thanks a lot for your reply.
I don't get any error message. The REINDEX step is very slow (2000/s) for the mixed index and fails after running for a few days. Everytime indexing fails or the cluster is restarted the shards settings and replica declared for the Indices is reset to 1 (default values for es-7) and thus I have to recreate new indices after disabling old ones. is there any better way to reindex fast? |
|||
|
|||
Re: Janusgraph spark on yarn error
hadoopmarc@...
#Private reply from OP:
Yes, i am running bulk load from hdfs(graphson) in janusgraph-hbase. Yes, i have graphson part files from spark job with a structure like grateful-dead.json example. But if application master starting on certain(third) hadoop node is working well. All nodes have identical configuration. #Answer HadoopMarc You do not need to use HadoopGraph for this. Indeed, there used to be a BulkLoaderVertexProgram in Apache TinkerPop, but this could not be maintained and keep working reliably for the various versions of the various graph systems. Until now, JanusGraph does not have developed its own BulkLoaderVertexProgram. Also note that while their does exist an HBaseInputFormat for loading a janusgraph-hbase graph into a HadoopGraph, there does not exist an HBaseOutputFormat to write an HadoopGraph into janusgraph-hbase. This being said, nothing is lost. You can simply write a spark application that has individual spark executors connect to janusgraph in the usual (OLTP) way and load data with the usual graph.traversal() API, that is using the addV(), addE() and properties() traversal steps. Of course, you could also try and copy the old code for the BulkLoaderVertexProgram into your project, but I believe the way I sketched is conceptually simpler and less error prone. I tend to remember that their exist some blog series about using JanusGraph at scale, but I do not have then at hand and will look for them later on. If you find these blogs yourself, pleas post the links! Best wishes, Marc |
|||
|
|||
Re: reindex job is very slow on ElasticSearch and BigTable
hadoopmarc@...
Thanks for reposting your issue on the janusgraph-users list!
Can you please show the entire stack trace leading to your error message? Note that your issue might be related to: https://github.com/JanusGraph/janusgraph/issues/2201 Marc |
|||
|
|||
reindex job is very slow on ElasticSearch and BigTable
vamsi.lingala@...
we have imported around 4 billion vertices in janus graph.
We are using big table and elastic search reindexing speed is very slow..around 2000 records per second is there any way to speed it up? MapReduceIndexManagement is not supported for BigTable |
|||
|
|||
Re: Janusgraph spark on yarn error
hadoopmarc@...
Hi
OK, do I understand right that you want to bulk load data from hdfs into janusgraph-hbase? Nothing wrong with that requirement, I do not know how to ask this in a more friendly way! Is your input data really in GraphSON format? (it is difficult to get this right!) With that established, we can see further, because this is a broad subject. Marc |
|||
|