[Meetup] JanusGraph Meetup May 18 covering JG OLAP approaches
Ted Wilmes
Hello, We will be hosting a community meetup next week on Tuesday, May 18th at 9:30 central/10:30 eastern. We have a great set of speakers who will be discussing all things JanusGraph OLAP: * Hadoop Marc who has helped many of us on the mailing list and in JG issues * Saurabh Verma, principal engineer at Zeotap * Bruno Berriso, engineer at Expero If you're interested in signing up, here's the link: https://www.experoinc.com/get/janusgraph-user-group. Thanks, Ted
|
|
Re: MapReduce reindexing with authentication
Boxuan Li
Hi Marc, you are right, we are indeed using this -files option :)
toggle quoted messageShow quoted text
|
|
Re: MapReduce reindexing with authentication
hadoopmarc@...
Hi Boxuan,
Using existing mechanisms for configuring mapreduce would be nicer, indeed. Upon reading this hadoop command, I see a GENERIC_OPTIONS env variable read by the mapreduce client, that can have a -files option. Maybe it is possible to include a jaas file that points to the (already installed?) keytab file on the workers? Best wishes, Marc
|
|
MapReduce reindexing with authentication
Boxuan Li
We have been using a yarn cluster to run MapReduce reindexing (Cassandra + Elasticsearch) for a long time. Recently, we introduced Kerberos-based authentication to the Elasticsearch cluster, meaning that worker nodes need to authenticate via a keytab file.
We managed to achieve so by using a hadoop command to include the keytab file when submitting the MapReduce job. Hadoop automatically copies this file and distributes it to the working directory of all worker nodes. This works well for us, except that we have to make changes to MapReduceIndexManagement class so that it accepts an org.apache.hadoop.conf.Configuration object (which is created by org.apache.hadoop.util.ToolRunner) rather than instantiate one by itself. We are happy to submit a PR for this, but I would like to hear if there is any better way of handling this. Cheers, Boxuan
|
|
Multiple entries with same key on mixed index
toom@...
Hello, I haven't identified the root cause and I don't know how to reproduce it. I would like to find a solution to repair the inconsistency, without loosing my data. The exposed API of JanusGraphManagement doesn't seem to be helpful: mixed index can't be removed and the IllegalArgumentException is raised when I try to retrieve the index status. Removing the index in Lucene or unconfiguring index backend doesn't help either. So I've tried to find a solution using internal API. Is it safe to delete Lucene data files and remove the schema vertex related to the mixed index ? ((ManagementSystem)mgmt).getWrappedTx() Is there a reason to not permit to remove a mixed index ? Best wishes Toom.
|
|
Re: Support for DB cache for Multi Node Janus Server Setup
pasansumanathilake@...
On Thu, May 6, 2021 at 11:36 PM, <hadoopmarc@...> wrote:
MarcHi Marc, Thanks for the reply. Yeah, it's true that a multi-node setup uses the same storage backend. However, I am referring to here is the Janusgraph database cache - https://docs.janusgraph.org/basics/cache/#database-level-caching Which is using Janusgraph heap memory.
|
|
Storing and reading connected component RDD through OutputFormatRDD & InputFormatRDD
Hi All, I see in tinker-pop library we have OutputFormatRDD to save data. I tired outputFormatRDD.writeGraphRDD(graphComputerConfiguration, uniqueRDD); ## connected but its throwing class cast exception as connected component vertex program output RDD value is a list which can not be cast to VertexWritable
outputFormatRDD.writeMemoryRDD(graphComputerConfiguration, "memoryKey", uniqueRDD); ## Its saving RDD by creating memory key folder name at output location.
Thanks,
|
|
Re: ID block allocation exception while creating edge
anjanisingh22@...
On Tue, May 11, 2021 at 04:56 PM, <hadoopmarc@...> wrote:
ids.num-partitionsThanks for help Marc, i will try updating value of ids.num-partitions = number of executors.
|
|
Re: ID block allocation exception while creating edge
hadoopmarc@...
Hi Anjani,
It is a while ago I did this myself. I interpret ids.num-partitions as a stock of reserved id blocks that can be delegated to a janugraph instance. It does not have a large value to not waste ids space. Actually, parallel tasks is not the number we want. We want the ids.num-partitions to be equal to the number of janusgraph instances, because initially all janusgraph instances ask for an ids block at the same time. Note that the cores in a spark executor can share the same janusgraph instance if you use a singleton object for that. So, if you have 50 executors with 5 cores each (and using a singleton janusgraph instance), I would try ids.num-partitions =50 Best wishes, Marc
|
|
Re: ID block allocation exception while creating edge
anjanisingh22@...
We have 250 parallel spark task running for creating node/edges.
I didn't get parallel tasks - (for setting ids.num-partitions)? Could you please help me on it?
|
|
Re: ID block allocation exception while creating edge
hadoopmarc@...
What is the number of parallel tasks? (for setting ids.num-partitions)
You have the ids.authority.wait-time still on its default value of 300 ms, so that seems worthwhile experimenting with.Best wishes, Marc
|
|
Re: ID block allocation exception while creating edge
anjanisingh22@...
On Tue, May 11, 2021 at 11:54 AM, <hadoopmarc@...> wrote:
https://docs.janusgraph.org/advanced-topics/bulk-loading/#optimizing-id-allocation Thanks for response Marc. Below is the method i am using to create janus connection: public JanusGraph createJanusConnection(HashMap<String, Object> janusConfig) { JanusGraphFactory.Builder configProps = JanusGraphFactory.build(); configProps.set(GREMLIN_GRAPH, “org.janusgraph.core.JanusGraphFactory”); configProps.set(STORAGE_BACKEND, “cql”); configProps.set(STORAGE_HOSTNAME, janusConfig.get("storage.hostname")); configProps.set(STORAGE_CQL_KEYSPACE, janusConfig.get("storage.keyspace")); configProps.set(CACHE_DB_CACHE, “false”); configProps.set(CACHE_DB_CACHE_SIZE, “0.5”); configProps.set(CACHE_DB_CACHE_TIME, “180000”); configProps.set(CACHE_DB_CACHE_CLEAN_WAIT, “20”); configProps.set(STORAGE_CQL_LOCAL_DATACENTER, janusConfig.get("local-datacenter")); configProps.set(STORAGE_CQL_WRITE_CONSISTENCY_LEVEL, “LOCAL_ONE”); configProps.set(STORAGE_CQL_READ_CONSISTENCY_LEVEL, “LOCAL_ONE”); configProps.set(STORAGE_CQL_SSL_ENABLED, janusConfig.get("cql.ssl.enabled")); configProps.set(STORAGE_CQL_SSL_TRUSTSTORE_LOCATION, janusConfig.get("truststore.location")); configProps.set(STORAGE_CQL_SSL_TRUSTSTORE_PASSWORD, janusConfig.get("truststore.password")); configProps.set(STORAGE_USERNAME, janusConfig.get("cassandra.username")); configProps.set(STORAGE_PASSWORD, janusConfig.get("cassandra.password")); configProps.set("storage.read-time", "120000"); configProps.set("storage.write-time", "120000"); configProps.set("storage.connection-timeout", "120000");
// added to fix ID block allocation exceptions configProps.set("renew-timeout", "240000"); configProps.set("write-time", "1000"); configProps.set("read-time", "100"); configProps.set("renew-percentage", "0.4");
configProps.set(METRICS_ENABLED, “true”); configProps.set(METRICS_JMX_ENABLED, “true”); configProps.set(INDEX_SEARCH_BACKEND, “elasticsearch”); configProps.set(INDEX_SEARCH_HOSTNAME, janusConfig.get("elasticsearch.hostname")); configProps.set(INDEX_SEARCH_ELASTICSEARCH_HTTP_AUTH_TYPE,”basic”); } configProps.set(INDEX_SEARCH_ELASTICSEARCH_HTTP_AUTH_BASIC_USERNAME, janusConfig.get("elasticsearch.username")); configProps.set(INDEX_SEARCH_ELASTICSEARCH_HTTP_AUTH_BASIC_PASSWORD, janusConfig.get("elasticsearch.password")); configProps.set(INDEX_SEARCH_ELASTICSEARCH_SSL_ENABLED, janusConfig.get("elasticsearch.ssl.enabled") ); configProps.set(IDS_BLOCK_SIZE, “1000000000”); configProps.set(IDS_RENEW_PERCENTAGE, “0.3”); logger.info("JanusGraph config initialization!!"); return configProps.open(); }
|
|
Re: ID block allocation exception while creating edge
hadoopmarc@...
Hi Anjani,
Please show the properties file you use to open janusgraph. I assume you also saw the other recommendations in https://docs.janusgraph.org/advanced-topics/bulk-loading/#optimizing-id-allocation Best wishes, Marc
|
|
ID block allocation exception while creating edge
anjanisingh22@...
Hi All,
I tired increasing value of "ids.block-size" but still no luck, even set to 1B also for testing purpose but still no luck, getting above error. I am creating around 4 - 5M nodes per hours.
Could you please share some pointers to fix it? Appreciate your help and time.
|
|
Re: Query Optimisation
hadoopmarc@...
Hi Vinayak,
Actually, query 4 was easier to rework. It could read somewhat like: g.V().has('property1', 'vertex1').as('v1').outE().has('property1', 'edge1').limit(100).as('e').inV().has('property1', 'vertex1').as('v2'). select('v1','e','v2').by(valueMap().by(unfold())).aggregate('x').fold(). V().has('property1', 'vertex1').as('v1').outE().has('property1', 'edge2').limit(100).as('e').inV().has('property1', 'vertex2').as('v2'). select('v1','e','v2').by(valueMap().by(unfold())).aggregate('x').fold(). V().has('property1', 'vertex3').as('v1').outE().has('property1', 'edge3').limit(100).as('e').inV().has('property1', 'vertex2').as('v2'). select('v1','e','v2').by(valueMap().by(unfold())).aggregate('x').fold(). V().has('property1', 'vertex3').as('v1').outE().has('property1', 'Component_Of').limit(100).as('e').inV().has('property1', 'vertex1').as('v2')). select('v1','e','v2').by(valueMap().by(unfold())).aggregate('x').fold(). cap('x') Best wishes, Marc
|
|
Re: Query Optimisation
Vinayak Bali
Hi Marc, Thank you for your reply. I will try to report this issue on janusgraph repository. Regarding the work around you suggested, if possible please share the updated query with work around for query1. That will be helpful for me to replicate the same. Thank & Regards, Vinayak
On Mon, 10 May 2021, 6:03 pm , <hadoopmarc@...> wrote: Hi Vinayak,
|
|
Re: Query Optimisation
hadoopmarc@...
Hi Vinayak,
If you would bother to demonstrate this behavior with a reproducible, generated graph, you can report it as an issue on github. For now, you can only look for workarounds: - combine the four clauses outside of gremlin - try g.V()......................fold().V()......................fold().V().......................fold().V()......................... instead of the union, although I am not sure janusgraph will use the index for the repeated V() steps. The fold() steps ensure that the V() steps are run exactly once. Best wishes, Marc
|
|
Re: Query Optimisation
Vinayak Bali
Hi Marc, This query takes 18 sec to run by changing as to aggregate and select to project. But still, 99% of the time is taken to compute union. There is no memory issue, it already set to 8g. g.inject(1).union(V().has('property1', 'vertex1').aggregate('v1').union(outE().has('property1', 'edge1').aggregate('e').inV().has('property1', 'vertex1'),outE().has('property1', 'edge2').aggregate('e').inV().has('property1', 'vertex2')).aggregate('v2'),V().has('property1', 'vertex3').aggregate('v1').union(outE().has('property1', 'edge3').aggregate('e').inV().has('property1', 'vertex2'),outE().has('property1', 'Component_Of').aggregate('e').inV().has('property1', 'vertex1')).aggregate('v2')).limit(100).project('v1','e','v2').by(valueMap().by(unfold())) Also, this has the same effect as removing the inner union step to separate ones. Thanks & Regards, Vinayak
On Mon, May 10, 2021 at 11:45 AM <hadoopmarc@...> wrote: Hi Vinayak,
|
|
Re: Query Optimisation
hadoopmarc@...
Hi Vinayak,
Your last remark explains it well: it seems that in JanusGraph a union of multiple clauses can take much longer than the sum of the individual clauses. There are still two things that we have not ruled out:
|
|
Re: Query Optimisation
Vinayak Bali
Hi Marc, That works as expected. Union also works as expected as in Query1 but when I add limit to all edge the performance degrades. Thanks
On Sat, 8 May 2021, 8:16 pm , <hadoopmarc@...> wrote: Hi Vinayak,
|
|