Date   

Re: Storing and reading connected component RDD through OutputFormatRDD & InputFormatRDD

hadoopmarc@...
 

Hi Anjani,

The following section of the TinkerPop ref docs gives an example of how to reuse the output RDD of one job in a follow-up gremlin OLAP job.
https://tinkerpop.apache.org/docs/3.4.10/reference/#interacting-with-spark

Best wishes,   Marc


Re: MapReduce reindexing with authentication

hadoopmarc@...
 

Hi Boxuan,

Yes, I did not finish my argument. What I tried to suggest: if the hadoop CLI command checks the GENERIC_OPTIONS env variable, then maybe also the mapreduce java client called by JanusGraph checks the GENERIC_OPTIONS env variable.

The (old) blog below suggests, however, that this behavior is not present by default but requires the janusgraph code to run hadoop's ToolRunner. So, just see if this is any better than what you had in mind to implement.
https://hadoopi.wordpress.com/2013/06/05/hadoop-implementing-the-tool-interface-for-mapreduce-driver/

Best wishes,    Marc


[Meetup] JanusGraph Meetup May 18 covering JG OLAP approaches

Ted Wilmes
 

Hello,
We will be hosting a community meetup next week on Tuesday, May 18th at 9:30 central/10:30 eastern. We have a great set of speakers who will be discussing all things JanusGraph OLAP:

* Hadoop Marc who has helped many of us on the mailing list and in JG issues
* Saurabh Verma, principal engineer at Zeotap
* Bruno Berriso, engineer at Expero

If you're interested in signing up, here's the link: https://www.experoinc.com/get/janusgraph-user-group.

Thanks,
Ted


Re: MapReduce reindexing with authentication

Boxuan Li
 

Hi Marc, you are right, we are indeed using this -files option :)

On May 14, 2021, at 8:06 PM, hadoopmarc@... wrote:

Hi Boxuan,

Using existing mechanisms for configuring mapreduce would be nicer, indeed.

Upon reading this hadoop command, I see a GENERIC_OPTIONS env variable read by the mapreduce client, that can have a -files option. Maybe it is possible to include a jaas file that points to the (already installed?) keytab file on the workers?

Best wishes,     Marc


Re: MapReduce reindexing with authentication

hadoopmarc@...
 

Hi Boxuan,

Using existing mechanisms for configuring mapreduce would be nicer, indeed.

Upon reading this hadoop command, I see a GENERIC_OPTIONS env variable read by the mapreduce client, that can have a -files option. Maybe it is possible to include a jaas file that points to the (already installed?) keytab file on the workers?

Best wishes,     Marc


MapReduce reindexing with authentication

Boxuan Li
 

We have been using a yarn cluster to run MapReduce reindexing (Cassandra + Elasticsearch) for a long time. Recently, we introduced Kerberos-based authentication to the Elasticsearch cluster, meaning that worker nodes need to authenticate via a keytab file.

We managed to achieve so by using a hadoop command to include the keytab file when submitting the MapReduce job. Hadoop automatically copies this file and distributes it to the working directory of all worker nodes. This works well for us, except that we have to make changes to MapReduceIndexManagement class so that it accepts an org.apache.hadoop.conf.Configuration object (which is created by org.apache.hadoop.util.ToolRunner) rather than instantiate one by itself. We are happy to submit a PR for this, but I would like to hear if there is any better way of handling this.

Cheers,
Boxuan


Multiple entries with same key on mixed index

toom@...
 

Hello,

I encounter the error describe in the issue #1916 (https://github.com/JanusGraph/janusgraph/issues/1916) on a mixed index (lucene). When I list property keys of the index, they are all duplicated.

I haven't identified the root cause and I don't know how to reproduce it.

I would like to find a solution to repair the inconsistency,  without loosing my data.

The exposed API of JanusGraphManagement doesn't seem to be helpful: mixed index can't be removed and the IllegalArgumentException is raised when I try to retrieve the index status. Removing the index in Lucene or unconfiguring index backend doesn't help either.

So I've tried to find a solution using internal API. Is it safe to delete Lucene data files and remove the schema vertex related to the mixed index ?

((ManagementSystem)mgmt).getWrappedTx()
  .getSchemaVertex(JanusGraphSchemaCategory.GRAPHINDEX.getSchemaName("indexName"))
  .remove()

Is there a reason to not permit to remove a mixed index ?

Best wishes

Toom.


Re: Support for DB cache for Multi Node Janus Server Setup

pasansumanathilake@...
 

On Thu, May 6, 2021 at 11:36 PM, <hadoopmarc@...> wrote:
Marc
Hi Marc,

Thanks for the reply. Yeah, it's true that a multi-node setup uses the same storage backend. However, I am referring to here is the Janusgraph database cache - https://docs.janusgraph.org/basics/cache/#database-level-caching Which is using Janusgraph heap memory.


Storing and reading connected component RDD through OutputFormatRDD & InputFormatRDD

anjanisingh22@...
 
Edited

Hi All,

I am using connected component vertex program to find all the connected nodes in graph and then using that RDD for further processing in graph. I want to store that RDD at some output location so that i can re-use the RDD and don't have to run connected component vertex program which is time consuming. 

I see in tinker-pop library we have OutputFormatRDD  to save data. I tired

outputFormatRDD.writeGraphRDD(graphComputerConfiguration, uniqueRDD);  ## connected but its throwing class cast exception as connected component vertex program output RDD value is a list which can not be cast to VertexWritable

 

 outputFormatRDD.writeMemoryRDD(graphComputerConfiguration, "memoryKey",  uniqueRDD);  ## Its saving RDD by creating memory key folder name at output location.


Not able to read RDD through InputFormatRDD.readMemoryRDD() as its looking for data files as per class SequenceFileInputFormat class. 

Am i missing any thing? Please let me know if you have tired some 
these methods? Want to check if we can use out of box methods before proceeding with our own?

Thanks,
Anjani





Re: ID block allocation exception while creating edge

anjanisingh22@...
 

On Tue, May 11, 2021 at 04:56 PM, <hadoopmarc@...> wrote:
ids.num-partitions
Thanks for help Marc, i will try updating value of  ids.num-partitions = number of executors.


Re: ID block allocation exception while creating edge

hadoopmarc@...
 

Hi Anjani,

It is a while ago I did this myself. I interpret ids.num-partitions as a stock of reserved id blocks that can be delegated to a janugraph instance. It does not have a large value to not waste ids space.

Actually, parallel tasks is not the number we want. We want the ids.num-partitions to be equal to the number of janusgraph instances, because initially all janusgraph instances ask for an ids block at the same time. Note that the cores in a spark executor can share the same janusgraph instance if you use a singleton object for that.

So, if you have 50 executors with 5 cores each (and using a singleton janusgraph instance), I would try ids.num-partitions =50

Best wishes,    Marc


Re: ID block allocation exception while creating edge

anjanisingh22@...
 

We have 250 parallel spark task running for creating node/edges.
I didn't get parallel tasks -  (for setting ids.num-partitions)? Could you please help me on it?


Re: ID block allocation exception while creating edge

hadoopmarc@...
 

What is the number of parallel tasks?  (for setting ids.num-partitions)

You have the ids.authority.wait-time still on its default value of 300 ms, so that seems worthwhile experimenting with.

Best wishes,    Marc


Re: ID block allocation exception while creating edge

anjanisingh22@...
 

On Tue, May 11, 2021 at 11:54 AM, <hadoopmarc@...> wrote:
https://docs.janusgraph.org/advanced-topics/bulk-loading/#optimizing-id-allocation

Thanks for response Marc. Below is the method i am using to create janus connection:

public JanusGraph createJanusConnection(HashMap<String, Object> janusConfig) {

    JanusGraphFactory.Builder configProps = JanusGraphFactory.build();

  configProps.set(GREMLIN_GRAPH, org.janusgraph.core.JanusGraphFactory”);

    configProps.set(STORAGE_BACKEND, cql”);

    configProps.set(STORAGE_HOSTNAME, janusConfig.get("storage.hostname"));

    configProps.set(STORAGE_CQL_KEYSPACE, janusConfig.get("storage.keyspace"));

    configProps.set(CACHE_DB_CACHE, false”);

    configProps.set(CACHE_DB_CACHE_SIZE, “0.5”);

    configProps.set(CACHE_DB_CACHE_TIME, 180000”);

    configProps.set(CACHE_DB_CACHE_CLEAN_WAIT, “20”);

    configProps.set(STORAGE_CQL_LOCAL_DATACENTER, janusConfig.get("local-datacenter"));

     configProps.set(STORAGE_CQL_WRITE_CONSISTENCY_LEVEL, LOCAL_ONE”);

    configProps.set(STORAGE_CQL_READ_CONSISTENCY_LEVEL, LOCAL_ONE”);

    configProps.set(STORAGE_CQL_SSL_ENABLED, janusConfig.get("cql.ssl.enabled"));

    configProps.set(STORAGE_CQL_SSL_TRUSTSTORE_LOCATION, janusConfig.get("truststore.location"));

    configProps.set(STORAGE_CQL_SSL_TRUSTSTORE_PASSWORD, janusConfig.get("truststore.password"));

    configProps.set(STORAGE_USERNAME, janusConfig.get("cassandra.username"));

    configProps.set(STORAGE_PASSWORD, janusConfig.get("cassandra.password"));

    configProps.set("storage.read-time", "120000");

    configProps.set("storage.write-time", "120000");

    configProps.set("storage.connection-timeout", "120000");

 

    // added to fix ID block allocation exceptions

    configProps.set("renew-timeout", "240000");

    configProps.set("write-time", "1000");

    configProps.set("read-time", "100");

    configProps.set("renew-percentage", "0.4");

 

    configProps.set(METRICS_ENABLED, true”);

    configProps.set(METRICS_JMX_ENABLED, true”);

    configProps.set(INDEX_SEARCH_BACKEND, elasticsearch”);

    configProps.set(INDEX_SEARCH_HOSTNAME, janusConfig.get("elasticsearch.hostname"));

   configProps.set(INDEX_SEARCH_ELASTICSEARCH_HTTP_AUTH_TYPE,”basic”);

    }

    configProps.set(INDEX_SEARCH_ELASTICSEARCH_HTTP_AUTH_BASIC_USERNAME, janusConfig.get("elasticsearch.username"));

    configProps.set(INDEX_SEARCH_ELASTICSEARCH_HTTP_AUTH_BASIC_PASSWORD, janusConfig.get("elasticsearch.password"));

    configProps.set(INDEX_SEARCH_ELASTICSEARCH_SSL_ENABLED, janusConfig.get("elasticsearch.ssl.enabled")

    );

    configProps.set(IDS_BLOCK_SIZE, 1000000000”);

    configProps.set(IDS_RENEW_PERCENTAGE, “0.3”);

    logger.info("JanusGraph config initialization!!");

    return configProps.open();

}


Re: ID block allocation exception while creating edge

hadoopmarc@...
 

Hi Anjani,

Please show the properties file you use to open janusgraph.
I assume you also saw the other recommendations in https://docs.janusgraph.org/advanced-topics/bulk-loading/#optimizing-id-allocation

Best wishes,   Marc


ID block allocation exception while creating edge

anjanisingh22@...
 

Hi All,

I am creating vertex and edges in bulk and getting below error while creating edge. Below is the exception log:

Cluster edge creation failed between guest node 20369408030929128 and identifier node 16891904008515712. Exception : org.janusgraph.core.JanusGraphException: ID block allocation on partition(29)-namespace(3) failed with an exception in 12.23 ms

I tired increasing value of "ids.block-size" but still no luck, even seto 1B also for testing purpose but still no luck, getting above error. I am creating around 4 - 5M nodes per hours.

 

Could you please share some pointers to fix it? Appreciate your help and time.

Thanks,
Anjani


Re: Query Optimisation

hadoopmarc@...
 

Hi Vinayak,

Actually, query 4 was easier to rework. It could read somewhat like:
g.V().has('property1', 'vertex1').as('v1').outE().has('property1', 'edge1').limit(100).as('e').inV().has('property1', 'vertex1').as('v2').
    select('v1','e','v2').by(valueMap().by(unfold())).aggregate('x').fold().
  V().has('property1', 'vertex1').as('v1').outE().has('property1', 'edge2').limit(100).as('e').inV().has('property1', 'vertex2').as('v2').
    select('v1','e','v2').by(valueMap().by(unfold())).aggregate('x').fold().
  V().has('property1', 'vertex3').as('v1').outE().has('property1', 'edge3').limit(100).as('e').inV().has('property1', 'vertex2').as('v2').
    select('v1','e','v2').by(valueMap().by(unfold())).aggregate('x').fold().
  V().has('property1', 'vertex3').as('v1').outE().has('property1', 'Component_Of').limit(100).as('e').inV().has('property1', 'vertex1').as('v2')).
    select('v1','e','v2').by(valueMap().by(unfold())).aggregate('x').fold().
  cap('x')

Best wishes,    Marc


Re: Query Optimisation

Vinayak Bali
 

Hi Marc,

Thank you for your reply. I will try to report this issue on janusgraph repository. Regarding the work around you suggested, if possible please share the updated query with work around for query1. That will be helpful for me to replicate the same. 

Thank & Regards,

Vinayak

On Mon, 10 May 2021, 6:03 pm , <hadoopmarc@...> wrote:
Hi Vinayak,

If you would bother to demonstrate this behavior with a reproducible, generated graph, you can report it as an issue on github.

For now, you can only look for workarounds:
 - combine the four clauses outside of gremlin
 - try g.V()......................fold().V()......................fold().V().......................fold().V()......................... instead of the union, although I am not sure janusgraph will use the index for the repeated V() steps. The fold() steps ensure that the V() steps are run exactly once.

Best wishes,    Marc


Re: Query Optimisation

hadoopmarc@...
 

Hi Vinayak,

If you would bother to demonstrate this behavior with a reproducible, generated graph, you can report it as an issue on github.

For now, you can only look for workarounds:
 - combine the four clauses outside of gremlin
 - try g.V()......................fold().V()......................fold().V().......................fold().V()......................... instead of the union, although I am not sure janusgraph will use the index for the repeated V() steps. The fold() steps ensure that the V() steps are run exactly once.

Best wishes,    Marc


Re: Query Optimisation

Vinayak Bali
 

Hi Marc, 

This query takes 18 sec to run by changing as to aggregate and select to project. But still, 99% of the time is taken to compute union. There is no memory issue, it already set to 8g.

g.inject(1).union(V().has('property1', 'vertex1').aggregate('v1').union(outE().has('property1', 'edge1').aggregate('e').inV().has('property1', 'vertex1'),outE().has('property1', 'edge2').aggregate('e').inV().has('property1', 'vertex2')).aggregate('v2'),V().has('property1', 'vertex3').aggregate('v1').union(outE().has('property1', 'edge3').aggregate('e').inV().has('property1', 'vertex2'),outE().has('property1', 'Component_Of').aggregate('e').inV().has('property1', 'vertex1')).aggregate('v2')).limit(100).project('v1','e','v2').by(valueMap().by(unfold()))

Also, this has the same effect as removing the inner union step to separate ones.

Thanks & Regards,
Vinayak

On Mon, May 10, 2021 at 11:45 AM <hadoopmarc@...> wrote:
Hi Vinayak,

Your last remark explains it well: it seems that in JanusGraph a union of multiple clauses can take much longer than the sum of the individual clauses. There are still two things that we have not ruled out:

  • the repetition of as('v1') is unusual. Can you try what happens if you use the aggegate('v1')..............cap('v1', e, 'v2') mechanism instead? Or, simpler, what happens if you use neither the as() nor the aggregate() steps, omitting the formatting of the output?
  • are you sure there are no memory constraints, even if this seems unlikely given the limit(100) steps applied. You can check by increasing memory for gremlin console:
    export JAVA_OPTIONS="-Xmx4g"
Best wishes,    Marc

1 - 20 of 5910