Date   

Re: P.neq() predicate uses wrong ES mapping

hadoopmarc@...
 

Hi Sergey,

The mere mortals skimming over the questions in this forum often need very explicit examples to fully grasp a point. The transcript below, expanding on the earlier one above, shows the exact consequence of your statement 'problem is that Janusgraph uses tokenised field for "neq" comparisons and non tokenised for "eq". '

According to the ref docs the eq(), neq(), textPrefix(), textRegex() and textFuzzy() predicates should apply to STRING search (so to the non-tokenized field).

gremlin> g.addV('Some').property('x','watch the dog')
==>v[4192]
gremlin> g.tx().commit()
==>null
gremlin> g.V().elementMap()
10:03:40 WARN  org.janusgraph.graphdb.transaction.StandardJanusGraphTx  - Query requires iterating over all vertices [()]. For better performance, use indexes
==>[id:4192,label:Some,x:watch the dog]
==>[id:4264,label:Some,x:x2,y:??]
==>[id:4224,label:Some,x:x1,y:y1]

gremlin> g.V().has('x', eq('watch')).elementMap()
gremlin>
gremlin> g.V().has('x', eq('watch the dog')).elementMap()
==>[id:4192,label:Some,x:watch the dog]

gremlin> g.V().has('x', neq('watch the dog')).elementMap()
==>[id:4264,label:Some,x:x2,y:??]
==>[id:4224,label:Some,x:x1,y:y1]

gremlin> g.V().has('x', neq('watch')).elementMap()
==>[id:4264,label:Some,x:x2,y:??]
==>[id:4224,label:Some,x:x1,y:y1]
// Here, ==>[id:4192,label:Some,x:watch the dog] is missing, supporting Sergey's issue!!!

Related to this, there does not exist a negation for the textContains() predicate for full TEXT search. Using the TextP.notContaining()TinkerPop generic predicate, causes JanusGraph to not use the index.

I will post an issue on github referring to this thread.

Best wishes,   Marc


Re: Union Query Optimization

AMIYA KUMAR SAHOO
 

Hi Vinayak,

I am not sure how to improve this query further through gremlin.

Query can be faster through data model.  VCI will be helpful, if you are applying any other filter along with hasLabel and  your edge selectivity is low compare  to the total degree of those vertex. 

If this query is very frequent and there's a need to improve it further, you can make inV title property be part of edge and VCI can be enabled on that edge property.

Other than that not sure if any configuration can be done to further improve. Someone else might comment on this front.


Regards,
Amiya


On Thu, 22 Apr 2021, 19:08 Vinayak Bali, <vinayakbali16@...> wrote:
Hi Amiya, 

Thank you for the query. It also increased the performance. But, it's still 35 seconds. Is there any other way to optimize it further, there are only 10 records returned by the query. 
Counts are as follows: 
V1: 187K V2:40 V3: 50 V4: 447K 

Thanks & Regards,
Vinayak

On Thu, Apr 22, 2021 at 12:55 PM AMIYA KUMAR SAHOO <amiyakr.sahoo91@...> wrote:
Hi Vinayak,

You can try below query, it can use index and combine as many traversals you want.

g.inject(1).
   union ( 
      V().has('title', 'V1'). outE().hasLabel('E1').inV().has('title', 'V2'),
    
       V().has('title', 'V3'). outE().hasLabel('E3').inV().has('title', 'V4'))....

Regards,
Amiya



On Thu, 22 Apr 2021, 10:36 Vinayak Bali, <vinayakbali16@...> wrote:
Hi, cmilowka,

The property title has a composite index created on it. Further modified the query as follows:

g.V().has('title',within('V1','V2')).union(has('title', 'V1').as('v1').outE().hasLabel('E1').as('e').inV().has('title', 'V2').as('v2'),has('title', 'V2').as('v1').union(outE().hasLabel('E2').as('e').inV().has('title', 'V2'),outE().hasLabel('E3').as('e').inV().has('title', 'V3')).as('v2')).select('v1','e','v2').by(valueMap().by(unfold()))

the only change is adding has('title',within('V1','V2')) at start of query. The warning is gone now and performance is also improved. 
Earlier the time taken was around 3.5 mins now it's 55 sec to return only 44 records.
The problem is my source changes, need to consider it. For example: 
v1 - e1 -v2
v3 -e2 -v4
Want the want in a single query. now the query for this will be as follows:

g.V().has('title',within('V1','V3')).union(has('title','V1').as('v1').outE().has('title','E1').as('e').inV().has('title','V2').as('v2'),has('title','V3').as('v1').outE().has('title','E2').as('e').inV().has('title','V4').as('v2')).select('v1','e','v2').by(valueMap().by(unfold()))

Request all of you to provide your feedback to improve it further. 

Thanks & Regards,
Vinayak

On Thu, Apr 22, 2021 at 5:14 AM cmilowka <cmilowka@...> wrote:

I guess, building composite index for 'title' property will do the job of accessing title(V1) and title(V2) fast, without full scan of DB as currently does.

cheers, CM


Mapreduce index repair job fails in Kerberos+SSL enabled cluster

shivainfotech12@...
 

Hi All,

I'm trying to run index repair job through mapreduce in a Kerberos+SSL enabled cluster.
I have added all required hbase and hadoop configurations but getting the below exception in mapreduce logs.

2021-04-22 20:19:55,112 DEBUG [hconnection-0x3bbf9027-metaLookup-shared--pool2-t1] org.apache.hadoop.hbase.security.HBaseSaslRpcClient: Creating SASL GSSAPI client. Server's Kerberos principal name is sudarshan/inedccpe101.informatica.com@...
2021-04-22 20:19:55,113 DEBUG [hconnection-0x3bbf9027-metaLookup-shared--pool2-t1] org.apache.hadoop.security.UserGroupInformation: PrivilegedActionException as:sudarshan (auth:SIMPLE) cause:javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
2021-04-22 20:19:55,113 DEBUG [hconnection-0x3bbf9027-metaLookup-shared--pool2-t1] org.apache.hadoop.security.UserGroupInformation: PrivilegedAction as:sudarshan (auth:SIMPLE) from:org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.handleSaslConnectionFailure(RpcClientImpl.java:643)
2021-04-22 20:19:55,113 WARN [hconnection-0x3bbf9027-metaLookup-shared--pool2-t1] org.apache.hadoop.hbase.ipc.RpcClientImpl: Exception encountered while connecting to the server : javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
2021-04-22 20:19:55,113 ERROR [hconnection-0x3bbf9027-metaLookup-shared--pool2-t1] org.apache.hadoop.hbase.ipc.RpcClientImpl: SASL authentication failed. The most likely cause is missing or invalid credentials. Consider 'kinit'.
javax.security.sasl.SaslException: GSS initiate failed [Caused by GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)]
        at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:220)
        at org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:179)
        at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupSaslConnection(RpcClientImpl.java:617)
        at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.access$700(RpcClientImpl.java:162)
        at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:743)
        at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection$2.run(RpcClientImpl.java:740)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
        at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.setupIOstreams(RpcClientImpl.java:740)
        at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.writeRequest(RpcClientImpl.java:909)
        at org.apache.hadoop.hbase.ipc.RpcClientImpl$Connection.tracedWriteRequest(RpcClientImpl.java:873)
        at org.apache.hadoop.hbase.ipc.RpcClientImpl.call(RpcClientImpl.java:1244)
        at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:227)
        at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:336)
        at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$BlockingStub.scan(ClientProtos.java:35396)
        at org.apache.hadoop.hbase.client.ClientSmallReversedScanner$SmallReversedScannerCallable.call(ClientSmallReversedScanner.java:298)
        at org.apache.hadoop.hbase.client.ClientSmallReversedScanner$SmallReversedScannerCallable.call(ClientSmallReversedScanner.java:276)
        at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:212)
        at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:364)
        at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas$RetryingRPC.call(ScannerCallableWithReplicas.java:338)
        at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:137)
        at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:65)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: GSSException: No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)
        at sun.security.jgss.krb5.Krb5InitCredential.getInstance(Krb5InitCredential.java:162)
        at sun.security.jgss.krb5.Krb5MechFactory.getCredentialElement(Krb5MechFactory.java:122)
        at sun.security.jgss.krb5.Krb5MechFactory.getMechanismContext(Krb5MechFactory.java:189)
        at sun.security.jgss.GSSManagerImpl.getMechanismContext(GSSManagerImpl.java:224)
        at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:212)
        at sun.security.jgss.GSSContextImpl.initSecContext(GSSContextImpl.java:179)
        at com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:201)
        ... 25 more
 

Can anyone please help on this issue.

Thanks
Shiva


Re: Strange behaviors for Janusgraph 0.5.3 on AWS EMR

hadoopmarc@...
 

Hi Alessandro,

The executors tab of the spark UI shows the product of spark.executor.instances times spark.executor.cores. I guess spark.executor.instances defaults to one and EMR might limit the number of executor cores?

I also won't hurt to explicitly specify spark.submit.deployMode=client assuming EMR allows it. I am not sure whether Gremlin Console needs client mode to have the count results returned. And with a "zero" result in the Gremlin Console did you mean 0 or just ==>     ?

For having written output on "output", you have to configure the distributed storage, so that "output" is a path on hadoop-hdfs (each executor writes its output to a partition on the distributed storage, so you would have 768 partitions in the output directory). Be aware that TinkerPop uses a bit strange naming in the output directory.

Best wishes,      Marc

Best wishes,     Marc


Re: Strange behaviors for Janusgraph 0.5.3 on AWS EMR

asivieri@...
 

By the way, if you have any properties file or running example of OLAP that you would like to share, I'd be happy to see something working and compare it to what I am trying to do!

Best regards,
Alessandro


Re: Strange behaviors for Janusgraph 0.5.3 on AWS EMR

asivieri@...
 

Hi,

here are the properties that I am setting so far (plus the same ones that are set in the TinkerPop example, such as the classpath for the executors and the driver):
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphReader=org.janusgraph.hadoop.formats.cql.CqlInputFormat
gremlin.hadoop.graphWriter=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
gremlin.hadoop.jarsInDistributedCache=true
 
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=output
 
schema.default=none
 
janusgraphmr.ioformat.conf.storage.backend=cql
janusgraphmr.ioformat.conf.storage.batch-loading=true
janusgraphmr.ioformat.conf.storage.buffer-size=10000
janusgraphmr.ioformat.conf.storage.cql.keyspace=...
 
janusgraphmr.ioformat.conf.storage.hostname=...
janusgraphmr.ioformat.conf.storage.port=9042
janusgraphmr.ioformat.conf.storage.username=...
janusgraphmr.ioformat.conf.storage.password=...
cassandra.output.native.port=9042
 
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra.input.widerows=true
 
spark.master=yarn
spark.executor.memory=20g
spark.executor.cores=4
spark.driver.memory=20g
spark.driver.cores=8
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator=org.janusgraph.hadoop.serialize.JanusGraphKryoRegistrator
gremlin.spark.graphStorageLevel=MEMORY_AND_DISK
gremlin.spark.persistContext=true
gremlin.spark.persistStorageLevel=MEMORY_AND_DISK
spark.default.parallelism=1000
On Spark UI I can see a number of tasks for the first job which is the same number of tokens for our Scylla cluster (256 tokens per node * 3 nodes), but only two executors are spawn, even though I tried on a cluster with 96 cores and 768 GB of RAM, which, given the configuration of drivers and executors that you can see in the properties, should allocate a lot more than 2.

Moreover, I wrote a dedicated Java application that replicates the first step of the SparkGraphComputer, which is the step where the entire vertex list is read into a RDD, so basically I tried skipping the entire Gremlin console, start a "normal" Spark session as we do in our applications, and then read the entire vertex list from Scylla. In this case the job has the same number of tasks as before, but the number of executors is the correct one that I expected, so it seems to me that something in the Spark context creation performed by Gremlin is limiting this number, so maybe I am missing a configuration.
The problem of empty results, however, remained: in this test the RDD in output is completely empty, even though the logs in DEBUG show that it is connecting to the correct keyspace, where there is some data present. There are no exceptions, so I am not sure why we are not reading anything. Am I missing some properties in your opinion/experience?

Best regards,
Alessandro


Re: Strange behaviors for Janusgraph 0.5.3 on AWS EMR

hadoopmarc@...
 

Hi Alessandro,

Yes, please include the properties file.

To be clear, you see in the spark web UI:
spark.master=yarn
spark.executor.instances=12

and only two executors for 700+ tasks show up, while other jobs using the same EMR account spawn tens of executors? Is their any yarn queue you have to specify to get more resources from yarn? It sounds like some limit in yarn RM.

Best wishes,   Marc


Re: Union Query Optimization

Vinayak Bali
 

Hi Amiya, 

Thank you for the query. It also increased the performance. But, it's still 35 seconds. Is there any other way to optimize it further, there are only 10 records returned by the query. 
Counts are as follows: 
V1: 187K V2:40 V3: 50 V4: 447K 

Thanks & Regards,
Vinayak

On Thu, Apr 22, 2021 at 12:55 PM AMIYA KUMAR SAHOO <amiyakr.sahoo91@...> wrote:
Hi Vinayak,

You can try below query, it can use index and combine as many traversals you want.

g.inject(1).
   union ( 
      V().has('title', 'V1'). outE().hasLabel('E1').inV().has('title', 'V2'),
    
       V().has('title', 'V3'). outE().hasLabel('E3').inV().has('title', 'V4'))....

Regards,
Amiya



On Thu, 22 Apr 2021, 10:36 Vinayak Bali, <vinayakbali16@...> wrote:
Hi, cmilowka,

The property title has a composite index created on it. Further modified the query as follows:

g.V().has('title',within('V1','V2')).union(has('title', 'V1').as('v1').outE().hasLabel('E1').as('e').inV().has('title', 'V2').as('v2'),has('title', 'V2').as('v1').union(outE().hasLabel('E2').as('e').inV().has('title', 'V2'),outE().hasLabel('E3').as('e').inV().has('title', 'V3')).as('v2')).select('v1','e','v2').by(valueMap().by(unfold()))

the only change is adding has('title',within('V1','V2')) at start of query. The warning is gone now and performance is also improved. 
Earlier the time taken was around 3.5 mins now it's 55 sec to return only 44 records.
The problem is my source changes, need to consider it. For example: 
v1 - e1 -v2
v3 -e2 -v4
Want the want in a single query. now the query for this will be as follows:

g.V().has('title',within('V1','V3')).union(has('title','V1').as('v1').outE().has('title','E1').as('e').inV().has('title','V2').as('v2'),has('title','V3').as('v1').outE().has('title','E2').as('e').inV().has('title','V4').as('v2')).select('v1','e','v2').by(valueMap().by(unfold()))

Request all of you to provide your feedback to improve it further. 

Thanks & Regards,
Vinayak

On Thu, Apr 22, 2021 at 5:14 AM cmilowka <cmilowka@...> wrote:

I guess, building composite index for 'title' property will do the job of accessing title(V1) and title(V2) fast, without full scan of DB as currently does.

cheers, CM


Re: Union Query Optimization

AMIYA KUMAR SAHOO
 

Hi Vinayak,

You can try below query, it can use index and combine as many traversals you want.

g.inject(1).
   union ( 
      V().has('title', 'V1'). outE().hasLabel('E1').inV().has('title', 'V2'),
    
       V().has('title', 'V3'). outE().hasLabel('E3').inV().has('title', 'V4'))....

Regards,
Amiya



On Thu, 22 Apr 2021, 10:36 Vinayak Bali, <vinayakbali16@...> wrote:
Hi, cmilowka,

The property title has a composite index created on it. Further modified the query as follows:

g.V().has('title',within('V1','V2')).union(has('title', 'V1').as('v1').outE().hasLabel('E1').as('e').inV().has('title', 'V2').as('v2'),has('title', 'V2').as('v1').union(outE().hasLabel('E2').as('e').inV().has('title', 'V2'),outE().hasLabel('E3').as('e').inV().has('title', 'V3')).as('v2')).select('v1','e','v2').by(valueMap().by(unfold()))

the only change is adding has('title',within('V1','V2')) at start of query. The warning is gone now and performance is also improved. 
Earlier the time taken was around 3.5 mins now it's 55 sec to return only 44 records.
The problem is my source changes, need to consider it. For example: 
v1 - e1 -v2
v3 -e2 -v4
Want the want in a single query. now the query for this will be as follows:

g.V().has('title',within('V1','V3')).union(has('title','V1').as('v1').outE().has('title','E1').as('e').inV().has('title','V2').as('v2'),has('title','V3').as('v1').outE().has('title','E2').as('e').inV().has('title','V4').as('v2')).select('v1','e','v2').by(valueMap().by(unfold()))

Request all of you to provide your feedback to improve it further. 

Thanks & Regards,
Vinayak

On Thu, Apr 22, 2021 at 5:14 AM cmilowka <cmilowka@...> wrote:

I guess, building composite index for 'title' property will do the job of accessing title(V1) and title(V2) fast, without full scan of DB as currently does.

cheers, CM


Re: Union Query Optimization

Vinayak Bali
 

Hi, cmilowka,

The property title has a composite index created on it. Further modified the query as follows:

g.V().has('title',within('V1','V2')).union(has('title', 'V1').as('v1').outE().hasLabel('E1').as('e').inV().has('title', 'V2').as('v2'),has('title', 'V2').as('v1').union(outE().hasLabel('E2').as('e').inV().has('title', 'V2'),outE().hasLabel('E3').as('e').inV().has('title', 'V3')).as('v2')).select('v1','e','v2').by(valueMap().by(unfold()))

the only change is adding has('title',within('V1','V2')) at start of query. The warning is gone now and performance is also improved. 
Earlier the time taken was around 3.5 mins now it's 55 sec to return only 44 records.
The problem is my source changes, need to consider it. For example: 
v1 - e1 -v2
v3 -e2 -v4
Want the want in a single query. now the query for this will be as follows:

g.V().has('title',within('V1','V3')).union(has('title','V1').as('v1').outE().has('title','E1').as('e').inV().has('title','V2').as('v2'),has('title','V3').as('v1').outE().has('title','E2').as('e').inV().has('title','V4').as('v2')).select('v1','e','v2').by(valueMap().by(unfold()))

Request all of you to provide your feedback to improve it further. 

Thanks & Regards,
Vinayak

On Thu, Apr 22, 2021 at 5:14 AM cmilowka <cmilowka@...> wrote:

I guess, building composite index for 'title' property will do the job of accessing title(V1) and title(V2) fast, without full scan of DB as currently does.

cheers, CM


Re: Union Query Optimization

cmilowka
 

I guess, building composite index for 'title' property will do the job of accessing title(V1) and title(V2) fast, without full scan of DB as currently does.

cheers, CM


Re: Strange behaviors for Janusgraph 0.5.3 on AWS EMR

kndoan94@...
 

Hi Alessandro,

I'm also working through a similar use-case with AWS EMR, but I'm running into some Hadoop class errors. What version of EMR are you using?

Additionally, if you could pass along the configuration details in your .properties file, that would be extremely helpful :) 

Thank you!
Ben


Re: Strange behaviors for Janusgraph 0.5.3 on AWS EMR

asivieri@...
 

Hi Marc,

the Tinkerpop example works correctly. We are actually using Scylla, and with 256 tokens per node I am getting 768 tasks in the Spark job (which I correctly see listed in the UI). The problems I have are that a) only 2 executors are spawn, which does not make much sense since I have configured executor cores and memory in the properties file and the cluster has resources for more than 2, and b) no data is being transmitted back from the cluster, even though performing similar (limited) queries without Spark produce results.

Best regards,
Alessandro


Union Query Optimization

Vinayak Bali
 

Hi All, 

I need to select multiple nodes and edges and display the content in v1 - e - v2 format. The query generated is as follows:

g.V().union(has('title', 'V1').as('v1').outE().hasLabel('E1').as('e').inV().has('title', 'V2').as('v2'),has('title', 'V2').as('v1').union(outE().hasLabel('E2').as('e').inV().has('title', 'V2'),outE().hasLabel('E3').as('e').inV().has('title', 'V3')).as('v2')).select('v1','e','v2').by(valueMap().by(unfold()))

It throws the warning:
05:20:21 WARN  org.janusgraph.graphdb.transaction.StandardJanusGraphTx  - Query requires iterating over all vertices [()]. For better performance, use indexes

How can we optimize the query may be without a union step?

Thanks & Regards,
Vinayak


Re: CLA & DCO Signing

rngcntr
 

Hi Fredrick!

First of all, thanks for your contribution! How did you try to squash the commits? In the PR, I still see a total of eight commits so it appears your squash didn't work for some reason.
You can either use `git rebase -i HEAD~8` and then replace `pick` by `squash` or `s` for seven of the commits you see there. Another option would be to use `git reset --soft HEAD~8` and then `git commit -s`. In either case, please verify that the code still contains all the changes you made before using `git push -f` to publish them.

Best regards,
Florian


Re: Janusgraph - OLAP using Dataproc

sauverma
 

Hi Claire

In our earlier attempts to work with JG OLAP, we hit the bottleneck / bounded by scylla virtual token ring count (it appears thats the unit of parallelism while bulk export)

Can you please share your experience while running the bulk export / OLAP on JG directly?

thanks


Re: CLA & DCO Signing

fredrick.eisele@...
 

Also the https://github.com/JanusGraph/janusgraph/blob/master/CONTRIBUTING.md is probably out of date.
It recommends getting help from...
janusgraph-cla@...
...but that group appears to no longer exist.
I presume that this forum is the correct place for such questions?


CLA & DCO Signing

fredrick.eisele@...
 

I have a pull request which contains commits which have not been properly signed.
I have tried to squash the commits and have made things worse.
Here is a link to the pull request.
https://github.com/JanusGraph/janusgraph-docker/pull/87


Re: Strange behaviors for Janusgraph 0.5.3 on AWS EMR

hadoopmarc@...
 

Hi Alessandro,

I assume Amazon EMR uses hadoop-yarn, so you need to specify spark.master  = yarn, see:
https://tinkerpop.apache.org/docs/current/recipes/#olap-spark-yarn

Once you can run the TinkerPop example, you can try and switch to janusgraph. You have to realize that janusgraph does not do a good job (yet) in partitioning the input data from a storage backend. Basically, when using cql, you get the partitions as used by Cassandra. So with 1 or 2 spark partitions, there is no need to fire 90 executors.

Best wishes,    Marc


Strange behaviors for Janusgraph 0.5.3 on AWS EMR

asivieri@...
 

Hello everyone,

is there anyone with experience of running OLAP on an AWS EMR cluster?

I am currently trying to do so, but strange things are happening.
The first one is that the application is not running on the entire cluster, even though I specified both driver and executor parameters on the properties file. Regardless of what I write there, only 2 executors are spawn, while the cluster on which I tried could support at least 90. I can see the jobs on the Hadoop and Spark UI of the cluster, and other properties (such as default parallelism) are correctly read and used on jobs.
Moreover, I seem to have problems in getting the correct output: I started from the properties example that uses CQL, but I do not receive any meaningful answer on queries that I do on the Gremlin console (the data is there, because I am able to query it without Spark). The classic vertex count returns zero, and trying to extract a certain set of properties does not return anything. I saw that the conf shows, as GraphWriter, a NullOutputFormat, so I tried to set the Gyro one in there, but nothing changed, and I am not sure is supported by the rest of the configuration.

Thank you for your help,
Alessandro

121 - 140 of 5958