Date   

Re: No results returned with duplicate Has steps in a vertex-search traversal

Boxuan Li
 

Hi,

Can you provide more info on how the fields in your example are indexed? E.g. composite or mixed, what are all indexes involving any of these fields.

「Patrick Streifel <prstreifel@...>」在 2021年1月26日 週二,上午4:56 寫道:

We are running into a JanusGraph bug where a traversal that should return a list of vertices is returning an empty list.

 

Here is some background info:

Using a JanusGraph Server with ConfigureGraphFactory running v. 0.5.2.

Storage: Cassandra v. 3.11.9

Index: Elasticsearch v. 6.7.2

Connecting to the server via java gremlin driver.

 

Our use case is this:

 

We are searching for vertices in the graph based on various property filters (e.g. Give me people named "Patrick" with a last name matching the regex "Str.*el"). When we just do this, there are no issues, of course.

 

The tricky part is that we are adding extra filters on a property called DomainGroup, which essentially allows us to filter out results per search user based on what they are interested in seeing. The user running the query provides a list of Domains they are interested in, and there has to be some overlap between the user's Domains and the list of DomainGroups on the vertices for those vertices to be returned. In short, we put in extra "has" steps that filters out vertices in certain groups from the results.

 

Another important note: These "has" steps to filter on Domain occur after each other step in the query. That may not be a great idea for this use case, but we have others where we need it. We have logic that groups together a set of has statements automatically based on user requests.  Sometimes this automated process will duplicate certain property searches when constructing the traversal and it is hard to avoid in certain cases.  We could work to deduplicate, but this still seems like a true bug in JanusGraph, albeit for a weird use case.

 

An example of one of our DomainGroup "has" steps is here:

has(DomainGroup, within([GROUP_A, GROUP_B])),

 

We combed through our DEBUG level logs in the JG Server.

We noticed that JG was querying the Elasticsearch index for results, as expected. Elasticsearch was actually returning the expected vertex(es), but the JG Server was not returning anything after that.

 

Here are some additional conditions we noticed:

  1. This appears only to happen when there are multiple duplicate "has" steps in the traversal.
    1. When we run a traversal with only one property search ( has(FirstName, textRegex(Patric.*)) ) and one DomainGroup filter ( has(DomainGroup , within([GROUP_A, GROUP_B])) ), then we get the expected results.
    2. When we provide two property searches, and thus two (duplicate) DomainGroup filters, we get no results. This leads us to believe there is an issue with having duplicate "has" steps, or specifically duplicate "has" steps with "within" filters.

 

Example of a traversal that we get empty results with:

args={gremlin=[[], [V(),

has(FirstName, textRegex(Patric.*)),

has(DomainGroup , within([GROUP_A, GROUP_B])),

has(PersonSurName, textRegex(Str.*el)),

has(DomainGroup , within([GROUP_A, GROUP_B])),

limit(5), valueMap(), with(~tinkerpop.valueMap.tokens)]], aliases={g=my_graph_traversal}}

 

Logs show the Elastic search scroll request returning a document with the correct id, but the logs also show JG ultimately sending an empty response to our API. Something is lost in between there.

 

Just wanted to bring this to your attention. We are figuring out workarounds on our side, but this seems like a JG bug.


No results returned with duplicate Has steps in a vertex-search traversal

Patrick Streifel <prstreifel@...>
 

We are running into a JanusGraph bug where a traversal that should return a list of vertices is returning an empty list.

 

Here is some background info:

Using a JanusGraph Server with ConfigureGraphFactory running v. 0.5.2.

Storage: Cassandra v. 3.11.9

Index: Elasticsearch v. 6.7.2

Connecting to the server via java gremlin driver.

 

Our use case is this:

 

We are searching for vertices in the graph based on various property filters (e.g. Give me people named "Patrick" with a last name matching the regex "Str.*el"). When we just do this, there are no issues, of course.

 

The tricky part is that we are adding extra filters on a property called DomainGroup, which essentially allows us to filter out results per search user based on what they are interested in seeing. The user running the query provides a list of Domains they are interested in, and there has to be some overlap between the user's Domains and the list of DomainGroups on the vertices for those vertices to be returned. In short, we put in extra "has" steps that filters out vertices in certain groups from the results.

 

Another important note: These "has" steps to filter on Domain occur after each other step in the query. That may not be a great idea for this use case, but we have others where we need it. We have logic that groups together a set of has statements automatically based on user requests.  Sometimes this automated process will duplicate certain property searches when constructing the traversal and it is hard to avoid in certain cases.  We could work to deduplicate, but this still seems like a true bug in JanusGraph, albeit for a weird use case.

 

An example of one of our DomainGroup "has" steps is here:

has(DomainGroup, within([GROUP_A, GROUP_B])),

 

We combed through our DEBUG level logs in the JG Server.

We noticed that JG was querying the Elasticsearch index for results, as expected. Elasticsearch was actually returning the expected vertex(es), but the JG Server was not returning anything after that.

 

Here are some additional conditions we noticed:

  1. This appears only to happen when there are multiple duplicate "has" steps in the traversal.
    1. When we run a traversal with only one property search ( has(FirstName, textRegex(Patric.*)) ) and one DomainGroup filter ( has(DomainGroup , within([GROUP_A, GROUP_B])) ), then we get the expected results.
    2. When we provide two property searches, and thus two (duplicate) DomainGroup filters, we get no results. This leads us to believe there is an issue with having duplicate "has" steps, or specifically duplicate "has" steps with "within" filters.

 

Example of a traversal that we get empty results with:

args={gremlin=[[], [V(),

has(FirstName, textRegex(Patric.*)),

has(DomainGroup , within([GROUP_A, GROUP_B])),

has(PersonSurName, textRegex(Str.*el)),

has(DomainGroup , within([GROUP_A, GROUP_B])),

limit(5), valueMap(), with(~tinkerpop.valueMap.tokens)]], aliases={g=my_graph_traversal}}

 

Logs show the Elastic search scroll request returning a document with the correct id, but the logs also show JG ultimately sending an empty response to our API. Something is lost in between there.

 

Just wanted to bring this to your attention. We are figuring out workarounds on our side, but this seems like a JG bug.


Issues with controlling partitions when using Apache Spark

Mladen Marović
 

Hey there!

I've recently been working on some Apache Spark jobs for Janusgraph via hadoop-gremlin (as described on https://docs.janusgraph.org/advanced-topics/hadoop/) and encountered several issues. Generally, I kept having memory issues as the partitions were too big to be loaded into my spark executors (which I increased up to 16GB per executor).

After analysing the code, I found two parameters that could be used to further subsplit the partitions: cassandra.input.split.size and cassandra.input.split.size_mb. However, when trying to use these parameters, and debugging when the memory issues persisted, I noticed several bugs in the underlying org.apache.cassandra.hadoop.cql3.CqlInputFormat used to load the data. I posted the question on the datastax community forums (see https://community.datastax.com/questions/10153/how-to-control-partition-size-when-reading-data-wi.html). There I was ultimately suggested to migrate to the spark-cassandra-connector because the issues I encountered were probably bugs, but that was legacy code (and probably not maintained anymore).

In the meantime, I reimplemented the InputFormat classes in my app to fix the issues, and testing so far showed that this now works as intended. However, I was wondering the following:

1. Does anyone else have any experience with using Apache Spark, Janusgraph, and graphs too big to fit into memory without subsplitting? Did you also encounter this issue? If so, how did you deal with it?
2. Is there an "official" solution to this issue?
3. Are there any plans to migrate to the spark-cassandra connector for this use case?

Thanks,

Mladen


Re: JanusGraph and future versions of ES

Assaf Schwartz
 

Thank you very much everyone


Re: JanusGraph and future versions of ES

Peter Corless
 

It would only affect the use of JanusGraph+Elastic if you were offering it in an -as-a-Service (-aaS) offering, like a commercial DBaaS. In which case the SSPL would require you to commit back to the code base modifications, UI, etc. 

We did a side-by-side analysis of SSPL vs. APGL when it was first announced. You can read that blog here:


The most germaine language is this:

SSPL requires that if you offer software as a service you must make public, open source, practically almost everything related to your service:

“including, without limitation, management software, user interfaces, application program interfaces, automation software, monitoring software, backup software, storage software and hosting software, all such that a user could run an instance of the service using the Service Source Code you make available.”

But if you are just using it internally, or deep within another product (one that an end user doesn't have direct access to) you have no worries.

Disclaimer: this is my opinion; I am not a lawyer. YMMV. Caveat emptor. Void where prohibited by law.

On Fri, Jan 22, 2021, 6:49 AM BO XUAN LI <liboxuan@...> wrote:
If I am understanding correctly, this does not affect the usage of Elasticsearch in JanusGraph.  Regarding AWS's decision, seems no additional configs/code changes are needed if someone wants to use AWS forked version (assuming Amazon would not introduce breaking changes causing client incompatibility).


Re: Janusgraph traversal time

hadoopmarc@...
 

Sounds normal to me. Note that setting query.batch=true might help a bit. Also note that the graph structure, in particular the number of repeat steps, is important. Finally, note that caching in the storage backend (in addition to caching inside JanusGraph) plays a role, see:
http://yaaics.blogspot.com/2018/04/understanding-reponse-times-for-single.html

Best wishes,    Marc


Re: JanusGraph and future versions of ES

Boxuan Li
 

If I am understanding correctly, this does not affect the usage of Elasticsearch in JanusGraph.  Regarding AWS's decision, seems no additional configs/code changes are needed if someone wants to use AWS forked version (assuming Amazon would not introduce breaking changes causing client incompatibility).


Janusgraph traversal time

faizez@...
 

Hi All,
 
I have a multi-level graph with total 800 vertices. When I run g.V(<id of parent>).repeat(out()).emit() for first time, it takes 300ms. In profile(), I could see RepeatStep takes most of the time (almost 296ms). Also I could see, it executes backend-query, same number as that of vertices. But when I run the same query again, it completes in 4ms. I know when I run 2nd time, it uses cache. 
My question is whether traversal of 800 vertices without cache in 300ms is normal or not ? 
Note : Im using JG 0.5.2 and I have out-of-the-box Janusgraph configurations.


Re: JanusGraph and future versions of ES

hadoopmarc@...
 

Please also read the stance of Elasticsearch itself (because the AWS article did not link to it ):

https://www.elastic.co/blog/license-change-clarification


JanusGraph and future versions of ES

Assaf Schwartz
 

Hi all,
Since ES seems to have changed their licensing, what does this entail about the future usage of ES as a JanusGraph index backend?
https://aws.amazon.com/blogs/opensource/stepping-up-for-a-truly-open-source-elasticsearch/

Thanks!


Re: Janusgraph query execution performance

hadoopmarc@...
 

Analytical queries require a full table scan. Some people succeed in speeding up analytical queries on JanusGraph using OLAP, check the older questions on OLAP and SparkGraphComputer and
https://docs.janusgraph.org/advanced-topics/hadoop/

A special case occurring very frequently, is counting the number of vertices for each label (you say: concept). Speeding this up is listed in the known issues:
https://github.com/JanusGraph/janusgraph/issues/926

Best wishes,   Marc


Janusgraph query execution performance

lalwani.ritu2609@...
 

Hi,

I have used https://github.com/IBM/expressive-reasoning-graph-store project to import the turtle file having around 4 Lakhs of concepts ad this project is using Janusgraph 0.4.0.
Now after importing I am able to run the queries.
But here the problem that I am facing is that some of the queries which access few number of nodes are quite faster. But some queries like counting the number of concepts in the graph (which access large number of nodes) are very very slow. Please note that I have used indexing already.

So is this due the version of Janusgraph which is 0.4.0(quite older version)?
Or the performance will be like this only for Janusgraph?

Any help will highly be appreciated.

Thanks!!


Re: OLAP Spark

hadoopmarc@...
 

Hi Vinayak,

JanusGraph has defined hadoop InputFormats for its storage backends to do OLAP queries, see https://docs.janusgraph.org/advanced-topics/hadoop/

However, these InputFormats have several problems regarding performance (see the old questions on this list), so your approach could be worthwhile:

1. It is best to create these ID's on ingestion of data in JanusGraph and add them as vertex property. If you create an index on this property, it is possible to use these id properties for retrieval during OLAP queries.
2. Spark does this automatically if you call rdd.mapPartitions on the RDD with ids.
3. Here is the disadvantage of this approach. You simply run the gremlin query per partition with ids, but you have to merge the results per partition afterwards outside gremlin. The merge logic differs per type of query.

Best wishes,     Marc


Re: Janusgraph spark on yarn error

hadoopmarc@...
 

The path of the BulkLoaderVertexProgram might be doable, but I cannot help you on that one. In the stack trace above, the yarn appmaster from spark-yarn apparently tries to communicate with HBase but finds that various libraries do not match. This failure arises because the JanusGraph distribution does not include spark-yarn and thus is not handcrafted to work with spark-yarn.

For the path without BulkLoaderVertexProgram you inevitably need a JVM language (java, scala, groovy). In this case, a spark executor is unaware of any other executors running and  is simply passed a callable (function) to execute (through RDD.mapPartitions() or through a spark-sql UDF). This callable can be part of a class that establish its own JanusGraph instances in the OLTP way. Now, you only have to deal with the executor CLASSPATH which does not need spark-yarn and the libs from the janusgraph distribution suffice.

Some example code can be found at:
https://nitinpoddar.medium.com/bulk-loading-data-into-janusgraph-part-2-ca946db26582

Best wishes,    Marc


Re: reindex job is very slow on ElasticSearch and BigTable

hadoopmarc@...
 

I mean, what happens if you try to run MapReduceIndexManagement on BigTable. Apparently, you get this error message "MapReduceIndexManagement is not supported for BigTable" but I would like to see the full stack trace leading to this error message, to see where this incompatibility stems from. E.g. the code in:

https://github.com/JanusGraph/janusgraph/blob/d954ea02035d8d54b4e1bd5863d1f903e6d57844/janusgraph-hadoop/src/main/java/org/janusgraph/hadoop/MapReduceIndexManagement.java

reads:
HadoopStoreManager storeManager = (HadoopStoreManager) graph.getBackend().getStoreManager().getHadoopManager();
if (storeManager == null) {
    throw new IllegalArgumentException("Store manager class " + graph.getBackend().getStoreManagerClass() + "is not supported");
}

But this is not what you see.

Best wishes,    Marc
 


OLAP Spark

Vinayak Bali
 

Hi All,

I am working on OLAP using Spark and Hadoop. I have a couple of questions.
1. How to execute a filter step on the driver and create an RDD of internal ids?
2. Distributing the collected Ids to multiple Spark Executor?
3. Execute Gremlin in Parallel

Thanks & Regards,
Vinayak


Re: Database Level Caching

Boxuan Li
 

Thanks Nicolas, I am able to reproduce it using your configs & script. Created an issue at https://github.com/JanusGraph/janusgraph/issues/2369

Looks like a bug with calculating cache entries' size.

On Jan 15, 2021, at 11:53 PM, Nicolas Trangosi <nicolas.trangosi@...> wrote:

Hi Boxuan,
Issue seems to occurs when edge properties are retrieved: cache has expected size with g.V().outE().id() an not when I do g.V().outE().valueMap();

I am able to reproduce with following groovy script :
  • gremlin console  ( launched with JAVA_OPTS="-Xmx1G -Xms1G" ./bin/gremlin.sh)   on JG 0.5.3
  • conf/janusgraph-cache.properties:
gremlin.graph=org.janusgraph.core.JanusGraphFactory

storage.backend=cql
storage.hostname=127.0.0.1
storage.port=9042
schema.default=logging

cache.db-cache: true
cache.db-cache-size: 50000000
cache.db-cache-time: 6000000

  •   groovy script:

graph = JanusGraphFactory.open('conf/janusgraph-cache.properties')
g = graph.traversal()


// Schema creation
graph.tx().rollback()
mgmt = g.getGraph().openManagement()

try {
    deviceLabel = mgmt.makeVertexLabel('device').make()
    nameProperty = mgmt.makePropertyKey("name").dataType(java.lang.String).cardinality(org.janusgraph.core.Cardinality.SINGLE).make()
    mgmt.addProperties(deviceLabel, nameProperty)

    measurementLabel = mgmt.makeEdgeLabel('measurement').unidirected().make()
    deviceNameProperty = mgmt.makePropertyKey("deviceName").dataType(java.lang.String).cardinality(org.janusgraph.core.Cardinality.SINGLE).make()
    physicalQuantityProperty = mgmt.makePropertyKey("physicalQuantity").dataType(java.lang.String).cardinality(org.janusgraph.core.Cardinality.SINGLE).make()
    valueProperty = mgmt.makePropertyKey("value").dataType(java.lang.Double).cardinality(org.janusgraph.core.Cardinality.SINGLE).make()
    timestampProperty = mgmt.makePropertyKey("timestamp").dataType(java.util.Date).cardinality(org.janusgraph.core.Cardinality.SINGLE).make()

    mgmt.addProperties(measurementLabel, deviceNameProperty, physicalQuantityProperty, valueProperty, timestampProperty)
    mgmt.buildIndex("deviceByName", Vertex.class).indexOnly(deviceLabel).addKey(nameProperty).buildCompositeIndex();

    //mgmt.buildEdgeIndex(measurementLabel, 'measurementByTimestamp', Direction.OUT, Order.decr, timestampProperty);

    mgmt.commit()
} catch (Exception e) {
    mgmt.rollback();
    throw e;
}

// Load data
random = new Random();
startTs = System.currentTimeMillis();
for (i = 0; i < 100; i++) {
   deviceId = g.addV("device").property("name", "device-" + i).id().next();
   for (k = 0; k < 5000; k++) {
       g.V(deviceId).addE("measurement").
           property("deviceName",  "device-" + i).
           property("physicalQuantity", "physicalQuantity-" + random.nextInt(10)).
           property("value", random.nextDouble()).
           property("timestamp", new Date(startTs + k * 1000)).
           iterate();
       if (k % 1000 == 0) {
           g.tx().commit();
       }
   }
   log.info("Done i={}",i);
}
g.tx().commit();

// Request data 
for (i = 0; i < 100; i++) {
   measurementsList = g.V().has("device", "name", "device-" + i).outE().valueMap().toList();
   log.info("Got {} measurements for {}", measurementsList.size(), i);
}
g.tx().commit();



Le sam. 9 janv. 2021 à 05:21, BO XUAN LI <liboxuan@...> a écrit :
Hi Nicolas,

Looks interesting. Your configs look fine and I couldn’t reproduce your problem. Could you provide some sample code to reproduce it?

Best regards,
Boxuan


On Jan 4, 2021, at 10:20 PM, Nicolas Trangosi <nicolas.trangosi@...> wrote:

Hi Boxuan,

I have configured janusgraph with:

cache.db-cache-time: 600000  
cache.db-cache: true  
cache.db-cache-size: 50000000  
index.search.elasticsearch.create.ext.number_of_replicas: 0
storage.buffer-size: 1024
index.search.elasticsearch.create.ext.number_of_shards: 1
cache.cache.db-cache-time: 0
index.search.index-name: dcbrain
index.search.backend: elasticsearch
storage.port: 9042
ids.block-size: 1000000
schema.default: logging
storage.cql.batch-statement-size: 50
index.search.hostname: dfe-elasticsearch
storage.backend: cql
storage.hostname: dfe-cassandra
storage.cql.local-max-requests-per-connection: 4096
index.search.port: 9200


I have load some data on the graph and dump memory.
When I import this dump with jvisualVM, retained size for ExpirationKCVSCache 257 Mb when the limit should be 50 Mb.
<image.png>

Regards,
Nicolas

Le lun. 4 janv. 2021 à 13:11, BO XUAN LI <liboxuan@...> a écrit :
Hi Nicolas,

Can you provide your configurations and the memory usage you observed?

Regards,
Boxuan

On Jan 4, 2021, at 3:44 PM, Nicolas Trangosi <nicolas.trangosi@...> wrote:

Hi,
I try to use  Database Level Caching as described in https://docs.janusgraph.org/basics/cache/ but it seems to use more memory than the configured threshold ( cache.db-cache-size ). Does anyone use such a feature ? Is it production ready ?

Regards,
Nicolas


Ce message et ses pièces jointes peuvent contenir des informations confidentielles ou privilégiées et ne doivent donc pas être diffusés, exploités ou copiés sans autorisation. 
Si vous avez reçu ce message par erreur, veuillez le signaler a l'expéditeur et le détruire ainsi que les pièces jointes. 
Les messages électroniques étant susceptibles d'altération, DCbrain décline toute responsabilité si ce message a été altéré, déformé ou falsifié. Merci. 

This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, DCbrain is not liable for messages that have been modified, changed or falsified. Thank you.

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-users+unsubscribe@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/CAD7qnB4SYvXq5A3vkzu44fERkySr2kPhsoZC-5%3DbBoz9KvzPnw%40mail.gmail.com.


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-users+unsubscribe@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/3B5EFE52-BE38-437B-B399-05AF4899F398%40connect.hku.hk.


--
  
Nicolas Trangosi
Lead back
+33 (0)6 77 86 66 44      
   



Ce message et ses pièces jointes peuvent contenir des informations confidentielles ou privilégiées et ne doivent donc pas être diffusés, exploités ou copiés sans autorisation. 
Si vous avez reçu ce message par erreur, veuillez le signaler a l'expéditeur et le détruire ainsi que les pièces jointes. 
Les messages électroniques étant susceptibles d'altération, DCbrain décline toute responsabilité si ce message a été altéré, déformé ou falsifié. Merci. 

This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, DCbrain is not liable for messages that have been modified, changed or falsified. Thank you.

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-users+unsubscribe@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/CAD7qnB7bY3bNwf3PVCLuuU%2BOT%2BAmGWnoEGtT00i6LQ8%2Bu5sHzw%40mail.gmail.com.


--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-users+unsubscribe@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/D7009A96-D9BB-4F05-A792-372EB3F5CE34%40connect.hku.hk.


--
  
Nicolas Trangosi
Lead back
+33 (0)6 77 86 66 44      
   



Ce message et ses pièces jointes peuvent contenir des informations confidentielles ou privilégiées et ne doivent donc pas être diffusés, exploités ou copiés sans autorisation. 
Si vous avez reçu ce message par erreur, veuillez le signaler a l'expéditeur et le détruire ainsi que les pièces jointes. 
Les messages électroniques étant susceptibles d'altération, DCbrain décline toute responsabilité si ce message a été altéré, déformé ou falsifié. Merci. 

This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, DCbrain is not liable for messages that have been modified, changed or falsified. Thank you.


Re: Janusgraph spark on yarn error

j2kupper@...
 
Edited

Thank you for response!

I am using BulkLoaderVertexProgram from console. Sometimes it works correctly.
This error still exist when i am running read from hbase spark job.

my  read-hbase.properties

gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphReader=org.janusgraph.hadoop.formats.hbase.HBaseInputFormat
gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat

gremlin.hadoop.jarsInDistributedCache=false
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=output

janusgraphmr.ioformat.conf.storage.backend=hbase
janusgraphmr.ioformat.conf.storage.hostname=192.168.1.11,192.168.1.12,192.168.1.13,192.168.1.14
janusgraphmr.ioformat.conf.storage.hbase.table=testTable


spark.master=yarn
spark.submit.deployMode=client
spark.yarn.archive=/usr/local/janusgraph/janusgraph_libs.zip
spark.executor.instances=2
spark.driver.memory=8g
spark.driver.cores=4
spark.executor.cores=5
spark.executor.memory=19g

spark.executor.extraClassPath=/usr/local/janusgraph/lib:/usr/local/hadoop/etc/hadoop/conf
spark.executor.extraJavaOptions=-Djava.library.path=/usr/local/hadoop/lib/native
spark.yarn.am.extraJavaOptions=-Djava.library.path=/usr/local/hadoop/lib/native
spark.yarn.appMasterEnv.CLASSPATH=/usr/local/janusgraph/lib:/usr/local/hadoop/etc/hadoop/conf


spark.driver.extraLibraryPath=/usr/local/hadoop/lib/native
spark.executor.extraLibraryPath=/usr/local/hadoop/lib/native

spark.dynamicAllocation.enabled=false
spark.io.compression.codec=snappy
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator=org.janusgraph.hadoop.serialize.JanusGraphKryoRegistrator

Can you provide some code example spark of application loading data OLTP way?
Which program langugage can i use? (I want python, if it`s possible)


Re: reindex job is very slow on ElasticSearch and BigTable

vamsi.lingala@...
 

Thanks a lot for your reply.
I don't get any error message.

The REINDEX step is very slow (2000/s) for the mixed index and fails after running for a few days.
Everytime indexing fails or the cluster is restarted the shards settings and replica declared for the Indices is reset to 1 (default values for es-7) and thus I have to recreate new indices after disabling old ones.

is there any better way to reindex fast?


Re: Janusgraph spark on yarn error

hadoopmarc@...
 

#Private reply from OP:
Yes, i am running bulk load from hdfs(graphson) in janusgraph-hbase.
Yes, i have graphson part files from spark job with a structure like grateful-dead.json example.

But if application master starting on certain(third) hadoop node is working well.
All nodes have identical configuration.

#Answer HadoopMarc
You do not need to use HadoopGraph for this. Indeed, there used to be a BulkLoaderVertexProgram in Apache TinkerPop, but this could not be maintained and keep working reliably for the various versions of the various graph systems. Until now, JanusGraph does not have developed its own BulkLoaderVertexProgram. Also note that while their does exist an HBaseInputFormat for loading a janusgraph-hbase graph into a HadoopGraph, there does not exist an HBaseOutputFormat to write an HadoopGraph into janusgraph-hbase.

This being said, nothing is lost. You can simply write a spark application that has individual spark executors connect to janusgraph in the usual (OLTP) way and load data with the usual graph.traversal() API, that is using the addV(), addE() and properties() traversal steps. Of course, you could also try and copy the old code for the BulkLoaderVertexProgram into your project, but I believe the way I sketched is conceptually simpler and less error prone.

I tend to remember that their exist some blog series about using JanusGraph at scale, but I do not have then at hand and will look for them later on. If you find these blogs yourself, pleas post the links!

Best wishes,      Marc

1121 - 1140 of 6661