Date   

Re: Issues with controlling partitions when using Apache Spark

hadoopmarc@...
 

Hi Mladen,

Having answered several questions about the JanusGraph InputFormats, I can confirm that many users encounter problems with the size of the input splits. This is the case in particular for the HBaseInputFormat where input splits are equal to HBase regions and HBase requires regions to have a size of the order of 10GB (compressed binary data!). Users could only work around this by manually and temporarily splitting the HBase regions. For the CassandraInputFormat problems surface less often, because there a default number of about 500 partitions is used, so you need a lot of data before partition size becomes a limitation.

So, I also encourage you to contribute, if possible!

Also note that there is a fundamental problem to OLAP in graphs: traversing a graph implies shuffling between partitions and this is only efficient if the entire graph fits in the cluster memory. So, where the scalability of JanusGraph OLTP queries is limited by disk space and the performance of the indexing backend, the scalability of OLAP queries is limited by cluster memory.

Best wishes,    Marc


Re: Janusgraph 0.5.3 potential memory leak

sergeymetallic@...
 

After some research I figured out that rolling back this PR https://github.com/JanusGraph/janusgraph/pull/2080/files# helps 


Janusgraph 0.5.3 potential memory leak

sergeymetallic@...
 

JG 0.5.3(same on 0.5.2), cannot be reproduced on JG 0.3.2
Backend: scyllaDB
Indexing backend: ElasticSearch

Steps to reproduce: 
1) Create a node with a composite index for the field "X"
2) Create another kind (Y) of node and fill with a lot of data (several millions nodes)
3) Create edges between node X and all the nodes Y with the label L
4) Execute the following query in gremlin: g.V().has("X","value").out("Y").valueMap(true)
5) Query should time out after some time

The main idea - only scylla/cassandra is involved in the query

Expected result: Janusgraph operates normally

Observed result: Janusgraph starts consuming all the allocated memory and one of the CPU cores is loaded 100%, another execution will load another core and so on until there are no available. CPU load and CPU consumption happens even if there is no any further interaction with the system. In the end JG becomes unresponsive.

Flame chart looks like this
 


Re: Issues with controlling partitions when using Apache Spark

Evgenii Ignatev
 

Hello Mladen,

Yes, we have experienced this issue as well, although we weren't able to fix it.

You solution sounds very interesting, could you share your enhacement as a PR (even not finished one)?
We have done some analysis of source code back then, I might be able to help with PR/tests - feel free to contact me.

Best regards,
Evgenii Ignatev.

On 26.01.2021 16:34, Florian Hockmann wrote:

Hi Mladen,

 

I wasn’t aware that the CqlInputFormat we’re using is considered legacy. Looks then like we should migrate to spark-cassandra-connector. Could you please create an issue on GitHub for this?

And if you already have an implementation ready for this, then it would of course be really great if you could contribute it with a PR.

 

Regards,

Florian

 

Von: janusgraph-users@... <janusgraph-users@...> Im Auftrag von Mladen Marovic
Gesendet: Montag, 25. Januar 2021 17:34
An: janusgraph-users@...
Betreff: [janusgraph-users] Issues with controlling partitions when using Apache Spark

 

Hey there!

 

I've recently been working on some Apache Spark jobs for Janusgraph via hadoop-gremlin (as described on https://docs.janusgraph.org/advanced-topics/hadoop/) and encountered several issues. Generally, I kept having memory issues as the partitions were too big to be loaded into my spark executors (which I increased up to 16GB per executor).

 

After analysing the code, I found two parameters that could be used to further subsplit the partitions: cassandra.input.split.size and cassandra.input.split.size_mb. However, when trying to use these parameters, and debugging when the memory issues persisted, I noticed several bugs in the underlying org.apache.cassandra.hadoop.cql3.CqlInputFormat used to load the data. I posted the question on the datastax community forums (see https://community.datastax.com/questions/10153/how-to-control-partition-size-when-reading-data-wi.html). There I was ultimately suggested to migrate to the spark-cassandra-connector because the issues I encountered were probably bugs, but that was legacy code (and probably not maintained anymore).

 

In the meantime, I reimplemented the InputFormat classes in my app to fix the issues, and testing so far showed that this now works as intended. However, I was wondering the following:

 

1. Does anyone else have any experience with using Apache Spark, Janusgraph, and graphs too big to fit into memory without subsplitting? Did you also encounter this issue? If so, how did you deal with it?

2. Is there an "official" solution to this issue?

3. Are there any plans to migrate to the spark-cassandra connector for this use case?

 

Thanks,

 

Mladen


Re: Issues with controlling partitions when using Apache Spark

Florian Hockmann
 

Hi Mladen,

 

I wasn’t aware that the CqlInputFormat we’re using is considered legacy. Looks then like we should migrate to spark-cassandra-connector. Could you please create an issue on GitHub for this?

And if you already have an implementation ready for this, then it would of course be really great if you could contribute it with a PR.

 

Regards,

Florian

 

Von: janusgraph-users@... <janusgraph-users@...> Im Auftrag von Mladen Marovic
Gesendet: Montag, 25. Januar 2021 17:34
An: janusgraph-users@...
Betreff: [janusgraph-users] Issues with controlling partitions when using Apache Spark

 

Hey there!

 

I've recently been working on some Apache Spark jobs for Janusgraph via hadoop-gremlin (as described on https://docs.janusgraph.org/advanced-topics/hadoop/) and encountered several issues. Generally, I kept having memory issues as the partitions were too big to be loaded into my spark executors (which I increased up to 16GB per executor).

 

After analysing the code, I found two parameters that could be used to further subsplit the partitions: cassandra.input.split.size and cassandra.input.split.size_mb. However, when trying to use these parameters, and debugging when the memory issues persisted, I noticed several bugs in the underlying org.apache.cassandra.hadoop.cql3.CqlInputFormat used to load the data. I posted the question on the datastax community forums (see https://community.datastax.com/questions/10153/how-to-control-partition-size-when-reading-data-wi.html). There I was ultimately suggested to migrate to the spark-cassandra-connector because the issues I encountered were probably bugs, but that was legacy code (and probably not maintained anymore).

 

In the meantime, I reimplemented the InputFormat classes in my app to fix the issues, and testing so far showed that this now works as intended. However, I was wondering the following:

 

1. Does anyone else have any experience with using Apache Spark, Janusgraph, and graphs too big to fit into memory without subsplitting? Did you also encounter this issue? If so, how did you deal with it?

2. Is there an "official" solution to this issue?

3. Are there any plans to migrate to the spark-cassandra connector for this use case?

 

Thanks,

 

Mladen


Re: No results returned with duplicate Has steps in a vertex-search traversal

Boxuan Li
 

Hi,

Can you provide more info on how the fields in your example are indexed? E.g. composite or mixed, what are all indexes involving any of these fields.

「Patrick Streifel <prstreifel@...>」在 2021年1月26日 週二,上午4:56 寫道:

We are running into a JanusGraph bug where a traversal that should return a list of vertices is returning an empty list.

 

Here is some background info:

Using a JanusGraph Server with ConfigureGraphFactory running v. 0.5.2.

Storage: Cassandra v. 3.11.9

Index: Elasticsearch v. 6.7.2

Connecting to the server via java gremlin driver.

 

Our use case is this:

 

We are searching for vertices in the graph based on various property filters (e.g. Give me people named "Patrick" with a last name matching the regex "Str.*el"). When we just do this, there are no issues, of course.

 

The tricky part is that we are adding extra filters on a property called DomainGroup, which essentially allows us to filter out results per search user based on what they are interested in seeing. The user running the query provides a list of Domains they are interested in, and there has to be some overlap between the user's Domains and the list of DomainGroups on the vertices for those vertices to be returned. In short, we put in extra "has" steps that filters out vertices in certain groups from the results.

 

Another important note: These "has" steps to filter on Domain occur after each other step in the query. That may not be a great idea for this use case, but we have others where we need it. We have logic that groups together a set of has statements automatically based on user requests.  Sometimes this automated process will duplicate certain property searches when constructing the traversal and it is hard to avoid in certain cases.  We could work to deduplicate, but this still seems like a true bug in JanusGraph, albeit for a weird use case.

 

An example of one of our DomainGroup "has" steps is here:

has(DomainGroup, within([GROUP_A, GROUP_B])),

 

We combed through our DEBUG level logs in the JG Server.

We noticed that JG was querying the Elasticsearch index for results, as expected. Elasticsearch was actually returning the expected vertex(es), but the JG Server was not returning anything after that.

 

Here are some additional conditions we noticed:

  1. This appears only to happen when there are multiple duplicate "has" steps in the traversal.
    1. When we run a traversal with only one property search ( has(FirstName, textRegex(Patric.*)) ) and one DomainGroup filter ( has(DomainGroup , within([GROUP_A, GROUP_B])) ), then we get the expected results.
    2. When we provide two property searches, and thus two (duplicate) DomainGroup filters, we get no results. This leads us to believe there is an issue with having duplicate "has" steps, or specifically duplicate "has" steps with "within" filters.

 

Example of a traversal that we get empty results with:

args={gremlin=[[], [V(),

has(FirstName, textRegex(Patric.*)),

has(DomainGroup , within([GROUP_A, GROUP_B])),

has(PersonSurName, textRegex(Str.*el)),

has(DomainGroup , within([GROUP_A, GROUP_B])),

limit(5), valueMap(), with(~tinkerpop.valueMap.tokens)]], aliases={g=my_graph_traversal}}

 

Logs show the Elastic search scroll request returning a document with the correct id, but the logs also show JG ultimately sending an empty response to our API. Something is lost in between there.

 

Just wanted to bring this to your attention. We are figuring out workarounds on our side, but this seems like a JG bug.


No results returned with duplicate Has steps in a vertex-search traversal

Patrick Streifel <prstreifel@...>
 

We are running into a JanusGraph bug where a traversal that should return a list of vertices is returning an empty list.

 

Here is some background info:

Using a JanusGraph Server with ConfigureGraphFactory running v. 0.5.2.

Storage: Cassandra v. 3.11.9

Index: Elasticsearch v. 6.7.2

Connecting to the server via java gremlin driver.

 

Our use case is this:

 

We are searching for vertices in the graph based on various property filters (e.g. Give me people named "Patrick" with a last name matching the regex "Str.*el"). When we just do this, there are no issues, of course.

 

The tricky part is that we are adding extra filters on a property called DomainGroup, which essentially allows us to filter out results per search user based on what they are interested in seeing. The user running the query provides a list of Domains they are interested in, and there has to be some overlap between the user's Domains and the list of DomainGroups on the vertices for those vertices to be returned. In short, we put in extra "has" steps that filters out vertices in certain groups from the results.

 

Another important note: These "has" steps to filter on Domain occur after each other step in the query. That may not be a great idea for this use case, but we have others where we need it. We have logic that groups together a set of has statements automatically based on user requests.  Sometimes this automated process will duplicate certain property searches when constructing the traversal and it is hard to avoid in certain cases.  We could work to deduplicate, but this still seems like a true bug in JanusGraph, albeit for a weird use case.

 

An example of one of our DomainGroup "has" steps is here:

has(DomainGroup, within([GROUP_A, GROUP_B])),

 

We combed through our DEBUG level logs in the JG Server.

We noticed that JG was querying the Elasticsearch index for results, as expected. Elasticsearch was actually returning the expected vertex(es), but the JG Server was not returning anything after that.

 

Here are some additional conditions we noticed:

  1. This appears only to happen when there are multiple duplicate "has" steps in the traversal.
    1. When we run a traversal with only one property search ( has(FirstName, textRegex(Patric.*)) ) and one DomainGroup filter ( has(DomainGroup , within([GROUP_A, GROUP_B])) ), then we get the expected results.
    2. When we provide two property searches, and thus two (duplicate) DomainGroup filters, we get no results. This leads us to believe there is an issue with having duplicate "has" steps, or specifically duplicate "has" steps with "within" filters.

 

Example of a traversal that we get empty results with:

args={gremlin=[[], [V(),

has(FirstName, textRegex(Patric.*)),

has(DomainGroup , within([GROUP_A, GROUP_B])),

has(PersonSurName, textRegex(Str.*el)),

has(DomainGroup , within([GROUP_A, GROUP_B])),

limit(5), valueMap(), with(~tinkerpop.valueMap.tokens)]], aliases={g=my_graph_traversal}}

 

Logs show the Elastic search scroll request returning a document with the correct id, but the logs also show JG ultimately sending an empty response to our API. Something is lost in between there.

 

Just wanted to bring this to your attention. We are figuring out workarounds on our side, but this seems like a JG bug.


Issues with controlling partitions when using Apache Spark

Mladen Marović
 

Hey there!

I've recently been working on some Apache Spark jobs for Janusgraph via hadoop-gremlin (as described on https://docs.janusgraph.org/advanced-topics/hadoop/) and encountered several issues. Generally, I kept having memory issues as the partitions were too big to be loaded into my spark executors (which I increased up to 16GB per executor).

After analysing the code, I found two parameters that could be used to further subsplit the partitions: cassandra.input.split.size and cassandra.input.split.size_mb. However, when trying to use these parameters, and debugging when the memory issues persisted, I noticed several bugs in the underlying org.apache.cassandra.hadoop.cql3.CqlInputFormat used to load the data. I posted the question on the datastax community forums (see https://community.datastax.com/questions/10153/how-to-control-partition-size-when-reading-data-wi.html). There I was ultimately suggested to migrate to the spark-cassandra-connector because the issues I encountered were probably bugs, but that was legacy code (and probably not maintained anymore).

In the meantime, I reimplemented the InputFormat classes in my app to fix the issues, and testing so far showed that this now works as intended. However, I was wondering the following:

1. Does anyone else have any experience with using Apache Spark, Janusgraph, and graphs too big to fit into memory without subsplitting? Did you also encounter this issue? If so, how did you deal with it?
2. Is there an "official" solution to this issue?
3. Are there any plans to migrate to the spark-cassandra connector for this use case?

Thanks,

Mladen


Re: JanusGraph and future versions of ES

Assaf Schwartz
 

Thank you very much everyone


Re: JanusGraph and future versions of ES

Peter Corless
 

It would only affect the use of JanusGraph+Elastic if you were offering it in an -as-a-Service (-aaS) offering, like a commercial DBaaS. In which case the SSPL would require you to commit back to the code base modifications, UI, etc. 

We did a side-by-side analysis of SSPL vs. APGL when it was first announced. You can read that blog here:


The most germaine language is this:

SSPL requires that if you offer software as a service you must make public, open source, practically almost everything related to your service:

“including, without limitation, management software, user interfaces, application program interfaces, automation software, monitoring software, backup software, storage software and hosting software, all such that a user could run an instance of the service using the Service Source Code you make available.”

But if you are just using it internally, or deep within another product (one that an end user doesn't have direct access to) you have no worries.

Disclaimer: this is my opinion; I am not a lawyer. YMMV. Caveat emptor. Void where prohibited by law.

On Fri, Jan 22, 2021, 6:49 AM BO XUAN LI <liboxuan@...> wrote:
If I am understanding correctly, this does not affect the usage of Elasticsearch in JanusGraph.  Regarding AWS's decision, seems no additional configs/code changes are needed if someone wants to use AWS forked version (assuming Amazon would not introduce breaking changes causing client incompatibility).


Re: Janusgraph traversal time

hadoopmarc@...
 

Sounds normal to me. Note that setting query.batch=true might help a bit. Also note that the graph structure, in particular the number of repeat steps, is important. Finally, note that caching in the storage backend (in addition to caching inside JanusGraph) plays a role, see:
http://yaaics.blogspot.com/2018/04/understanding-reponse-times-for-single.html

Best wishes,    Marc


Re: JanusGraph and future versions of ES

Boxuan Li
 

If I am understanding correctly, this does not affect the usage of Elasticsearch in JanusGraph.  Regarding AWS's decision, seems no additional configs/code changes are needed if someone wants to use AWS forked version (assuming Amazon would not introduce breaking changes causing client incompatibility).


Janusgraph traversal time

faizez@...
 

Hi All,
 
I have a multi-level graph with total 800 vertices. When I run g.V(<id of parent>).repeat(out()).emit() for first time, it takes 300ms. In profile(), I could see RepeatStep takes most of the time (almost 296ms). Also I could see, it executes backend-query, same number as that of vertices. But when I run the same query again, it completes in 4ms. I know when I run 2nd time, it uses cache. 
My question is whether traversal of 800 vertices without cache in 300ms is normal or not ? 
Note : Im using JG 0.5.2 and I have out-of-the-box Janusgraph configurations.


Re: JanusGraph and future versions of ES

hadoopmarc@...
 

Please also read the stance of Elasticsearch itself (because the AWS article did not link to it ):

https://www.elastic.co/blog/license-change-clarification


JanusGraph and future versions of ES

Assaf Schwartz
 

Hi all,
Since ES seems to have changed their licensing, what does this entail about the future usage of ES as a JanusGraph index backend?
https://aws.amazon.com/blogs/opensource/stepping-up-for-a-truly-open-source-elasticsearch/

Thanks!


Re: Janusgraph query execution performance

hadoopmarc@...
 

Analytical queries require a full table scan. Some people succeed in speeding up analytical queries on JanusGraph using OLAP, check the older questions on OLAP and SparkGraphComputer and
https://docs.janusgraph.org/advanced-topics/hadoop/

A special case occurring very frequently, is counting the number of vertices for each label (you say: concept). Speeding this up is listed in the known issues:
https://github.com/JanusGraph/janusgraph/issues/926

Best wishes,   Marc


Janusgraph query execution performance

lalwani.ritu2609@...
 

Hi,

I have used https://github.com/IBM/expressive-reasoning-graph-store project to import the turtle file having around 4 Lakhs of concepts ad this project is using Janusgraph 0.4.0.
Now after importing I am able to run the queries.
But here the problem that I am facing is that some of the queries which access few number of nodes are quite faster. But some queries like counting the number of concepts in the graph (which access large number of nodes) are very very slow. Please note that I have used indexing already.

So is this due the version of Janusgraph which is 0.4.0(quite older version)?
Or the performance will be like this only for Janusgraph?

Any help will highly be appreciated.

Thanks!!


Re: OLAP Spark

hadoopmarc@...
 

Hi Vinayak,

JanusGraph has defined hadoop InputFormats for its storage backends to do OLAP queries, see https://docs.janusgraph.org/advanced-topics/hadoop/

However, these InputFormats have several problems regarding performance (see the old questions on this list), so your approach could be worthwhile:

1. It is best to create these ID's on ingestion of data in JanusGraph and add them as vertex property. If you create an index on this property, it is possible to use these id properties for retrieval during OLAP queries.
2. Spark does this automatically if you call rdd.mapPartitions on the RDD with ids.
3. Here is the disadvantage of this approach. You simply run the gremlin query per partition with ids, but you have to merge the results per partition afterwards outside gremlin. The merge logic differs per type of query.

Best wishes,     Marc


Re: Janusgraph spark on yarn error

hadoopmarc@...
 

The path of the BulkLoaderVertexProgram might be doable, but I cannot help you on that one. In the stack trace above, the yarn appmaster from spark-yarn apparently tries to communicate with HBase but finds that various libraries do not match. This failure arises because the JanusGraph distribution does not include spark-yarn and thus is not handcrafted to work with spark-yarn.

For the path without BulkLoaderVertexProgram you inevitably need a JVM language (java, scala, groovy). In this case, a spark executor is unaware of any other executors running and  is simply passed a callable (function) to execute (through RDD.mapPartitions() or through a spark-sql UDF). This callable can be part of a class that establish its own JanusGraph instances in the OLTP way. Now, you only have to deal with the executor CLASSPATH which does not need spark-yarn and the libs from the janusgraph distribution suffice.

Some example code can be found at:
https://nitinpoddar.medium.com/bulk-loading-data-into-janusgraph-part-2-ca946db26582

Best wishes,    Marc


Re: reindex job is very slow on ElasticSearch and BigTable

hadoopmarc@...
 

I mean, what happens if you try to run MapReduceIndexManagement on BigTable. Apparently, you get this error message "MapReduceIndexManagement is not supported for BigTable" but I would like to see the full stack trace leading to this error message, to see where this incompatibility stems from. E.g. the code in:

https://github.com/JanusGraph/janusgraph/blob/d954ea02035d8d54b4e1bd5863d1f903e6d57844/janusgraph-hadoop/src/main/java/org/janusgraph/hadoop/MapReduceIndexManagement.java

reads:
HadoopStoreManager storeManager = (HadoopStoreManager) graph.getBackend().getStoreManager().getHadoopManager();
if (storeManager == null) {
    throw new IllegalArgumentException("Store manager class " + graph.getBackend().getStoreManagerClass() + "is not supported");
}

But this is not what you see.

Best wishes,    Marc
 

1121 - 1140 of 6666