Date   

Re: Create new node for each group of connected nodes

anjanisingh22@...
 

On Sat, Jan 30, 2021 at 10:42 PM, <hadoopmarc@...> wrote:
Thanks Marc for quick response. 
Regards,
Anjani


Re: Create new node for each group of connected nodes

hadoopmarc@...
 

Hi Anjani,

Your use case obviously comes down to an OLAP query. While JanusGraph provides InputFormat classes to use TInkerPop's SparkGraphComputer and HadoopGraph, many users have experienced problems with them, see e.g. the latest thread:
https://lists.lfaidata.foundation/g/janusgraph-users/topic/issues_with_controlling/80107845?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,2,0,80107845

If you would get OLAP with SparkGraphComputer running on your graph with sufficient performance, an additional advantage would be that you can apply TinkerPop's ConnectedComponentVertexProgram.

A safer way to go, not depending on the JanusGraph InputFormats, would be:
  • run an OLTP query writing all vertex id's to a file. This may take days, but it will also give you a baseline of how long a full tablescan takes and what parallellism you need to get a reasonable running time. Be sure to iterate the traversal and not keep all id's in memory.
  • Use the file with id's as input to a spark job that does for each vertex the gremlin query to get all connected vertices. If the starting vertex has the lowest id, then add the additional required vertex and edges (relying on detecting the new vertex is not safe on an eventually consistent backend). Each spark executor can instantiate its own embedded janusgraph instance and queries inside an executor are done in an OLTP way.
Best wishes,    Marc


Create new node for each group of connected nodes

anjanisingh22@...
 

Hi All,

We are using Janus graph 0.5.2 with Cassandra as storage and Elastic as search engine. We have 700M + nodes.
Nodes are already connected by edges.

We got a use case to add one more node for each group of connected nodes and then create edges between newly created node and exiting nodes. 
For ex, say

node A and B are connected by an edge.
node C , D and E are connected by an edge.

then,
create one node for A and B and creates edges between newly created node and existing nodes
create one node for C, D and E and creates edges between newly created node and existing nodes

I would appreciate to have suggestions to achieve this considering our huge graph size.

 

Thanks in advance.

Regards,
Anjani


Re: Recommended way to perform Schema / Data migration

hadoopmarc@...
 

Hi Nick,

What do you want to migrate? Vertex labels? Property keys? Did you take a look at:

https://docs.janusgraph.org/basics/schema/#changing-schema-elements

Best wishes,    Marc


Re: reindex job is very slow on ElasticSearch and BigTable

hadoopmarc@...
 

OK, thanks for confirming the stacktrace. You can report this behavior as an issue on https://github.com/JanusGraph/janusgraph/issues, referring to this thread. It is still not clear to me how this exception can occur, because the BigTable compatibility layer reuses the HBase backend, for which graph.getBackend().getStoreManager().getHadoopManager(); is available.

So, I am afraid there is no quick fix for your issue, unless you start debugging MapReduceIndexManagement for BigTable yourself. Maybe, simply reloading the graph is an option.


Best wishes,    Marc


Re: Connecting to Multiple Schemas using Java

hadoopmarc@...
 

Hi Vinayak,

The TinkerPop ref docs gives the following code fragment for connecting with the cluster method:

Cluster cluster = Cluster.open();
GraphTraversalSource g1 = traversal().withRemote(DriverRemoteConnection.using(cluster, "g1"));
GraphTraversalSource g2 = traversal().withRemote(DriverRemoteConnection.using(cluster, "g2"));
Is this what you tried (I do not see it in your question)?

Best wishes,    Marc


Re: Issues with controlling partitions when using Apache Spark

Mladen Marović
 

Thanks for the responses.

I'll create a github issue for this then, and also create a PR with the changes that fixed this issue for me, in case anyone finds it useful.

I'm also interested in doing the spark-cassandra-connector implementation, however it might take a while until I get around to it.


Re: Problem with index never becoming ENABLED.

vamsi.lingala@...
 

Register index before you enable it


Re: Janusgraph 0.5.3 potential memory leak

Oleksandr Porunov
 

What exactly do you mean by that? Do you mean to change the implementation of `ofStaticBuffer`?
I mean that possibly we need to change the logic back to use `StaticArrayEntryList.of(Iterable<E> ... ...)` instead of `StaticArrayEntryList.of(Iterator<E> ... ...)`. If so, we may need to use `Lazy.of` again but then we need to think about what exactly it returns (previously it used to return an ArrayList but that's again additional computation which would be better to avoid).
We may also think about improving `StaticArrayEntryList.of(Iterator<E> ... ...)` to not cause memory problems but I didn't look deep into the logic yet.
The first thing which I'm thinking about, maybe we could change
`StaticArrayEntryList.of(Iterator<E> ... ...)` to have the same logic as
`StaticArrayEntryList.of(Iterable<E> ... ...)`. Of course, we can't use that iterator 2 times, but we could store intermediate elements inside some Singly Linked List. I guess something like:
class SinglyLinkedList<E> {
  E value;
 
SinglyLinkedList<E> nextElement;
}
That said, I didn't compare space and time complexity of

`StaticArrayEntryList.of(Iterable<E> ... ...)` vs
`StaticArrayEntryList.of(Iterator<E> ... ...)`.


Connecting to Multiple Schemas using Java

Vinayak Bali
 

Hi,
I am trying to connect to multiple schema's through java using the Cluster method. The properties files are as follows:

gremlin-server.yaml
# Copyright 2019 JanusGraph Authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

host: 0.0.0.0
port: 8182
scriptEvaluationTimeout: 30000
channelizer: org.apache.tinkerpop.gremlin.server.channel.WsAndHttpChannelizer
graphManager: org.janusgraph.graphdb.management.JanusGraphManager
graphs: {
  ConfigurationManagementGraph: conf/janusgraph-cql-configurationgraph.properties,
  graph1: conf/graph1.properties,
  graph2: conf/graph2.properties
}
scriptEngines: {
  gremlin-groovy: {
    plugins: { org.janusgraph.graphdb.tinkerpop.plugin.JanusGraphGremlinPlugin: {},
               org.apache.tinkerpop.gremlin.server.jsr223.GremlinServerGremlinPlugin: {},
               org.apache.tinkerpop.gremlin.tinkergraph.jsr223.TinkerGraphGremlinPlugin: {},
               org.apache.tinkerpop.gremlin.jsr223.ImportGremlinPlugin: {classImports: [java.lang.Math], methodImports: [java.lang.Math#*]},
               org.apache.tinkerpop.gremlin.jsr223.ScriptFileGremlinPlugin: {files: []}}}}
serializers:
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV3d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV3d0, config: { serializeResultToString: true }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerV3d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  # Older serialization versions for backwards compatibility:
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoLiteMessageSerializerV1d0, config: {ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GryoMessageSerializerV1d0, config: { serializeResultToString: true }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerGremlinV2d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistry] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerGremlinV1d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistryV1d0] }}
  - { className: org.apache.tinkerpop.gremlin.driver.ser.GraphSONMessageSerializerV1d0, config: { ioRegistries: [org.janusgraph.graphdb.tinkerpop.JanusGraphIoRegistryV1d0] }}
processors:
  - { className: org.apache.tinkerpop.gremlin.server.op.session.SessionOpProcessor, config: { sessionTimeout: 28800000 }}
  - { className: org.apache.tinkerpop.gremlin.server.op.traversal.TraversalOpProcessor, config: { cacheExpirationTime: 600000, cacheMaxSize: 1000 }}
metrics: {
  consoleReporter: {enabled: true, interval: 180000},
  csvReporter: {enabled: true, interval: 180000, fileName: /tmp/gremlin-server-metrics.csv},
  jmxReporter: {enabled: true},
  slf4jReporter: {enabled: true, interval: 180000},
  gangliaReporter: {enabled: false, interval: 180000, addressingMode: MULTICAST},
  graphiteReporter: {enabled: false, interval: 180000}}
maxInitialLineLength: 4096
maxHeaderSize: 8192
maxChunkSize: 8192
maxContentLength: 65536
maxAccumulationBufferComponents: 1024
resultIterationBatchSize: 64
writeBufferLowWaterMark: 32768
writeBufferHighWaterMark: 65536

graph1.properties

# Copyright 2019 JanusGraph Authors
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# JanusGraph configuration sample: Cassandra & Elasticsearch over sockets
#
# This file connects to Cassandra and Elasticsearch services running
# on localhost over the CQL API and the Elasticsearch native
# "Transport" API on their respective default ports.  The Cassandra
# and Elasticsearch services must already be running before starting
# JanusGraph with this file.

# The implementation of graph factory that will be used by gremlin server
#
# Default:    org.janusgraph.core.JanusGraphFactory
# Data Type:  String
# Mutability: LOCAL
gremlin.graph=org.janusgraph.core.JanusGraphFactory

# The primary persistence provider used by JanusGraph.  This is required.
# It should be set one of JanusGraph's built-in shorthand names for its
# standard storage backends (shorthands: berkeleyje, cassandrathrift,
# cassandra, astyanax, embeddedcassandra, cql, hbase, inmemory) or to the
# full package and classname of a custom/third-party StoreManager
# implementation.
#
# Default:    (no default value)
# Data Type:  String
# Mutability: LOCAL
storage.backend=cql

# The hostname or comma-separated list of hostnames of storage backend
# servers.  This is only applicable to some storage backends, such as
# cassandra and hbase.
#
# Default:    127.0.0.1
# Data Type:  class java.lang.String[]
# Mutability: LOCAL
storage.hostname=127.0.0.1

# The name of JanusGraph's keyspace.  It will be created if it does not
# exist.
#
# Default:    janusgraph
# Data Type:  String
# Mutability: LOCAL
storage.cql.keyspace=graph1

# Whether to enable JanusGraph's database-level cache, which is shared
# across all transactions. Enabling this option speeds up traversals by
# holding hot graph elements in memory, but also increases the likelihood
# of reading stale data.  Disabling it forces each transaction to
# independently fetch graph elements from storage before reading/writing
# them.
#
# Default:    false
# Data Type:  Boolean
# Mutability: MASKABLE
cache.db-cache = true

# How long, in milliseconds, database-level cache will keep entries after
# flushing them.  This option is only useful on distributed storage
# backends that are capable of acknowledging writes without necessarily
# making them immediately visible.
#
# Default:    50
# Data Type:  Integer
# Mutability: GLOBAL_OFFLINE
#
# Settings with mutability GLOBAL_OFFLINE are centrally managed in
# JanusGraph's storage backend.  After starting the database for the first
# time, this file's copy of this setting is ignored.  Use JanusGraph's
# Management System to read or modify this value after bootstrapping.
cache.db-cache-clean-wait = 20

# Default expiration time, in milliseconds, for entries in the
# database-level cache. Entries are evicted when they reach this age even
# if the cache has room to spare. Set to 0 to disable expiration (cache
# entries live forever or until memory pressure triggers eviction when set
# to 0).
#
# Default:    10000
# Data Type:  Long
# Settings with mutability GLOBAL_OFFLINE are centrally managed in
# JanusGraph's storage backend.  After starting the database for the first
# time, this file's copy of this setting is ignored.  Use JanusGraph's
# Management System to read or modify this value after bootstrapping.
cache.db-cache-time = 180000

# Size of JanusGraph's database level cache.  Values between 0 and 1 are
# interpreted as a percentage of VM heap, while larger values are
# interpreted as an absolute size in bytes.
#
# Default:    0.3
# Data Type:  Double
# Mutability: MASKABLE
cache.db-cache-size = 0.25

# Connect to an already-running ES instance on localhost

# The indexing backend used to extend and optimize JanusGraph's query
# functionality. This setting is optional.  JanusGraph can use multiple
# heterogeneous index backends.  Hence, this option can appear more than
# once, so long as the user-defined name between "index" and "backend" is
# unique among appearances.Similar to the storage backend, this should be
# set to one of JanusGraph's built-in shorthand names for its standard
# index backends (shorthands: lucene, elasticsearch, es, solr) or to the
# full package and classname of a custom/third-party IndexProvider
# implementation.
#
# Default:    elasticsearch
# Data Type:  String
# Mutability: GLOBAL_OFFLINE
#
# Settings with mutability GLOBAL_OFFLINE are centrally managed in
# JanusGraph's storage backend.  After starting the database for the first
# time, this file's copy of this setting is ignored.  Use JanusGraph's
# Management System to read or modify this value after bootstrapping.
index.search.backend=elasticsearch

# The hostname or comma-separated list of hostnames of index backend
# servers.  This is only applicable to some index backends, such as
# elasticsearch and solr.
#
# Default:    127.0.0.1
# Data Type:  class java.lang.String[]
# Mutability: MASKABLE
index.search.hostname=127.0.0.1

graph2.properties is the same as graph1.properties only change is the schema name.

empty-sample.groovy

// Copyright 2019 JanusGraph Authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
//      http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

// an init script that returns a Map allows explicit setting of global bindings.
def globals = [:]

// defines a sample LifeCycleHook that prints some output to the Gremlin Server console.
// note that the name of the key in the "global" map is unimportant.
globals << [hook : [
        onStartUp: { ctx ->
            ctx.logger.info("Executed once at startup of Gremlin Server.")
        },
        onShutDown: { ctx ->
            ctx.logger.info("Executed once at shutdown of Gremlin Server.")
        }
] as LifeCycleHook]

// define the default TraversalSource to bind queries to - this one will be named "g".
graph1=JanusGraphFactory.open('conf/graph1.properties')
graph2=JanusGraphFactory.open('conf/graph2.properties')
globals << [ g1 : graph1.traversal() , g2 : graph2.traversal()]

When I run the gremlin query with g1 or g2, getting the error g1/g2 not defined.
But if we use g it is using graph2 and return the result.
How can we connect to different schema's using traversals?

Thanks & Regards,
Vinayak


Re: Janusgraph 0.5.3 potential memory leak

rngcntr
 

What @porunov mentions, looks quite interesting. When I made the change in the code, I didn't actually notice I changed the signature that is used for `ofStaticBuffer`. But as you mentioned, it now looks like the reason to use `Lazy.of` is gone in the newer version using the `Iterator` and thus, looping twice can not be the issue.

I think, we can change the old solution to return an Iterable as well but don't call `iterator` for resultSet 2 times.
What exactly do you mean by that? Do you mean to change the implementation of `ofStaticBuffer`?

One thing that I've found is that `StaticArrayEntryList.of(Iterator<E> ... ...)` repeatedly calls a self-implemented method called `ensureSpace` which allocates a new array twice as large as the old one and copies the entries over to the new one. Although the JVM should GC the old (and unused) array, this behavior seems to me like it is prone to cause memory leaks if the unused arrays are not dropped correctly. This method is not used in the `StaticArrayEntryList.of(Iterable<E> ... ...)` implementation.


Re: Transaction Management

rngcntr
 

We've had another post about memory leaks just the other day, here: https://lists.lfaidata.foundation/g/janusgraph-users/message/5544
Do you think what you encountered is a duplicate of that problem or is it something different?


Re: reindex job is very slow on ElasticSearch and BigTable

vamsi@...
 

Got the same error 

throw new IllegalArgumentException("Store manager class " + graph.getBackend().getStoreManagerClass() + "is not supported");


Re: Janusgraph 0.5.3 potential memory leak

Oleksandr Porunov
 

One more thing I noticed is that previously we were passing `Iterable` and right now we are passing `Iterator` `ofStaticBuffer` and those methods are computed differently actually.
Here are the first and the second method:
private static <E,D> EntryList of(Iterable<E> elements, StaticArrayEntry.GetColVal<E,D> getter, StaticArrayEntry.DataHandler<D> dataHandler)
private static <E,D> EntryList of(Iterator<E> elements, StaticArrayEntry.GetColVal<E,D> getter, StaticArrayEntry.DataHandler<D> dataHandler)

If we check the code, their implementation is slightly different. The first methods passes 2 times `elements` and computes something whereas a second method passes `elements` once.

I do understand now why we used lazy `Lazy.of` previously. It's just because we were looping `elements` 2 times instead of once.
I guess, the main problem in the previous model was that we were adding all elements into an ArrayList inside `Lazy.of` code. I think, we can change the old solution to return an Iterable as well but don't call `iterator` for resultSet 2 times.
That said, it's just some quick observations. I didn't go deep into the logic


Transaction Management

ryssavage@...
 

Hello. I have been having memory leaks while using Janusgraph and found that there were a few places that I was not explicitly closing transactions and thought that might be the culprit. I am now closing all transactions explicitly but still find that I get out of memory errors. Someone suggested to me to go look at my cassandra database txlog table to see if there are any stray transactions there. I see 14 rows in that table. Before I go and mess with that table I would like to understand what it is for and what the rows inside it indicate. Are they always indicative of stale transactions?

I have done some local tests and did a whole bunch of queries where I closed the transactions and never saw anything pop up in this table. then I did a whole bunch of queries where I didn't close the transaction and one row appeared in this table. 

Can someone please explain what is going on here?


Re: Janusgraph 0.5.3 potential memory leak

sergeymetallic@...
 

I did not figure out the reason of the problem, but what is interesting - CPU does not recover even after an hour or two and the process continues reading from Scylla with pretty high speed. Looks like the reading process was not interrupted properly. But the reason of the issue is not obvious and requires, maybe, deeper profiling.


Re: Janusgraph 0.5.3 potential memory leak

Oleksandr Porunov
 

Thank you for reporting this bug!

That's interesting. The one difference I see is that now the code performs `rs.iterator()` immediately (and not lazily as it was previously). That said, I didn't check if that's the root cause of the problem or not.
Probably `rs.iterator()` may cause some issues with memory management in that place (line 328) in the PR but it should be verified. I guess, we need to check if `rs.iterator()` adds any memory pressure during the iterator construction.
My point is that `Lazy.of` (which was removed in the PR) memorizes the computation. Thus, repeated calls to `lazyList.get()` will always return the same object. Whereas repeated calls of `rs.iterator()` creates new different iterators.
That said, it's just a spontaneous guess and the problem might be with something else.


Recommended way to perform Schema / Data migration

nick.ood17@...
 

Hello,

I would like to ask the following:

Is there any recommended way to perform Schema/Data migration?

Thanks in advance,
Nick.


Re: Janusgraph 0.5.3 potential memory leak

rngcntr
 

Hi there!

What you describe looks very interesting and is definitely not intended behavior. You mentioned my PR which seems to cause these troubles. That's quite interesting because this PR was actually merged to *improve* memory handling, not *worsen* it :P

Since the PR is rather small and you have probably already had a look at the changes it made: Did you find anything that looks suspicious right away? I would be happy to find this bug and fix it and it would be great if you share everything you already found out.


Re: Issues with controlling partitions when using Apache Spark

hadoopmarc@...
 

Hi Mladen,

Having answered several questions about the JanusGraph InputFormats, I can confirm that many users encounter problems with the size of the input splits. This is the case in particular for the HBaseInputFormat where input splits are equal to HBase regions and HBase requires regions to have a size of the order of 10GB (compressed binary data!). Users could only work around this by manually and temporarily splitting the HBase regions. For the CassandraInputFormat problems surface less often, because there a default number of about 500 partitions is used, so you need a lot of data before partition size becomes a limitation.

So, I also encourage you to contribute, if possible!

Also note that there is a fundamental problem to OLAP in graphs: traversing a graph implies shuffling between partitions and this is only efficient if the entire graph fits in the cluster memory. So, where the scalability of JanusGraph OLTP queries is limited by disk space and the performance of the indexing backend, the scalability of OLAP queries is limited by cluster memory.

Best wishes,    Marc

1101 - 1120 of 6665