GraphTraversal Thread Stuck


Sujay Bothe <ssbothe3@...>
 

Hello,


Janus version -  0.5.3
Cassandra version - 3.11.4


I am facing one issue where the GraphTraversal hasNext() call got stuck .
The thread from which the Traversal was invoked is still stuck and below is the stackTrace for it.


"MYTHREAD" #80 prio=5 os_prio=0 tid=0x00007f74f82f9000 nid=0x1ab9 in Object.wait() [0x00007f746a8ec000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:502)
        at io.vavr.concurrent.FutureImpl$$Lambda$226/250860313.run(Unknown Source)
        at io.vavr.control.Try.run(Try.java:105)
        at io.vavr.concurrent.FutureImpl.await(FutureImpl.java:114)
        - locked <0x00000000c42f2e60> (a java.lang.Object)
        at org.janusgraph.diskstorage.cql.CQLKeyColumnValueStore.interruptibleWait(CQLKeyColumnValueStore.java:308)
        at org.janusgraph.diskstorage.cql.CQLKeyColumnValueStore.getSlice(CQLKeyColumnValueStore.java:289)
        at org.janusgraph.diskstorage.keycolumnvalue.KCVSProxy.getSlice(KCVSProxy.java:76)
        at org.janusgraph.diskstorage.configuration.backend.KCVSConfiguration$1.call(KCVSConfiguration.java:97)
        at org.janusgraph.diskstorage.configuration.backend.KCVSConfiguration$1.call(KCVSConfiguration.java:94)
        at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:147)
        at org.janusgraph.diskstorage.util.BackendOperation$1.call(BackendOperation.java:161)
        at org.janusgraph.diskstorage.util.BackendOperation.executeDirect(BackendOperation.java:68)
        at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:54)
        at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:158)
        at org.janusgraph.diskstorage.configuration.backend.KCVSConfiguration.get(KCVSConfiguration.java:94)
        at org.janusgraph.graphdb.tinkerpop.JanusGraphVariables.get(JanusGraphVariables.java:46)
        at MyObjectStrategy.apply(MyObjectStrategy.java:411)
        at org.apache.tinkerpop.gremlin.process.traversal.util.DefaultTraversalStrategies.applyStrategies(DefaultTraversalStrategies.java:88)
        at org.apache.tinkerpop.gremlin.process.traversal.util.DefaultTraversal.applyStrategies(DefaultTraversal.java:124)
        at org.apache.tinkerpop.gremlin.process.traversal.util.DefaultTraversal.hasNext(DefaultTraversal.java:196)
        at MyTraversal.get(MyTraversal.java:300)


As you can see above, the CQLKeyColumnValueStore is waiting on the future object. 
It has been stuck for a couple of days now.

I went through the code of CQLKeyColumnValueStore.java of JanusGraph.
It is executing the sliceQuery using the CQLStoreManager.java's executorService threadpool.
The thread names for above thread pool starts with 'CQLStoreManager'


I checked the state of all  20 CQLStoreManager threads in the my java process and all of them are in PARKING state

CQLStoreManager[00]" #191278 daemon prio=5 os_prio=0 tid=0x00007f73f837e000 nid=0x3f0a waiting on condition [0x00007f73cd46c000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000c3c26ff8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
        at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1074)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)


So I am not able to understand that what happened to the CQLStoreManager thread which was suppose to update the future object on which
CQLKeyColumnValueStore.interruptibleWait(CQLKeyColumnValueStore.java:308) is waiting.

One more additional information that I want to share is that this has happened around the time at which one of the cassandra nodes was disconnected
and the traversal was failing with TemporaryBackendException (Enough replicas not available) .
But afterwards the node got added to the cluster and quorum was reached , still the thread remained stuck.

Can someone please help here ?
Do let me know if any additional information is needed. 


Thanks,
Sujay Bothe


Boxuan Li
 

Hi Sujay,

I am not sure about the root cause (it might be a JanusGraph bug or a Datastax CQL driver bug), but you could try the JanusGraph 0.6.0 version and disable the `storage.cql.executor-service.enabled` option (https://docs.janusgraph.org/configs/configuration-reference/#storagecqlexecutor-service), which does not use an internal thread pool for CQLStoreManager as opposed to 0.5.3. If the problem still exists, I would argue it is more likely to be a bug with the Datastax CQL driver.

Best,
Boxuan


ssbothe3@...
 

Hi

You are suggesting above experiment for isolating the issue right ?


I thought about not using the CQL executor service but we are in primary stage have not done any workload tests to figure out the correct CQL driver config params.

So will prefer to have a safety net of executor service which will prevent too many parallel call to CQL driver.


And looks like someone also faced similar issue.

 

https://lists.lfaidata.foundation/g/janusgraph-users/topic/thread_goes_into_waiting/79937111?p=,,,20,0,0,0::recentpostdate/sticky,,,20,0,0,79937111,previd=1634107810504447777,nextid=1630650635690684483&previd=1634107810504447777&nextid=1630650635690684483

Thanks,
Sujay Bothe