Not able to connect when 1 of 3 nodes is down in the Cassandra cluster


Bharat Dighe <bdi...@...>
 

I am using titan 1.0 and planning to move to Janus very soon.

I have following keyspace

CREATE KEYSPACE my_ks WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '2', 'DC2': '1'}  AND durable_writes = true;

Current status of the cluster nodes is as follows, one of the node in DC1 is down.

|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns    Host ID                               Rack
DN  IP1  
2.8 MB     256     ?       2a5abdad-af65-48e7-a74c-d40f1f759460  rac2
UN  IP2  
4.33 MB    256     ?       4897d661-24d3-4d30-b07a-00a8103635f6  rac1
Datacenter: Sunnyside_DC
========================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns    Host ID                               Rack
UN  IP3  
5.24 MB    256     ?       f830e2a9-6eea-4617-88dd-d63e44beb115  rac1


Titan is able to connect to node in DC2 but fails to join the UP node in DC1.

TitanGraph graph = TitanFactory.build().
 
set("storage.backend", "cassandra").
 
set("storage.hostname", "IP2").
 
set("storage.port", 9160).
 
set("storage.cassandra.keyspace", "my_ks").
 
set("storage.read-only", false).
 
set("query.force-index", false).
 
set("storage.cassandra.astyanax.connection-pool-type","ROUND_ROBIN").
 
set("storage.cassandra.astyanax.node-discovery-type","NONE").
 
set("storage.cassandra.read-consistency-level","LOCAL_QUORUM").
 
set("storage.cassandra.write-consistency-level","LOCAL_QUORUM").
 
set("storage.cassandra.atomic-batch-mutate",false).
 open
();

It gives following exception:

22:11:19,655 ERROR CountingConnectionPoolMonitor:94 - com.netflix.astyanax.connectionpool.exceptions.TokenRangeOfflineException: TokenRangeOfflineException: [host=IP2:9160, latency=100(100), attempts=1]UnavailableException()
com
.netflix.astyanax.connectionpool.exceptions.TokenRangeOfflineException: TokenRangeOfflineException: [host=IP2:9160, latency=100(100), attempts=1]UnavailableException()
 at com
.netflix.astyanax.thrift.ThriftConverter.ToConnectionPoolException(ThriftConverter.java:165)
 at com
.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:65)
 at com
.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:28)
 at com
.netflix.astyanax.thrift.ThriftSyncConnectionFactoryImpl$ThriftConnection.execute(ThriftSyncConnectionFactoryImpl.java:153)
 at com
.netflix.astyanax.connectionpool.impl.AbstractExecuteWithFailoverImpl.tryOperation(AbstractExecuteWithFailoverImpl.java:119)
 at com
.netflix.astyanax.connectionpool.impl.AbstractHostPartitionConnectionPool.executeWithFailover(AbstractHostPartitionConnectionPool.java:352)
 at com
.netflix.astyanax.thrift.ThriftColumnFamilyQueryImpl$4.execute(ThriftColumnFamilyQueryImpl.java:538)
 at com
.thinkaurelius.titan.diskstorage.cassandra.astyanax.AstyanaxKeyColumnValueStore.getNamesSlice(AstyanaxKeyColumnValueStore.java:112)
 at com
.thinkaurelius.titan.diskstorage.cassandra.astyanax.AstyanaxKeyColumnValueStore.getNamesSlice(AstyanaxKeyColumnValueStore.java:78)
 at com
.thinkaurelius.titan.diskstorage.cassandra.astyanax.AstyanaxKeyColumnValueStore.getSlice(AstyanaxKeyColumnValueStore.java:67)
 at com
.thinkaurelius.titan.diskstorage.configuration.backend.KCVSConfiguration$1.call(KCVSConfiguration.java:91)
 at com
.thinkaurelius.titan.diskstorage.configuration.backend.KCVSConfiguration$1.call(KCVSConfiguration.java:1)
 at com
.thinkaurelius.titan.diskstorage.util.BackendOperation.execute(BackendOperation.java:133)
 at com
.thinkaurelius.titan.diskstorage.util.BackendOperation$1.call(BackendOperation.java:147)
 at com
.thinkaurelius.titan.diskstorage.util.BackendOperation.executeDirect(BackendOperation.java:56)
 at com
.thinkaurelius.titan.diskstorage.util.BackendOperation.execute(BackendOperation.java:42)
 at com
.thinkaurelius.titan.diskstorage.util.BackendOperation.execute(BackendOperation.java:144)
 at com
.thinkaurelius.titan.diskstorage.configuration.backend.KCVSConfiguration.get(KCVSConfiguration.java:88)
 at com
.thinkaurelius.titan.diskstorage.configuration.BasicConfiguration.isFrozen(BasicConfiguration.java:93)
 at com
.thinkaurelius.titan.graphdb.configuration.GraphDatabaseConfiguration.<init>(GraphDatabaseConfiguration.java:1338)
 at com
.thinkaurelius.titan.core.TitanFactory.open(TitanFactory.java:94)
 at com
.thinkaurelius.titan.core.TitanFactory.open(TitanFactory.java:84)
 at com
.thinkaurelius.titan.core.TitanFactory$Builder.open(TitanFactory.java:139)
 at 
TestGraph.main(TestGraph.java:20)
Caused by: UnavailableException()
 at org
.apache.cassandra.thrift.Cassandra$multiget_slice_result$multiget_slice_resultStandardScheme.read(Cassandra.java:14687)
 at org
.apache.cassandra.thrift.Cassandra$multiget_slice_result$multiget_slice_resultStandardScheme.read(Cassandra.java:14633)
 at org
.apache.cassandra.thrift.Cassandra$multiget_slice_result.read(Cassandra.java:14559)
 at org
.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
 at org
.apache.cassandra.thrift.Cassandra$Client.recv_multiget_slice(Cassandra.java:741)
 at org
.apache.cassandra.thrift.Cassandra$Client.multiget_slice(Cassandra.java:725)
 at com
.netflix.astyanax.thrift.ThriftColumnFamilyQueryImpl$4$1.internalExecute(ThriftColumnFamilyQueryImpl.java:544)
 at com
.netflix.astyanax.thrift.ThriftColumnFamilyQueryImpl$4$1.internalExecute(ThriftColumnFamilyQueryImpl.java:541)
 at com
.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:60)
 
... 22 more


Please help me to resolve this.

Thanks
Bharat


Jason Plurad <plu...@...>
 

This is more of a Cassandra question than JanusGraph/Titan. If you have two nodes in DC1 and the read/write consistency settings are LOCAL_QUORUM, you can't reach a local quorum in DC1 when one node is down.

You could try either LOCAL_ONE or QUORUM.

http://docs.datastax.com/en/cassandra/2.1/cassandra/dml/dml_config_consistency_c.html


On Sunday, July 23, 2017 at 9:14:12 AM UTC-4, Bharat Dighe wrote:
I am using titan 1.0 and planning to move to Janus very soon.

I have following keyspace

CREATE KEYSPACE my_ks WITH replication = {'class': 'NetworkTopologyStrategy', 'DC1': '2', 'DC2': '1'}  AND durable_writes = true;

Current status of the cluster nodes is as follows, one of the node in DC1 is down.

|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns    Host ID                               Rack
DN  IP1  
2.8 MB     256     ?       2a5abdad-af65-48e7-a74c-d40f1f759460  rac2
UN  IP2  
4.33 MB    256     ?       4897d661-24d3-4d30-b07a-00a8103635f6  rac1
Datacenter: Sunnyside_DC
========================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns    Host ID                               Rack
UN  IP3  
5.24 MB    256     ?       f830e2a9-6eea-4617-88dd-d63e44beb115  rac1


Titan is able to connect to node in DC2 but fails to join the UP node in DC1.

TitanGraph graph = TitanFactory.build().
 
set("storage.backend", "cassandra").
 
set("storage.hostname", "IP2").
 
set("storage.port", 9160).
 
set("storage.cassandra.keyspace", "my_ks").
 
set("storage.read-only", false).
 
set("query.force-index", false).
 
set("storage.cassandra.astyanax.connection-pool-type","ROUND_ROBIN").
 
set("storage.cassandra.astyanax.node-discovery-type","NONE").
 
set("storage.cassandra.read-consistency-level","LOCAL_QUORUM").
 
set("storage.cassandra.write-consistency-level","LOCAL_QUORUM").
 
set("storage.cassandra.atomic-batch-mutate",false).
 open
();

It gives following exception:

22:11:19,655 ERROR CountingConnectionPoolMonitor:94 - com.netflix.astyanax.connectionpool.exceptions.TokenRangeOfflineException: TokenRangeOfflineException: [host=IP2:9160, latency=100(100), attempts=1]UnavailableException()
com
.netflix.astyanax.connectionpool.exceptions.TokenRangeOfflineException: TokenRangeOfflineException: [host=IP2:9160, latency=100(100), attempts=1]UnavailableException()
 at com
.netflix.astyanax.thrift.ThriftConverter.ToConnectionPoolException(ThriftConverter.java:165)
 at com
.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:65)
 at com
.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:28)
 at com
.netflix.astyanax.thrift.ThriftSyncConnectionFactoryImpl$ThriftConnection.execute(ThriftSyncConnectionFactoryImpl.java:153)
 at com
.netflix.astyanax.connectionpool.impl.AbstractExecuteWithFailoverImpl.tryOperation(AbstractExecuteWithFailoverImpl.java:119)
 at com
.netflix.astyanax.connectionpool.impl.AbstractHostPartitionConnectionPool.executeWithFailover(AbstractHostPartitionConnectionPool.java:352)
 at com
.netflix.astyanax.thrift.ThriftColumnFamilyQueryImpl$4.execute(ThriftColumnFamilyQueryImpl.java:538)
 at com
.thinkaurelius.titan.diskstorage.cassandra.astyanax.AstyanaxKeyColumnValueStore.getNamesSlice(AstyanaxKeyColumnValueStore.java:112)
 at com
.thinkaurelius.titan.diskstorage.cassandra.astyanax.AstyanaxKeyColumnValueStore.getNamesSlice(AstyanaxKeyColumnValueStore.java:78)
 at com
.thinkaurelius.titan.diskstorage.cassandra.astyanax.AstyanaxKeyColumnValueStore.getSlice(AstyanaxKeyColumnValueStore.java:67)
 at com
.thinkaurelius.titan.diskstorage.configuration.backend.KCVSConfiguration$1.call(KCVSConfiguration.java:91)
 at com
.thinkaurelius.titan.diskstorage.configuration.backend.KCVSConfiguration$1.call(KCVSConfiguration.java:1)
 at com
.thinkaurelius.titan.diskstorage.util.BackendOperation.execute(BackendOperation.java:133)
 at com
.thinkaurelius.titan.diskstorage.util.BackendOperation$1.call(BackendOperation.java:147)
 at com
.thinkaurelius.titan.diskstorage.util.BackendOperation.executeDirect(BackendOperation.java:56)
 at com
.thinkaurelius.titan.diskstorage.util.BackendOperation.execute(BackendOperation.java:42)
 at com
.thinkaurelius.titan.diskstorage.util.BackendOperation.execute(BackendOperation.java:144)
 at com
.thinkaurelius.titan.diskstorage.configuration.backend.KCVSConfiguration.get(KCVSConfiguration.java:88)
 at com
.thinkaurelius.titan.diskstorage.configuration.BasicConfiguration.isFrozen(BasicConfiguration.java:93)
 at com
.thinkaurelius.titan.graphdb.configuration.GraphDatabaseConfiguration.<init>(GraphDatabaseConfiguration.java:1338)
 at com
.thinkaurelius.titan.core.TitanFactory.open(TitanFactory.java:94)
 at com
.thinkaurelius.titan.core.TitanFactory.open(TitanFactory.java:84)
 at com
.thinkaurelius.titan.core.TitanFactory$Builder.open(TitanFactory.java:139)
 at 
TestGraph.main(TestGraph.java:20)
Caused by: UnavailableException()
 at org
.apache.cassandra.thrift.Cassandra$multiget_slice_result$multiget_slice_resultStandardScheme.read(Cassandra.java:14687)
 at org
.apache.cassandra.thrift.Cassandra$multiget_slice_result$multiget_slice_resultStandardScheme.read(Cassandra.java:14633)
 at org
.apache.cassandra.thrift.Cassandra$multiget_slice_result.read(Cassandra.java:14559)
 at org
.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
 at org
.apache.cassandra.thrift.Cassandra$Client.recv_multiget_slice(Cassandra.java:741)
 at org
.apache.cassandra.thrift.Cassandra$Client.multiget_slice(Cassandra.java:725)
 at com
.netflix.astyanax.thrift.ThriftColumnFamilyQueryImpl$4$1.internalExecute(ThriftColumnFamilyQueryImpl.java:544)
 at com
.netflix.astyanax.thrift.ThriftColumnFamilyQueryImpl$4$1.internalExecute(ThriftColumnFamilyQueryImpl.java:541)
 at com
.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:60)
 
... 22 more


Please help me to resolve this.

Thanks
Bharat