Date
1 - 1 of 1
Delays during commit processing
graham...@...
Hi,
I posted a question a while back to the users list (here: https://groups.google.com/forum/#!topic/janusgraph-users/pGEZ0yMEP9k ) and subsequently did a bit of digging.
In a nutshell, I am seeing approximately 10 second delays every time I commit. Since startup involves around 350 commits it can take up to an hour to start my server.
I progressively added performance trace to a small number of classes involved in commit processing, and have found something that I really don't understand. To me it looks like a bug, but I could be wrong. Would appreciate any thoughts from the dev community.
The most important classes I added trace to are CacheTransaction, BackendOperation and HBaseStoreManager.
Here are the relevant snippets: I used log.error so that I would not need to enable full debug logging.
CacheTransaction:
private int persist(final Map<String, Map<StaticBuffer, KCVMutation>> subMutations) {
BackendOperation.execute(new Callable<Boolean>() {
@Override
public Boolean call() throws Exception {
log.error("PERF: === CacheTransaction.persist callable mutateMany");
manager.mutateMany(subMutations, tx);
log.error("PERF: === CacheTransaction.persist callable mutateMany done");
return true;
}
@Override
public String toString() {
return "CacheMutation";
}
}, maxWriteTime);
log.error("PERF: === CacheTransaction.persist submutations clear");
subMutations.clear();
log.error("PERF: === CacheTransaction.persist submutations clear done");
return 0;
}
BackendOperation
public static final<V> V executeDirect(Callable<V> exe, Duration totalWaitTime) throws BackendException {
Preconditions.checkArgument(!totalWaitTime.isZero(),"Need to specify a positive waitTime: %s",totalWaitTime);
long maxTime = System.currentTimeMillis()+totalWaitTime.toMillis();
log.error("PERF: === BackendOperation.executeDirect totalWaitTime={} ms, maxTime={}", totalWaitTime.toMillis(),maxTime);
Duration waitTime = pertubateTime(BASE_REATTEMPT_TIME);
BackendException lastException;
log.error("PERF: === BackendOperation.executeDirect pre while");
while (true) {
try {
log.error("PERF: === BackendOperation.executeDirect exe call");
return exe.call();
} catch (final Throwable e) {
log.error("PERF: === BackendOperation.executeDirect exe call threw exception {}", e.getMessage());
//Find inner-most StorageException
Throwable ex = e;
BackendException storeEx = null;
do {
if (ex instanceof BackendException) storeEx = (BackendException)ex;
} while ((ex=ex.getCause())!=null);
if (storeEx!=null && storeEx instanceof TemporaryBackendException) {
lastException = storeEx;
} else if (e instanceof BackendException) {
throw (BackendException)e;
} else {
throw new PermanentBackendException("Permanent exception while executing backend operation "+exe.toString(),e);
}
}
//Wait and retry
log.error("PERF: === BackendOperation.executeDirect wait and retry");
assert lastException!=null;
if (System.currentTimeMillis()+waitTime.toMillis()<maxTime) {
log.info("Temporary exception during backend operation ["+exe.toString()+"]. Attempting backoff retry.",lastException);
try {
Thread.sleep(waitTime.toMillis());
} catch (InterruptedException r) {
log.error("PERF: === BackendOperation.executeDirect sleep interrupted");
// added thread interrupt signal to support traversal interruption
Thread.currentThread().interrupt();
throw new PermanentBackendException("Interrupted while waiting to retry failed backend operation", r);
}
} else {
log.error("PERF: === BackendOperation.executeDirect reached maxTime");
break;
}
waitTime = pertubateTime(waitTime.multipliedBy(2));
log.error("PERF: === BackendOperation.executeDirect wait time now {}",waitTime);
}
throw new TemporaryBackendException("Could not successfully complete backend operation due to repeated temporary exceptions after "+totalWaitTime,lastException);
}
HBaseStoreManager.
public void mutateMany(Map<String, Map<StaticBuffer, KCVMutation>> mutations, StoreTransaction txh) throws BackendException {
logger.error("HBaseStoreManager: mutateMany, begin");
final MaskedTimestamp commitTime = new MaskedTimestamp(txh);
// In case of an addition and deletion with identical timestamps, the
// deletion tombstone wins.
// http://hbase.apache.org/book/versions.html#d244e4250
logger.error("HBaseStoreManager: mutateMany, convertToCommands");
final Map<StaticBuffer, Pair<List<Put>, Delete>> commandsPerKey =
convertToCommands(
mutations,
commitTime.getAdditionTime(times),
commitTime.getDeletionTime(times));
final List<Row> batch = new ArrayList<>(commandsPerKey.size()); // actual batch operation
logger.error("HBaseStoreManager: mutateMany, construct batch");
// convert sorted commands into representation required for 'batch' operation
for (Pair<List<Put>, Delete> commands : commandsPerKey.values()) {
if (commands.getFirst() != null && !commands.getFirst().isEmpty())
batch.addAll(commands.getFirst());
if (commands.getSecond() != null)
batch.add(commands.getSecond());
}
logger.error("HBaseStoreManager: mutateMany, batch additions done");
try {
TableMask table = null;
try {
logger.error("HBaseStoreManager: mutateMany, get table");
table = cnx.getTable(tableName);
logger.error("HBaseStoreManager: mutateMany, batch (size={})",batch.size());
table.batch(batch, new Object[batch.size()]);
} finally {
IOUtils.closeQuietly(table);
}
} catch (IOException e) {
throw new TemporaryBackendException(e);
} catch (InterruptedException e) {
throw new TemporaryBackendException(e);
}
logger.error("HBaseStoreManager: sleepAfterWrite, commitTime={}",commitTime);
sleepAfterWrite(txh, commitTime);
logger.error("HBaseStoreManager: sleepAfterWrite, done");
}
With the above logging in place, I narrowed down the cause of the time delays to the following. Note in particular the thread switch (to pool5-thread-1) and the 4 attempts to execute the operation. I was kind of expecting that they might be due to retries caused by waiting for HBase - but we don't get to the HBaseManager until the last line of this snippet.
2018-07-04 10:16:45,149 ERROR - [pool-1-thread-10 - cd79b8f7-4f8f-4a2f-a1ba-e0fded20b7bc:] ~ PERF: === CacheTransaction.persist callable mutateMany (CacheTransaction$1:104)
2018-07-04 10:16:49,489 ERROR - [pool-5-thread-1:] ~ PERF: === BackendOperation.executeDirect totalWaitTime=4000 ms, maxTime=1530695813489 (BackendOperation:65)
2018-07-04 10:16:49,489 ERROR - [pool-5-thread-1:] ~ PERF: === BackendOperation.executeDirect pre while (BackendOperation:68)
2018-07-04 10:16:49,489 ERROR - [pool-5-thread-1:] ~ PERF: === BackendOperation.executeDirect exe call (BackendOperation:71)
2018-07-04 10:16:54,494 ERROR - [pool-5-thread-1:] ~ PERF: === BackendOperation.executeDirect totalWaitTime=4000 ms, maxTime=1530695818494 (BackendOperation:65)
2018-07-04 10:16:54,494 ERROR - [pool-5-thread-1:] ~ PERF: === BackendOperation.executeDirect pre while (BackendOperation:68)
2018-07-04 10:16:54,494 ERROR - [pool-5-thread-1:] ~ PERF: === BackendOperation.executeDirect exe call (BackendOperation:71)
2018-07-04 10:16:55,130 ERROR - [pool-1-thread-10 - cd79b8f7-4f8f-4a2f-a1ba-e0fded20b7bc:] ~ PERF: === BackendOperation.executeDirect totalWaitTime=10000 ms, maxTime=1530695825130 (BackendOperation:65)
2018-07-04 10:16:55,132 ERROR - [pool-1-thread-10 - cd79b8f7-4f8f-4a2f-a1ba-e0fded20b7bc:] ~ PERF: === BackendOperation.executeDirect pre while (BackendOperation:68)
2018-07-04 10:16:55,133 ERROR - [pool-1-thread-10 - cd79b8f7-4f8f-4a2f-a1ba-e0fded20b7bc:] ~ PERF: === BackendOperation.executeDirect exe call (BackendOperation:71)
2018-07-04 10:16:55,136 ERROR - [pool-1-thread-10 - cd79b8f7-4f8f-4a2f-a1ba-e0fded20b7bc:] ~ PERF: === BackendOperation.executeDirect totalWaitTime=10000 ms, maxTime=1530695825136 (BackendOperation:65)
2018-07-04 10:16:55,137 ERROR - [pool-1-thread-10 - cd79b8f7-4f8f-4a2f-a1ba-e0fded20b7bc:] ~ PERF: === BackendOperation.executeDirect pre while (BackendOperation:68)
2018-07-04 10:16:55,138 ERROR - [pool-1-thread-10 - cd79b8f7-4f8f-4a2f-a1ba-e0fded20b7bc:] ~ PERF: === BackendOperation.executeDirect exe call (BackendOperation:71)
2018-07-04 10:16:55,141 ERROR - [pool-1-thread-10 - cd79b8f7-4f8f-4a2f-a1ba-e0fded20b7bc:] ~ HBaseStoreManager: mutateMany, begin (HBaseStoreManager:414)
So I am puzzled as to why there appear to be 4 attempts to invoke the mutateMany operation, with wait times of 4 seconds, 4 seconds, 10 seconds and 10 seconds.
Why any retries?
Why the 4, 4, 10, 10 escalation? (I have examined the wait time perturbation and still don't get this).
But most importantly I am puzzling over why CacheTransaction creates a Callable and asks BackendOperation to execute it directly instead of using the ExecutorService and Futures. I can't see exactly what is causing the thread switch - but it appears to be redundant and costs a chunk of time each time we switch threads.
I realize that this is performance critical code so there may be a great reason that I am not aware of yet. If not, then I am happy to experiment with converting this to using the ExecutorService.
Best regards
Graham