Re: [BLOG] Configuring JanusGraph for spark-yarn


HadoopMarc <bi...@...>
 

Hi Joseph,

You ran into terrain I have not yet covered myself. Up till now I have been using the graben1437 PR for Titan and for OLAP I adopted a poor man's approach where node id's are distributed over spark tasks and each spark executor makes its own Titan/HBase connection. This performs well, but does not have the nice abstraction of the HBaseInputFormat.

So, no clear answer to this one, but just some thoughts:
 - could you try to move some regions manually and see what it does to performance?
 - how do your OLAP vertex count times compare to the OLTP count times?
 - how does the sum of spark task execution times compare to the yarn start-to-end time difference you reported? In other words, how much of the start-to-end time is spent in waiting for timeouts?
 - unless you managed to create a vertex with > 1GB size, the RowTooBigException sounds like a bug (which you can report on Jnausgraph's github page). Hbase does not like large rows at all, so vertex/edge properties should not have blob values.
 
@(David Robinson): do you have any additional thoughts on this?

Cheers,    Marc

Op maandag 7 augustus 2017 23:12:02 UTC+2 schreef Joseph Obernberger:

Hi Marc - I've been able to get it to run longer, but am now getting a RowTooBigException from HBase.  How does JanusGraph store data in HBase?  The current max size of a row in 1GByte, which makes me think this error is covering something else up.

What I'm seeing so far in testing with a 5 server cluster - each machine with 128G of RAM:
HBase table is 1.5G in size, split across 7 regions, and has 20,001,105 rows.  To do a g.V().count() takes 2 hours and results in 3,842,755 verticies.

Another HBase table is 5.7G in size split across 10 regions, is 57,620,276 rows, and took 6.5 hours to run the count and results in 10,859,491 nodes.  When running, it looks like it hits one server very hard even though the YARN tasks are distributed across the cluster.  One HBase node gets hammered.

The RowTooBigException is below.  Anything to try?  Thank you for any help!


org.janusgraph.core.JanusGraphException: Could not process individual retrieval call
                at org.janusgraph.graphdb.query.QueryUtil.processIntersectingRetrievals(QueryUtil.java:257)
                at org.janusgraph.graphdb.transaction.StandardJanusGraphTx$6.execute(StandardJanusGraphTx.java:1269)
                at org.janusgraph.graphdb.transaction.StandardJanusGraphTx$6.execute(StandardJanusGraphTx.java:1137)
                at org.janusgraph.graphdb.query.QueryProcessor$LimitAdjustingIterator.getNewIterator(QueryProcessor.java:209)
                at org.janusgraph.graphdb.query.LimitAdjustingIterator.hasNext(LimitAdjustingIterator.java:75)
                at org.janusgraph.graphdb.query.ResultSetIterator.nextInternal(ResultSetIterator.java:54)
                at org.janusgraph.graphdb.query.ResultSetIterator.next(ResultSetIterator.java:67)
                at org.janusgraph.graphdb.query.ResultSetIterator.next(ResultSetIterator.java:28)
                at com.google.common.collect.Iterators$7.computeNext(Iterators.java:651)
                at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
                at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
                at org.janusgraph.hadoop.formats.util.input.current.JanusGraphHadoopSetupImpl.getTypeInspector(JanusGraphHadoopSetupImpl.java:60)
                at org.janusgraph.hadoop.formats.util.JanusGraphVertexDeserializer.<init>(JanusGraphVertexDeserializer.java:55)
                at org.janusgraph.hadoop.formats.util.GiraphInputFormat.lambda$static$0(GiraphInputFormat.java:49)
                at org.janusgraph.hadoop.formats.util.GiraphInputFormat$RefCountedCloseable.acquire(GiraphInputFormat.java:100)
                at org.janusgraph.hadoop.formats.util.GiraphRecordReader.<init>(GiraphRecordReader.java:47)
                at org.janusgraph.hadoop.formats.util.GiraphInputFormat.createRecordReader(GiraphInputFormat.java:67)
                at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:166)
                at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:133)
                at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
                at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
                at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
                at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
                at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
                at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
                at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
                at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
                at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
                at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
                at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
                at org.apache.spark.scheduler.Task.run(Task.scala:89)
                at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:242)
                at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
                at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
                at java.lang.Thread.run(Thread.java:745)
Caused by: org.janusgraph.core.JanusGraphException: Could not call index
                at org.janusgraph.graphdb.transaction.StandardJanusGraphTx$6$6.call(StandardJanusGraphTx.java:1262)
                at org.janusgraph.graphdb.query.QueryUtil.processIntersectingRetrievals(QueryUtil.java:255)
                ... 34 more
Caused by: org.janusgraph.core.JanusGraphException: Could not execute operation due to backend exception
                at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:57)
                at org.janusgraph.diskstorage.BackendTransaction.executeRead(BackendTransaction.java:444)
                at org.janusgraph.diskstorage.BackendTransaction.indexQuery(BackendTransaction.java:395)
                at org.janusgraph.graphdb.query.graph.MultiKeySliceQuery.execute(MultiKeySliceQuery.java:51)
                at org.janusgraph.graphdb.database.IndexSerializer.query(IndexSerializer.java:529)
                at org.janusgraph.graphdb.transaction.StandardJanusGraphTx$6$6$1.lambda$call$5(StandardJanusGraphTx.java:1258)
                at org.janusgraph.graphdb.query.profile.QueryProfiler.profile(QueryProfiler.java:97)
                at org.janusgraph.graphdb.query.profile.QueryProfiler.profile(QueryProfiler.java:89)
                at org.janusgraph.graphdb.query.profile.QueryProfiler.profile(QueryProfiler.java:81)
                at org.janusgraph.graphdb.transaction.StandardJanusGraphTx$6$6$1.call(StandardJanusGraphTx.java:1258)
                at org.janusgraph.graphdb.transaction.StandardJanusGraphTx$6$6$1.call(StandardJanusGraphTx.java:1255)
                at com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4742)
                at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527)
                at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319)
                at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282)
                at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197)
                at com.google.common.cache.LocalCache.get(LocalCache.java:3937)
                at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4739)
                at org.janusgraph.graphdb.transaction.StandardJanusGraphTx$6$6.call(StandardJanusGraphTx.java:1255)
                ... 35 more
Caused by: org.janusgraph.diskstorage.TemporaryBackendException: Could not successfully complete backend operation due to repeated temporary exceptions after PT10S
                at org.janusgraph.diskstorage.util.BackendOperation.executeDirect(BackendOperation.java:101)
                at org.janusgraph.diskstorage.util.BackendOperation.execute(BackendOperation.java:55)
                ... 53 more
Caused by: org.janusgraph.diskstorage.TemporaryBackendException: Temporary failure in storage backend
                at org.janusgraph.diskstorage.hbase.HBaseKeyColumnValueStore.getHelper(HBaseKeyColumnValueStore.java:202)
                at org.janusgraph.diskstorage.hbase.HBaseKeyColumnValueStore.getSlice(HBaseKeyColumnValueStore.java:90)
                at org.janusgraph.diskstorage.keycolumnvalue.KCVSProxy.getSlice(KCVSProxy.java:77)
                at org.janusgraph.diskstorage.keycolumnvalue.KCVSProxy.getSlice(KCVSProxy.java:77)
                at org.janusgraph.diskstorage.BackendTransaction$5.call(BackendTransaction.java:398)
                at org.janusgraph.diskstorage.BackendTransaction$5.call(BackendTransaction.java:395)
                at org.janusgraph.diskstorage.util.BackendOperation.executeDirect(BackendOperation.java:69)
                ... 54 more
Caused by: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=35, exceptions:
Sat Aug 05 07:22:03 EDT 2017, RpcRetryingCaller{globalStartTime=1501932111280, pause=100, retries=35}, org.apache.hadoop.hbase.regionserver.RowTooBigException: rg.apache.hadoop.hbase.regionserver.RowTooBigException: Max row size allowed: 1073741824, but the row is bigger than that.
                at org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:564)
                at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:147)
                at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:5697)
                at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:5856)
                at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:5634)
                at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:5611)
                at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.next(HRegion.java:5597)
                at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:6792)
                at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:6770)
                at org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2023)
                at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33644)
                at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2170)
                at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:109)
                at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:185)
                at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:165)


On 8/6/2017 3:50 PM, HadoopMarc wrote:
Hi ... and others,  I have been offline for a few weeks enjoying a holiday and will start looking into your questions and make the suggested corrections. Thanks for following the recipes and helping others with it.

..., did you run the recipe on the same HDP sandbox and same Tinkerpop version? I remember (from 4 weeks ago) that copying the zookeeper.znode.parent property from the hbase configs to the janusgraph configs was essential to get janusgraph's HBaseInputFormat working (that is: read graph data for the spark tasks).

Cheers,    Marc

Op maandag 24 juli 2017 10:12:13 UTC+2 schreef spi...@...:
hi,Thanks for your post.
I did it according to the post.But I ran into a problem.
15:58:49,110  INFO SecurityManager:58 - Changing view acls to: rc
15:58:49,110  INFO SecurityManager:58 - Changing modify acls to: rc
15:58:49,110  INFO SecurityManager:58 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(rc); users with modify permissions: Set(rc)
15:58:49,111  INFO Client:58 - Submitting application 25 to ResourceManager
15:58:49,320  INFO YarnClientImpl:274 - Submitted application application_1500608983535_0025
15:58:49,321  INFO SchedulerExtensionServices:58 - Starting Yarn extension services with app application_1500608983535_0025 and attemptId None
15:58:50,325  INFO Client:58 - Application report for application_1500608983535_0025 (state: ACCEPTED)
15:58:50,326  INFO Client:58 -
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: default
start time: 1500883129115
final status: UNDEFINED
user: rc
15:58:51,330  INFO Client:58 - Application report for application_1500608983535_0025 (state: ACCEPTED)
15:58:52,333  INFO Client:58 - Application report for application_1500608983535_0025 (state: ACCEPTED)
15:58:53,335  INFO Client:58 - Application report for application_1500608983535_0025 (state: ACCEPTED)
15:58:54,337  INFO Client:58 - Application report for application_1500608983535_0025 (state: ACCEPTED)
15:58:55,340  INFO Client:58 - Application report for application_1500608983535_0025 (state: ACCEPTED)
15:58:56,343  INFO Client:58 - Application report for application_1500608983535_0025 (state: ACCEPTED)
15:58:56,802  INFO YarnSchedulerBackend$YarnSchedulerEndpoint:58 - ApplicationMaster registered as NettyRpcEndpointRef(null)
15:58:56,822  INFO YarnClientSchedulerBackend:58 - Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> dl-rc-optd-ambari-master-v-test-1.host.dataengine.com,dl-rc-optd-ambari-master-v-test-2.host.dataengine.com, PROXY_URI_BASES -> http://dl-rc-optd-ambari-master-v-test-1.host.dataengine.com:8088/proxy/application_1500608983535_0025,http://dl-rc-optd-ambari-master-v-test-2.host.dataengine.com:8088/proxy/application_1500608983535_0025), /proxy/application_1500608983535_0025
15:58:56,824  INFO JettyUtils:58 - Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
15:58:57,346  INFO Client:58 - Application report for application_1500608983535_0025 (state: RUNNING)
15:58:57,347  INFO Client:58 -
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.200.48.154
ApplicationMaster RPC port: 0
queue: default
start time: 1500883129115
final status: UNDEFINED
user: rc
15:58:57,348  INFO YarnClientSchedulerBackend:58 - Application application_1500608983535_0025 has started running.
15:58:57,358  INFO Utils:58 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 47514.
15:58:57,358  INFO NettyBlockTransferService:58 - Server created on 47514
15:58:57,360  INFO BlockManagerMaster:58 - Trying to register BlockManager
15:58:57,363  INFO BlockManagerMasterEndpoint:58 - Registering block manager 10.200.48.112:47514 with 2.4 GB RAM, BlockManagerId(driver, 10.200.48.112, 47514)15:58:57,366  INFO BlockManagerMaster:58 - Registered BlockManager
15:58:57,585  INFO EventLoggingListener:58 - Logging events to hdfs:///spark-history/application_1500608983535_0025
15:59:07,177  WARN YarnSchedulerBackend$YarnSchedulerEndpoint:70 - Container marked as failed: container_e170_1500608983535_0025_01_000002 on host: dl-rc-optd-ambari-slave-v-test-1.host.dataengine.com. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_e170_1500608983535_0025_01_000002
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)
at org.apache.hadoop.util.Shell.run(Shell.java:487)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java:371)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:303)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Shell output: main : command provided 1
main : run as user is rc
main : requested yarn user is rc


Container exited with a non-zero exit code 1
Display stack trace? [yN]15:59:57,702  WARN TransportChannelHandler:79 - Exception in connection from 10.200.48.155/10.200.48.155:50921
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:748)
15:59:57,704 ERROR TransportResponseHandler:132 - Still have 1 requests outstanding when connection from 10.200.48.155/10.200.48.155:50921 is closed
15:59:57,706  WARN NettyRpcEndpointRef:91 - Error sending message [message = RequestExecutors(0,0,Map())] in 1 attempts
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:313)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:242)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at java.lang.Thread.run(Thread.java:748)

I am confused about that. Could you please help me?



在 2017年7月6日星期四 UTC+8下午4:15:37,HadoopMarc写道:

Readers wanting to run OLAP queries on a real spark-yarn cluster might want to check my recent post:

http://yaaics.blogspot.nl/2017/07/configuring-janusgraph-for-spark-yarn.html

Regards,  Marc
--
You received this message because you are subscribed to the Google Groups "JanusGraph users list" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-use...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Virus-free. www.avg.com

Join {janusgraph-users@lists.lfaidata.foundation to automatically receive all group messages.