Re: bulk loading error


HadoopMarc <m.c.d...@...>
 

And this was the answer that Eliz referred to above:

Hi Eliz,

Good to hear that you make progress. I do not see this post on the gremlin users list. Would you be so kind as to post it there? I'll then add the answers below.

As to your questions:

  • id block reservation during bulkload is described in section 20.1.2 of:
    http://docs.janusgraph.org/latest/bulk-loading.html

  • Fighting GC/OOM: give gremlin console's JVM more memory in its startup script (java -Xmx command line option). Another possibility is to limit the transactions to say 100.000 vertices, so commit more often.

  • OLAP: the exception does not seem familiar to me. Maybe the JG code example refers to an older TP version.
    Therefore, it could help if you compare with the blvp example in TP (TP runs all code examples during ref doc generation!):
    http://tinkerpop.apache.org/docs/3.2.3/reference/#sparkgraphcomputer
    As far as I know the dependencies in the JG distribution are complete and do not need a TP install.
HTH,     Marc



Op maandag 26 juni 2017 15:30:03 UTC+2 schreef Ted Wilmes:

Hi Eliz,
For your first code snippet, you'll need to add in a periodic commit every X number of vertices instead of after you've loaded the whole file. That X will vary depending on your hardware, etc. but you can experiment and find what gives you the best performance. I'd suggest starting at 100 and going from there. Once you get that working, you could try loading data in parallel by spinning up multiple threads that are addV'ing and periodically committing.

For the second approach, using the TinkerPop BulkLoaderVertexProgram, you do not need to download TP separately. I think from looking at your stacktrace, you may just be missing a bit when you constructed the vertex program. Did you call create at the end of its construction like in this little snippet?

blvp = BulkLoaderVertexProgram.build().
                    bulkLoader(OneTimeBulkLoader).
                    writeGraph(writeGraphConf).create(modern)

Create takes the input graph that you're reading from as an argument.

--Ted

On Sunday, June 25, 2017 at 8:48:57 PM UTC-5, Elizabeth wrote:
Hi Marc,

This is for your request for posting here:)

Thank so much! I indeed followed "the powers of ten", and made it even simpler to load -- not  to check if the vertex is already existent, I have done it beforehand. Here is the code, just readline and addVertex row by row: 

 def loadTestSchema(graph)  {
    g = graph.traversal()

    t=System.currentTimeMillis()
    new File("/home/dev/wanmeng/adjlist/vertices1000000.txt").eachLine{l-> p=l; graph.addVertex(label,"userId","uid", p);  }
    graph.tx().commit()

    u = System.currentTimeMillis()-t
    print u/1000+" seconds \n"
    g = graph.traversal()
    g.V().has('uid', 1)

}

The schema is as follows:
def defineTestSchema(graph) {
    mgmt = graph.openManagement()
    g = graph.traversal()
    // vertex labels
    userId= mgmt.makeVertexLabel("userId").make()
    // edge labels
    relatedby = mgmt.makeEdgeLabel("relatedby").make()
    // vertex and edge properties
    uid = mgmt.makePropertyKey("uid").dataType(Long.class).cardinality(Cardinality.SET).make()
    // global indices
    //mgmt.buildIndex("byuid", Vertex.class).addKey(uid).indexOnly(userId).buildCompositeIndex()
    mgmt.buildIndex("byuid", Vertex.class).addKey(uid).buildCompositeIndex()
    mgmt.commit()

    //mgmt = graph.openManagement()
    //mgmt.updateIndex(mgmt.getGraphIndex('byuid'), SchemaAction.REINDEX).get()
    //mgmt.commit()
}

configuration file is : janusgraph-hbase-es.properties

gremlin.graph=org.janusgraph.core.JanusGraphFactory
storage.backend=hbase
storage.batch-loading=true
schema.default=none
storage.hostname=127.0.0.1
cache.db-cache = true
cache.db-cache-clean-wait = 20
cache.db-cache-time = 180000
cache.db-cache-size = 0.5

index.search.elasticsearch.interface=TRANSPORT_CLIENT
index.search.backend=elasticsearch
index.search.hostname=127.0.0.1

However, the loading time is still very long.

100     0.026s
10k    49.001seconds 
100k  35.827 seconds
1million 379.05 seconds.
10 million: error 
gremlin> loadTestSchema(graph)
15:59:27 WARN  org.janusgraph.diskstorage.idmanagement.ConsistentKeyIDAuthority  - Temporary storage exception while acquiring id block - retrying in PT0.6S: org.janusgraph.diskstorage.TemporaryBackendException: Wrote claim for id block [2880001, 2960001) in PT2.213S => too slow, threshold is: PT0.3S
GC overhead limit exceeded
Type ':help' or ':h' for help.
Display stack trace? [yN]y
java.lang.OutOfMemoryError: GC overhead limit exceeded

What i am wondering is
1) that why does bulk-loading seem not working, though I have already set storage.batch-loading=true, what else should I set to make bulk-loading take effect?  do I need to drop the index in order to speed up bulk loading?
2) how to solve the GC overhead limit exceeding?

3) At the same time, I am using the Kryo+ BulkLoaderVertexProgram to load 
the last step failed:

gremlin> graph.compute(SparkGraphComputer).program(blvp).submit().get()
No signature of method: org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer.program() is applicable for argument types: (org.apache.tinkerpop.gremlin.process.computer.bulkloading.BulkLoaderVertexProgram$Builder) values: [org.apache.tinkerpop.gremlin.process.computer.bulkloading.BulkLoaderVertexProgram$Builder@6bb4cc0e]
Possible solutions: program(org.apache.tinkerpop.gremlin.process.computer.VertexProgram), profile(java.util.concurrent.Callable)

Do I need to install tinkerPop 3 besides Janusgraph to use this graph.compute(SparkGraphComputer).program(blvp).submit().get()?

Many thanks!

Eliz

Join {janusgraph-users@lists.lfaidata.foundation to automatically receive all group messages.