Re: bulk loading error


Ted Wilmes <twi...@...>
 

Hi Eliz,
For your first code snippet, you'll need to add in a periodic commit every X number of vertices instead of after you've loaded the whole file. That X will vary depending on your hardware, etc. but you can experiment and find what gives you the best performance. I'd suggest starting at 100 and going from there. Once you get that working, you could try loading data in parallel by spinning up multiple threads that are addV'ing and periodically committing.

For the second approach, using the TinkerPop BulkLoaderVertexProgram, you do not need to download TP separately. I think from looking at your stacktrace, you may just be missing a bit when you constructed the vertex program. Did you call create at the end of its construction like in this little snippet?

blvp = BulkLoaderVertexProgram.build().
                    bulkLoader(OneTimeBulkLoader).
                    writeGraph(writeGraphConf).create(modern)

Create takes the input graph that you're reading from as an argument.

--Ted

On Sunday, June 25, 2017 at 8:48:57 PM UTC-5, Elizabeth wrote:
Hi Marc,

This is for your request for posting here:)

Thank so much! I indeed followed "the powers of ten", and made it even simpler to load -- not  to check if the vertex is already existent, I have done it beforehand. Here is the code, just readline and addVertex row by row: 

 def loadTestSchema(graph)  {
    g = graph.traversal()

    t=System.currentTimeMillis()
    new File("/home/dev/wanmeng/adjlist/vertices1000000.txt").eachLine{l-> p=l; graph.addVertex(label,"userId","uid", p);  }
    graph.tx().commit()

    u = System.currentTimeMillis()-t
    print u/1000+" seconds \n"
    g = graph.traversal()
    g.V().has('uid', 1)

}

The schema is as follows:
def defineTestSchema(graph) {
    mgmt = graph.openManagement()
    g = graph.traversal()
    // vertex labels
    userId= mgmt.makeVertexLabel("userId").make()
    // edge labels
    relatedby = mgmt.makeEdgeLabel("relatedby").make()
    // vertex and edge properties
    uid = mgmt.makePropertyKey("uid").dataType(Long.class).cardinality(Cardinality.SET).make()
    // global indices
    //mgmt.buildIndex("byuid", Vertex.class).addKey(uid).indexOnly(userId).buildCompositeIndex()
    mgmt.buildIndex("byuid", Vertex.class).addKey(uid).buildCompositeIndex()
    mgmt.commit()

    //mgmt = graph.openManagement()
    //mgmt.updateIndex(mgmt.getGraphIndex('byuid'), SchemaAction.REINDEX).get()
    //mgmt.commit()
}

configuration file is : janusgraph-hbase-es.properties

gremlin.graph=org.janusgraph.core.JanusGraphFactory
storage.backend=hbase
storage.batch-loading=true
schema.default=none
storage.hostname=127.0.0.1
cache.db-cache = true
cache.db-cache-clean-wait = 20
cache.db-cache-time = 180000
cache.db-cache-size = 0.5

index.search.elasticsearch.interface=TRANSPORT_CLIENT
index.search.backend=elasticsearch
index.search.hostname=127.0.0.1

However, the loading time is still very long.

100     0.026s
10k    49.001seconds 
100k  35.827 seconds
1million 379.05 seconds.
10 million: error 
gremlin> loadTestSchema(graph)
15:59:27 WARN  org.janusgraph.diskstorage.idmanagement.ConsistentKeyIDAuthority  - Temporary storage exception while acquiring id block - retrying in PT0.6S: org.janusgraph.diskstorage.TemporaryBackendException: Wrote claim for id block [2880001, 2960001) in PT2.213S => too slow, threshold is: PT0.3S
GC overhead limit exceeded
Type ':help' or ':h' for help.
Display stack trace? [yN]y
java.lang.OutOfMemoryError: GC overhead limit exceeded

What i am wondering is
1) that why does bulk-loading seem not working, though I have already set storage.batch-loading=true, what else should I set to make bulk-loading take effect?  do I need to drop the index in order to speed up bulk loading?
2) how to solve the GC overhead limit exceeding?

3) At the same time, I am using the Kryo+ BulkLoaderVertexProgram to load 
the last step failed:

gremlin> graph.compute(SparkGraphComputer).program(blvp).submit().get()
No signature of method: org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer.program() is applicable for argument types: (org.apache.tinkerpop.gremlin.process.computer.bulkloading.BulkLoaderVertexProgram$Builder) values: [org.apache.tinkerpop.gremlin.process.computer.bulkloading.BulkLoaderVertexProgram$Builder@6bb4cc0e]
Possible solutions: program(org.apache.tinkerpop.gremlin.process.computer.VertexProgram), profile(java.util.concurrent.Callable)

Do I need to install tinkerPop 3 besides Janusgraph to use this graph.compute(SparkGraphComputer).program(blvp).submit().get()?

Many thanks!

Eliz

Join janusgraph-users@lists.lfaidata.foundation to automatically receive all group messages.