Hi Marc,
This is for your request for posting here:)
Thank so much! I indeed followed "the powers of ten", and made it even simpler to load -- not to check if the vertex is already existent, I have done it beforehand. Here is the code, just readline and addVertex row by row:
def loadTestSchema(graph) {
g = graph.traversal()
t=System.currentTimeMillis()
new File("/home/dev/wanmeng/adjlist/vertices1000000.txt").eachLine{l-> p=l; graph.addVertex(label,"userId","uid", p); }
graph.tx().commit()
u = System.currentTimeMillis()-t
print u/1000+" seconds \n"
g = graph.traversal()
g.V().has('uid', 1)
}
The schema is as follows:
def defineTestSchema(graph) {
mgmt = graph.openManagement()
g = graph.traversal()
// vertex labels
userId= mgmt.makeVertexLabel("userId").make()
// edge labels
relatedby = mgmt.makeEdgeLabel("relatedby").make()
// vertex and edge properties
uid = mgmt.makePropertyKey("uid").dataType(Long.class).cardinality(Cardinality.SET).make()
// global indices
//mgmt.buildIndex("byuid", Vertex.class).addKey(uid).indexOnly(userId).buildCompositeIndex()
mgmt.buildIndex("byuid", Vertex.class).addKey(uid).buildCompositeIndex()
mgmt.commit()
//mgmt = graph.openManagement()
//mgmt.updateIndex(mgmt.getGraphIndex('byuid'), SchemaAction.REINDEX).get()
//mgmt.commit()
}
configuration file is : janusgraph-hbase-es.properties
gremlin.graph=org.janusgraph.core.JanusGraphFactory
storage.backend=hbase
storage.batch-loading=true
schema.default=none
storage.hostname=127.0.0.1
cache.db-cache = true
cache.db-cache-clean-wait = 20
cache.db-cache-time = 180000
cache.db-cache-size = 0.5
index.search.elasticsearch.interface=TRANSPORT_CLIENT
index.search.backend=elasticsearch
index.search.hostname=127.0.0.1
However, the loading time is still very long.
100 0.026s
10k 49.001seconds
100k 35.827 seconds
1million 379.05 seconds.
10 million: error
gremlin> loadTestSchema(graph)
15:59:27 WARN org.janusgraph.diskstorage.idmanagement.ConsistentKeyIDAuthority - Temporary storage exception while acquiring id block - retrying in PT0.6S: org.janusgraph.diskstorage.TemporaryBackendException: Wrote claim for id block [2880001, 2960001) in PT2.213S => too slow, threshold is: PT0.3S
GC overhead limit exceeded
Type ':help' or ':h' for help.
Display stack trace? [yN]y
java.lang.OutOfMemoryError: GC overhead limit exceeded
What i am wondering is
1) that why does bulk-loading seem not working, though I have already set storage.batch-loading=true, what else should I set to make bulk-loading take effect? do I need to drop the index in order to speed up bulk loading?
2) how to solve the GC overhead limit exceeding?
3) At the same time, I am using the Kryo+ BulkLoaderVertexProgram to load
the last step failed:
gremlin> graph.compute(SparkGraphComputer).program(blvp).submit().get()
No signature of method: org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer.program() is applicable for argument types: (org.apache.tinkerpop.gremlin.process.computer.bulkloading.BulkLoaderVertexProgram$Builder) values: [org.apache.tinkerpop.gremlin.process.computer.bulkloading.BulkLoaderVertexProgram$Builder@6bb4cc0e]
Possible solutions: program(org.apache.tinkerpop.gremlin.process.computer.VertexProgram), profile(java.util.concurrent.Callable)
Do I need to install tinkerPop 3 besides Janusgraph to use this graph.compute(SparkGraphComputer).program(blvp).submit().get()?
Many thanks!
Eliz