- Did you increase the block-size? It should large than number of vertices you want insert in the bulk load.
(ids.block-size * ids.renew-percentage should be large than the vertices to insert during bulk loading. To prevent any complex id block generation) - We have also memory issues with cql. Did your GC often a cleanup? - Without bulk loading, it will took longer due a lot of more checks.
An extra Question. Do you store the data for a long time or only for a short analysis?
Am Donnerstag, 10. Oktober 2019 17:38:38 UTC+2 schrieb Lilly:
toggle quoted message
Show quoted text
I now experimented with many types of settings now for the cql connection and timed how long it took. My observation is the following: -Embedded with bulk loading took 16 min
- CQL without bulk loading is extremly slow > 2h
- CQL with bulk loading (same settings as for embedded for parameters: storage.batch.loading, ids.block.size, ids.renew.timeout, cache.db-cache, cache.db-cache-clean-wait, cache.db-cache-time, cache.db-cache-size) took 27 min and took up considerable amounts of my RAM (not the case for embedded mode). - CQL as above but with additionally storage.cql.batch-statement-size = 500 and storage.batch-loading = true took 24 min and not quite as much RAM.
I honestly do not now what else might be the issue..
Am Mittwoch, 9. Oktober 2019 08:17:13 UTC+2 schrieb fa...@...: For "violation of unique key" it could be the case that cql checks id's to be unique (JanusGraph could run out of id's in the batch loading mode) but i'm not sure what the embedded backend is doing.
Am Dienstag, 8. Oktober 2019 17:50:23 UTC+2 schrieb Lilly: Hi Jan,
So I tried it again. First of all, I remembered, that for cql I need to commit after each step. Otherwise, I get "violation of unique key" errors, even though I am actually not. Is this supposed to be the case (having to commit each time)? Now on doing the commit after each function call, I found that with the adaption in the properties configuration (see last reply) it is really super slow. If I use the "default" configuration for cql, it is a bit faster but still much slower than in the embedded case. I also tried it with another graph which I persisted like this:
public void persist(Map<Integer, Map<String,Object>> nodes, Map<Integer,Integer> edges, Map<Integer,Map<String,String>> names) { g = graph.traversal();
int counter = 0; for(Map.Entry<Integer, Map<String,Object>> e: nodes.entrySet()) {
Vertex v = g.addV().property("taxId",e.getKey()). property("rank",e.getValue().get("rank")). property("divId",e.getValue().get("divId")). property("genId",e.getValue().get("genId")).next(); g.tx().commit(); Map<String,String> n = names.get(e.getKey()); if(n != null) { for(Map.Entry<String,String> vals: n.entrySet()) { g.V(v).property(vals.getKey(),vals.getValue()).iterate(); g.tx().commit(); } }
if(counter % BULK_CHOP_SIZE == 0) {
System.out.println(counter); } counter++;
}
counter = 0; for(Map.Entry<Integer,Integer> e: edges.entrySet()) { g.V().has("taxId",e.getKey()).as("v1").V(). has("taxId",e.getValue()).as("v2"). addE("has_parent").from("v1").to("v2").iterate(); g.tx().commit(); if(counter % BULK_CHOP_SIZE == 0) {
System.out.println(counter); } counter++; }
g.V().has("taxId",1).as("v").outE().filter(__.inV().where(P.eq("v"))).drop().iterate(); g.tx().commit(); System.out.println("Done with persistence"); }
And had the same problem in either case.
I am probably using the cql backend wrong somehow and would appreciate any help on what else to do! Thanks, Lilly
Am Dienstag, 8. Oktober 2019 09:05:56 UTC+2 schrieb Lilly: Hi Jan, Ok then I probably screwed up somewhere. I kind of thought this was to be expected, which is why I did not check it more thoroughly.
Maybe the way I persisted is not working well for cql. I will try to create a test scenario where I do not have to persist all my data and see how it performs with cql again.
In principle, what I do is call this function : public void updateEdges(String kmer, int pos, boolean strand, int record, List<SequenceParser.Feature> features){
if(features == null) { features = Arrays.asList(); }
g.withSideEffect("features",features) .V().has("prefix", kmer.substring(0,kmer.length()-1)).fold().coalesce(__.unfold(), __.addV("prefix_node").property("prefix",kmer.substring(0,kmer.length()-1)) ).as("v1"). coalesce(__.V().has("prefix", kmer.substring(1,kmer.length())), __.addV("prefix_node").property("prefix",kmer.substring(1,kmer.length())) ).as("v2"). sideEffect(__.choose(__.select("features").unfold().count().is(P.eq(0)), __.addE("suffix_edge").property("record",record). property("strand",strand).property("pos",pos).from("v1").to("v2")). select("features").unfold(). addE("suffix_edge").property("record",record).property("strand",strand).property("pos",pos) .property(__.map(t -> ((SequenceParser.Feature)t.get()).category), __.map(t -> ((SequenceParser.Feature)t.get()).feature)).from("v1").to("v2")). iterate();
} and every roughly 50000 calls I do a commit. As a side remark, all of the above properties possess indecees. And Feature is a simple class with two attributes category and feature.
Also I adapted the configuration file in the following way:
storage.batch-loading = true
ids.block-size = 100000 ids.authority.wait-time = 2000 ms ids.renew-timeout = 1000000 ms
I tried the same with cql and embedded.
I will get back to you once I have tested it once again. But maybe you already spot an issue? Thanks
Lilly
Am Montag, 7. Oktober 2019 20:14:29 UTC+2 schrieb fa...@...: We don't see this problem on persistence.
It would be good know what takes longer. Do like to give some more informations?
Jan
|