Re: Property with multiple data types
Laura Morales <laur...@...>
Thank you for your further comments.
I'm still a bit confused though. Janus advertises itself as a database for huge graphs. But I'm asking myself if it means huge "homogeneous" graphs (ie. a simple schema but a lot of nodes/edges) or huge "dishomogeneous" graphs (ie. with lots of nodes/edges but also with a complex schema). With really big graphs I think it's reasonable to assume that different nodes will want to use the same property key but with different data types. Labels however don't really help with this use case, because there can only be 1 per node, and maybe some nodes want to use both types. How would you design the schema of a big "dishomogeneous" graph? Like a schema that can describe 100s of different domains in a single graph. Is Object.class the only way? Sent: Monday, December 14, 2020 at 2:29 PM From: "HadoopMarc" <bi...@...> To: "JanusGraph users" <janusgra...@...> Subject: Re: Property with multiple data types Hi Laura, Things are a bit different than you ask: a vertex has a single label onlya property key has a single datatype only, but it can be Object.class, see https://docs.janusgraph.org/basics/schema/#property-key-data-typeindices can have a label constraint, but these are not helpful if you want to mix the datatypes in a property for the same vertexI cannot predict well how the various janusgraph parts will behave when mixing up real integers and "string-integers" in a property key of the Object.class datatype. I guess that the gremlin traversals will have problems, while an indexing backend for MixedIndices probably can deal with it. The ref docs definitely advise to use the basic datatypes and spare yourself future headaches (so, unify the datatypes on ingestion). Best wishes, Marc Op maandag 14 december 2020 om 09:07:31 UTC+1 schreef Laura Morales: Thank you Marc, I think it does indeed! If I understand correctly, I can use labels to "namespace" my nodes, or in other words as a way to identify subgraphs. If I have a node with 2 labels instead, say label1 and label2, I can create 2 indices for the same node, right? That is an index for label1.age (Integer) and an index for label2.age (String), both indices containing the same node. In this scenario I should be allowed to add 2 types of properties to the same node, one containing an Integer and the other one containing a String. Then query by choosing a specific label. Does this work? Can I do something like this? Sent: Monday, December 14, 2020 at 8:01 AM From: "HadoopMarc" <b...@...> To: "JanusGraph users" <janu...@...> Subject: Re: Property with multiple data types Hi Laura, Good that you pay close attention to understanding indices in JanusGraph because they are essential to proper use. Does the following section of the ref docs answers your question? https://docs.janusgraph.org/index-management/index-performance/#label-constraint[https://docs.janusgraph.org/index-management/index-performance/#label-constraint] Best wishes, Marc Op zondag 13 december 2020 om 16:30:19 UTC+1 schreef Laura Morales:I'm new to Janus and LPGs. I have a question after reading the Janus documentation. As far as I understand, edges labels as well as properties (for both nodes and edges) are indexed globally. What happens when I have a sufficiently large graph, that completely unrelated and separate nodes want to use a property called with the same name but that holds different data types? For example, a property called "age" could be used by some nodes with a Integer type (eg. "age": 23), but other nodes on the other far-side of my big graph might want/need/require to use a String type (eg. "age": "twenty-seven"). Is this configuration possible with Janus? Or do I *have to* use two different names such as age_int and age_string? -- You received this message because you are subscribed to the Google Groups "JanusGraph users" group. To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@...[mailto:janusgr...@...]. To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/0b84be68-3688-46fe-a104-32baef119e2an%40googlegroups.com[https://groups.google.com/d/msgid/janusgraph-users/0b84be68-3688-46fe-a104-32baef119e2an%40googlegroups.com?utm_medium=email&utm_source=footer][https://groups.google.com/d/msgid/janusgraph-users/0b84be68-3688-46fe-a104-32baef119e2an%40googlegroups.com%5Bhttps://groups.google.com/d/msgid/janusgraph-users/0b84be68-3688-46fe-a104-32baef119e2an%40googlegroups.com?utm_medium=email&utm_source=footer]]. -- You received this message because you are subscribed to the Google Groups "JanusGraph users" group. To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@...[mailto:janusgra...@...]. To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/24ee1b44-3501-4d40-abef-b32aa345c959n%40googlegroups.com[https://groups.google.com/d/msgid/janusgraph-users/24ee1b44-3501-4d40-abef-b32aa345c959n%40googlegroups.com?utm_medium=email&utm_source=footer].
|
|
Re: Property with multiple data types
HadoopMarc <bi...@...>
Hi Laura, Things are a bit different than you ask:
Best wishes, Marc Op maandag 14 december 2020 om 09:07:31 UTC+1 schreef Laura Morales:
|
|
Re: Is there a standard, human-friendly, serialization format?
Evgeniy Ignatiev <yevgeniy...@...>
Hi Laura,
toggle quoted messageShow quoted text
Many people use CSV for data and JSON for schemas (in other graph databases too). Not sure what will be the most "standard" or "efficient" approach, but CSV/JSON seems to be most common. For example https://github.com/IBM/janusgraph-utils and https://github.com/dengziming/janusgraph-util Best regards, Evgenii Ignatev.
On 14.12.2020 09:22, Laura Morales wrote:
All the examples that I see in the Janus documentation seem to use Groovy. Instructions such as JanusGraphFactory.open(), graph.openManagement(), mgmt.makeEdgeLabel() etc.
|
|
Is there a standard, human-friendly, serialization format?
Laura Morales <laur...@...>
All the examples that I see in the Janus documentation seem to use Groovy. Instructions such as JanusGraphFactory.open(), graph.openManagement(), mgmt.makeEdgeLabel() etc.
Is there any human-friendly plaintext format that I can use to write my graph with, and then load into Janus? In practical terms what I would like to do is this: 1. write my graph in a text file, all nodes and edges, and the schema too. So the format should be human-friendly and easy to edit manually. Hopefully not XML. 2. load this graph into Janus by asking Janus to read my graph file. Not in-memory though, I mean to create a new persistent database that is always there when Janus starts.
|
|
Re: Property with multiple data types
Laura Morales <laur...@...>
Thank you Marc, I think it does indeed! If I understand correctly, I can use labels to "namespace" my nodes, or in other words as a way to identify subgraphs.
If I have a node with 2 labels instead, say label1 and label2, I can create 2 indices for the same node, right? That is an index for label1.age (Integer) and an index for label2.age (String), both indices containing the same node. In this scenario I should be allowed to add 2 types of properties to the same node, one containing an Integer and the other one containing a String. Then query by choosing a specific label. Does this work? Can I do something like this? Sent: Monday, December 14, 2020 at 8:01 AM From: "HadoopMarc" <bi...@...> To: "JanusGraph users" <janusgra...@...> Subject: Re: Property with multiple data types Hi Laura, Good that you pay close attention to understanding indices in JanusGraph because they are essential to proper use. Does the following section of the ref docs answers your question? https://docs.janusgraph.org/index-management/index-performance/#label-constraint Best wishes, Marc Op zondag 13 december 2020 om 16:30:19 UTC+1 schreef Laura Morales:I'm new to Janus and LPGs. I have a question after reading the Janus documentation. As far as I understand, edges labels as well as properties (for both nodes and edges) are indexed globally. What happens when I have a sufficiently large graph, that completely unrelated and separate nodes want to use a property called with the same name but that holds different data types? For example, a property called "age" could be used by some nodes with a Integer type (eg. "age": 23), but other nodes on the other far-side of my big graph might want/need/require to use a String type (eg. "age": "twenty-seven"). Is this configuration possible with Janus? Or do I *have to* use two different names such as age_int and age_string? -- You received this message because you are subscribed to the Google Groups "JanusGraph users" group. To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@...[mailto:janusgra...@...]. To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/0b84be68-3688-46fe-a104-32baef119e2an%40googlegroups.com[https://groups.google.com/d/msgid/janusgraph-users/0b84be68-3688-46fe-a104-32baef119e2an%40googlegroups.com?utm_medium=email&utm_source=footer].
|
|
Re: Property with multiple data types
HadoopMarc <bi...@...>
Hi Laura, Good that you pay close attention to understanding indices in JanusGraph because they are essential to proper use. Does the following section of the ref docs answers your question? https://docs.janusgraph.org/index-management/index-performance/#label-constraint Best wishes, Marc Op zondag 13 december 2020 om 16:30:19 UTC+1 schreef Laura Morales:
I'm new to Janus and LPGs. I have a question after reading the Janus documentation. As far as I understand, edges labels as well as properties (for both nodes and edges) are indexed globally. What happens when I have a sufficiently large graph, that completely unrelated and separate nodes want to use a property called with the same name but that holds different data types? For example, a property called "age" could be used by some nodes with a Integer type (eg. "age": 23), but other nodes on the other far-side of my big graph might want/need/require to use a String type (eg. "age": "twenty-seven"). Is this configuration possible with Janus? Or do I *have to* use two different names such as age_int and age_string?
|
|
Re: Centric Indexes failing to support all conditions for better performance.
chrism <cmil...@...>
Thank you Boxuan Li,
toggle quoted messageShow quoted text
It is obvious that your are an expert, is any other way apart of isFitted=true to know that index is used or not? (It may be even debugging JanusGraph server or Cassandra) We need to construct Gremlin query, to utilize these indexes in full, and always,... problem is just what to type, as our implementation requires more complicated than above conditions to match, using above as sample it would be: (rating >= value AND time < value) OR HasNot( time ) - means that "time" was not specified. What is visible from profile() is that we cannot use coalesce() or or() steps, and trying all kind of workarounds cannot be verified easily having isFitted=false and no other "good" indication of using indexes. Cheers, Christopher
On Sunday, December 13, 2020 at 7:24:13 PM UTC+11 li...@... wrote:
|
|
Property with multiple data types
Laura Morales <laur...@...>
I'm new to Janus and LPGs. I have a question after reading the Janus documentation. As far as I understand, edges labels as well as properties (for both nodes and edges) are indexed globally. What happens when I have a sufficiently large graph, that completely unrelated and separate nodes want to use a property called with the same name but that holds different data types? For example, a property called "age" could be used by some nodes with a Integer type (eg. "age": 23), but other nodes on the other far-side of my big graph might want/need/require to use a String type (eg. "age": "twenty-seven"). Is this configuration possible with Janus? Or do I *have to* use two different names such as age_int and age_string?
|
|
subscribe
Laura Morales <laur...@...>
subscribe
|
|
Re: Centric Indexes failing to support all conditions for better performance.
BO XUAN LI <libo...@...>
Hi Christopher,
toggle quoted messageShow quoted text
isFitted = true basically means no in-memory filtering is needed. If you see isFitted = false, it does not necessarily mean vertex-centric indexes are not used. It could be the case that some vertex-centric index is used, but further in-memory filtering is still needed. If you see isFitted = false, it does not necessarily mean any index is used. It could be the case that you are fetching all edges of a given vertex. I totally understand your confusion because the documentation does not explain how the vertex-centric index is built. In JanusGraph, vertices and edges are stored in the “edgestore” store, while composite indexes are stored in the “graphindex” store. Mixed indexes are stored in external index store like Elasticsearch. This might be a bit counter-intuitive, but vertex-centric indexes are stored in the “edgestore” store. Recall how edges are stored (https://docs.janusgraph.org/advanced-topics/data-model/#individual-edge-layout): ![]() Roughly speaking, If you don’t have any vertex-centric index, then your edge is stored once for one endpoint. If you have one vertex-centric index, then applicable edges are stored twice. If you have two vertex-centric indexes, then applicable edges are stored three times… These edges, although seemingly duplicate, have different “sort key”s which conform to corresponding vertex-centric indexes. Let’s say you have built an “battlesByRating” vertex-centric index based on the property “rating”, then apart from the ordinary edge, JanusGraph creates an additional edge whose “sort key” is the rating value. Because the “column” is sorted in the underlying data storage (e.g. “column” in JanusGraph model is mapped to “clustering column” in Cassandra), you essentially gain the ability to search an index by “rating” value/range. What happens when your vertex-centric index has two properties like the following? > mgmt.buildEdgeIndex(battled, 'battlesByRatingAndTime', Direction.OUT, Order.asc, rating, time) Now your “sort key” is a combination of “rating” and “time” (note “rating” comes before “time”). Under this vertex-centric index, “sort key”s look like this: (rating=1, time=2), (rating=1, time=3), (rating=2, time=1), (rating=2, time=5), (rating=4, time=2), … This explains why isFitted = true when your query is has('rating', 5.0).has('time', inside(10, 50)) but not when your query is has(’time', 5.0).has(‘rating', inside(10, 50)).Again, note that isFitted = false does not necessarily mean your query is not optimized by vertex-centric index. I think the profiler shall be improved to state whether and which vertex-centric index is used. I am not quite sure about the case b) you mentioned. Seems it’s a design consideration but right now I cannot tell why it is there. “hasNot" almost never uses indexes because JanusGraph cannot index something that does not exist. (Note that “null” value is not valid in JanusGraph). Hope this helps. Best regards, Boxuan
|
|
Re: Configuring Transaction Log feature
Sandeep Mishra <sandy...@...>
Pawan,
toggle quoted messageShow quoted text
I was able to make your code work. the problem is "setStartTimeNow()" Instead use
setStartTime(Instant.now()) and test. It works. I am yet to explore difference between two api. make sure to use a new logidentifier to test. Regards, Sandeep
On Wednesday, December 9, 2020 at 8:54:17 PM UTC+8 shr...@... wrote: Hi Sandeep,
|
|
Re: Error when running JanusGraph with YARN and CQL
Varun Ganesh <operatio...@...>
Thanks a lot for responding Marc.
toggle quoted messageShow quoted text
Yes, I had initially tried setting spark.yarn.archive with the path to spark-gremlin.zip. However with this approach, the containers were failing with the message "Error: Could not find or load main class org.apache.spark.deploy.yarn.ExecutorLauncher". I'm yet to understand the differences between the spark.yarn.archive and the HADOOP_GREMLIN_LIBS approaches. Will update this thread as I find out more. Thank you, Varun
On Friday, December 11, 2020 at 2:05:35 AM UTC-5 HadoopMarc wrote:
|
|
Re: How to improve traversal query performance
HadoopMarc <bi...@...>
Hi Manabu, Yes, providing an example graph works much better in exploring the problem space. I am afraid, though, that I did not find much that will help you out.
So, concluding, there does not seem to be much you can do about the query: you simply want a large resultset from a traversal with multiple steps. Depending on the size of you graph, you could hod the graph in memory using the inmemory backend, or you could replace cassandra with cql and put on it on infrastructure with SSD storage. Of course, you could also precompute and store results, or split up the query with repeat().times(1), repeat().times(2), etc. for faster intermediate results. Best wishes, Marc Op dinsdag 8 december 2020 om 08:56:03 UTC+1 schreef Manabu Kotani:
Hi Marc,
|
|
Re: Profile() seems inconsisten with System.currentTimeMillis
HadoopMarc <bi...@...>
In the mean time I found that the difference between profile() and currentTimeMillis can be much larger. Apparently, the profile() step takes into account that for real queries, vertices are not present in the database cache and assumes some time duration to retrieve a vertex or properties from the backend. Is there any documentation on these assumptions? Best wishes, Marc Op vrijdag 11 december 2020 om 09:58:21 UTC+1 schreef HadoopMarc:
|
|
Profile() seems inconsisten with System.currentTimeMillis
HadoopMarc <bi...@...>
Hi, Can anyone explain why the total duration displayed by the profile() step is more than twice as large as the time difference clocked with System.currentTimeMillis? see below, For those who wonder, the query without profile() also takes about 300 msec. Thanks, Marc gremlin> start = System.currentTimeMillis() ==>1607676127027 gremlin> g.V().has('serial', within('1654145144','1648418968','1652445288','1654952168','1653379120', '1654325440','1653383216','1658298568','1649680536','1649819672','1654964456','1649729552', '1656103144','1655460032','1656111336','1654669360')).inE('assembled').outV().profile() ==>Traversal Metrics Step Count Traversers Time (ms) % Dur ============================================================================================================= JanusGraphStep([],[serial.within([1654145144, 1... 16 16 0,486 59,26 \_condition=((serial = 1654145144 OR serial = 1648418968 OR serial = 1652445288 OR serial = 1654952168 OR serial = 1653379120 OR serial = 1654325440 OR serial = 1653383216 OR serial = 1658298568 OR se rial = 1649680536 OR serial = 1649819672 OR serial = 1654964456 OR serial = 1649729552 OR seri al = 1656103144 OR serial = 1655460032 OR serial = 1656111336 OR serial = 1654669360)) \_orders=[] \_isFitted=true \_isOrdered=true \_query=multiKSQ[16]@2000 \_index=bySerial optimization 0,009 optimization 0,267 JanusGraphVertexStep(IN,[assembled],vertex) 73 73 0,334 40,74 \_condition=type[assembled] \_orders=[] \_isFitted=true \_isOrdered=true \_query=org.janusgraph.diskstorage.keycolumnvalue.SliceQuery@812bd43d \_vertices=1 optimization 0,037 optimization 0,008 optimization 0,005 optimization 0,004 optimization 0,004 optimization 0,004 optimization 0,017 optimization 0,004 optimization 0,004 optimization 0,004 optimization 0,004 optimization 0,004 optimization 0,004 optimization 0,004 optimization 0,004 optimization 0,004 >TOTAL - - 0,820 - gremlin> System.currentTimeMillis() - start ==>322
|
|
Re: Running OLAP on HBase with SparkGraphComputer fails with Error Container killed by YARN for exceeding memory limits
HadoopMarc <bi...@...>
Hi Roy, I think I would first check whether the skew is absent if you count the rows reading the HBase table directly from spark (so, without using janusgraph), e.g.: https://stackoverflow.com/questions/42019905/how-to-use-newapihadooprdd-spark-in-java-to-read-hbase-data If this works all right, than you know that somehow in janusgraph HBaseInputFormat the mappers do not get the right key ranges to read from. Best wishes, Marc Op woensdag 9 december 2020 om 17:16:35 UTC+1 schreef Roy Yu:
Hi Marc,
|
|
Re: Error when running JanusGraph with YARN and CQL
HadoopMarc <bi...@...>
Hi Varun, Good job. However, your last solution will only work with everything running on a single machine. So, indeed, there is something wrong with the contents of spark-gremlin.zip or with the way it is put in the executor's local working directory. Note that you already put /Users/my_comp/Downloads/janusgraph-0.5.2/lib/janusgraph-cql-0.5.2.jar explicitly on the executor classpath while it should have been available already through ./spark-gremlin.zip/* O, I think I see now what is different. You have used spark.yarn.dist.archives, while the TinkerPop recipes use spark.yarn.archive. They behave differently in yes/no extracting the jars from the zip. I guess either can be used, provided it is done consistently. You can use the environment tab in Spark web UI to inspect how things are picked up by spark. Best wishes, Marc Op donderdag 10 december 2020 om 20:23:32 UTC+1 schreef Varun Ganesh:
Answering my own question. I was able fix the above error and successfully run the count job after explicitly adding /Users/my_comp/Downloads/janusgraph-0.5.2/lib/* to spark.executor.extraClassPath
|
|
Re: Janusgraph Hadoop Spark standalone cluster - Janusgraph job always creates constant number 513 of Spark tasks
Varun Ganesh <operatio...@...>
Thank you Marc. I was able to reduce the tasks by adjusting the `num_tokens` settings on Cassandra. Still unsure about why each task takes so long though. Hoping that this a per-task overhead that stays the same as we process larger datasets.
toggle quoted messageShow quoted text
On Saturday, December 5, 2020 at 3:20:17 PM UTC-5 HadoopMarc wrote:
|
|
Re: Error when running JanusGraph with YARN and CQL
Varun Ganesh <operatio...@...>
Answering my own question. I was able fix the above error and successfully run the count job after explicitly adding /Users/my_comp/Downloads/janusgraph-0.5.2/lib/* to spark.executor.extraClassPath
toggle quoted messageShow quoted text
But I am not yet sure as to why that was needed. I had assumed that adding spark-gremlin.zip to the path would have provided the required dependencies.
On Thursday, December 10, 2020 at 1:00:24 PM UTC-5 Varun Ganesh wrote: An update on this, I tried setting the env var below:
|
|
Re: Error when running JanusGraph with YARN and CQL
Varun Ganesh <operatio...@...>
An update on this, I tried setting the env var below:
toggle quoted messageShow quoted text
export HADOOP_GREMLIN_LIBS=$GREMLIN_HOME/lib After doing this I was able to successfully run the tinkerpop-modern.kryo example from the Recipes documentation. (though the guide at http://yaaics.blogspot.com/2017/07/configuring-janusgraph-for-spark-yarn.html explicitly asks us to ignore this) Unfortunately, it is still not working with CQL. But the error is now different. Please see below: 12:46:33 ERROR org.apache.spark.scheduler.TaskSetManager - Task 3 in stage 0.0 failed 4 times; aborting job org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 0.0 failed 4 times, most recent failure: Lost task 3.3 in stage 0.0 (TID 9, 192.168.1.160, executor 2): java.lang.NoClassDefFoundError: org/janusgraph/hadoop/formats/util/HadoopInputFormat at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:756) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) at java.net.URLClassLoader.access$100(URLClassLoader.java:74) at java.net.URLClassLoader$1.run(URLClassLoader.java:369) at java.net.URLClassLoader$1.run(URLClassLoader.java:363) at java.security.AccessController.doPrivileged(Native Method) ... (skipping) Caused by: java.lang.ClassNotFoundException: org.janusgraph.hadoop.formats.util.HadoopInputFormat at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ... 130 more Is there some additional dependency that I may need to add? Thanks in advance!
On Wednesday, December 9, 2020 at 11:49:29 PM UTC-5 Varun Ganesh wrote: Hello,
|
|