Date   

Re: Data Loading Script Optimization

hadoopmarc@...
 

Hi Vinayak,

Good to see some progress!

Some suggestions:
  • Is 40% relative to a single core or to all cores (e.g. CPU usage for a java process in top can be 800% if 8 cores are present)?
  • Ncore * 100% is not necessarily the maximum CPU load of the groovy process + storage backend if the loading becomes IO limited. Can you find out what IO usage is?
  • Do you use CompositeIndices on the properties "name" and "e-mail" for the has() filters?
  • Regarding the idea from Nicolas, I would rather use a ConcurrentMap that maps ORG id's to vertex id's, but only fill it as you go for the ORG's that you add or lookup. The JanusGraph transaction and database caches should be large enough to hold the vertices to be referenced two or more times, thus accommodating g.V(id) lookups.
  • On a single system Apache Spark will not help you.

Best wishes,    Marc


Re: Data Loading Script Optimization

Nicolas Trangosi
 

Hi,
You could first create a local cache for ORG by retrying first all ORG:

Map<String, Long> orgCache = g.V().has('vertexLabel', 'ORG').project("name", "id").by("orgName").by(T.id)...

Then replace __.V().has('vertexLabel', 'ORG').has('orgName', orgName) by __V(orgCache.get(orgName))

Same trick, may be used for persons to remove the coalesce if you know that you import more users than already exist in db.

Nicolas


Le lun. 9 août 2021 à 10:07, Vinayak Bali <vinayakbali16@...> a écrit :
Hi Marc, 

To avoid confusion, including a new transaction at line number 39, as well as at line no 121.  
Line 39: GraphTraversalSource g = graph.newTransaction().traversal();
Line 121: g = ctx.g = graph.newTransaction().traversal();
The total time took was 11 mins. The maximum amount of cpu utilization was 40%. As the hardware configuration of the instance is at higher side, still we have enough RAM to increase the performance. 
Request you to share if it's possible to increase the performance further following some process(Hadoop/spark) etc based on your experience. 

Thanks & Regards,
Vinayak

On Sun, Aug 8, 2021 at 1:53 PM <hadoopmarc@...> wrote:
Hi Vinayak,

Yes, it should be possible to improve on the 3% CPU usage.

The newTransaction() should be added to line 39 (GraphTraversalSource g = graph.traversal();) as the global g from line 121 is not used.

Marc



--

  

Nicolas Trangosi

Lead back

+33 (0)6 77 86 66 44      

   




Ce message et ses pièces jointes peuvent contenir des informations confidentielles ou privilégiées et ne doivent donc pas être diffusés, exploités ou copiés sans autorisation. 
Si vous avez reçu ce message par erreur, veuillez le signaler a l'expéditeur et le détruire ainsi que les pièces jointes. 
Les messages électroniques étant susceptibles d'altération, DCbrain décline toute responsabilité si ce message a été altéré, déformé ou falsifié. Merci. 

This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, DCbrain is not liable for messages that have been modified, changed or falsified. Thank you.


Re: Data Loading Script Optimization

Vinayak Bali
 

Hi Marc, 

To avoid confusion, including a new transaction at line number 39, as well as at line no 121.  
Line 39: GraphTraversalSource g = graph.newTransaction().traversal();
Line 121: g = ctx.g = graph.newTransaction().traversal();
The total time took was 11 mins. The maximum amount of cpu utilization was 40%. As the hardware configuration of the instance is at higher side, still we have enough RAM to increase the performance. 
Request you to share if it's possible to increase the performance further following some process(Hadoop/spark) etc based on your experience. 

Thanks & Regards,
Vinayak

On Sun, Aug 8, 2021 at 1:53 PM <hadoopmarc@...> wrote:
Hi Vinayak,

Yes, it should be possible to improve on the 3% CPU usage.

The newTransaction() should be added to line 39 (GraphTraversalSource g = graph.traversal();) as the global g from line 121 is not used.

Marc


Re: Data Loading Script Optimization

hadoopmarc@...
 

Hi Vinayak,

Yes, it should be possible to improve on the 3% CPU usage.

The newTransaction() should be added to line 39 (GraphTraversalSource g = graph.traversal();) as the global g from line 121 is not used.

Marc


Re: Data Loading Script Optimization

Vinayak Bali
 

Hi Marc, 

The storage backend used is Cassandra. 
Yes, storage backend janusgraph and load scripts are on the same server.
specified storage.batch-loading=true 
CPU usage is very low not more than 3 percent. The machine has higher hardware configurations. So, I need suggestions on how we can make full use of the hardware.
I will use graph.newTransaction().traversal() replacing line 121 in the code and share the results. 
Current line: g = ctx.g = graph.traversal();
Modified : g = ctx.g = graph.newTransaction().traversal();
Please validate and confirm the changes. 
As data increases, we should use global GraphTraversalSource g at the bottom of the script for the bulk loading.

Thanks & Regards,
Vinayak


On Sat, Aug 7, 2021 at 6:21 PM <hadoopmarc@...> wrote:
Hi Vinayak,

What storage backend do you use? Do I understand right that the storage backend and the load script all run on the same server? If, so, are all available CPU resources actively used during batch loading? What is CPU usage of the groovy process and what of the storage backend?

Specific details in the script:
  • did you specify storage.batch-loading=true
  • I am not sure whether each traversal() call on the graph gets its own thread-independent transaction (that is why ask for the groovy CPU usage). Maybe you need g = graph.newTransaction().traversal() in CsvImporter
  • I assume that the global GraphTraversalSource g at the bottom of the script is not used for the bulk loading.
Best wishes,    Marc


Re: Data Loading Script Optimization

hadoopmarc@...
 

Hi Vinayak,

What storage backend do you use? Do I understand right that the storage backend and the load script all run on the same server? If, so, are all available CPU resources actively used during batch loading? What is CPU usage of the groovy process and what of the storage backend?

Specific details in the script:
  • did you specify storage.batch-loading=true
  • I am not sure whether each traversal() call on the graph gets its own thread-independent transaction (that is why ask for the groovy CPU usage). Maybe you need g = graph.newTransaction().traversal() in CsvImporter
  • I assume that the global GraphTraversalSource g at the bottom of the script is not used for the bulk loading.
Best wishes,    Marc


Re: Not able to enable Write-ahead logs using tx.log-tx for existing JanusGraph setup

Boxuan Li
 

Is it expected that tx.log-tx works only for fresh JanusGraph setup?

No. It should work well for your existing JanusGraph setup too. Note that it is a GLOBAL option so it must be changed for the entire cluster. See https://docs.janusgraph.org/basics/configuration/#global-configuration

Best,
Boxuan


Not able to enable Write-ahead logs using tx.log-tx for existing JanusGraph setup

Radhika Kundam
 

Hi All,

I am trying to enable write-ahead logs by using the config property: tx.log-tx to handle transaction recovery.
It's working fine for new JanusGraph setup but not working for Janusgraph setup which already used for some time.

Is it expected that tx.log-tx works only for fresh JanusGraph setup?
Do we need to follow any additional steps to enable tx.log-tx for existing Janusgraph setup.

Thanks,
Radhika


Re: config skip-schema-check=true is not honored for HBase

hadoopmarc@...
 

Hi Jigar,

Yes, I think it is an issue. I did not fully dive into it, in particular I did not check whether any tests exist for the "disable schema check" configuration option. So, go ahead and create an issue for it.

Best wishes,   Marc


Data Loading Script Optimization

Vinayak Bali
 

Hi All, 

I have attached a groovy script that I use to load data into janusgraph. 
The script takes 4 mins to load 1.5 million nodes and 13 mins to load approx 3 million edges.  The server on which the script runs has higher configurations. I looking for different ways to improve the performance of the script. Your feedback will help.
Thank You for the responses. 

Thanks & Regards,
Vinayak


Re: config skip-schema-check=true is not honored for HBase

jigar patel <jigar.9408266552@...>
 

Hi Marc

Here is the full stack trace,

https://gist.github.com/jigs1993/5cc1682a919cfb5e8290bf4636f1c766

possible fix is here: https://github.com/jigs1993/janusgraph/pull/1/files

Let me know if you think this is actually an issue, i can raise the PR against the master branch


Re: config skip-schema-check=true is not honored for HBase

hadoopmarc@...
 

Hi Jigar,

Can you provide the properties file you used for opening the graph, as well as the complete stacktrace for the exception listed above?

Best wishes,    Marc


config skip-schema-check=true is not honored for HBase

jigar patel <jigar.9408266552@...>
 

org.apache.hadoop.hbase.security.AccessDeniedException: org.apache.hadoop.hbase.security.AccessDeniedException: Insufficient permissions (user=<user>, scope=<namespace>:<table>, params=[table=<namespace>:<table>],action=CREATE)
at org.apache.hadoop.hbase.security.access.AccessController.requirePermission(AccessController.java:468) at org.apache.hadoop.hbase.security.access.AccessController.preGetTableDescriptors(AccessController.java:2576)

got above error while OLAP without create permission to <user> at line https://github.com/JanusGraph/janusgraph/blob/master/janusgraph-hbase/src/main/java/org/janusgraph/diskstorage/hbase/HBaseStoreManager.java#L732

it succeeded with CREATE permission given to user

looks like it is due to this call https://github.com/JanusGraph/janusgraph/blob/master/janusgraph-hbase/src/main/java/org/janusgraph/diskstorage/hbase/HBaseStoreManager.java#L543 being made regardless of the boolean variable skipSchemaCheck value https://github.com/JanusGraph/janusgraph/blob/master/janusgraph-hbase/src/main/java/org/janusgraph/diskstorage/hbase/HBaseStoreManager.java#L258

is this a bug? 


Re: Property keys unique per label

hadoopmarc@...
 

Hi Laura,

Thanks for explaining in more detail. Another example is a "color" property. Different data sources could use different types of color objects. As long as you do not want to query for paints and socks with the same color, there is no real need to harmonize the color data-value types.
Also note that an index on a property can already be constrained to a single vertex or edge label. So, if anyone would contribute your idea as a JanusGraph feature, I would guess there would be no objection.

Best wishes,   Marc


Re: Property keys unique per label

Laura Morales <lauretas@...>
 

Janus describes itself like this

a scalable graph database optimized for storing and querying graphs containing hundreds of billions of vertices and edges distributed across a multi-machine cluster

but my feeling when using it is that this definition means "a simple schema with billions of vertices/edges" and not "a graph with a large schema". This limitation with properties is an example. What I mean is a graph big enough that vertices on one corner of the graph represent something entirely different (semantically) from vertices on the far end of the same graph. So for example, I could use property "age" with a meaning, but use it with a completely different meaning somewhere else on the graph. Because properties names are unique, I must namespace them, for example "contextA.age" and "contextB.Age". But if nodes could be grouped by "context" for example, or maybe properties could be bound to labels, I would not need to namespace them and their datatype would only depend by their context.
I don't know if this makes sense to others, but to me it does.
 
 
 

Sent: Tuesday, July 27, 2021 at 2:00 PM
From: hadoopmarc@...
To: janusgraph-users@...
Subject: Re: [janusgraph-users] Property keys unique per label
Hi Laura,

Indeed, unique property key names are a limitation. But to be honest: if two properties have a different data-value type I would say these are different properties, so why give them the same name?

Best wishes,    Marc


Re: Property keys unique per label

hadoopmarc@...
 

Hi Laura,

Indeed, unique property key names are a limitation. But to be honest: if two properties have a different data-value type I would say these are different properties, so why give them the same name?

Best wishes,    Marc


Re: How to create users and roles

hadoopmarc@...
 

Hi Jonathan,

User authorization for Gremlin Server was introduced in TinkerPop 3.5.0, see https://tinkerpop.apache.org/docs/current/reference/#authorization

JanusGraph will use TinkerPop 3.5.x in its upcoming 0.6.0 release. If you want, you can already build the 0.6.0-SNAPSHOT distribution archives from master, using:

mvn clean install -Pjanusgraph-release -Dgpg.skip=true -DskipTests=true
Best wishes,     Marc


How to create users and roles

jonathan.mercier.fr@...
 

Dear,

I have not found into the documentation on the process to create and manage user and roles in order to contro datal access.
At this page https://docs.janusgraph.org/basics/server/ we can see they are a connection andauthentification through HTTPor websocket.
But I do not see where it is describe how to How to manage users and roles .

Thanks


Property keys unique per label

Laura Morales <lauretas@...>
 

The documentation says "Property key names must be unique in the graph". Does it mean that it's not possible to have property keys that are unique *per label*? In other words, can I have two distinct properties with the same name but different data-value types, as long as they are applied to vertexes with different labels?


Re: janusgraph and deeplearning

hadoopmarc@...
 

Hi Jonathan,

One thing is not yet clear to me: does your graph fit into a single node (regarding memory and GPU) or do you plan to use distributed pytorch? Either way, I guess it would be most efficient to use a two step process:

  1. get all data from janusgraph and store it on disk in a suitable format
  2. run pytorch geometric (may be in a distributed way) from the files on disk
JanusGraph only supports the hadoop InputFormats to retrieve graph data in a distributed way. Some teams succeeded in retrieving data from partitions from the janusgraph storage backends (not using any janusgraph API, see here), which could be done in a custom pytorch loader, but this is not documented (yet).

Cool that you apply janusgraph to this use case, so do not hesitate to ask for more details!

Marc

581 - 600 of 6656