Date
1 - 7 of 7
Data Loading Script Optimization
Vinayak Bali
Hi All, I have attached a groovy script that I use to load data into janusgraph. The script takes 4 mins to load 1.5 million nodes and 13 mins to load approx 3 million edges. The server on which the script runs has higher configurations. I looking for different ways to improve the performance of the script. Your feedback will help. Thank You for the responses. Thanks & Regards, Vinayak |
|
hadoopmarc@...
Hi Vinayak,
What storage backend do you use? Do I understand right that the storage backend and the load script all run on the same server? If, so, are all available CPU resources actively used during batch loading? What is CPU usage of the groovy process and what of the storage backend? Specific details in the script:
|
|
Vinayak Bali
Hi Marc, The storage backend used is Cassandra. Yes, storage backend janusgraph and load scripts are on the same server. specified storage.batch-loading=true CPU usage is very low not more than 3 percent. The machine has higher hardware configurations. So, I need suggestions on how we can make full use of the hardware. I will use graph.newTransaction().traversal() replacing line 121 in the code and share the results. Current line: g = ctx.g = graph.traversal(); Modified : g = ctx.g = graph.newTransaction().traversal(); Please validate and confirm the changes. As data increases, we should use global GraphTraversalSource g at the bottom of the script for the bulk loading. Thanks & Regards, Vinayak On Sat, Aug 7, 2021 at 6:21 PM <hadoopmarc@...> wrote: Hi Vinayak, |
|
hadoopmarc@...
Hi Vinayak,
Yes, it should be possible to improve on the 3% CPU usage. The newTransaction() should be added to line 39 (GraphTraversalSource g = graph.traversal();) as the global g from line 121 is not used. Marc |
|
Vinayak Bali
Hi Marc, To avoid confusion, including a new transaction at line number 39, as well as at line no 121. Line 39: GraphTraversalSource g = graph.newTransaction().traversal(); Line 121: g = ctx.g = graph.newTransaction().traversal(); The total time took was 11 mins. The maximum amount of cpu utilization was 40%. As the hardware configuration of the instance is at higher side, still we have enough RAM to increase the performance. Request you to share if it's possible to increase the performance further following some process(Hadoop/spark) etc based on your experience. Thanks & Regards, Vinayak On Sun, Aug 8, 2021 at 1:53 PM <hadoopmarc@...> wrote: Hi Vinayak, |
|
Nicolas Trangosi <nicolas.trangosi@...>
Hi, You could first create a local cache for ORG by retrying first all ORG: Map<String, Long> orgCache = g.V().has('vertexLabel', 'ORG').project("name", "id").by("orgName").by(T.id)... Then replace __.V().has('vertexLabel', 'ORG').has('orgName', orgName) by __V(orgCache.get(orgName)) Same trick, may be used for persons to remove the coalesce if you know that you import more users than already exist in db. Le lun. 9 août 2021 à 10:07, Vinayak Bali <vinayakbali16@...> a écrit :
--
![]() Ce message et ses pièces jointes peuvent contenir des informations confidentielles ou privilégiées et ne doivent donc pas être diffusés, exploités ou copiés sans autorisation. Si vous avez reçu ce message par erreur, veuillez le signaler a l'expéditeur et le détruire ainsi que les pièces jointes. Les messages électroniques étant susceptibles d'altération, DCbrain décline toute responsabilité si ce message a été altéré, déformé ou falsifié. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, DCbrain is not liable for messages that have been modified, changed or falsified. Thank you. |
|
hadoopmarc@...
Hi Vinayak,
Good to see some progress! Some suggestions:
Best wishes, Marc |
|