How to upload rdf bulk data to janus graph


Arpan Jain <arpan...@...>
 

I have data in RDF(ttl) format. It is having around 6 million triplets. Currently, I have used rdf2gremlin python script for this conversion but it's taking to much time i.e. for 10k records it took around 1 hour. I am using Scylla DB as a Janus graph backend. Below is the python code I am using. 

from rdf2g import setup_graph
DEFAULT_LOCAL_CONNECTION_STRING = "ws://localhost:8182/gremlin"
g = setup_graph(DEFAULT_LOCAL_CONNECTION_STRING) 
 import rdflib 
import pathlib
OUTPUT_FILE_LAM_PROPERTIES = pathlib.Path("path/to/ttl/file/.ttl").resolve() 
rdf_graph = rdflib.Graph() 
rdf_graph.parse(str(OUTPUT_FILE_LAM_PROPERTIES), format="ttl") 

Same RDF data in neo4j is taking around only 10 mins to load the whole data. But I want to use the Janus graph.

Kindly suggest to me the best way to upload bulk RDF data to Janus graph using python or java.


"alex...@gmail.com" <alexand...@...>
 

Hi,

Try to enable batch loading: "storage.batch-loading=true".
Increase your batch mutations buffer: "storage.buffer-size=20480".
Increase ids block size: "ids.block-size=10000000".
Not sure if your flows just adds or upserts data. In case it upserts you may also set "query.batch=true".
That said, I didn't use rdf2gremlin and can't suggest much. Above configurations are just options which I can immediately think of. Of course a proper investigation should be done to suggest performance improvement. You may additionally optimize your ScyllaDB for your use cases. 

Best regards,
Oleksandr

On Thursday, December 24, 2020 at 12:24:10 PM UTC+2 ar...@... wrote:
I have data in RDF(ttl) format. It is having around 6 million triplets. Currently, I have used rdf2gremlin python script for this conversion but it's taking to much time i.e. for 10k records it took around 1 hour. I am using Scylla DB as a Janus graph backend. Below is the python code I am using. 

from rdf2g import setup_graph
DEFAULT_LOCAL_CONNECTION_STRING = "ws://localhost:8182/gremlin"
g = setup_graph(DEFAULT_LOCAL_CONNECTION_STRING) 
 import rdflib 
import pathlib
OUTPUT_FILE_LAM_PROPERTIES = pathlib.Path("path/to/ttl/file/.ttl").resolve() 
rdf_graph = rdflib.Graph() 
rdf_graph.parse(str(OUTPUT_FILE_LAM_PROPERTIES), format="ttl") 

Same RDF data in neo4j is taking around only 10 mins to load the whole data. But I want to use the Janus graph.

Kindly suggest to me the best way to upload bulk RDF data to Janus graph using python or java.


Arpan Jain <arpan...@...>
 

All these properties I need to set in the Janusgraph properties file right? I mean the config on which the server is starting. I mean the file where we set the backend storage and host etc.


On Thu, 24 Dec, 2020, 4:05 pm alex...@..., <alexand...@...> wrote:
Hi,

Try to enable batch loading: "storage.batch-loading=true".
Increase your batch mutations buffer: "storage.buffer-size=20480".
Increase ids block size: "ids.block-size=10000000".
Not sure if your flows just adds or upserts data. In case it upserts you may also set "query.batch=true".
That said, I didn't use rdf2gremlin and can't suggest much. Above configurations are just options which I can immediately think of. Of course a proper investigation should be done to suggest performance improvement. You may additionally optimize your ScyllaDB for your use cases. 

Best regards,
Oleksandr
On Thursday, December 24, 2020 at 12:24:10 PM UTC+2 ar...@... wrote:
I have data in RDF(ttl) format. It is having around 6 million triplets. Currently, I have used rdf2gremlin python script for this conversion but it's taking to much time i.e. for 10k records it took around 1 hour. I am using Scylla DB as a Janus graph backend. Below is the python code I am using. 

from rdf2g import setup_graph
DEFAULT_LOCAL_CONNECTION_STRING = "ws://localhost:8182/gremlin"
g = setup_graph(DEFAULT_LOCAL_CONNECTION_STRING) 
 import rdflib 
import pathlib
OUTPUT_FILE_LAM_PROPERTIES = pathlib.Path("path/to/ttl/file/.ttl").resolve() 
rdf_graph = rdflib.Graph() 
rdf_graph.parse(str(OUTPUT_FILE_LAM_PROPERTIES), format="ttl") 

Same RDF data in neo4j is taking around only 10 mins to load the whole data. But I want to use the Janus graph.

Kindly suggest to me the best way to upload bulk RDF data to Janus graph using python or java.

--
You received this message because you are subscribed to a topic in the Google Groups "JanusGraph users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/janusgraph-users/Buk0hjlxVOs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/7a341919-78a4-48d2-9380-100f827803e1n%40googlegroups.com.


"alex...@gmail.com" <alexand...@...>
 

That's right


On Thursday, December 24, 2020 at 1:43:42 PM UTC+2 ar...@... wrote:
All these properties I need to set in the Janusgraph properties file right? I mean the config on which the server is starting. I mean the file where we set the backend storage and host etc.

On Thu, 24 Dec, 2020, 4:05 pm alex...@..., <alex...@...> wrote:
Hi,

Try to enable batch loading: "storage.batch-loading=true".
Increase your batch mutations buffer: "storage.buffer-size=20480".
Increase ids block size: "ids.block-size=10000000".
Not sure if your flows just adds or upserts data. In case it upserts you may also set "query.batch=true".
That said, I didn't use rdf2gremlin and can't suggest much. Above configurations are just options which I can immediately think of. Of course a proper investigation should be done to suggest performance improvement. You may additionally optimize your ScyllaDB for your use cases. 

Best regards,
Oleksandr
On Thursday, December 24, 2020 at 12:24:10 PM UTC+2 ar...@... wrote:
I have data in RDF(ttl) format. It is having around 6 million triplets. Currently, I have used rdf2gremlin python script for this conversion but it's taking to much time i.e. for 10k records it took around 1 hour. I am using Scylla DB as a Janus graph backend. Below is the python code I am using. 

from rdf2g import setup_graph
DEFAULT_LOCAL_CONNECTION_STRING = "ws://localhost:8182/gremlin"
g = setup_graph(DEFAULT_LOCAL_CONNECTION_STRING) 
 import rdflib 
import pathlib
OUTPUT_FILE_LAM_PROPERTIES = pathlib.Path("path/to/ttl/file/.ttl").resolve() 
rdf_graph = rdflib.Graph() 
rdf_graph.parse(str(OUTPUT_FILE_LAM_PROPERTIES), format="ttl") 

Same RDF data in neo4j is taking around only 10 mins to load the whole data. But I want to use the Janus graph.

Kindly suggest to me the best way to upload bulk RDF data to Janus graph using python or java.

--
You received this message because you are subscribed to a topic in the Google Groups "JanusGraph users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/janusgraph-users/Buk0hjlxVOs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to janusgr...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/7a341919-78a4-48d2-9380-100f827803e1n%40googlegroups.com.


Arpan Jain <arpan...@...>
 

Actually I have around 70 fields. So my doubt is - whether is it possible to insert so data without bulk upload so that Janus graph will create it's own schema and letter for remaining data I will use bulk upload true.
Will this process give error?

On Thu, 24 Dec, 2020, 5:14 pm alex...@..., <alexand...@...> wrote:
That's right

On Thursday, December 24, 2020 at 1:43:42 PM UTC+2 ar...@... wrote:
All these properties I need to set in the Janusgraph properties file right? I mean the config on which the server is starting. I mean the file where we set the backend storage and host etc.

On Thu, 24 Dec, 2020, 4:05 pm alex...@..., <alex...@...> wrote:
Hi,

Try to enable batch loading: "storage.batch-loading=true".
Increase your batch mutations buffer: "storage.buffer-size=20480".
Increase ids block size: "ids.block-size=10000000".
Not sure if your flows just adds or upserts data. In case it upserts you may also set "query.batch=true".
That said, I didn't use rdf2gremlin and can't suggest much. Above configurations are just options which I can immediately think of. Of course a proper investigation should be done to suggest performance improvement. You may additionally optimize your ScyllaDB for your use cases. 

Best regards,
Oleksandr
On Thursday, December 24, 2020 at 12:24:10 PM UTC+2 ar...@... wrote:
I have data in RDF(ttl) format. It is having around 6 million triplets. Currently, I have used rdf2gremlin python script for this conversion but it's taking to much time i.e. for 10k records it took around 1 hour. I am using Scylla DB as a Janus graph backend. Below is the python code I am using. 

from rdf2g import setup_graph
DEFAULT_LOCAL_CONNECTION_STRING = "ws://localhost:8182/gremlin"
g = setup_graph(DEFAULT_LOCAL_CONNECTION_STRING) 
 import rdflib 
import pathlib
OUTPUT_FILE_LAM_PROPERTIES = pathlib.Path("path/to/ttl/file/.ttl").resolve() 
rdf_graph = rdflib.Graph() 
rdf_graph.parse(str(OUTPUT_FILE_LAM_PROPERTIES), format="ttl") 

Same RDF data in neo4j is taking around only 10 mins to load the whole data. But I want to use the Janus graph.

Kindly suggest to me the best way to upload bulk RDF data to Janus graph using python or java.

--
You received this message because you are subscribed to a topic in the Google Groups "JanusGraph users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/janusgraph-users/Buk0hjlxVOs/unsubscribe.
To unsubscribe from this group and all its topics, send an email to janusgr...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/7a341919-78a4-48d2-9380-100f827803e1n%40googlegroups.com.

--
You received this message because you are subscribed to the Google Groups "JanusGraph users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgra...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-users/ddb3eb4d-3fe2-4a4e-9c34-4a76476af7c2n%40googlegroups.com.