Date
1 - 5 of 5
ids.placement configuration question
Manish Baid <mmb...@...>
Hi,
I am using embedded janusgraph to connect to backend (solr, cassandra) over the network. Storing large volume of data, looking to partition the data based on the application provided partitioning key (PropertiesPlacementStrategy).
I see in the doc following statemnet:
Edge cuts are more meaningful when the JanusGraph servers
are on the same hosts as the storage backend. If you have to make a
network call to a different host on each hop of a traversal, the benefit
of edge cuts and custom placement strategies can be largely nullified.
Why is it only benefecial when janusgraph and backend are local to each other?
Isn't partitioning id decide which Cassandra partition the record would go to?
i.e. the key in cassandra tables?
PRIMARY KEY (key, column1)
Thanks
Steve Todorov <steve....@...>
Hi,
I was actually researching the same thing just yesterday!
The `cluster.max-partitions` suggests you need to know the amount of storage backend instances you will add the future.
However, Cassandra can have an infinite amount of instances which you can re-balance.
How do you account for infinite and how would this work with Cassandra? :)
Also, what happens if you add more storage backend instances than the `max-partitions`?
Regards,
Steve
On Sunday, September 20, 2020 at 2:46:22 AM UTC+3 m...@... wrote:
Hi,I am using embedded janusgraph to connect to backend (solr, cassandra) over the network. Storing large volume of data, looking to partition the data based on the application provided partitioning key (PropertiesPlacementStrategy).I see in the doc following statemnet:Edge cuts are more meaningful when the JanusGraph servers are on the same hosts as the storage backend. If you have to make a network call to a different host on each hop of a traversal, the benefit of edge cuts and custom placement strategies can be largely nullified.Why is it only benefecial when janusgraph and backend are local to each other?Isn't partitioning id decide which Cassandra partition the record would go to?i.e. the key in cassandra tables?PRIMARY KEY (key, column1)Thanks
HadoopMarc <bi...@...>
Hi,
Your quote from the ref docs:
Edge cuts are more meaningful when the JanusGraph servers
are on the same hosts as the storage backend. If you have to make a
network call to a different host on each hop of a traversal, the benefit
of edge cuts and custom placement strategies can be largely nullified.
This makes no sense to me either. A gremlin traversal is executed by a single janusgraph instance and in general this instance needs to retrieve vertices from multiple storage backend hosts. When traversing an edge, the janusgraph calls to the storage backend for retrieving the inVertex and the outVertex are separated in time.
Because the original Titan developers were not the typical stupid guys, I suspect there is a different reason behind the sense of the edge cut. This then would have to do with the inner dynamics of the storage backend which determines the number of network exchanges for retrieving a particular vertex. Anyone with more detailed knowledge about this?
Marc
Op zondag 20 september 2020 om 05:05:51 UTC+2 schreef Steve:
Hi,I was actually researching the same thing just yesterday!The `cluster.max-partitions` suggests you need to know the amount of storage backend instances you will add the future.However, Cassandra can have an infinite amount of instances which you can re-balance.How do you account for infinite and how would this work with Cassandra? :)Also, what happens if you add more storage backend instances than the `max-partitions`?Regards,SteveOn Sunday, September 20, 2020 at 2:46:22 AM UTC+3 m...@... wrote:Hi,I am using embedded janusgraph to connect to backend (solr, cassandra) over the network. Storing large volume of data, looking to partition the data based on the application provided partitioning key (PropertiesPlacementStrategy).I see in the doc following statemnet:Edge cuts are more meaningful when the JanusGraph servers are on the same hosts as the storage backend. If you have to make a network call to a different host on each hop of a traversal, the benefit of edge cuts and custom placement strategies can be largely nullified.Why is it only benefecial when janusgraph and backend are local to each other?Isn't partitioning id decide which Cassandra partition the record would go to?i.e. the key in cassandra tables?PRIMARY KEY (key, column1)Thanks
Manish Baid <mmb...@...>
All,
Partitioning is an important consideration while storing large volume of data.
Anyone knows about this feature in detail?
This is a MUST have requirement for our POC.
Regards
On Sunday, 20 September 2020 at 02:25:48 UTC-7 HadoopMarc wrote:
Hi,Your quote from the ref docs:Edge cuts are more meaningful when the JanusGraph servers are on the same hosts as the storage backend. If you have to make a network call to a different host on each hop of a traversal, the benefit of edge cuts and custom placement strategies can be largely nullified.This makes no sense to me either. A gremlin traversal is executed by a single janusgraph instance and in general this instance needs to retrieve vertices from multiple storage backend hosts. When traversing an edge, the janusgraph calls to the storage backend for retrieving the inVertex and the outVertex are separated in time.Because the original Titan developers were not the typical stupid guys, I suspect there is a different reason behind the sense of the edge cut. This then would have to do with the inner dynamics of the storage backend which determines the number of network exchanges for retrieving a particular vertex. Anyone with more detailed knowledge about this?MarcOp zondag 20 september 2020 om 05:05:51 UTC+2 schreef Steve:Hi,I was actually researching the same thing just yesterday!The `cluster.max-partitions` suggests you need to know the amount of storage backend instances you will add the future.However, Cassandra can have an infinite amount of instances which you can re-balance.How do you account for infinite and how would this work with Cassandra? :)Also, what happens if you add more storage backend instances than the `max-partitions`?Regards,SteveOn Sunday, September 20, 2020 at 2:46:22 AM UTC+3 m...@... wrote:Hi,I am using embedded janusgraph to connect to backend (solr, cassandra) over the network. Storing large volume of data, looking to partition the data based on the application provided partitioning key (PropertiesPlacementStrategy).I see in the doc following statemnet:Edge cuts are more meaningful when the JanusGraph servers are on the same hosts as the storage backend. If you have to make a network call to a different host on each hop of a traversal, the benefit of edge cuts and custom placement strategies can be largely nullified.Why is it only benefecial when janusgraph and backend are local to each other?Isn't partitioning id decide which Cassandra partition the record would go to?i.e. the key in cassandra tables?PRIMARY KEY (key, column1)Thanks
"alex...@gmail.com" <alexand...@...>
Hi,
Notice the statement in the doc "Currently explicit partitioning is not supported." (https://docs.janusgraph.org/advanced-topics/partitioning/).
You may try to reach out to Chris Hupman (https://github.com/chupman) who put that statement in the docs for more details but as far as I remember currently there is no possibility to control partition boundaries. It means that placing two vertices within the same partition might put those vertices on different hosts (i.e. Cassandra nodes). Even so we can choose a partition within JanusGraph, it doesn't guarantee that choosing the same partition will place vertices to the same host.
As far as I know the only viable partitioning is Random partitioning which distributes data randomly across the cluster.
That said, I didn't investigate that area much, so I will be happy to hear more details about this feature.
You may try to reach out to Chris Hupman (https://github.com/chupman) who put that statement in the docs for more details but as far as I remember currently there is no possibility to control partition boundaries. It means that placing two vertices within the same partition might put those vertices on different hosts (i.e. Cassandra nodes). Even so we can choose a partition within JanusGraph, it doesn't guarantee that choosing the same partition will place vertices to the same host.
As far as I know the only viable partitioning is Random partitioning which distributes data randomly across the cluster.
That said, I didn't investigate that area much, so I will be happy to hear more details about this feature.
Oleksandr
On Wednesday, September 23, 2020 at 5:59:37 PM UTC+3 m...@... wrote:
All,Partitioning is an important consideration while storing large volume of data.Anyone knows about this feature in detail?This is a MUST have requirement for our POC.RegardsOn Sunday, 20 September 2020 at 02:25:48 UTC-7 HadoopMarc wrote:Hi,Your quote from the ref docs:Edge cuts are more meaningful when the JanusGraph servers are on the same hosts as the storage backend. If you have to make a network call to a different host on each hop of a traversal, the benefit of edge cuts and custom placement strategies can be largely nullified.This makes no sense to me either. A gremlin traversal is executed by a single janusgraph instance and in general this instance needs to retrieve vertices from multiple storage backend hosts. When traversing an edge, the janusgraph calls to the storage backend for retrieving the inVertex and the outVertex are separated in time.Because the original Titan developers were not the typical stupid guys, I suspect there is a different reason behind the sense of the edge cut. This then would have to do with the inner dynamics of the storage backend which determines the number of network exchanges for retrieving a particular vertex. Anyone with more detailed knowledge about this?MarcOp zondag 20 september 2020 om 05:05:51 UTC+2 schreef Steve:Hi,I was actually researching the same thing just yesterday!The `cluster.max-partitions` suggests you need to know the amount of storage backend instances you will add the future.However, Cassandra can have an infinite amount of instances which you can re-balance.How do you account for infinite and how would this work with Cassandra? :)Also, what happens if you add more storage backend instances than the `max-partitions`?Regards,SteveOn Sunday, September 20, 2020 at 2:46:22 AM UTC+3 m...@... wrote:Hi,I am using embedded janusgraph to connect to backend (solr, cassandra) over the network. Storing large volume of data, looking to partition the data based on the application provided partitioning key (PropertiesPlacementStrategy).I see in the doc following statemnet:Edge cuts are more meaningful when the JanusGraph servers are on the same hosts as the storage backend. If you have to make a network call to a different host on each hop of a traversal, the benefit of edge cuts and custom placement strategies can be largely nullified.Why is it only benefecial when janusgraph and backend are local to each other?Isn't partitioning id decide which Cassandra partition the record would go to?i.e. the key in cassandra tables?PRIMARY KEY (key, column1)Thanks