Date
1 - 10 of 10
[Performance Optimization] Optimization around the `system_properties` table interaction
Hi all
- The interaction with the underlying KV store via janusgraph client hits the `system_properties` table with a range query where the key is `configuration` (key = 0x636f6e66696775726174696f6e) - The observation is that the janusgraph client stores all the configurations (static + dynamic) is stored against `configuration` key - When we run the job with spark executors, where each executor is using janusgraph embedded mode, each of these executors create executor level entries (dynamic) with the same key `configuration` - Thus as the number of executors increase, the particular partition with the key `configuration` starts becoming a large partition, and queries with key=`configuration` become range queries scanning the large partition as seen in below graphs (these are from scylla monitoring grafana dashboard) - I would like to know if this there is a fix in progress for this issue - we at zeotap are using JanusGraph at tremendous scale (single graphs having 70 billion Vertices and 50 billion edges) and have identified couple of solutions to fix this Thanks Saurabh Verma +91 7976984604 |
|
Hi Saurabh!
Thanks for reporting that issue! Looking at the open pull requests, I don't see one that addresses this problem. You're always welcome to share your solutions by discussing them here or even submitting PRs directly. Do you already have these fixes in place and use them in your productive environment or is it rather an early stage draft? |
|
simone3.cattani@...
On Mon, Feb 15, 2021 at 05:59 AM, sauverma wrote:
zeotapHi Saurabh, we are experiencing the exact same issue: spark job with one janusgraph instance per partition calling a scylladb cluster. Our symptom is a ton of timeout exceptions due to missing QUORUM in read queries. If you want to share your proposed solutions, we will be happy to try them. |
|
Thank you folks for getting back.
@Simone3, yes this issue comes out as read timeout from the shard holding the system_properties table (there is only 1 partition for system_properties unreplicated). We've used below workarounds to bypass it for now (the code change required in janusgraph is under test right now) based on the observations - set the gc_grace_seconds for system_properties to 0 - truncate system_properties table periodically (say 2 hours) Thanks |
|
simone3.cattani@...
HI @sauverma,
nice, we are truncating the table, too. It's good to have your confirmation that it could safely workaround the issue. We will analyze the `gc_grace_seconds` setting now. Just for curiosity: are you fixing it changing the `KCVSConfiguration` in order to store properties as rows instead of columns? |
|
On Mon, Feb 22, 2021 at 01:58 AM, <simone3.cattani@...> wrote:
columnsHi @simone We were planning to segregate the static configurations from the runtime dynamic configurations (last update TS from client, etc.). AFAIK only the static configurations are required by the janusgraph clients while initializing. Thanks Saurabh |
|
Hi all
Updates on this issue - We found that the periodic removal of system_properties (while the ingestion is running) leads to graph corruption (mentioned at high level at https://docs.janusgraph.org/advanced-topics/recovery/) - The perf issue we saw were due to below reasons - improper handling on dataproc scaledown which lead to connections not getting closed to JG, and thus ever increasing system_properties table - unbounded access to the scylla caching layer, which is basically unthrottled access to scylla caching system, leading to other queries slowing down due to the system_properties single, hot partition - in addition to this, the data model for system_properties still needs to be fixed via usage of clustering keys, by design system_properties has only 1 SINGLE partition and all spark executors hit it while initialization leading to query slow down -> query queuing -> query timeouts Thanks |
|
Boxuan Li
Hi @sauverma,
I am just curious: I noticed you said "there is only 1 partition for system_properties unreplicated". Do you have storage.cql.replication-factor = 1? |
|
Hi Boxuan Li |
|
Boxuan Li
Yeah that makes sense. I saw you said “unreplicated” thus wondered. I am not familiar with how `system_properties` is handled, but just want to point out that it is very difficult if not impossible to change the data model while keeping backward compatibility at the same time.
toggle quoted message
Show quoted text
|
|