mappers.per.region option to HBaseInputFormat


HadoopMarc <bi...@...>
 

Hi dev team,

The JanusGraph users list has seen a number of threads regarding OLAP performance with janusgraph-hbase. In particular, it turns out that initial loading of a graph is problematic when the Hbase table is stored in a small number of large regions of say 10Gb. Such large region sizes result in optimal performance of HBase, so system managers are not expected to like HBase backed graphs with many small regions needed for good parellelism during OLAP operations. As a result, HBase 2.0 alpha has introduced a mappers.per.region option to TableInputFormatBase which allows a single region to be spread over multiple mappers cq Spark tasks. Anxious to use this feature before HBase 2.0 and a JG version supporting it, will come out, I made a quick attempt to backport the feature. This turns out to be quite doable, see: https://github.com/vtslab/janusgraph/commit/87bf1000c01dfce92e857349ba479db0d3ef6bd1. This is initial work and I plan to do a performance benchmark with the friendster graph, like the TinkerPop team did.

My questions to you:
  • would this work be welcomed as a JanusGraph PR before a release based on HBase 2.0 comes out?
  • if so, do you have any suggestions to improve on the work?


Some additional notes:

  • SparkGraphComputer has an option to repartition the graph using the workers() method of the GraphComputer builder, but this does not help in a better parallelization of the initial load
  • The current HBaseInputFormat has a rather intricate inheritance structure, which will probably need rigorous refactoring to use the HBase 2.0 TableInputFormatBase

Cheers,   Marc


Robert Dale <rob...@...>
 

Can’t make a release on a snapshot. Do they have a pre-release release?  We already have one dep on a rc1 release.


On Sun, Oct 22, 2017 at 18:29 HadoopMarc <bi...@...> wrote:
Hi dev team,

The JanusGraph users list has seen a number of threads regarding OLAP performance with janusgraph-hbase. In particular, it turns out that initial loading of a graph is problematic when the Hbase table is stored in a small number of large regions of say 10Gb. Such large region sizes result in optimal performance of HBase, so system managers are not expected to like HBase backed graphs with many small regions needed for good parellelism during OLAP operations. As a result, HBase 2.0 alpha has introduced a mappers.per.region option to TableInputFormatBase which allows a single region to be spread over multiple mappers cq Spark tasks. Anxious to use this feature before HBase 2.0 and a JG version supporting it, will come out, I made a quick attempt to backport the feature. This turns out to be quite doable, see: https://github.com/vtslab/janusgraph/commit/87bf1000c01dfce92e857349ba479db0d3ef6bd1. This is initial work and I plan to do a performance benchmark with the friendster graph, like the TinkerPop team did.

My questions to you:
  • would this work be welcomed as a JanusGraph PR before a release based on HBase 2.0 comes out?
  • if so, do you have any suggestions to improve on the work?


Some additional notes:

  • SparkGraphComputer has an option to repartition the graph using the workers() method of the GraphComputer builder, but this does not help in a better parallelization of the initial load
  • The current HBaseInputFormat has a rather intricate inheritance structure, which will probably need rigorous refactoring to use the HBase 2.0 TableInputFormatBase

Cheers,   Marc

--
You received this message because you are subscribed to the Google Groups "JanusGraph developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgr...@....
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-dev/fc87971b-664c-4b0b-961a-aef593d9fb40%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
Robert Dale


HadoopMarc <marc.d...@...>
 

Hi Robert,

I see I caused some confusion. The link I sent does not have any deps on HBase 2.0, it only copied/back-ported a small bit of HBase 2.0 code that becomes part of JanusGraph and is tested as such. Once HBase 2.0 becomes available, the back-port would vanish from JanusGraph again for the release branch that depends on HBase 2.0.
So my question is whether there would be support for this back-ported feature on the 0.2.X and 0.3.X branches which depend on HBase 1.Y.

Cheers,    Marc

Op maandag 23 oktober 2017 05:55:01 UTC+2 schreef Robert Dale:

Can’t make a release on a snapshot. Do they have a pre-release release?  We already have one dep on a rc1 release.

On Sun, Oct 22, 2017 at 18:29 HadoopMarc <b...@...> wrote:
Hi dev team,

The JanusGraph users list has seen a number of threads regarding OLAP performance with janusgraph-hbase. In particular, it turns out that initial loading of a graph is problematic when the Hbase table is stored in a small number of large regions of say 10Gb. Such large region sizes result in optimal performance of HBase, so system managers are not expected to like HBase backed graphs with many small regions needed for good parellelism during OLAP operations. As a result, HBase 2.0 alpha has introduced a mappers.per.region option to TableInputFormatBase which allows a single region to be spread over multiple mappers cq Spark tasks. Anxious to use this feature before HBase 2.0 and a JG version supporting it, will come out, I made a quick attempt to backport the feature. This turns out to be quite doable, see: https://github.com/vtslab/janusgraph/commit/87bf1000c01dfce92e857349ba479db0d3ef6bd1. This is initial work and I plan to do a performance benchmark with the friendster graph, like the TinkerPop team did.

My questions to you:
  • would this work be welcomed as a JanusGraph PR before a release based on HBase 2.0 comes out?
  • if so, do you have any suggestions to improve on the work?


Some additional notes:

  • SparkGraphComputer has an option to repartition the graph using the workers() method of the GraphComputer builder, but this does not help in a better parallelization of the initial load
  • The current HBaseInputFormat has a rather intricate inheritance structure, which will probably need rigorous refactoring to use the HBase 2.0 TableInputFormatBase

Cheers,   Marc

--
You received this message because you are subscribed to the Google Groups "JanusGraph developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to janusgraph-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/janusgraph-dev/fc87971b-664c-4b0b-961a-aef593d9fb40%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
Robert Dale


Jerry He <jerr...@...>
 

I think it is a useful feature to have. See the JIRA for more
details: https://issues.apache.org/jira/browse/HBASE-16894
On the other hand, it is about timing. It is very likely that the
next release of JanusGraph (say 3 month from now) will be close to
either HBase 2.0 or HBase 1.4 (which also contains the fix). Then we
will have a code backport that may quickly becomes duplicate.

THanks,

On Mon, Oct 23, 2017 at 3:01 AM, HadoopMarc <marc.d...@...> wrote:
Hi Robert,

I see I caused some confusion. The link I sent does not have any deps on
HBase 2.0, it only copied/back-ported a small bit of HBase 2.0 code that
becomes part of JanusGraph and is tested as such. Once HBase 2.0 becomes
available, the back-port would vanish from JanusGraph again for the release
branch that depends on HBase 2.0.
So my question is whether there would be support for this back-ported
feature on the 0.2.X and 0.3.X branches which depend on HBase 1.Y.

Cheers, Marc

Op maandag 23 oktober 2017 05:55:01 UTC+2 schreef Robert Dale:

Can’t make a release on a snapshot. Do they have a pre-release release?
We already have one dep on a rc1 release.

On Sun, Oct 22, 2017 at 18:29 HadoopMarc <b...@...> wrote:

Hi dev team,

The JanusGraph users list has seen a number of threads regarding OLAP
performance with janusgraph-hbase. In particular, it turns out that initial
loading of a graph is problematic when the Hbase table is stored in a small
number of large regions of say 10Gb. Such large region sizes result in
optimal performance of HBase, so system managers are not expected to like
HBase backed graphs with many small regions needed for good parellelism
during OLAP operations. As a result, HBase 2.0 alpha has introduced a
mappers.per.region option to TableInputFormatBase which allows a single
region to be spread over multiple mappers cq Spark tasks. Anxious to use
this feature before HBase 2.0 and a JG version supporting it, will come out,
I made a quick attempt to backport the feature. This turns out to be quite
doable, see:
https://github.com/vtslab/janusgraph/commit/87bf1000c01dfce92e857349ba479db0d3ef6bd1.
This is initial work and I plan to do a performance benchmark with the
friendster graph, like the TinkerPop team did.

My questions to you:

would this work be welcomed as a JanusGraph PR before a release based on
HBase 2.0 comes out?
if so, do you have any suggestions to improve on the work?


Some additional notes:

SparkGraphComputer has an option to repartition the graph using the
workers() method of the GraphComputer builder, but this does not help in a
better parallelization of the initial load
The current HBaseInputFormat has a rather intricate inheritance
structure, which will probably need rigorous refactoring to use the HBase
2.0 TableInputFormatBase

Cheers, Marc

--
You received this message because you are subscribed to the Google Groups
"JanusGraph developers" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to janusg...@....
To view this discussion on the web visit
https://groups.google.com/d/msgid/janusgraph-dev/fc87971b-664c-4b0b-961a-aef593d9fb40%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
Robert Dale
--
You received this message because you are subscribed to the Google Groups
"JanusGraph developers" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to janusgr...@....
To view this discussion on the web visit
https://groups.google.com/d/msgid/janusgraph-dev/fe315e19-343f-4b99-9a92-4786ac5b3c8c%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


HadoopMarc <bi...@...>
 

Hi Jerry,

Thanks for the info about the HBase-1.4 branch, I was not aware of that. I agree then that it is better to focus our efforts to have JanusGraph HBaseInputFormat inherit from HBase InputTableFormatBase. I will do the benchmark test with my current code anyway, to see if performance works out as expected.

Marc

Op maandag 23 oktober 2017 19:30:25 UTC+2 schreef Jerry He:

I think it is a useful feature to have.  See the JIRA for more
details: https://issues.apache.org/jira/browse/HBASE-16894
On the other hand, it is about timing.  It is very likely that the
next release of JanusGraph (say 3 month from now) will be close to
either HBase 2.0 or HBase 1.4 (which also contains the fix).  Then we
will have a code backport that may quickly becomes duplicate.

THanks,

On Mon, Oct 23, 2017 at 3:01 AM, HadoopMarc <mar...@...> wrote:
> Hi Robert,
>
> I see I caused some confusion. The link I sent does not have any deps on
> HBase 2.0, it only copied/back-ported a small bit of HBase 2.0 code that
> becomes part of JanusGraph and is tested as such. Once HBase 2.0 becomes
> available, the back-port would vanish from JanusGraph again for the release
> branch that depends on HBase 2.0.
> So my question is whether there would be support for this back-ported
> feature on the 0.2.X and 0.3.X branches which depend on HBase 1.Y.
>
> Cheers,    Marc
>
> Op maandag 23 oktober 2017 05:55:01 UTC+2 schreef Robert Dale:
>>
>> Can’t make a release on a snapshot. Do they have a pre-release release?
>> We already have one dep on a rc1 release.
>>
>> On Sun, Oct 22, 2017 at 18:29 HadoopMarc <b...@...> wrote:
>>>
>>> Hi dev team,
>>>
>>> The JanusGraph users list has seen a number of threads regarding OLAP
>>> performance with janusgraph-hbase. In particular, it turns out that initial
>>> loading of a graph is problematic when the Hbase table is stored in a small
>>> number of large regions of say 10Gb. Such large region sizes result in
>>> optimal performance of HBase, so system managers are not expected to like
>>> HBase backed graphs with many small regions needed for good parellelism
>>> during OLAP operations. As a result, HBase 2.0 alpha has introduced a
>>> mappers.per.region option to TableInputFormatBase which allows a single
>>> region to be spread over multiple mappers cq Spark tasks. Anxious to use
>>> this feature before HBase 2.0 and a JG version supporting it, will come out,
>>> I made a quick attempt to backport the feature. This turns out to be quite
>>> doable, see:
>>> https://github.com/vtslab/janusgraph/commit/87bf1000c01dfce92e857349ba479db0d3ef6bd1.
>>> This is initial work and I plan to do a performance benchmark with the
>>> friendster graph, like the TinkerPop team did.
>>>
>>> My questions to you:
>>>
>>> would this work be welcomed as a JanusGraph PR before a release based on
>>> HBase 2.0 comes out?
>>> if so, do you have any suggestions to improve on the work?
>>>
>>>
>>> Some additional notes:
>>>
>>> SparkGraphComputer has an option to repartition the graph using the
>>> workers() method of the GraphComputer builder, but this does not help in a
>>> better parallelization of the initial load
>>> The current HBaseInputFormat has a rather intricate inheritance
>>> structure, which will probably need rigorous refactoring to use the HBase
>>> 2.0 TableInputFormatBase
>>>
>>> Cheers,   Marc
>>>
>>> --
>>> You received this message because you are subscribed to the Google Groups
>>> "JanusGraph developers" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an
>>> email to janusgraph-de...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/janusgraph-dev/fc87971b-664c-4b0b-961a-aef593d9fb40%40googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>
>> --
>> Robert Dale
>
> --
> You received this message because you are subscribed to the Google Groups
> "JanusGraph developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to janusgraph-de...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/janusgraph-dev/fe315e19-343f-4b99-9a92-4786ac5b3c8c%40googlegroups.com.
>
> For more options, visit https://groups.google.com/d/optout.


Jerry He <jerr...@...>
 

That will be a very nice performance testing!

Thanks.

On Mon, Oct 23, 2017 at 10:45 PM, HadoopMarc <bi...@...> wrote:
Hi Jerry,

Thanks for the info about the HBase-1.4 branch, I was not aware of that. I
agree then that it is better to focus our efforts to have JanusGraph
HBaseInputFormat inherit from HBase InputTableFormatBase. I will do the
benchmark test with my current code anyway, to see if performance works out
as expected.

Marc

Op maandag 23 oktober 2017 19:30:25 UTC+2 schreef Jerry He:

I think it is a useful feature to have. See the JIRA for more
details: https://issues.apache.org/jira/browse/HBASE-16894
On the other hand, it is about timing. It is very likely that the
next release of JanusGraph (say 3 month from now) will be close to
either HBase 2.0 or HBase 1.4 (which also contains the fix). Then we
will have a code backport that may quickly becomes duplicate.

THanks,

On Mon, Oct 23, 2017 at 3:01 AM, HadoopMarc <mar...@...> wrote:
Hi Robert,

I see I caused some confusion. The link I sent does not have any deps on
HBase 2.0, it only copied/back-ported a small bit of HBase 2.0 code that
becomes part of JanusGraph and is tested as such. Once HBase 2.0 becomes
available, the back-port would vanish from JanusGraph again for the
release
branch that depends on HBase 2.0.
So my question is whether there would be support for this back-ported
feature on the 0.2.X and 0.3.X branches which depend on HBase 1.Y.

Cheers, Marc

Op maandag 23 oktober 2017 05:55:01 UTC+2 schreef Robert Dale:

Can’t make a release on a snapshot. Do they have a pre-release release?
We already have one dep on a rc1 release.

On Sun, Oct 22, 2017 at 18:29 HadoopMarc <b...@...> wrote:

Hi dev team,

The JanusGraph users list has seen a number of threads regarding OLAP
performance with janusgraph-hbase. In particular, it turns out that
initial
loading of a graph is problematic when the Hbase table is stored in a
small
number of large regions of say 10Gb. Such large region sizes result in
optimal performance of HBase, so system managers are not expected to
like
HBase backed graphs with many small regions needed for good
parellelism
during OLAP operations. As a result, HBase 2.0 alpha has introduced a
mappers.per.region option to TableInputFormatBase which allows a
single
region to be spread over multiple mappers cq Spark tasks. Anxious to
use
this feature before HBase 2.0 and a JG version supporting it, will
come out,
I made a quick attempt to backport the feature. This turns out to be
quite
doable, see:

https://github.com/vtslab/janusgraph/commit/87bf1000c01dfce92e857349ba479db0d3ef6bd1.
This is initial work and I plan to do a performance benchmark with the
friendster graph, like the TinkerPop team did.

My questions to you:

would this work be welcomed as a JanusGraph PR before a release based
on
HBase 2.0 comes out?
if so, do you have any suggestions to improve on the work?


Some additional notes:

SparkGraphComputer has an option to repartition the graph using the
workers() method of the GraphComputer builder, but this does not help
in a
better parallelization of the initial load
The current HBaseInputFormat has a rather intricate inheritance
structure, which will probably need rigorous refactoring to use the
HBase
2.0 TableInputFormatBase

Cheers, Marc

--
You received this message because you are subscribed to the Google
Groups
"JanusGraph developers" group.
To unsubscribe from this group and stop receiving emails from it, send
an
email to janusg...@....
To view this discussion on the web visit

https://groups.google.com/d/msgid/janusgraph-dev/fc87971b-664c-4b0b-961a-aef593d9fb40%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
--
Robert Dale
--
You received this message because you are subscribed to the Google
Groups
"JanusGraph developers" group.
To unsubscribe from this group and stop receiving emails from it, send
an
email to janusg...@....
To view this discussion on the web visit

https://groups.google.com/d/msgid/janusgraph-dev/fe315e19-343f-4b99-9a92-4786ac5b3c8c%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups
"JanusGraph developers" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to janusgr...@....
To view this discussion on the web visit
https://groups.google.com/d/msgid/janusgraph-dev/f2159ac2-beb1-4be1-a843-209e52648e77%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.