You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Saad Mufti <sa...@gmail.com> on 2018/03/11 00:17:23 UTC

TableSnapshotInputFormat Behavior In HBase 1.4.0

Hi,

I am running a Spark job (Spark 2.2.1) on an EMR cluster in AWS. There is
no Hbase installed on the cluster, only HBase libs linked to my Spark app.
We are reading the snapshot info from a HBase folder in S3 using
TableSnapshotInputFormat class from HBase 1.4.0 to have the Spark job read
snapshot info directly from the S3 based filesystem instead of going
through any region server.

I have observed a few behaviors while debugging performance that are
concerning, some we could mitigate and other I am looking for clarity on:

1)  the TableSnapshotInputFormatImpl code is trying to get locality
information for the region splits, for a snapshots with a large number of
files (over 350000 in our case) this causing single threaded scan of all
the file listings in a single thread in the driver. And it was useless
because there is really no useful locality information to glean since all
the files are in S3 and not HDFS. So I was forced to make a copy of
TableSnapshotInputFormatImpl.java in our code and control this with a
config setting I made up. That got rid of the hours long scan, so I am good
with this part for now.

2) I have set a single column family in the Scan that I set on the hbase
configuration via

scan.addFamily(str.getBytes()))

hBaseConf.set(TableInputFormat.SCAN, convertScanToString(scan))


But when this code is executing under Spark and I observe the threads and
logs on Spark executors, I it is reading from S3 files for a column family
that was not included in the scan. This column family was intentionally
excluded because it is much larger than the others and so we wanted to
avoid the cost.

Any advice on what I am doing wrong would be appreciated.

3) We also explicitly set caching of blocks to false on the scan, although
I see that in TableSnapshotInputFormatImpl.java it is again set to false
internally also. But when running the Spark job, some executors were taking
much longer than others, and when I observe their threads, I see periodic
messages about a few hundred megs of RAM used by the block cache, and the
thread is sitting there reading data from S3, and is occasionally blocked a
couple of other threads that have the "hfile-prefetcher" name in them.
Going back to 2) above, they seem to be reading the wrong column family,
but in this item I am more concerned about why they appear to be
prefetching blocks and caching them, when the Scan object has a setting to
not cache blocks at all?

Thanks in advance for any insights anyone can provide.

----
Saad

Re: TableSnapshotInputFormat Behavior In HBase 1.4.0

Posted by Saad Mufti <sa...@gmail.com>.

Thanks, will do that.

----
Saad


On Mon, Mar 12, 2018 at 12:14 PM, Ted Yu <yu...@gmail.com> wrote:

> Saad:
> I encourage you to open an HBase JIRA outlining your use case and the
> config knobs you added through a patch.
>
> We can see the details for each config and make recommendation accordingly.
>
> Thanks
>
> On Mon, Mar 12, 2018 at 8:43 AM, Saad Mufti <sa...@gmail.com> wrote:
>
> > I have create a company specific branch and added 4 new flags to control
> > this behavior, these gave us a huge performance boost when running Spark
> > jobs on snapshots of very large tables in S3. I tried to do everything
> > cleanly but
> >
> > a) not being familiar with the whole test strategies I haven't had time
> to
> > add any useful tests, though of course I left the default behavior the
> > same, and a lot of the behavior I control wit these flags only affect
> > performance, not the final result, so I would need some pointers on how
> to
> > add useful tests
> > b) I added a new flag to be an overall override for prefetch behavior
> that
> > overrides any setting even in the column family descriptor, not sure if
> > what I did was entirely in the spirit of what HBase does
> >
> > Again these if used properly would only impact jobs using
> > TableSnapshotInputFormat in their Spark or M-R jobs. Would someone from
> the
> > core team be willing to look at my patch? I have never done this before,
> so
> > would appreciate a quick pointer on how to send a patch and get some
> quick
> > feedback.
> >
> > Cheers.
> >
> > ----
> > Saad
> >
> >
> >
> > On Sat, Mar 10, 2018 at 9:56 PM, Saad Mufti <sa...@gmail.com>
> wrote:
> >
> > > The question remain though of why it is even accessing a column
> family's
> > > files that should be excluded based on the Scan. And that column family
> > > does NOT specify prefetch on open in its schema. Only the one we want
> to
> > > read specifies prefetch on open, which we want to override if possible
> > for
> > > the Spark job.
> > >
> > > ----
> > > Saad
> > >
> > >
> > > On Sat, Mar 10, 2018 at 9:51 PM, Saad Mufti <sa...@gmail.com>
> > wrote:
> > >
> > >> See below more I found on item 3.
> > >>
> > >> Cheers.
> > >>
> > >> ----
> > >> Saad
> > >>
> > >> On Sat, Mar 10, 2018 at 7:17 PM, Saad Mufti <sa...@gmail.com>
> > wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> I am running a Spark job (Spark 2.2.1) on an EMR cluster in AWS.
> There
> > >>> is no Hbase installed on the cluster, only HBase libs linked to my
> > Spark
> > >>> app. We are reading the snapshot info from a HBase folder in S3 using
> > >>> TableSnapshotInputFormat class from HBase 1.4.0 to have the Spark job
> > read
> > >>> snapshot info directly from the S3 based filesystem instead of going
> > >>> through any region server.
> > >>>
> > >>> I have observed a few behaviors while debugging performance that are
> > >>> concerning, some we could mitigate and other I am looking for clarity
> > on:
> > >>>
> > >>> 1)  the TableSnapshotInputFormatImpl code is trying to get locality
> > >>> information for the region splits, for a snapshots with a large
> number
> > of
> > >>> files (over 350000 in our case) this causing single threaded scan of
> > all
> > >>> the file listings in a single thread in the driver. And it was
> useless
> > >>> because there is really no useful locality information to glean since
> > all
> > >>> the files are in S3 and not HDFS. So I was forced to make a copy of
> > >>> TableSnapshotInputFormatImpl.java in our code and control this with
> a
> > >>> config setting I made up. That got rid of the hours long scan, so I
> am
> > good
> > >>> with this part for now.
> > >>>
> > >>> 2) I have set a single column family in the Scan that I set on the
> > hbase
> > >>> configuration via
> > >>>
> > >>> scan.addFamily(str.getBytes()))
> > >>>
> > >>> hBaseConf.set(TableInputFormat.SCAN, convertScanToString(scan))
> > >>>
> > >>>
> > >>> But when this code is executing under Spark and I observe the threads
> > >>> and logs on Spark executors, I it is reading from S3 files for a
> column
> > >>> family that was not included in the scan. This column family was
> > >>> intentionally excluded because it is much larger than the others and
> > so we
> > >>> wanted to avoid the cost.
> > >>>
> > >>> Any advice on what I am doing wrong would be appreciated.
> > >>>
> > >>> 3) We also explicitly set caching of blocks to false on the scan,
> > >>> although I see that in TableSnapshotInputFormatImpl.java it is again
> > >>> set to false internally also. But when running the Spark job, some
> > >>> executors were taking much longer than others, and when I observe
> their
> > >>> threads, I see periodic messages about a few hundred megs of RAM used
> > by
> > >>> the block cache, and the thread is sitting there reading data from
> S3,
> > and
> > >>> is occasionally blocked a couple of other threads that have the
> > >>> "hfile-prefetcher" name in them. Going back to 2) above, they seem to
> > be
> > >>> reading the wrong column family, but in this item I am more concerned
> > about
> > >>> why they appear to be prefetching blocks and caching them, when the
> > Scan
> > >>> object has a setting to not cache blocks at all?
> > >>>
> > >>
> > >> I think I figured out item 3, the column family descriptor for the
> table
> > >> in question has prefetch on open set in its schema. Now for the Spark
> > job,
> > >> I don't think this serves any useful purpose does it? But I can't see
> > any
> > >> way to override it. If these is, I'd appreciate some advice.
> > >>
> > >
> > >> Thanks.
> > >>
> > >>
> > >>>
> > >>> Thanks in advance for any insights anyone can provide.
> > >>>
> > >>> ----
> > >>> Saad
> > >>>
> > >>>
> > >>
> > >>
> > >
> >
>

Re: TableSnapshotInputFormat Behavior In HBase 1.4.0

Posted by Ted Yu <yu...@gmail.com>.

Saad:
I encourage you to open an HBase JIRA outlining your use case and the
config knobs you added through a patch.

We can see the details for each config and make recommendation accordingly.

Thanks

On Mon, Mar 12, 2018 at 8:43 AM, Saad Mufti <sa...@gmail.com> wrote:

> I have create a company specific branch and added 4 new flags to control
> this behavior, these gave us a huge performance boost when running Spark
> jobs on snapshots of very large tables in S3. I tried to do everything
> cleanly but
>
> a) not being familiar with the whole test strategies I haven't had time to
> add any useful tests, though of course I left the default behavior the
> same, and a lot of the behavior I control wit these flags only affect
> performance, not the final result, so I would need some pointers on how to
> add useful tests
> b) I added a new flag to be an overall override for prefetch behavior that
> overrides any setting even in the column family descriptor, not sure if
> what I did was entirely in the spirit of what HBase does
>
> Again these if used properly would only impact jobs using
> TableSnapshotInputFormat in their Spark or M-R jobs. Would someone from the
> core team be willing to look at my patch? I have never done this before, so
> would appreciate a quick pointer on how to send a patch and get some quick
> feedback.
>
> Cheers.
>
> ----
> Saad
>
>
>
> On Sat, Mar 10, 2018 at 9:56 PM, Saad Mufti <sa...@gmail.com> wrote:
>
> > The question remain though of why it is even accessing a column family's
> > files that should be excluded based on the Scan. And that column family
> > does NOT specify prefetch on open in its schema. Only the one we want to
> > read specifies prefetch on open, which we want to override if possible
> for
> > the Spark job.
> >
> > ----
> > Saad
> >
> >
> > On Sat, Mar 10, 2018 at 9:51 PM, Saad Mufti <sa...@gmail.com>
> wrote:
> >
> >> See below more I found on item 3.
> >>
> >> Cheers.
> >>
> >> ----
> >> Saad
> >>
> >> On Sat, Mar 10, 2018 at 7:17 PM, Saad Mufti <sa...@gmail.com>
> wrote:
> >>
> >>> Hi,
> >>>
> >>> I am running a Spark job (Spark 2.2.1) on an EMR cluster in AWS. There
> >>> is no Hbase installed on the cluster, only HBase libs linked to my
> Spark
> >>> app. We are reading the snapshot info from a HBase folder in S3 using
> >>> TableSnapshotInputFormat class from HBase 1.4.0 to have the Spark job
> read
> >>> snapshot info directly from the S3 based filesystem instead of going
> >>> through any region server.
> >>>
> >>> I have observed a few behaviors while debugging performance that are
> >>> concerning, some we could mitigate and other I am looking for clarity
> on:
> >>>
> >>> 1)  the TableSnapshotInputFormatImpl code is trying to get locality
> >>> information for the region splits, for a snapshots with a large number
> of
> >>> files (over 350000 in our case) this causing single threaded scan of
> all
> >>> the file listings in a single thread in the driver. And it was useless
> >>> because there is really no useful locality information to glean since
> all
> >>> the files are in S3 and not HDFS. So I was forced to make a copy of
> >>> TableSnapshotInputFormatImpl.java in our code and control this with a
> >>> config setting I made up. That got rid of the hours long scan, so I am
> good
> >>> with this part for now.
> >>>
> >>> 2) I have set a single column family in the Scan that I set on the
> hbase
> >>> configuration via
> >>>
> >>> scan.addFamily(str.getBytes()))
> >>>
> >>> hBaseConf.set(TableInputFormat.SCAN, convertScanToString(scan))
> >>>
> >>>
> >>> But when this code is executing under Spark and I observe the threads
> >>> and logs on Spark executors, I it is reading from S3 files for a column
> >>> family that was not included in the scan. This column family was
> >>> intentionally excluded because it is much larger than the others and
> so we
> >>> wanted to avoid the cost.
> >>>
> >>> Any advice on what I am doing wrong would be appreciated.
> >>>
> >>> 3) We also explicitly set caching of blocks to false on the scan,
> >>> although I see that in TableSnapshotInputFormatImpl.java it is again
> >>> set to false internally also. But when running the Spark job, some
> >>> executors were taking much longer than others, and when I observe their
> >>> threads, I see periodic messages about a few hundred megs of RAM used
> by
> >>> the block cache, and the thread is sitting there reading data from S3,
> and
> >>> is occasionally blocked a couple of other threads that have the
> >>> "hfile-prefetcher" name in them. Going back to 2) above, they seem to
> be
> >>> reading the wrong column family, but in this item I am more concerned
> about
> >>> why they appear to be prefetching blocks and caching them, when the
> Scan
> >>> object has a setting to not cache blocks at all?
> >>>
> >>
> >> I think I figured out item 3, the column family descriptor for the table
> >> in question has prefetch on open set in its schema. Now for the Spark
> job,
> >> I don't think this serves any useful purpose does it? But I can't see
> any
> >> way to override it. If these is, I'd appreciate some advice.
> >>
> >
> >> Thanks.
> >>
> >>
> >>>
> >>> Thanks in advance for any insights anyone can provide.
> >>>
> >>> ----
> >>> Saad
> >>>
> >>>
> >>
> >>
> >
>

Re: TableSnapshotInputFormat Behavior In HBase 1.4.0

Posted by Saad Mufti <sa...@gmail.com>.

I have create a company specific branch and added 4 new flags to control
this behavior, these gave us a huge performance boost when running Spark
jobs on snapshots of very large tables in S3. I tried to do everything
cleanly but

a) not being familiar with the whole test strategies I haven't had time to
add any useful tests, though of course I left the default behavior the
same, and a lot of the behavior I control wit these flags only affect
performance, not the final result, so I would need some pointers on how to
add useful tests
b) I added a new flag to be an overall override for prefetch behavior that
overrides any setting even in the column family descriptor, not sure if
what I did was entirely in the spirit of what HBase does

Again these if used properly would only impact jobs using
TableSnapshotInputFormat in their Spark or M-R jobs. Would someone from the
core team be willing to look at my patch? I have never done this before, so
would appreciate a quick pointer on how to send a patch and get some quick
feedback.

Cheers.

----
Saad



On Sat, Mar 10, 2018 at 9:56 PM, Saad Mufti <sa...@gmail.com> wrote:

> The question remain though of why it is even accessing a column family's
> files that should be excluded based on the Scan. And that column family
> does NOT specify prefetch on open in its schema. Only the one we want to
> read specifies prefetch on open, which we want to override if possible for
> the Spark job.
>
> ----
> Saad
>
>
> On Sat, Mar 10, 2018 at 9:51 PM, Saad Mufti <sa...@gmail.com> wrote:
>
>> See below more I found on item 3.
>>
>> Cheers.
>>
>> ----
>> Saad
>>
>> On Sat, Mar 10, 2018 at 7:17 PM, Saad Mufti <sa...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am running a Spark job (Spark 2.2.1) on an EMR cluster in AWS. There
>>> is no Hbase installed on the cluster, only HBase libs linked to my Spark
>>> app. We are reading the snapshot info from a HBase folder in S3 using
>>> TableSnapshotInputFormat class from HBase 1.4.0 to have the Spark job read
>>> snapshot info directly from the S3 based filesystem instead of going
>>> through any region server.
>>>
>>> I have observed a few behaviors while debugging performance that are
>>> concerning, some we could mitigate and other I am looking for clarity on:
>>>
>>> 1)  the TableSnapshotInputFormatImpl code is trying to get locality
>>> information for the region splits, for a snapshots with a large number of
>>> files (over 350000 in our case) this causing single threaded scan of all
>>> the file listings in a single thread in the driver. And it was useless
>>> because there is really no useful locality information to glean since all
>>> the files are in S3 and not HDFS. So I was forced to make a copy of
>>> TableSnapshotInputFormatImpl.java in our code and control this with a
>>> config setting I made up. That got rid of the hours long scan, so I am good
>>> with this part for now.
>>>
>>> 2) I have set a single column family in the Scan that I set on the hbase
>>> configuration via
>>>
>>> scan.addFamily(str.getBytes()))
>>>
>>> hBaseConf.set(TableInputFormat.SCAN, convertScanToString(scan))
>>>
>>>
>>> But when this code is executing under Spark and I observe the threads
>>> and logs on Spark executors, I it is reading from S3 files for a column
>>> family that was not included in the scan. This column family was
>>> intentionally excluded because it is much larger than the others and so we
>>> wanted to avoid the cost.
>>>
>>> Any advice on what I am doing wrong would be appreciated.
>>>
>>> 3) We also explicitly set caching of blocks to false on the scan,
>>> although I see that in TableSnapshotInputFormatImpl.java it is again
>>> set to false internally also. But when running the Spark job, some
>>> executors were taking much longer than others, and when I observe their
>>> threads, I see periodic messages about a few hundred megs of RAM used by
>>> the block cache, and the thread is sitting there reading data from S3, and
>>> is occasionally blocked a couple of other threads that have the
>>> "hfile-prefetcher" name in them. Going back to 2) above, they seem to be
>>> reading the wrong column family, but in this item I am more concerned about
>>> why they appear to be prefetching blocks and caching them, when the Scan
>>> object has a setting to not cache blocks at all?
>>>
>>
>> I think I figured out item 3, the column family descriptor for the table
>> in question has prefetch on open set in its schema. Now for the Spark job,
>> I don't think this serves any useful purpose does it? But I can't see any
>> way to override it. If these is, I'd appreciate some advice.
>>
>
>> Thanks.
>>
>>
>>>
>>> Thanks in advance for any insights anyone can provide.
>>>
>>> ----
>>> Saad
>>>
>>>
>>
>>
>

Re: TableSnapshotInputFormat Behavior In HBase 1.4.0

Posted by Saad Mufti <sa...@gmail.com>.

The question remain though of why it is even accessing a column family's
files that should be excluded based on the Scan. And that column family
does NOT specify prefetch on open in its schema. Only the one we want to
read specifies prefetch on open, which we want to override if possible for
the Spark job.

----
Saad

On Sat, Mar 10, 2018 at 9:51 PM, Saad Mufti <sa...@gmail.com> wrote:

> See below more I found on item 3.
>
> Cheers.
>
> ----
> Saad
>
> On Sat, Mar 10, 2018 at 7:17 PM, Saad Mufti <sa...@gmail.com> wrote:
>
>> Hi,
>>
>> I am running a Spark job (Spark 2.2.1) on an EMR cluster in AWS. There is
>> no Hbase installed on the cluster, only HBase libs linked to my Spark app.
>> We are reading the snapshot info from a HBase folder in S3 using
>> TableSnapshotInputFormat class from HBase 1.4.0 to have the Spark job read
>> snapshot info directly from the S3 based filesystem instead of going
>> through any region server.
>>
>> I have observed a few behaviors while debugging performance that are
>> concerning, some we could mitigate and other I am looking for clarity on:
>>
>> 1)  the TableSnapshotInputFormatImpl code is trying to get locality
>> information for the region splits, for a snapshots with a large number of
>> files (over 350000 in our case) this causing single threaded scan of all
>> the file listings in a single thread in the driver. And it was useless
>> because there is really no useful locality information to glean since all
>> the files are in S3 and not HDFS. So I was forced to make a copy of
>> TableSnapshotInputFormatImpl.java in our code and control this with a
>> config setting I made up. That got rid of the hours long scan, so I am good
>> with this part for now.
>>
>> 2) I have set a single column family in the Scan that I set on the hbase
>> configuration via
>>
>> scan.addFamily(str.getBytes()))
>>
>> hBaseConf.set(TableInputFormat.SCAN, convertScanToString(scan))
>>
>>
>> But when this code is executing under Spark and I observe the threads and
>> logs on Spark executors, I it is reading from S3 files for a column family
>> that was not included in the scan. This column family was intentionally
>> excluded because it is much larger than the others and so we wanted to
>> avoid the cost.
>>
>> Any advice on what I am doing wrong would be appreciated.
>>
>> 3) We also explicitly set caching of blocks to false on the scan,
>> although I see that in TableSnapshotInputFormatImpl.java it is again set
>> to false internally also. But when running the Spark job, some executors
>> were taking much longer than others, and when I observe their threads, I
>> see periodic messages about a few hundred megs of RAM used by the block
>> cache, and the thread is sitting there reading data from S3, and is
>> occasionally blocked a couple of other threads that have the
>> "hfile-prefetcher" name in them. Going back to 2) above, they seem to be
>> reading the wrong column family, but in this item I am more concerned about
>> why they appear to be prefetching blocks and caching them, when the Scan
>> object has a setting to not cache blocks at all?
>>
>
> I think I figured out item 3, the column family descriptor for the table
> in question has prefetch on open set in its schema. Now for the Spark job,
> I don't think this serves any useful purpose does it? But I can't see any
> way to override it. If these is, I'd appreciate some advice.
>

> Thanks.
>
>
>>
>> Thanks in advance for any insights anyone can provide.
>>
>> ----
>> Saad
>>
>>
>
>

Re: TableSnapshotInputFormat Behavior In HBase 1.4.0

Posted by Saad Mufti <sa...@gmail.com>.

See below more I found on item 3.

Cheers.

----
Saad

On Sat, Mar 10, 2018 at 7:17 PM, Saad Mufti <sa...@gmail.com> wrote:

> Hi,
>
> I am running a Spark job (Spark 2.2.1) on an EMR cluster in AWS. There is
> no Hbase installed on the cluster, only HBase libs linked to my Spark app.
> We are reading the snapshot info from a HBase folder in S3 using
> TableSnapshotInputFormat class from HBase 1.4.0 to have the Spark job read
> snapshot info directly from the S3 based filesystem instead of going
> through any region server.
>
> I have observed a few behaviors while debugging performance that are
> concerning, some we could mitigate and other I am looking for clarity on:
>
> 1)  the TableSnapshotInputFormatImpl code is trying to get locality
> information for the region splits, for a snapshots with a large number of
> files (over 350000 in our case) this causing single threaded scan of all
> the file listings in a single thread in the driver. And it was useless
> because there is really no useful locality information to glean since all
> the files are in S3 and not HDFS. So I was forced to make a copy of
> TableSnapshotInputFormatImpl.java in our code and control this with a
> config setting I made up. That got rid of the hours long scan, so I am good
> with this part for now.
>
> 2) I have set a single column family in the Scan that I set on the hbase
> configuration via
>
> scan.addFamily(str.getBytes()))
>
> hBaseConf.set(TableInputFormat.SCAN, convertScanToString(scan))
>
>
> But when this code is executing under Spark and I observe the threads and
> logs on Spark executors, I it is reading from S3 files for a column family
> that was not included in the scan. This column family was intentionally
> excluded because it is much larger than the others and so we wanted to
> avoid the cost.
>
> Any advice on what I am doing wrong would be appreciated.
>
> 3) We also explicitly set caching of blocks to false on the scan, although
> I see that in TableSnapshotInputFormatImpl.java it is again set to false
> internally also. But when running the Spark job, some executors were taking
> much longer than others, and when I observe their threads, I see periodic
> messages about a few hundred megs of RAM used by the block cache, and the
> thread is sitting there reading data from S3, and is occasionally blocked a
> couple of other threads that have the "hfile-prefetcher" name in them.
> Going back to 2) above, they seem to be reading the wrong column family,
> but in this item I am more concerned about why they appear to be
> prefetching blocks and caching them, when the Scan object has a setting to
> not cache blocks at all?
>

I think I figured out item 3, the column family descriptor for the table in
question has prefetch on open set in its schema. Now for the Spark job, I
don't think this serves any useful purpose does it? But I can't see any way
to override it. If these is, I'd appreciate some advice.

Thanks.


>
> Thanks in advance for any insights anyone can provide.
>
> ----
> Saad
>
>