You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Lars Hofhansl (JIRA)" <ji...@apache.org> on 2013/12/13 02:04:07 UTC
[jira] [Comment Edited] (HBASE-8369) MapReduce over snapshot files

    [ https://issues.apache.org/jira/browse/HBASE-8369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13847004#comment-13847004 ] 

Lars Hofhansl edited comment on HBASE-8369 at 12/13/13 1:02 AM:
----------------------------------------------------------------

The only changes to existing HBase classes are exactly these hooks, though. Without them it cannot be done with outside code. When those are in place anyway, might as well add some new classes for M/R stuff; but it's fine to keep these outside, they just become part of the M/R job then.

To explain my comment above:
Adding a few classes is not a fork of course, but it starts a slippery slope. Once you started it's easy to pile on top of that. And there are some HBase changes needed, so it is an actual patch we need to maintain.
We have so far completely avoided that (except for some hopefully temporary security related changes to HDFS), and I have been a strong advocate for that in our organization. We have also always forward ported any changes we made to 0.96+. So it is frustrating having to start this even (or especially) for such a small change.

So please pardon my frustration.
I do not understand the reluctance with this, as it is almost no risk and some folks will be using 0.94 for a while.
Whether it's a new "feature" or not is not relevant (IMHO). HBase's slow M/R performance could be considered a bug too, and then this would be bug fix.

We're not breaking up over this :)

So it seems a good compromise would be to get the required hooks into HBase...?
[~jesse_yates], FYI.



was (Author: lhofhansl):
The only changes to existing HBase classes are exactly these hooks, though. Without them it cannot be done with outside code. When those are in place anyway, might as well add some new classes for M/R stuff; but it's fine to keep these outside, they just become part of the M/R job then.

To explain my comment above:
Adding a few classes is not a fork of course, but it starts a slippery slope. Once you started it's easy to pile on top of that. And there are some HBase changes needed, so it is an actual patch we need to maintain.
We have so far completely avoided that (except for some hopefully temporary security related changes to HDFS), and I have been a strong advocate for that in our organization. We have also always forward ported any changes we made to 0.96+. So it is frustrating having to start this even (or especially) for such a small change.

So please pardon my frustration.
I do not understand the reluctance with this, as it is almost no risk and some folks will be using 0.94 for a while.
Whether it's a new "feature" or not is not relevant (IMHO). HBase's slow M/R performance could be considered a bug too, and then this would be bug fix.

We're not breaking up over this :)

So it seems a good compromise would be to get the required hooks into HBase...?
[~jesse_yates].


> MapReduce over snapshot files
> -----------------------------
>
>                 Key: HBASE-8369
>                 URL: https://issues.apache.org/jira/browse/HBASE-8369
>             Project: HBase
>          Issue Type: New Feature
>          Components: mapreduce, snapshots
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 0.98.0
>
>         Attachments: HBASE-8369-0.94.patch, HBASE-8369-0.94_v2.patch, HBASE-8369-0.94_v3.patch, HBASE-8369-0.94_v4.patch, HBASE-8369-0.94_v5.patch, HBASE-8369-trunk_v1.patch, HBASE-8369-trunk_v2.patch, HBASE-8369-trunk_v3.patch, hbase-8369_v0.patch, hbase-8369_v11.patch, hbase-8369_v5.patch, hbase-8369_v6.patch, hbase-8369_v7.patch, hbase-8369_v8.patch, hbase-8369_v9.patch
>
>
> The idea is to add an InputFormat, which can run the mapreduce job over snapshot files directly bypassing hbase server layer. The IF is similar in usage to TableInputFormat, taking a Scan object from the user, but instead of running from an online table, it runs from a table snapshot. We do one split per region in the snapshot, and open an HRegion inside the RecordReader. A RegionScanner is used internally for doing the scan without any HRegionServer bits. 
> Users have been asking and searching for ways to run MR jobs by reading directly from hfiles, so this allows new use cases if reading from stale data is ok:
>  - Take snapshots periodically, and run MR jobs only on snapshots.
>  - Export snapshots to remote hdfs cluster, run the MR jobs at that cluster without HBase cluster.
>  - (Future use case) Combine snapshot data with online hbase data: Scan from yesterday's snapshot, but read today's data from online hbase cluster. 



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)