You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hbase.apache.org by Cosmin Lehene <cl...@adobe.com> on 2012/03/02 22:45:56 UTC

Re: MR job "randomly" scans up thousands of rows less than the it should.

Following up on this.

Back porting HBASE-4485 didn't seem to help.
We were a bit under pressure and I didn't have time to investigate deeper
(there's a small chance I missed something during back port)

We eventually upgraded to 0.92 which fixed the problem :)

Thanks a lot for helping with this,
Cosmin

On 2/15/12 1:33 PM, "Cosmin Lehene" <cl...@adobe.com> wrote:

>Amit, HBASE-4485 describes the behavior I'm seeing, thanks.
>
>Looking over the patches I'm under the impression  that HBASE-4485 which
>is a subtask of HBASE-2856 was back ported through HBASE-4838 to 0.92 by
>Lars.
>Am I wrong?
>
>Thanks,
>Cosmin
>
>
>On 2/14/12 11:06 PM, "Amitanand Aiyer" <am...@fb.com> wrote:
>
>>Hi Cosmin,
>>  https://issues.apache.org/jira/browse/HBASE-4485 might be applicable.
>>
>>  The patch was included in the fix for 2856.
>>
>>Cheers,
>>-Amit
>>
>>________________________________________
>>From: Cosmin Lehene [clehene@adobe.com]
>>Sent: Tuesday, February 14, 2012 12:02 PM
>>To: dev@hbase.apache.org
>>Subject: Re: MR job "randomly" scans up thousands of rows less than the
>>it should.
>>
>>I just got back on this issue. Initially the behavior we've seen (missing
>>rows) wouldn't reproduce on 0.90 using TestAcidGuarantees.
>>However, if the puts in the writer threads include additional rows the
>>scanners will start reading less rows. This reproduces consistently on
>>0.90 and seems to be working correctly on 0.92.
>>
>>HBASE-2856/HBASE-4838 are probably the solution, although there's a
>>chance
>>it's some other fix on 0.92 (ideas?)
>>
>>We're undecided whether backporting to 0.90 vs upgrading the affected
>>clusters to 0.92 would be better?
>>Also is there interest for this fix on 0.90?
>>
>>Thanks,
>>Cosmin
>>
>>On 2/6/12 6:25 PM, "Cosmin Lehene" <cl...@adobe.com> wrote:
>>
>>>Thanks Ted!
>>>
>>>I wonder if it would make more sense to port it to 0.90.X or upgrade to
>>>0.92.
>>>
>>>Cosmin
>>>
>>>On 2/2/12 5:03 PM, "Ted Yu" <yu...@gmail.com> wrote:
>>>
>>>>HBASE-4838 ports HBASE-2856 to 0.92
>>>>
>>>>FYI
>>>>
>>>>On Thu, Feb 2, 2012 at 4:46 PM, Cosmin Lehene <cl...@adobe.com>
>>>>wrote:
>>>>
>>>>> (sorry for the damaged subject :))
>>>>>
>>>>>
>>>>> Hey Jon,
>>>>> We have two column families.
>>>>> There are no filters and there's a full table scan. We're not
>>>>>skipping
>>>>> rows.
>>>>> I did see however a single time that we had one qualifier "fault" in
>>>>>the
>>>>> job counters (it was missing, and it wasn't supposed to be missing).
>>>>> However that was only once and it doesn't happen when we encounter
>>>>>missing
>>>>> rows.
>>>>>
>>>>> We're getting this behavior consistently although I couldn't figure a
>>>>>way
>>>>> to reproduce it. I'll try running multiple instances of the job in
>>>>> parallel to figure out if that would affect the outcome.
>>>>> I'll probably have to add more debugging for the affected rows and
>>>>>dig
>>>>> deeper.
>>>>>
>>>>> HBASE-2856 is a pretty large issue - do you think it could be related
>>>>>to
>>>>> what I'm seeing? If so it could help me reproduce it.
>>>>>
>>>>> Thanks,
>>>>> Cosmin
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 2/1/12 11:30 PM, "Jonathan Hsieh" <jo...@cloudera.com> wrote:
>>>>>
>>>>> >Cosmin,
>>>>> >
>>>>> >How many column families to you have in this table?   Are you using
>>>>>any
>>>>> >filters in you HBase scans?  Are you using skip rows that may not
>>>>>have
>>>>> >qualifiers present?
>>>>> >
>>>>> >There are a few known issues with multi-CF atomicity and a recent
>>>>>one
>>>>> >about
>>>>> >flushes that may be related to this problem.  There HBASE-2856, a
>>>>>fix
>>>>> >having to do with flushes which is pretty intricate and only in
>>>>>0.92.
>>>>> >
>>>>> >Jon.
>>>>> >
>>>>> >On Wed, Feb 1, 2012 at 8:46 PM, Cosmin Lehene <cl...@adobe.com>
>>>>>wrote:
>>>>> >
>>>>> >> We have a MR job that runs every few minutes on some time series
>>>>>data
>>>>> >> which is continuously updated (never deleted).
>>>>> >> Every few (in the range of tens to hundreds) runs the map task
>>>>>that
>>>>> >>covers
>>>>> >> the last region will get fewer input records (off by 500-5000
>>>>>rows)
>>>>> >>without
>>>>> >> any splits happening. This lower number of input records could
>>>>>persist
>>>>> >>for
>>>>> >> a few MR runs, but will eventually get back to the "correct"
>>>>>value.
>>>>> >>
>>>>> >> This drop can be seen both in the "map input records" metric but
>>>>>it's
>>>>> >> correlated with the metrics that get computed by the MR job (so
>>>>>it's
>>>>> >>not a
>>>>> >> MR counter bug).
>>>>> >>
>>>>> >> There are no exceptions in the MR job, or in the region server and
>>>>>this
>>>>> >> doesn't seem to be correlated with any compaction, split or region
>>>>> >>movement.
>>>>> >> The only "variable" in this scenario is that new data gets
>>>>>injected
>>>>> >> continuously (and the actual MR job which is idempotent)
>>>>> >>
>>>>> >> This entire puzzle takes place on  HBase 0.90.5 ­ish (12 dec 2011)
>>>>>on
>>>>> >>top
>>>>> >> of Hadoop cdh3u2.
>>>>> >>
>>>>> >> Cosmin
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >>
>>>>> >
>>>>> >
>>>>> >--
>>>>> >// Jonathan Hsieh (shay)
>>>>> >// Software Engineer, Cloudera
>>>>> >// jon@cloudera.com
>>>>>
>>
>