You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Ted Yu <yu...@gmail.com> on 2015/02/25 04:01:19 UTC

Re: HBase scan time range, inconsistency

What's the TTL setting for your table ?

Which hbase release are you using ?

Was there compaction in between the scans ?

Thanks


> On Feb 24, 2015, at 2:32 PM, Stephen Durfey <sj...@gmail.com> wrote:
> 
> I have some code that accepts a time range and looks for data written to an HBase table during that range. If anything has been written for that row during that range, the row key is saved off, and sometime later in the pipeline those row keys are used to extract the entire row. I’m testing against a fixed time range, at some point in the past. This is being done as part of a Map/Reduce job (using Apache Crunch). I have some job counters setup to keep track of the number of rows extracted. Since the time range is fixed, I would expect the scan to return the same number of rows with data in the provided time range. However, I am seeing this number vary from scan to scan (bouncing between increasing and decreasing). 
> 
> I’ve eliminated the possibility that data is being pulled in from outside the time range. I did this by scanning for one column qualifier (and only using this as the qualifier for if a row had data in the time range), getting the timestamp on the cell for each returned row and compared it against the begin and end times for the scan, and I didn’t find any that satisfied that criteria. I’ve observed some row keys show up in the 1st scan, then drop out in the 2nd scan, only to show back up again in the 3rd scan (all with the exact same Scan object). These numbers have varied wildly, from being off by 2-3 between subsequent scans to 40 row increases, followed by a drop of 70 rows. 
> 
> I’m kind of looking for ideas to try to track down what could be causing this to happen. The code itself is pretty simple, it creates a Scan object, scans the table, and then in the map phase, extract out the row key, and at the end, it dumps them to a directory in hdfs.

Re: HBase scan time range, inconsistency

Posted by ramkrishna vasudevan <ra...@gmail.com>.
In your case since the TTL is set to the max and you have a timeRange in
your scan it would go with the first case.
Every time it would try to fetch only one version ( the latest) for the
given record but if the time Range is not falling in the latest then it
would skip those cells. But my doubt is if there are no new updates and
your scan queries are the same just running in a loop there should be no
difference in the output.

Are you sure on your scan queries that you have formed.  Does the timeRange
calculated always based on System.currentMillis or the range is formed from
two known boundary values?




On Fri, Feb 27, 2015 at 2:17 AM, Stephen Durfey <sj...@gmail.com> wrote:

> Right, it is 1 by default, but shouldn't the Scan return a version from
> within that time range if there is one? Without the number of versions
> specified, I thought it was the most recent version, is that the most
> recent version within the time range, or the most recent version in the
> history of the cell? In the first case, I would expect the count to still
> be the same across subsequent scans, since a version exists. If it is the
> second case, I would expect the number to be lower than the actual count,
> but to be consistently lower, rather the fluctuating all over the place,
> like I was seeing, since the table isn't actively being updated with new
> rows and updates.
>
> On Thu, Feb 26, 2015 at 2:37 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > The maxVersions field of Scan object is 1 by default:
> >
> >   private int maxVersions = 1;
> >
> > Cheers
> >
> > On Thu, Feb 26, 2015 at 12:31 PM, Stephen Durfey <sj...@gmail.com>
> > wrote:
> >
> > > >
> > > > 1) What do you mean by saying your have a partitioned HBase table?
> > > > (Regions and partitions are not the same)
> > >
> > >
> > > By partitions, I just mean logical partitions, using the row key to
> keep
> > > data from separate data sources apart from each other.
> > >
> > > I think the issue may be resolved now, but it isn't obvious to me why
> the
> > > change works. The table is set to the save the max number of versions,
> > but
> > > the number of versions is not specified in the Scan object. Once I
> > changed
> > > the Scan to request the max number of versions the counts remained the
> > same
> > > across all subsequent job runs. Can anyone provide some insight as to
> why
> > > this is the case?
> > >
> > > On Thu, Feb 26, 2015 at 8:35 AM, Michael Segel <
> > michael_segel@hotmail.com>
> > > wrote:
> > >
> > > > Ok…
> > > >
> > > > Silly question time… so just humor me for a second.
> > > >
> > > > 1) What do you mean by saying your have a partitioned HBase table?
> > > > (Regions and partitions are not the same)
> > > >
> > > > 2) There’s a question of the isolation level during the scan. What
> > > happens
> > > > when there is a compaction running or there’s RLL taking place?
> > > >
> > > > Does your scan get locked/blocked? Does it skip the row?
> > > > (This should be documented.)
> > > > Do you count the number of rows scanned when building the list of
> rows
> > > > that need to be processed further?
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > > On Feb 25, 2015, at 4:46 PM, Stephen Durfey <sj...@gmail.com>
> > > wrote:
> > > >
> > > > >
> > > > >>
> > > > >> Are you writing any Deletes? Are you writing any duplicates?
> > > > >
> > > > >
> > > > > No physical deletes are occurring in my data, and there is a very
> > real
> > > > > possibility of duplicates.
> > > > >
> > > > > How is the partitioning done?
> > > > >>
> > > > >
> > > > > The key structure would be /partition_id/person_id .... I'm dealing
> > > with
> > > > > clinical data, with a data source identified by the partition, and
> > the
> > > > > person data is associated with that particular partition at load
> > time.
> > > > >
> > > > > Are you doing the column filtering with a custom filter or one of
> the
> > > > >> prepackaged ones?
> > > > >>
> > > > >
> > > > > They appear to be all prepackaged filters:  FamilyFilter,
> > > KeyOnlyFilter,
> > > > > QualifierFilter, and ColumnPrefixFilter are used under various
> > > > conditions,
> > > > > depending upon what is requested on the Scan object.
> > > > >
> > > > >
> > > > > On Wed, Feb 25, 2015 at 4:35 PM, Sean Busbey <bu...@cloudera.com>
> > > > wrote:
> > > > >
> > > > >> Are you writing any Deletes? Are you writing any duplicates?
> > > > >>
> > > > >> How is the partitioning done?
> > > > >>
> > > > >> What does the entire key structure look like?
> > > > >>
> > > > >> Are you doing the column filtering with a custom filter or one of
> > the
> > > > >> prepackaged ones?
> > > > >>
> > > > >> On Wed, Feb 25, 2015 at 12:57 PM, Stephen Durfey <
> > sjdurfey@gmail.com>
> > > > >> wrote:
> > > > >>
> > > > >>>>
> > > > >>>> What's the TTL setting for your table ?
> > > > >>>>
> > > > >>>> Which hbase release are you using ?
> > > > >>>>
> > > > >>>> Was there compaction in between the scans ?
> > > > >>>>
> > > > >>>> Thanks
> > > > >>>>
> > > > >>>
> > > > >>> The TTL is set to the max. The HBase version is 0.94.6-cdh4.4.0.
> I
> > > > don’t
> > > > >>> want to say compactions aren’t a factor, but the jobs are short
> > lived
> > > > >> (4-5
> > > > >>> minutes), and I have ran them frequently over the last couple of
> > days
> > > > >>> trying to gather stats around what was being extracted, and
> trying
> > to
> > > > >> find
> > > > >>> the difference and intersection in row keys before job runs.
> > > > >>>
> > > > >>> These numbers have varied wildly, from being off by 2-3 between
> > > > >>>
> > > > >>> subsequent scans to 40 row increases, followed by a drop of 70
> > rows.
> > > > >>>> When you say there is a variation in the number of rows
> retrieved
> > -
> > > > the
> > > > >>> 40
> > > > >>>> rows that got increased - are those rows in the expected time
> > range?
> > > > Or
> > > > >>> is
> > > > >>>> the system retrieving some rows which are not in the specified
> > time
> > > > >>> range?
> > > > >>>>
> > > > >>>> And when the rows drop by 70, are you using any row which was
> > needed
> > > > to
> > > > >>> be
> > > > >>>> retrieved got missed out?
> > > > >>>>
> > > > >>>
> > > > >>> The best I can tell, if there is an increase in counts, those
> rows
> > > are
> > > > >> not
> > > > >>> coming from outside of the time range. In the job, I am
> > maintaining a
> > > > >> list
> > > > >>> of rows that have a timestamp outside of my provided time range,
> > and
> > > > then
> > > > >>> writing those out to hdfs at the end of the map task. So far,
> > nothing
> > > > has
> > > > >>> been written out.
> > > > >>>
> > > > >>> Any filters in your scan?
> > > > >>>>
> > > > >>>
> > > > >>>>
> > > > >>> Regards
> > > > >>>> Ram
> > > > >>>>
> > > > >>>
> > > > >>> There are some column filters. There is an API abstraction on top
> > of
> > > > >> hbase
> > > > >>> that I am using to allow users to easily extract data from
> columns
> > > that
> > > > >>> start with a provided column prefix. So, the column filters are
> in
> > > > place
> > > > >> to
> > > > >>> ensure I am only getting back data from columns that start with
> the
> > > > >>> provided prefix.
> > > > >>>
> > > > >>> To add a little more detail, my row keys are separated out by
> > > > partition.
> > > > >> At
> > > > >>> periodic times (through oozie), data is loaded from a source into
> > the
> > > > >>> appropriate partition. I ran some scans against a partition that
> > > hadn't
> > > > >>> been updated in almost a year (with a scan range around the times
> > of
> > > > the
> > > > >>> 2nd to last load into the table), and the row key counts were
> > > > consistent
> > > > >>> across multiple scans. I chose another partition that is actively
> > > being
> > > > >>> updated once a day. I chose a scan time around the 4th most
> recent
> > > > load,
> > > > >>> and the results were inconsistent from scan to scan (fluctuating
> up
> > > and
> > > > >>> down). Setting the begin time to 4 days in the past end time on
> the
> > > > scan
> > > > >>> range to 'right now', using System.currentTimeMillis() (with the
> > time
> > > > >> being
> > > > >>> after the daily load), the results also fluctuated up and down.
> So,
> > > it
> > > > >> kind
> > > > >>> of seems like there is some sort of temporal recency that is
> > causing
> > > > the
> > > > >>> counts to fluctuate.
> > > > >>>
> > > > >>>
> > > > >>>
> > > > >>> On Feb 24, 2015, at 10:20 PM, ramkrishna vasudevan <
> > > > >>> ramkrishna.s.vasudevan@gmail.com> wrote:
> > > > >>>
> > > > >>> These numbers have varied wildly, from being off by 2-3 between
> > > > >>>
> > > > >>> subsequent scans to 40 row increases, followed by a drop of 70
> > rows.
> > > > >>> When you say there is a variation in the number of rows
> retrieved -
> > > the
> > > > >> 40
> > > > >>> rows that got increased - are those rows in the expected time
> > range?
> > > Or
> > > > >> is
> > > > >>> the system retrieving some rows which are not in the specified
> time
> > > > >> range?
> > > > >>>
> > > > >>> And when the rows drop by 70, are you using any row which was
> > needed
> > > to
> > > > >> be
> > > > >>> retrieved got missed out?
> > > > >>>
> > > > >>> Any filters in your scan?
> > > > >>>
> > > > >>> Regards
> > > > >>> Ram
> > > > >>>
> > > > >>> On Wed, Feb 25, 2015 at 8:31 AM, Ted Yu <yu...@gmail.com>
> > wrote:
> > > > >>>
> > > > >>> What's the TTL setting for your table ?
> > > > >>>
> > > > >>> Which hbase release are you using ?
> > > > >>>
> > > > >>> Was there compaction in between the scans ?
> > > > >>>
> > > > >>> Thanks
> > > > >>>
> > > > >>>
> > > > >>> On Feb 24, 2015, at 2:32 PM, Stephen Durfey <sj...@gmail.com>
> > > > wrote:
> > > > >>>
> > > > >>> I have some code that accepts a time range and looks for data
> > written
> > > > to
> > > > >>>
> > > > >>> an HBase table during that range. If anything has been written
> for
> > > that
> > > > >> row
> > > > >>> during that range, the row key is saved off, and sometime later
> in
> > > the
> > > > >>> pipeline those row keys are used to extract the entire row. I’m
> > > testing
> > > > >>> against a fixed time range, at some point in the past. This is
> > being
> > > > done
> > > > >>> as part of a Map/Reduce job (using Apache Crunch). I have some
> job
> > > > >> counters
> > > > >>> setup to keep track of the number of rows extracted. Since the
> time
> > > > range
> > > > >>> is fixed, I would expect the scan to return the same number of
> rows
> > > > with
> > > > >>> data in the provided time range. However, I am seeing this number
> > > vary
> > > > >> from
> > > > >>> scan to scan (bouncing between increasing and decreasing).
> > > > >>>
> > > > >>>
> > > > >>> I’ve eliminated the possibility that data is being pulled in from
> > > > >>>
> > > > >>> outside the time range. I did this by scanning for one column
> > > qualifier
> > > > >>> (and only using this as the qualifier for if a row had data in
> the
> > > time
> > > > >>> range), getting the timestamp on the cell for each returned row
> and
> > > > >>> compared it against the begin and end times for the scan, and I
> > > didn’t
> > > > >> find
> > > > >>> any that satisfied that criteria. I’ve observed some row keys
> show
> > up
> > > > in
> > > > >>> the 1st scan, then drop out in the 2nd scan, only to show back up
> > > again
> > > > >> in
> > > > >>> the 3rd scan (all with the exact same Scan object). These numbers
> > > have
> > > > >>> varied wildly, from being off by 2-3 between subsequent scans to
> 40
> > > row
> > > > >>> increases, followed by a drop of 70 rows.
> > > > >>>
> > > > >>>
> > > > >>> I’m kind of looking for ideas to try to track down what could be
> > > > causing
> > > > >>>
> > > > >>> this to happen. The code itself is pretty simple, it creates a
> Scan
> > > > >> object,
> > > > >>> scans the table, and then in the map phase, extract out the row
> > key,
> > > > and
> > > > >> at
> > > > >>> the end, it dumps them to a directory in hdfs.
> > > > >>>
> > > > >>
> > > > >>
> > > > >>
> > > > >> --
> > > > >> Sean
> > > > >>
> > > >
> > > > The opinions expressed here are mine, while they may reflect a
> > cognitive
> > > > thought, that is purely accidental.
> > > > Use at your own risk.
> > > > Michael Segel
> > > > michael_segel (AT) hotmail.com
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> >
>

Re: HBase scan time range, inconsistency

Posted by Stephen Durfey <sj...@gmail.com>.
Right, it is 1 by default, but shouldn't the Scan return a version from
within that time range if there is one? Without the number of versions
specified, I thought it was the most recent version, is that the most
recent version within the time range, or the most recent version in the
history of the cell? In the first case, I would expect the count to still
be the same across subsequent scans, since a version exists. If it is the
second case, I would expect the number to be lower than the actual count,
but to be consistently lower, rather the fluctuating all over the place,
like I was seeing, since the table isn't actively being updated with new
rows and updates.

On Thu, Feb 26, 2015 at 2:37 PM, Ted Yu <yu...@gmail.com> wrote:

> The maxVersions field of Scan object is 1 by default:
>
>   private int maxVersions = 1;
>
> Cheers
>
> On Thu, Feb 26, 2015 at 12:31 PM, Stephen Durfey <sj...@gmail.com>
> wrote:
>
> > >
> > > 1) What do you mean by saying your have a partitioned HBase table?
> > > (Regions and partitions are not the same)
> >
> >
> > By partitions, I just mean logical partitions, using the row key to keep
> > data from separate data sources apart from each other.
> >
> > I think the issue may be resolved now, but it isn't obvious to me why the
> > change works. The table is set to the save the max number of versions,
> but
> > the number of versions is not specified in the Scan object. Once I
> changed
> > the Scan to request the max number of versions the counts remained the
> same
> > across all subsequent job runs. Can anyone provide some insight as to why
> > this is the case?
> >
> > On Thu, Feb 26, 2015 at 8:35 AM, Michael Segel <
> michael_segel@hotmail.com>
> > wrote:
> >
> > > Ok…
> > >
> > > Silly question time… so just humor me for a second.
> > >
> > > 1) What do you mean by saying your have a partitioned HBase table?
> > > (Regions and partitions are not the same)
> > >
> > > 2) There’s a question of the isolation level during the scan. What
> > happens
> > > when there is a compaction running or there’s RLL taking place?
> > >
> > > Does your scan get locked/blocked? Does it skip the row?
> > > (This should be documented.)
> > > Do you count the number of rows scanned when building the list of rows
> > > that need to be processed further?
> > >
> > >
> > >
> > >
> > >
> > > > On Feb 25, 2015, at 4:46 PM, Stephen Durfey <sj...@gmail.com>
> > wrote:
> > >
> > > >
> > > >>
> > > >> Are you writing any Deletes? Are you writing any duplicates?
> > > >
> > > >
> > > > No physical deletes are occurring in my data, and there is a very
> real
> > > > possibility of duplicates.
> > > >
> > > > How is the partitioning done?
> > > >>
> > > >
> > > > The key structure would be /partition_id/person_id .... I'm dealing
> > with
> > > > clinical data, with a data source identified by the partition, and
> the
> > > > person data is associated with that particular partition at load
> time.
> > > >
> > > > Are you doing the column filtering with a custom filter or one of the
> > > >> prepackaged ones?
> > > >>
> > > >
> > > > They appear to be all prepackaged filters:  FamilyFilter,
> > KeyOnlyFilter,
> > > > QualifierFilter, and ColumnPrefixFilter are used under various
> > > conditions,
> > > > depending upon what is requested on the Scan object.
> > > >
> > > >
> > > > On Wed, Feb 25, 2015 at 4:35 PM, Sean Busbey <bu...@cloudera.com>
> > > wrote:
> > > >
> > > >> Are you writing any Deletes? Are you writing any duplicates?
> > > >>
> > > >> How is the partitioning done?
> > > >>
> > > >> What does the entire key structure look like?
> > > >>
> > > >> Are you doing the column filtering with a custom filter or one of
> the
> > > >> prepackaged ones?
> > > >>
> > > >> On Wed, Feb 25, 2015 at 12:57 PM, Stephen Durfey <
> sjdurfey@gmail.com>
> > > >> wrote:
> > > >>
> > > >>>>
> > > >>>> What's the TTL setting for your table ?
> > > >>>>
> > > >>>> Which hbase release are you using ?
> > > >>>>
> > > >>>> Was there compaction in between the scans ?
> > > >>>>
> > > >>>> Thanks
> > > >>>>
> > > >>>
> > > >>> The TTL is set to the max. The HBase version is 0.94.6-cdh4.4.0. I
> > > don’t
> > > >>> want to say compactions aren’t a factor, but the jobs are short
> lived
> > > >> (4-5
> > > >>> minutes), and I have ran them frequently over the last couple of
> days
> > > >>> trying to gather stats around what was being extracted, and trying
> to
> > > >> find
> > > >>> the difference and intersection in row keys before job runs.
> > > >>>
> > > >>> These numbers have varied wildly, from being off by 2-3 between
> > > >>>
> > > >>> subsequent scans to 40 row increases, followed by a drop of 70
> rows.
> > > >>>> When you say there is a variation in the number of rows retrieved
> -
> > > the
> > > >>> 40
> > > >>>> rows that got increased - are those rows in the expected time
> range?
> > > Or
> > > >>> is
> > > >>>> the system retrieving some rows which are not in the specified
> time
> > > >>> range?
> > > >>>>
> > > >>>> And when the rows drop by 70, are you using any row which was
> needed
> > > to
> > > >>> be
> > > >>>> retrieved got missed out?
> > > >>>>
> > > >>>
> > > >>> The best I can tell, if there is an increase in counts, those rows
> > are
> > > >> not
> > > >>> coming from outside of the time range. In the job, I am
> maintaining a
> > > >> list
> > > >>> of rows that have a timestamp outside of my provided time range,
> and
> > > then
> > > >>> writing those out to hdfs at the end of the map task. So far,
> nothing
> > > has
> > > >>> been written out.
> > > >>>
> > > >>> Any filters in your scan?
> > > >>>>
> > > >>>
> > > >>>>
> > > >>> Regards
> > > >>>> Ram
> > > >>>>
> > > >>>
> > > >>> There are some column filters. There is an API abstraction on top
> of
> > > >> hbase
> > > >>> that I am using to allow users to easily extract data from columns
> > that
> > > >>> start with a provided column prefix. So, the column filters are in
> > > place
> > > >> to
> > > >>> ensure I am only getting back data from columns that start with the
> > > >>> provided prefix.
> > > >>>
> > > >>> To add a little more detail, my row keys are separated out by
> > > partition.
> > > >> At
> > > >>> periodic times (through oozie), data is loaded from a source into
> the
> > > >>> appropriate partition. I ran some scans against a partition that
> > hadn't
> > > >>> been updated in almost a year (with a scan range around the times
> of
> > > the
> > > >>> 2nd to last load into the table), and the row key counts were
> > > consistent
> > > >>> across multiple scans. I chose another partition that is actively
> > being
> > > >>> updated once a day. I chose a scan time around the 4th most recent
> > > load,
> > > >>> and the results were inconsistent from scan to scan (fluctuating up
> > and
> > > >>> down). Setting the begin time to 4 days in the past end time on the
> > > scan
> > > >>> range to 'right now', using System.currentTimeMillis() (with the
> time
> > > >> being
> > > >>> after the daily load), the results also fluctuated up and down. So,
> > it
> > > >> kind
> > > >>> of seems like there is some sort of temporal recency that is
> causing
> > > the
> > > >>> counts to fluctuate.
> > > >>>
> > > >>>
> > > >>>
> > > >>> On Feb 24, 2015, at 10:20 PM, ramkrishna vasudevan <
> > > >>> ramkrishna.s.vasudevan@gmail.com> wrote:
> > > >>>
> > > >>> These numbers have varied wildly, from being off by 2-3 between
> > > >>>
> > > >>> subsequent scans to 40 row increases, followed by a drop of 70
> rows.
> > > >>> When you say there is a variation in the number of rows retrieved -
> > the
> > > >> 40
> > > >>> rows that got increased - are those rows in the expected time
> range?
> > Or
> > > >> is
> > > >>> the system retrieving some rows which are not in the specified time
> > > >> range?
> > > >>>
> > > >>> And when the rows drop by 70, are you using any row which was
> needed
> > to
> > > >> be
> > > >>> retrieved got missed out?
> > > >>>
> > > >>> Any filters in your scan?
> > > >>>
> > > >>> Regards
> > > >>> Ram
> > > >>>
> > > >>> On Wed, Feb 25, 2015 at 8:31 AM, Ted Yu <yu...@gmail.com>
> wrote:
> > > >>>
> > > >>> What's the TTL setting for your table ?
> > > >>>
> > > >>> Which hbase release are you using ?
> > > >>>
> > > >>> Was there compaction in between the scans ?
> > > >>>
> > > >>> Thanks
> > > >>>
> > > >>>
> > > >>> On Feb 24, 2015, at 2:32 PM, Stephen Durfey <sj...@gmail.com>
> > > wrote:
> > > >>>
> > > >>> I have some code that accepts a time range and looks for data
> written
> > > to
> > > >>>
> > > >>> an HBase table during that range. If anything has been written for
> > that
> > > >> row
> > > >>> during that range, the row key is saved off, and sometime later in
> > the
> > > >>> pipeline those row keys are used to extract the entire row. I’m
> > testing
> > > >>> against a fixed time range, at some point in the past. This is
> being
> > > done
> > > >>> as part of a Map/Reduce job (using Apache Crunch). I have some job
> > > >> counters
> > > >>> setup to keep track of the number of rows extracted. Since the time
> > > range
> > > >>> is fixed, I would expect the scan to return the same number of rows
> > > with
> > > >>> data in the provided time range. However, I am seeing this number
> > vary
> > > >> from
> > > >>> scan to scan (bouncing between increasing and decreasing).
> > > >>>
> > > >>>
> > > >>> I’ve eliminated the possibility that data is being pulled in from
> > > >>>
> > > >>> outside the time range. I did this by scanning for one column
> > qualifier
> > > >>> (and only using this as the qualifier for if a row had data in the
> > time
> > > >>> range), getting the timestamp on the cell for each returned row and
> > > >>> compared it against the begin and end times for the scan, and I
> > didn’t
> > > >> find
> > > >>> any that satisfied that criteria. I’ve observed some row keys show
> up
> > > in
> > > >>> the 1st scan, then drop out in the 2nd scan, only to show back up
> > again
> > > >> in
> > > >>> the 3rd scan (all with the exact same Scan object). These numbers
> > have
> > > >>> varied wildly, from being off by 2-3 between subsequent scans to 40
> > row
> > > >>> increases, followed by a drop of 70 rows.
> > > >>>
> > > >>>
> > > >>> I’m kind of looking for ideas to try to track down what could be
> > > causing
> > > >>>
> > > >>> this to happen. The code itself is pretty simple, it creates a Scan
> > > >> object,
> > > >>> scans the table, and then in the map phase, extract out the row
> key,
> > > and
> > > >> at
> > > >>> the end, it dumps them to a directory in hdfs.
> > > >>>
> > > >>
> > > >>
> > > >>
> > > >> --
> > > >> Sean
> > > >>
> > >
> > > The opinions expressed here are mine, while they may reflect a
> cognitive
> > > thought, that is purely accidental.
> > > Use at your own risk.
> > > Michael Segel
> > > michael_segel (AT) hotmail.com
> > >
> > >
> > >
> > >
> > >
> > >
> >
>

Re: HBase scan time range, inconsistency

Posted by Ted Yu <yu...@gmail.com>.
The maxVersions field of Scan object is 1 by default:

  private int maxVersions = 1;

Cheers

On Thu, Feb 26, 2015 at 12:31 PM, Stephen Durfey <sj...@gmail.com> wrote:

> >
> > 1) What do you mean by saying your have a partitioned HBase table?
> > (Regions and partitions are not the same)
>
>
> By partitions, I just mean logical partitions, using the row key to keep
> data from separate data sources apart from each other.
>
> I think the issue may be resolved now, but it isn't obvious to me why the
> change works. The table is set to the save the max number of versions, but
> the number of versions is not specified in the Scan object. Once I changed
> the Scan to request the max number of versions the counts remained the same
> across all subsequent job runs. Can anyone provide some insight as to why
> this is the case?
>
> On Thu, Feb 26, 2015 at 8:35 AM, Michael Segel <mi...@hotmail.com>
> wrote:
>
> > Ok…
> >
> > Silly question time… so just humor me for a second.
> >
> > 1) What do you mean by saying your have a partitioned HBase table?
> > (Regions and partitions are not the same)
> >
> > 2) There’s a question of the isolation level during the scan. What
> happens
> > when there is a compaction running or there’s RLL taking place?
> >
> > Does your scan get locked/blocked? Does it skip the row?
> > (This should be documented.)
> > Do you count the number of rows scanned when building the list of rows
> > that need to be processed further?
> >
> >
> >
> >
> >
> > > On Feb 25, 2015, at 4:46 PM, Stephen Durfey <sj...@gmail.com>
> wrote:
> >
> > >
> > >>
> > >> Are you writing any Deletes? Are you writing any duplicates?
> > >
> > >
> > > No physical deletes are occurring in my data, and there is a very real
> > > possibility of duplicates.
> > >
> > > How is the partitioning done?
> > >>
> > >
> > > The key structure would be /partition_id/person_id .... I'm dealing
> with
> > > clinical data, with a data source identified by the partition, and the
> > > person data is associated with that particular partition at load time.
> > >
> > > Are you doing the column filtering with a custom filter or one of the
> > >> prepackaged ones?
> > >>
> > >
> > > They appear to be all prepackaged filters:  FamilyFilter,
> KeyOnlyFilter,
> > > QualifierFilter, and ColumnPrefixFilter are used under various
> > conditions,
> > > depending upon what is requested on the Scan object.
> > >
> > >
> > > On Wed, Feb 25, 2015 at 4:35 PM, Sean Busbey <bu...@cloudera.com>
> > wrote:
> > >
> > >> Are you writing any Deletes? Are you writing any duplicates?
> > >>
> > >> How is the partitioning done?
> > >>
> > >> What does the entire key structure look like?
> > >>
> > >> Are you doing the column filtering with a custom filter or one of the
> > >> prepackaged ones?
> > >>
> > >> On Wed, Feb 25, 2015 at 12:57 PM, Stephen Durfey <sj...@gmail.com>
> > >> wrote:
> > >>
> > >>>>
> > >>>> What's the TTL setting for your table ?
> > >>>>
> > >>>> Which hbase release are you using ?
> > >>>>
> > >>>> Was there compaction in between the scans ?
> > >>>>
> > >>>> Thanks
> > >>>>
> > >>>
> > >>> The TTL is set to the max. The HBase version is 0.94.6-cdh4.4.0. I
> > don’t
> > >>> want to say compactions aren’t a factor, but the jobs are short lived
> > >> (4-5
> > >>> minutes), and I have ran them frequently over the last couple of days
> > >>> trying to gather stats around what was being extracted, and trying to
> > >> find
> > >>> the difference and intersection in row keys before job runs.
> > >>>
> > >>> These numbers have varied wildly, from being off by 2-3 between
> > >>>
> > >>> subsequent scans to 40 row increases, followed by a drop of 70 rows.
> > >>>> When you say there is a variation in the number of rows retrieved -
> > the
> > >>> 40
> > >>>> rows that got increased - are those rows in the expected time range?
> > Or
> > >>> is
> > >>>> the system retrieving some rows which are not in the specified time
> > >>> range?
> > >>>>
> > >>>> And when the rows drop by 70, are you using any row which was needed
> > to
> > >>> be
> > >>>> retrieved got missed out?
> > >>>>
> > >>>
> > >>> The best I can tell, if there is an increase in counts, those rows
> are
> > >> not
> > >>> coming from outside of the time range. In the job, I am maintaining a
> > >> list
> > >>> of rows that have a timestamp outside of my provided time range, and
> > then
> > >>> writing those out to hdfs at the end of the map task. So far, nothing
> > has
> > >>> been written out.
> > >>>
> > >>> Any filters in your scan?
> > >>>>
> > >>>
> > >>>>
> > >>> Regards
> > >>>> Ram
> > >>>>
> > >>>
> > >>> There are some column filters. There is an API abstraction on top of
> > >> hbase
> > >>> that I am using to allow users to easily extract data from columns
> that
> > >>> start with a provided column prefix. So, the column filters are in
> > place
> > >> to
> > >>> ensure I am only getting back data from columns that start with the
> > >>> provided prefix.
> > >>>
> > >>> To add a little more detail, my row keys are separated out by
> > partition.
> > >> At
> > >>> periodic times (through oozie), data is loaded from a source into the
> > >>> appropriate partition. I ran some scans against a partition that
> hadn't
> > >>> been updated in almost a year (with a scan range around the times of
> > the
> > >>> 2nd to last load into the table), and the row key counts were
> > consistent
> > >>> across multiple scans. I chose another partition that is actively
> being
> > >>> updated once a day. I chose a scan time around the 4th most recent
> > load,
> > >>> and the results were inconsistent from scan to scan (fluctuating up
> and
> > >>> down). Setting the begin time to 4 days in the past end time on the
> > scan
> > >>> range to 'right now', using System.currentTimeMillis() (with the time
> > >> being
> > >>> after the daily load), the results also fluctuated up and down. So,
> it
> > >> kind
> > >>> of seems like there is some sort of temporal recency that is causing
> > the
> > >>> counts to fluctuate.
> > >>>
> > >>>
> > >>>
> > >>> On Feb 24, 2015, at 10:20 PM, ramkrishna vasudevan <
> > >>> ramkrishna.s.vasudevan@gmail.com> wrote:
> > >>>
> > >>> These numbers have varied wildly, from being off by 2-3 between
> > >>>
> > >>> subsequent scans to 40 row increases, followed by a drop of 70 rows.
> > >>> When you say there is a variation in the number of rows retrieved -
> the
> > >> 40
> > >>> rows that got increased - are those rows in the expected time range?
> Or
> > >> is
> > >>> the system retrieving some rows which are not in the specified time
> > >> range?
> > >>>
> > >>> And when the rows drop by 70, are you using any row which was needed
> to
> > >> be
> > >>> retrieved got missed out?
> > >>>
> > >>> Any filters in your scan?
> > >>>
> > >>> Regards
> > >>> Ram
> > >>>
> > >>> On Wed, Feb 25, 2015 at 8:31 AM, Ted Yu <yu...@gmail.com> wrote:
> > >>>
> > >>> What's the TTL setting for your table ?
> > >>>
> > >>> Which hbase release are you using ?
> > >>>
> > >>> Was there compaction in between the scans ?
> > >>>
> > >>> Thanks
> > >>>
> > >>>
> > >>> On Feb 24, 2015, at 2:32 PM, Stephen Durfey <sj...@gmail.com>
> > wrote:
> > >>>
> > >>> I have some code that accepts a time range and looks for data written
> > to
> > >>>
> > >>> an HBase table during that range. If anything has been written for
> that
> > >> row
> > >>> during that range, the row key is saved off, and sometime later in
> the
> > >>> pipeline those row keys are used to extract the entire row. I’m
> testing
> > >>> against a fixed time range, at some point in the past. This is being
> > done
> > >>> as part of a Map/Reduce job (using Apache Crunch). I have some job
> > >> counters
> > >>> setup to keep track of the number of rows extracted. Since the time
> > range
> > >>> is fixed, I would expect the scan to return the same number of rows
> > with
> > >>> data in the provided time range. However, I am seeing this number
> vary
> > >> from
> > >>> scan to scan (bouncing between increasing and decreasing).
> > >>>
> > >>>
> > >>> I’ve eliminated the possibility that data is being pulled in from
> > >>>
> > >>> outside the time range. I did this by scanning for one column
> qualifier
> > >>> (and only using this as the qualifier for if a row had data in the
> time
> > >>> range), getting the timestamp on the cell for each returned row and
> > >>> compared it against the begin and end times for the scan, and I
> didn’t
> > >> find
> > >>> any that satisfied that criteria. I’ve observed some row keys show up
> > in
> > >>> the 1st scan, then drop out in the 2nd scan, only to show back up
> again
> > >> in
> > >>> the 3rd scan (all with the exact same Scan object). These numbers
> have
> > >>> varied wildly, from being off by 2-3 between subsequent scans to 40
> row
> > >>> increases, followed by a drop of 70 rows.
> > >>>
> > >>>
> > >>> I’m kind of looking for ideas to try to track down what could be
> > causing
> > >>>
> > >>> this to happen. The code itself is pretty simple, it creates a Scan
> > >> object,
> > >>> scans the table, and then in the map phase, extract out the row key,
> > and
> > >> at
> > >>> the end, it dumps them to a directory in hdfs.
> > >>>
> > >>
> > >>
> > >>
> > >> --
> > >> Sean
> > >>
> >
> > The opinions expressed here are mine, while they may reflect a cognitive
> > thought, that is purely accidental.
> > Use at your own risk.
> > Michael Segel
> > michael_segel (AT) hotmail.com
> >
> >
> >
> >
> >
> >
>

Re: HBase scan time range, inconsistency

Posted by Stephen Durfey <sj...@gmail.com>.
>
> 1) What do you mean by saying your have a partitioned HBase table?
> (Regions and partitions are not the same)


By partitions, I just mean logical partitions, using the row key to keep
data from separate data sources apart from each other.

I think the issue may be resolved now, but it isn't obvious to me why the
change works. The table is set to the save the max number of versions, but
the number of versions is not specified in the Scan object. Once I changed
the Scan to request the max number of versions the counts remained the same
across all subsequent job runs. Can anyone provide some insight as to why
this is the case?

On Thu, Feb 26, 2015 at 8:35 AM, Michael Segel <mi...@hotmail.com>
wrote:

> Ok…
>
> Silly question time… so just humor me for a second.
>
> 1) What do you mean by saying your have a partitioned HBase table?
> (Regions and partitions are not the same)
>
> 2) There’s a question of the isolation level during the scan. What happens
> when there is a compaction running or there’s RLL taking place?
>
> Does your scan get locked/blocked? Does it skip the row?
> (This should be documented.)
> Do you count the number of rows scanned when building the list of rows
> that need to be processed further?
>
>
>
>
>
> > On Feb 25, 2015, at 4:46 PM, Stephen Durfey <sj...@gmail.com> wrote:
>
> >
> >>
> >> Are you writing any Deletes? Are you writing any duplicates?
> >
> >
> > No physical deletes are occurring in my data, and there is a very real
> > possibility of duplicates.
> >
> > How is the partitioning done?
> >>
> >
> > The key structure would be /partition_id/person_id .... I'm dealing with
> > clinical data, with a data source identified by the partition, and the
> > person data is associated with that particular partition at load time.
> >
> > Are you doing the column filtering with a custom filter or one of the
> >> prepackaged ones?
> >>
> >
> > They appear to be all prepackaged filters:  FamilyFilter, KeyOnlyFilter,
> > QualifierFilter, and ColumnPrefixFilter are used under various
> conditions,
> > depending upon what is requested on the Scan object.
> >
> >
> > On Wed, Feb 25, 2015 at 4:35 PM, Sean Busbey <bu...@cloudera.com>
> wrote:
> >
> >> Are you writing any Deletes? Are you writing any duplicates?
> >>
> >> How is the partitioning done?
> >>
> >> What does the entire key structure look like?
> >>
> >> Are you doing the column filtering with a custom filter or one of the
> >> prepackaged ones?
> >>
> >> On Wed, Feb 25, 2015 at 12:57 PM, Stephen Durfey <sj...@gmail.com>
> >> wrote:
> >>
> >>>>
> >>>> What's the TTL setting for your table ?
> >>>>
> >>>> Which hbase release are you using ?
> >>>>
> >>>> Was there compaction in between the scans ?
> >>>>
> >>>> Thanks
> >>>>
> >>>
> >>> The TTL is set to the max. The HBase version is 0.94.6-cdh4.4.0. I
> don’t
> >>> want to say compactions aren’t a factor, but the jobs are short lived
> >> (4-5
> >>> minutes), and I have ran them frequently over the last couple of days
> >>> trying to gather stats around what was being extracted, and trying to
> >> find
> >>> the difference and intersection in row keys before job runs.
> >>>
> >>> These numbers have varied wildly, from being off by 2-3 between
> >>>
> >>> subsequent scans to 40 row increases, followed by a drop of 70 rows.
> >>>> When you say there is a variation in the number of rows retrieved -
> the
> >>> 40
> >>>> rows that got increased - are those rows in the expected time range?
> Or
> >>> is
> >>>> the system retrieving some rows which are not in the specified time
> >>> range?
> >>>>
> >>>> And when the rows drop by 70, are you using any row which was needed
> to
> >>> be
> >>>> retrieved got missed out?
> >>>>
> >>>
> >>> The best I can tell, if there is an increase in counts, those rows are
> >> not
> >>> coming from outside of the time range. In the job, I am maintaining a
> >> list
> >>> of rows that have a timestamp outside of my provided time range, and
> then
> >>> writing those out to hdfs at the end of the map task. So far, nothing
> has
> >>> been written out.
> >>>
> >>> Any filters in your scan?
> >>>>
> >>>
> >>>>
> >>> Regards
> >>>> Ram
> >>>>
> >>>
> >>> There are some column filters. There is an API abstraction on top of
> >> hbase
> >>> that I am using to allow users to easily extract data from columns that
> >>> start with a provided column prefix. So, the column filters are in
> place
> >> to
> >>> ensure I am only getting back data from columns that start with the
> >>> provided prefix.
> >>>
> >>> To add a little more detail, my row keys are separated out by
> partition.
> >> At
> >>> periodic times (through oozie), data is loaded from a source into the
> >>> appropriate partition. I ran some scans against a partition that hadn't
> >>> been updated in almost a year (with a scan range around the times of
> the
> >>> 2nd to last load into the table), and the row key counts were
> consistent
> >>> across multiple scans. I chose another partition that is actively being
> >>> updated once a day. I chose a scan time around the 4th most recent
> load,
> >>> and the results were inconsistent from scan to scan (fluctuating up and
> >>> down). Setting the begin time to 4 days in the past end time on the
> scan
> >>> range to 'right now', using System.currentTimeMillis() (with the time
> >> being
> >>> after the daily load), the results also fluctuated up and down. So, it
> >> kind
> >>> of seems like there is some sort of temporal recency that is causing
> the
> >>> counts to fluctuate.
> >>>
> >>>
> >>>
> >>> On Feb 24, 2015, at 10:20 PM, ramkrishna vasudevan <
> >>> ramkrishna.s.vasudevan@gmail.com> wrote:
> >>>
> >>> These numbers have varied wildly, from being off by 2-3 between
> >>>
> >>> subsequent scans to 40 row increases, followed by a drop of 70 rows.
> >>> When you say there is a variation in the number of rows retrieved - the
> >> 40
> >>> rows that got increased - are those rows in the expected time range? Or
> >> is
> >>> the system retrieving some rows which are not in the specified time
> >> range?
> >>>
> >>> And when the rows drop by 70, are you using any row which was needed to
> >> be
> >>> retrieved got missed out?
> >>>
> >>> Any filters in your scan?
> >>>
> >>> Regards
> >>> Ram
> >>>
> >>> On Wed, Feb 25, 2015 at 8:31 AM, Ted Yu <yu...@gmail.com> wrote:
> >>>
> >>> What's the TTL setting for your table ?
> >>>
> >>> Which hbase release are you using ?
> >>>
> >>> Was there compaction in between the scans ?
> >>>
> >>> Thanks
> >>>
> >>>
> >>> On Feb 24, 2015, at 2:32 PM, Stephen Durfey <sj...@gmail.com>
> wrote:
> >>>
> >>> I have some code that accepts a time range and looks for data written
> to
> >>>
> >>> an HBase table during that range. If anything has been written for that
> >> row
> >>> during that range, the row key is saved off, and sometime later in the
> >>> pipeline those row keys are used to extract the entire row. I’m testing
> >>> against a fixed time range, at some point in the past. This is being
> done
> >>> as part of a Map/Reduce job (using Apache Crunch). I have some job
> >> counters
> >>> setup to keep track of the number of rows extracted. Since the time
> range
> >>> is fixed, I would expect the scan to return the same number of rows
> with
> >>> data in the provided time range. However, I am seeing this number vary
> >> from
> >>> scan to scan (bouncing between increasing and decreasing).
> >>>
> >>>
> >>> I’ve eliminated the possibility that data is being pulled in from
> >>>
> >>> outside the time range. I did this by scanning for one column qualifier
> >>> (and only using this as the qualifier for if a row had data in the time
> >>> range), getting the timestamp on the cell for each returned row and
> >>> compared it against the begin and end times for the scan, and I didn’t
> >> find
> >>> any that satisfied that criteria. I’ve observed some row keys show up
> in
> >>> the 1st scan, then drop out in the 2nd scan, only to show back up again
> >> in
> >>> the 3rd scan (all with the exact same Scan object). These numbers have
> >>> varied wildly, from being off by 2-3 between subsequent scans to 40 row
> >>> increases, followed by a drop of 70 rows.
> >>>
> >>>
> >>> I’m kind of looking for ideas to try to track down what could be
> causing
> >>>
> >>> this to happen. The code itself is pretty simple, it creates a Scan
> >> object,
> >>> scans the table, and then in the map phase, extract out the row key,
> and
> >> at
> >>> the end, it dumps them to a directory in hdfs.
> >>>
> >>
> >>
> >>
> >> --
> >> Sean
> >>
>
> The opinions expressed here are mine, while they may reflect a cognitive
> thought, that is purely accidental.
> Use at your own risk.
> Michael Segel
> michael_segel (AT) hotmail.com
>
>
>
>
>
>

Re: HBase scan time range, inconsistency

Posted by Michael Segel <mi...@hotmail.com>.
Ok… 

Silly question time… so just humor me for a second.

1) What do you mean by saying your have a partitioned HBase table?  (Regions and partitions are not the same) 

2) There’s a question of the isolation level during the scan. What happens when there is a compaction running or there’s RLL taking place? 

Does your scan get locked/blocked? Does it skip the row? 
(This should be documented.) 
Do you count the number of rows scanned when building the list of rows that need to be processed further? 





> On Feb 25, 2015, at 4:46 PM, Stephen Durfey <sj...@gmail.com> wrote:

> 
>> 
>> Are you writing any Deletes? Are you writing any duplicates?
> 
> 
> No physical deletes are occurring in my data, and there is a very real
> possibility of duplicates.
> 
> How is the partitioning done?
>> 
> 
> The key structure would be /partition_id/person_id .... I'm dealing with
> clinical data, with a data source identified by the partition, and the
> person data is associated with that particular partition at load time.
> 
> Are you doing the column filtering with a custom filter or one of the
>> prepackaged ones?
>> 
> 
> They appear to be all prepackaged filters:  FamilyFilter, KeyOnlyFilter,
> QualifierFilter, and ColumnPrefixFilter are used under various conditions,
> depending upon what is requested on the Scan object.
> 
> 
> On Wed, Feb 25, 2015 at 4:35 PM, Sean Busbey <bu...@cloudera.com> wrote:
> 
>> Are you writing any Deletes? Are you writing any duplicates?
>> 
>> How is the partitioning done?
>> 
>> What does the entire key structure look like?
>> 
>> Are you doing the column filtering with a custom filter or one of the
>> prepackaged ones?
>> 
>> On Wed, Feb 25, 2015 at 12:57 PM, Stephen Durfey <sj...@gmail.com>
>> wrote:
>> 
>>>> 
>>>> What's the TTL setting for your table ?
>>>> 
>>>> Which hbase release are you using ?
>>>> 
>>>> Was there compaction in between the scans ?
>>>> 
>>>> Thanks
>>>> 
>>> 
>>> The TTL is set to the max. The HBase version is 0.94.6-cdh4.4.0. I don’t
>>> want to say compactions aren’t a factor, but the jobs are short lived
>> (4-5
>>> minutes), and I have ran them frequently over the last couple of days
>>> trying to gather stats around what was being extracted, and trying to
>> find
>>> the difference and intersection in row keys before job runs.
>>> 
>>> These numbers have varied wildly, from being off by 2-3 between
>>> 
>>> subsequent scans to 40 row increases, followed by a drop of 70 rows.
>>>> When you say there is a variation in the number of rows retrieved - the
>>> 40
>>>> rows that got increased - are those rows in the expected time range? Or
>>> is
>>>> the system retrieving some rows which are not in the specified time
>>> range?
>>>> 
>>>> And when the rows drop by 70, are you using any row which was needed to
>>> be
>>>> retrieved got missed out?
>>>> 
>>> 
>>> The best I can tell, if there is an increase in counts, those rows are
>> not
>>> coming from outside of the time range. In the job, I am maintaining a
>> list
>>> of rows that have a timestamp outside of my provided time range, and then
>>> writing those out to hdfs at the end of the map task. So far, nothing has
>>> been written out.
>>> 
>>> Any filters in your scan?
>>>> 
>>> 
>>>> 
>>> Regards
>>>> Ram
>>>> 
>>> 
>>> There are some column filters. There is an API abstraction on top of
>> hbase
>>> that I am using to allow users to easily extract data from columns that
>>> start with a provided column prefix. So, the column filters are in place
>> to
>>> ensure I am only getting back data from columns that start with the
>>> provided prefix.
>>> 
>>> To add a little more detail, my row keys are separated out by partition.
>> At
>>> periodic times (through oozie), data is loaded from a source into the
>>> appropriate partition. I ran some scans against a partition that hadn't
>>> been updated in almost a year (with a scan range around the times of the
>>> 2nd to last load into the table), and the row key counts were consistent
>>> across multiple scans. I chose another partition that is actively being
>>> updated once a day. I chose a scan time around the 4th most recent load,
>>> and the results were inconsistent from scan to scan (fluctuating up and
>>> down). Setting the begin time to 4 days in the past end time on the scan
>>> range to 'right now', using System.currentTimeMillis() (with the time
>> being
>>> after the daily load), the results also fluctuated up and down. So, it
>> kind
>>> of seems like there is some sort of temporal recency that is causing the
>>> counts to fluctuate.
>>> 
>>> 
>>> 
>>> On Feb 24, 2015, at 10:20 PM, ramkrishna vasudevan <
>>> ramkrishna.s.vasudevan@gmail.com> wrote:
>>> 
>>> These numbers have varied wildly, from being off by 2-3 between
>>> 
>>> subsequent scans to 40 row increases, followed by a drop of 70 rows.
>>> When you say there is a variation in the number of rows retrieved - the
>> 40
>>> rows that got increased - are those rows in the expected time range? Or
>> is
>>> the system retrieving some rows which are not in the specified time
>> range?
>>> 
>>> And when the rows drop by 70, are you using any row which was needed to
>> be
>>> retrieved got missed out?
>>> 
>>> Any filters in your scan?
>>> 
>>> Regards
>>> Ram
>>> 
>>> On Wed, Feb 25, 2015 at 8:31 AM, Ted Yu <yu...@gmail.com> wrote:
>>> 
>>> What's the TTL setting for your table ?
>>> 
>>> Which hbase release are you using ?
>>> 
>>> Was there compaction in between the scans ?
>>> 
>>> Thanks
>>> 
>>> 
>>> On Feb 24, 2015, at 2:32 PM, Stephen Durfey <sj...@gmail.com> wrote:
>>> 
>>> I have some code that accepts a time range and looks for data written to
>>> 
>>> an HBase table during that range. If anything has been written for that
>> row
>>> during that range, the row key is saved off, and sometime later in the
>>> pipeline those row keys are used to extract the entire row. I’m testing
>>> against a fixed time range, at some point in the past. This is being done
>>> as part of a Map/Reduce job (using Apache Crunch). I have some job
>> counters
>>> setup to keep track of the number of rows extracted. Since the time range
>>> is fixed, I would expect the scan to return the same number of rows with
>>> data in the provided time range. However, I am seeing this number vary
>> from
>>> scan to scan (bouncing between increasing and decreasing).
>>> 
>>> 
>>> I’ve eliminated the possibility that data is being pulled in from
>>> 
>>> outside the time range. I did this by scanning for one column qualifier
>>> (and only using this as the qualifier for if a row had data in the time
>>> range), getting the timestamp on the cell for each returned row and
>>> compared it against the begin and end times for the scan, and I didn’t
>> find
>>> any that satisfied that criteria. I’ve observed some row keys show up in
>>> the 1st scan, then drop out in the 2nd scan, only to show back up again
>> in
>>> the 3rd scan (all with the exact same Scan object). These numbers have
>>> varied wildly, from being off by 2-3 between subsequent scans to 40 row
>>> increases, followed by a drop of 70 rows.
>>> 
>>> 
>>> I’m kind of looking for ideas to try to track down what could be causing
>>> 
>>> this to happen. The code itself is pretty simple, it creates a Scan
>> object,
>>> scans the table, and then in the map phase, extract out the row key, and
>> at
>>> the end, it dumps them to a directory in hdfs.
>>> 
>> 
>> 
>> 
>> --
>> Sean
>> 

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. 
Use at your own risk. 
Michael Segel
michael_segel (AT) hotmail.com






Re: HBase scan time range, inconsistency

Posted by Stephen Durfey <sj...@gmail.com>.
>
> Are you writing any Deletes? Are you writing any duplicates?


No physical deletes are occurring in my data, and there is a very real
possibility of duplicates.

How is the partitioning done?
>

The key structure would be /partition_id/person_id .... I'm dealing with
clinical data, with a data source identified by the partition, and the
person data is associated with that particular partition at load time.

Are you doing the column filtering with a custom filter or one of the
> prepackaged ones?
>

They appear to be all prepackaged filters:  FamilyFilter, KeyOnlyFilter,
QualifierFilter, and ColumnPrefixFilter are used under various conditions,
depending upon what is requested on the Scan object.


On Wed, Feb 25, 2015 at 4:35 PM, Sean Busbey <bu...@cloudera.com> wrote:

> Are you writing any Deletes? Are you writing any duplicates?
>
> How is the partitioning done?
>
> What does the entire key structure look like?
>
> Are you doing the column filtering with a custom filter or one of the
> prepackaged ones?
>
> On Wed, Feb 25, 2015 at 12:57 PM, Stephen Durfey <sj...@gmail.com>
> wrote:
>
> > >
> > > What's the TTL setting for your table ?
> > >
> > > Which hbase release are you using ?
> > >
> > > Was there compaction in between the scans ?
> > >
> > > Thanks
> > >
> >
> > The TTL is set to the max. The HBase version is 0.94.6-cdh4.4.0. I don’t
> > want to say compactions aren’t a factor, but the jobs are short lived
> (4-5
> > minutes), and I have ran them frequently over the last couple of days
> > trying to gather stats around what was being extracted, and trying to
> find
> > the difference and intersection in row keys before job runs.
> >
> > These numbers have varied wildly, from being off by 2-3 between
> >
> > subsequent scans to 40 row increases, followed by a drop of 70 rows.
> > > When you say there is a variation in the number of rows retrieved - the
> > 40
> > > rows that got increased - are those rows in the expected time range? Or
> > is
> > > the system retrieving some rows which are not in the specified time
> > range?
> > >
> > > And when the rows drop by 70, are you using any row which was needed to
> > be
> > > retrieved got missed out?
> > >
> >
> > The best I can tell, if there is an increase in counts, those rows are
> not
> > coming from outside of the time range. In the job, I am maintaining a
> list
> > of rows that have a timestamp outside of my provided time range, and then
> > writing those out to hdfs at the end of the map task. So far, nothing has
> > been written out.
> >
> > Any filters in your scan?
> > >
> >
> > >
> > Regards
> > > Ram
> > >
> >
> > There are some column filters. There is an API abstraction on top of
> hbase
> > that I am using to allow users to easily extract data from columns that
> > start with a provided column prefix. So, the column filters are in place
> to
> > ensure I am only getting back data from columns that start with the
> > provided prefix.
> >
> > To add a little more detail, my row keys are separated out by partition.
> At
> > periodic times (through oozie), data is loaded from a source into the
> > appropriate partition. I ran some scans against a partition that hadn't
> > been updated in almost a year (with a scan range around the times of the
> > 2nd to last load into the table), and the row key counts were consistent
> > across multiple scans. I chose another partition that is actively being
> > updated once a day. I chose a scan time around the 4th most recent load,
> > and the results were inconsistent from scan to scan (fluctuating up and
> > down). Setting the begin time to 4 days in the past end time on the scan
> > range to 'right now', using System.currentTimeMillis() (with the time
> being
> > after the daily load), the results also fluctuated up and down. So, it
> kind
> > of seems like there is some sort of temporal recency that is causing the
> > counts to fluctuate.
> >
> >
> >
> > On Feb 24, 2015, at 10:20 PM, ramkrishna vasudevan <
> > ramkrishna.s.vasudevan@gmail.com> wrote:
> >
> > These numbers have varied wildly, from being off by 2-3 between
> >
> > subsequent scans to 40 row increases, followed by a drop of 70 rows.
> > When you say there is a variation in the number of rows retrieved - the
> 40
> > rows that got increased - are those rows in the expected time range? Or
> is
> > the system retrieving some rows which are not in the specified time
> range?
> >
> > And when the rows drop by 70, are you using any row which was needed to
> be
> > retrieved got missed out?
> >
> > Any filters in your scan?
> >
> > Regards
> > Ram
> >
> > On Wed, Feb 25, 2015 at 8:31 AM, Ted Yu <yu...@gmail.com> wrote:
> >
> > What's the TTL setting for your table ?
> >
> > Which hbase release are you using ?
> >
> > Was there compaction in between the scans ?
> >
> > Thanks
> >
> >
> > On Feb 24, 2015, at 2:32 PM, Stephen Durfey <sj...@gmail.com> wrote:
> >
> > I have some code that accepts a time range and looks for data written to
> >
> > an HBase table during that range. If anything has been written for that
> row
> > during that range, the row key is saved off, and sometime later in the
> > pipeline those row keys are used to extract the entire row. I’m testing
> > against a fixed time range, at some point in the past. This is being done
> > as part of a Map/Reduce job (using Apache Crunch). I have some job
> counters
> > setup to keep track of the number of rows extracted. Since the time range
> > is fixed, I would expect the scan to return the same number of rows with
> > data in the provided time range. However, I am seeing this number vary
> from
> > scan to scan (bouncing between increasing and decreasing).
> >
> >
> > I’ve eliminated the possibility that data is being pulled in from
> >
> > outside the time range. I did this by scanning for one column qualifier
> > (and only using this as the qualifier for if a row had data in the time
> > range), getting the timestamp on the cell for each returned row and
> > compared it against the begin and end times for the scan, and I didn’t
> find
> > any that satisfied that criteria. I’ve observed some row keys show up in
> > the 1st scan, then drop out in the 2nd scan, only to show back up again
> in
> > the 3rd scan (all with the exact same Scan object). These numbers have
> > varied wildly, from being off by 2-3 between subsequent scans to 40 row
> > increases, followed by a drop of 70 rows.
> >
> >
> > I’m kind of looking for ideas to try to track down what could be causing
> >
> > this to happen. The code itself is pretty simple, it creates a Scan
> object,
> > scans the table, and then in the map phase, extract out the row key, and
> at
> > the end, it dumps them to a directory in hdfs.
> >
>
>
>
> --
> Sean
>

Re: HBase scan time range, inconsistency

Posted by Sean Busbey <bu...@cloudera.com>.
Are you writing any Deletes? Are you writing any duplicates?

How is the partitioning done?

What does the entire key structure look like?

Are you doing the column filtering with a custom filter or one of the
prepackaged ones?

On Wed, Feb 25, 2015 at 12:57 PM, Stephen Durfey <sj...@gmail.com> wrote:

> >
> > What's the TTL setting for your table ?
> >
> > Which hbase release are you using ?
> >
> > Was there compaction in between the scans ?
> >
> > Thanks
> >
>
> The TTL is set to the max. The HBase version is 0.94.6-cdh4.4.0. I don’t
> want to say compactions aren’t a factor, but the jobs are short lived (4-5
> minutes), and I have ran them frequently over the last couple of days
> trying to gather stats around what was being extracted, and trying to find
> the difference and intersection in row keys before job runs.
>
> These numbers have varied wildly, from being off by 2-3 between
>
> subsequent scans to 40 row increases, followed by a drop of 70 rows.
> > When you say there is a variation in the number of rows retrieved - the
> 40
> > rows that got increased - are those rows in the expected time range? Or
> is
> > the system retrieving some rows which are not in the specified time
> range?
> >
> > And when the rows drop by 70, are you using any row which was needed to
> be
> > retrieved got missed out?
> >
>
> The best I can tell, if there is an increase in counts, those rows are not
> coming from outside of the time range. In the job, I am maintaining a list
> of rows that have a timestamp outside of my provided time range, and then
> writing those out to hdfs at the end of the map task. So far, nothing has
> been written out.
>
> Any filters in your scan?
> >
>
> >
> Regards
> > Ram
> >
>
> There are some column filters. There is an API abstraction on top of hbase
> that I am using to allow users to easily extract data from columns that
> start with a provided column prefix. So, the column filters are in place to
> ensure I am only getting back data from columns that start with the
> provided prefix.
>
> To add a little more detail, my row keys are separated out by partition. At
> periodic times (through oozie), data is loaded from a source into the
> appropriate partition. I ran some scans against a partition that hadn't
> been updated in almost a year (with a scan range around the times of the
> 2nd to last load into the table), and the row key counts were consistent
> across multiple scans. I chose another partition that is actively being
> updated once a day. I chose a scan time around the 4th most recent load,
> and the results were inconsistent from scan to scan (fluctuating up and
> down). Setting the begin time to 4 days in the past end time on the scan
> range to 'right now', using System.currentTimeMillis() (with the time being
> after the daily load), the results also fluctuated up and down. So, it kind
> of seems like there is some sort of temporal recency that is causing the
> counts to fluctuate.
>
>
>
> On Feb 24, 2015, at 10:20 PM, ramkrishna vasudevan <
> ramkrishna.s.vasudevan@gmail.com> wrote:
>
> These numbers have varied wildly, from being off by 2-3 between
>
> subsequent scans to 40 row increases, followed by a drop of 70 rows.
> When you say there is a variation in the number of rows retrieved - the 40
> rows that got increased - are those rows in the expected time range? Or is
> the system retrieving some rows which are not in the specified time range?
>
> And when the rows drop by 70, are you using any row which was needed to be
> retrieved got missed out?
>
> Any filters in your scan?
>
> Regards
> Ram
>
> On Wed, Feb 25, 2015 at 8:31 AM, Ted Yu <yu...@gmail.com> wrote:
>
> What's the TTL setting for your table ?
>
> Which hbase release are you using ?
>
> Was there compaction in between the scans ?
>
> Thanks
>
>
> On Feb 24, 2015, at 2:32 PM, Stephen Durfey <sj...@gmail.com> wrote:
>
> I have some code that accepts a time range and looks for data written to
>
> an HBase table during that range. If anything has been written for that row
> during that range, the row key is saved off, and sometime later in the
> pipeline those row keys are used to extract the entire row. I’m testing
> against a fixed time range, at some point in the past. This is being done
> as part of a Map/Reduce job (using Apache Crunch). I have some job counters
> setup to keep track of the number of rows extracted. Since the time range
> is fixed, I would expect the scan to return the same number of rows with
> data in the provided time range. However, I am seeing this number vary from
> scan to scan (bouncing between increasing and decreasing).
>
>
> I’ve eliminated the possibility that data is being pulled in from
>
> outside the time range. I did this by scanning for one column qualifier
> (and only using this as the qualifier for if a row had data in the time
> range), getting the timestamp on the cell for each returned row and
> compared it against the begin and end times for the scan, and I didn’t find
> any that satisfied that criteria. I’ve observed some row keys show up in
> the 1st scan, then drop out in the 2nd scan, only to show back up again in
> the 3rd scan (all with the exact same Scan object). These numbers have
> varied wildly, from being off by 2-3 between subsequent scans to 40 row
> increases, followed by a drop of 70 rows.
>
>
> I’m kind of looking for ideas to try to track down what could be causing
>
> this to happen. The code itself is pretty simple, it creates a Scan object,
> scans the table, and then in the map phase, extract out the row key, and at
> the end, it dumps them to a directory in hdfs.
>



-- 
Sean

Re: HBase scan time range, inconsistency

Posted by Stephen Durfey <sj...@gmail.com>.
>
> What's the TTL setting for your table ?
>
> Which hbase release are you using ?
>
> Was there compaction in between the scans ?
>
> Thanks
>

The TTL is set to the max. The HBase version is 0.94.6-cdh4.4.0. I don’t
want to say compactions aren’t a factor, but the jobs are short lived (4-5
minutes), and I have ran them frequently over the last couple of days
trying to gather stats around what was being extracted, and trying to find
the difference and intersection in row keys before job runs.

These numbers have varied wildly, from being off by 2-3 between

subsequent scans to 40 row increases, followed by a drop of 70 rows.
> When you say there is a variation in the number of rows retrieved - the 40
> rows that got increased - are those rows in the expected time range? Or is
> the system retrieving some rows which are not in the specified time range?
>
> And when the rows drop by 70, are you using any row which was needed to be
> retrieved got missed out?
>

The best I can tell, if there is an increase in counts, those rows are not
coming from outside of the time range. In the job, I am maintaining a list
of rows that have a timestamp outside of my provided time range, and then
writing those out to hdfs at the end of the map task. So far, nothing has
been written out.

Any filters in your scan?
>

>
Regards
> Ram
>

There are some column filters. There is an API abstraction on top of hbase
that I am using to allow users to easily extract data from columns that
start with a provided column prefix. So, the column filters are in place to
ensure I am only getting back data from columns that start with the
provided prefix.

To add a little more detail, my row keys are separated out by partition. At
periodic times (through oozie), data is loaded from a source into the
appropriate partition. I ran some scans against a partition that hadn't
been updated in almost a year (with a scan range around the times of the
2nd to last load into the table), and the row key counts were consistent
across multiple scans. I chose another partition that is actively being
updated once a day. I chose a scan time around the 4th most recent load,
and the results were inconsistent from scan to scan (fluctuating up and
down). Setting the begin time to 4 days in the past end time on the scan
range to 'right now', using System.currentTimeMillis() (with the time being
after the daily load), the results also fluctuated up and down. So, it kind
of seems like there is some sort of temporal recency that is causing the
counts to fluctuate.



On Feb 24, 2015, at 10:20 PM, ramkrishna vasudevan <
ramkrishna.s.vasudevan@gmail.com> wrote:

These numbers have varied wildly, from being off by 2-3 between

subsequent scans to 40 row increases, followed by a drop of 70 rows.
When you say there is a variation in the number of rows retrieved - the 40
rows that got increased - are those rows in the expected time range? Or is
the system retrieving some rows which are not in the specified time range?

And when the rows drop by 70, are you using any row which was needed to be
retrieved got missed out?

Any filters in your scan?

Regards
Ram

On Wed, Feb 25, 2015 at 8:31 AM, Ted Yu <yu...@gmail.com> wrote:

What's the TTL setting for your table ?

Which hbase release are you using ?

Was there compaction in between the scans ?

Thanks


On Feb 24, 2015, at 2:32 PM, Stephen Durfey <sj...@gmail.com> wrote:

I have some code that accepts a time range and looks for data written to

an HBase table during that range. If anything has been written for that row
during that range, the row key is saved off, and sometime later in the
pipeline those row keys are used to extract the entire row. I’m testing
against a fixed time range, at some point in the past. This is being done
as part of a Map/Reduce job (using Apache Crunch). I have some job counters
setup to keep track of the number of rows extracted. Since the time range
is fixed, I would expect the scan to return the same number of rows with
data in the provided time range. However, I am seeing this number vary from
scan to scan (bouncing between increasing and decreasing).


I’ve eliminated the possibility that data is being pulled in from

outside the time range. I did this by scanning for one column qualifier
(and only using this as the qualifier for if a row had data in the time
range), getting the timestamp on the cell for each returned row and
compared it against the begin and end times for the scan, and I didn’t find
any that satisfied that criteria. I’ve observed some row keys show up in
the 1st scan, then drop out in the 2nd scan, only to show back up again in
the 3rd scan (all with the exact same Scan object). These numbers have
varied wildly, from being off by 2-3 between subsequent scans to 40 row
increases, followed by a drop of 70 rows.


I’m kind of looking for ideas to try to track down what could be causing

this to happen. The code itself is pretty simple, it creates a Scan object,
scans the table, and then in the map phase, extract out the row key, and at
the end, it dumps them to a directory in hdfs.

Re: HBase scan time range, inconsistency

Posted by ramkrishna vasudevan <ra...@gmail.com>.
>> These numbers have varied wildly, from being off by 2-3 between
subsequent scans to 40 row increases, followed by a drop of 70 rows.
When you say there is a variation in the number of rows retrieved - the 40
rows that got increased - are those rows in the expected time range? Or is
the system retrieving some rows which are not in the specified time range?

And when the rows drop by 70, are you using any row which was needed to be
retrieved got missed out?

Any filters in your scan?

Regards
Ram

On Wed, Feb 25, 2015 at 8:31 AM, Ted Yu <yu...@gmail.com> wrote:

> What's the TTL setting for your table ?
>
> Which hbase release are you using ?
>
> Was there compaction in between the scans ?
>
> Thanks
>
>
> > On Feb 24, 2015, at 2:32 PM, Stephen Durfey <sj...@gmail.com> wrote:
> >
> > I have some code that accepts a time range and looks for data written to
> an HBase table during that range. If anything has been written for that row
> during that range, the row key is saved off, and sometime later in the
> pipeline those row keys are used to extract the entire row. I’m testing
> against a fixed time range, at some point in the past. This is being done
> as part of a Map/Reduce job (using Apache Crunch). I have some job counters
> setup to keep track of the number of rows extracted. Since the time range
> is fixed, I would expect the scan to return the same number of rows with
> data in the provided time range. However, I am seeing this number vary from
> scan to scan (bouncing between increasing and decreasing).
> >
> > I’ve eliminated the possibility that data is being pulled in from
> outside the time range. I did this by scanning for one column qualifier
> (and only using this as the qualifier for if a row had data in the time
> range), getting the timestamp on the cell for each returned row and
> compared it against the begin and end times for the scan, and I didn’t find
> any that satisfied that criteria. I’ve observed some row keys show up in
> the 1st scan, then drop out in the 2nd scan, only to show back up again in
> the 3rd scan (all with the exact same Scan object). These numbers have
> varied wildly, from being off by 2-3 between subsequent scans to 40 row
> increases, followed by a drop of 70 rows.
> >
> > I’m kind of looking for ideas to try to track down what could be causing
> this to happen. The code itself is pretty simple, it creates a Scan object,
> scans the table, and then in the map phase, extract out the row key, and at
> the end, it dumps them to a directory in hdfs.
>