You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Shrijeet Paliwal <sh...@rocketfuel.com> on 2012/08/06 08:42:38 UTC
Find rows which do not have any of the given columns
Hi All,
I am writing a job which finds rows that do not have a cell corresponding
to any of the columns in the given set of columns.
This is how I have configured my scan (a combination of lQualifierFilters
and SkipFilter)
columnsSet = Splitter.on(',') .split(columns); //columns is a csv
containing column names
List<Filter> qualifierFilters = new ArrayList<Filter>();
for (String qual : columnsSet) {
qualifierFilters.add(new QualifierFilter(CompareOp.NOT_EQUAL,
new BinaryComparator(Bytes.toBytes(qual))));
}
Filter skipFilter = new SkipFilter(new
FilterList(Operator.MUST_PASS_ALL, qualifierFilters));
Scan scan = new Scan();
scan.addFamily(Bytes.toBytes(family));
scan.setCacheBlocks(false);
scan.setCaching(1000);
scan.setFilter(skipFilter);
scan.setTimeRange(Long.valueOf(args[4]), Long.valueOf(args[5]));
In my test table the scan worked as expected. But in production run, I got
rows which had cells containing one of the given qualifiers (not expected)
Can some one help me spot the mistake?
-Shrijeet
Re: [potential bug]Find rows which do not have any of the given columns
Posted by Shrijeet Paliwal <sh...@rocketfuel.com>.
Zahoor,
Thank you for the input. I still feel it is counter intuitive.
-Shrijeet
On Tue, Aug 7, 2012 at 2:57 AM, J Mohamed Zahoor <jm...@gmail.com> wrote:
> Hi
>
> Nice one. But i think this is valid behavior.
> Time ranges are something which qualifies certain rows to be made available
> to the client (something which is related to MVCC).
> Once a certain rows are qualified... then the filters are applied on them.
>
> The fact that both can be set simultaneously on a "Scan" object hints that
> they orthogonal.
>
> ./zahoor
>
> On Tue, Aug 7, 2012 at 2:10 AM, Shrijeet Paliwal <sh...@rocketfuel.com>wrote:
>
>> - user
>> +dev
>>
>> Hi Devs,
>>
>> Please follow the discussion to get full context. tl:dr "Did a scan with
>> timerange and filters, scan o/p was incorrect. Repeated scan with filter
>> only, scan o/p was correct."
>>
>> HBase version : 0.90.3
>> Hadoop : CDH3u0
>> Issues:
>> The scan when set with both a time range and a filter can behave in
>> an unintuitive way. Calling it unintuitive instead of wrong, since I do not
>> know if this is a known limitation of scan. Picture a filter setup like
>> mine - "Filter rows which have cells pertaining to certain columns". This
>> filter is set on a scan which has a time range constraint as well. AFAIK
>> we skip Hfiles based on metadata when dealing with time ranges. If a region
>> has two Hfiles. One of the Hfiles has cells for unwanted columns but the
>> other one does not - we may get incorrect result based on what how time
>> range is set (If the time range scan optimizer skips the Hfile containing
>> unwanted cells).
>>
>> Does this sound like a valid issue? Also I can see this happening to more
>> than one kind of SkipFilters.
>>
>> -Shrijeet
>>
>>
>> On Mon, Aug 6, 2012 at 11:38 AM, Shrijeet Paliwal
>> <sh...@rocketfuel.com>wrote:
>>
>> > It seems setting time range is a problem , I was doing (*
>> > scan.setTimeRange(Long.**valueOf(args[4]), Long.valueOf(args[5]));)*
>> > *
>> > *
>> > I was working on assumption that filter logic works before scan logic, in
>> > other words a KV dropped by filter will not make it to scan. In case of
>> > time range this might not be true.
>> >
>> > -Shrijeet
>> >
>> >
>> > On Mon, Aug 6, 2012 at 9:25 AM, jmozah <jm...@gmail.com> wrote:
>> >
>> >> Hmmm.. Missed it. Otherwise i dont spot anything wrong in this.
>> >> are you sure about the column names?
>> >>
>> >> ./zahoor
>> >>
>> >>
>> >> On 06-Aug-2012, at 9:34 PM, Shrijeet Paliwal <sh...@rocketfuel.com>
>> >> wrote:
>> >>
>> >> > I am using FilterList. Could you elaborate?
>> >> >
>> >> > On Mon, Aug 6, 2012 at 8:48 AM, jmozah <jm...@gmail.com> wrote:
>> >> >
>> >> >>
>> >> >>
>> >> >> Use FilterList instead of List of Filters.
>> >> >>
>> >> >> ./Zahoor
>> >> >>
>> >> >> On 06-Aug-2012, at 12:12 PM, Shrijeet Paliwal <
>> shrijeet@rocketfuel.com
>> >> >
>> >> >> wrote:
>> >> >>
>> >> >>> Hi All,
>> >> >>>
>> >> >>> I am writing a job which finds rows that do not have a cell
>> >> corresponding
>> >> >>> to any of the columns in the given set of columns.
>> >> >>> This is how I have configured my scan (a combination of
>> >> lQualifierFilters
>> >> >>> and SkipFilter)
>> >> >>>
>> >> >>> columnsSet = Splitter.on(',') .split(columns); //columns is a csv
>> >> >>> containing column names
>> >> >>> List<Filter> qualifierFilters = new ArrayList<Filter>();
>> >> >>> for (String qual : columnsSet) {
>> >> >>> qualifierFilters.add(new QualifierFilter(CompareOp.NOT_EQUAL,
>> >> >>> new BinaryComparator(Bytes.toBytes(qual))));
>> >> >>> }
>> >> >>> Filter skipFilter = new SkipFilter(new
>> >> >>> FilterList(Operator.MUST_PASS_ALL, qualifierFilters));
>> >> >>> Scan scan = new Scan();
>> >> >>> scan.addFamily(Bytes.toBytes(family));
>> >> >>> scan.setCacheBlocks(false);
>> >> >>> scan.setCaching(1000);
>> >> >>> scan.setFilter(skipFilter);
>> >> >>> scan.setTimeRange(Long.valueOf(args[4]), Long.valueOf(args[5]));
>> >> >>>
>> >> >>> In my test table the scan worked as expected. But in production
>> run, I
>> >> >> got
>> >> >>> rows which had cells containing one of the given qualifiers (not
>> >> >> expected)
>> >> >>> Can some one help me spot the mistake?
>> >> >>>
>> >> >>> -Shrijeet
>> >> >>
>> >> >>
>> >>
>> >>
>> >
>>
Re: [potential bug]Find rows which do not have any of the given columns
Posted by J Mohamed Zahoor <jm...@gmail.com>.
Hi
Nice one. But i think this is valid behavior.
Time ranges are something which qualifies certain rows to be made available
to the client (something which is related to MVCC).
Once a certain rows are qualified... then the filters are applied on them.
The fact that both can be set simultaneously on a "Scan" object hints that
they orthogonal.
./zahoor
On Tue, Aug 7, 2012 at 2:10 AM, Shrijeet Paliwal <sh...@rocketfuel.com>wrote:
> - user
> +dev
>
> Hi Devs,
>
> Please follow the discussion to get full context. tl:dr "Did a scan with
> timerange and filters, scan o/p was incorrect. Repeated scan with filter
> only, scan o/p was correct."
>
> HBase version : 0.90.3
> Hadoop : CDH3u0
> Issues:
> The scan when set with both a time range and a filter can behave in
> an unintuitive way. Calling it unintuitive instead of wrong, since I do not
> know if this is a known limitation of scan. Picture a filter setup like
> mine - "Filter rows which have cells pertaining to certain columns". This
> filter is set on a scan which has a time range constraint as well. AFAIK
> we skip Hfiles based on metadata when dealing with time ranges. If a region
> has two Hfiles. One of the Hfiles has cells for unwanted columns but the
> other one does not - we may get incorrect result based on what how time
> range is set (If the time range scan optimizer skips the Hfile containing
> unwanted cells).
>
> Does this sound like a valid issue? Also I can see this happening to more
> than one kind of SkipFilters.
>
> -Shrijeet
>
>
> On Mon, Aug 6, 2012 at 11:38 AM, Shrijeet Paliwal
> <sh...@rocketfuel.com>wrote:
>
> > It seems setting time range is a problem , I was doing (*
> > scan.setTimeRange(Long.**valueOf(args[4]), Long.valueOf(args[5]));)*
> > *
> > *
> > I was working on assumption that filter logic works before scan logic, in
> > other words a KV dropped by filter will not make it to scan. In case of
> > time range this might not be true.
> >
> > -Shrijeet
> >
> >
> > On Mon, Aug 6, 2012 at 9:25 AM, jmozah <jm...@gmail.com> wrote:
> >
> >> Hmmm.. Missed it. Otherwise i dont spot anything wrong in this.
> >> are you sure about the column names?
> >>
> >> ./zahoor
> >>
> >>
> >> On 06-Aug-2012, at 9:34 PM, Shrijeet Paliwal <sh...@rocketfuel.com>
> >> wrote:
> >>
> >> > I am using FilterList. Could you elaborate?
> >> >
> >> > On Mon, Aug 6, 2012 at 8:48 AM, jmozah <jm...@gmail.com> wrote:
> >> >
> >> >>
> >> >>
> >> >> Use FilterList instead of List of Filters.
> >> >>
> >> >> ./Zahoor
> >> >>
> >> >> On 06-Aug-2012, at 12:12 PM, Shrijeet Paliwal <
> shrijeet@rocketfuel.com
> >> >
> >> >> wrote:
> >> >>
> >> >>> Hi All,
> >> >>>
> >> >>> I am writing a job which finds rows that do not have a cell
> >> corresponding
> >> >>> to any of the columns in the given set of columns.
> >> >>> This is how I have configured my scan (a combination of
> >> lQualifierFilters
> >> >>> and SkipFilter)
> >> >>>
> >> >>> columnsSet = Splitter.on(',') .split(columns); //columns is a csv
> >> >>> containing column names
> >> >>> List<Filter> qualifierFilters = new ArrayList<Filter>();
> >> >>> for (String qual : columnsSet) {
> >> >>> qualifierFilters.add(new QualifierFilter(CompareOp.NOT_EQUAL,
> >> >>> new BinaryComparator(Bytes.toBytes(qual))));
> >> >>> }
> >> >>> Filter skipFilter = new SkipFilter(new
> >> >>> FilterList(Operator.MUST_PASS_ALL, qualifierFilters));
> >> >>> Scan scan = new Scan();
> >> >>> scan.addFamily(Bytes.toBytes(family));
> >> >>> scan.setCacheBlocks(false);
> >> >>> scan.setCaching(1000);
> >> >>> scan.setFilter(skipFilter);
> >> >>> scan.setTimeRange(Long.valueOf(args[4]), Long.valueOf(args[5]));
> >> >>>
> >> >>> In my test table the scan worked as expected. But in production
> run, I
> >> >> got
> >> >>> rows which had cells containing one of the given qualifiers (not
> >> >> expected)
> >> >>> Can some one help me spot the mistake?
> >> >>>
> >> >>> -Shrijeet
> >> >>
> >> >>
> >>
> >>
> >
>
[potential bug]Find rows which do not have any of the given columns
Posted by Shrijeet Paliwal <sh...@rocketfuel.com>.
- user
+dev
Hi Devs,
Please follow the discussion to get full context. tl:dr "Did a scan with
timerange and filters, scan o/p was incorrect. Repeated scan with filter
only, scan o/p was correct."
HBase version : 0.90.3
Hadoop : CDH3u0
Issues:
The scan when set with both a time range and a filter can behave in
an unintuitive way. Calling it unintuitive instead of wrong, since I do not
know if this is a known limitation of scan. Picture a filter setup like
mine - "Filter rows which have cells pertaining to certain columns". This
filter is set on a scan which has a time range constraint as well. AFAIK
we skip Hfiles based on metadata when dealing with time ranges. If a region
has two Hfiles. One of the Hfiles has cells for unwanted columns but the
other one does not - we may get incorrect result based on what how time
range is set (If the time range scan optimizer skips the Hfile containing
unwanted cells).
Does this sound like a valid issue? Also I can see this happening to more
than one kind of SkipFilters.
-Shrijeet
On Mon, Aug 6, 2012 at 11:38 AM, Shrijeet Paliwal
<sh...@rocketfuel.com>wrote:
> It seems setting time range is a problem , I was doing (*
> scan.setTimeRange(Long.**valueOf(args[4]), Long.valueOf(args[5]));)*
> *
> *
> I was working on assumption that filter logic works before scan logic, in
> other words a KV dropped by filter will not make it to scan. In case of
> time range this might not be true.
>
> -Shrijeet
>
>
> On Mon, Aug 6, 2012 at 9:25 AM, jmozah <jm...@gmail.com> wrote:
>
>> Hmmm.. Missed it. Otherwise i dont spot anything wrong in this.
>> are you sure about the column names?
>>
>> ./zahoor
>>
>>
>> On 06-Aug-2012, at 9:34 PM, Shrijeet Paliwal <sh...@rocketfuel.com>
>> wrote:
>>
>> > I am using FilterList. Could you elaborate?
>> >
>> > On Mon, Aug 6, 2012 at 8:48 AM, jmozah <jm...@gmail.com> wrote:
>> >
>> >>
>> >>
>> >> Use FilterList instead of List of Filters.
>> >>
>> >> ./Zahoor
>> >>
>> >> On 06-Aug-2012, at 12:12 PM, Shrijeet Paliwal <shrijeet@rocketfuel.com
>> >
>> >> wrote:
>> >>
>> >>> Hi All,
>> >>>
>> >>> I am writing a job which finds rows that do not have a cell
>> corresponding
>> >>> to any of the columns in the given set of columns.
>> >>> This is how I have configured my scan (a combination of
>> lQualifierFilters
>> >>> and SkipFilter)
>> >>>
>> >>> columnsSet = Splitter.on(',') .split(columns); //columns is a csv
>> >>> containing column names
>> >>> List<Filter> qualifierFilters = new ArrayList<Filter>();
>> >>> for (String qual : columnsSet) {
>> >>> qualifierFilters.add(new QualifierFilter(CompareOp.NOT_EQUAL,
>> >>> new BinaryComparator(Bytes.toBytes(qual))));
>> >>> }
>> >>> Filter skipFilter = new SkipFilter(new
>> >>> FilterList(Operator.MUST_PASS_ALL, qualifierFilters));
>> >>> Scan scan = new Scan();
>> >>> scan.addFamily(Bytes.toBytes(family));
>> >>> scan.setCacheBlocks(false);
>> >>> scan.setCaching(1000);
>> >>> scan.setFilter(skipFilter);
>> >>> scan.setTimeRange(Long.valueOf(args[4]), Long.valueOf(args[5]));
>> >>>
>> >>> In my test table the scan worked as expected. But in production run, I
>> >> got
>> >>> rows which had cells containing one of the given qualifiers (not
>> >> expected)
>> >>> Can some one help me spot the mistake?
>> >>>
>> >>> -Shrijeet
>> >>
>> >>
>>
>>
>
Re: Find rows which do not have any of the given columns
Posted by Shrijeet Paliwal <sh...@rocketfuel.com>.
It seems setting time range is a problem , I was doing (*
scan.setTimeRange(Long.**valueOf(args[4]), Long.valueOf(args[5]));)*
*
*
I was working on assumption that filter logic works before scan logic, in
other words a KV dropped by filter will not make it to scan. In case of
time range this might not be true.
-Shrijeet
On Mon, Aug 6, 2012 at 9:25 AM, jmozah <jm...@gmail.com> wrote:
> Hmmm.. Missed it. Otherwise i dont spot anything wrong in this.
> are you sure about the column names?
>
> ./zahoor
>
>
> On 06-Aug-2012, at 9:34 PM, Shrijeet Paliwal <sh...@rocketfuel.com>
> wrote:
>
> > I am using FilterList. Could you elaborate?
> >
> > On Mon, Aug 6, 2012 at 8:48 AM, jmozah <jm...@gmail.com> wrote:
> >
> >>
> >>
> >> Use FilterList instead of List of Filters.
> >>
> >> ./Zahoor
> >>
> >> On 06-Aug-2012, at 12:12 PM, Shrijeet Paliwal <sh...@rocketfuel.com>
> >> wrote:
> >>
> >>> Hi All,
> >>>
> >>> I am writing a job which finds rows that do not have a cell
> corresponding
> >>> to any of the columns in the given set of columns.
> >>> This is how I have configured my scan (a combination of
> lQualifierFilters
> >>> and SkipFilter)
> >>>
> >>> columnsSet = Splitter.on(',') .split(columns); //columns is a csv
> >>> containing column names
> >>> List<Filter> qualifierFilters = new ArrayList<Filter>();
> >>> for (String qual : columnsSet) {
> >>> qualifierFilters.add(new QualifierFilter(CompareOp.NOT_EQUAL,
> >>> new BinaryComparator(Bytes.toBytes(qual))));
> >>> }
> >>> Filter skipFilter = new SkipFilter(new
> >>> FilterList(Operator.MUST_PASS_ALL, qualifierFilters));
> >>> Scan scan = new Scan();
> >>> scan.addFamily(Bytes.toBytes(family));
> >>> scan.setCacheBlocks(false);
> >>> scan.setCaching(1000);
> >>> scan.setFilter(skipFilter);
> >>> scan.setTimeRange(Long.valueOf(args[4]), Long.valueOf(args[5]));
> >>>
> >>> In my test table the scan worked as expected. But in production run, I
> >> got
> >>> rows which had cells containing one of the given qualifiers (not
> >> expected)
> >>> Can some one help me spot the mistake?
> >>>
> >>> -Shrijeet
> >>
> >>
>
>
Re: Find rows which do not have any of the given columns
Posted by jmozah <jm...@gmail.com>.
Hmmm.. Missed it. Otherwise i dont spot anything wrong in this.
are you sure about the column names?
./zahoor
On 06-Aug-2012, at 9:34 PM, Shrijeet Paliwal <sh...@rocketfuel.com> wrote:
> I am using FilterList. Could you elaborate?
>
> On Mon, Aug 6, 2012 at 8:48 AM, jmozah <jm...@gmail.com> wrote:
>
>>
>>
>> Use FilterList instead of List of Filters.
>>
>> ./Zahoor
>>
>> On 06-Aug-2012, at 12:12 PM, Shrijeet Paliwal <sh...@rocketfuel.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> I am writing a job which finds rows that do not have a cell corresponding
>>> to any of the columns in the given set of columns.
>>> This is how I have configured my scan (a combination of lQualifierFilters
>>> and SkipFilter)
>>>
>>> columnsSet = Splitter.on(',') .split(columns); //columns is a csv
>>> containing column names
>>> List<Filter> qualifierFilters = new ArrayList<Filter>();
>>> for (String qual : columnsSet) {
>>> qualifierFilters.add(new QualifierFilter(CompareOp.NOT_EQUAL,
>>> new BinaryComparator(Bytes.toBytes(qual))));
>>> }
>>> Filter skipFilter = new SkipFilter(new
>>> FilterList(Operator.MUST_PASS_ALL, qualifierFilters));
>>> Scan scan = new Scan();
>>> scan.addFamily(Bytes.toBytes(family));
>>> scan.setCacheBlocks(false);
>>> scan.setCaching(1000);
>>> scan.setFilter(skipFilter);
>>> scan.setTimeRange(Long.valueOf(args[4]), Long.valueOf(args[5]));
>>>
>>> In my test table the scan worked as expected. But in production run, I
>> got
>>> rows which had cells containing one of the given qualifiers (not
>> expected)
>>> Can some one help me spot the mistake?
>>>
>>> -Shrijeet
>>
>>
Re: Find rows which do not have any of the given columns
Posted by Shrijeet Paliwal <sh...@rocketfuel.com>.
I am using FilterList. Could you elaborate?
On Mon, Aug 6, 2012 at 8:48 AM, jmozah <jm...@gmail.com> wrote:
>
>
> Use FilterList instead of List of Filters.
>
> ./Zahoor
>
> On 06-Aug-2012, at 12:12 PM, Shrijeet Paliwal <sh...@rocketfuel.com>
> wrote:
>
> > Hi All,
> >
> > I am writing a job which finds rows that do not have a cell corresponding
> > to any of the columns in the given set of columns.
> > This is how I have configured my scan (a combination of lQualifierFilters
> > and SkipFilter)
> >
> > columnsSet = Splitter.on(',') .split(columns); //columns is a csv
> > containing column names
> > List<Filter> qualifierFilters = new ArrayList<Filter>();
> > for (String qual : columnsSet) {
> > qualifierFilters.add(new QualifierFilter(CompareOp.NOT_EQUAL,
> > new BinaryComparator(Bytes.toBytes(qual))));
> > }
> > Filter skipFilter = new SkipFilter(new
> > FilterList(Operator.MUST_PASS_ALL, qualifierFilters));
> > Scan scan = new Scan();
> > scan.addFamily(Bytes.toBytes(family));
> > scan.setCacheBlocks(false);
> > scan.setCaching(1000);
> > scan.setFilter(skipFilter);
> > scan.setTimeRange(Long.valueOf(args[4]), Long.valueOf(args[5]));
> >
> > In my test table the scan worked as expected. But in production run, I
> got
> > rows which had cells containing one of the given qualifiers (not
> expected)
> > Can some one help me spot the mistake?
> >
> > -Shrijeet
>
>
Re: Find rows which do not have any of the given columns
Posted by jmozah <jm...@gmail.com>.
Use FilterList instead of List of Filters.
./Zahoor
On 06-Aug-2012, at 12:12 PM, Shrijeet Paliwal <sh...@rocketfuel.com> wrote:
> Hi All,
>
> I am writing a job which finds rows that do not have a cell corresponding
> to any of the columns in the given set of columns.
> This is how I have configured my scan (a combination of lQualifierFilters
> and SkipFilter)
>
> columnsSet = Splitter.on(',') .split(columns); //columns is a csv
> containing column names
> List<Filter> qualifierFilters = new ArrayList<Filter>();
> for (String qual : columnsSet) {
> qualifierFilters.add(new QualifierFilter(CompareOp.NOT_EQUAL,
> new BinaryComparator(Bytes.toBytes(qual))));
> }
> Filter skipFilter = new SkipFilter(new
> FilterList(Operator.MUST_PASS_ALL, qualifierFilters));
> Scan scan = new Scan();
> scan.addFamily(Bytes.toBytes(family));
> scan.setCacheBlocks(false);
> scan.setCaching(1000);
> scan.setFilter(skipFilter);
> scan.setTimeRange(Long.valueOf(args[4]), Long.valueOf(args[5]));
>
> In my test table the scan worked as expected. But in production run, I got
> rows which had cells containing one of the given qualifiers (not expected)
> Can some one help me spot the mistake?
>
> -Shrijeet