You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Shrijeet Paliwal <sh...@rocketfuel.com> on 2012/08/06 08:42:38 UTC

Find rows which do not have any of the given columns

Hi All,

I am writing a job which finds rows that do not have a cell corresponding
to any of the columns in the given set of columns.
This is how I have configured my scan (a combination of lQualifierFilters
and SkipFilter)

    columnsSet = Splitter.on(',') .split(columns); //columns is a csv
containing column names
    List<Filter> qualifierFilters = new ArrayList<Filter>();
    for (String qual : columnsSet) {
      qualifierFilters.add(new QualifierFilter(CompareOp.NOT_EQUAL,
          new BinaryComparator(Bytes.toBytes(qual))));
    }
    Filter skipFilter = new SkipFilter(new
FilterList(Operator.MUST_PASS_ALL, qualifierFilters));
    Scan scan = new Scan();
    scan.addFamily(Bytes.toBytes(family));
    scan.setCacheBlocks(false);
    scan.setCaching(1000);
    scan.setFilter(skipFilter);
    scan.setTimeRange(Long.valueOf(args[4]), Long.valueOf(args[5]));

In my test table the scan worked as expected. But in production run, I got
rows which had cells containing one of the given qualifiers (not expected)
Can some one help me spot the mistake?

-Shrijeet

Re: [potential bug]Find rows which do not have any of the given columns

Posted by Shrijeet Paliwal <sh...@rocketfuel.com>.
Zahoor,

Thank you for the input. I still feel it is counter intuitive.

-Shrijeet

On Tue, Aug 7, 2012 at 2:57 AM, J Mohamed Zahoor <jm...@gmail.com> wrote:
> Hi
>
> Nice one. But i think this is valid behavior.
> Time ranges are something which qualifies certain rows to be made available
> to the client (something which is related to MVCC).
> Once a certain rows are qualified... then the filters are applied on them.
>
> The fact that both can be set simultaneously on a "Scan" object hints that
> they orthogonal.
>
> ./zahoor
>
> On Tue, Aug 7, 2012 at 2:10 AM, Shrijeet Paliwal <sh...@rocketfuel.com>wrote:
>
>> - user
>> +dev
>>
>> Hi Devs,
>>
>> Please follow the discussion to get full context. tl:dr "Did a scan with
>> timerange and filters, scan o/p was incorrect. Repeated scan with filter
>> only, scan o/p was correct."
>>
>> HBase version : 0.90.3
>> Hadoop : CDH3u0
>> Issues:
>> The scan when set with both a time range and a filter can behave in
>> an unintuitive way. Calling it unintuitive instead of wrong, since I do not
>> know if this is a known limitation of scan. Picture a filter setup like
>> mine - "Filter rows which have cells pertaining to certain columns". This
>> filter is set on a scan which has a time range constraint as well.  AFAIK
>> we skip Hfiles based on metadata when dealing with time ranges. If a region
>> has two Hfiles. One of the Hfiles has cells for unwanted columns but the
>> other one does not - we may get incorrect result based on what how time
>> range is set (If the time range scan optimizer skips the Hfile containing
>> unwanted cells).
>>
>> Does this sound like a valid issue? Also I can see this happening to more
>> than one kind of SkipFilters.
>>
>> -Shrijeet
>>
>>
>> On Mon, Aug 6, 2012 at 11:38 AM, Shrijeet Paliwal
>> <sh...@rocketfuel.com>wrote:
>>
>> > It seems setting time range is a problem , I was doing  (*
>> > scan.setTimeRange(Long.**valueOf(args[4]), Long.valueOf(args[5]));)*
>> > *
>> > *
>> > I was working on assumption that filter logic works before scan logic, in
>> > other words a KV dropped by filter will not make it to scan. In case of
>> > time range this might not be true.
>> >
>> > -Shrijeet
>> >
>> >
>> > On Mon, Aug 6, 2012 at 9:25 AM, jmozah <jm...@gmail.com> wrote:
>> >
>> >> Hmmm.. Missed it. Otherwise i dont spot anything wrong in this.
>> >> are you sure about the column names?
>> >>
>> >> ./zahoor
>> >>
>> >>
>> >> On 06-Aug-2012, at 9:34 PM, Shrijeet Paliwal <sh...@rocketfuel.com>
>> >> wrote:
>> >>
>> >> > I am using FilterList. Could you elaborate?
>> >> >
>> >> > On Mon, Aug 6, 2012 at 8:48 AM, jmozah <jm...@gmail.com> wrote:
>> >> >
>> >> >>
>> >> >>
>> >> >> Use FilterList instead of List of Filters.
>> >> >>
>> >> >> ./Zahoor
>> >> >>
>> >> >> On 06-Aug-2012, at 12:12 PM, Shrijeet Paliwal <
>> shrijeet@rocketfuel.com
>> >> >
>> >> >> wrote:
>> >> >>
>> >> >>> Hi All,
>> >> >>>
>> >> >>> I am writing a job which finds rows that do not have a cell
>> >> corresponding
>> >> >>> to any of the columns in the given set of columns.
>> >> >>> This is how I have configured my scan (a combination of
>> >> lQualifierFilters
>> >> >>> and SkipFilter)
>> >> >>>
>> >> >>>   columnsSet = Splitter.on(',') .split(columns); //columns is a csv
>> >> >>> containing column names
>> >> >>>   List<Filter> qualifierFilters = new ArrayList<Filter>();
>> >> >>>   for (String qual : columnsSet) {
>> >> >>>     qualifierFilters.add(new QualifierFilter(CompareOp.NOT_EQUAL,
>> >> >>>         new BinaryComparator(Bytes.toBytes(qual))));
>> >> >>>   }
>> >> >>>   Filter skipFilter = new SkipFilter(new
>> >> >>> FilterList(Operator.MUST_PASS_ALL, qualifierFilters));
>> >> >>>   Scan scan = new Scan();
>> >> >>>   scan.addFamily(Bytes.toBytes(family));
>> >> >>>   scan.setCacheBlocks(false);
>> >> >>>   scan.setCaching(1000);
>> >> >>>   scan.setFilter(skipFilter);
>> >> >>>   scan.setTimeRange(Long.valueOf(args[4]), Long.valueOf(args[5]));
>> >> >>>
>> >> >>> In my test table the scan worked as expected. But in production
>> run, I
>> >> >> got
>> >> >>> rows which had cells containing one of the given qualifiers (not
>> >> >> expected)
>> >> >>> Can some one help me spot the mistake?
>> >> >>>
>> >> >>> -Shrijeet
>> >> >>
>> >> >>
>> >>
>> >>
>> >
>>

Re: [potential bug]Find rows which do not have any of the given columns

Posted by J Mohamed Zahoor <jm...@gmail.com>.
Hi

Nice one. But i think this is valid behavior.
Time ranges are something which qualifies certain rows to be made available
to the client (something which is related to MVCC).
Once a certain rows are qualified... then the filters are applied on them.

The fact that both can be set simultaneously on a "Scan" object hints that
they orthogonal.

./zahoor

On Tue, Aug 7, 2012 at 2:10 AM, Shrijeet Paliwal <sh...@rocketfuel.com>wrote:

> - user
> +dev
>
> Hi Devs,
>
> Please follow the discussion to get full context. tl:dr "Did a scan with
> timerange and filters, scan o/p was incorrect. Repeated scan with filter
> only, scan o/p was correct."
>
> HBase version : 0.90.3
> Hadoop : CDH3u0
> Issues:
> The scan when set with both a time range and a filter can behave in
> an unintuitive way. Calling it unintuitive instead of wrong, since I do not
> know if this is a known limitation of scan. Picture a filter setup like
> mine - "Filter rows which have cells pertaining to certain columns". This
> filter is set on a scan which has a time range constraint as well.  AFAIK
> we skip Hfiles based on metadata when dealing with time ranges. If a region
> has two Hfiles. One of the Hfiles has cells for unwanted columns but the
> other one does not - we may get incorrect result based on what how time
> range is set (If the time range scan optimizer skips the Hfile containing
> unwanted cells).
>
> Does this sound like a valid issue? Also I can see this happening to more
> than one kind of SkipFilters.
>
> -Shrijeet
>
>
> On Mon, Aug 6, 2012 at 11:38 AM, Shrijeet Paliwal
> <sh...@rocketfuel.com>wrote:
>
> > It seems setting time range is a problem , I was doing  (*
> > scan.setTimeRange(Long.**valueOf(args[4]), Long.valueOf(args[5]));)*
> > *
> > *
> > I was working on assumption that filter logic works before scan logic, in
> > other words a KV dropped by filter will not make it to scan. In case of
> > time range this might not be true.
> >
> > -Shrijeet
> >
> >
> > On Mon, Aug 6, 2012 at 9:25 AM, jmozah <jm...@gmail.com> wrote:
> >
> >> Hmmm.. Missed it. Otherwise i dont spot anything wrong in this.
> >> are you sure about the column names?
> >>
> >> ./zahoor
> >>
> >>
> >> On 06-Aug-2012, at 9:34 PM, Shrijeet Paliwal <sh...@rocketfuel.com>
> >> wrote:
> >>
> >> > I am using FilterList. Could you elaborate?
> >> >
> >> > On Mon, Aug 6, 2012 at 8:48 AM, jmozah <jm...@gmail.com> wrote:
> >> >
> >> >>
> >> >>
> >> >> Use FilterList instead of List of Filters.
> >> >>
> >> >> ./Zahoor
> >> >>
> >> >> On 06-Aug-2012, at 12:12 PM, Shrijeet Paliwal <
> shrijeet@rocketfuel.com
> >> >
> >> >> wrote:
> >> >>
> >> >>> Hi All,
> >> >>>
> >> >>> I am writing a job which finds rows that do not have a cell
> >> corresponding
> >> >>> to any of the columns in the given set of columns.
> >> >>> This is how I have configured my scan (a combination of
> >> lQualifierFilters
> >> >>> and SkipFilter)
> >> >>>
> >> >>>   columnsSet = Splitter.on(',') .split(columns); //columns is a csv
> >> >>> containing column names
> >> >>>   List<Filter> qualifierFilters = new ArrayList<Filter>();
> >> >>>   for (String qual : columnsSet) {
> >> >>>     qualifierFilters.add(new QualifierFilter(CompareOp.NOT_EQUAL,
> >> >>>         new BinaryComparator(Bytes.toBytes(qual))));
> >> >>>   }
> >> >>>   Filter skipFilter = new SkipFilter(new
> >> >>> FilterList(Operator.MUST_PASS_ALL, qualifierFilters));
> >> >>>   Scan scan = new Scan();
> >> >>>   scan.addFamily(Bytes.toBytes(family));
> >> >>>   scan.setCacheBlocks(false);
> >> >>>   scan.setCaching(1000);
> >> >>>   scan.setFilter(skipFilter);
> >> >>>   scan.setTimeRange(Long.valueOf(args[4]), Long.valueOf(args[5]));
> >> >>>
> >> >>> In my test table the scan worked as expected. But in production
> run, I
> >> >> got
> >> >>> rows which had cells containing one of the given qualifiers (not
> >> >> expected)
> >> >>> Can some one help me spot the mistake?
> >> >>>
> >> >>> -Shrijeet
> >> >>
> >> >>
> >>
> >>
> >
>

[potential bug]Find rows which do not have any of the given columns

Posted by Shrijeet Paliwal <sh...@rocketfuel.com>.
- user
+dev

Hi Devs,

Please follow the discussion to get full context. tl:dr "Did a scan with
timerange and filters, scan o/p was incorrect. Repeated scan with filter
only, scan o/p was correct."

HBase version : 0.90.3
Hadoop : CDH3u0
Issues:
The scan when set with both a time range and a filter can behave in
an unintuitive way. Calling it unintuitive instead of wrong, since I do not
know if this is a known limitation of scan. Picture a filter setup like
mine - "Filter rows which have cells pertaining to certain columns". This
filter is set on a scan which has a time range constraint as well.  AFAIK
we skip Hfiles based on metadata when dealing with time ranges. If a region
has two Hfiles. One of the Hfiles has cells for unwanted columns but the
other one does not - we may get incorrect result based on what how time
range is set (If the time range scan optimizer skips the Hfile containing
unwanted cells).

Does this sound like a valid issue? Also I can see this happening to more
than one kind of SkipFilters.

-Shrijeet


On Mon, Aug 6, 2012 at 11:38 AM, Shrijeet Paliwal
<sh...@rocketfuel.com>wrote:

> It seems setting time range is a problem , I was doing  (*
> scan.setTimeRange(Long.**valueOf(args[4]), Long.valueOf(args[5]));)*
> *
> *
> I was working on assumption that filter logic works before scan logic, in
> other words a KV dropped by filter will not make it to scan. In case of
> time range this might not be true.
>
> -Shrijeet
>
>
> On Mon, Aug 6, 2012 at 9:25 AM, jmozah <jm...@gmail.com> wrote:
>
>> Hmmm.. Missed it. Otherwise i dont spot anything wrong in this.
>> are you sure about the column names?
>>
>> ./zahoor
>>
>>
>> On 06-Aug-2012, at 9:34 PM, Shrijeet Paliwal <sh...@rocketfuel.com>
>> wrote:
>>
>> > I am using FilterList. Could you elaborate?
>> >
>> > On Mon, Aug 6, 2012 at 8:48 AM, jmozah <jm...@gmail.com> wrote:
>> >
>> >>
>> >>
>> >> Use FilterList instead of List of Filters.
>> >>
>> >> ./Zahoor
>> >>
>> >> On 06-Aug-2012, at 12:12 PM, Shrijeet Paliwal <shrijeet@rocketfuel.com
>> >
>> >> wrote:
>> >>
>> >>> Hi All,
>> >>>
>> >>> I am writing a job which finds rows that do not have a cell
>> corresponding
>> >>> to any of the columns in the given set of columns.
>> >>> This is how I have configured my scan (a combination of
>> lQualifierFilters
>> >>> and SkipFilter)
>> >>>
>> >>>   columnsSet = Splitter.on(',') .split(columns); //columns is a csv
>> >>> containing column names
>> >>>   List<Filter> qualifierFilters = new ArrayList<Filter>();
>> >>>   for (String qual : columnsSet) {
>> >>>     qualifierFilters.add(new QualifierFilter(CompareOp.NOT_EQUAL,
>> >>>         new BinaryComparator(Bytes.toBytes(qual))));
>> >>>   }
>> >>>   Filter skipFilter = new SkipFilter(new
>> >>> FilterList(Operator.MUST_PASS_ALL, qualifierFilters));
>> >>>   Scan scan = new Scan();
>> >>>   scan.addFamily(Bytes.toBytes(family));
>> >>>   scan.setCacheBlocks(false);
>> >>>   scan.setCaching(1000);
>> >>>   scan.setFilter(skipFilter);
>> >>>   scan.setTimeRange(Long.valueOf(args[4]), Long.valueOf(args[5]));
>> >>>
>> >>> In my test table the scan worked as expected. But in production run, I
>> >> got
>> >>> rows which had cells containing one of the given qualifiers (not
>> >> expected)
>> >>> Can some one help me spot the mistake?
>> >>>
>> >>> -Shrijeet
>> >>
>> >>
>>
>>
>

Re: Find rows which do not have any of the given columns

Posted by Shrijeet Paliwal <sh...@rocketfuel.com>.
It seems setting time range is a problem , I was doing  (*
scan.setTimeRange(Long.**valueOf(args[4]), Long.valueOf(args[5]));)*
*
*
I was working on assumption that filter logic works before scan logic, in
other words a KV dropped by filter will not make it to scan. In case of
time range this might not be true.

-Shrijeet


On Mon, Aug 6, 2012 at 9:25 AM, jmozah <jm...@gmail.com> wrote:

> Hmmm.. Missed it. Otherwise i dont spot anything wrong in this.
> are you sure about the column names?
>
> ./zahoor
>
>
> On 06-Aug-2012, at 9:34 PM, Shrijeet Paliwal <sh...@rocketfuel.com>
> wrote:
>
> > I am using FilterList. Could you elaborate?
> >
> > On Mon, Aug 6, 2012 at 8:48 AM, jmozah <jm...@gmail.com> wrote:
> >
> >>
> >>
> >> Use FilterList instead of List of Filters.
> >>
> >> ./Zahoor
> >>
> >> On 06-Aug-2012, at 12:12 PM, Shrijeet Paliwal <sh...@rocketfuel.com>
> >> wrote:
> >>
> >>> Hi All,
> >>>
> >>> I am writing a job which finds rows that do not have a cell
> corresponding
> >>> to any of the columns in the given set of columns.
> >>> This is how I have configured my scan (a combination of
> lQualifierFilters
> >>> and SkipFilter)
> >>>
> >>>   columnsSet = Splitter.on(',') .split(columns); //columns is a csv
> >>> containing column names
> >>>   List<Filter> qualifierFilters = new ArrayList<Filter>();
> >>>   for (String qual : columnsSet) {
> >>>     qualifierFilters.add(new QualifierFilter(CompareOp.NOT_EQUAL,
> >>>         new BinaryComparator(Bytes.toBytes(qual))));
> >>>   }
> >>>   Filter skipFilter = new SkipFilter(new
> >>> FilterList(Operator.MUST_PASS_ALL, qualifierFilters));
> >>>   Scan scan = new Scan();
> >>>   scan.addFamily(Bytes.toBytes(family));
> >>>   scan.setCacheBlocks(false);
> >>>   scan.setCaching(1000);
> >>>   scan.setFilter(skipFilter);
> >>>   scan.setTimeRange(Long.valueOf(args[4]), Long.valueOf(args[5]));
> >>>
> >>> In my test table the scan worked as expected. But in production run, I
> >> got
> >>> rows which had cells containing one of the given qualifiers (not
> >> expected)
> >>> Can some one help me spot the mistake?
> >>>
> >>> -Shrijeet
> >>
> >>
>
>

Re: Find rows which do not have any of the given columns

Posted by jmozah <jm...@gmail.com>.
Hmmm.. Missed it. Otherwise i dont spot anything wrong in this.
are you sure about the column names?

./zahoor


On 06-Aug-2012, at 9:34 PM, Shrijeet Paliwal <sh...@rocketfuel.com> wrote:

> I am using FilterList. Could you elaborate?
> 
> On Mon, Aug 6, 2012 at 8:48 AM, jmozah <jm...@gmail.com> wrote:
> 
>> 
>> 
>> Use FilterList instead of List of Filters.
>> 
>> ./Zahoor
>> 
>> On 06-Aug-2012, at 12:12 PM, Shrijeet Paliwal <sh...@rocketfuel.com>
>> wrote:
>> 
>>> Hi All,
>>> 
>>> I am writing a job which finds rows that do not have a cell corresponding
>>> to any of the columns in the given set of columns.
>>> This is how I have configured my scan (a combination of lQualifierFilters
>>> and SkipFilter)
>>> 
>>>   columnsSet = Splitter.on(',') .split(columns); //columns is a csv
>>> containing column names
>>>   List<Filter> qualifierFilters = new ArrayList<Filter>();
>>>   for (String qual : columnsSet) {
>>>     qualifierFilters.add(new QualifierFilter(CompareOp.NOT_EQUAL,
>>>         new BinaryComparator(Bytes.toBytes(qual))));
>>>   }
>>>   Filter skipFilter = new SkipFilter(new
>>> FilterList(Operator.MUST_PASS_ALL, qualifierFilters));
>>>   Scan scan = new Scan();
>>>   scan.addFamily(Bytes.toBytes(family));
>>>   scan.setCacheBlocks(false);
>>>   scan.setCaching(1000);
>>>   scan.setFilter(skipFilter);
>>>   scan.setTimeRange(Long.valueOf(args[4]), Long.valueOf(args[5]));
>>> 
>>> In my test table the scan worked as expected. But in production run, I
>> got
>>> rows which had cells containing one of the given qualifiers (not
>> expected)
>>> Can some one help me spot the mistake?
>>> 
>>> -Shrijeet
>> 
>> 


Re: Find rows which do not have any of the given columns

Posted by Shrijeet Paliwal <sh...@rocketfuel.com>.
I am using FilterList. Could you elaborate?

On Mon, Aug 6, 2012 at 8:48 AM, jmozah <jm...@gmail.com> wrote:

>
>
> Use FilterList instead of List of Filters.
>
> ./Zahoor
>
> On 06-Aug-2012, at 12:12 PM, Shrijeet Paliwal <sh...@rocketfuel.com>
> wrote:
>
> > Hi All,
> >
> > I am writing a job which finds rows that do not have a cell corresponding
> > to any of the columns in the given set of columns.
> > This is how I have configured my scan (a combination of lQualifierFilters
> > and SkipFilter)
> >
> >    columnsSet = Splitter.on(',') .split(columns); //columns is a csv
> > containing column names
> >    List<Filter> qualifierFilters = new ArrayList<Filter>();
> >    for (String qual : columnsSet) {
> >      qualifierFilters.add(new QualifierFilter(CompareOp.NOT_EQUAL,
> >          new BinaryComparator(Bytes.toBytes(qual))));
> >    }
> >    Filter skipFilter = new SkipFilter(new
> > FilterList(Operator.MUST_PASS_ALL, qualifierFilters));
> >    Scan scan = new Scan();
> >    scan.addFamily(Bytes.toBytes(family));
> >    scan.setCacheBlocks(false);
> >    scan.setCaching(1000);
> >    scan.setFilter(skipFilter);
> >    scan.setTimeRange(Long.valueOf(args[4]), Long.valueOf(args[5]));
> >
> > In my test table the scan worked as expected. But in production run, I
> got
> > rows which had cells containing one of the given qualifiers (not
> expected)
> > Can some one help me spot the mistake?
> >
> > -Shrijeet
>
>

Re: Find rows which do not have any of the given columns

Posted by jmozah <jm...@gmail.com>.

Use FilterList instead of List of Filters.

./Zahoor

On 06-Aug-2012, at 12:12 PM, Shrijeet Paliwal <sh...@rocketfuel.com> wrote:

> Hi All,
> 
> I am writing a job which finds rows that do not have a cell corresponding
> to any of the columns in the given set of columns.
> This is how I have configured my scan (a combination of lQualifierFilters
> and SkipFilter)
> 
>    columnsSet = Splitter.on(',') .split(columns); //columns is a csv
> containing column names
>    List<Filter> qualifierFilters = new ArrayList<Filter>();
>    for (String qual : columnsSet) {
>      qualifierFilters.add(new QualifierFilter(CompareOp.NOT_EQUAL,
>          new BinaryComparator(Bytes.toBytes(qual))));
>    }
>    Filter skipFilter = new SkipFilter(new
> FilterList(Operator.MUST_PASS_ALL, qualifierFilters));
>    Scan scan = new Scan();
>    scan.addFamily(Bytes.toBytes(family));
>    scan.setCacheBlocks(false);
>    scan.setCaching(1000);
>    scan.setFilter(skipFilter);
>    scan.setTimeRange(Long.valueOf(args[4]), Long.valueOf(args[5]));
> 
> In my test table the scan worked as expected. But in production run, I got
> rows which had cells containing one of the given qualifiers (not expected)
> Can some one help me spot the mistake?
> 
> -Shrijeet