You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Varun Sharma <va...@pinterest.com> on 2012/12/10 14:58:21 UTC

Filtering/Collection columns during Major Compaction

Hi,

My understanding of major compaction is that it rewrites one store file and
does a merge of the memstore, store files on disk and cleans out delete
tombstones and puts prior to them and cleans out excess versions. We want
to limit the number of columns per row in hbase. Also, we want to limit
them in lexicographically sorted order - which means we take the top, say
100 smallest columns (in lexicographical sense) and only keep them while
discard the rest.

One way to do this would be to clean out columns in a daily mapreduce job.
Or another way is to clean them out during the major compaction which can
be run daily too. I see, from the code that a major compaction essentially
invokes a Scan over the region - so if the Scan is invoked with the
appropriate filter (say ColumnCountGetFilter) - would that do the trick ?

Thanks
Varun

RE: Filtering/Collection columns during Major Compaction

Posted by Anoop Sam John <an...@huawei.com>.
Hi Varun

>but I am trying to understand why, for the
> real compaction - smallestReadPoint needs to be passed - I thought the read
> point was a memstore only thing

No this will be needed not only for memstore. In between the scan the memstore can get flushed... That is why the MVCC  ts is also getting written to the HFile. 
Hope the reply from Ram helped you in doing what you want. If you are facing any issues pls let us know. We have done this already using the CP hooks. Thanks to Lars H for this new hooks :) Very useful...

-Anoop-
________________________________________
From: Varun Sharma [varun@pinterest.com]
Sent: Monday, December 10, 2012 8:59 PM
To: user@hbase.apache.org
Subject: Re: Filtering/Collection columns during Major Compaction

Okay - I looked more thoroughly again - I should be able to extract these
from the region observer.

Thanks !

On Mon, Dec 10, 2012 at 6:59 AM, Varun Sharma <va...@pinterest.com> wrote:

> Thanks ! This is exactly what I need. I am looking at the code in
> compactStore() under Store.java but I am trying to understand why, for the
> real compaction - smallestReadPoint needs to be passed - I thought the read
> point was a memstore only thing. Also the preCompactScannerOpen does not
> have a way of passing this value.
>
> Varun
>
>
> On Mon, Dec 10, 2012 at 6:08 AM, ramkrishna vasudevan <
> ramkrishna.s.vasudevan@gmail.com> wrote:
>
>> Hi Varun
>>
>> If you are using 0.94 version you have a coprocessor that is getting
>> invoked before and after Compaction selection.
>> preCompactScannerOpen() helps you to create your own scanner which
>> actually
>> does the next() operation.
>> Now if you can wrap your own scanner and implement your next() it will
>> help
>> you to play with the kvs that you need.  So basically you can say what
>> cols
>> to include and what to exclude.
>> Does this help you Varun?
>>
>> Regards
>> Ram
>>
>> On Mon, Dec 10, 2012 at 7:28 PM, Varun Sharma <va...@pinterest.com>
>> wrote:
>>
>> > Hi,
>> >
>> > My understanding of major compaction is that it rewrites one store file
>> and
>> > does a merge of the memstore, store files on disk and cleans out delete
>> > tombstones and puts prior to them and cleans out excess versions. We
>> want
>> > to limit the number of columns per row in hbase. Also, we want to limit
>> > them in lexicographically sorted order - which means we take the top,
>> say
>> > 100 smallest columns (in lexicographical sense) and only keep them while
>> > discard the rest.
>> >
>> > One way to do this would be to clean out columns in a daily mapreduce
>> job.
>> > Or another way is to clean them out during the major compaction which
>> can
>> > be run daily too. I see, from the code that a major compaction
>> essentially
>> > invokes a Scan over the region - so if the Scan is invoked with the
>> > appropriate filter (say ColumnCountGetFilter) - would that do the trick
>> ?
>> >
>> > Thanks
>> > Varun
>> >
>>
>
>

Re: Filtering/Collection columns during Major Compaction

Posted by lars hofhansl <lh...@yahoo.com>.
In this case on each iteration you should get all KeyValues (KVs) for all columns in this column family for a single row.
i.e. each KV should have the same rowkey.

-- Lars



________________________________
 From: Varun Sharma <va...@pinterest.com>
To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com> 
Sent: Tuesday, December 11, 2012 4:51 PM
Subject: Re: Filtering/Collection columns during Major Compaction
 
Hi Lars,

Thanks for the detailed tip - we will go down that path. Looking at the
javadoc for InternalScanner.next() - it says grab the next row's values -
is this rows in the hbase sense or are these rows in the HFile - I suspect
it is the latter ?

Thanks !

On Mon, Dec 10, 2012 at 11:19 PM, lars hofhansl <lh...@yahoo.com> wrote:

> Filters do not work for compactions. We only support them for user scans.
> (some of them might incidentally work, but that is entirely untested and
> unsupported)
>
> You best bet is to use the preCompact hook and return a wrapper scanner
> like so:
>
>     public InternalScanner
> preCompact(ObserverContext<RegionCoprocessorEnvironment> e,
>         Store store, final InternalScanner scanner) {
>       return new InternalScanner() {
>         public boolean next(List<KeyValue> results) throws IOException {
>           return next(results, -1);
>         }
>         public boolean next(List<KeyValue> results, String metric)
>             throws IOException {
>           return next(results, -1, metric);
>         }
>         public boolean next(List<KeyValue> results, int limit)
>             throws IOException{
>           return next(results, limit, null);
>         }
>         public boolean next(List<KeyValue> results, int limit, String
> metric)
>             throws IOException {
>
>             // call next on the passed scanner
>             // do your filtering here
>         }
>
>         public void close() throws IOException {
>           scanner.close();
>         }
>       };
>     }
>
> -- Lars
>
>
>
> ________________________________
>  From: Varun Sharma <va...@pinterest.com>
> To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
> Sent: Monday, December 10, 2012 11:04 PM
> Subject: Re: Filtering/Collection columns during Major Compaction
>
> Hi Lars,
>
> In my case, I just want to use ColumnPaginationFilter() rather than
> implementing my own logic for filter. Is there an easy way to apply this
> filter on top of an existing scanner ? Do I do something like
>
> RegionScannerImpl scanner = new RegionScannerImpl(scan_with_my_filter,
> original_compaction_scanner)
>
> Thanks
> Varun
>
> On Mon, Dec 10, 2012 at 9:09 PM, lars hofhansl <lh...@yahoo.com>
> wrote:
>
> > In your case you probably just want to filter on top of the provided
> > scanner with preCompact (rather than actually replacing the scanner,
> which
> > preCompactScannerOpen does).
> >
> > (And sorry I only saw this reply after I sent my own reply to your
> initial
> > question.)
> >
> >
> >
> > ________________________________
> >  From: Varun Sharma <va...@pinterest.com>
> > To: user@hbase.apache.org
> > Sent: Monday, December 10, 2012 7:29 AM
> > Subject: Re: Filtering/Collection columns during Major Compaction
> >
> > Okay - I looked more thoroughly again - I should be able to extract these
> > from the region observer.
> >
> > Thanks !
> >
> > On Mon, Dec 10, 2012 at 6:59 AM, Varun Sharma <va...@pinterest.com>
> wrote:
> >
> > > Thanks ! This is exactly what I need. I am looking at the code in
> > > compactStore() under Store.java but I am trying to understand why, for
> > the
> > > real compaction - smallestReadPoint needs to be passed - I thought the
> > read
> > > point was a memstore only thing. Also the preCompactScannerOpen does
> not
> > > have a way of passing this value.
> > >
> > > Varun
> > >
> > >
> > > On Mon, Dec 10, 2012 at 6:08 AM, ramkrishna vasudevan <
> > > ramkrishna.s.vasudevan@gmail.com> wrote:
> > >
> > >> Hi Varun
> > >>
> > >> If you are using 0.94 version you have a coprocessor that is getting
> > >> invoked before and after Compaction selection.
> > >> preCompactScannerOpen() helps you to create your own scanner which
> > >> actually
> > >> does the next() operation.
> > >> Now if you can wrap your own scanner and implement your next() it will
> > >> help
> > >> you to play with the kvs that you need.  So basically you can say what
> > >> cols
> > >> to include and what to exclude.
> > >> Does this help you Varun?
> > >>
> > >> Regards
> > >> Ram
> > >>
> > >> On Mon, Dec 10, 2012 at 7:28 PM, Varun Sharma <va...@pinterest.com>
> > >> wrote:
> > >>
> > >> > Hi,
> > >> >
> > >> > My understanding of major compaction is that it rewrites one store
> > file
> > >> and
> > >> > does a merge of the memstore, store files on disk and cleans out
> > delete
> > >> > tombstones and puts prior to them and cleans out excess versions. We
> > >> want
> > >> > to limit the number of columns per row in hbase. Also, we want to
> > limit
> > >> > them in lexicographically sorted order - which means we take the
> top,
> > >> say
> > >> > 100 smallest columns (in lexicographical sense) and only keep them
> > while
> > >> > discard the rest.
> > >> >
> > >> > One way to do this would be to clean out columns in a daily
> mapreduce
> > >> job.
> > >> > Or another way is to clean them out during the major compaction
> which
> > >> can
> > >> > be run daily too. I see, from the code that a major compaction
> > >> essentially
> > >> > invokes a Scan over the region - so if the Scan is invoked with the
> > >> > appropriate filter (say ColumnCountGetFilter) - would that do the
> > trick
> > >> ?
> > >> >
> > >> > Thanks
> > >> > Varun
> > >> >
> > >>
> > >
> > >
> >
>

Re: Filtering/Collection columns during Major Compaction

Posted by Varun Sharma <va...@pinterest.com>.
Hi Lars,

Thanks for the detailed tip - we will go down that path. Looking at the
javadoc for InternalScanner.next() - it says grab the next row's values -
is this rows in the hbase sense or are these rows in the HFile - I suspect
it is the latter ?

Thanks !

On Mon, Dec 10, 2012 at 11:19 PM, lars hofhansl <lh...@yahoo.com> wrote:

> Filters do not work for compactions. We only support them for user scans.
> (some of them might incidentally work, but that is entirely untested and
> unsupported)
>
> You best bet is to use the preCompact hook and return a wrapper scanner
> like so:
>
>     public InternalScanner
> preCompact(ObserverContext<RegionCoprocessorEnvironment> e,
>         Store store, final InternalScanner scanner) {
>       return new InternalScanner() {
>         public boolean next(List<KeyValue> results) throws IOException {
>           return next(results, -1);
>         }
>         public boolean next(List<KeyValue> results, String metric)
>             throws IOException {
>           return next(results, -1, metric);
>         }
>         public boolean next(List<KeyValue> results, int limit)
>             throws IOException{
>           return next(results, limit, null);
>         }
>         public boolean next(List<KeyValue> results, int limit, String
> metric)
>             throws IOException {
>
>             // call next on the passed scanner
>             // do your filtering here
>         }
>
>         public void close() throws IOException {
>           scanner.close();
>         }
>       };
>     }
>
> -- Lars
>
>
>
> ________________________________
>  From: Varun Sharma <va...@pinterest.com>
> To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com>
> Sent: Monday, December 10, 2012 11:04 PM
> Subject: Re: Filtering/Collection columns during Major Compaction
>
> Hi Lars,
>
> In my case, I just want to use ColumnPaginationFilter() rather than
> implementing my own logic for filter. Is there an easy way to apply this
> filter on top of an existing scanner ? Do I do something like
>
> RegionScannerImpl scanner = new RegionScannerImpl(scan_with_my_filter,
> original_compaction_scanner)
>
> Thanks
> Varun
>
> On Mon, Dec 10, 2012 at 9:09 PM, lars hofhansl <lh...@yahoo.com>
> wrote:
>
> > In your case you probably just want to filter on top of the provided
> > scanner with preCompact (rather than actually replacing the scanner,
> which
> > preCompactScannerOpen does).
> >
> > (And sorry I only saw this reply after I sent my own reply to your
> initial
> > question.)
> >
> >
> >
> > ________________________________
> >  From: Varun Sharma <va...@pinterest.com>
> > To: user@hbase.apache.org
> > Sent: Monday, December 10, 2012 7:29 AM
> > Subject: Re: Filtering/Collection columns during Major Compaction
> >
> > Okay - I looked more thoroughly again - I should be able to extract these
> > from the region observer.
> >
> > Thanks !
> >
> > On Mon, Dec 10, 2012 at 6:59 AM, Varun Sharma <va...@pinterest.com>
> wrote:
> >
> > > Thanks ! This is exactly what I need. I am looking at the code in
> > > compactStore() under Store.java but I am trying to understand why, for
> > the
> > > real compaction - smallestReadPoint needs to be passed - I thought the
> > read
> > > point was a memstore only thing. Also the preCompactScannerOpen does
> not
> > > have a way of passing this value.
> > >
> > > Varun
> > >
> > >
> > > On Mon, Dec 10, 2012 at 6:08 AM, ramkrishna vasudevan <
> > > ramkrishna.s.vasudevan@gmail.com> wrote:
> > >
> > >> Hi Varun
> > >>
> > >> If you are using 0.94 version you have a coprocessor that is getting
> > >> invoked before and after Compaction selection.
> > >> preCompactScannerOpen() helps you to create your own scanner which
> > >> actually
> > >> does the next() operation.
> > >> Now if you can wrap your own scanner and implement your next() it will
> > >> help
> > >> you to play with the kvs that you need.  So basically you can say what
> > >> cols
> > >> to include and what to exclude.
> > >> Does this help you Varun?
> > >>
> > >> Regards
> > >> Ram
> > >>
> > >> On Mon, Dec 10, 2012 at 7:28 PM, Varun Sharma <va...@pinterest.com>
> > >> wrote:
> > >>
> > >> > Hi,
> > >> >
> > >> > My understanding of major compaction is that it rewrites one store
> > file
> > >> and
> > >> > does a merge of the memstore, store files on disk and cleans out
> > delete
> > >> > tombstones and puts prior to them and cleans out excess versions. We
> > >> want
> > >> > to limit the number of columns per row in hbase. Also, we want to
> > limit
> > >> > them in lexicographically sorted order - which means we take the
> top,
> > >> say
> > >> > 100 smallest columns (in lexicographical sense) and only keep them
> > while
> > >> > discard the rest.
> > >> >
> > >> > One way to do this would be to clean out columns in a daily
> mapreduce
> > >> job.
> > >> > Or another way is to clean them out during the major compaction
> which
> > >> can
> > >> > be run daily too. I see, from the code that a major compaction
> > >> essentially
> > >> > invokes a Scan over the region - so if the Scan is invoked with the
> > >> > appropriate filter (say ColumnCountGetFilter) - would that do the
> > trick
> > >> ?
> > >> >
> > >> > Thanks
> > >> > Varun
> > >> >
> > >>
> > >
> > >
> >
>

Re: Filtering/Collection columns during Major Compaction

Posted by lars hofhansl <lh...@yahoo.com>.
Filters do not work for compactions. We only support them for user scans.
(some of them might incidentally work, but that is entirely untested and unsupported)

You best bet is to use the preCompact hook and return a wrapper scanner like so:

    public InternalScanner preCompact(ObserverContext<RegionCoprocessorEnvironment> e,
        Store store, final InternalScanner scanner) {
      return new InternalScanner() {
        public boolean next(List<KeyValue> results) throws IOException {
          return next(results, -1);
        }
        public boolean next(List<KeyValue> results, String metric)
            throws IOException {
          return next(results, -1, metric);
        }
        public boolean next(List<KeyValue> results, int limit)
            throws IOException{
          return next(results, limit, null);
        }
        public boolean next(List<KeyValue> results, int limit, String metric)
            throws IOException {

            // call next on the passed scanner
            // do your filtering here
        }

        public void close() throws IOException {
          scanner.close();
        }
      };
    }

-- Lars



________________________________
 From: Varun Sharma <va...@pinterest.com>
To: user@hbase.apache.org; lars hofhansl <lh...@yahoo.com> 
Sent: Monday, December 10, 2012 11:04 PM
Subject: Re: Filtering/Collection columns during Major Compaction
 
Hi Lars,

In my case, I just want to use ColumnPaginationFilter() rather than
implementing my own logic for filter. Is there an easy way to apply this
filter on top of an existing scanner ? Do I do something like

RegionScannerImpl scanner = new RegionScannerImpl(scan_with_my_filter,
original_compaction_scanner)

Thanks
Varun

On Mon, Dec 10, 2012 at 9:09 PM, lars hofhansl <lh...@yahoo.com> wrote:

> In your case you probably just want to filter on top of the provided
> scanner with preCompact (rather than actually replacing the scanner, which
> preCompactScannerOpen does).
>
> (And sorry I only saw this reply after I sent my own reply to your initial
> question.)
>
>
>
> ________________________________
>  From: Varun Sharma <va...@pinterest.com>
> To: user@hbase.apache.org
> Sent: Monday, December 10, 2012 7:29 AM
> Subject: Re: Filtering/Collection columns during Major Compaction
>
> Okay - I looked more thoroughly again - I should be able to extract these
> from the region observer.
>
> Thanks !
>
> On Mon, Dec 10, 2012 at 6:59 AM, Varun Sharma <va...@pinterest.com> wrote:
>
> > Thanks ! This is exactly what I need. I am looking at the code in
> > compactStore() under Store.java but I am trying to understand why, for
> the
> > real compaction - smallestReadPoint needs to be passed - I thought the
> read
> > point was a memstore only thing. Also the preCompactScannerOpen does not
> > have a way of passing this value.
> >
> > Varun
> >
> >
> > On Mon, Dec 10, 2012 at 6:08 AM, ramkrishna vasudevan <
> > ramkrishna.s.vasudevan@gmail.com> wrote:
> >
> >> Hi Varun
> >>
> >> If you are using 0.94 version you have a coprocessor that is getting
> >> invoked before and after Compaction selection.
> >> preCompactScannerOpen() helps you to create your own scanner which
> >> actually
> >> does the next() operation.
> >> Now if you can wrap your own scanner and implement your next() it will
> >> help
> >> you to play with the kvs that you need.  So basically you can say what
> >> cols
> >> to include and what to exclude.
> >> Does this help you Varun?
> >>
> >> Regards
> >> Ram
> >>
> >> On Mon, Dec 10, 2012 at 7:28 PM, Varun Sharma <va...@pinterest.com>
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > My understanding of major compaction is that it rewrites one store
> file
> >> and
> >> > does a merge of the memstore, store files on disk and cleans out
> delete
> >> > tombstones and puts prior to them and cleans out excess versions. We
> >> want
> >> > to limit the number of columns per row in hbase. Also, we want to
> limit
> >> > them in lexicographically sorted order - which means we take the top,
> >> say
> >> > 100 smallest columns (in lexicographical sense) and only keep them
> while
> >> > discard the rest.
> >> >
> >> > One way to do this would be to clean out columns in a daily mapreduce
> >> job.
> >> > Or another way is to clean them out during the major compaction which
> >> can
> >> > be run daily too. I see, from the code that a major compaction
> >> essentially
> >> > invokes a Scan over the region - so if the Scan is invoked with the
> >> > appropriate filter (say ColumnCountGetFilter) - would that do the
> trick
> >> ?
> >> >
> >> > Thanks
> >> > Varun
> >> >
> >>
> >
> >
>

Re: Filtering/Collection columns during Major Compaction

Posted by Varun Sharma <va...@pinterest.com>.
Hi Lars,

In my case, I just want to use ColumnPaginationFilter() rather than
implementing my own logic for filter. Is there an easy way to apply this
filter on top of an existing scanner ? Do I do something like

RegionScannerImpl scanner = new RegionScannerImpl(scan_with_my_filter,
original_compaction_scanner)

Thanks
Varun

On Mon, Dec 10, 2012 at 9:09 PM, lars hofhansl <lh...@yahoo.com> wrote:

> In your case you probably just want to filter on top of the provided
> scanner with preCompact (rather than actually replacing the scanner, which
> preCompactScannerOpen does).
>
> (And sorry I only saw this reply after I sent my own reply to your initial
> question.)
>
>
>
> ________________________________
>  From: Varun Sharma <va...@pinterest.com>
> To: user@hbase.apache.org
> Sent: Monday, December 10, 2012 7:29 AM
> Subject: Re: Filtering/Collection columns during Major Compaction
>
> Okay - I looked more thoroughly again - I should be able to extract these
> from the region observer.
>
> Thanks !
>
> On Mon, Dec 10, 2012 at 6:59 AM, Varun Sharma <va...@pinterest.com> wrote:
>
> > Thanks ! This is exactly what I need. I am looking at the code in
> > compactStore() under Store.java but I am trying to understand why, for
> the
> > real compaction - smallestReadPoint needs to be passed - I thought the
> read
> > point was a memstore only thing. Also the preCompactScannerOpen does not
> > have a way of passing this value.
> >
> > Varun
> >
> >
> > On Mon, Dec 10, 2012 at 6:08 AM, ramkrishna vasudevan <
> > ramkrishna.s.vasudevan@gmail.com> wrote:
> >
> >> Hi Varun
> >>
> >> If you are using 0.94 version you have a coprocessor that is getting
> >> invoked before and after Compaction selection.
> >> preCompactScannerOpen() helps you to create your own scanner which
> >> actually
> >> does the next() operation.
> >> Now if you can wrap your own scanner and implement your next() it will
> >> help
> >> you to play with the kvs that you need.  So basically you can say what
> >> cols
> >> to include and what to exclude.
> >> Does this help you Varun?
> >>
> >> Regards
> >> Ram
> >>
> >> On Mon, Dec 10, 2012 at 7:28 PM, Varun Sharma <va...@pinterest.com>
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > My understanding of major compaction is that it rewrites one store
> file
> >> and
> >> > does a merge of the memstore, store files on disk and cleans out
> delete
> >> > tombstones and puts prior to them and cleans out excess versions. We
> >> want
> >> > to limit the number of columns per row in hbase. Also, we want to
> limit
> >> > them in lexicographically sorted order - which means we take the top,
> >> say
> >> > 100 smallest columns (in lexicographical sense) and only keep them
> while
> >> > discard the rest.
> >> >
> >> > One way to do this would be to clean out columns in a daily mapreduce
> >> job.
> >> > Or another way is to clean them out during the major compaction which
> >> can
> >> > be run daily too. I see, from the code that a major compaction
> >> essentially
> >> > invokes a Scan over the region - so if the Scan is invoked with the
> >> > appropriate filter (say ColumnCountGetFilter) - would that do the
> trick
> >> ?
> >> >
> >> > Thanks
> >> > Varun
> >> >
> >>
> >
> >
>

Re: Filtering/Collection columns during Major Compaction

Posted by lars hofhansl <lh...@yahoo.com>.
In your case you probably just want to filter on top of the provided scanner with preCompact (rather than actually replacing the scanner, which preCompactScannerOpen does). 

(And sorry I only saw this reply after I sent my own reply to your initial question.)



________________________________
 From: Varun Sharma <va...@pinterest.com>
To: user@hbase.apache.org 
Sent: Monday, December 10, 2012 7:29 AM
Subject: Re: Filtering/Collection columns during Major Compaction
 
Okay - I looked more thoroughly again - I should be able to extract these
from the region observer.

Thanks !

On Mon, Dec 10, 2012 at 6:59 AM, Varun Sharma <va...@pinterest.com> wrote:

> Thanks ! This is exactly what I need. I am looking at the code in
> compactStore() under Store.java but I am trying to understand why, for the
> real compaction - smallestReadPoint needs to be passed - I thought the read
> point was a memstore only thing. Also the preCompactScannerOpen does not
> have a way of passing this value.
>
> Varun
>
>
> On Mon, Dec 10, 2012 at 6:08 AM, ramkrishna vasudevan <
> ramkrishna.s.vasudevan@gmail.com> wrote:
>
>> Hi Varun
>>
>> If you are using 0.94 version you have a coprocessor that is getting
>> invoked before and after Compaction selection.
>> preCompactScannerOpen() helps you to create your own scanner which
>> actually
>> does the next() operation.
>> Now if you can wrap your own scanner and implement your next() it will
>> help
>> you to play with the kvs that you need.  So basically you can say what
>> cols
>> to include and what to exclude.
>> Does this help you Varun?
>>
>> Regards
>> Ram
>>
>> On Mon, Dec 10, 2012 at 7:28 PM, Varun Sharma <va...@pinterest.com>
>> wrote:
>>
>> > Hi,
>> >
>> > My understanding of major compaction is that it rewrites one store file
>> and
>> > does a merge of the memstore, store files on disk and cleans out delete
>> > tombstones and puts prior to them and cleans out excess versions. We
>> want
>> > to limit the number of columns per row in hbase. Also, we want to limit
>> > them in lexicographically sorted order - which means we take the top,
>> say
>> > 100 smallest columns (in lexicographical sense) and only keep them while
>> > discard the rest.
>> >
>> > One way to do this would be to clean out columns in a daily mapreduce
>> job.
>> > Or another way is to clean them out during the major compaction which
>> can
>> > be run daily too. I see, from the code that a major compaction
>> essentially
>> > invokes a Scan over the region - so if the Scan is invoked with the
>> > appropriate filter (say ColumnCountGetFilter) - would that do the trick
>> ?
>> >
>> > Thanks
>> > Varun
>> >
>>
>
>

Re: Filtering/Collection columns during Major Compaction

Posted by Varun Sharma <va...@pinterest.com>.
Okay - I looked more thoroughly again - I should be able to extract these
from the region observer.

Thanks !

On Mon, Dec 10, 2012 at 6:59 AM, Varun Sharma <va...@pinterest.com> wrote:

> Thanks ! This is exactly what I need. I am looking at the code in
> compactStore() under Store.java but I am trying to understand why, for the
> real compaction - smallestReadPoint needs to be passed - I thought the read
> point was a memstore only thing. Also the preCompactScannerOpen does not
> have a way of passing this value.
>
> Varun
>
>
> On Mon, Dec 10, 2012 at 6:08 AM, ramkrishna vasudevan <
> ramkrishna.s.vasudevan@gmail.com> wrote:
>
>> Hi Varun
>>
>> If you are using 0.94 version you have a coprocessor that is getting
>> invoked before and after Compaction selection.
>> preCompactScannerOpen() helps you to create your own scanner which
>> actually
>> does the next() operation.
>> Now if you can wrap your own scanner and implement your next() it will
>> help
>> you to play with the kvs that you need.  So basically you can say what
>> cols
>> to include and what to exclude.
>> Does this help you Varun?
>>
>> Regards
>> Ram
>>
>> On Mon, Dec 10, 2012 at 7:28 PM, Varun Sharma <va...@pinterest.com>
>> wrote:
>>
>> > Hi,
>> >
>> > My understanding of major compaction is that it rewrites one store file
>> and
>> > does a merge of the memstore, store files on disk and cleans out delete
>> > tombstones and puts prior to them and cleans out excess versions. We
>> want
>> > to limit the number of columns per row in hbase. Also, we want to limit
>> > them in lexicographically sorted order - which means we take the top,
>> say
>> > 100 smallest columns (in lexicographical sense) and only keep them while
>> > discard the rest.
>> >
>> > One way to do this would be to clean out columns in a daily mapreduce
>> job.
>> > Or another way is to clean them out during the major compaction which
>> can
>> > be run daily too. I see, from the code that a major compaction
>> essentially
>> > invokes a Scan over the region - so if the Scan is invoked with the
>> > appropriate filter (say ColumnCountGetFilter) - would that do the trick
>> ?
>> >
>> > Thanks
>> > Varun
>> >
>>
>
>

Re: Filtering/Collection columns during Major Compaction

Posted by Varun Sharma <va...@pinterest.com>.
Thanks ! This is exactly what I need. I am looking at the code in
compactStore() under Store.java but I am trying to understand why, for the
real compaction - smallestReadPoint needs to be passed - I thought the read
point was a memstore only thing. Also the preCompactScannerOpen does not
have a way of passing this value.

Varun

On Mon, Dec 10, 2012 at 6:08 AM, ramkrishna vasudevan <
ramkrishna.s.vasudevan@gmail.com> wrote:

> Hi Varun
>
> If you are using 0.94 version you have a coprocessor that is getting
> invoked before and after Compaction selection.
> preCompactScannerOpen() helps you to create your own scanner which actually
> does the next() operation.
> Now if you can wrap your own scanner and implement your next() it will help
> you to play with the kvs that you need.  So basically you can say what cols
> to include and what to exclude.
> Does this help you Varun?
>
> Regards
> Ram
>
> On Mon, Dec 10, 2012 at 7:28 PM, Varun Sharma <va...@pinterest.com> wrote:
>
> > Hi,
> >
> > My understanding of major compaction is that it rewrites one store file
> and
> > does a merge of the memstore, store files on disk and cleans out delete
> > tombstones and puts prior to them and cleans out excess versions. We want
> > to limit the number of columns per row in hbase. Also, we want to limit
> > them in lexicographically sorted order - which means we take the top, say
> > 100 smallest columns (in lexicographical sense) and only keep them while
> > discard the rest.
> >
> > One way to do this would be to clean out columns in a daily mapreduce
> job.
> > Or another way is to clean them out during the major compaction which can
> > be run daily too. I see, from the code that a major compaction
> essentially
> > invokes a Scan over the region - so if the Scan is invoked with the
> > appropriate filter (say ColumnCountGetFilter) - would that do the trick ?
> >
> > Thanks
> > Varun
> >
>

Re: Filtering/Collection columns during Major Compaction

Posted by ramkrishna vasudevan <ra...@gmail.com>.
Hi Varun

If you are using 0.94 version you have a coprocessor that is getting
invoked before and after Compaction selection.
preCompactScannerOpen() helps you to create your own scanner which actually
does the next() operation.
Now if you can wrap your own scanner and implement your next() it will help
you to play with the kvs that you need.  So basically you can say what cols
to include and what to exclude.
Does this help you Varun?

Regards
Ram

On Mon, Dec 10, 2012 at 7:28 PM, Varun Sharma <va...@pinterest.com> wrote:

> Hi,
>
> My understanding of major compaction is that it rewrites one store file and
> does a merge of the memstore, store files on disk and cleans out delete
> tombstones and puts prior to them and cleans out excess versions. We want
> to limit the number of columns per row in hbase. Also, we want to limit
> them in lexicographically sorted order - which means we take the top, say
> 100 smallest columns (in lexicographical sense) and only keep them while
> discard the rest.
>
> One way to do this would be to clean out columns in a daily mapreduce job.
> Or another way is to clean them out during the major compaction which can
> be run daily too. I see, from the code that a major compaction essentially
> invokes a Scan over the region - so if the Scan is invoked with the
> appropriate filter (say ColumnCountGetFilter) - would that do the trick ?
>
> Thanks
> Varun
>

Re: Filtering/Collection columns during Major Compaction

Posted by Varun Sharma <va...@pinterest.com>.
So, I actually wrote something that uses the preCompactScannerOpen and
initialize a StoreScanner in exactly the same way as we do for a major
compaction. Except that I add the filter I need to this scanner
(ColumnPaginationFilter) - I guess that should accomplish the same thing.

On Mon, Dec 10, 2012 at 9:06 PM, lars hofhansl <lh...@yahoo.com> wrote:

> You can replace (or post filter) the scanner used for the compaction using
> coprocessors.
> Take a look at RegionObserver.preCompact, which is passed a scanner that
> will iterate over all KVs that should make it into the new store file.
> You can now wrap this scanner and then any filtering you'd like to do.
>
>
>
> ________________________________
>  From: Varun Sharma <va...@pinterest.com>
> To: user@hbase.apache.org
> Sent: Monday, December 10, 2012 5:58 AM
> Subject: Filtering/Collection columns during Major Compaction
>
> Hi,
>
> My understanding of major compaction is that it rewrites one store file and
> does a merge of the memstore, store files on disk and cleans out delete
> tombstones and puts prior to them and cleans out excess versions. We want
> to limit the number of columns per row in hbase. Also, we want to limit
> them in lexicographically sorted order - which means we take the top, say
> 100 smallest columns (in lexicographical sense) and only keep them while
> discard the rest.
>
> One way to do this would be to clean out columns in a daily mapreduce job.
> Or another way is to clean them out during the major compaction which can
> be run daily too. I see, from the code that a major compaction essentially
> invokes a Scan over the region - so if the Scan is invoked with the
> appropriate filter (say ColumnCountGetFilter) - would that do the trick ?
>
> Thanks
> Varun
>

Re: Filtering/Collection columns during Major Compaction

Posted by lars hofhansl <lh...@yahoo.com>.
You can replace (or post filter) the scanner used for the compaction using coprocessors.
Take a look at RegionObserver.preCompact, which is passed a scanner that will iterate over all KVs that should make it into the new store file.
You can now wrap this scanner and then any filtering you'd like to do.



________________________________
 From: Varun Sharma <va...@pinterest.com>
To: user@hbase.apache.org 
Sent: Monday, December 10, 2012 5:58 AM
Subject: Filtering/Collection columns during Major Compaction
 
Hi,

My understanding of major compaction is that it rewrites one store file and
does a merge of the memstore, store files on disk and cleans out delete
tombstones and puts prior to them and cleans out excess versions. We want
to limit the number of columns per row in hbase. Also, we want to limit
them in lexicographically sorted order - which means we take the top, say
100 smallest columns (in lexicographical sense) and only keep them while
discard the rest.

One way to do this would be to clean out columns in a daily mapreduce job.
Or another way is to clean them out during the major compaction which can
be run daily too. I see, from the code that a major compaction essentially
invokes a Scan over the region - so if the Scan is invoked with the
appropriate filter (say ColumnCountGetFilter) - would that do the trick ?

Thanks
Varun