You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@carbondata.apache.org by Jacky Li <ja...@qq.com> on 2019/11/04 09:15:35 UTC

Re: [DISCUSSION] Page Level Bloom Filter

Hi Manhua,

+1 for this feature.

One question:
Since one column chunk in one blocklet is carbon's minimum IO unit, why not
create bloom filter in blocklet level? If it is page level, we still need to
read page data into memory, the saving is only for decompression.


Regards,
Jacky



--
Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [DISCUSSION] Page Level Bloom Filter

Posted by Manhua <ke...@gmail.com>.

To my understanding, only IO of the *filter columns*' column pages are saved if we do this, in condition of that minmax/pagebloom decides we can *skip* scanning these pages.  

On 2019/11/12 03:37:08, Jacky Li <ja...@apache.org> wrote: 
> 
> 
> On 2019/11/05 02:30:30, Manhua Jiang <ma...@apache.org> wrote: 
> > Hi Jacky,
> >   If we create bloom filter in blocklet level, maybe too similar to bloom datamap and have to face the same problems bloom datamap facing, except the pruning is running in executor side.
> >   Page level is preferred since page size is KNOWN and this let us get rid of considering how many bit should we need in the bitmap of bloom filter, only the FPP needed to be set.
> >   I checked the problem you mentioned actually exists. This also a problem when pruning pages by page minmax. Although minmax may believes this page does not need to scan, current query logic already loaded both the datachunk3 and column pages. The IO for column page is wasted. Should we change this first? Is this worth for us to separate one IO operation into two? 
> 
> In my opinion, I think yes. We should leverage the datachunk3 and check whether the column pages are needed before reading. This can reduce the IO dramatically for some use case, for example, high selectivity filter query.
> 
> > 
> > Anyone interesting in this part is welcomed to share you ideas also.
> > 
> > Thanks.
> > Manhua
> > 
> > On 2019/11/04 09:15:35, Jacky Li <ja...@qq.com> wrote: 
> > > Hi Manhua,
> > > 
> > > +1 for this feature.
> > > 
> > > One question:
> > > Since one column chunk in one blocklet is carbon's minimum IO unit, why not
> > > create bloom filter in blocklet level? If it is page level, we still need to
> > > read page data into memory, the saving is only for decompression.
> > > 
> > > 
> > > Regards,
> > > Jacky
> > > 
> > > 
> > > 
> > > --
> > > Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> > > 
> > 
>

Re: [DISCUSSION] Page Level Bloom Filter

Posted by Jacky Li <ja...@apache.org>.


On 2019/11/05 02:30:30, Manhua Jiang <ma...@apache.org> wrote: 
> Hi Jacky,
>   If we create bloom filter in blocklet level, maybe too similar to bloom datamap and have to face the same problems bloom datamap facing, except the pruning is running in executor side.
>   Page level is preferred since page size is KNOWN and this let us get rid of considering how many bit should we need in the bitmap of bloom filter, only the FPP needed to be set.
>   I checked the problem you mentioned actually exists. This also a problem when pruning pages by page minmax. Although minmax may believes this page does not need to scan, current query logic already loaded both the datachunk3 and column pages. The IO for column page is wasted. Should we change this first? Is this worth for us to separate one IO operation into two? 

In my opinion, I think yes. We should leverage the datachunk3 and check whether the column pages are needed before reading. This can reduce the IO dramatically for some use case, for example, high selectivity filter query.

> 
> Anyone interesting in this part is welcomed to share you ideas also.
> 
> Thanks.
> Manhua
> 
> On 2019/11/04 09:15:35, Jacky Li <ja...@qq.com> wrote: 
> > Hi Manhua,
> > 
> > +1 for this feature.
> > 
> > One question:
> > Since one column chunk in one blocklet is carbon's minimum IO unit, why not
> > create bloom filter in blocklet level? If it is page level, we still need to
> > read page data into memory, the saving is only for decompression.
> > 
> > 
> > Regards,
> > Jacky
> > 
> > 
> > 
> > --
> > Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
> > 
>

Re: [DISCUSSION] Page Level Bloom Filter

Posted by Manhua Jiang <ma...@apache.org>.

Hi Jacky,
  If we create bloom filter in blocklet level, maybe too similar to bloom datamap and have to face the same problems bloom datamap facing, except the pruning is running in executor side.
  Page level is preferred since page size is KNOWN and this let us get rid of considering how many bit should we need in the bitmap of bloom filter, only the FPP needed to be set.
  I checked the problem you mentioned actually exists. This also a problem when pruning pages by page minmax. Although minmax may believes this page does not need to scan, current query logic already loaded both the datachunk3 and column pages. The IO for column page is wasted. Should we change this first? Is this worth for us to separate one IO operation into two? 

Anyone interesting in this part is welcomed to share you ideas also.

Thanks.
Manhua

On 2019/11/04 09:15:35, Jacky Li <ja...@qq.com> wrote: 
> Hi Manhua,
> 
> +1 for this feature.
> 
> One question:
> Since one column chunk in one blocklet is carbon's minimum IO unit, why not
> create bloom filter in blocklet level? If it is page level, we still need to
> read page data into memory, the saving is only for decompression.
> 
> 
> Regards,
> Jacky
> 
> 
> 
> --
> Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
>