You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by "蔡自强(伏念)" <fu...@alibaba-inc.com> on 2015/01/27 09:23:31 UTC

question for pre-filter parquet file data

Hi  dear drill devloper,    Now we are deploy the 0.7 version drill for statistics analysis. I found that the parquet file store the column summary info in pageheader (like min,max,count and so on), but in the datareader these info seems not to be used for pre-filtering files. For example, when I search the records that attribute_A = 10, if the column's (min,max) =(1,9) , skip to scan the data seems the best choice. I want to check if drill will do this operation in analysis process.btw：In TableStatsCalculator.getRegionSizeInBytes method, if avgRowSizeInBytes is to large, the return value will be out of int range. So the code should be fixed like "return ((long)avgRowSizeInBytes)*1024L*1024L".                                                                                                                                                            Thanks&Regards

Re: question for pre-filter parquet file data

Posted by Steven Phillips <sp...@maprtech.com>.

Our parquet reader doesn't currently have filter pushdown, but this is
something we will be adding in the near future. Once that work is done, we
will be able to skip entire pages as you describe.

Also, could you file a jira for the TableStatsCalculator bug?

On Tue, Jan 27, 2015 at 12:23 AM, 蔡自强(伏念) <fu...@alibaba-inc.com>
wrote:

>
> Hi  dear drill devloper,    Now we are deploy the 0.7 version drill for
> statistics analysis. I found that the parquet file store the column summary
> info in pageheader (like min,max,count and so on), but in the datareader
> these info seems not to be used for pre-filtering files. For example, when
> I search the records that attribute_A = 10, if the column's
> (min,max) =(1,9) , skip to scan the data seems the best choice. I want to
> check if drill will do this operation in analysis
> process.btw：In TableStatsCalculator.getRegionSizeInBytes method,
> if avgRowSizeInBytes is to large, the return value will be out of int
> range. So the code should be fixed like "return
> ((long)avgRowSizeInBytes)*1024L*1024L".
>   Thanks&Regards




-- 
 Steven Phillips
 Software Engineer

 mapr.com