You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by salim achouche <sa...@gmail.com> on 2018/02/09 01:42:09 UTC

Batch Sizing for Parquet Flat Reader

The following document
<https://docs.google.com/document/d/1A6zFkjxnC_-9RwG4h0sI81KI5ZEvJ7HzgClCUFpB5WE/edit?ts=5a793606#>
describes
a proposal for enforcing batch sizing constraints (count and memory) within
the Parquet Reader (Flat Schema). Please feel free to take a look and
provide feedback.

Thanks!

Regards,
Salim

Re: Batch Sizing for Parquet Flat Reader

Posted by salim achouche <sa...@gmail.com>.

I have updated the document
<https://docs.google.com/document/d/1A6zFkjxnC_-9RwG4h0sI81KI5ZEvJ7HzgClCUFpB5WE/edit?ts=5a793606&pli=1#>
with more design details.

On Thu, Feb 8, 2018 at 5:42 PM, salim achouche <sa...@gmail.com> wrote:

> The following document
> <https://docs.google.com/document/d/1A6zFkjxnC_-9RwG4h0sI81KI5ZEvJ7HzgClCUFpB5WE/edit?ts=5a793606#> describes
> a proposal for enforcing batch sizing constraints (count and memory) within
> the Parquet Reader (Flat Schema). Please feel free to take a look and
> provide feedback.
>
> Thanks!
>
> Regards,
> Salim
>

Re: Batch Sizing for Parquet Flat Reader

Posted by salim achouche <sa...@gmail.com>.

Thanks Parth for your feedback! I am planning to enhance the document based
on the received feedback and the prototype I am currently working on!

Regards,
Salim

On Sun, Feb 11, 2018 at 2:36 PM, salim achouche <sa...@gmail.com>
wrote:

> Thanks Paul for your feedback! let me try to answer some of your questions
> / comments:
>
> *Duplicate Implementation*
> - I am *not* contemplating two different implementations; one for Parquet
> and another for the rest of the code
> - Instead, I am reacting to the fact that we have two different processing
> patterns Row Oriented and Columnar
> - The goal is to offer both strategies depending on the operator
>
> Complex Vs Flat Parquet Readers
> - The Complex and Flat Parquet readers are quite different
> - I presume, for the sake of performance, we can enhance our SQL
> capabilities so that the Flat Parquet reader can be invoked more frequently
>
> *Predicate Pushdown*
> - The reason I invoked Predicate Pushdown within the document is to help
> the analysis:
>    o Notice how Record Batch materialization could involve many more pages
>    o A solution that relies mainly on the current set of pages (one per
> column) might pay a heavy IO price without much to show for
>       + By waiting for all columns to have at least one page loaded so
> that upfront stats are gathered
>       + Batch memory is then divided optimally across columns and the
> current batch size is computed
>       + Unfortunately, such logic will fail if more pages are involved
> than the ones taken in consideration
>    o Example -
>       + Two variable length columns c1 and c2
>       + Reader waits for two pages P1-1 and P2-1 so that we a) allocate
> memory optimally across c1 and c2 and b) compute a batch size that will
> minimize overflow logic
>       + Assume, because of data length skew or predicate pushdown, that
> more pages are involved in loading the batch
>       + for c1: {P1-1, P1-2, P1-3, P1-4}, c2: {P2-1, P2-2}
>       + It is now highly possible that overflow logic might not be optimal
> since only  two pages statistics were considered instead of six
>
>  - I have added new logic to the ScanBatch so to log (on-demand) extra
> batch statistics which will help us assess the efficiency of the batch
> sizing strategy; will add this information to the document when this
> sub-task is done
>
>
> *Implementation Strategy*
> - DRILL-6147 mission is to implement batch sizing for Flat Parquet with *minimal
> overhead*
> - This will also help us test this functionality for end-to-end cases
> (whole query)
> - My next task (after DRILL-6147) is to incorporate your framework with
> Parquet
> - I’ll will a) enhance the framework to support columnar processing and b)
> refactor the Parquet code to user the framework
> *- *I agree there might be some duplicate effort but I really believe
> this will be minimal
> - DRILL-6147 is not more than one week of research & analysis and one week
> of implementation
>
> Regards,
> Salim
>
>
>
> On Feb 11, 2018, at 1:35 PM, Paul Rogers <pa...@yahoo.com.INVALID>
> wrote:
>
> Hi All,
> Perhaps this topic needs just a bit more thought and discussion to avoid
> working at cross purposes. I've outlined the issues, and a possible path
> forward, in a comment to DRILL-6147.
> Quick summary: creating a second batch size implementation just for
> Parquet will be very difficult once we handle all the required use cases as
> spelled out in the comment. We'd want to be very sure that we do, indeed,
> want to duplicate this effort before we head down that route. Duplicating
> the effort means repeating all the work done over the last six months to
> make the original result set loader work, and the future work needed to
> maintain two parallel systems. This is not a decision to make by default.
> Thanks,
> - Paul
>
>    On Sunday, February 11, 2018, 12:10:58 AM PST, Parth Chandra <
> parthc@apache.org> wrote:
>
> Thanks Salim.
> Can you add this to the JIRA/design doc. Also, I would venture to suggest
> that the section on predicate pushdown can be made clearer.
> Also, Since you're proposing the average batch size approach with overflow
> handling, some detail on the proposed changes to the framework would be
> useful in the design doc. (Perhaps pseudo code and affected classes.)
> Essentially some guarantees provided by the framework will change and this
> may affect (or not) the existing usage. These should be enumerated in the
> design doc.
>
>
>
>

Re: Batch Sizing for Parquet Flat Reader

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi Salim.
Thanks much for the detailed explanation! You clearly have developed a deep understanding of the Parquet code and its impact on CPU and I/O performance. My comments are more from the holistic perspective as Drill as a whole.
Far too much to discuss on the dev list. I've added your comments, and my response, to DRILL-6147.
The key question is: what is our end goal and what path gets us there with the least effort? Perhaps those design doc updates Parth requested could spell that out a bit more.
Thanks,
- Paul


    On Sunday, February 11, 2018, 2:36:14 PM PST, salim achouche <sa...@gmail.com> wrote:  
 
 Thanks Paul for your feedback! let me try to answer some of your questions / comments:

Duplicate Implementation
- I am not contemplating two different implementations; one for Parquet and another for the rest of the code
- Instead, I am reacting to the fact that we have two different processing patterns Row Oriented and Columnar
- The goal is to offer both strategies depending on the operator

Complex Vs Flat Parquet Readers
- The Complex and Flat Parquet readers are quite different
- I presume, for the sake of performance, we can enhance our SQL capabilities so that the Flat Parquet reader can be invoked more frequently

Predicate Pushdown
- The reason I invoked Predicate Pushdown within the document is to help the analysis:
  o Notice how Record Batch materialization could involve many more pages
  o A solution that relies mainly on the current set of pages (one per column) might pay a heavy IO price without much to show for
      + By waiting for all columns to have at least one page loaded so that upfront stats are gathered 
      + Batch memory is then divided optimally across columns and the current batch size is computed
      + Unfortunately, such logic will fail if more pages are involved than the ones taken in consideration
  o Example -
      + Two variable length columns c1 and c2
      + Reader waits for two pages P1-1 and P2-1 so that we a) allocate memory optimally across c1 and c2 and b) compute a batch size that will minimize overflow logic
      + Assume, because of data length skew or predicate pushdown, that more pages are involved in loading the batch
      + for c1: {P1-1, P1-2, P1-3, P1-4}, c2: {P2-1, P2-2} 
      + It is now highly possible that overflow logic might not be optimal since only  two pages statistics were considered instead of six

 - I have added new logic to the ScanBatch so to log (on-demand) extra batch statistics which will help us assess the efficiency of the batch sizing strategy; will add this information to the document when this sub-task is done


Implementation Strategy
- DRILL-6147 mission is to implement batch sizing for Flat Parquet with minimal overhead
- This will also help us test this functionality for end-to-end cases (whole query)
- My next task (after DRILL-6147) is to incorporate your framework with Parquet 
- I’ll will a) enhance the framework to support columnar processing and b) refactor the Parquet code to user the framework
- I agree there might be some duplicate effort but I really believe this will be minimal
- DRILL-6147 is not more than one week of research & analysis and one week of implementation

Regards,
Salim



> On Feb 11, 2018, at 1:35 PM, Paul Rogers <pa...@yahoo.com.INVALID> wrote:
> 
> Hi All,
> Perhaps this topic needs just a bit more thought and discussion to avoid working at cross purposes. I've outlined the issues, and a possible path forward, in a comment to DRILL-6147.
> Quick summary: creating a second batch size implementation just for Parquet will be very difficult once we handle all the required use cases as spelled out in the comment. We'd want to be very sure that we do, indeed, want to duplicate this effort before we head down that route. Duplicating the effort means repeating all the work done over the last six months to make the original result set loader work, and the future work needed to maintain two parallel systems. This is not a decision to make by default.
> Thanks,
> - Paul
> 
>    On Sunday, February 11, 2018, 12:10:58 AM PST, Parth Chandra <pa...@apache.org> wrote:  
> 
> Thanks Salim.
> Can you add this to the JIRA/design doc. Also, I would venture to suggest
> that the section on predicate pushdown can be made clearer.
> Also, Since you're proposing the average batch size approach with overflow
> handling, some detail on the proposed changes to the framework would be
> useful in the design doc. (Perhaps pseudo code and affected classes.)
> Essentially some guarantees provided by the framework will change and this
> may affect (or not) the existing usage. These should be enumerated in the
> design doc.
> 
>

Re: Batch Sizing for Parquet Flat Reader

Posted by salim achouche <sa...@gmail.com>.

Thanks Paul for your feedback! let me try to answer some of your questions / comments:

Duplicate Implementation
- I am not contemplating two different implementations; one for Parquet and another for the rest of the code
- Instead, I am reacting to the fact that we have two different processing patterns Row Oriented and Columnar
- The goal is to offer both strategies depending on the operator

Complex Vs Flat Parquet Readers
- The Complex and Flat Parquet readers are quite different
- I presume, for the sake of performance, we can enhance our SQL capabilities so that the Flat Parquet reader can be invoked more frequently

Predicate Pushdown
- The reason I invoked Predicate Pushdown within the document is to help the analysis:
   o Notice how Record Batch materialization could involve many more pages
   o A solution that relies mainly on the current set of pages (one per column) might pay a heavy IO price without much to show for
      + By waiting for all columns to have at least one page loaded so that upfront stats are gathered 
      + Batch memory is then divided optimally across columns and the current batch size is computed
      + Unfortunately, such logic will fail if more pages are involved than the ones taken in consideration
   o Example -
      + Two variable length columns c1 and c2
      + Reader waits for two pages P1-1 and P2-1 so that we a) allocate memory optimally across c1 and c2 and b) compute a batch size that will minimize overflow logic
      + Assume, because of data length skew or predicate pushdown, that more pages are involved in loading the batch
      + for c1: {P1-1, P1-2, P1-3, P1-4}, c2: {P2-1, P2-2} 
      + It is now highly possible that overflow logic might not be optimal since only  two pages statistics were considered instead of six

 - I have added new logic to the ScanBatch so to log (on-demand) extra batch statistics which will help us assess the efficiency of the batch sizing strategy; will add this information to the document when this sub-task is done


Implementation Strategy
- DRILL-6147 mission is to implement batch sizing for Flat Parquet with minimal overhead
- This will also help us test this functionality for end-to-end cases (whole query)
- My next task (after DRILL-6147) is to incorporate your framework with Parquet 
- I’ll will a) enhance the framework to support columnar processing and b) refactor the Parquet code to user the framework
- I agree there might be some duplicate effort but I really believe this will be minimal
- DRILL-6147 is not more than one week of research & analysis and one week of implementation

Regards,
Salim



> On Feb 11, 2018, at 1:35 PM, Paul Rogers <pa...@yahoo.com.INVALID> wrote:
> 
> Hi All,
> Perhaps this topic needs just a bit more thought and discussion to avoid working at cross purposes. I've outlined the issues, and a possible path forward, in a comment to DRILL-6147.
> Quick summary: creating a second batch size implementation just for Parquet will be very difficult once we handle all the required use cases as spelled out in the comment. We'd want to be very sure that we do, indeed, want to duplicate this effort before we head down that route. Duplicating the effort means repeating all the work done over the last six months to make the original result set loader work, and the future work needed to maintain two parallel systems. This is not a decision to make by default.
> Thanks,
> - Paul
> 
>    On Sunday, February 11, 2018, 12:10:58 AM PST, Parth Chandra <pa...@apache.org> wrote:  
> 
> Thanks Salim.
> Can you add this to the JIRA/design doc. Also, I would venture to suggest
> that the section on predicate pushdown can be made clearer.
> Also, Since you're proposing the average batch size approach with overflow
> handling, some detail on the proposed changes to the framework would be
> useful in the design doc. (Perhaps pseudo code and affected classes.)
> Essentially some guarantees provided by the framework will change and this
> may affect (or not) the existing usage. These should be enumerated in the
> design doc.
> 
>

Re: Batch Sizing for Parquet Flat Reader

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi All,
Perhaps this topic needs just a bit more thought and discussion to avoid working at cross purposes. I've outlined the issues, and a possible path forward, in a comment to DRILL-6147.
Quick summary: creating a second batch size implementation just for Parquet will be very difficult once we handle all the required use cases as spelled out in the comment. We'd want to be very sure that we do, indeed, want to duplicate this effort before we head down that route. Duplicating the effort means repeating all the work done over the last six months to make the original result set loader work, and the future work needed to maintain two parallel systems. This is not a decision to make by default.
Thanks,
- Paul

    On Sunday, February 11, 2018, 12:10:58 AM PST, Parth Chandra <pa...@apache.org> wrote:  
 
 Thanks Salim.
Can you add this to the JIRA/design doc. Also, I would venture to suggest
that the section on predicate pushdown can be made clearer.
Also, Since you're proposing the average batch size approach with overflow
handling, some detail on the proposed changes to the framework would be
useful in the design doc. (Perhaps pseudo code and affected classes.)
 Essentially some guarantees provided by the framework will change and this
may affect (or not) the existing usage. These should be enumerated in the
design doc.

Re: Batch Sizing for Parquet Flat Reader

Posted by Aman Sinha <am...@apache.org>.

Hi Paul,
thanks for your comments.  I have added my thoughts in the DRILL-6147 JIRA
as well.  Regarding the hangout, let me find out about availability of
other folks too and will circle back with you.

thanks,
Aman

On Sun, Mar 4, 2018 at 1:23 PM, Paul Rogers <pa...@yahoo.com.invalid>
wrote:

> Hi Aman,
> To follow up, we should look at all sides of the issue. One factor
> overlooked in my previous note is that code now is better than code later.
> DRILL-6147 is available today and will immediately give users a
> performance boost. The result set loader is large and will take some months
> to commit, and so can't offer a benefit until then.
> It is hard to argue that we wait. Let's get DRILL-6147 in now, then
> revisit the issue later (doing the proposed test) once the result set
> loader is available.
> And, as discussed, DRILL-6147 works only for the flat Parquet reader.
> We'll need the result set loader for the Parquet reader that reads nested
> types.
>
> Thanks,
> - Paul
>
>
>
>     On Sunday, March 4, 2018, 1:07:38 PM PST, Paul Rogers
> <pa...@yahoo.com.INVALID> wrote:
>
>  Hi Aman,
> Please see my comment in DRILL-6147.
> For the hangout to be productive, perhaps we should create test cases that
> will show the benefit of DRILL-6147 relative to the result set loader.
> The test case of interest has three parts:
> * Multiple variable-width fields (say five) with a large variance in field
> widths in each field
> * Large data set that will be split across multiple batches (say 10 or 50
> batches)
> * Constraints on total batch size and size of the largest vector
> Clearly, we can't try this out with Parquet: that's the topic we are
> discussing.
> But, we can generate a data set in code, then do a unit test of the two
> methods (just the vector loading bits) and time the result. Similar code
> already exists in the result set loader branch that can be repurposed for
> this use. We'd want to create a similar test for the DRILL-6147 mechanisms.
> We can work out the details in a separate discussion.
> IHMO, if the results are the same, we should go with one solution. If
> DRILL-6147 is significantly faster, the decision is clear: we have two
> solutions.
> We also should consider things such as selection, null columns, implicit
> columns, and the other higher-level functionality provided by the result
> set loader. Since Parquet already has ad-hoc solutions for these, with
> DRILL-6147 we'd simply keep those solutions for Parquet, while the other
> readers use the new, unified mechanisms.
> In terms of time, this week is busy:
> * Wed. 3PM or later* Fri. 3PM or later
> The week of the 12th is much more open.
> Thanks,
> - Paul
>
>
>
>     On Sunday, March 4, 2018, 11:48:33 AM PST, Aman Sinha <
> amansinha@apache.org> wrote:
>
>  Hi all,  with reference to DRILL-6147
> <https://issues.apache.org/jira/browse/DRILL-6147> given the overlapping
> approaches,  I feel like we should have a separate hangout session with
> interested parties and discuss the details.
> Let me know and I can setup one.
>
> Aman
>
> On Mon, Feb 12, 2018 at 8:50 AM, Padma Penumarthy <pp...@mapr.com>
> wrote:
>
> > If our goal is to not to allocate more than 16MB for individual vectors
> to
> > avoid external fragmentation, I guess
> > we can take that also into consideration in our calculations to figure
> out
> > the outgoing number of rows.
> > The math might become more complex. But, the main point like you said is
> > operators know what they are
> > getting and can figure out how to deal with that to honor the constraints
> > imposed.
> >
> > Thanks
> > Padma
> >
> >
> > On Feb 12, 2018, at 8:25 AM, Paul Rogers <par0328@yahoo.com.INVALID<
> > mailto:par0328@yahoo.com.INVALID>> wrote:
> >
> > Agreed that allocating vectors up front is another good improvement.
> > The average batch size approach gets us 80% of the way to the goal: it
> > limits batch size and allows vector preallocation.
> > What it cannot do is limit individual vector sizes. Nor can it ensure
> that
> > the resulting batch is optimally loaded with data. Getting the remaining
> > 20% requires the level of detail provided by the result set loader.
> > We are driving to use the result set loader first in readers, since
> > readers can't use the average batch size (they don't have an input batch
> to
> > use to obtain sizes.)
> > To use the result set loader in non-leaf operators, we'd need to modify
> > code generation. AFAIK, that is not something anyone is working on, so
> > another advantage of the average batch size method is that it works with
> > the code generation we already have.
> > Thanks,
> > - Paul
> >
> >
> >
> >    On Sunday, February 11, 2018, 7:28:52 PM PST, Padma Penumarthy <
> > ppenumarthy@mapr.com<ma...@mapr.com>> wrote:
> >
> > With average row size method, since I know number of rows and the average
> > size for each column,
> > I am planning to use that information to allocate required memory for
> each
> > vector upfront.
> > This should help avoid copying every time we double and also improve
> > memory utilization.
> >
> > Thanks
> > Padma
> >
> >
> > On Feb 11, 2018, at 3:44 PM, Paul Rogers <par0328@yahoo.com.INVALID<
> > mailto:par0328@yahoo.com.INVALID>> wrote:
> >
> > One more thought:
> > 3) Assuming that you go with the average batch size calculation approach,
> >
> > The average batch size approach is a quick and dirty approach for
> non-leaf
> > operators that can observe an incoming batch to estimate row width.
> Because
> > Drill batches are large, the law of large numbers means that the average
> of
> > a large input batch is likely to be a good estimator for the average size
> > of a large output batch.
> > Note that this works only because non-leaf operators have an input batch
> > to sample. Leaf operators (readers) do not have this luxury. Hence the
> > result set loader uses the actual accumulated size for the current batch.
> > Also note that the average row method, while handy, is not optimal. It
> > will, in general, result in greater internal fragmentation than the
> result
> > set loader. Why? The result set loader packs vectors right up to the
> point
> > where the largest would overflow. The average row method works at the
> > aggregate level and will likely result in wasted space (internal
> > fragmentation) in the largest vector. Said another way, with the average
> > row size method, we can usually pack in a few more rows before the batch
> > actually fills, and so we end up with batches with lower "density" than
> the
> > optimal. This is important when the consuming operator is a buffering one
> > such as sort.
> > The key reason Padma is using the quick & dirty average row size method
> is
> > not that it is ideal (it is not), but rather that it is, in fact, quick.
> > We do want to move to the result set loader over time so we get improved
> > memory utilization. And, it is the only way to control row size in
> readers
> > such as CSV or JSON in which we have no size information until we read
> the
> > data.
> > - Paul
> >
> >
>
>

Re: Batch Sizing for Parquet Flat Reader

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi Aman,
To follow up, we should look at all sides of the issue. One factor overlooked in my previous note is that code now is better than code later.
DRILL-6147 is available today and will immediately give users a performance boost. The result set loader is large and will take some months to commit, and so can't offer a benefit until then.
It is hard to argue that we wait. Let's get DRILL-6147 in now, then revisit the issue later (doing the proposed test) once the result set loader is available.
And, as discussed, DRILL-6147 works only for the flat Parquet reader. We'll need the result set loader for the Parquet reader that reads nested types.

Thanks,
- Paul

 

    On Sunday, March 4, 2018, 1:07:38 PM PST, Paul Rogers <pa...@yahoo.com.INVALID> wrote:  
 
 Hi Aman,
Please see my comment in DRILL-6147.
For the hangout to be productive, perhaps we should create test cases that will show the benefit of DRILL-6147 relative to the result set loader.
The test case of interest has three parts:
* Multiple variable-width fields (say five) with a large variance in field widths in each field
* Large data set that will be split across multiple batches (say 10 or 50 batches)
* Constraints on total batch size and size of the largest vector
Clearly, we can't try this out with Parquet: that's the topic we are discussing.
But, we can generate a data set in code, then do a unit test of the two methods (just the vector loading bits) and time the result. Similar code already exists in the result set loader branch that can be repurposed for this use. We'd want to create a similar test for the DRILL-6147 mechanisms. We can work out the details in a separate discussion.
IHMO, if the results are the same, we should go with one solution. If DRILL-6147 is significantly faster, the decision is clear: we have two solutions.
We also should consider things such as selection, null columns, implicit columns, and the other higher-level functionality provided by the result set loader. Since Parquet already has ad-hoc solutions for these, with DRILL-6147 we'd simply keep those solutions for Parquet, while the other readers use the new, unified mechanisms.
In terms of time, this week is busy:
* Wed. 3PM or later* Fri. 3PM or later
The week of the 12th is much more open.
Thanks,
- Paul

 

    On Sunday, March 4, 2018, 11:48:33 AM PST, Aman Sinha <am...@apache.org> wrote:  
 
 Hi all,  with reference to DRILL-6147
<https://issues.apache.org/jira/browse/DRILL-6147> given the overlapping
approaches,  I feel like we should have a separate hangout session with
interested parties and discuss the details.
Let me know and I can setup one.

Aman

On Mon, Feb 12, 2018 at 8:50 AM, Padma Penumarthy <pp...@mapr.com>
wrote:

> If our goal is to not to allocate more than 16MB for individual vectors to
> avoid external fragmentation, I guess
> we can take that also into consideration in our calculations to figure out
> the outgoing number of rows.
> The math might become more complex. But, the main point like you said is
> operators know what they are
> getting and can figure out how to deal with that to honor the constraints
> imposed.
>
> Thanks
> Padma
>
>
> On Feb 12, 2018, at 8:25 AM, Paul Rogers <par0328@yahoo.com.INVALID<
> mailto:par0328@yahoo.com.INVALID>> wrote:
>
> Agreed that allocating vectors up front is another good improvement.
> The average batch size approach gets us 80% of the way to the goal: it
> limits batch size and allows vector preallocation.
> What it cannot do is limit individual vector sizes. Nor can it ensure that
> the resulting batch is optimally loaded with data. Getting the remaining
> 20% requires the level of detail provided by the result set loader.
> We are driving to use the result set loader first in readers, since
> readers can't use the average batch size (they don't have an input batch to
> use to obtain sizes.)
> To use the result set loader in non-leaf operators, we'd need to modify
> code generation. AFAIK, that is not something anyone is working on, so
> another advantage of the average batch size method is that it works with
> the code generation we already have.
> Thanks,
> - Paul
>
>
>
>    On Sunday, February 11, 2018, 7:28:52 PM PST, Padma Penumarthy <
> ppenumarthy@mapr.com<ma...@mapr.com>> wrote:
>
> With average row size method, since I know number of rows and the average
> size for each column,
> I am planning to use that information to allocate required memory for each
> vector upfront.
> This should help avoid copying every time we double and also improve
> memory utilization.
>
> Thanks
> Padma
>
>
> On Feb 11, 2018, at 3:44 PM, Paul Rogers <par0328@yahoo.com.INVALID<
> mailto:par0328@yahoo.com.INVALID>> wrote:
>
> One more thought:
> 3) Assuming that you go with the average batch size calculation approach,
>
> The average batch size approach is a quick and dirty approach for non-leaf
> operators that can observe an incoming batch to estimate row width. Because
> Drill batches are large, the law of large numbers means that the average of
> a large input batch is likely to be a good estimator for the average size
> of a large output batch.
> Note that this works only because non-leaf operators have an input batch
> to sample. Leaf operators (readers) do not have this luxury. Hence the
> result set loader uses the actual accumulated size for the current batch.
> Also note that the average row method, while handy, is not optimal. It
> will, in general, result in greater internal fragmentation than the result
> set loader. Why? The result set loader packs vectors right up to the point
> where the largest would overflow. The average row method works at the
> aggregate level and will likely result in wasted space (internal
> fragmentation) in the largest vector. Said another way, with the average
> row size method, we can usually pack in a few more rows before the batch
> actually fills, and so we end up with batches with lower "density" than the
> optimal. This is important when the consuming operator is a buffering one
> such as sort.
> The key reason Padma is using the quick & dirty average row size method is
> not that it is ideal (it is not), but rather that it is, in fact, quick.
> We do want to move to the result set loader over time so we get improved
> memory utilization. And, it is the only way to control row size in readers
> such as CSV or JSON in which we have no size information until we read the
> data.
> - Paul
>
>

Re: Batch Sizing for Parquet Flat Reader

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi Aman,
Please see my comment in DRILL-6147.
For the hangout to be productive, perhaps we should create test cases that will show the benefit of DRILL-6147 relative to the result set loader.
The test case of interest has three parts:
* Multiple variable-width fields (say five) with a large variance in field widths in each field* Large data set that will be split across multiple batches (say 10 or 50 batches)* Constraints on total batch size and size of the largest vector
Clearly, we can't try this out with Parquet: that's the topic we are discussing.
But, we can generate a data set in code, then do a unit test of the two methods (just the vector loading bits) and time the result. Similar code already exists in the result set loader branch that can be repurposed for this use. We'd want to create a similar test for the DRILL-6147 mechanisms. We can work out the details in a separate discussion.
IHMO, if the results are the same, we should go with one solution. If DRILL-6147 is significantly faster, the decision is clear: we have two solutions.
We also should consider things such as selection, null columns, implicit columns, and the other higher-level functionality provided by the result set loader. Since Parquet already has ad-hoc solutions for these, with DRILL-6147 we'd simply keep those solutions for Parquet, while the other readers use the new, unified mechanisms.
In terms of time, this week is busy:
* Wed. 3PM or later* Fri. 3PM or later
The week of the 12th is much more open.
Thanks,
- Paul

 

    On Sunday, March 4, 2018, 11:48:33 AM PST, Aman Sinha <am...@apache.org> wrote:  
 
 Hi all,  with reference to DRILL-6147
<https://issues.apache.org/jira/browse/DRILL-6147> given the overlapping
approaches,  I feel like we should have a separate hangout session with
interested parties and discuss the details.
Let me know and I can setup one.

Aman

On Mon, Feb 12, 2018 at 8:50 AM, Padma Penumarthy <pp...@mapr.com>
wrote:

> If our goal is to not to allocate more than 16MB for individual vectors to
> avoid external fragmentation, I guess
> we can take that also into consideration in our calculations to figure out
> the outgoing number of rows.
> The math might become more complex. But, the main point like you said is
> operators know what they are
> getting and can figure out how to deal with that to honor the constraints
> imposed.
>
> Thanks
> Padma
>
>
> On Feb 12, 2018, at 8:25 AM, Paul Rogers <par0328@yahoo.com.INVALID<
> mailto:par0328@yahoo.com.INVALID>> wrote:
>
> Agreed that allocating vectors up front is another good improvement.
> The average batch size approach gets us 80% of the way to the goal: it
> limits batch size and allows vector preallocation.
> What it cannot do is limit individual vector sizes. Nor can it ensure that
> the resulting batch is optimally loaded with data. Getting the remaining
> 20% requires the level of detail provided by the result set loader.
> We are driving to use the result set loader first in readers, since
> readers can't use the average batch size (they don't have an input batch to
> use to obtain sizes.)
> To use the result set loader in non-leaf operators, we'd need to modify
> code generation. AFAIK, that is not something anyone is working on, so
> another advantage of the average batch size method is that it works with
> the code generation we already have.
> Thanks,
> - Paul
>
>
>
>    On Sunday, February 11, 2018, 7:28:52 PM PST, Padma Penumarthy <
> ppenumarthy@mapr.com<ma...@mapr.com>> wrote:
>
> With average row size method, since I know number of rows and the average
> size for each column,
> I am planning to use that information to allocate required memory for each
> vector upfront.
> This should help avoid copying every time we double and also improve
> memory utilization.
>
> Thanks
> Padma
>
>
> On Feb 11, 2018, at 3:44 PM, Paul Rogers <par0328@yahoo.com.INVALID<
> mailto:par0328@yahoo.com.INVALID>> wrote:
>
> One more thought:
> 3) Assuming that you go with the average batch size calculation approach,
>
> The average batch size approach is a quick and dirty approach for non-leaf
> operators that can observe an incoming batch to estimate row width. Because
> Drill batches are large, the law of large numbers means that the average of
> a large input batch is likely to be a good estimator for the average size
> of a large output batch.
> Note that this works only because non-leaf operators have an input batch
> to sample. Leaf operators (readers) do not have this luxury. Hence the
> result set loader uses the actual accumulated size for the current batch.
> Also note that the average row method, while handy, is not optimal. It
> will, in general, result in greater internal fragmentation than the result
> set loader. Why? The result set loader packs vectors right up to the point
> where the largest would overflow. The average row method works at the
> aggregate level and will likely result in wasted space (internal
> fragmentation) in the largest vector. Said another way, with the average
> row size method, we can usually pack in a few more rows before the batch
> actually fills, and so we end up with batches with lower "density" than the
> optimal. This is important when the consuming operator is a buffering one
> such as sort.
> The key reason Padma is using the quick & dirty average row size method is
> not that it is ideal (it is not), but rather that it is, in fact, quick.
> We do want to move to the result set loader over time so we get improved
> memory utilization. And, it is the only way to control row size in readers
> such as CSV or JSON in which we have no size information until we read the
> data.
> - Paul
>
>

Re: Batch Sizing for Parquet Flat Reader

Posted by Aman Sinha <am...@apache.org>.

Hi all,  with reference to DRILL-6147
<https://issues.apache.org/jira/browse/DRILL-6147> given the overlapping
approaches,  I feel like we should have a separate hangout session with
interested parties and discuss the details.
Let me know and I can setup one.

Aman

On Mon, Feb 12, 2018 at 8:50 AM, Padma Penumarthy <pp...@mapr.com>
wrote:

> If our goal is to not to allocate more than 16MB for individual vectors to
> avoid external fragmentation, I guess
> we can take that also into consideration in our calculations to figure out
> the outgoing number of rows.
> The math might become more complex. But, the main point like you said is
> operators know what they are
> getting and can figure out how to deal with that to honor the constraints
> imposed.
>
> Thanks
> Padma
>
>
> On Feb 12, 2018, at 8:25 AM, Paul Rogers <par0328@yahoo.com.INVALID<
> mailto:par0328@yahoo.com.INVALID>> wrote:
>
> Agreed that allocating vectors up front is another good improvement.
> The average batch size approach gets us 80% of the way to the goal: it
> limits batch size and allows vector preallocation.
> What it cannot do is limit individual vector sizes. Nor can it ensure that
> the resulting batch is optimally loaded with data. Getting the remaining
> 20% requires the level of detail provided by the result set loader.
> We are driving to use the result set loader first in readers, since
> readers can't use the average batch size (they don't have an input batch to
> use to obtain sizes.)
> To use the result set loader in non-leaf operators, we'd need to modify
> code generation. AFAIK, that is not something anyone is working on, so
> another advantage of the average batch size method is that it works with
> the code generation we already have.
> Thanks,
> - Paul
>
>
>
>    On Sunday, February 11, 2018, 7:28:52 PM PST, Padma Penumarthy <
> ppenumarthy@mapr.com<ma...@mapr.com>> wrote:
>
> With average row size method, since I know number of rows and the average
> size for each column,
> I am planning to use that information to allocate required memory for each
> vector upfront.
> This should help avoid copying every time we double and also improve
> memory utilization.
>
> Thanks
> Padma
>
>
> On Feb 11, 2018, at 3:44 PM, Paul Rogers <par0328@yahoo.com.INVALID<
> mailto:par0328@yahoo.com.INVALID>> wrote:
>
> One more thought:
> 3) Assuming that you go with the average batch size calculation approach,
>
> The average batch size approach is a quick and dirty approach for non-leaf
> operators that can observe an incoming batch to estimate row width. Because
> Drill batches are large, the law of large numbers means that the average of
> a large input batch is likely to be a good estimator for the average size
> of a large output batch.
> Note that this works only because non-leaf operators have an input batch
> to sample. Leaf operators (readers) do not have this luxury. Hence the
> result set loader uses the actual accumulated size for the current batch.
> Also note that the average row method, while handy, is not optimal. It
> will, in general, result in greater internal fragmentation than the result
> set loader. Why? The result set loader packs vectors right up to the point
> where the largest would overflow. The average row method works at the
> aggregate level and will likely result in wasted space (internal
> fragmentation) in the largest vector. Said another way, with the average
> row size method, we can usually pack in a few more rows before the batch
> actually fills, and so we end up with batches with lower "density" than the
> optimal. This is important when the consuming operator is a buffering one
> such as sort.
> The key reason Padma is using the quick & dirty average row size method is
> not that it is ideal (it is not), but rather that it is, in fact, quick.
> We do want to move to the result set loader over time so we get improved
> memory utilization. And, it is the only way to control row size in readers
> such as CSV or JSON in which we have no size information until we read the
> data.
> - Paul
>
>

Re: Batch Sizing for Parquet Flat Reader

Posted by Padma Penumarthy <pp...@mapr.com>.

If our goal is to not to allocate more than 16MB for individual vectors to avoid external fragmentation, I guess
we can take that also into consideration in our calculations to figure out the outgoing number of rows.
The math might become more complex. But, the main point like you said is operators know what they are
getting and can figure out how to deal with that to honor the constraints imposed.

Thanks
Padma

On Feb 12, 2018, at 8:25 AM, Paul Rogers <pa...@yahoo.com.INVALID>> wrote:

Agreed that allocating vectors up front is another good improvement.
The average batch size approach gets us 80% of the way to the goal: it limits batch size and allows vector preallocation.
What it cannot do is limit individual vector sizes. Nor can it ensure that the resulting batch is optimally loaded with data. Getting the remaining 20% requires the level of detail provided by the result set loader.
We are driving to use the result set loader first in readers, since readers can't use the average batch size (they don't have an input batch to use to obtain sizes.)
To use the result set loader in non-leaf operators, we'd need to modify code generation. AFAIK, that is not something anyone is working on, so another advantage of the average batch size method is that it works with the code generation we already have.
Thanks,
- Paul

On Sunday, February 11, 2018, 7:28:52 PM PST, Padma Penumarthy <pp...@mapr.com>> wrote:

With average row size method, since I know number of rows and the average size for each column,
I am planning to use that information to allocate required memory for each vector upfront.
This should help avoid copying every time we double and also improve memory utilization.

Thanks
Padma

On Feb 11, 2018, at 3:44 PM, Paul Rogers <pa...@yahoo.com.INVALID>> wrote:

One more thought:
3) Assuming that you go with the average batch size calculation approach,

The average batch size approach is a quick and dirty approach for non-leaf operators that can observe an incoming batch to estimate row width. Because Drill batches are large, the law of large numbers means that the average of a large input batch is likely to be a good estimator for the average size of a large output batch.
Note that this works only because non-leaf operators have an input batch to sample. Leaf operators (readers) do not have this luxury. Hence the result set loader uses the actual accumulated size for the current batch.
Also note that the average row method, while handy, is not optimal. It will, in general, result in greater internal fragmentation than the result set loader. Why? The result set loader packs vectors right up to the point where the largest would overflow. The average row method works at the aggregate level and will likely result in wasted space (internal fragmentation) in the largest vector. Said another way, with the average row size method, we can usually pack in a few more rows before the batch actually fills, and so we end up with batches with lower "density" than the optimal. This is important when the consuming operator is a buffering one such as sort.
The key reason Padma is using the quick & dirty average row size method is not that it is ideal (it is not), but rather that it is, in fact, quick.
We do want to move to the result set loader over time so we get improved memory utilization. And, it is the only way to control row size in readers such as CSV or JSON in which we have no size information until we read the data.
- Paul

Re: Batch Sizing for Parquet Flat Reader

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Agreed that allocating vectors up front is another good improvement.
The average batch size approach gets us 80% of the way to the goal: it limits batch size and allows vector preallocation.
What it cannot do is limit individual vector sizes. Nor can it ensure that the resulting batch is optimally loaded with data. Getting the remaining 20% requires the level of detail provided by the result set loader.
We are driving to use the result set loader first in readers, since readers can't use the average batch size (they don't have an input batch to use to obtain sizes.)
To use the result set loader in non-leaf operators, we'd need to modify code generation. AFAIK, that is not something anyone is working on, so another advantage of the average batch size method is that it works with the code generation we already have.
Thanks,
- Paul

 

    On Sunday, February 11, 2018, 7:28:52 PM PST, Padma Penumarthy <pp...@mapr.com> wrote:  
 
 With average row size method, since I know number of rows and the average size for each column, 
I am planning to use that information to allocate required memory for each vector upfront. 
This should help avoid copying every time we double and also improve memory utilization.

Thanks
Padma


> On Feb 11, 2018, at 3:44 PM, Paul Rogers <pa...@yahoo.com.INVALID> wrote:
> 
> One more thought:
>>> 3) Assuming that you go with the average batch size calculation approach,
> 
> The average batch size approach is a quick and dirty approach for non-leaf operators that can observe an incoming batch to estimate row width. Because Drill batches are large, the law of large numbers means that the average of a large input batch is likely to be a good estimator for the average size of a large output batch.
> Note that this works only because non-leaf operators have an input batch to sample. Leaf operators (readers) do not have this luxury. Hence the result set loader uses the actual accumulated size for the current batch.
> Also note that the average row method, while handy, is not optimal. It will, in general, result in greater internal fragmentation than the result set loader. Why? The result set loader packs vectors right up to the point where the largest would overflow. The average row method works at the aggregate level and will likely result in wasted space (internal fragmentation) in the largest vector. Said another way, with the average row size method, we can usually pack in a few more rows before the batch actually fills, and so we end up with batches with lower "density" than the optimal. This is important when the consuming operator is a buffering one such as sort.
> The key reason Padma is using the quick & dirty average row size method is not that it is ideal (it is not), but rather that it is, in fact, quick.
> We do want to move to the result set loader over time so we get improved memory utilization. And, it is the only way to control row size in readers such as CSV or JSON in which we have no size information until we read the data.
> - Paul

Re: Batch Sizing for Parquet Flat Reader

Posted by Padma Penumarthy <pp...@mapr.com>.

With average row size method, since I know number of rows and the average size for each column, 
I am planning to use that information to allocate required memory for each vector upfront. 
This should help avoid copying every time we double and also improve memory utilization.

Thanks
Padma


> On Feb 11, 2018, at 3:44 PM, Paul Rogers <pa...@yahoo.com.INVALID> wrote:
> 
> One more thought:
>>> 3) Assuming that you go with the average batch size calculation approach,
> 
> The average batch size approach is a quick and dirty approach for non-leaf operators that can observe an incoming batch to estimate row width. Because Drill batches are large, the law of large numbers means that the average of a large input batch is likely to be a good estimator for the average size of a large output batch.
> Note that this works only because non-leaf operators have an input batch to sample. Leaf operators (readers) do not have this luxury. Hence the result set loader uses the actual accumulated size for the current batch.
> Also note that the average row method, while handy, is not optimal. It will, in general, result in greater internal fragmentation than the result set loader. Why? The result set loader packs vectors right up to the point where the largest would overflow. The average row method works at the aggregate level and will likely result in wasted space (internal fragmentation) in the largest vector. Said another way, with the average row size method, we can usually pack in a few more rows before the batch actually fills, and so we end up with batches with lower "density" than the optimal. This is important when the consuming operator is a buffering one such as sort.
> The key reason Padma is using the quick & dirty average row size method is not that it is ideal (it is not), but rather that it is, in fact, quick.
> We do want to move to the result set loader over time so we get improved memory utilization. And, it is the only way to control row size in readers such as CSV or JSON in which we have no size information until we read the data.
> - Paul

Re: Batch Sizing for Parquet Flat Reader

Posted by salim achouche <sa...@gmail.com>.

Paul,

I cannot thank you enough for your help and guidance! You are right that columnar readers will have a harder time balancing resource requirements and performance. Nevertheless, DRILL-6147 is a starting point; it should allow us to gain knowledge and accordingly refine our strategy as we go.

FYI - On a completely different topic; I was working on an EBF regarding the Parquet complex reader (though the bug was midstream). I was surprised by the level of overhead associated with nested data processing; literarily, the code was jumping from one column/level to another just to process a single value. There was a comment to perform such processing in a bulk manner (which I agree with). The moral of the story is that Drill is dealing with complex use-cases that haven’t been dealt with before (at least not with great success); as can be seen, we started with simpler solutions only to realize they are inefficient. What is needed, is to spend time understanding such use-cases and incrementally attempt perfecting those shortcomings.

Regards,
Salim

> On Feb 11, 2018, at 3:44 PM, Paul Rogers <par0328@yahoo.c <ma...@yahoo.c>
> om.INVALID> wrote:
> 
> One more thought:
>>> 3) Assuming that you go with the average batch size calculation approach,
> 
> The average batch size approach is a quick and dirty approach for non-leaf operators that can observe an incoming batch to estimate row width. Because Drill batches are large, the law of large numbers means that the average of a large input batch is likely to be a good estimator for the average size of a large output batch.
> Note that this works only because non-leaf operators have an input batch to sample. Leaf operators (readers) do not have this luxury. Hence the result set loader uses the actual accumulated size for the current batch.
> Also note that the average row method, while handy, is not optimal. It will, in general, result in greater internal fragmentation than the result set loader. Why? The result set loader packs vectors right up to the point where the largest would overflow. The average row method works at the aggregate level and will likely result in wasted space (internal fragmentation) in the largest vector. Said another way, with the average row size method, we can usually pack in a few more rows before the batch actually fills, and so we end up with batches with lower "density" than the optimal. This is important when the consuming operator is a buffering one such as sort.
> The key reason Padma is using the quick & dirty average row size method is not that it is ideal (it is not), but rather that it is, in fact, quick.
> We do want to move to the result set loader over time so we get improved memory utilization. And, it is the only way to control row size in readers such as CSV or JSON in which we have no size information until we read the data.
> - Paul

Re: Batch Sizing for Parquet Flat Reader

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

One more thought:
> > 3) Assuming that you go with the average batch size calculation approach,

The average batch size approach is a quick and dirty approach for non-leaf operators that can observe an incoming batch to estimate row width. Because Drill batches are large, the law of large numbers means that the average of a large input batch is likely to be a good estimator for the average size of a large output batch.
Note that this works only because non-leaf operators have an input batch to sample. Leaf operators (readers) do not have this luxury. Hence the result set loader uses the actual accumulated size for the current batch.
Also note that the average row method, while handy, is not optimal. It will, in general, result in greater internal fragmentation than the result set loader. Why? The result set loader packs vectors right up to the point where the largest would overflow. The average row method works at the aggregate level and will likely result in wasted space (internal fragmentation) in the largest vector. Said another way, with the average row size method, we can usually pack in a few more rows before the batch actually fills, and so we end up with batches with lower "density" than the optimal. This is important when the consuming operator is a buffering one such as sort.
The key reason Padma is using the quick & dirty average row size method is not that it is ideal (it is not), but rather that it is, in fact, quick.
We do want to move to the result set loader over time so we get improved memory utilization. And, it is the only way to control row size in readers such as CSV or JSON in which we have no size information until we read the data.
- Paul

Re: Batch Sizing for Parquet Flat Reader

Posted by Parth Chandra <pa...@apache.org>.

Thanks Salim.
Can you add this to the JIRA/design doc. Also, I would venture to suggest
that the section on predicate pushdown can be made clearer.
Also, Since you're proposing the average batch size approach with overflow
handling, some detail on the proposed changes to the framework would be
useful in the design doc. (Perhaps pseudo code and affected classes.)
 Essentially some guarantees provided by the framework will change and this
may affect (or not) the existing usage. These should be enumerated in the
design doc.





On Fri, Feb 9, 2018 at 11:52 PM, salim achouche <sa...@gmail.com>
wrote:

> Thank you Parth for providing feedback; please find my answers below:
>
> I have created Apache JIRA DRILL-6147
> <https://issues.apache.org/jira/browse/DRILL-6147?filter=-1> for tracking
> this improvement.
>
> >  2) Not sure where you were going with the predicate pushdown section and
> how it pertains to your proposed batch sizing.
>
> Predicate push down was part of the Design Considerations section; the
> intent is that the design should be able to handle future use-cases such as
> push down. Notice how the Page based Statistical Approach will not work
> well with predicate push down as one single batch can span many pages per
> column.
>
> > 3) Assuming that you go with the average batch size calculation approach,
> are you proposing to have a Parquet scan specific overflow implementation?
> Or are you planning to leverage the ResultSet loader mechanism? If you plan
> to use the latter, it will need to be enhanced to handle a bulk chunk as
> opposed to a single value at a time. If not using the ResultSet loader
> mechanism, why not (you would be reinventing the wheel) ?
>
> Padma Penumarthy and I are currently working on the batch sizing
> functionality and selected few TPCH queries to show case end-to-end use
> cases. Immediately after this task, I'll be working on enhancing the new
> framework to support columnar processing and as such retrofit DRILL-6147
> implementation as part of the new framework. So essentially we want to make
> progress in both fronts so that a) OOM conditions are minimized as soon as
> possible and b) the new Reader framework is applied to all readers and
> operators is rolled out in the next few releases.
>
> > Also note that memory allocations by Netty greater than the 16MB chunk
> size
> are returned to the OS when the memory is free'd. Both this document and
> the original document on memory fragmentation state incorrectly that such
> memory is not released back to the OS. A quick thought experiment - where
> does this memory go if it is not released back to the OS?
>
> I have the same understanding as you:
> - I think Paul meant that 16 MB blocks are not released to the OS (cached
> within Netty)
> - Many memory allocators exhibit the same behavior as the release mechanism
> is slow (heuristics used to decide when to release so to balance between
> performance and resource usage)
> - Basically, if Drill holds a large count of 16 MB blocks, than a 32 MB, 64
> MB , etc memory allocation might fail since
>   *  none of the Netty allocated blocks can satisfy the new request
>   *  a new OS allocation will take Drill beyond the maximum direct memory
>
>
> On Fri, Feb 9, 2018 at 4:08 AM, Parth Chandra <pa...@apache.org> wrote:
>
> > Is there a JIRA for this? Would be useful to capture the comments in the
> > JIRA. Note that the document itself is not comment-able as it is shared
> > with view-only permissions.
> >
> > Some thoughts in no particular order-
> > 1) The Page based statistical approach is likely to run into trouble with
> > the encoding used for Parquet fields especially RLE which drastically
> > changes the size of the field. So pageSize/numValues is going to be
> wildly
> > inaccurate with RLE.
> > 2) Not sure where you were going with the predicate pushdown section and
> > how it pertains to your proposed batch sizing.
> > 3) Assuming that you go with the average batch size calculation approach,
> > are you proposing to have a Parquet scan specific overflow
> implementation?
> > Or are you planning to leverage the ResultSet loader mechanism? If you
> plan
> > to use the latter, it will need to be enhanced to handle a bulk chunk as
> > opposed to a single value at a time. If not using the ResultSet loader
> > mechanism, why not (you would be reinventing the wheel) ?
> > 4) Parquet page level stats are probably not reliable. You can assume
> page
> > size (compressed/uncompressed) and value count are accurate, but nothing
> > else.
> >
> > Also note that memory allocations by Netty greater than the 16MB chunk
> size
> > are returned to the OS when the memory is free'd. Both this document and
> > the original document on memory fragmentation state incorrectly that such
> > memory is not released back to the OS. A quick thought experiment - where
> > does this memory go if it is not released back to the OS?
> >
> >
> >
> > On Fri, Feb 9, 2018 at 7:12 AM, salim achouche <sa...@gmail.com>
> > wrote:
> >
> > > The following document
> > > <https://docs.google.com/document/d/1A6zFkjxnC_-
> > > 9RwG4h0sI81KI5ZEvJ7HzgClCUFpB5WE/edit?ts=5a793606#>
> > > describes
> > > a proposal for enforcing batch sizing constraints (count and memory)
> > within
> > > the Parquet Reader (Flat Schema). Please feel free to take a look and
> > > provide feedback.
> > >
> > > Thanks!
> > >
> > > Regards,
> > > Salim
> > >
> >
>

Re: Batch Sizing for Parquet Flat Reader

Posted by salim achouche <sa...@gmail.com>.

Thank you Parth for providing feedback; please find my answers below:

I have created Apache JIRA DRILL-6147
<https://issues.apache.org/jira/browse/DRILL-6147?filter=-1> for tracking
this improvement.

>  2) Not sure where you were going with the predicate pushdown section and
how it pertains to your proposed batch sizing.

Predicate push down was part of the Design Considerations section; the
intent is that the design should be able to handle future use-cases such as
push down. Notice how the Page based Statistical Approach will not work
well with predicate push down as one single batch can span many pages per
column.

> 3) Assuming that you go with the average batch size calculation approach,
are you proposing to have a Parquet scan specific overflow implementation?
Or are you planning to leverage the ResultSet loader mechanism? If you plan
to use the latter, it will need to be enhanced to handle a bulk chunk as
opposed to a single value at a time. If not using the ResultSet loader
mechanism, why not (you would be reinventing the wheel) ?

Padma Penumarthy and I are currently working on the batch sizing
functionality and selected few TPCH queries to show case end-to-end use
cases. Immediately after this task, I'll be working on enhancing the new
framework to support columnar processing and as such retrofit DRILL-6147
implementation as part of the new framework. So essentially we want to make
progress in both fronts so that a) OOM conditions are minimized as soon as
possible and b) the new Reader framework is applied to all readers and
operators is rolled out in the next few releases.

> Also note that memory allocations by Netty greater than the 16MB chunk
size
are returned to the OS when the memory is free'd. Both this document and
the original document on memory fragmentation state incorrectly that such
memory is not released back to the OS. A quick thought experiment - where
does this memory go if it is not released back to the OS?

I have the same understanding as you:
- I think Paul meant that 16 MB blocks are not released to the OS (cached
within Netty)
- Many memory allocators exhibit the same behavior as the release mechanism
is slow (heuristics used to decide when to release so to balance between
performance and resource usage)
- Basically, if Drill holds a large count of 16 MB blocks, than a 32 MB, 64
MB , etc memory allocation might fail since
  *  none of the Netty allocated blocks can satisfy the new request
  *  a new OS allocation will take Drill beyond the maximum direct memory


On Fri, Feb 9, 2018 at 4:08 AM, Parth Chandra <pa...@apache.org> wrote:

> Is there a JIRA for this? Would be useful to capture the comments in the
> JIRA. Note that the document itself is not comment-able as it is shared
> with view-only permissions.
>
> Some thoughts in no particular order-
> 1) The Page based statistical approach is likely to run into trouble with
> the encoding used for Parquet fields especially RLE which drastically
> changes the size of the field. So pageSize/numValues is going to be wildly
> inaccurate with RLE.
> 2) Not sure where you were going with the predicate pushdown section and
> how it pertains to your proposed batch sizing.
> 3) Assuming that you go with the average batch size calculation approach,
> are you proposing to have a Parquet scan specific overflow implementation?
> Or are you planning to leverage the ResultSet loader mechanism? If you plan
> to use the latter, it will need to be enhanced to handle a bulk chunk as
> opposed to a single value at a time. If not using the ResultSet loader
> mechanism, why not (you would be reinventing the wheel) ?
> 4) Parquet page level stats are probably not reliable. You can assume page
> size (compressed/uncompressed) and value count are accurate, but nothing
> else.
>
> Also note that memory allocations by Netty greater than the 16MB chunk size
> are returned to the OS when the memory is free'd. Both this document and
> the original document on memory fragmentation state incorrectly that such
> memory is not released back to the OS. A quick thought experiment - where
> does this memory go if it is not released back to the OS?
>
>
>
> On Fri, Feb 9, 2018 at 7:12 AM, salim achouche <sa...@gmail.com>
> wrote:
>
> > The following document
> > <https://docs.google.com/document/d/1A6zFkjxnC_-
> > 9RwG4h0sI81KI5ZEvJ7HzgClCUFpB5WE/edit?ts=5a793606#>
> > describes
> > a proposal for enforcing batch sizing constraints (count and memory)
> within
> > the Parquet Reader (Flat Schema). Please feel free to take a look and
> > provide feedback.
> >
> > Thanks!
> >
> > Regards,
> > Salim
> >
>

Re: Batch Sizing for Parquet Flat Reader

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Parth notes:

Also note that memory allocations by Netty greater than the 16MB chunk sizeare returned to the OS when the memory is free'd. Both this document andthe original document on memory fragmentation state incorrectly that suchmemory is not released back to the OS. A quick thought experiment - wheredoes this memory go if it is not released back to the OS?

This is true. If the original docs said otherwise, then it is an error for which I apologize. If this were not true, we'd have lots of memory leaks, which we'd have found and fixed. So, clearly memory is returned.
It is not returning memory to the OS that is the issue. Rather, it is the fragmentation that occurs when most memory is on the Netty free list and we want to get a large chunk from the OS. We can run out of memory even when lots is free (in Netty).
The original jemalloc paper talks about an algorithm to return unused memory to the OS, perhaps we can add that to our own Netty-based allocator.
We'd want to be clever, however, because allocations from the OS are 1000 times slower than allocations from the Netty free list, or at least that was try in a prototype I did a year ago on the Mac.
Further, in the general case, even Netty is not a panacea. Even if we keep blocks to 16 MB and smaller, doing random sized allocations in random order will cause Netty fragmentation: we might want a 16 MB block, half of memory might be free, but due to historical alloc/free patterns, all memory is free as 8 GB blocks and so allocation fails.
Java avoids this issue because it does compaction of free heap space. I'd guess we don't really want to try to implement that for direct memory.
This is why DBs generally use fixed-size allocations: it completely avoids memory fragmentation issues. One of the goals of the recent "result set loader" work is to encapsulate all vector accesses in a higher-level abstraction so that, eventually, we can try alternative memory layouts with minimal impact on the rest of Drill code. (The column reader and writer layer isolates code from actual vector APIs and memory layout.)
Thanks,
- Paul

Re: Batch Sizing for Parquet Flat Reader

Posted by Parth Chandra <pa...@apache.org>.

Is there a JIRA for this? Would be useful to capture the comments in the
JIRA. Note that the document itself is not comment-able as it is shared
with view-only permissions.

Some thoughts in no particular order-
1) The Page based statistical approach is likely to run into trouble with
the encoding used for Parquet fields especially RLE which drastically
changes the size of the field. So pageSize/numValues is going to be wildly
inaccurate with RLE.
2) Not sure where you were going with the predicate pushdown section and
how it pertains to your proposed batch sizing.
3) Assuming that you go with the average batch size calculation approach,
are you proposing to have a Parquet scan specific overflow implementation?
Or are you planning to leverage the ResultSet loader mechanism? If you plan
to use the latter, it will need to be enhanced to handle a bulk chunk as
opposed to a single value at a time. If not using the ResultSet loader
mechanism, why not (you would be reinventing the wheel) ?
4) Parquet page level stats are probably not reliable. You can assume page
size (compressed/uncompressed) and value count are accurate, but nothing
else.

Also note that memory allocations by Netty greater than the 16MB chunk size
are returned to the OS when the memory is free'd. Both this document and
the original document on memory fragmentation state incorrectly that such
memory is not released back to the OS. A quick thought experiment - where
does this memory go if it is not released back to the OS?

On Fri, Feb 9, 2018 at 7:12 AM, salim achouche <sa...@gmail.com> wrote:

> The following document
> <https://docs.google.com/document/d/1A6zFkjxnC_-
> 9RwG4h0sI81KI5ZEvJ7HzgClCUFpB5WE/edit?ts=5a793606#>
> describes
> a proposal for enforcing batch sizing constraints (count and memory) within
> the Parquet Reader (Flat Schema). Please feel free to take a look and
> provide feedback.
>
> Thanks!
>
> Regards,
> Salim
>