You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by kazuyuki tanimura <kt...@apple.com.INVALID> on 2023/01/31 17:33:15 UTC

[DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

Hi everyone,

I would like to start a discussion on “Lazy Materialization for Parquet Read Performance Improvement"

Chao and I propose a Parquet reader with lazy materialization. For Spark-SQL filter operations, evaluating the filters first and lazily materializing only the used values can save computation wastes and improve the read performance.
The current implementation of Spark requires the read values to materialize (i.e. decompress, de-code, etc...) onto memory first before applying the filters even though the filters may eventually throw away many values.

We made our design doc as follows.
SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256 <https://issues.apache.org/jira/browse/SPARK-42256> 
SPIP Doc: https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME <https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME>

Liang-Chi was kind enough to shepherd this effort. 

Thank you
Kazu

Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

Posted by kazuyuki tanimura <kt...@apple.com.INVALID>.
Thank you Mich. I addressed your point on the SPIP doc.

Kazu

> On Feb 1, 2023, at 2:04 AM, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> 
> In your statement on Q2 in SPIP, you mention and I quote
> 
> "... File formats other than Parquet are beyond the scope of this SPIP.."
> 
> It is important that you explain why you choose Parquet for this work. Apache Parquet  <https://parquet.apache.org/>is an open source column-oriented data format that is widely used in the Apache Hadoop ecosystem and beyond. It is designed for efficient data storage and retrieval. Many data warehouses prefer to store data in external storage in Parquet format. As an ETL workload for Spark, it makes sense to optimise data retrieval as much as possible.
> 
> HTH
> 
>    view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh <https://en.everybodywiki.com/Mich_Talebzadeh>
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
> 
> On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura <kt...@apple.com.invalid> wrote:
> Hi everyone,
> 
> I would like to start a discussion on “Lazy Materialization for Parquet Read Performance Improvement"
> 
> Chao and I propose a Parquet reader with lazy materialization. For Spark-SQL filter operations, evaluating the filters first and lazily materializing only the used values can save computation wastes and improve the read performance.
> The current implementation of Spark requires the read values to materialize (i.e. decompress, de-code, etc...) onto memory first before applying the filters even though the filters may eventually throw away many values.
> 
> We made our design doc as follows.
> SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256 <https://issues.apache.org/jira/browse/SPARK-42256> 
> SPIP Doc: https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME <https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME>
> 
> Liang-Chi was kind enough to shepherd this effort. 
> 
> Thank you
> Kazu


Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

Posted by Mich Talebzadeh <mi...@gmail.com>.
In your statement on Q2 in SPIP, you mention and I quote


"... File formats other than Parquet are beyond the scope of this SPIP.."


It is important that you explain why you choose Parquet for this work. Apache
Parquet <https://parquet.apache.org/>is an open source *column-oriented
data format *that is widely used in the Apache Hadoop ecosystem and beyond.
It is designed for efficient data storage and retrieval. Many data
warehouses prefer to store data in external storage in Parquet format. As
an ETL workload for Spark, it makes sense to optimise data retrieval as
much as possible.

HTH


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura <kt...@apple.com.invalid>
wrote:

> Hi everyone,
>
> I would like to start a discussion on “Lazy Materialization for Parquet
> Read Performance Improvement"
>
> Chao and I propose a Parquet reader with lazy materialization. For
> Spark-SQL filter operations, evaluating the filters first and lazily
> materializing only the used values can save computation wastes and improve
> the read performance.
> The current implementation of Spark requires the read values to
> materialize (i.e. decompress, de-code, etc...) onto memory first before
> applying the filters even though the filters may eventually throw away many
> values.
>
> We made our design doc as follows.
> SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256
> SPIP Doc:
> https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME
>
> Liang-Chi was kind enough to shepherd this effort.
>
> Thank you
> Kazu
>

Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

Posted by "L. C. Hsieh" <vi...@gmail.com>.
Hi Mich,

The title of this thread is "[DISCUSS]". We need to have a public
discussion on a SPIP proposal collecting comments before we can move
forward to call for a vote on it.


On Mon, Feb 13, 2023 at 2:35 PM Mich Talebzadeh <mi...@gmail.com>
wrote:

> Hi,
>
> I thought we already voted to go ahead with this proposal!
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 13 Feb 2023 at 20:41, kazuyuki tanimura <kt...@apple.com>
> wrote:
>
>> Thank you Liang-Chi!
>>
>> Kazu
>>
>> On Feb 11, 2023, at 7:12 PM, L. C. Hsieh <vi...@gmail.com> wrote:
>>
>> Thanks all for your feedback.
>>
>> Given this positive feedback, if there is no other comments/discussion, I
>> will go to start a vote in the next few days.
>>
>> Thank you again!
>>
>> On Thu, Feb 2, 2023 at 10:12 AM kazuyuki tanimura <
>> ktanimura@apple.com.invalid> wrote:
>>
>>> Thank you all for +1s and reviewing the SPIP doc.
>>>
>>> Kazu
>>>
>>> On Feb 1, 2023, at 1:28 AM, Dongjoon Hyun <do...@gmail.com>
>>> wrote:
>>>
>>> +1
>>>
>>> On Wed, Feb 1, 2023 at 12:52 AM Mich Talebzadeh <
>>> mich.talebzadeh@gmail.com> wrote:
>>>
>>>> +1
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, 1 Feb 2023 at 02:23, huaxin gao <hu...@gmail.com> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> On Tue, Jan 31, 2023 at 6:10 PM DB Tsai <db...@dbtsai.com> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> Sent from my iPhone
>>>>>>
>>>>>> On Jan 31, 2023, at 4:16 PM, Yuming Wang <wg...@gmail.com> wrote:
>>>>>>
>>>>>> 
>>>>>> +1.
>>>>>>
>>>>>> On Wed, Feb 1, 2023 at 7:42 AM kazuyuki tanimura <
>>>>>> ktanimura@apple.com.invalid> wrote:
>>>>>>
>>>>>>> Great! Much appreciated, Mitch!
>>>>>>>
>>>>>>> Kazu
>>>>>>>
>>>>>>> On Jan 31, 2023, at 3:07 PM, Mich Talebzadeh <
>>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>>
>>>>>>> Thanks, Kazu.
>>>>>>>
>>>>>>> I followed that template link and indeed as you pointed out it is a
>>>>>>> common template. If it works then it is what it is.
>>>>>>>
>>>>>>> I will be going through your design proposals and hopefully we can
>>>>>>> review it.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Mich
>>>>>>>
>>>>>>>
>>>>>>>    view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, 31 Jan 2023 at 22:34, kazuyuki tanimura <kt...@apple.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thank you Mich. I followed the instruction at
>>>>>>>> https://spark.apache.org/improvement-proposals.html and used its
>>>>>>>> template.
>>>>>>>> While we are open to revise our design doc, it seems more like you
>>>>>>>> are proposing the community to change the instruction per se?
>>>>>>>>
>>>>>>>> Kazu
>>>>>>>>
>>>>>>>> On Jan 31, 2023, at 11:24 AM, Mich Talebzadeh <
>>>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Thanks for these proposals. good suggestions. Is this style of
>>>>>>>> breaking down your approach standard?
>>>>>>>>
>>>>>>>> My view would be that perhaps it makes more sense to follow the
>>>>>>>> industry established approach of breaking down
>>>>>>>> your technical proposal  into:
>>>>>>>>
>>>>>>>>
>>>>>>>>    1. Background
>>>>>>>>    2. Objective
>>>>>>>>    3. Scope
>>>>>>>>    4. Constraints
>>>>>>>>    5. Assumptions
>>>>>>>>    6. Reporting
>>>>>>>>    7. Deliverables
>>>>>>>>    8. Timelines
>>>>>>>>    9. Appendix
>>>>>>>>
>>>>>>>> Your current approach using below
>>>>>>>>
>>>>>>>> Q1. What are you trying to do? Articulate your objectives using
>>>>>>>> absolutely no jargon. What are you trying to achieve?
>>>>>>>> Q2. What problem is this proposal NOT designed to solve? What
>>>>>>>> issues the suggested proposal is not going to address
>>>>>>>> Q3. How is it done today, and what are the limits of current
>>>>>>>> practice?
>>>>>>>> Q4. What is new in your approach approach and why do you think it
>>>>>>>> will be successful succeed?
>>>>>>>> Q5. Who cares? If you are successful, what difference will it make?
>>>>>>>> If your proposal succeeds, what tangible benefits will it add?
>>>>>>>> Q6. What are the risks?
>>>>>>>> Q7. How long will it take?
>>>>>>>> Q8. What are the midterm and final “exams” to check for success?
>>>>>>>>
>>>>>>>>
>>>>>>>> May not do  justice to your proposal.
>>>>>>>>
>>>>>>>> HTH
>>>>>>>>
>>>>>>>> Mich
>>>>>>>>
>>>>>>>>    view my Linkedin profile
>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>
>>>>>>>>
>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>
>>>>>>>>
>>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>>>> arising from such loss, damage or destruction.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura <
>>>>>>>> ktanimura@apple.com.invalid> wrote:
>>>>>>>>
>>>>>>>>> Hi everyone,
>>>>>>>>>
>>>>>>>>> I would like to start a discussion on “Lazy Materialization for
>>>>>>>>> Parquet Read Performance Improvement"
>>>>>>>>>
>>>>>>>>> Chao and I propose a Parquet reader with lazy materialization. For
>>>>>>>>> Spark-SQL filter operations, evaluating the filters first and
>>>>>>>>> lazily materializing only the used values can save computation wastes and
>>>>>>>>> improve the read performance.
>>>>>>>>> The current implementation of Spark requires the read values to
>>>>>>>>> materialize (i.e. decompress, de-code, etc...) onto memory first before
>>>>>>>>> applying the filters even though the filters may eventually throw away many
>>>>>>>>> values.
>>>>>>>>>
>>>>>>>>> We made our design doc as follows.
>>>>>>>>> SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256
>>>>>>>>> SPIP Doc:
>>>>>>>>> https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME
>>>>>>>>>
>>>>>>>>> Liang-Chi was kind enough to shepherd this effort.
>>>>>>>>>
>>>>>>>>> Thank you
>>>>>>>>> Kazu
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>
>>

Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

Posted by Mich Talebzadeh <mi...@gmail.com>.
Hi,

I thought we already voted to go ahead with this proposal!



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 13 Feb 2023 at 20:41, kazuyuki tanimura <kt...@apple.com> wrote:

> Thank you Liang-Chi!
>
> Kazu
>
> On Feb 11, 2023, at 7:12 PM, L. C. Hsieh <vi...@gmail.com> wrote:
>
> Thanks all for your feedback.
>
> Given this positive feedback, if there is no other comments/discussion, I
> will go to start a vote in the next few days.
>
> Thank you again!
>
> On Thu, Feb 2, 2023 at 10:12 AM kazuyuki tanimura <
> ktanimura@apple.com.invalid> wrote:
>
>> Thank you all for +1s and reviewing the SPIP doc.
>>
>> Kazu
>>
>> On Feb 1, 2023, at 1:28 AM, Dongjoon Hyun <do...@gmail.com>
>> wrote:
>>
>> +1
>>
>> On Wed, Feb 1, 2023 at 12:52 AM Mich Talebzadeh <
>> mich.talebzadeh@gmail.com> wrote:
>>
>>> +1
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Wed, 1 Feb 2023 at 02:23, huaxin gao <hu...@gmail.com> wrote:
>>>
>>>> +1
>>>>
>>>> On Tue, Jan 31, 2023 at 6:10 PM DB Tsai <db...@dbtsai.com> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> Sent from my iPhone
>>>>>
>>>>> On Jan 31, 2023, at 4:16 PM, Yuming Wang <wg...@gmail.com> wrote:
>>>>>
>>>>> 
>>>>> +1.
>>>>>
>>>>> On Wed, Feb 1, 2023 at 7:42 AM kazuyuki tanimura <
>>>>> ktanimura@apple.com.invalid> wrote:
>>>>>
>>>>>> Great! Much appreciated, Mitch!
>>>>>>
>>>>>> Kazu
>>>>>>
>>>>>> On Jan 31, 2023, at 3:07 PM, Mich Talebzadeh <
>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>
>>>>>> Thanks, Kazu.
>>>>>>
>>>>>> I followed that template link and indeed as you pointed out it is a
>>>>>> common template. If it works then it is what it is.
>>>>>>
>>>>>> I will be going through your design proposals and hopefully we can
>>>>>> review it.
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Mich
>>>>>>
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, 31 Jan 2023 at 22:34, kazuyuki tanimura <kt...@apple.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thank you Mich. I followed the instruction at
>>>>>>> https://spark.apache.org/improvement-proposals.html and used its
>>>>>>> template.
>>>>>>> While we are open to revise our design doc, it seems more like you
>>>>>>> are proposing the community to change the instruction per se?
>>>>>>>
>>>>>>> Kazu
>>>>>>>
>>>>>>> On Jan 31, 2023, at 11:24 AM, Mich Talebzadeh <
>>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Thanks for these proposals. good suggestions. Is this style of
>>>>>>> breaking down your approach standard?
>>>>>>>
>>>>>>> My view would be that perhaps it makes more sense to follow the
>>>>>>> industry established approach of breaking down
>>>>>>> your technical proposal  into:
>>>>>>>
>>>>>>>
>>>>>>>    1. Background
>>>>>>>    2. Objective
>>>>>>>    3. Scope
>>>>>>>    4. Constraints
>>>>>>>    5. Assumptions
>>>>>>>    6. Reporting
>>>>>>>    7. Deliverables
>>>>>>>    8. Timelines
>>>>>>>    9. Appendix
>>>>>>>
>>>>>>> Your current approach using below
>>>>>>>
>>>>>>> Q1. What are you trying to do? Articulate your objectives using
>>>>>>> absolutely no jargon. What are you trying to achieve?
>>>>>>> Q2. What problem is this proposal NOT designed to solve? What
>>>>>>> issues the suggested proposal is not going to address
>>>>>>> Q3. How is it done today, and what are the limits of current
>>>>>>> practice?
>>>>>>> Q4. What is new in your approach approach and why do you think it
>>>>>>> will be successful succeed?
>>>>>>> Q5. Who cares? If you are successful, what difference will it make?
>>>>>>> If your proposal succeeds, what tangible benefits will it add?
>>>>>>> Q6. What are the risks?
>>>>>>> Q7. How long will it take?
>>>>>>> Q8. What are the midterm and final “exams” to check for success?
>>>>>>>
>>>>>>>
>>>>>>> May not do  justice to your proposal.
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>> Mich
>>>>>>>
>>>>>>>    view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>>> arise from relying on this email's technical content is explicitly
>>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>>> arising from such loss, damage or destruction.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura <
>>>>>>> ktanimura@apple.com.invalid> wrote:
>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> I would like to start a discussion on “Lazy Materialization for
>>>>>>>> Parquet Read Performance Improvement"
>>>>>>>>
>>>>>>>> Chao and I propose a Parquet reader with lazy materialization. For
>>>>>>>> Spark-SQL filter operations, evaluating the filters first and
>>>>>>>> lazily materializing only the used values can save computation wastes and
>>>>>>>> improve the read performance.
>>>>>>>> The current implementation of Spark requires the read values to
>>>>>>>> materialize (i.e. decompress, de-code, etc...) onto memory first before
>>>>>>>> applying the filters even though the filters may eventually throw away many
>>>>>>>> values.
>>>>>>>>
>>>>>>>> We made our design doc as follows.
>>>>>>>> SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256
>>>>>>>> SPIP Doc:
>>>>>>>> https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME
>>>>>>>>
>>>>>>>> Liang-Chi was kind enough to shepherd this effort.
>>>>>>>>
>>>>>>>> Thank you
>>>>>>>> Kazu
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>
>

Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

Posted by kazuyuki tanimura <kt...@apple.com.INVALID>.
Thank you Liang-Chi!

Kazu

> On Feb 11, 2023, at 7:12 PM, L. C. Hsieh <vi...@gmail.com> wrote:
> 
> Thanks all for your feedback.
> 
> Given this positive feedback, if there is no other comments/discussion, I will go to start a vote in the next few days.
> 
> Thank you again!
> 
> On Thu, Feb 2, 2023 at 10:12 AM kazuyuki tanimura <kt...@apple.com.invalid> wrote:
> Thank you all for +1s and reviewing the SPIP doc.
> 
> Kazu
> 
>> On Feb 1, 2023, at 1:28 AM, Dongjoon Hyun <dongjoon.hyun@gmail.com <ma...@gmail.com>> wrote:
>> 
>> +1
>> 
>> On Wed, Feb 1, 2023 at 12:52 AM Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>> +1
>> 
>> 
>>    view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>> 
>>  https://en.everybodywiki.com/Mich_Talebzadeh <https://en.everybodywiki.com/Mich_Talebzadeh>
>>  
>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>  
>> 
>> 
>> On Wed, 1 Feb 2023 at 02:23, huaxin gao <huaxin.gao11@gmail.com <ma...@gmail.com>> wrote:
>> +1
>> 
>> On Tue, Jan 31, 2023 at 6:10 PM DB Tsai <dbtsai@dbtsai.com <ma...@dbtsai.com>> wrote:
>> +1
>> 
>> Sent from my iPhone
>> 
>>> On Jan 31, 2023, at 4:16 PM, Yuming Wang <wgyumg@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> 
>>> +1.
>>> 
>>> On Wed, Feb 1, 2023 at 7:42 AM kazuyuki tanimura <ktanimura@apple.com.invalid <ma...@apple.com.invalid>> wrote:
>>> Great! Much appreciated, Mitch!
>>> 
>>> Kazu
>>> 
>>>> On Jan 31, 2023, at 3:07 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> Thanks, Kazu.
>>>> 
>>>> I followed that template link and indeed as you pointed out it is a common template. If it works then it is what it is.
>>>> 
>>>> I will be going through your design proposals and hopefully we can review it.
>>>> 
>>>> Regards,
>>>> 
>>>> Mich
>>>> 
>>>> 
>>>>    view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>> 
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh <https://en.everybodywiki.com/Mich_Talebzadeh>
>>>>  
>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>>>  
>>>> 
>>>> 
>>>> On Tue, 31 Jan 2023 at 22:34, kazuyuki tanimura <ktanimura@apple.com <ma...@apple.com>> wrote:
>>>> Thank you Mich. I followed the instruction at https://spark.apache.org/improvement-proposals.html <https://spark.apache.org/improvement-proposals.html> and used its template.
>>>> While we are open to revise our design doc, it seems more like you are proposing the community to change the instruction per se?
>>>> 
>>>> Kazu
>>>> 
>>>>> On Jan 31, 2023, at 11:24 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> Thanks for these proposals. good suggestions. Is this style of breaking down your approach standard?
>>>>> 
>>>>> My view would be that perhaps it makes more sense to follow the industry established approach of breaking down your technical proposal  into:
>>>>> 
>>>>> Background
>>>>> Objective
>>>>> Scope
>>>>> Constraints
>>>>> Assumptions
>>>>> Reporting
>>>>> Deliverables
>>>>> Timelines
>>>>> Appendix
>>>>> Your current approach using below 
>>>>> 
>>>>> Q1. What are you trying to do? Articulate your objectives using absolutely no jargon. What are you trying to achieve?
>>>>> Q2. What problem is this proposal NOT designed to solve? What issues the suggested proposal is not going to address
>>>>> Q3. How is it done today, and what are the limits of current practice?
>>>>> Q4. What is new in your approach approach and why do you think it will be successful succeed?
>>>>> Q5. Who cares? If you are successful, what difference will it make? If your proposal succeeds, what tangible benefits will it add?
>>>>> Q6. What are the risks?
>>>>> Q7. How long will it take?
>>>>> Q8. What are the midterm and final “exams” to check for success?
>>>>>  
>>>>> May not do  justice to your proposal.
>>>>> 
>>>>> HTH
>>>>> 
>>>>> Mich
>>>>> 
>>>>> 
>>>>>    view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>> 
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh <https://en.everybodywiki.com/Mich_Talebzadeh>
>>>>>  
>>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>>>>  
>>>>> 
>>>>> 
>>>>> On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura <ktanimura@apple.com.invalid <ma...@apple.com.invalid>> wrote:
>>>>> Hi everyone,
>>>>> 
>>>>> I would like to start a discussion on “Lazy Materialization for Parquet Read Performance Improvement"
>>>>> 
>>>>> Chao and I propose a Parquet reader with lazy materialization. For Spark-SQL filter operations, evaluating the filters first and lazily materializing only the used values can save computation wastes and improve the read performance.
>>>>> The current implementation of Spark requires the read values to materialize (i.e. decompress, de-code, etc...) onto memory first before applying the filters even though the filters may eventually throw away many values.
>>>>> 
>>>>> We made our design doc as follows.
>>>>> SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256 <https://issues.apache.org/jira/browse/SPARK-42256> 
>>>>> SPIP Doc: https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME <https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME>
>>>>> 
>>>>> Liang-Chi was kind enough to shepherd this effort. 
>>>>> 
>>>>> Thank you
>>>>> Kazu
>>>> 
>>> 
> 


Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

Posted by "L. C. Hsieh" <vi...@gmail.com>.
Thanks all for your feedback.

Given this positive feedback, if there is no other comments/discussion, I
will go to start a vote in the next few days.

Thank you again!

On Thu, Feb 2, 2023 at 10:12 AM kazuyuki tanimura
<kt...@apple.com.invalid> wrote:

> Thank you all for +1s and reviewing the SPIP doc.
>
> Kazu
>
> On Feb 1, 2023, at 1:28 AM, Dongjoon Hyun <do...@gmail.com> wrote:
>
> +1
>
> On Wed, Feb 1, 2023 at 12:52 AM Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
>> +1
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 1 Feb 2023 at 02:23, huaxin gao <hu...@gmail.com> wrote:
>>
>>> +1
>>>
>>> On Tue, Jan 31, 2023 at 6:10 PM DB Tsai <db...@dbtsai.com> wrote:
>>>
>>>> +1
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Jan 31, 2023, at 4:16 PM, Yuming Wang <wg...@gmail.com> wrote:
>>>>
>>>> 
>>>> +1.
>>>>
>>>> On Wed, Feb 1, 2023 at 7:42 AM kazuyuki tanimura <
>>>> ktanimura@apple.com.invalid> wrote:
>>>>
>>>>> Great! Much appreciated, Mitch!
>>>>>
>>>>> Kazu
>>>>>
>>>>> On Jan 31, 2023, at 3:07 PM, Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>
>>>>> Thanks, Kazu.
>>>>>
>>>>> I followed that template link and indeed as you pointed out it is a
>>>>> common template. If it works then it is what it is.
>>>>>
>>>>> I will be going through your design proposals and hopefully we can
>>>>> review it.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Mich
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, 31 Jan 2023 at 22:34, kazuyuki tanimura <kt...@apple.com>
>>>>> wrote:
>>>>>
>>>>>> Thank you Mich. I followed the instruction at
>>>>>> https://spark.apache.org/improvement-proposals.html and used its
>>>>>> template.
>>>>>> While we are open to revise our design doc, it seems more like you
>>>>>> are proposing the community to change the instruction per se?
>>>>>>
>>>>>> Kazu
>>>>>>
>>>>>> On Jan 31, 2023, at 11:24 AM, Mich Talebzadeh <
>>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Thanks for these proposals. good suggestions. Is this style of
>>>>>> breaking down your approach standard?
>>>>>>
>>>>>> My view would be that perhaps it makes more sense to follow the
>>>>>> industry established approach of breaking down
>>>>>> your technical proposal  into:
>>>>>>
>>>>>>
>>>>>>    1. Background
>>>>>>    2. Objective
>>>>>>    3. Scope
>>>>>>    4. Constraints
>>>>>>    5. Assumptions
>>>>>>    6. Reporting
>>>>>>    7. Deliverables
>>>>>>    8. Timelines
>>>>>>    9. Appendix
>>>>>>
>>>>>> Your current approach using below
>>>>>>
>>>>>> Q1. What are you trying to do? Articulate your objectives using
>>>>>> absolutely no jargon. What are you trying to achieve?
>>>>>> Q2. What problem is this proposal NOT designed to solve? What issues
>>>>>> the suggested proposal is not going to address
>>>>>> Q3. How is it done today, and what are the limits of current practice?
>>>>>> Q4. What is new in your approach approach and why do you think it
>>>>>> will be successful succeed?
>>>>>> Q5. Who cares? If you are successful, what difference will it make?
>>>>>> If your proposal succeeds, what tangible benefits will it add?
>>>>>> Q6. What are the risks?
>>>>>> Q7. How long will it take?
>>>>>> Q8. What are the midterm and final “exams” to check for success?
>>>>>>
>>>>>>
>>>>>> May not do  justice to your proposal.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Mich
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura <
>>>>>> ktanimura@apple.com.invalid> wrote:
>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> I would like to start a discussion on “Lazy Materialization for
>>>>>>> Parquet Read Performance Improvement"
>>>>>>>
>>>>>>> Chao and I propose a Parquet reader with lazy materialization. For
>>>>>>> Spark-SQL filter operations, evaluating the filters first and
>>>>>>> lazily materializing only the used values can save computation wastes and
>>>>>>> improve the read performance.
>>>>>>> The current implementation of Spark requires the read values to
>>>>>>> materialize (i.e. decompress, de-code, etc...) onto memory first before
>>>>>>> applying the filters even though the filters may eventually throw away many
>>>>>>> values.
>>>>>>>
>>>>>>> We made our design doc as follows.
>>>>>>> SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256
>>>>>>> SPIP Doc:
>>>>>>> https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME
>>>>>>>
>>>>>>> Liang-Chi was kind enough to shepherd this effort.
>>>>>>>
>>>>>>> Thank you
>>>>>>> Kazu
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>

Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

Posted by kazuyuki tanimura <kt...@apple.com.INVALID>.
Thank you all for +1s and reviewing the SPIP doc.

Kazu

> On Feb 1, 2023, at 1:28 AM, Dongjoon Hyun <do...@gmail.com> wrote:
> 
> +1
> 
> On Wed, Feb 1, 2023 at 12:52 AM Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
> +1
> 
> 
>    view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh <https://en.everybodywiki.com/Mich_Talebzadeh>
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
> 
> On Wed, 1 Feb 2023 at 02:23, huaxin gao <huaxin.gao11@gmail.com <ma...@gmail.com>> wrote:
> +1
> 
> On Tue, Jan 31, 2023 at 6:10 PM DB Tsai <dbtsai@dbtsai.com <ma...@dbtsai.com>> wrote:
> +1
> 
> Sent from my iPhone
> 
>> On Jan 31, 2023, at 4:16 PM, Yuming Wang <wgyumg@gmail.com <ma...@gmail.com>> wrote:
>> 
>> 
>> +1.
>> 
>> On Wed, Feb 1, 2023 at 7:42 AM kazuyuki tanimura <kt...@apple.com.invalid> wrote:
>> Great! Much appreciated, Mitch!
>> 
>> Kazu
>> 
>>> On Jan 31, 2023, at 3:07 PM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>>> 
>>> Thanks, Kazu.
>>> 
>>> I followed that template link and indeed as you pointed out it is a common template. If it works then it is what it is.
>>> 
>>> I will be going through your design proposals and hopefully we can review it.
>>> 
>>> Regards,
>>> 
>>> Mich
>>> 
>>> 
>>>    view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>> 
>>>  https://en.everybodywiki.com/Mich_Talebzadeh <https://en.everybodywiki.com/Mich_Talebzadeh>
>>>  
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>>  
>>> 
>>> 
>>> On Tue, 31 Jan 2023 at 22:34, kazuyuki tanimura <ktanimura@apple.com <ma...@apple.com>> wrote:
>>> Thank you Mich. I followed the instruction at https://spark.apache.org/improvement-proposals.html <https://spark.apache.org/improvement-proposals.html> and used its template.
>>> While we are open to revise our design doc, it seems more like you are proposing the community to change the instruction per se?
>>> 
>>> Kazu
>>> 
>>>> On Jan 31, 2023, at 11:24 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>>>> 
>>>> Hi,
>>>> 
>>>> Thanks for these proposals. good suggestions. Is this style of breaking down your approach standard?
>>>> 
>>>> My view would be that perhaps it makes more sense to follow the industry established approach of breaking down your technical proposal  into:
>>>> 
>>>> Background
>>>> Objective
>>>> Scope
>>>> Constraints
>>>> Assumptions
>>>> Reporting
>>>> Deliverables
>>>> Timelines
>>>> Appendix
>>>> Your current approach using below 
>>>> 
>>>> Q1. What are you trying to do? Articulate your objectives using absolutely no jargon. What are you trying to achieve?
>>>> Q2. What problem is this proposal NOT designed to solve? What issues the suggested proposal is not going to address
>>>> Q3. How is it done today, and what are the limits of current practice?
>>>> Q4. What is new in your approach approach and why do you think it will be successful succeed?
>>>> Q5. Who cares? If you are successful, what difference will it make? If your proposal succeeds, what tangible benefits will it add?
>>>> Q6. What are the risks?
>>>> Q7. How long will it take?
>>>> Q8. What are the midterm and final “exams” to check for success?
>>>>  
>>>> May not do  justice to your proposal.
>>>> 
>>>> HTH
>>>> 
>>>> Mich
>>>> 
>>>> 
>>>>    view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>> 
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh <https://en.everybodywiki.com/Mich_Talebzadeh>
>>>>  
>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>>>  
>>>> 
>>>> 
>>>> On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura <ktanimura@apple.com.invalid <ma...@apple.com.invalid>> wrote:
>>>> Hi everyone,
>>>> 
>>>> I would like to start a discussion on “Lazy Materialization for Parquet Read Performance Improvement"
>>>> 
>>>> Chao and I propose a Parquet reader with lazy materialization. For Spark-SQL filter operations, evaluating the filters first and lazily materializing only the used values can save computation wastes and improve the read performance.
>>>> The current implementation of Spark requires the read values to materialize (i.e. decompress, de-code, etc...) onto memory first before applying the filters even though the filters may eventually throw away many values.
>>>> 
>>>> We made our design doc as follows.
>>>> SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256 <https://issues.apache.org/jira/browse/SPARK-42256> 
>>>> SPIP Doc: https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME <https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME>
>>>> 
>>>> Liang-Chi was kind enough to shepherd this effort. 
>>>> 
>>>> Thank you
>>>> Kazu
>>> 
>> 


Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

Posted by Dongjoon Hyun <do...@gmail.com>.
+1

On Wed, Feb 1, 2023 at 12:52 AM Mich Talebzadeh <mi...@gmail.com>
wrote:

> +1
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 1 Feb 2023 at 02:23, huaxin gao <hu...@gmail.com> wrote:
>
>> +1
>>
>> On Tue, Jan 31, 2023 at 6:10 PM DB Tsai <db...@dbtsai.com> wrote:
>>
>>> +1
>>>
>>> Sent from my iPhone
>>>
>>> On Jan 31, 2023, at 4:16 PM, Yuming Wang <wg...@gmail.com> wrote:
>>>
>>> 
>>> +1.
>>>
>>> On Wed, Feb 1, 2023 at 7:42 AM kazuyuki tanimura
>>> <kt...@apple.com.invalid> wrote:
>>>
>>>> Great! Much appreciated, Mitch!
>>>>
>>>> Kazu
>>>>
>>>> On Jan 31, 2023, at 3:07 PM, Mich Talebzadeh <mi...@gmail.com>
>>>> wrote:
>>>>
>>>> Thanks, Kazu.
>>>>
>>>> I followed that template link and indeed as you pointed out it is a
>>>> common template. If it works then it is what it is.
>>>>
>>>> I will be going through your design proposals and hopefully we can
>>>> review it.
>>>>
>>>> Regards,
>>>>
>>>> Mich
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, 31 Jan 2023 at 22:34, kazuyuki tanimura <kt...@apple.com>
>>>> wrote:
>>>>
>>>>> Thank you Mich. I followed the instruction at
>>>>> https://spark.apache.org/improvement-proposals.html and used its
>>>>> template.
>>>>> While we are open to revise our design doc, it seems more like you are
>>>>> proposing the community to change the instruction per se?
>>>>>
>>>>> Kazu
>>>>>
>>>>> On Jan 31, 2023, at 11:24 AM, Mich Talebzadeh <
>>>>> mich.talebzadeh@gmail.com> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> Thanks for these proposals. good suggestions. Is this style of
>>>>> breaking down your approach standard?
>>>>>
>>>>> My view would be that perhaps it makes more sense to follow the
>>>>> industry established approach of breaking down
>>>>> your technical proposal  into:
>>>>>
>>>>>
>>>>>    1. Background
>>>>>    2. Objective
>>>>>    3. Scope
>>>>>    4. Constraints
>>>>>    5. Assumptions
>>>>>    6. Reporting
>>>>>    7. Deliverables
>>>>>    8. Timelines
>>>>>    9. Appendix
>>>>>
>>>>> Your current approach using below
>>>>>
>>>>> Q1. What are you trying to do? Articulate your objectives using
>>>>> absolutely no jargon. What are you trying to achieve?
>>>>> Q2. What problem is this proposal NOT designed to solve? What issues
>>>>> the suggested proposal is not going to address
>>>>> Q3. How is it done today, and what are the limits of current practice?
>>>>> Q4. What is new in your approach approach and why do you think it
>>>>> will be successful succeed?
>>>>> Q5. Who cares? If you are successful, what difference will it make?
>>>>> If your proposal succeeds, what tangible benefits will it add?
>>>>> Q6. What are the risks?
>>>>> Q7. How long will it take?
>>>>> Q8. What are the midterm and final “exams” to check for success?
>>>>>
>>>>>
>>>>> May not do  justice to your proposal.
>>>>>
>>>>> HTH
>>>>>
>>>>> Mich
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura <
>>>>> ktanimura@apple.com.invalid> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>>
>>>>>> I would like to start a discussion on “Lazy Materialization for
>>>>>> Parquet Read Performance Improvement"
>>>>>>
>>>>>> Chao and I propose a Parquet reader with lazy materialization. For
>>>>>> Spark-SQL filter operations, evaluating the filters first and lazily
>>>>>> materializing only the used values can save computation wastes and improve
>>>>>> the read performance.
>>>>>> The current implementation of Spark requires the read values to
>>>>>> materialize (i.e. decompress, de-code, etc...) onto memory first before
>>>>>> applying the filters even though the filters may eventually throw away many
>>>>>> values.
>>>>>>
>>>>>> We made our design doc as follows.
>>>>>> SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256
>>>>>> SPIP Doc:
>>>>>> https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME
>>>>>>
>>>>>> Liang-Chi was kind enough to shepherd this effort.
>>>>>>
>>>>>> Thank you
>>>>>> Kazu
>>>>>>
>>>>>
>>>>>
>>>>

Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

Posted by Mich Talebzadeh <mi...@gmail.com>.
+1



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 1 Feb 2023 at 02:23, huaxin gao <hu...@gmail.com> wrote:

> +1
>
> On Tue, Jan 31, 2023 at 6:10 PM DB Tsai <db...@dbtsai.com> wrote:
>
>> +1
>>
>> Sent from my iPhone
>>
>> On Jan 31, 2023, at 4:16 PM, Yuming Wang <wg...@gmail.com> wrote:
>>
>> 
>> +1.
>>
>> On Wed, Feb 1, 2023 at 7:42 AM kazuyuki tanimura
>> <kt...@apple.com.invalid> wrote:
>>
>>> Great! Much appreciated, Mitch!
>>>
>>> Kazu
>>>
>>> On Jan 31, 2023, at 3:07 PM, Mich Talebzadeh <mi...@gmail.com>
>>> wrote:
>>>
>>> Thanks, Kazu.
>>>
>>> I followed that template link and indeed as you pointed out it is a
>>> common template. If it works then it is what it is.
>>>
>>> I will be going through your design proposals and hopefully we can
>>> review it.
>>>
>>> Regards,
>>>
>>> Mich
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 31 Jan 2023 at 22:34, kazuyuki tanimura <kt...@apple.com>
>>> wrote:
>>>
>>>> Thank you Mich. I followed the instruction at
>>>> https://spark.apache.org/improvement-proposals.html and used its
>>>> template.
>>>> While we are open to revise our design doc, it seems more like you are
>>>> proposing the community to change the instruction per se?
>>>>
>>>> Kazu
>>>>
>>>> On Jan 31, 2023, at 11:24 AM, Mich Talebzadeh <
>>>> mich.talebzadeh@gmail.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> Thanks for these proposals. good suggestions. Is this style of breaking
>>>> down your approach standard?
>>>>
>>>> My view would be that perhaps it makes more sense to follow the
>>>> industry established approach of breaking down
>>>> your technical proposal  into:
>>>>
>>>>
>>>>    1. Background
>>>>    2. Objective
>>>>    3. Scope
>>>>    4. Constraints
>>>>    5. Assumptions
>>>>    6. Reporting
>>>>    7. Deliverables
>>>>    8. Timelines
>>>>    9. Appendix
>>>>
>>>> Your current approach using below
>>>>
>>>> Q1. What are you trying to do? Articulate your objectives using
>>>> absolutely no jargon. What are you trying to achieve?
>>>> Q2. What problem is this proposal NOT designed to solve? What issues
>>>> the suggested proposal is not going to address
>>>> Q3. How is it done today, and what are the limits of current practice?
>>>> Q4. What is new in your approach approach and why do you think it will be
>>>> successful succeed?
>>>> Q5. Who cares? If you are successful, what difference will it make? If
>>>> your proposal succeeds, what tangible benefits will it add?
>>>> Q6. What are the risks?
>>>> Q7. How long will it take?
>>>> Q8. What are the midterm and final “exams” to check for success?
>>>>
>>>>
>>>> May not do  justice to your proposal.
>>>>
>>>> HTH
>>>>
>>>> Mich
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura <
>>>> ktanimura@apple.com.invalid> wrote:
>>>>
>>>>> Hi everyone,
>>>>>
>>>>> I would like to start a discussion on “Lazy Materialization for
>>>>> Parquet Read Performance Improvement"
>>>>>
>>>>> Chao and I propose a Parquet reader with lazy materialization. For
>>>>> Spark-SQL filter operations, evaluating the filters first and lazily
>>>>> materializing only the used values can save computation wastes and improve
>>>>> the read performance.
>>>>> The current implementation of Spark requires the read values to
>>>>> materialize (i.e. decompress, de-code, etc...) onto memory first before
>>>>> applying the filters even though the filters may eventually throw away many
>>>>> values.
>>>>>
>>>>> We made our design doc as follows.
>>>>> SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256
>>>>> SPIP Doc:
>>>>> https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME
>>>>>
>>>>> Liang-Chi was kind enough to shepherd this effort.
>>>>>
>>>>> Thank you
>>>>> Kazu
>>>>>
>>>>
>>>>
>>>

Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

Posted by huaxin gao <hu...@gmail.com>.
+1

On Tue, Jan 31, 2023 at 6:10 PM DB Tsai <db...@dbtsai.com> wrote:

> +1
>
> Sent from my iPhone
>
> On Jan 31, 2023, at 4:16 PM, Yuming Wang <wg...@gmail.com> wrote:
>
> 
> +1.
>
> On Wed, Feb 1, 2023 at 7:42 AM kazuyuki tanimura
> <kt...@apple.com.invalid> wrote:
>
>> Great! Much appreciated, Mitch!
>>
>> Kazu
>>
>> On Jan 31, 2023, at 3:07 PM, Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>
>> Thanks, Kazu.
>>
>> I followed that template link and indeed as you pointed out it is a
>> common template. If it works then it is what it is.
>>
>> I will be going through your design proposals and hopefully we can review
>> it.
>>
>> Regards,
>>
>> Mich
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 31 Jan 2023 at 22:34, kazuyuki tanimura <kt...@apple.com>
>> wrote:
>>
>>> Thank you Mich. I followed the instruction at
>>> https://spark.apache.org/improvement-proposals.html and used its
>>> template.
>>> While we are open to revise our design doc, it seems more like you are
>>> proposing the community to change the instruction per se?
>>>
>>> Kazu
>>>
>>> On Jan 31, 2023, at 11:24 AM, Mich Talebzadeh <mi...@gmail.com>
>>> wrote:
>>>
>>> Hi,
>>>
>>> Thanks for these proposals. good suggestions. Is this style of breaking
>>> down your approach standard?
>>>
>>> My view would be that perhaps it makes more sense to follow the industry
>>> established approach of breaking down your technical proposal  into:
>>>
>>>
>>>    1. Background
>>>    2. Objective
>>>    3. Scope
>>>    4. Constraints
>>>    5. Assumptions
>>>    6. Reporting
>>>    7. Deliverables
>>>    8. Timelines
>>>    9. Appendix
>>>
>>> Your current approach using below
>>>
>>> Q1. What are you trying to do? Articulate your objectives using
>>> absolutely no jargon. What are you trying to achieve?
>>> Q2. What problem is this proposal NOT designed to solve? What issues
>>> the suggested proposal is not going to address
>>> Q3. How is it done today, and what are the limits of current practice?
>>> Q4. What is new in your approach approach and why do you think it will be
>>> successful succeed?
>>> Q5. Who cares? If you are successful, what difference will it make? If
>>> your proposal succeeds, what tangible benefits will it add?
>>> Q6. What are the risks?
>>> Q7. How long will it take?
>>> Q8. What are the midterm and final “exams” to check for success?
>>>
>>>
>>> May not do  justice to your proposal.
>>>
>>> HTH
>>>
>>> Mich
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura <
>>> ktanimura@apple.com.invalid> wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I would like to start a discussion on “Lazy Materialization for Parquet
>>>> Read Performance Improvement"
>>>>
>>>> Chao and I propose a Parquet reader with lazy materialization. For
>>>> Spark-SQL filter operations, evaluating the filters first and lazily
>>>> materializing only the used values can save computation wastes and improve
>>>> the read performance.
>>>> The current implementation of Spark requires the read values to
>>>> materialize (i.e. decompress, de-code, etc...) onto memory first before
>>>> applying the filters even though the filters may eventually throw away many
>>>> values.
>>>>
>>>> We made our design doc as follows.
>>>> SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256
>>>> SPIP Doc:
>>>> https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME
>>>>
>>>> Liang-Chi was kind enough to shepherd this effort.
>>>>
>>>> Thank you
>>>> Kazu
>>>>
>>>
>>>
>>

Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

Posted by DB Tsai <db...@dbtsai.com>.
+1  
  

Sent from my iPhone

  

> On Jan 31, 2023, at 4:16 PM, Yuming Wang <wg...@gmail.com> wrote:  
>  
>

> 
>
> +1.  
>
>
>  
>
>
> On Wed, Feb 1, 2023 at 7:42 AM kazuyuki tanimura
> <kt...@apple.com.invalid> wrote:  
>
>

>> Great! Much appreciated, Mitch!  
>
>>

>>  
> Kazu
>>

>>  
>
>>

>>> On Jan 31, 2023, at 3:07 PM, Mich Talebzadeh
<[mich.talebzadeh@gmail.com](mailto:mich.talebzadeh@gmail.com)> wrote:

>>>

>>>  
>
>>>

>>> Thanks, Kazu.

>>>

>>>  
>
>>>

>>> I followed that template link and indeed as you pointed out it is a common
template. If it works then it is what it is.

>>>

>>>  
>
>>>

>>> I will be going through your design proposals and hopefully we can review
it.

>>>

>>>  
>
>>>

>>> Regards,

>>>

>>>  
>
>>>

>>> Mich  
>
>>>

>>>  
>
>>>

>>>  
>
>>>

>>>  ![](https://ci3.googleusercontent.com/mail-
sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE)
**** [ view my Linkedin profile](https://www.linkedin.com/in/mich-talebzadeh-
ph-d-5205b2/)

>>>

>>>  
>
>>>

>>>  <https://en.everybodywiki.com/Mich_Talebzadeh>

>>>

>>>  
>>>

>>>  **Disclaimer:**  Use it at your own risk. Any and all responsibility for
any loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed. The
author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.

>>>

>>>  
>>>

>>>  
>
>>>

>>>  
>
>>>

>>> On Tue, 31 Jan 2023 at 22:34, kazuyuki tanimura
<[ktanimura@apple.com](mailto:ktanimura@apple.com)> wrote:  
>
>>>

>>>> Thank you Mich. I followed the instruction at
<https://spark.apache.org/improvement-proposals.html> and used its template.

>>>>

>>>> While we are open to revise our design doc, it seems more like you are
proposing the community to change the instruction per se?  
>
>>>>

>>>>  
> Kazu
>>>>

>>>>  
>
>>>>

>>>>> On Jan 31, 2023, at 11:24 AM, Mich Talebzadeh
<[mich.talebzadeh@gmail.com](mailto:mich.talebzadeh@gmail.com)> wrote:

>>>>>

>>>>>  
>
>>>>>

>>>>> Hi,

>>>>>

>>>>>  
>
>>>>>

>>>>> Thanks for these proposals. good suggestions. Is this style of breaking
down your approach standard?

>>>>>

>>>>>  
>
>>>>>

>>>>> My view would be that perhaps it makes more sense to follow the industry
established approach of breaking down your technical proposal  into:

>>>>>

>>>>>  
>
>>>>>

>>>>>   1. Background

>>>>>   2. Objective

>>>>>   3. Scope

>>>>>   4. Constraints

>>>>>   5. Assumptions

>>>>>   6. Reporting

>>>>>   7. Deliverables

>>>>>   8. Timelines

>>>>>   9. Appendix

>>>>>

>>>>>

>>>>> Your current approach using below

>>>>>

>>>>>  
>
>>>>>

>>>>> Q1 ~~. What are you trying to do? Articulate your objectives using
absolutely no jargon~~. What are you trying to achieve?

>>>>>

>>>>> Q2. ~~What problem is this proposal NOT designed to solve?~~ What issues
the suggested proposal is not going to address

>>>>>

>>>>> Q3. How is it done today, and what are the limits of current practice?

>>>>>

>>>>> Q4. What is new in your ~~approach~~ approach and why do you think it
will ~~be successful~~ succeed?

>>>>>

>>>>> Q5. ~~Who cares? If you are successful, what difference will it make?~~
If your proposal succeeds, what tangible benefits will it add?

>>>>>

>>>>> Q6. What are the risks?

>>>>>

>>>>> Q7. How long will it take?

>>>>>

>>>>> Q8. What are the midterm and final “exams” to check for success?

>>>>>

>>>>>  
>>>>>

>>>>> May not do  justice to your proposal.

>>>>>

>>>>>  
>
>>>>>

>>>>> HTH

>>>>>

>>>>>  
>
>>>>>

>>>>> Mich

>>>>>

>>>>>  
>
>>>>>

>>>>>  ![](https://ci3.googleusercontent.com/mail-
sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE)
**** [ view my Linkedin profile](https://www.linkedin.com/in/mich-talebzadeh-
ph-d-5205b2/)

>>>>>

>>>>>  
>
>>>>>

>>>>>  <https://en.everybodywiki.com/Mich_Talebzadeh>

>>>>>

>>>>>  
>>>>>

>>>>>  **Disclaimer:**  Use it at your own risk. Any and all responsibility
for any loss, damage or destruction of data or any other property which may
arise from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

>>>>>

>>>>>  
>>>>>

>>>>>  
>
>>>>>

>>>>>  
>
>>>>>

>>>>> On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura
<[ktanimura@apple.com.invalid](mailto:ktanimura@apple.com.invalid)> wrote:  
>
>>>>>

>>>>>> Hi everyone,

>>>>>>

>>>>>>  
>
>>>>>>

>>>>>> I would like to start a discussion on “Lazy Materialization for Parquet
Read Performance Improvement"

>>>>>>

>>>>>>  
>
>>>>>>

>>>>>> Chao and I propose a Parquet reader with lazy materialization. For
Spark-SQL filter operations, evaluating the filters first and lazily
materializing only the used values can save computation wastes and improve the
read performance.

>>>>>>

>>>>>> The current implementation of Spark requires the read values to
materialize (i.e. decompress, de-code, etc...) onto memory first before
applying the filters even though the filters may eventually throw away many
values.

>>>>>>

>>>>>>  
>
>>>>>>

>>>>>> We made our design doc as follows.

>>>>>>

>>>>>> SPIP Jira: <https://issues.apache.org/jira/browse/SPARK-42256>

>>>>>>

>>>>>> SPIP Doc:
<https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME>

>>>>>>

>>>>>>  
>
>>>>>>

>>>>>> Liang-Chi was kind enough to shepherd this effort.

>>>>>>

>>>>>>  
>
>>>>>>

>>>>>> Thank you  
> Kazu
>>>>

>>>>  
>
>>

>>  
>


Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

Posted by Yuming Wang <wg...@gmail.com>.
+1.

On Wed, Feb 1, 2023 at 7:42 AM kazuyuki tanimura
<kt...@apple.com.invalid> wrote:

> Great! Much appreciated, Mitch!
>
> Kazu
>
> On Jan 31, 2023, at 3:07 PM, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> Thanks, Kazu.
>
> I followed that template link and indeed as you pointed out it is a common
> template. If it works then it is what it is.
>
> I will be going through your design proposals and hopefully we can review
> it.
>
> Regards,
>
> Mich
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 31 Jan 2023 at 22:34, kazuyuki tanimura <kt...@apple.com>
> wrote:
>
>> Thank you Mich. I followed the instruction at
>> https://spark.apache.org/improvement-proposals.html and used its
>> template.
>> While we are open to revise our design doc, it seems more like you are
>> proposing the community to change the instruction per se?
>>
>> Kazu
>>
>> On Jan 31, 2023, at 11:24 AM, Mich Talebzadeh <mi...@gmail.com>
>> wrote:
>>
>> Hi,
>>
>> Thanks for these proposals. good suggestions. Is this style of breaking
>> down your approach standard?
>>
>> My view would be that perhaps it makes more sense to follow the industry
>> established approach of breaking down your technical proposal  into:
>>
>>
>>    1. Background
>>    2. Objective
>>    3. Scope
>>    4. Constraints
>>    5. Assumptions
>>    6. Reporting
>>    7. Deliverables
>>    8. Timelines
>>    9. Appendix
>>
>> Your current approach using below
>>
>> Q1. What are you trying to do? Articulate your objectives using
>> absolutely no jargon. What are you trying to achieve?
>> Q2. What problem is this proposal NOT designed to solve? What issues the
>> suggested proposal is not going to address
>> Q3. How is it done today, and what are the limits of current practice?
>> Q4. What is new in your approach approach and why do you think it will be
>> successful succeed?
>> Q5. Who cares? If you are successful, what difference will it make? If
>> your proposal succeeds, what tangible benefits will it add?
>> Q6. What are the risks?
>> Q7. How long will it take?
>> Q8. What are the midterm and final “exams” to check for success?
>>
>>
>> May not do  justice to your proposal.
>>
>> HTH
>>
>> Mich
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura <
>> ktanimura@apple.com.invalid> wrote:
>>
>>> Hi everyone,
>>>
>>> I would like to start a discussion on “Lazy Materialization for Parquet
>>> Read Performance Improvement"
>>>
>>> Chao and I propose a Parquet reader with lazy materialization. For
>>> Spark-SQL filter operations, evaluating the filters first and lazily
>>> materializing only the used values can save computation wastes and improve
>>> the read performance.
>>> The current implementation of Spark requires the read values to
>>> materialize (i.e. decompress, de-code, etc...) onto memory first before
>>> applying the filters even though the filters may eventually throw away many
>>> values.
>>>
>>> We made our design doc as follows.
>>> SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256
>>> SPIP Doc:
>>> https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME
>>>
>>> Liang-Chi was kind enough to shepherd this effort.
>>>
>>> Thank you
>>> Kazu
>>>
>>
>>
>

Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

Posted by kazuyuki tanimura <kt...@apple.com.INVALID>.
Great! Much appreciated, Mitch!

Kazu

> On Jan 31, 2023, at 3:07 PM, Mich Talebzadeh <mi...@gmail.com> wrote:
> 
> Thanks, Kazu.
> 
> I followed that template link and indeed as you pointed out it is a common template. If it works then it is what it is.
> 
> I will be going through your design proposals and hopefully we can review it.
> 
> Regards,
> 
> Mich
> 
> 
>    view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh <https://en.everybodywiki.com/Mich_Talebzadeh>
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
> 
> On Tue, 31 Jan 2023 at 22:34, kazuyuki tanimura <ktanimura@apple.com <ma...@apple.com>> wrote:
> Thank you Mich. I followed the instruction at https://spark.apache.org/improvement-proposals.html <https://spark.apache.org/improvement-proposals.html> and used its template.
> While we are open to revise our design doc, it seems more like you are proposing the community to change the instruction per se?
> 
> Kazu
> 
>> On Jan 31, 2023, at 11:24 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
>> 
>> Hi,
>> 
>> Thanks for these proposals. good suggestions. Is this style of breaking down your approach standard?
>> 
>> My view would be that perhaps it makes more sense to follow the industry established approach of breaking down your technical proposal  into:
>> 
>> Background
>> Objective
>> Scope
>> Constraints
>> Assumptions
>> Reporting
>> Deliverables
>> Timelines
>> Appendix
>> Your current approach using below 
>> 
>> Q1. What are you trying to do? Articulate your objectives using absolutely no jargon. What are you trying to achieve?
>> Q2. What problem is this proposal NOT designed to solve? What issues the suggested proposal is not going to address
>> Q3. How is it done today, and what are the limits of current practice?
>> Q4. What is new in your approach approach and why do you think it will be successful succeed?
>> Q5. Who cares? If you are successful, what difference will it make? If your proposal succeeds, what tangible benefits will it add?
>> Q6. What are the risks?
>> Q7. How long will it take?
>> Q8. What are the midterm and final “exams” to check for success?
>>  
>> May not do  justice to your proposal.
>> 
>> HTH
>> 
>> Mich
>> 
>> 
>>    view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>> 
>>  https://en.everybodywiki.com/Mich_Talebzadeh <https://en.everybodywiki.com/Mich_Talebzadeh>
>>  
>> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>>  
>> 
>> 
>> On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura <ktanimura@apple.com.invalid <ma...@apple.com.invalid>> wrote:
>> Hi everyone,
>> 
>> I would like to start a discussion on “Lazy Materialization for Parquet Read Performance Improvement"
>> 
>> Chao and I propose a Parquet reader with lazy materialization. For Spark-SQL filter operations, evaluating the filters first and lazily materializing only the used values can save computation wastes and improve the read performance.
>> The current implementation of Spark requires the read values to materialize (i.e. decompress, de-code, etc...) onto memory first before applying the filters even though the filters may eventually throw away many values.
>> 
>> We made our design doc as follows.
>> SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256 <https://issues.apache.org/jira/browse/SPARK-42256> 
>> SPIP Doc: https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME <https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME>
>> 
>> Liang-Chi was kind enough to shepherd this effort. 
>> 
>> Thank you
>> Kazu
> 


Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

Posted by Mich Talebzadeh <mi...@gmail.com>.
Thanks, Kazu.

I followed that template link and indeed as you pointed out it is a common
template. If it works then it is what it is.

I will be going through your design proposals and hopefully we can review
it.

Regards,

Mich



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 31 Jan 2023 at 22:34, kazuyuki tanimura <kt...@apple.com> wrote:

> Thank you Mich. I followed the instruction at
> https://spark.apache.org/improvement-proposals.html and used its template.
> While we are open to revise our design doc, it seems more like you are
> proposing the community to change the instruction per se?
>
> Kazu
>
> On Jan 31, 2023, at 11:24 AM, Mich Talebzadeh <mi...@gmail.com>
> wrote:
>
> Hi,
>
> Thanks for these proposals. good suggestions. Is this style of breaking
> down your approach standard?
>
> My view would be that perhaps it makes more sense to follow the industry
> established approach of breaking down your technical proposal  into:
>
>
>    1. Background
>    2. Objective
>    3. Scope
>    4. Constraints
>    5. Assumptions
>    6. Reporting
>    7. Deliverables
>    8. Timelines
>    9. Appendix
>
> Your current approach using below
>
> Q1. What are you trying to do? Articulate your objectives using
> absolutely no jargon. What are you trying to achieve?
> Q2. What problem is this proposal NOT designed to solve? What issues the
> suggested proposal is not going to address
> Q3. How is it done today, and what are the limits of current practice?
> Q4. What is new in your approach approach and why do you think it will be
> successful succeed?
> Q5. Who cares? If you are successful, what difference will it make? If
> your proposal succeeds, what tangible benefits will it add?
> Q6. What are the risks?
> Q7. How long will it take?
> Q8. What are the midterm and final “exams” to check for success?
>
>
> May not do  justice to your proposal.
>
> HTH
>
> Mich
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura <
> ktanimura@apple.com.invalid> wrote:
>
>> Hi everyone,
>>
>> I would like to start a discussion on “Lazy Materialization for Parquet
>> Read Performance Improvement"
>>
>> Chao and I propose a Parquet reader with lazy materialization. For
>> Spark-SQL filter operations, evaluating the filters first and lazily
>> materializing only the used values can save computation wastes and improve
>> the read performance.
>> The current implementation of Spark requires the read values to
>> materialize (i.e. decompress, de-code, etc...) onto memory first before
>> applying the filters even though the filters may eventually throw away many
>> values.
>>
>> We made our design doc as follows.
>> SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256
>> SPIP Doc:
>> https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME
>>
>> Liang-Chi was kind enough to shepherd this effort.
>>
>> Thank you
>> Kazu
>>
>
>

Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

Posted by kazuyuki tanimura <kt...@apple.com.INVALID>.
Thank you Mich. I followed the instruction at https://spark.apache.org/improvement-proposals.html <https://spark.apache.org/improvement-proposals.html> and used its template.
While we are open to revise our design doc, it seems more like you are proposing the community to change the instruction per se?

Kazu

> On Jan 31, 2023, at 11:24 AM, Mich Talebzadeh <mich.talebzadeh@gmail.com <ma...@gmail.com>> wrote:
> 
> Hi,
> 
> Thanks for these proposals. good suggestions. Is this style of breaking down your approach standard?
> 
> My view would be that perhaps it makes more sense to follow the industry established approach of breaking down your technical proposal  into:
> 
> Background
> Objective
> Scope
> Constraints
> Assumptions
> Reporting
> Deliverables
> Timelines
> Appendix
> Your current approach using below 
> 
> Q1. What are you trying to do? Articulate your objectives using absolutely no jargon. What are you trying to achieve?
> Q2. What problem is this proposal NOT designed to solve? What issues the suggested proposal is not going to address
> Q3. How is it done today, and what are the limits of current practice?
> Q4. What is new in your approach approach and why do you think it will be successful succeed?
> Q5. Who cares? If you are successful, what difference will it make? If your proposal succeeds, what tangible benefits will it add?
> Q6. What are the risks?
> Q7. How long will it take?
> Q8. What are the midterm and final “exams” to check for success?
>  
> May not do  justice to your proposal.
> 
> HTH
> 
> Mich
> 
> 
>    view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
> 
>  https://en.everybodywiki.com/Mich_Talebzadeh <https://en.everybodywiki.com/Mich_Talebzadeh>
>  
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly disclaimed. The author will in no case be liable for any monetary damages arising from such loss, damage or destruction.
>  
> 
> 
> On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura <ktanimura@apple.com.invalid <ma...@apple.com.invalid>> wrote:
> Hi everyone,
> 
> I would like to start a discussion on “Lazy Materialization for Parquet Read Performance Improvement"
> 
> Chao and I propose a Parquet reader with lazy materialization. For Spark-SQL filter operations, evaluating the filters first and lazily materializing only the used values can save computation wastes and improve the read performance.
> The current implementation of Spark requires the read values to materialize (i.e. decompress, de-code, etc...) onto memory first before applying the filters even though the filters may eventually throw away many values.
> 
> We made our design doc as follows.
> SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256 <https://issues.apache.org/jira/browse/SPARK-42256> 
> SPIP Doc: https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME <https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME>
> 
> Liang-Chi was kind enough to shepherd this effort. 
> 
> Thank you
> Kazu


Re: [DISCUSS] SPIP: Lazy Materialization for Parquet Read Performance Improvement

Posted by Mich Talebzadeh <mi...@gmail.com>.
Hi,

Thanks for these proposals. good suggestions. Is this style of breaking
down your approach standard?

My view would be that perhaps it makes more sense to follow the industry
established approach of breaking down your technical proposal  into:


   1. Background
   2. Objective
   3. Scope
   4. Constraints
   5. Assumptions
   6. Reporting
   7. Deliverables
   8. Timelines
   9. Appendix

Your current approach using below

Q1. What are you trying to do? Articulate your objectives using absolutely
no jargon. What are you trying to achieve?

Q2. What problem is this proposal NOT designed to solve? What issues the
suggested proposal is not going to address

Q3. How is it done today, and what are the limits of current practice?

Q4. What is new in your approach approach and why do you think it will be
successful succeed?

Q5. Who cares? If you are successful, what difference will it make? If your
proposal succeeds, what tangible benefits will it add?

Q6. What are the risks?

Q7. How long will it take?

Q8. What are the midterm and final “exams” to check for success?


May not do  justice to your proposal.

HTH

Mich


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Tue, 31 Jan 2023 at 17:35, kazuyuki tanimura <kt...@apple.com.invalid>
wrote:

> Hi everyone,
>
> I would like to start a discussion on “Lazy Materialization for Parquet
> Read Performance Improvement"
>
> Chao and I propose a Parquet reader with lazy materialization. For
> Spark-SQL filter operations, evaluating the filters first and lazily
> materializing only the used values can save computation wastes and improve
> the read performance.
> The current implementation of Spark requires the read values to
> materialize (i.e. decompress, de-code, etc...) onto memory first before
> applying the filters even though the filters may eventually throw away many
> values.
>
> We made our design doc as follows.
> SPIP Jira: https://issues.apache.org/jira/browse/SPARK-42256
> SPIP Doc:
> https://docs.google.com/document/d/1Kr3y2fVZUbQXGH0y8AvdCAeWC49QJjpczapiaDvFzME
>
> Liang-Chi was kind enough to shepherd this effort.
>
> Thank you
> Kazu
>