You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-dev@hadoop.apache.org by Ben Maurer <bm...@andrew.cmu.edu> on 2008/10/27 21:07:52 UTC

Re: [hive-users] Hive Roadmap (Some information)

Have you guys considered translating the syntax tree for queries into Java 
bytecode? Java bytecode is great for this type of process because it's 
extremely high level -- the code generation mostly focuses on type 
checking and name resolution. However, it enables the JIT to perform 
register allocation and other low level optimizations for good 
performance.

-b

On Mon, 27 Oct 2008, Ashish Thusoo wrote:

> Folks,
>
> Here are some of the things that we are working on internally at Facebook. We thought it would be a good idea to let everyone know what is going on with Hive development. We will put this up on the wiki as well.
>
> 1. Integrating Dynamic SerDe with the DDL. (Zheng/Pete) - This allows the users to create typed tables along with list and map types from the DDL
> 2. Support for Statistics. (Ashish) - These stats are needed to make optimization decisions
> 3. Join Optimizations. (Prasad) - Mapside joins, semi join techniques etc to do the join faster
> 4. Predicate Pushdown Optimizations. (Namit) - pushing predicates just above the table scan for certain situations in joins as well as ensuring that only required columns are sent across map/reduce boundaries
> 5. Group By Optimizations. (Joydeep) - various optimizations to make group by faster
> 6. Optimizations to reduce the number of map files created by filter operations. (Dhrubha) - Filters with a large number of mappers produces a lot of files which slows down the following operations. This tries to address problems with that.
> 7. Transformations in LOAD. (Joydeep) - LOAD currently does not transform the input data if it is not in the format expected by the destination table.
> 8. Schemaless map/reduce. (Zheng) - TRANSFORM needs schema while map/reduce is schema less.
> 9. Improvements to TRANSFORM. (Zheng) - Make this more intuitive to map/reduce developers - evaluate some other keywords etc..
> 10. Error Reporting Improvements. (Pete) - Make error reporting for parse errors better
> 11. Help on CLI. (Joydeep) - add help to the CLI
> 12. Explode and Collect Operators. (Zheng) - Explode and collect operators to convert collections to individual items and vice versa.
> 13. Propagating sort properties to destination tables. (Prasad) - If the query produces sorted we want to capture that in the destination table's metadata so that downstream optimizations can be enabled.
>
> Other contributions from outside FB ...
> 1. JDBC driver (Michi Mutsuzaki @ stanford.edu, Raghu @ stanford.edu)
> 2. Fixes to CLI driver (Jeremy Huylebroeck)
> 3. Web interface...
>
> Most of these have a JIRA associated. A lot of focus is on running things faster in Hive considering that we have a good feature set now...
>
> Comments/contributions are welcome. Please go to the JIRA and check out contrib/hive...
>
> Thanks,
> Ashish
> _______________________________________________
> hive-users mailing list
> hive-users@publists.facebook.com
> http://publists.facebook.com/mailman/listinfo/hive-users
>
>

Re: [hive-users] Hive Roadmap (Some information)

Posted by Dhruba Borthakur <dh...@facebook.com>.

Hi Ben,

And, if I may add, if you would like to contribute the code to make this happen, that will be awesome! In that case, we can move this discussion to a JIRA.

Thanks,
dhruba


On 10/27/08 1:41 PM, "Ashish Thusoo" <at...@facebook.com> wrote:

We did have some discussions around it a while back but we put it on the back burner considering that there were a lot of algorithmic improvements that we could make in the current code itself. We reckoned that we could make significant improvements there first and then measure the improvements we could get out of byte code generation. What kind of performance speedups have you seen with byte code generation in data processing applications?

Ashish

-----Original Message-----
From: Ben Maurer [mailto:bmaurer@andrew.cmu.edu]
Sent: Monday, October 27, 2008 1:08 PM
To: Ashish Thusoo
Cc: hive-users@publists.facebook.com; core-dev@hadoop.apache.org; core-user@hadoop.apache.org
Subject: Re: [hive-users] Hive Roadmap (Some information)

Have you guys considered translating the syntax tree for queries into Java bytecode? Java bytecode is great for this type of process because it's extremely high level -- the code generation mostly focuses on type checking and name resolution. However, it enables the JIT to perform register allocation and other low level optimizations for good performance.

-b

On Mon, 27 Oct 2008, Ashish Thusoo wrote:

> Folks,
>
> Here are some of the things that we are working on internally at Facebook. We thought it would be a good idea to let everyone know what is going on with Hive development. We will put this up on the wiki as well.
>
> 1. Integrating Dynamic SerDe with the DDL. (Zheng/Pete) - This allows
> the users to create typed tables along with list and map types from
> the DDL 2. Support for Statistics. (Ashish) - These stats are needed
> to make optimization decisions 3. Join Optimizations. (Prasad) -
> Mapside joins, semi join techniques etc to do the join faster 4.
> Predicate Pushdown Optimizations. (Namit) - pushing predicates just above the table scan for certain situations in joins as well as ensuring that only required columns are sent across map/reduce boundaries 5. Group By Optimizations. (Joydeep) - various optimizations to make group by faster 6. Optimizations to reduce the number of map files created by filter operations. (Dhrubha) - Filters with a large number of mappers produces a lot of files which slows down the following operations. This tries to address problems with that.
> 7. Transformations in LOAD. (Joydeep) - LOAD currently does not transform the input data if it is not in the format expected by the destination table.
> 8. Schemaless map/reduce. (Zheng) - TRANSFORM needs schema while map/reduce is schema less.
> 9. Improvements to TRANSFORM. (Zheng) - Make this more intuitive to map/reduce developers - evaluate some other keywords etc..
> 10. Error Reporting Improvements. (Pete) - Make error reporting for
> parse errors better 11. Help on CLI. (Joydeep) - add help to the CLI
> 12. Explode and Collect Operators. (Zheng) - Explode and collect operators to convert collections to individual items and vice versa.
> 13. Propagating sort properties to destination tables. (Prasad) - If the query produces sorted we want to capture that in the destination table's metadata so that downstream optimizations can be enabled.
>
> Other contributions from outside FB ...
> 1. JDBC driver (Michi Mutsuzaki @ stanford.edu, Raghu @ stanford.edu)
> 2. Fixes to CLI driver (Jeremy Huylebroeck) 3. Web interface...
>
> Most of these have a JIRA associated. A lot of focus is on running things faster in Hive considering that we have a good feature set now...
>
> Comments/contributions are welcome. Please go to the JIRA and check out contrib/hive...
>
> Thanks,
> Ashish
> _______________________________________________
> hive-users mailing list
> hive-users@publists.facebook.com
> http://publists.facebook.com/mailman/listinfo/hive-users
>
>
_______________________________________________
hive-users mailing list
hive-users@publists.facebook.com
http://publists.facebook.com/mailman/listinfo/hive-users

Re: [hive-users] Hive Roadmap (Some information)

Posted by Dhruba Borthakur <dh...@facebook.com>.

Hi Ben,

And, if I may add, if you would like to contribute the code to make this happen, that will be awesome! In that case, we can move this discussion to a JIRA.

Thanks,
dhruba


On 10/27/08 1:41 PM, "Ashish Thusoo" <at...@facebook.com> wrote:

We did have some discussions around it a while back but we put it on the back burner considering that there were a lot of algorithmic improvements that we could make in the current code itself. We reckoned that we could make significant improvements there first and then measure the improvements we could get out of byte code generation. What kind of performance speedups have you seen with byte code generation in data processing applications?

Ashish

-----Original Message-----
From: Ben Maurer [mailto:bmaurer@andrew.cmu.edu]
Sent: Monday, October 27, 2008 1:08 PM
To: Ashish Thusoo
Cc: hive-users@publists.facebook.com; core-dev@hadoop.apache.org; core-user@hadoop.apache.org
Subject: Re: [hive-users] Hive Roadmap (Some information)

Have you guys considered translating the syntax tree for queries into Java bytecode? Java bytecode is great for this type of process because it's extremely high level -- the code generation mostly focuses on type checking and name resolution. However, it enables the JIT to perform register allocation and other low level optimizations for good performance.

-b

On Mon, 27 Oct 2008, Ashish Thusoo wrote:

> Folks,
>
> Here are some of the things that we are working on internally at Facebook. We thought it would be a good idea to let everyone know what is going on with Hive development. We will put this up on the wiki as well.
>
> 1. Integrating Dynamic SerDe with the DDL. (Zheng/Pete) - This allows
> the users to create typed tables along with list and map types from
> the DDL 2. Support for Statistics. (Ashish) - These stats are needed
> to make optimization decisions 3. Join Optimizations. (Prasad) -
> Mapside joins, semi join techniques etc to do the join faster 4.
> Predicate Pushdown Optimizations. (Namit) - pushing predicates just above the table scan for certain situations in joins as well as ensuring that only required columns are sent across map/reduce boundaries 5. Group By Optimizations. (Joydeep) - various optimizations to make group by faster 6. Optimizations to reduce the number of map files created by filter operations. (Dhrubha) - Filters with a large number of mappers produces a lot of files which slows down the following operations. This tries to address problems with that.
> 7. Transformations in LOAD. (Joydeep) - LOAD currently does not transform the input data if it is not in the format expected by the destination table.
> 8. Schemaless map/reduce. (Zheng) - TRANSFORM needs schema while map/reduce is schema less.
> 9. Improvements to TRANSFORM. (Zheng) - Make this more intuitive to map/reduce developers - evaluate some other keywords etc..
> 10. Error Reporting Improvements. (Pete) - Make error reporting for
> parse errors better 11. Help on CLI. (Joydeep) - add help to the CLI
> 12. Explode and Collect Operators. (Zheng) - Explode and collect operators to convert collections to individual items and vice versa.
> 13. Propagating sort properties to destination tables. (Prasad) - If the query produces sorted we want to capture that in the destination table's metadata so that downstream optimizations can be enabled.
>
> Other contributions from outside FB ...
> 1. JDBC driver (Michi Mutsuzaki @ stanford.edu, Raghu @ stanford.edu)
> 2. Fixes to CLI driver (Jeremy Huylebroeck) 3. Web interface...
>
> Most of these have a JIRA associated. A lot of focus is on running things faster in Hive considering that we have a good feature set now...
>
> Comments/contributions are welcome. Please go to the JIRA and check out contrib/hive...
>
> Thanks,
> Ashish
> _______________________________________________
> hive-users mailing list
> hive-users@publists.facebook.com
> http://publists.facebook.com/mailman/listinfo/hive-users
>
>
_______________________________________________
hive-users mailing list
hive-users@publists.facebook.com
http://publists.facebook.com/mailman/listinfo/hive-users

RE: [hive-users] Hive Roadmap (Some information)

Posted by Ashish Thusoo <at...@facebook.com>.

We did have some discussions around it a while back but we put it on the back burner considering that there were a lot of algorithmic improvements that we could make in the current code itself. We reckoned that we could make significant improvements there first and then measure the improvements we could get out of byte code generation. What kind of performance speedups have you seen with byte code generation in data processing applications?

Ashish

-----Original Message-----
From: Ben Maurer [mailto:bmaurer@andrew.cmu.edu]
Sent: Monday, October 27, 2008 1:08 PM
To: Ashish Thusoo
Cc: hive-users@publists.facebook.com; core-dev@hadoop.apache.org; core-user@hadoop.apache.org
Subject: Re: [hive-users] Hive Roadmap (Some information)

Have you guys considered translating the syntax tree for queries into Java bytecode? Java bytecode is great for this type of process because it's extremely high level -- the code generation mostly focuses on type checking and name resolution. However, it enables the JIT to perform register allocation and other low level optimizations for good performance.

-b

On Mon, 27 Oct 2008, Ashish Thusoo wrote:

> Folks,
>
> Here are some of the things that we are working on internally at Facebook. We thought it would be a good idea to let everyone know what is going on with Hive development. We will put this up on the wiki as well.
>
> 1. Integrating Dynamic SerDe with the DDL. (Zheng/Pete) - This allows
> the users to create typed tables along with list and map types from
> the DDL 2. Support for Statistics. (Ashish) - These stats are needed
> to make optimization decisions 3. Join Optimizations. (Prasad) -
> Mapside joins, semi join techniques etc to do the join faster 4.
> Predicate Pushdown Optimizations. (Namit) - pushing predicates just above the table scan for certain situations in joins as well as ensuring that only required columns are sent across map/reduce boundaries 5. Group By Optimizations. (Joydeep) - various optimizations to make group by faster 6. Optimizations to reduce the number of map files created by filter operations. (Dhrubha) - Filters with a large number of mappers produces a lot of files which slows down the following operations. This tries to address problems with that.
> 7. Transformations in LOAD. (Joydeep) - LOAD currently does not transform the input data if it is not in the format expected by the destination table.
> 8. Schemaless map/reduce. (Zheng) - TRANSFORM needs schema while map/reduce is schema less.
> 9. Improvements to TRANSFORM. (Zheng) - Make this more intuitive to map/reduce developers - evaluate some other keywords etc..
> 10. Error Reporting Improvements. (Pete) - Make error reporting for
> parse errors better 11. Help on CLI. (Joydeep) - add help to the CLI
> 12. Explode and Collect Operators. (Zheng) - Explode and collect operators to convert collections to individual items and vice versa.
> 13. Propagating sort properties to destination tables. (Prasad) - If the query produces sorted we want to capture that in the destination table's metadata so that downstream optimizations can be enabled.
>
> Other contributions from outside FB ...
> 1. JDBC driver (Michi Mutsuzaki @ stanford.edu, Raghu @ stanford.edu)
> 2. Fixes to CLI driver (Jeremy Huylebroeck) 3. Web interface...
>
> Most of these have a JIRA associated. A lot of focus is on running things faster in Hive considering that we have a good feature set now...
>
> Comments/contributions are welcome. Please go to the JIRA and check out contrib/hive...
>
> Thanks,
> Ashish
> _______________________________________________
> hive-users mailing list
> hive-users@publists.facebook.com
> http://publists.facebook.com/mailman/listinfo/hive-users
>
>

RE: [hive-users] Hive Roadmap (Some information)

Posted by Ashish Thusoo <at...@facebook.com>.

We did have some discussions around it a while back but we put it on the back burner considering that there were a lot of algorithmic improvements that we could make in the current code itself. We reckoned that we could make significant improvements there first and then measure the improvements we could get out of byte code generation. What kind of performance speedups have you seen with byte code generation in data processing applications?

Ashish

-----Original Message-----
From: Ben Maurer [mailto:bmaurer@andrew.cmu.edu]
Sent: Monday, October 27, 2008 1:08 PM
To: Ashish Thusoo
Cc: hive-users@publists.facebook.com; core-dev@hadoop.apache.org; core-user@hadoop.apache.org
Subject: Re: [hive-users] Hive Roadmap (Some information)

Have you guys considered translating the syntax tree for queries into Java bytecode? Java bytecode is great for this type of process because it's extremely high level -- the code generation mostly focuses on type checking and name resolution. However, it enables the JIT to perform register allocation and other low level optimizations for good performance.

-b

On Mon, 27 Oct 2008, Ashish Thusoo wrote:

> Folks,
>
> Here are some of the things that we are working on internally at Facebook. We thought it would be a good idea to let everyone know what is going on with Hive development. We will put this up on the wiki as well.
>
> 1. Integrating Dynamic SerDe with the DDL. (Zheng/Pete) - This allows
> the users to create typed tables along with list and map types from
> the DDL 2. Support for Statistics. (Ashish) - These stats are needed
> to make optimization decisions 3. Join Optimizations. (Prasad) -
> Mapside joins, semi join techniques etc to do the join faster 4.
> Predicate Pushdown Optimizations. (Namit) - pushing predicates just above the table scan for certain situations in joins as well as ensuring that only required columns are sent across map/reduce boundaries 5. Group By Optimizations. (Joydeep) - various optimizations to make group by faster 6. Optimizations to reduce the number of map files created by filter operations. (Dhrubha) - Filters with a large number of mappers produces a lot of files which slows down the following operations. This tries to address problems with that.
> 7. Transformations in LOAD. (Joydeep) - LOAD currently does not transform the input data if it is not in the format expected by the destination table.
> 8. Schemaless map/reduce. (Zheng) - TRANSFORM needs schema while map/reduce is schema less.
> 9. Improvements to TRANSFORM. (Zheng) - Make this more intuitive to map/reduce developers - evaluate some other keywords etc..
> 10. Error Reporting Improvements. (Pete) - Make error reporting for
> parse errors better 11. Help on CLI. (Joydeep) - add help to the CLI
> 12. Explode and Collect Operators. (Zheng) - Explode and collect operators to convert collections to individual items and vice versa.
> 13. Propagating sort properties to destination tables. (Prasad) - If the query produces sorted we want to capture that in the destination table's metadata so that downstream optimizations can be enabled.
>
> Other contributions from outside FB ...
> 1. JDBC driver (Michi Mutsuzaki @ stanford.edu, Raghu @ stanford.edu)
> 2. Fixes to CLI driver (Jeremy Huylebroeck) 3. Web interface...
>
> Most of these have a JIRA associated. A lot of focus is on running things faster in Hive considering that we have a good feature set now...
>
> Comments/contributions are welcome. Please go to the JIRA and check out contrib/hive...
>
> Thanks,
> Ashish
> _______________________________________________
> hive-users mailing list
> hive-users@publists.facebook.com
> http://publists.facebook.com/mailman/listinfo/hive-users
>
>