You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org> on 2009/01/11 21:12:59 UTC

[jira] Created: (HIVE-224) implement lfu based flushing policy for map side aggregates

implement lfu based flushing policy for map side aggregates
-----------------------------------------------------------

                 Key: HIVE-224
                 URL: https://issues.apache.org/jira/browse/HIVE-224
             Project: Hadoop Hive
          Issue Type: Improvement
            Reporter: Joydeep Sen Sarma


currently we flush some random set of rows when the map side hash table approaches memory limits.

we have discussed a strategy of flushing hash table entries that have the been seen the least number of times (effectively LFU flushing strategy). This will be very effective at reducing the amount of data sent from map to reduce step - as well as reduce the chances for any skews.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-224) implement lfu based flushing policy for map side aggregates

Posted by "James Warren (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12843878#action_12843878 ] 

James Warren commented on HIVE-224:
-----------------------------------

Unfortunately have bandwidth limitations myself -- but when (if?) my queue clears I'll be happy to give it a go.

cheers,
-James

> implement lfu based flushing policy for map side aggregates
> -----------------------------------------------------------
>
>                 Key: HIVE-224
>                 URL: https://issues.apache.org/jira/browse/HIVE-224
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Joydeep Sen Sarma
>
> currently we flush some random set of rows when the map side hash table approaches memory limits.
> we have discussed a strategy of flushing hash table entries that have the been seen the least number of times (effectively LFU flushing strategy). This will be very effective at reducing the amount of data sent from map to reduce step - as well as reduce the chances for any skews.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-224) implement lfu based flushing policy for map side aggregates

Posted by "Jeff Hammerbacher (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757646#action_12757646 ] 

Jeff Hammerbacher commented on HIVE-224:
----------------------------------------

Hey Joy,

Out of curiosity, did you guys ever look at this issue further?

Thanks,
Jeff

> implement lfu based flushing policy for map side aggregates
> -----------------------------------------------------------
>
>                 Key: HIVE-224
>                 URL: https://issues.apache.org/jira/browse/HIVE-224
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Joydeep Sen Sarma
>
> currently we flush some random set of rows when the map side hash table approaches memory limits.
> we have discussed a strategy of flushing hash table entries that have the been seen the least number of times (effectively LFU flushing strategy). This will be very effective at reducing the amount of data sent from map to reduce step - as well as reduce the chances for any skews.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-224) implement lfu based flushing policy for map side aggregates

Posted by "James Warren (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841692#action_12841692 ] 

James Warren commented on HIVE-224:
-----------------------------------

think i bumped up against this or a related issue today - is there any plans on incorporating this into a future release?

thanks,
-James

> implement lfu based flushing policy for map side aggregates
> -----------------------------------------------------------
>
>                 Key: HIVE-224
>                 URL: https://issues.apache.org/jira/browse/HIVE-224
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Joydeep Sen Sarma
>
> currently we flush some random set of rows when the map side hash table approaches memory limits.
> we have discussed a strategy of flushing hash table entries that have the been seen the least number of times (effectively LFU flushing strategy). This will be very effective at reducing the amount of data sent from map to reduce step - as well as reduce the chances for any skews.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-224) implement lfu based flushing policy for map side aggregates

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757715#action_12757715 ] 

Joydeep Sen Sarma commented on HIVE-224:
----------------------------------------

no - i guess we didn't - although it's an easy one.. fallout of reading the SOSP paper?

ridiculous - they are reporting 'accumator partial-hash' as something new (never reported in literature) when reference #1 in their paper implements exactly that. so much for research.


> implement lfu based flushing policy for map side aggregates
> -----------------------------------------------------------
>
>                 Key: HIVE-224
>                 URL: https://issues.apache.org/jira/browse/HIVE-224
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Joydeep Sen Sarma
>
> currently we flush some random set of rows when the map side hash table approaches memory limits.
> we have discussed a strategy of flushing hash table entries that have the been seen the least number of times (effectively LFU flushing strategy). This will be very effective at reducing the amount of data sent from map to reduce step - as well as reduce the chances for any skews.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-224) implement lfu based flushing policy for map side aggregates

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841714#action_12841714 ] 

Zheng Shao commented on HIVE-224:
---------------------------------

Hi James, currently we don't have the bandwidth to do this, but I guess it won't be too hard - we just need to use http://java.sun.com/j2se/1.4.2/docs/api/java/util/LinkedHashMap.html (search for LRU).
Are you interested in joining force on this?


> implement lfu based flushing policy for map side aggregates
> -----------------------------------------------------------
>
>                 Key: HIVE-224
>                 URL: https://issues.apache.org/jira/browse/HIVE-224
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Joydeep Sen Sarma
>
> currently we flush some random set of rows when the map side hash table approaches memory limits.
> we have discussed a strategy of flushing hash table entries that have the been seen the least number of times (effectively LFU flushing strategy). This will be very effective at reducing the amount of data sent from map to reduce step - as well as reduce the chances for any skews.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.