You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Namit Jain (JIRA)" <ji...@apache.org> on 2009/01/29 00:57:00 UTC

[jira] Created: (HIVE-256) map side aggregation : number of output rows is same as number of input rows

map side aggregation : number of output rows is same as number of input rows
----------------------------------------------------------------------------

                 Key: HIVE-256
                 URL: https://issues.apache.org/jira/browse/HIVE-256
             Project: Hadoop Hive
          Issue Type: Bug
          Components: Query Processor
            Reporter: Namit Jain
            Assignee: Namit Jain


map side aggregation : number of output rows is same as number of input rows

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-256) map side aggregation : number of output rows is same as number of input rows

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-256:
----------------------------

    Attachment: patch-256.1.txt

> map side aggregation : number of output rows is same as number of input rows
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-256
>                 URL: https://issues.apache.org/jira/browse/HIVE-256
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch-256.1.txt
>
>
> map side aggregation : number of output rows is same as number of input rows

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-256) map side aggregation : number of output rows is same as number of input rows

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Carl Steinbach updated HIVE-256:
--------------------------------

    Fix Version/s: 0.3.0
                       (was: 0.6.0)

> map side aggregation : number of output rows is same as number of input rows
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-256
>                 URL: https://issues.apache.org/jira/browse/HIVE-256
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>             Fix For: 0.3.0
>
>         Attachments: patch-256.1.txt
>
>
> map side aggregation : number of output rows is same as number of input rows

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-256) map side aggregation : number of output rows is same as number of input rows

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668334#action_12668334 ] 

Joydeep Sen Sarma commented on HIVE-256:
----------------------------------------

man - the commits are fast! 

i had some questions:

Curious - 

-    if ((numEntries % NUMROWSESTIMATESIZE) == 0) {
+    if ((numEntriesHashTable == 0) || ((numEntries % NUMROWSESTIMATESIZE) == 0)) {

Guessing this was the critical change - but couldn't follow how numEntriesHashTable could ever be 0 without numEntries being 0 as well.

Also - I couldn't understand this code fragment (quoting the old code since it doesn't matter):

      Field[] fArr = agg.getFields();
      for (Field f : fArr) 
        fixedRowSize += getSize(i, agg, f);

the getSize() call doesn't even look at the Field - it seems to base it's decision on the class type of agg - and I think will default to unknowntype (which is a whopping 256 bytes)?

(sorry - unrelated to this bug - just generally curious since looking at this code in detail for first time)



> map side aggregation : number of output rows is same as number of input rows
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-256
>                 URL: https://issues.apache.org/jira/browse/HIVE-256
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>             Fix For: 0.2.0
>
>         Attachments: patch-256.1.txt
>
>
> map side aggregation : number of output rows is same as number of input rows

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-256) map side aggregation : number of output rows is same as number of input rows

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668301#action_12668301 ] 

Namit Jain commented on HIVE-256:
---------------------------------

It is used to keep track of the aggregation classes using variable length fields. There were few bugs there, which were getting hidden because the reflection API getFields() only shows public fields, whereas getDeclaredFields() shows all fields.

For example, max has a string variable whose size is not fixed, it is sampled at runtime

> map side aggregation : number of output rows is same as number of input rows
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-256
>                 URL: https://issues.apache.org/jira/browse/HIVE-256
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch-256.1.txt
>
>
> map side aggregation : number of output rows is same as number of input rows

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-256) map side aggregation : number of output rows is same as number of input rows

Posted by "Ashish Thusoo (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668282#action_12668282 ] 

Ashish Thusoo commented on HIVE-256:
------------------------------------

where is the aggrPositions variable used? Seems like it is not used anywhere...

> map side aggregation : number of output rows is same as number of input rows
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-256
>                 URL: https://issues.apache.org/jira/browse/HIVE-256
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch-256.1.txt
>
>
> map side aggregation : number of output rows is same as number of input rows

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-256) map side aggregation : number of output rows is same as number of input rows

Posted by "Ashish Thusoo (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668307#action_12668307 ] 

Ashish Thusoo commented on HIVE-256:
------------------------------------

got it..

+1

Running tests now and will check in if everything passes.


> map side aggregation : number of output rows is same as number of input rows
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-256
>                 URL: https://issues.apache.org/jira/browse/HIVE-256
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch-256.1.txt
>
>
> map side aggregation : number of output rows is same as number of input rows

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-256) map side aggregation : number of output rows is same as number of input rows

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-256:
----------------------------

    Status: Patch Available  (was: Open)

> map side aggregation : number of output rows is same as number of input rows
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-256
>                 URL: https://issues.apache.org/jira/browse/HIVE-256
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch-256.1.txt
>
>
> map side aggregation : number of output rows is same as number of input rows

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-256) map side aggregation : number of output rows is same as number of input rows

Posted by "Ashish Thusoo (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ashish Thusoo updated HIVE-256:
-------------------------------

       Resolution: Fixed
    Fix Version/s: 0.2.0
     Hadoop Flags: [Reviewed]
           Status: Resolved  (was: Patch Available)

committed. Thanks Namit!!

> map side aggregation : number of output rows is same as number of input rows
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-256
>                 URL: https://issues.apache.org/jira/browse/HIVE-256
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>             Fix For: 0.2.0
>
>         Attachments: patch-256.1.txt
>
>
> map side aggregation : number of output rows is same as number of input rows

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-256) map side aggregation : number of output rows is same as number of input rows

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668311#action_12668311 ] 

Namit Jain commented on HIVE-256:
---------------------------------

yes, ran that on the production successfully

> map side aggregation : number of output rows is same as number of input rows
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-256
>                 URL: https://issues.apache.org/jira/browse/HIVE-256
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch-256.1.txt
>
>
> map side aggregation : number of output rows is same as number of input rows

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-256) map side aggregation : number of output rows is same as number of input rows

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668341#action_12668341 ] 

Joydeep Sen Sarma commented on HIVE-256:
----------------------------------------

the getSize() call ignores the field altogether. it looks at the aggregation class. the code seems to indicate that the intent was to sum up the sizes of the fields in the aggregation class - so something amiss there. perhaps we should just open a separate jira since the issue here is resolved.

ok - got it on the numEntries thing - seems like we never call shouldFlushed with hashsize of zero.

> map side aggregation : number of output rows is same as number of input rows
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-256
>                 URL: https://issues.apache.org/jira/browse/HIVE-256
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>             Fix For: 0.2.0
>
>         Attachments: patch-256.1.txt
>
>
> map side aggregation : number of output rows is same as number of input rows

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-256) map side aggregation : number of output rows is same as number of input rows

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668310#action_12668310 ] 

Joydeep Sen Sarma commented on HIVE-256:
----------------------------------------

is this the source of the count(1) issue?

> map side aggregation : number of output rows is same as number of input rows
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-256
>                 URL: https://issues.apache.org/jira/browse/HIVE-256
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch-256.1.txt
>
>
> map side aggregation : number of output rows is same as number of input rows

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-256) map side aggregation : number of output rows is same as number of input rows

Posted by "Ashish Thusoo (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668339#action_12668339 ] 

Ashish Thusoo commented on HIVE-256:
------------------------------------

>From what Namit told me, there were 3 bugs:

1. aggrPositions was not being initialized so we were not tracking the variable length fields properly.
2. Instead of getFields he had to use getDeclaredFields as the former only gives public fields where as most of our fields are private.
3. The numEntries stuff which would not let the code kick in if there were less thatn NUMROWESTIMATESIZE of rows...

We had used 256 as a heuristic for unknown types...

> map side aggregation : number of output rows is same as number of input rows
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-256
>                 URL: https://issues.apache.org/jira/browse/HIVE-256
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>             Fix For: 0.2.0
>
>         Attachments: patch-256.1.txt
>
>
> map side aggregation : number of output rows is same as number of input rows

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-256) map side aggregation : number of output rows is same as number of input rows

Posted by "Ashish Thusoo (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668281#action_12668281 ] 

Ashish Thusoo commented on HIVE-256:
------------------------------------

can you give a description of the bug? How is it triggered?

> map side aggregation : number of output rows is same as number of input rows
> ----------------------------------------------------------------------------
>
>                 Key: HIVE-256
>                 URL: https://issues.apache.org/jira/browse/HIVE-256
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Namit Jain
>            Assignee: Namit Jain
>         Attachments: patch-256.1.txt
>
>
> map side aggregation : number of output rows is same as number of input rows

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.