You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "He Yongqiang (JIRA)" <ji...@apache.org> on 2010/01/23 21:00:18 UTC

[jira] Created: (HIVE-1088) RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark

RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark
----------------------------------------------------------------------------------------------------

                 Key: HIVE-1088
                 URL: https://issues.apache.org/jira/browse/HIVE-1088
             Project: Hadoop Hive
          Issue Type: Bug
            Reporter: He Yongqiang
             Fix For: 0.5.0, 0.6.0
         Attachments: hive-rcfile-reader-branch-0.5.patch, hive-rcfile-reader-trunk.patch



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1088) RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Carl Steinbach updated HIVE-1088:
---------------------------------

          Component/s: Serializers/Deserializers
    Affects Version/s:     (was: 0.6.0)
                           (was: 0.5.0)
        Fix Version/s:     (was: 0.6.0)

> RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-1088
>                 URL: https://issues.apache.org/jira/browse/HIVE-1088
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>            Priority: Blocker
>             Fix For: 0.5.0
>
>         Attachments: hive-1088-branch0.5-2010-1-25.2.patch, hive-1088-branch0.5-2010-1-25.patch, hive-1088-trunk-2010-1-25.2.patch, hive-1088-trunk-2010-1-25.patch, hive-rcfile-reader-branch-0.5.patch, hive-rcfile-reader-trunk.2.patch, hive-rcfile-reader-trunk.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1088) RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-1088:
-------------------------------

    Attachment: hive-1088-trunk-2010-1-25.patch
                hive-1088-branch0.5-2010-1-25.patch

Attached 2 patches. hive-1088-branch0.5-2010-1-25.patch is for branch 0.5, and hive-1088-trunk-2010-1-25.patch is for trunk code.

> RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-1088
>                 URL: https://issues.apache.org/jira/browse/HIVE-1088
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.5.0, 0.6.0
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>            Priority: Blocker
>             Fix For: 0.5.0, 0.6.0
>
>         Attachments: hive-1088-branch0.5-2010-1-25.patch, hive-1088-trunk-2010-1-25.patch, hive-rcfile-reader-branch-0.5.patch, hive-rcfile-reader-trunk.2.patch, hive-rcfile-reader-trunk.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1088) RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-1088:
-------------------------------

    Attachment:     (was: hive-1088-trunk-2010-1-25.2.patch)

> RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-1088
>                 URL: https://issues.apache.org/jira/browse/HIVE-1088
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.5.0, 0.6.0
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>            Priority: Blocker
>             Fix For: 0.5.0, 0.6.0
>
>         Attachments: hive-1088-branch0.5-2010-1-25.patch, hive-1088-trunk-2010-1-25.patch, hive-rcfile-reader-branch-0.5.patch, hive-rcfile-reader-trunk.2.patch, hive-rcfile-reader-trunk.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1088) RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-1088:
-------------------------------

    Attachment: hive-1088-trunk-2010-1-25.2.patch
                hive-1088-branch0.5-2010-1-25.2.patch

> RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-1088
>                 URL: https://issues.apache.org/jira/browse/HIVE-1088
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.5.0, 0.6.0
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>            Priority: Blocker
>             Fix For: 0.5.0, 0.6.0
>
>         Attachments: hive-1088-branch0.5-2010-1-25.2.patch, hive-1088-branch0.5-2010-1-25.patch, hive-1088-trunk-2010-1-25.2.patch, hive-1088-trunk-2010-1-25.patch, hive-rcfile-reader-branch-0.5.patch, hive-rcfile-reader-trunk.2.patch, hive-rcfile-reader-trunk.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1088) RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-1088:
-------------------------------

    Attachment: hive-rcfile-reader-trunk.patch
                hive-rcfile-reader-branch-0.5.patch

> RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-1088
>                 URL: https://issues.apache.org/jira/browse/HIVE-1088
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: He Yongqiang
>             Fix For: 0.5.0, 0.6.0
>
>         Attachments: hive-rcfile-reader-branch-0.5.patch, hive-rcfile-reader-trunk.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1088) RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao updated HIVE-1088:
-----------------------------

             Priority: Blocker  (was: Major)
    Affects Version/s: 0.6.0
                       0.5.0

> RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-1088
>                 URL: https://issues.apache.org/jira/browse/HIVE-1088
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.5.0, 0.6.0
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>            Priority: Blocker
>             Fix For: 0.5.0, 0.6.0
>
>         Attachments: hive-rcfile-reader-branch-0.5.patch, hive-rcfile-reader-trunk.2.patch, hive-rcfile-reader-trunk.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1088) RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804853#action_12804853 ] 

Ning Zhang commented on HIVE-1088:
----------------------------------

+1. Looks good. Will commit if tests pass.

> RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-1088
>                 URL: https://issues.apache.org/jira/browse/HIVE-1088
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.5.0, 0.6.0
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>            Priority: Blocker
>             Fix For: 0.5.0, 0.6.0
>
>         Attachments: hive-1088-branch0.5-2010-1-25.2.patch, hive-1088-branch0.5-2010-1-25.patch, hive-1088-trunk-2010-1-25.2.patch, hive-1088-trunk-2010-1-25.patch, hive-rcfile-reader-branch-0.5.patch, hive-rcfile-reader-trunk.2.patch, hive-rcfile-reader-trunk.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1088) RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-1088:
-------------------------------

    Attachment:     (was: hive-1088-branch0.5-2010-1-25.2.patch)

> RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-1088
>                 URL: https://issues.apache.org/jira/browse/HIVE-1088
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.5.0, 0.6.0
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>            Priority: Blocker
>             Fix For: 0.5.0, 0.6.0
>
>         Attachments: hive-1088-branch0.5-2010-1-25.patch, hive-1088-trunk-2010-1-25.patch, hive-rcfile-reader-branch-0.5.patch, hive-rcfile-reader-trunk.2.patch, hive-rcfile-reader-trunk.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (HIVE-1088) RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ning Zhang resolved HIVE-1088.
------------------------------

    Resolution: Fixed

Committed to 0.5.0 and trunk (0.6.0).  Thanks Yongqiang!

> RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-1088
>                 URL: https://issues.apache.org/jira/browse/HIVE-1088
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.5.0, 0.6.0
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>            Priority: Blocker
>             Fix For: 0.5.0, 0.6.0
>
>         Attachments: hive-1088-branch0.5-2010-1-25.2.patch, hive-1088-branch0.5-2010-1-25.patch, hive-1088-trunk-2010-1-25.2.patch, hive-1088-trunk-2010-1-25.patch, hive-rcfile-reader-branch-0.5.patch, hive-rcfile-reader-trunk.2.patch, hive-rcfile-reader-trunk.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1088) RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-1088:
-------------------------------

    Attachment: hive-rcfile-reader-trunk.2.patch

Added a testcase in the patch for trunk. (hive-rcfile-reader-trunk.2.patch)

> RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-1088
>                 URL: https://issues.apache.org/jira/browse/HIVE-1088
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>             Fix For: 0.5.0, 0.6.0
>
>         Attachments: hive-rcfile-reader-branch-0.5.patch, hive-rcfile-reader-trunk.2.patch, hive-rcfile-reader-trunk.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-1088) RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

He Yongqiang updated HIVE-1088:
-------------------------------

    Attachment: hive-1088-trunk-2010-1-25.2.patch
                hive-1088-branch0.5-2010-1-25.2.patch

> RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-1088
>                 URL: https://issues.apache.org/jira/browse/HIVE-1088
>             Project: Hadoop Hive
>          Issue Type: Bug
>    Affects Versions: 0.5.0, 0.6.0
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>            Priority: Blocker
>             Fix For: 0.5.0, 0.6.0
>
>         Attachments: hive-1088-branch0.5-2010-1-25.2.patch, hive-1088-branch0.5-2010-1-25.patch, hive-1088-trunk-2010-1-25.2.patch, hive-1088-trunk-2010-1-25.patch, hive-rcfile-reader-branch-0.5.patch, hive-rcfile-reader-trunk.2.patch, hive-rcfile-reader-trunk.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1088) RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804434#action_12804434 ] 

He Yongqiang commented on HIVE-1088:
------------------------------------

Will add a testcase. 

>>Also, do you think it is a good idea to convert a subset of tests to use rcfile ?
Yes. We may need to do this soon or later, and right now is a good time. It maybe better if we do this in a separate jira.
What others think? 

> RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-1088
>                 URL: https://issues.apache.org/jira/browse/HIVE-1088
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>             Fix For: 0.5.0, 0.6.0
>
>         Attachments: hive-rcfile-reader-branch-0.5.patch, hive-rcfile-reader-trunk.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HIVE-1088) RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain reassigned HIVE-1088:
--------------------------------

    Assignee: He Yongqiang

> RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-1088
>                 URL: https://issues.apache.org/jira/browse/HIVE-1088
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>             Fix For: 0.5.0, 0.6.0
>
>         Attachments: hive-rcfile-reader-branch-0.5.patch, hive-rcfile-reader-trunk.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1088) RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804389#action_12804389 ] 

Namit Jain commented on HIVE-1088:
----------------------------------

Can you add a test for this ?

Also, do you think it is a good idea to convert a subset of tests to use rcfile ?

> RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-1088
>                 URL: https://issues.apache.org/jira/browse/HIVE-1088
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>             Fix For: 0.5.0, 0.6.0
>
>         Attachments: hive-rcfile-reader-branch-0.5.patch, hive-rcfile-reader-trunk.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1088) RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804639#action_12804639 ] 

Ning Zhang commented on HIVE-1088:
----------------------------------

Another suggestion: in RCFileRecordReader.java, next(LongWritable, BytesRefArrayWritable) shares many common code with next(LongWritable). Can you refactor this function to something like:

public boolean next(LongWritable key, ByteRefArrayWritable value) 
  throws IOException {
more = next(key);
if ( more ) {
  in.getCurrentRow(value);
}
return more;
}

This will always keep the logic consistent for the two next() functions.

> RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-1088
>                 URL: https://issues.apache.org/jira/browse/HIVE-1088
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>             Fix For: 0.5.0, 0.6.0
>
>         Attachments: hive-rcfile-reader-branch-0.5.patch, hive-rcfile-reader-trunk.2.patch, hive-rcfile-reader-trunk.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-1088) RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark

Posted by "Ning Zhang (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12804632#action_12804632 ] 

Ning Zhang commented on HIVE-1088:
----------------------------------

Yongqiang, I noticed you have added a unit test in TestRCFile.java. Is it possible to add a unit test as .q file? The benefit of doing this is that TestRCFile just test one code path for particular functions. If we change the code path later in the real execution engine, we may not cover the error case, but a .q file will  be more likely to catch the caes.

> RCFile RecordReader's first split will read duplicate rows if the split end is < the first SYNC mark
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-1088
>                 URL: https://issues.apache.org/jira/browse/HIVE-1088
>             Project: Hadoop Hive
>          Issue Type: Bug
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>             Fix For: 0.5.0, 0.6.0
>
>         Attachments: hive-rcfile-reader-branch-0.5.patch, hive-rcfile-reader-trunk.2.patch, hive-rcfile-reader-trunk.patch
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.