You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Zheng Shao (JIRA)" <ji...@apache.org> on 2009/01/16 12:23:59 UTC

[jira] Created: (HIVE-235) DynamicSerDe does not work with Thrift Protocols that can have missing fields for null values

DynamicSerDe does not work with Thrift Protocols that can have missing fields for null values
---------------------------------------------------------------------------------------------

                 Key: HIVE-235
                 URL: https://issues.apache.org/jira/browse/HIVE-235
             Project: Hadoop Hive
          Issue Type: Bug
          Components: Serializers/Deserializers
            Reporter: Zheng Shao
            Priority: Blocker


The current DynamicSerDe code assumes all fields are there and no fields are missing.

However Thrift Protocols can have missing fields, in case the field is null.

In that case, DynamicSerDe may commit 2 behavior:
1. array index out of bound error because DynamicSerDe assumes the number of fields in the record should be equal to that in the DDL;
2. fields with null values will take the value from the last record. This may produce wrong result for queries.

In order to fix this, we need to:
1. Pass ObjectInspector/TypeInfo recursively so that we know the number of fields when deserializing the record.
2. Clear out fields that are missing from the record.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-235) DynamicSerDe does not work with Thrift Protocols that can have missing fields for null values

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao updated HIVE-235:
----------------------------

    Attachment: HIVE-235.2.patch

Had to change the test case to make sure the output is deterministic.


> DynamicSerDe does not work with Thrift Protocols that can have missing fields for null values
> ---------------------------------------------------------------------------------------------
>
>                 Key: HIVE-235
>                 URL: https://issues.apache.org/jira/browse/HIVE-235
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Blocker
>         Attachments: HIVE-235.1.patch, HIVE-235.2.patch
>
>
> The current DynamicSerDe code assumes all fields are there and no fields are missing.
> However Thrift Protocols can have missing fields, in case the field is null.
> In that case, DynamicSerDe may commit 2 behavior:
> 1. array index out of bound error because DynamicSerDe assumes the number of fields in the record should be equal to that in the DDL;
> 2. fields with null values will take the value from the last record. This may produce wrong result for queries.
> In order to fix this, we need to:
> 1. Pass ObjectInspector/TypeInfo recursively so that we know the number of fields when deserializing the record.
> 2. Clear out fields that are missing from the record.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-235) DynamicSerDe does not work with Thrift Protocols that can have missing fields for null values

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665517#action_12665517 ] 

Joydeep Sen Sarma commented on HIVE-235:
----------------------------------------

+1

> DynamicSerDe does not work with Thrift Protocols that can have missing fields for null values
> ---------------------------------------------------------------------------------------------
>
>                 Key: HIVE-235
>                 URL: https://issues.apache.org/jira/browse/HIVE-235
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Blocker
>         Attachments: HIVE-235.1.patch
>
>
> The current DynamicSerDe code assumes all fields are there and no fields are missing.
> However Thrift Protocols can have missing fields, in case the field is null.
> In that case, DynamicSerDe may commit 2 behavior:
> 1. array index out of bound error because DynamicSerDe assumes the number of fields in the record should be equal to that in the DDL;
> 2. fields with null values will take the value from the last record. This may produce wrong result for queries.
> In order to fix this, we need to:
> 1. Pass ObjectInspector/TypeInfo recursively so that we know the number of fields when deserializing the record.
> 2. Clear out fields that are missing from the record.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (HIVE-235) DynamicSerDe does not work with Thrift Protocols that can have missing fields for null values

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao resolved HIVE-235.
-----------------------------

      Resolution: Fixed
    Release Note: 
HIVE-235. Fixed DynamicSerDe to work with null values with Thrift Protocols that can have missing fields for null values. (zshao)

    Hadoop Flags: [Reviewed]

Committed revision 736092.

> DynamicSerDe does not work with Thrift Protocols that can have missing fields for null values
> ---------------------------------------------------------------------------------------------
>
>                 Key: HIVE-235
>                 URL: https://issues.apache.org/jira/browse/HIVE-235
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Blocker
>         Attachments: HIVE-235.1.patch, HIVE-235.2.patch
>
>
> The current DynamicSerDe code assumes all fields are there and no fields are missing.
> However Thrift Protocols can have missing fields, in case the field is null.
> In that case, DynamicSerDe may commit 2 behavior:
> 1. array index out of bound error because DynamicSerDe assumes the number of fields in the record should be equal to that in the DDL;
> 2. fields with null values will take the value from the last record. This may produce wrong result for queries.
> In order to fix this, we need to:
> 1. Pass ObjectInspector/TypeInfo recursively so that we know the number of fields when deserializing the record.
> 2. Clear out fields that are missing from the record.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-235) DynamicSerDe does not work with Thrift Protocols that can have missing fields for null values

Posted by "Carl Steinbach (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Carl Steinbach updated HIVE-235:
--------------------------------

    Fix Version/s: 0.3.0

> DynamicSerDe does not work with Thrift Protocols that can have missing fields for null values
> ---------------------------------------------------------------------------------------------
>
>                 Key: HIVE-235
>                 URL: https://issues.apache.org/jira/browse/HIVE-235
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Blocker
>             Fix For: 0.3.0
>
>         Attachments: HIVE-235.1.patch, HIVE-235.2.patch
>
>
> The current DynamicSerDe code assumes all fields are there and no fields are missing.
> However Thrift Protocols can have missing fields, in case the field is null.
> In that case, DynamicSerDe may commit 2 behavior:
> 1. array index out of bound error because DynamicSerDe assumes the number of fields in the record should be equal to that in the DDL;
> 2. fields with null values will take the value from the last record. This may produce wrong result for queries.
> In order to fix this, we need to:
> 1. Pass ObjectInspector/TypeInfo recursively so that we know the number of fields when deserializing the record.
> 2. Clear out fields that are missing from the record.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-235) DynamicSerDe does not work with Thrift Protocols that can have missing fields for null values

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665502#action_12665502 ] 

Zheng Shao commented on HIVE-235:
---------------------------------

You are right. The getNumFields actually returns the same value as ordered_types. It's not related to the actual non-null fields in the record.
I thought that might be a problem but it's actually not.  In reality, we have not seen the array index out of bound error.


> DynamicSerDe does not work with Thrift Protocols that can have missing fields for null values
> ---------------------------------------------------------------------------------------------
>
>                 Key: HIVE-235
>                 URL: https://issues.apache.org/jira/browse/HIVE-235
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Blocker
>         Attachments: HIVE-235.1.patch
>
>
> The current DynamicSerDe code assumes all fields are there and no fields are missing.
> However Thrift Protocols can have missing fields, in case the field is null.
> In that case, DynamicSerDe may commit 2 behavior:
> 1. array index out of bound error because DynamicSerDe assumes the number of fields in the record should be equal to that in the DDL;
> 2. fields with null values will take the value from the last record. This may produce wrong result for queries.
> In order to fix this, we need to:
> 1. Pass ObjectInspector/TypeInfo recursively so that we know the number of fields when deserializing the record.
> 2. Clear out fields that are missing from the record.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HIVE-235) DynamicSerDe does not work with Thrift Protocols that can have missing fields for null values

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao reassigned HIVE-235:
-------------------------------

    Assignee: Zheng Shao

> DynamicSerDe does not work with Thrift Protocols that can have missing fields for null values
> ---------------------------------------------------------------------------------------------
>
>                 Key: HIVE-235
>                 URL: https://issues.apache.org/jira/browse/HIVE-235
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Blocker
>         Attachments: HIVE-235.1.patch
>
>
> The current DynamicSerDe code assumes all fields are there and no fields are missing.
> However Thrift Protocols can have missing fields, in case the field is null.
> In that case, DynamicSerDe may commit 2 behavior:
> 1. array index out of bound error because DynamicSerDe assumes the number of fields in the record should be equal to that in the DDL;
> 2. fields with null values will take the value from the last record. This may produce wrong result for queries.
> In order to fix this, we need to:
> 1. Pass ObjectInspector/TypeInfo recursively so that we know the number of fields when deserializing the record.
> 2. Clear out fields that are missing from the record.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-235) DynamicSerDe does not work with Thrift Protocols that can have missing fields for null values

Posted by "Joydeep Sen Sarma (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12665318#action_12665318 ] 

Joydeep Sen Sarma commented on HIVE-235:
----------------------------------------

looks good for handling the null columns case.

btw - getNumFields and ordered_types have the same length. doesn't seem to me that the SerDe assumes that the record has to have all fields (bails out when it hits stop even if all fields have not been read). So if we are trying to solve any outofbounds issue - not sure this is going to resolve it.


> DynamicSerDe does not work with Thrift Protocols that can have missing fields for null values
> ---------------------------------------------------------------------------------------------
>
>                 Key: HIVE-235
>                 URL: https://issues.apache.org/jira/browse/HIVE-235
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Blocker
>         Attachments: HIVE-235.1.patch
>
>
> The current DynamicSerDe code assumes all fields are there and no fields are missing.
> However Thrift Protocols can have missing fields, in case the field is null.
> In that case, DynamicSerDe may commit 2 behavior:
> 1. array index out of bound error because DynamicSerDe assumes the number of fields in the record should be equal to that in the DDL;
> 2. fields with null values will take the value from the last record. This may produce wrong result for queries.
> In order to fix this, we need to:
> 1. Pass ObjectInspector/TypeInfo recursively so that we know the number of fields when deserializing the record.
> 2. Clear out fields that are missing from the record.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-235) DynamicSerDe does not work with Thrift Protocols that can have missing fields for null values

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao updated HIVE-235:
----------------------------

    Attachment: HIVE-235.1.patch

It turns out to be simpler than I thought. We don't need to pass ObjectInspector because the class already knows the number of fields.

Also added a test case.


> DynamicSerDe does not work with Thrift Protocols that can have missing fields for null values
> ---------------------------------------------------------------------------------------------
>
>                 Key: HIVE-235
>                 URL: https://issues.apache.org/jira/browse/HIVE-235
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>            Reporter: Zheng Shao
>            Priority: Blocker
>         Attachments: HIVE-235.1.patch
>
>
> The current DynamicSerDe code assumes all fields are there and no fields are missing.
> However Thrift Protocols can have missing fields, in case the field is null.
> In that case, DynamicSerDe may commit 2 behavior:
> 1. array index out of bound error because DynamicSerDe assumes the number of fields in the record should be equal to that in the DDL;
> 2. fields with null values will take the value from the last record. This may produce wrong result for queries.
> In order to fix this, we need to:
> 1. Pass ObjectInspector/TypeInfo recursively so that we know the number of fields when deserializing the record.
> 2. Clear out fields that are missing from the record.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.