You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Zheng Shao (JIRA)" <ji...@apache.org> on 2009/06/03 10:58:07 UTC

[jira] Created: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
-------------------------------------------------------------------------------

                 Key: HIVE-537
                 URL: https://issues.apache.org/jira/browse/HIVE-537
             Project: Hadoop Hive
          Issue Type: New Feature
            Reporter: Zheng Shao


There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.

We will support the union type like this:

{code}
TypeDefinition:
  type: primitivetype | structtype | arraytype | maptype | uniontype
  uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
Example:
  union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>

Example of serialized data format:
  We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.

Interface for ObjectInspector:
interface UnionObjectInspector {
  /** Returns the array of OIs that are for each of the tags
   */
  ObjectInspector[] getObjectInspectors();
  /** Return the tag of the object.
   */
  byte getTag(Object o);
  /** Return the field based on the tag value associated with the Object.
   */
  Object getField(Object o);
};

{code}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "Min Zhou (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12727999#action_12727999 ] 

Min Zhou commented on HIVE-537:
-------------------------------

Zheng, how would you get field value from an object without a ordinal?


> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Min Zhou
>         Attachments: HIVE-537.1.patch
>
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor)
> userid:int,log:union<0:struct<touserid:int,message:string>>,1:string>
> 123 1=login
> 123 0=243=helloworld
> 123 1=logout
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "John Sichi (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Sichi updated HIVE-537:
----------------------------

    Status: Open  (was: Patch Available)

> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Amareshwari Sriramadasu
>         Attachments: HIVE-537.1.patch, patch-537-1.txt, patch-537.txt
>
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor)
> userid:int,log:union<0:struct<touserid:int,message:string>>,1:string>
> 123 1=login
> 123 0=243=helloworld
> 123 1=logout
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Namit Jain updated HIVE-537:
----------------------------

      Resolution: Fixed
    Hadoop Flags: [Reviewed]
          Status: Resolved  (was: Patch Available)

Committed. Thanks Amareshwari

> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.7.0
>
>         Attachments: HIVE-537.1.patch, patch-537-1.txt, patch-537-2.txt, patch-537-3.txt, patch-537-4.txt, patch-537-5.txt, patch-537.txt
>
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor)
> userid:int,log:union<0:struct<touserid:int,message:string>>,1:string>
> 123 1=login
> 123 0=243=helloworld
> 123 1=logout
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913670#action_12913670 ] 

Zheng Shao commented on HIVE-537:
---------------------------------

{code}
union<T0,T1,T2> create_union(byte tag, T0 o0, T1 o1, T2 o2, ...)
Some real examples:
union<School,Company> create_union( is_student ? 0 : 1, school, company)
{code}

Depending on the value of the tag, the returned union object will choose to store only the object corresponding to that tag.


> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Amareshwari Sriramadasu
>         Attachments: HIVE-537.1.patch, patch-537-1.txt, patch-537.txt
>
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor)
> userid:int,log:union<0:struct<touserid:int,message:string>>,1:string>
> 123 1=login
> 123 0=243=helloworld
> 123 1=logout
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12912614#action_12912614 ] 

Zheng Shao commented on HIVE-537:
---------------------------------

I think so. Let's use a different name for the UDF.

Using 'UNION' as UDF name will not cause grammar ambiguity, but it may cause other issues in the future.

Zheng


> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Amareshwari Sriramadasu
>         Attachments: HIVE-537.1.patch, patch-537-1.txt, patch-537.txt
>
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor)
> userid:int,log:union<0:struct<touserid:int,message:string>>,1:string>
> 123 1=login
> 123 0=243=helloworld
> 123 1=logout
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao reassigned HIVE-537:
-------------------------------

    Assignee: Min Zhou  (was: Zheng Shao)

> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Min Zhou
>         Attachments: HIVE-537.1.patch
>
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor)
> userid:int,log:union<0:struct<touserid:int,message:string>>,1:string>
> 123 1=login
> 123 0=243=helloworld
> 123 1=logout
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "Min Zhou (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Min Zhou updated HIVE-537:
--------------------------

    Attachment: HIVE-537.1.patch

HIVE-537.1.patch

> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>         Attachments: HIVE-537.1.patch
>
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor)
> userid:int,log:union<0:struct<touserid:int,message:string>>,1:string>
> 123 1=login
> 123 0=243=helloworld
> 123 1=logout
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HIVE-537:
-----------------------------------------

    Assignee: Amareshwari Sriramadasu  (was: Min Zhou)

Min, if you are not working on this, I would like to work on the follow-up patch.

> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Amareshwari Sriramadasu
>         Attachments: HIVE-537.1.patch
>
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor)
> userid:int,log:union<0:struct<touserid:int,message:string>>,1:string>
> 123 1=login
> 123 0=243=helloworld
> 123 1=logout
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao reassigned HIVE-537:
-------------------------------

    Assignee: Zheng Shao

> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838256#action_12838256 ] 

Zheng Shao commented on HIVE-537:
---------------------------------

I suppose tag is always the first field of an object? Is that reasonable?


> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Min Zhou
>         Attachments: HIVE-537.1.patch
>
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor)
> userid:int,log:union<0:struct<touserid:int,message:string>>,1:string>
> 123 1=login
> 123 0=243=helloworld
> 123 1=logout
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao updated HIVE-537:
----------------------------

    Description: 
There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.

We will support the union type like this:

{code}
TypeDefinition:
  type: primitivetype | structtype | arraytype | maptype | uniontype
  uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
Example:
  union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>

Example of serialized data format:
  We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.

Interface for ObjectInspector:
interface UnionObjectInspector {
  /** Returns the array of OIs that are for each of the tags
   */
  ObjectInspector[] getObjectInspectors();
  /** Return the tag of the object.
   */
  byte getTag(Object o);
  /** Return the field based on the tag value associated with the Object.
   */
  Object getField(Object o);
};

An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor)
userid:int,log:union<0:struct<touserid:int,message:string>>,1:string>
123 1=login
123 0=243=helloworld
123 1=logout

{code}


  was:
There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.

We will support the union type like this:

{code}
TypeDefinition:
  type: primitivetype | structtype | arraytype | maptype | uniontype
  uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
Example:
  union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>

Example of serialized data format:
  We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.

Interface for ObjectInspector:
interface UnionObjectInspector {
  /** Returns the array of OIs that are for each of the tags
   */
  ObjectInspector[] getObjectInspectors();
  /** Return the tag of the object.
   */
  byte getTag(Object o);
  /** Return the field based on the tag value associated with the Object.
   */
  Object getField(Object o);
};

{code}



> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor)
> userid:int,log:union<0:struct<touserid:int,message:string>>,1:string>
> 123 1=login
> 123 0=243=helloworld
> 123 1=logout
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12919015#action_12919015 ] 

Namit Jain commented on HIVE-537:
---------------------------------

Add the following line to serde.thrift:


const string FOO = "foo"


(whichever constant you want to add).

Then, call 

ant thriftif 

in serde directory - it will generate the required files 

      src/gen-py/org_apache_hadoop_hive_serde/constants.py
      src/gen-java/org/apache/hadoop/hive/serde/Constants.java
      src/gen-cpp/serde_constants.cpp
      src/gen-cpp/serde_constants.h
      src/gen-php/serde_constants.php


Check in these files, but don't hand edit them

> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.7.0
>
>         Attachments: HIVE-537.1.patch, patch-537-1.txt, patch-537-2.txt, patch-537-3.txt, patch-537-4.txt, patch-537.txt
>
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor)
> userid:int,log:union<0:struct<touserid:int,message:string>>,1:string>
> 123 1=login
> 123 0=243=helloworld
> 123 1=logout
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12727846#action_12727846 ] 

Zheng Shao commented on HIVE-537:
---------------------------------

@HIVE-537.1.patch:
1. Can you remove the property changes? These java files don't need to be executable:
Property changes on: src/java/org/apache/hadoop/hive/serde2/objectinspector/StandardUnionObjectInspector.java
___________________________________________________________________
Name: svn:executable
   + *
2. UnionObjectInspector.java: byte getTag(Object o, int ordinal);
We don't need ordinal here.
3. Can you add union to TypeInfoUtils.java: class TypeInfoParser as well?
4. We need some test cases. Please take a look at TestStandardObjectInspectors.java
5. We need to add the capability of serializing/deserializing Union types to LazySimpleSerDe.


> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Min Zhou
>         Attachments: HIVE-537.1.patch
>
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor)
> userid:int,log:union<0:struct<touserid:int,message:string>>,1:string>
> 123 1=login
> 123 0=243=helloworld
> 123 1=logout
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12918740#action_12918740 ] 

Namit Jain commented on HIVE-537:
---------------------------------

Otherwise it looks good to me

> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.7.0
>
>         Attachments: HIVE-537.1.patch, patch-537-1.txt, patch-537-2.txt, patch-537-3.txt, patch-537-4.txt, patch-537.txt
>
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor)
> userid:int,log:union<0:struct<touserid:int,message:string>>,1:string>
> 123 1=login
> 123 0=243=helloworld
> 123 1=logout
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HIVE-537:
-----------------------------------------

    Attachment: patch-537.txt

Updated the patch with following changes:
* Removed the ordinal from getTag() in UnionObjectInspector.
* Added union to TypeInfoUtils, TypeInfoParser, LazySimpleSerde and LazyFactory.
* Added unit tests to TestLazyArrayMapStruct and TestStandardObjectInspectors

> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Amareshwari Sriramadasu
>         Attachments: HIVE-537.1.patch, patch-537.txt
>
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor)
> userid:int,log:union<0:struct<touserid:int,message:string>>,1:string>
> 123 1=login
> 123 0=243=helloworld
> 123 1=logout
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HIVE-537:
-----------------------------------------

    Attachment: patch-537-3.txt

Earlier patch has gone stale. Updated the patch to trunk.

> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.7.0
>
>         Attachments: HIVE-537.1.patch, patch-537-1.txt, patch-537-2.txt, patch-537-3.txt, patch-537.txt
>
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor)
> userid:int,log:union<0:struct<touserid:int,message:string>>,1:string>
> 123 1=login
> 123 0=243=helloworld
> 123 1=logout
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904099#action_12904099 ] 

Amareshwari Sriramadasu commented on HIVE-537:
----------------------------------------------

Min, any update on the patch?

> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Min Zhou
>         Attachments: HIVE-537.1.patch
>
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor)
> userid:int,log:union<0:struct<touserid:int,message:string>>,1:string>
> 123 1=login
> 123 0=243=helloworld
> 123 1=logout
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HIVE-537:
-----------------------------------------

           Status: Patch Available  (was: Open)
    Fix Version/s: 0.7.0

All the tests passed with the patch.

Zheng, Can you have a look at the updated patch? Thanks.

> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.7.0
>
>         Attachments: HIVE-537.1.patch, patch-537-1.txt, patch-537-2.txt, patch-537.txt
>
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor)
> userid:int,log:union<0:struct<touserid:int,message:string>>,1:string>
> 123 1=login
> 123 0=243=helloworld
> 123 1=logout
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HIVE-537:
-----------------------------------------

    Attachment: patch-537-5.txt

Added the constant to serde.thrift and re-generated the files.
Added an example to describe extended for create_union 

> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.7.0
>
>         Attachments: HIVE-537.1.patch, patch-537-1.txt, patch-537-2.txt, patch-537-3.txt, patch-537-4.txt, patch-537-5.txt, patch-537.txt
>
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor)
> userid:int,log:union<0:struct<touserid:int,message:string>>,1:string>
> 123 1=login
> 123 0=243=helloworld
> 123 1=logout
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "Min Zhou (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724916#action_12724916 ] 

Min Zhou commented on HIVE-537:
-------------------------------

we've done a test about this issue, dataset: 700m records.

first approach, each distinct count needs 119 seconds, that's means 10 distinct count needs at least  1190 seconds.
second approach where distinct keys were distinguished by a tag,  10 distinct count need 148 seconds.

> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12918795#action_12918795 ] 

Amareshwari Sriramadasu commented on HIVE-537:
----------------------------------------------

bq. Constants.java is a generated file ? Can you change serde/if/serde.thrift
After adding the constant to serde/if/serde.thrift, do i need to regenerate the java file? If yes, How should i do it?

> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.7.0
>
>         Attachments: HIVE-537.1.patch, patch-537-1.txt, patch-537-2.txt, patch-537-3.txt, patch-537-4.txt, patch-537.txt
>
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor)
> userid:int,log:union<0:struct<touserid:int,message:string>>,1:string>
> 123 1=login
> 123 0=243=helloworld
> 123 1=logout
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HIVE-537:
-----------------------------------------

    Status: Patch Available  (was: Open)

Zheng, Can you please have a look at the attached patch?

> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Amareshwari Sriramadasu
>         Attachments: HIVE-537.1.patch, patch-537-1.txt, patch-537.txt
>
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor)
> userid:int,log:union<0:struct<touserid:int,message:string>>,1:string>
> 123 1=login
> 123 0=243=helloworld
> 123 1=logout
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HIVE-537:
-----------------------------------------

    Attachment: patch-537-4.txt

Fixed a minor bug in BinarySortableSerde while working on HIVE-474.

Can somebody review the patch, before it goes stale again?

> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.7.0
>
>         Attachments: HIVE-537.1.patch, patch-537-1.txt, patch-537-2.txt, patch-537-3.txt, patch-537-4.txt, patch-537.txt
>
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor)
> userid:int,log:union<0:struct<touserid:int,message:string>>,1:string>
> 123 1=login
> 123 0=243=helloworld
> 123 1=logout
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715878#action_12715878 ] 

Zheng Shao commented on HIVE-537:
---------------------------------

An example usage is for multiple distinct. Min Zhou talked with me offline and has shown that doing multiple distinct in a single map-reduce job can be much faster than doing them separately and then join the results.

{code}
Query:
  select a, count(distinct b), count(distinct c), sum(d)

Plan:
  Map side:
    Emit: distribution_key: a, sort_key: a, 0, b, value: d
    Emit: distribution_key: a, sort_key: a, 1, c, value: nothing
  Reduce side:
    Group By:
      a, 0, count(distinct b), sum(d)
      a, 1, count(distinct c)
    Flatten:
      a, count(distinct b), sum(d), count(distinct c)
{code}


> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12918739#action_12918739 ] 

Namit Jain commented on HIVE-537:
---------------------------------

I will review again: few initial comments:

1. Constants.java is a generated file ? Can you change serde/if/serde.thrift
2. desc extended for create_union is not detailed enough ?

> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.7.0
>
>         Attachments: HIVE-537.1.patch, patch-537-1.txt, patch-537-2.txt, patch-537-3.txt, patch-537-4.txt, patch-537.txt
>
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor)
> userid:int,log:union<0:struct<touserid:int,message:string>>,1:string>
> 123 1=login
> 123 0=243=helloworld
> 123 1=logout
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HIVE-537:
-----------------------------------------

    Attachment: patch-537-1.txt

Patch with minor change in the testcase.

> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Amareshwari Sriramadasu
>         Attachments: HIVE-537.1.patch, patch-537-1.txt, patch-537.txt
>
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor)
> userid:int,log:union<0:struct<touserid:int,message:string>>,1:string>
> 123 1=login
> 123 0=243=helloworld
> 123 1=logout
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12906759#action_12906759 ] 

HBase Review Board commented on HIVE-537:
-----------------------------------------

Message from: "Amareshwari Sriramadasu" <am...@yahoo-inc.com>

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/795/
-----------------------------------------------------------

Review request for Hive Developers.


Summary
-------

Adds Union type to Standard ObjectInSpectors, TypeInfo and Lazy ObjectInspectors.


This addresses bug HIVE-537.
    http://issues.apache.org/jira/browse/HIVE-537


Diffs
-----

  trunk/serde/src/gen-java/org/apache/hadoop/hive/serde/Constants.java 991812 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/SerDeUtils.java 991812 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/binarysortable/BinarySortableSerDe.java 991812 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyFactory.java 991812 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazySimpleSerDe.java 991812 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyUnion.java PRE-CREATION 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/objectinspector/LazyObjectInspectorFactory.java 991812 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/lazy/objectinspector/LazyUnionObjectInspector.java PRE-CREATION 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspector.java 991812 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorFactory.java 991812 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorUtils.java 991812 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/StandardUnionObjectInspector.java PRE-CREATION 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/UnionObject.java PRE-CREATION 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/UnionObjectInspector.java PRE-CREATION 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/typeinfo/TypeInfo.java 991812 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/typeinfo/TypeInfoFactory.java 991812 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/typeinfo/TypeInfoUtils.java 991812 
  trunk/serde/src/java/org/apache/hadoop/hive/serde2/typeinfo/UnionTypeInfo.java PRE-CREATION 
  trunk/serde/src/test/org/apache/hadoop/hive/serde2/lazy/TestLazyArrayMapStruct.java 991812 
  trunk/serde/src/test/org/apache/hadoop/hive/serde2/objectinspector/TestStandardObjectInspectors.java 991812 

Diff: http://review.cloudera.org/r/795/diff


Testing
-------


Thanks,

Amareshwari




> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Amareshwari Sriramadasu
>         Attachments: HIVE-537.1.patch, patch-537-1.txt, patch-537.txt
>
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor)
> userid:int,log:union<0:struct<touserid:int,message:string>>,1:string>
> 123 1=login
> 123 0=243=helloworld
> 123 1=logout
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "Min Zhou (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724916#action_12724916 ] 

Min Zhou edited comment on HIVE-537 at 6/27/09 8:18 PM:
--------------------------------------------------------

we've done a test about this issue, dataset: 700m records.

first approach where each distinct count was computed one by one, each of them needed 119 seconds, that meant 10 distinct count need at least  1190 seconds.
second approach where distinct keys were distinguished by a tag,  10 distinct count need 148 seconds.

      was (Author: coderplay):
    we've done a test about this issue, dataset: 700m records.

first approach, each distinct count needs 119 seconds, that's means 10 distinct count needs at least  1190 seconds.
second approach where distinct keys were distinguished by a tag,  10 distinct count need 148 seconds.
  
> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "Min Zhou (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725532#action_12725532 ] 

Min Zhou commented on HIVE-537:
-------------------------------

Even if UnionObjectInspector has been implemented,  the DynamicSerDe seems don't support  the schema with a union type  which thrift can't recoginze.
We must find a way solving it, any suggestions?  

> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor)
> userid:int,log:union<0:struct<touserid:int,message:string>>,1:string>
> 123 1=login
> 123 0=243=helloworld
> 123 1=logout
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12919165#action_12919165 ] 

Namit Jain commented on HIVE-537:
---------------------------------

Will look at it tomorrow.

> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.7.0
>
>         Attachments: HIVE-537.1.patch, patch-537-1.txt, patch-537-2.txt, patch-537-3.txt, patch-537-4.txt, patch-537-5.txt, patch-537.txt
>
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor)
> userid:int,log:union<0:struct<touserid:int,message:string>>,1:string>
> 123 1=login
> 123 0=243=helloworld
> 123 1=logout
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "Min Zhou (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12718373#action_12718373 ] 

Min Zhou commented on HIVE-537:
-------------------------------

first approach:
  O(mN/p) + O(m(N/p log (N/p))) + O(mN/r) + O(m)
I don't agree with you about this O(m).  It would be indeed very large cost.  and meanwhile,  you should adding the cost in the end joining all results into one. 

 for the second approach, I think it should be  
  O(N/p) + O(mN/p log (mN/p)) + O(mN/r)  

> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "Ashish Thusoo (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12716015#action_12716015 ] 

Ashish Thusoo commented on HIVE-537:
------------------------------------

One thing that you need to be careful about is the fact that you will be increasing the number of rows between the map and the reduce boundaries which, if there are a lot of distincts can lead to data explosion and a subsequent slowdown in the sort.

>From that I mean the following:

Suppose we have a query with m different distincts and the base table with N rows and p mappers and r reducers
By doing multiple map/reduce jobs, the predominant term in our complexity is

O(mN/p) + O(m(N/p log (N/p))) + O(mN/r) + O(m)

ie.
map side scan + map side sort + Reduce side merge + fixed cost of starting the map/reduce job.

how with the current approach the corresponding formula will be

O(mN/p) + O(mN/p log (mN/p)) + O(mN/r)
=
O(mN/p) + O(mN/p log (N/p)) + O(mN/p log m) + O(mN/r)

There may be situations where one is better than the other... Something to keep in mind.


> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12909919#action_12909919 ] 

HBase Review Board commented on HIVE-537:
-----------------------------------------

Message from: "Zheng Shao" <zs...@gmail.com>

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/795/#review1231
-----------------------------------------------------------


Overall looks like a good first step.  We need to change Hive.g, add UDF etc to allow users to use it in the Hive language.


trunk/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorFactory.java
<http://review.cloudera.org/r/795/#comment4192>

    unioin -> union



trunk/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorUtils.java
<http://review.cloudera.org/r/795/#comment4193>

    We cannot compare 2 union objects like this.  We need to first compare their TAG.  Only when the TAG is the same shall we compare the field.


- Zheng





> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Amareshwari Sriramadasu
>         Attachments: HIVE-537.1.patch, patch-537-1.txt, patch-537.txt
>
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor)
> userid:int,log:union<0:struct<touserid:int,message:string>>,1:string>
> 123 1=login
> 123 0=243=helloworld
> 123 1=logout
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "HBase Review Board (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12912420#action_12912420 ] 

HBase Review Board commented on HIVE-537:
-----------------------------------------

Message from: "Amareshwari Sriramadasu" <am...@yahoo-inc.com>


bq.  On 2010-09-15 15:15:08, Zheng Shao wrote:
bq.  > Overall looks like a good first step.  We need to change Hive.g, add UDF etc to allow users to use it in the Hive language.

Zheng, there is already keyword (KW_UNION: 'UNION') used for doing union/union all operations. Do you think we should use a different keyword for specifying Union type?


- Amareshwari


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
http://review.cloudera.org/r/795/#review1231
-----------------------------------------------------------





> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Amareshwari Sriramadasu
>         Attachments: HIVE-537.1.patch, patch-537-1.txt, patch-537.txt
>
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor)
> userid:int,log:union<0:struct<touserid:int,message:string>>,1:string>
> 123 1=login
> 123 0=243=helloworld
> 123 1=logout
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amareshwari Sriramadasu updated HIVE-537:
-----------------------------------------

    Attachment: patch-537-2.txt

Patch incorporating review comments:

Changes include:
* Added udf "create_union" to create union object. Added test query using create_union.
* Added UNIONTYPE keyword to Hive.g.  Added test query to create table with union column.
* Fixed a couple of minor bugs in LazySimpleSerde and LazyUnion.


> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Amareshwari Sriramadasu
>         Attachments: HIVE-537.1.patch, patch-537-1.txt, patch-537-2.txt, patch-537.txt
>
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor)
> userid:int,log:union<0:struct<touserid:int,message:string>>,1:string>
> 123 1=login
> 123 0=243=helloworld
> 123 1=logout
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12919342#action_12919342 ] 

Namit Jain commented on HIVE-537:
---------------------------------

+1

> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.7.0
>
>         Attachments: HIVE-537.1.patch, patch-537-1.txt, patch-537-2.txt, patch-537-3.txt, patch-537-4.txt, patch-537-5.txt, patch-537.txt
>
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor)
> userid:int,log:union<0:struct<touserid:int,message:string>>,1:string>
> 123 1=login
> 123 0=243=helloworld
> 123 1=logout
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-537) Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)

Posted by "Amareshwari Sriramadasu (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913475#action_12913475 ] 

Amareshwari Sriramadasu commented on HIVE-537:
----------------------------------------------

Zheng, Can you give an example usage of union type as UDF? I looked at Struct, Map and array UDFs, but Union is quiet different from them because it holds only one object at any point of time.

> Hive TypeInfo/ObjectInspector to support union (besides struct, array, and map)
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-537
>                 URL: https://issues.apache.org/jira/browse/HIVE-537
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: Zheng Shao
>            Assignee: Amareshwari Sriramadasu
>         Attachments: HIVE-537.1.patch, patch-537-1.txt, patch-537.txt
>
>
> There are already some cases inside the code that we use heterogeneous data: JoinOperator, and UnionOperator (in the sense that different parents can pass in records with different ObjectInspectors).
> We currently use Operator's parentID to distinguish that. However that approach does not extend to more complex plans that might be needed in the future.
> We will support the union type like this:
> {code}
> TypeDefinition:
>   type: primitivetype | structtype | arraytype | maptype | uniontype
>   uniontype: "union" "<" tag ":" type ("," tag ":" type)* ">"
> Example:
>   union<0:int,1:double,2:array<string>,3:struct<a:int,b:string>>
> Example of serialized data format:
>   We will first store the tag byte before we serialize the object. On deserialization, we will first read out the tag byte, then we know what is the current type of the following object, so we can deserialize it successfully.
> Interface for ObjectInspector:
> interface UnionObjectInspector {
>   /** Returns the array of OIs that are for each of the tags
>    */
>   ObjectInspector[] getObjectInspectors();
>   /** Return the tag of the object.
>    */
>   byte getTag(Object o);
>   /** Return the field based on the tag value associated with the Object.
>    */
>   Object getField(Object o);
> };
> An example serialization format (Using deliminated format, with ' ' as first-level delimitor and '=' as second-level delimitor)
> userid:int,log:union<0:struct<touserid:int,message:string>>,1:string>
> 123 1=login
> 123 0=243=helloworld
> 123 1=logout
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.