You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Zheng Shao (JIRA)" <ji...@apache.org> on 2009/03/11 00:32:50 UTC

[jira] Created: (HIVE-337) LazySimpleSerDe should support array and map types

LazySimpleSerDe should support array and map types
--------------------------------------------------

                 Key: HIVE-337
                 URL: https://issues.apache.org/jira/browse/HIVE-337
             Project: Hadoop Hive
          Issue Type: Bug
          Components: Serializers/Deserializers
    Affects Versions: 0.2.0
            Reporter: Zheng Shao
            Assignee: Zheng Shao
            Priority: Blocker


Once we do that, we can completely deprecate DynamicSerDe/TCTLSeparatedProtocol, and close any bugs that DynamicSerDe/TCTLSeparatedProtocol has.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-337) LazySimpleSerDe should support array and map types

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao updated HIVE-337:
----------------------------

    Attachment: HIVE-337.2.patch

Second cut. Fixed 2 problems popped up from the review:

1. Make sure we don't do "setAll" on the field/element/value again if "setAll" has been called already.
2. When getting the value out of a LazyMap using a key (which is assumed to be a primitive), we are now comparing the key directly with the keys deserialized from the map (instead of comparing the serialized version of keys to avoid matching problems between "01" and "1" as an int value).

This will be the patch intended for check-in (except the test results are not overwritten yet, in order to help with the review).


> LazySimpleSerDe should support array and map types
> --------------------------------------------------
>
>                 Key: HIVE-337
>                 URL: https://issues.apache.org/jira/browse/HIVE-337
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.2.0
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Blocker
>         Attachments: HIVE-337.1.patch, HIVE-337.2.patch
>
>
> Once we do that, we can completely deprecate DynamicSerDe/TCTLSeparatedProtocol, and close any bugs that DynamicSerDe/TCTLSeparatedProtocol has.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-337) LazySimpleSerDe should support array and map types

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688548#action_12688548 ] 

Zheng Shao commented on HIVE-337:
---------------------------------

Comments from code review with Ashish, Raghu, and Namit:

1. Move factory methods in LazyUtils to LazyFactory;

2. LazyArray etc should use Factory methods (with TypeInfo as parameter) to create elements;

3. Comment that LazySimpleSerDe only supports 10 levels of separators currently.

4. Use a global pointer to byte[], and all LazyObjects should point to the global pointer.
This helps with garbage collection.

5. More Javadoc.

6. Move the 3 fields (bytes, start, length) from LazyNonPrimitive to LazyObject, add "parsed" to LazyPrimitive.
This makes the code cleaner because all LazyObjects now behave in the same way - nothing happens in setAll (but cached the values), and actual parsing happens later.
The trade-off is that LazyPrimitive takes more memory - 12 bytes for bytes, start, and length.

For 6, a better way would be to always directly parse the data when "setAll" is called (basically, convert the way LazyNonPrimitive works to the way that LazyPrimitive works.) The reason is that there is no point of "setAll" if we don't need to parse the data later, so why don't we directly parse the data in the "setAll" call?


> LazySimpleSerDe should support array and map types
> --------------------------------------------------
>
>                 Key: HIVE-337
>                 URL: https://issues.apache.org/jira/browse/HIVE-337
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.2.0
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Blocker
>         Attachments: HIVE-337.1.patch, HIVE-337.2.patch
>
>
> Once we do that, we can completely deprecate DynamicSerDe/TCTLSeparatedProtocol, and close any bugs that DynamicSerDe/TCTLSeparatedProtocol has.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-337) LazySimpleSerDe should support array and map types

Posted by "Raghotham Murthy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12681050#action_12681050 ] 

Raghotham Murthy commented on HIVE-337:
---------------------------------------

I am not sure what you mean by null array.

Given an array column, ideally, we should distinguish between the following cases (I am repeating them for clarity):

1. NULL - array column is null (is this what you mean by null array?)
2. [NULL] - array containing one element (NULL)
3. [''] - array containing one element (empty string)
4. [] - array containing no elements

Is there are plan for LazySimpleSerDe to support nested arrays? If so, we cant really have a single delimiter for arrays and maps. We should introduce array begin and end markers in the serialization format. Alternatively, we could store the number of bytes in the array before the array column value itself.

> LazySimpleSerDe should support array and map types
> --------------------------------------------------
>
>                 Key: HIVE-337
>                 URL: https://issues.apache.org/jira/browse/HIVE-337
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.2.0
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Blocker
>
> Once we do that, we can completely deprecate DynamicSerDe/TCTLSeparatedProtocol, and close any bugs that DynamicSerDe/TCTLSeparatedProtocol has.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-337) LazySimpleSerDe should support array and map types

Posted by "Raghotham Murthy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12681141#action_12681141 ] 

Raghotham Murthy commented on HIVE-337:
---------------------------------------

@Zheng. I agree with what you said except the last statement.

> At this point, the goal is only to support what is already supported (by Meta../TCTL...). For new features, let's keep the discussion on a separate jira.

The reason is that if our current focus is only to support what is already supported by Meta..., we will have to visit the same problem again of creating a new SerDe to support arrays of arrays. Then another serde to support maps of arrays and so on. I'd rather we make the decision and the code changes now such that LazySerDe becomes the default general serde for nested maps and arrays. I am not saying that you need to make lazy serde support everything as part of this jira. But, what I am hoping for is the decision to *add* support for nested structures to LazySimpleSerDe rather than creating yet another wafer thin serde.

> LazySimpleSerDe should support array and map types
> --------------------------------------------------
>
>                 Key: HIVE-337
>                 URL: https://issues.apache.org/jira/browse/HIVE-337
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.2.0
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Blocker
>
> Once we do that, we can completely deprecate DynamicSerDe/TCTLSeparatedProtocol, and close any bugs that DynamicSerDe/TCTLSeparatedProtocol has.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-337) LazySimpleSerDe should support array and map types

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12681010#action_12681010 ] 

Zheng Shao commented on HIVE-337:
---------------------------------

2 pitfalls for the delimited format:

1. Empty array has exactly the same serialized format as an array with a single element which is an empty String.

2. Null array has exactly the same serialized format as an array with a single element which is Null.

We have to make a choice between the two. After some discussions with Prasad, we've got the consensus that we should support empty array and null array (which means we are NOT going to support an array with a single element that is empty or null.)

This is the same as what TCTLSeparatedProtocol is doing.


> LazySimpleSerDe should support array and map types
> --------------------------------------------------
>
>                 Key: HIVE-337
>                 URL: https://issues.apache.org/jira/browse/HIVE-337
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.2.0
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Blocker
>
> Once we do that, we can completely deprecate DynamicSerDe/TCTLSeparatedProtocol, and close any bugs that DynamicSerDe/TCTLSeparatedProtocol has.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-337) LazySimpleSerDe should support array and map types

Posted by "Namit Jain (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12681126#action_12681126 ] 

Namit Jain commented on HIVE-337:
---------------------------------

Isnt the goal to use LazySerde as the default for all kinds of schemas ? If yes, we should support all nestings.

If that is not the goal, then as Raghu said, LazySerde might be deprecated soon. It might be painful to have separate serde as a default based on the schema.

> LazySimpleSerDe should support array and map types
> --------------------------------------------------
>
>                 Key: HIVE-337
>                 URL: https://issues.apache.org/jira/browse/HIVE-337
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.2.0
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Blocker
>
> Once we do that, we can completely deprecate DynamicSerDe/TCTLSeparatedProtocol, and close any bugs that DynamicSerDe/TCTLSeparatedProtocol has.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-337) LazySimpleSerDe should support multi-level nested array, map, struct types

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao updated HIVE-337:
----------------------------

    Summary: LazySimpleSerDe should support multi-level nested array, map, struct types  (was: LazySimpleSerDe should support array and map types)

> LazySimpleSerDe should support multi-level nested array, map, struct types
> --------------------------------------------------------------------------
>
>                 Key: HIVE-337
>                 URL: https://issues.apache.org/jira/browse/HIVE-337
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.2.0
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Blocker
>         Attachments: HIVE-337.1.patch, HIVE-337.2.patch, HIVE-337.5.patch
>
>
> Once we do that, we can completely deprecate DynamicSerDe/TCTLSeparatedProtocol, and close any bugs that DynamicSerDe/TCTLSeparatedProtocol has.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-337) LazySimpleSerDe should support multi-level nested array, map, struct types

Posted by "Raghotham Murthy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688879#action_12688879 ] 

Raghotham Murthy commented on HIVE-337:
---------------------------------------

+1

looks good

> LazySimpleSerDe should support multi-level nested array, map, struct types
> --------------------------------------------------------------------------
>
>                 Key: HIVE-337
>                 URL: https://issues.apache.org/jira/browse/HIVE-337
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.2.0
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Blocker
>         Attachments: HIVE-337.1.patch, HIVE-337.2.patch, HIVE-337.5.patch
>
>
> Once we do that, we can completely deprecate DynamicSerDe/TCTLSeparatedProtocol, and close any bugs that DynamicSerDe/TCTLSeparatedProtocol has.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-337) LazySimpleSerDe should support array and map types

Posted by "Raghotham Murthy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12681059#action_12681059 ] 

Raghotham Murthy commented on HIVE-337:
---------------------------------------

I am not suggesting changing the serialization format for everything. just for maps and arrays. If I have to write a new SerDe I'd have to copy over code from LazySimpleSerDe for everything other than for maps and arrays. Can you make LazySimpleSerDe parameterizable so that it can deserialize the old format (just single delimiters) as well as the new format (with begin-end delimiters say) and always serialize into the new format? This will give us a migration path to a format which is compatible with nested arrays and maps. Thoughts?

> LazySimpleSerDe should support array and map types
> --------------------------------------------------
>
>                 Key: HIVE-337
>                 URL: https://issues.apache.org/jira/browse/HIVE-337
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.2.0
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Blocker
>
> Once we do that, we can completely deprecate DynamicSerDe/TCTLSeparatedProtocol, and close any bugs that DynamicSerDe/TCTLSeparatedProtocol has.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-337) LazySimpleSerDe should support array and map types

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12681165#action_12681165 ] 

Zheng Shao commented on HIVE-337:
---------------------------------

Ok, let's discuss tomorrow whether our current focus is only to fix this blocker, or we want to add new features at the same time.


> LazySimpleSerDe should support array and map types
> --------------------------------------------------
>
>                 Key: HIVE-337
>                 URL: https://issues.apache.org/jira/browse/HIVE-337
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.2.0
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Blocker
>
> Once we do that, we can completely deprecate DynamicSerDe/TCTLSeparatedProtocol, and close any bugs that DynamicSerDe/TCTLSeparatedProtocol has.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-337) LazySimpleSerDe should support array and map types

Posted by "Prasad Chakka (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12681016#action_12681016 ] 

Prasad Chakka commented on HIVE-337:
------------------------------------

just to clarify what i meant,

a null array, an empty array and an array with one empty string or null are all same. So users can't rely on being able to differentiate between the 4 cases. All of them are represented by an empty string in the serialized string of the array. such an array should be deserialzied to either null or empty array. i prefer a null array since it is easier to test for in conditions.



> LazySimpleSerDe should support array and map types
> --------------------------------------------------
>
>                 Key: HIVE-337
>                 URL: https://issues.apache.org/jira/browse/HIVE-337
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.2.0
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Blocker
>
> Once we do that, we can completely deprecate DynamicSerDe/TCTLSeparatedProtocol, and close any bugs that DynamicSerDe/TCTLSeparatedProtocol has.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-337) LazySimpleSerDe should support array and map types

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12681064#action_12681064 ] 

Zheng Shao commented on HIVE-337:
---------------------------------

For the migration path, we can easily create new tables with new SerDes, and everything will work transparently. If you are suggesting letting LazySerDe automatically figure out the old/new format, I don't think that's even possible, and if it is, users will be easily confused by that.

Delimited format is meant to be simple and human-readable, and it is only good for simple data. If the structure really gets complicated, we should store the data in binary format instead of delimited format. For example, we can use Thrift etc.

If we really want to write a new SerDe that shares a lot with LazySimpleSerDe (with an extended delimited format), we can easily do that by reusing a lot of the classes introduced by LazySimpleSerDe. There is not much to copy - and if there is, it's better to factor the common code out, instead of pushing all logics (new format/old format) into the same class.

Let's open another jira for discussions on new features like this.


So the question here is that we have to make a choice between the two: whether treat "" to be an empty array or an array with an empty string as the only element (and the same question for NULL).


> LazySimpleSerDe should support array and map types
> --------------------------------------------------
>
>                 Key: HIVE-337
>                 URL: https://issues.apache.org/jira/browse/HIVE-337
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.2.0
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Blocker
>
> Once we do that, we can completely deprecate DynamicSerDe/TCTLSeparatedProtocol, and close any bugs that DynamicSerDe/TCTLSeparatedProtocol has.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-337) LazySimpleSerDe should support array and map types

Posted by "Raghotham Murthy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12681073#action_12681073 ] 

Raghotham Murthy commented on HIVE-337:
---------------------------------------

I see that the trend is to keep writing new SerDes either for performance (MetadataTyped... and possibly others) or for expressibility (DynamicSerDe etc). Eventually, most of these SerDes are not used at all other than for backward compatibility with existing data. Isnt it better to find a balance between performance and and expressibility in a single serde which can be used in general. Of course, if people want more performance/expressibility they can use/write other serdes which use binary formats etc. 

Using array begin and end markers dont decrease human-readability imo (people are fine with reading json right?) and they allow expressing nested structures. I am not sure there is a disadvantage to this. What I was hoping was that LazySimpleSerDe becomes the default SerDe for most requirements. 

Specifically, I have a requirement for arrays of maps. If you dont provide support for that in LazySimpleSerDe (which is probably not a big change, i might be mistaken though), then for my requirement, we would have to go through the process of creating a new SerDe. And once we create that serde I'd rather use it than LazySimpleSerDe for all of my future requirements. I am guessing that pretty soon we would have to deprecate LazySimpleSerDe in favor of this new serde because of its expressibility.

Regarding automatically detecting the serialization format for arrays in the data, maybe I am mistaken, but arent you already using some logic to create LazySimpleSerDe when the metastore has MetadataTypedColumnSetSerDe for that table? In that same logic, cant you add a parameter to the lazy serde to indicate which array serialization format to use?

Again, I am not suggesting that we should add several serialization formats to the same SerDe. All I am suggesting is that there is a middle ground between proliferating class for each small feature difference and putting all features into a single class.

> LazySimpleSerDe should support array and map types
> --------------------------------------------------
>
>                 Key: HIVE-337
>                 URL: https://issues.apache.org/jira/browse/HIVE-337
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.2.0
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Blocker
>
> Once we do that, we can completely deprecate DynamicSerDe/TCTLSeparatedProtocol, and close any bugs that DynamicSerDe/TCTLSeparatedProtocol has.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-337) LazySimpleSerDe should support array and map types

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12681033#action_12681033 ] 

Zheng Shao commented on HIVE-337:
---------------------------------

We use "\\N" to represent NULL, so we can still distinguish between empty array and null array.


> LazySimpleSerDe should support array and map types
> --------------------------------------------------
>
>                 Key: HIVE-337
>                 URL: https://issues.apache.org/jira/browse/HIVE-337
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.2.0
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Blocker
>
> Once we do that, we can completely deprecate DynamicSerDe/TCTLSeparatedProtocol, and close any bugs that DynamicSerDe/TCTLSeparatedProtocol has.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-337) LazySimpleSerDe should support array and map types

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao updated HIVE-337:
----------------------------

    Attachment: HIVE-337.5.patch

Done with all the comments except 6.

I also renamed the setAll() function to init() to make it clearer.

Because we now pass TypeInfo around in LazyObject hierarchy, we don't even need to create the LazyObject for an array element if that element is never accessed (we can create it on demand when it's accessed).

The current code works fine without the change of 6.  The change of 6 requires either 12 bytes more storage per primitive object (by adding the byte[], int, int to the LazyPrimitive), or more complicated logic in removing the int start and int length from LazyNonPrimitive (we will have to parse the data right in init(..) but we don't have access to the separators because it's in the next-level ObjectInspectors - unless we add the pointers from LazyObject to ObjectInspector, but that's another overhead and complicates the data structure).

After all, the implementation of init() is private to the class and I don't think there is a strong need to make the implementation the same across LazyPrimitive and LazyNonPrimitive. The fact that the parsing of LazyPrimitive does not require delimiters and LazyNonPrimitive requires is good enough for them to have different implementations.


Future improvements include:
1. Support escaping: HIVE-136;
2. Columnar storage: HIVE-352;
3. Use Writable/Text for values: HIVE-266;
4. Short-circuit serialization: HIVE-358;
5. Short-circuit expression evaluation: HIVE-359.


> LazySimpleSerDe should support array and map types
> --------------------------------------------------
>
>                 Key: HIVE-337
>                 URL: https://issues.apache.org/jira/browse/HIVE-337
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.2.0
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Blocker
>         Attachments: HIVE-337.1.patch, HIVE-337.2.patch, HIVE-337.5.patch
>
>
> Once we do that, we can completely deprecate DynamicSerDe/TCTLSeparatedProtocol, and close any bugs that DynamicSerDe/TCTLSeparatedProtocol has.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-337) LazySimpleSerDe should support array and map types

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12681134#action_12681134 ] 

Zheng Shao commented on HIVE-337:
---------------------------------

It's true that LazySimpleSerDe does not share code with MetadataTypedColumnsetSerDe or DynamicSerDe, but the reason of that is because the design principle is completely different and there is no shared code at all.

@Raghu: I thought you were saying automatically detect the format based on data (not metadata). If we can rely on metadata, then we are on the same track.

@Namit: The current goal of LazySimpleSerDe is to replace MetaDataTypedColumnsetSerDe and TCTLSeparatedProtocol. LazySerDe is a thin class on top of a lot of utility classes that can be reused for any lazily-initialized serdes.


I think overall we are on agreement:
1. We should reuse code as much as possible.
2. We should rely on metadata to find out the serialization format difference (instead of automatically figure it out from the data).

The difference between parametering a class and 2 classes are so small - I can write a wrapper to wrap 2 classes. This is a minor issue.
This is different from reusing the code, because most of the code are in utility classes that will be reused.

At this point, the goal is only to support what is already supported (by Meta../TCTL...). For new features, let's keep the discussion on a separate jira.


> LazySimpleSerDe should support array and map types
> --------------------------------------------------
>
>                 Key: HIVE-337
>                 URL: https://issues.apache.org/jira/browse/HIVE-337
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.2.0
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Blocker
>
> Once we do that, we can completely deprecate DynamicSerDe/TCTLSeparatedProtocol, and close any bugs that DynamicSerDe/TCTLSeparatedProtocol has.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (HIVE-337) LazySimpleSerDe should support multi-level nested array, map, struct types

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao resolved HIVE-337.
-----------------------------

       Resolution: Fixed
    Fix Version/s: 0.3.0
     Release Note: HIVE-337. LazySimpleSerDe to support multi-level nested array, map, struct types. (zshao)
     Hadoop Flags: [Reviewed]

Committed revision 758089.


> LazySimpleSerDe should support multi-level nested array, map, struct types
> --------------------------------------------------------------------------
>
>                 Key: HIVE-337
>                 URL: https://issues.apache.org/jira/browse/HIVE-337
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.2.0
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Blocker
>             Fix For: 0.3.0
>
>         Attachments: HIVE-337.1.patch, HIVE-337.2.patch, HIVE-337.5.patch
>
>
> Once we do that, we can completely deprecate DynamicSerDe/TCTLSeparatedProtocol, and close any bugs that DynamicSerDe/TCTLSeparatedProtocol has.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (HIVE-337) LazySimpleSerDe should support array and map types

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/HIVE-337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zheng Shao updated HIVE-337:
----------------------------

    Attachment: HIVE-337.1.patch

First cut.

The code is written in a way to make sure it's very easy to extend to nested types.

I also replaced DynamicSerDe/TBinaryProtocol for map-reduce value to LazySimpleSerDe. On a typical table with 8 string columns, we are seeing the amount of data passed from mappers to reducers decreased from 224MB to 148MB (roughly a 1/3 improvement).

Escaping are not supported yet, but it shouldn't take too much time to add.

> LazySimpleSerDe should support array and map types
> --------------------------------------------------
>
>                 Key: HIVE-337
>                 URL: https://issues.apache.org/jira/browse/HIVE-337
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.2.0
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Blocker
>         Attachments: HIVE-337.1.patch
>
>
> Once we do that, we can completely deprecate DynamicSerDe/TCTLSeparatedProtocol, and close any bugs that DynamicSerDe/TCTLSeparatedProtocol has.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-337) LazySimpleSerDe should support multi-level nested array, map, struct types

Posted by "Raghotham Murthy (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688882#action_12688882 ] 

Raghotham Murthy commented on HIVE-337:
---------------------------------------

btw can you attach the patch file with all the tests (just for completeness)?

> LazySimpleSerDe should support multi-level nested array, map, struct types
> --------------------------------------------------------------------------
>
>                 Key: HIVE-337
>                 URL: https://issues.apache.org/jira/browse/HIVE-337
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.2.0
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Blocker
>         Attachments: HIVE-337.1.patch, HIVE-337.2.patch, HIVE-337.5.patch
>
>
> Once we do that, we can completely deprecate DynamicSerDe/TCTLSeparatedProtocol, and close any bugs that DynamicSerDe/TCTLSeparatedProtocol has.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (HIVE-337) LazySimpleSerDe should support array and map types

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12681055#action_12681055 ] 

Zheng Shao commented on HIVE-337:
---------------------------------

By null array, I mean case 1 (as you mentioned above).

There is no plan for LazySerDe to support nested array at this moment - it is used to replace MetaDataTypeColumnsetSerDe and TCTLSeparatedProtocol by exactly the same serialization format.

If we want to change the serialization format, that will be a new SerDe.


> LazySimpleSerDe should support array and map types
> --------------------------------------------------
>
>                 Key: HIVE-337
>                 URL: https://issues.apache.org/jira/browse/HIVE-337
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.2.0
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Blocker
>
> Once we do that, we can completely deprecate DynamicSerDe/TCTLSeparatedProtocol, and close any bugs that DynamicSerDe/TCTLSeparatedProtocol has.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (HIVE-337) LazySimpleSerDe should support array and map types

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/HIVE-337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688599#action_12688599 ] 

Zheng Shao edited comment on HIVE-337 at 3/24/09 1:58 PM:
----------------------------------------------------------

Done with all the comments except 6.

I also renamed the setAll() function to init() to make it clearer.

Because we now pass TypeInfo around in LazyObject hierarchy, we don't even need to create the LazyObject for an array element if that element is never accessed (we can create it on demand when it's accessed).

The current code works fine without the change of 6.  The change of 6 requires either 12 bytes more storage per primitive object (by adding the byte[], int, int to the LazyPrimitive), or more complicated logic in removing the int start and int length from LazyNonPrimitive (we will have to parse the data right in init(..) but we don't have access to the separators because it's in the next-level ObjectInspectors - unless we add the pointers from LazyObject to ObjectInspector, but that's another overhead and complicates the data structure).

After all, the implementation of init() is private to the class and I don't think there is a strong need to make the implementation the same across LazyPrimitive and LazyNonPrimitive. The fact that the parsing of LazyPrimitive does not require delimiters and LazyNonPrimitive requires is good enough for them to have different implementations.


Future improvements include:
1. Support escaping: HIVE-136;
2. Columnar storage: HIVE-352;
3. Use Writable/Text for values: HIVE-266;
4. Short-circuit serialization: HIVE-358;
5. Short-circuit expression evaluation: HIVE-359.
6. Common expression evaluation: HIVE-364


      was (Author: zshao):
    Done with all the comments except 6.

I also renamed the setAll() function to init() to make it clearer.

Because we now pass TypeInfo around in LazyObject hierarchy, we don't even need to create the LazyObject for an array element if that element is never accessed (we can create it on demand when it's accessed).

The current code works fine without the change of 6.  The change of 6 requires either 12 bytes more storage per primitive object (by adding the byte[], int, int to the LazyPrimitive), or more complicated logic in removing the int start and int length from LazyNonPrimitive (we will have to parse the data right in init(..) but we don't have access to the separators because it's in the next-level ObjectInspectors - unless we add the pointers from LazyObject to ObjectInspector, but that's another overhead and complicates the data structure).

After all, the implementation of init() is private to the class and I don't think there is a strong need to make the implementation the same across LazyPrimitive and LazyNonPrimitive. The fact that the parsing of LazyPrimitive does not require delimiters and LazyNonPrimitive requires is good enough for them to have different implementations.


Future improvements include:
1. Support escaping: HIVE-136;
2. Columnar storage: HIVE-352;
3. Use Writable/Text for values: HIVE-266;
4. Short-circuit serialization: HIVE-358;
5. Short-circuit expression evaluation: HIVE-359.

  
> LazySimpleSerDe should support array and map types
> --------------------------------------------------
>
>                 Key: HIVE-337
>                 URL: https://issues.apache.org/jira/browse/HIVE-337
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.2.0
>            Reporter: Zheng Shao
>            Assignee: Zheng Shao
>            Priority: Blocker
>         Attachments: HIVE-337.1.patch, HIVE-337.2.patch, HIVE-337.5.patch
>
>
> Once we do that, we can completely deprecate DynamicSerDe/TCTLSeparatedProtocol, and close any bugs that DynamicSerDe/TCTLSeparatedProtocol has.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.