You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "He Yongqiang (JIRA)" <ji...@apache.org> on 2011/04/01 01:14:06 UTC
[jira] [Commented] (HIVE-2065) RCFile issues

    [ https://issues.apache.org/jira/browse/HIVE-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014272#comment-13014272 ] 

He Yongqiang commented on HIVE-2065:
------------------------------------

The column-specific compression is very interesting, but it is not directly related to make RCFile compatible with Seqfile. We can still do that without this compatibility. 

Some inputs maybe useful to you:
we examined column groups, and sort the data internally based on one column in one column group. (But we did not try different compressions across column groups.) Tried this with 3-4 tables, and we see ~20% storage savings on one table compared the previous RCFile. The main problems for this approach is that it is hard to find out the correct/most efficient column group definitions.
One example, table tbl_1 has 20 columns, and user can define:

col_1,col_2,col_11,col_13:0;col_3,col_4,col_15,col_16:1;

This will put col_1, col_2,col_11, col_13 into one column group, and reorder that column group based on sorting col_1 (0 is the first column in this column group), and put col_3, col_4, col_15,col_16 into another column group, and reorder this column group based on sorting col_4, and finally put all other columns into the default column group with original order.
And should be easy to allow different compression codec for different column groups.

The main block issue for this approach is have a full set of utils to find out the best column group definition.

Instead of doing that in the existing RCFile, do you think it would be better if we can explore it in the new one that i just mentioned. If you think interesting, we can share you the existing code that we have for things i mentioned. And you can work on the compression codec based on the new one, and provide a util tool to find out the best column group definition.

what do you think?

> RCFile issues
> -------------
>
>                 Key: HIVE-2065
>                 URL: https://issues.apache.org/jira/browse/HIVE-2065
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Krishna Kumar
>            Assignee: Krishna Kumar
>            Priority: Minor
>         Attachments: HIVE.2065.patch.0.txt, Slide1.png, proposal.png
>
>
> Some potential issues with RCFile
> 1. Remove unwanted synchronized modifiers on the methods of RCFile. As per yongqiang he, the class is not meant to be thread-safe (and it is not). Might as well get rid of the confusing and performance-impacting lock acquisitions.
> 2. Record Length overstated for compressed files. IIUC, the key compression happens after we have written the record length.
> {code}
>       int keyLength = key.getSize();
>       if (keyLength < 0) {
>         throw new IOException("negative length keys not allowed: " + key);
>       }
>       out.writeInt(keyLength + valueLength); // total record length
>       out.writeInt(keyLength); // key portion length
>       if (!isCompressed()) {
>         out.writeInt(keyLength);
>         key.write(out); // key
>       } else {
>         keyCompressionBuffer.reset();
>         keyDeflateFilter.resetState();
>         key.write(keyDeflateOut);
>         keyDeflateOut.flush();
>         keyDeflateFilter.finish();
>         int compressedKeyLen = keyCompressionBuffer.getLength();
>         out.writeInt(compressedKeyLen);
>         out.write(keyCompressionBuffer.getData(), 0, compressedKeyLen);
>       }
> {code}
> 3. For sequence file compatibility, the compressed key length should be the next field to record length, not the uncompressed key length.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira