You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Bennie Schut (JIRA)" <ji...@apache.org> on 2010/02/08 11:58:29 UTC

[jira] Created: (HIVE-1138) Hive using lzo comporession returns unexpected results.

Hive using lzo comporession returns unexpected results.
-------------------------------------------------------

                 Key: HIVE-1138
                 URL: https://issues.apache.org/jira/browse/HIVE-1138
             Project: Hadoop Hive
          Issue Type: Bug
          Components: Query Processor
    Affects Versions: 0.6.0
         Environment: hadoop 0.20.1, hive trunk 2010-02-03
            Reporter: Bennie Schut
            Priority: Blocker


I have a tab separated files I have loaded it with "load data inpath" then I do a

SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
SET mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
select distinct login_cldr_id as cldr_id from chatsessions_load;

Ended Job = job_201001151039_1641
OK
NULL
NULL
NULL
Time taken: 49.06 seconds

however if I start it without the set commands I get this:
Ended Job = job_201001151039_1642
OK
2283
Time taken: 45.308 seconds

Which is the correct result.

When I do a "insert overwrite" on a rcfile table it will actually compress the data correctly.
When I disable compression and query this new table the result is correct.
When I enable compression it's wrong again.
I see no errors in the logs.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-1138) Hive using lzo comporession returns unexpected results.

Posted by "Bennie Schut (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bennie Schut updated HIVE-1138:
-------------------------------

    Attachment: test.csv

How to reproduce the problem:

{noformat} 
CREATE TABLE test_load (
  id     int
, code   string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';

LOAD DATA INPATH '/user/dwh/test.csv' INTO TABLE test_load;

-- this one correctly returns 5 rows.
select distinct id from test_load;

SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
SET mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;

-- this one returns incorrect results.
select distinct id from test_load;
{noformat} 

> Hive using lzo comporession returns unexpected results.
> -------------------------------------------------------
>
>                 Key: HIVE-1138
>                 URL: https://issues.apache.org/jira/browse/HIVE-1138
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>         Environment: hadoop 0.20.1, hive trunk 2010-02-03
>            Reporter: Bennie Schut
>            Priority: Blocker
>         Attachments: test.csv
>
>
> I have a tab separated files I have loaded it with "load data inpath" then I do a
> SET hive.exec.compress.output=true;
> SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
> SET mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
> select distinct login_cldr_id as cldr_id from chatsessions_load;
> Ended Job = job_201001151039_1641
> OK
> NULL
> NULL
> NULL
> Time taken: 49.06 seconds
> however if I start it without the set commands I get this:
> Ended Job = job_201001151039_1642
> OK
> 2283
> Time taken: 45.308 seconds
> Which is the correct result.
> When I do a "insert overwrite" on a rcfile table it will actually compress the data correctly.
> When I disable compression and query this new table the result is correct.
> When I enable compression it's wrong again.
> I see no errors in the logs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1138) Hive using lzo comporession returns unexpected results.

Posted by "Bennie Schut (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831339#action_12831339 ] 

Bennie Schut commented on HIVE-1138:
------------------------------------

I was looking a little bit into that direction but found this in the com.hadoop.compression.lzo.LzoCodec file:

{noformat} 
  /**
   * Get the default filename extension for this kind of compression.
   * @return the extension including the '.'
   */
  public String getDefaultExtension() {
    return ".lzo_deflate";
  }
{noformat} 

Which looks the same as gz is doing:

{noformat}
  public String getDefaultExtension() {
    return ".gz";
  }
{noformat}

> Hive using lzo comporession returns unexpected results.
> -------------------------------------------------------
>
>                 Key: HIVE-1138
>                 URL: https://issues.apache.org/jira/browse/HIVE-1138
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>         Environment: hadoop 0.20.1, hive trunk 2010-02-03
>            Reporter: Bennie Schut
>            Priority: Blocker
>
> I have a tab separated files I have loaded it with "load data inpath" then I do a
> SET hive.exec.compress.output=true;
> SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
> SET mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
> select distinct login_cldr_id as cldr_id from chatsessions_load;
> Ended Job = job_201001151039_1641
> OK
> NULL
> NULL
> NULL
> Time taken: 49.06 seconds
> however if I start it without the set commands I get this:
> Ended Job = job_201001151039_1642
> OK
> 2283
> Time taken: 45.308 seconds
> Which is the correct result.
> When I do a "insert overwrite" on a rcfile table it will actually compress the data correctly.
> When I disable compression and query this new table the result is correct.
> When I enable compression it's wrong again.
> I see no errors in the logs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1138) Hive using lzo comporession returns unexpected results.

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831112#action_12831112 ] 

He Yongqiang commented on HIVE-1138:
------------------------------------

1) Can you check if LzoCodec is loaded/installed correctly? (Is LzoCodec removed from the Hadoop version you used, if yes, how you installed and used it in Hive?)
2) If yes for 1, can you upload a small piece of data, no need for real data, i think a test table with some data will ok. Just make sure i can reproduce the problem.

> Hive using lzo comporession returns unexpected results.
> -------------------------------------------------------
>
>                 Key: HIVE-1138
>                 URL: https://issues.apache.org/jira/browse/HIVE-1138
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>         Environment: hadoop 0.20.1, hive trunk 2010-02-03
>            Reporter: Bennie Schut
>            Priority: Blocker
>
> I have a tab separated files I have loaded it with "load data inpath" then I do a
> SET hive.exec.compress.output=true;
> SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
> SET mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
> select distinct login_cldr_id as cldr_id from chatsessions_load;
> Ended Job = job_201001151039_1641
> OK
> NULL
> NULL
> NULL
> Time taken: 49.06 seconds
> however if I start it without the set commands I get this:
> Ended Job = job_201001151039_1642
> OK
> 2283
> Time taken: 45.308 seconds
> Which is the correct result.
> When I do a "insert overwrite" on a rcfile table it will actually compress the data correctly.
> When I disable compression and query this new table the result is correct.
> When I enable compression it's wrong again.
> I see no errors in the logs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1138) Hive using lzo comporession returns unexpected results.

Posted by "Bennie Schut (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830944#action_12830944 ] 

Bennie Schut commented on HIVE-1138:
------------------------------------

on the filesystem I do find a jobfile like this: attempt_201001151039_1841_r_000001_0.lzo_deflate
which contains the correct value in a compressed format.
^@^@^@^E^@^@^@  ^V2283
^Q^@^@
Perhaps hive reads this as a non-compressed file?

> Hive using lzo comporession returns unexpected results.
> -------------------------------------------------------
>
>                 Key: HIVE-1138
>                 URL: https://issues.apache.org/jira/browse/HIVE-1138
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>         Environment: hadoop 0.20.1, hive trunk 2010-02-03
>            Reporter: Bennie Schut
>            Priority: Blocker
>
> I have a tab separated files I have loaded it with "load data inpath" then I do a
> SET hive.exec.compress.output=true;
> SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
> SET mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
> select distinct login_cldr_id as cldr_id from chatsessions_load;
> Ended Job = job_201001151039_1641
> OK
> NULL
> NULL
> NULL
> Time taken: 49.06 seconds
> however if I start it without the set commands I get this:
> Ended Job = job_201001151039_1642
> OK
> 2283
> Time taken: 45.308 seconds
> Which is the correct result.
> When I do a "insert overwrite" on a rcfile table it will actually compress the data correctly.
> When I disable compression and query this new table the result is correct.
> When I enable compression it's wrong again.
> I see no errors in the logs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1138) Hive using lzo comporession returns unexpected results.

Posted by "Bennie Schut (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12830908#action_12830908 ] 

Bennie Schut commented on HIVE-1138:
------------------------------------

Doesn't seem to happen when I set compression to Gzip.

set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
set mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;


> Hive using lzo comporession returns unexpected results.
> -------------------------------------------------------
>
>                 Key: HIVE-1138
>                 URL: https://issues.apache.org/jira/browse/HIVE-1138
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>         Environment: hadoop 0.20.1, hive trunk 2010-02-03
>            Reporter: Bennie Schut
>            Priority: Blocker
>
> I have a tab separated files I have loaded it with "load data inpath" then I do a
> SET hive.exec.compress.output=true;
> SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
> SET mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
> select distinct login_cldr_id as cldr_id from chatsessions_load;
> Ended Job = job_201001151039_1641
> OK
> NULL
> NULL
> NULL
> Time taken: 49.06 seconds
> however if I start it without the set commands I get this:
> Ended Job = job_201001151039_1642
> OK
> 2283
> Time taken: 45.308 seconds
> Which is the correct result.
> When I do a "insert overwrite" on a rcfile table it will actually compress the data correctly.
> When I disable compression and query this new table the result is correct.
> When I enable compression it's wrong again.
> I see no errors in the logs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HIVE-1138) Hive using lzo comporession returns unexpected results.

Posted by "Bennie Schut (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bennie Schut resolved HIVE-1138.
--------------------------------

    Resolution: Not A Problem
      Assignee: Bennie Schut

Ah a clear case of rtfm
The codec needs to be in the list of codecs like this:
{noformat} 
<property>
 <name>io.compression.codecs</name>
 <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value>
</property>
{noformat} 

So this is a configuration mistake and not a bug in hive.
I just wouldn't have expected this behavior since it seems to work a little bit.
Hopefully someone else can learn from my mistake ;-)

Thanks Zheng and He for the support on this.

> Hive using lzo comporession returns unexpected results.
> -------------------------------------------------------
>
>                 Key: HIVE-1138
>                 URL: https://issues.apache.org/jira/browse/HIVE-1138
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>         Environment: hadoop 0.20.1, hive trunk 2010-02-03
>            Reporter: Bennie Schut
>            Assignee: Bennie Schut
>            Priority: Blocker
>         Attachments: test.csv
>
>
> I have a tab separated files I have loaded it with "load data inpath" then I do a
> SET hive.exec.compress.output=true;
> SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
> SET mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
> select distinct login_cldr_id as cldr_id from chatsessions_load;
> Ended Job = job_201001151039_1641
> OK
> NULL
> NULL
> NULL
> Time taken: 49.06 seconds
> however if I start it without the set commands I get this:
> Ended Job = job_201001151039_1642
> OK
> 2283
> Time taken: 45.308 seconds
> Which is the correct result.
> When I do a "insert overwrite" on a rcfile table it will actually compress the data correctly.
> When I disable compression and query this new table the result is correct.
> When I enable compression it's wrong again.
> I see no errors in the logs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1138) Hive using lzo comporession returns unexpected results.

Posted by "Bennie Schut (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831120#action_12831120 ] 

Bennie Schut commented on HIVE-1138:
------------------------------------

on 0.20.1 lzo is removed. I installed the "hadoop-gpl-compression-read-only" code from googlecode.com and it seems to work correctly on hadoop.

On the reduce step I see things like this in the logs:
{noformat} 
2010-02-08 22:06:36,554 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader: Loaded native gpl library
2010-02-08 22:06:36,555 INFO com.hadoop.compression.lzo.LzoCodec: Successfully loaded & initialized native-lzo library
2010-02-08 22:06:36,556 INFO org.apache.hadoop.hive.ql.io.CodecPool: Got brand-new compressor
2010-02-08 22:06:36,558 INFO org.apache.hadoop.hive.ql.io.CodecPool: Got brand-new compressor
{noformat} 

2) I'll add some data+ example code tomorrow morning. 

Thanks for looking at this.

> Hive using lzo comporession returns unexpected results.
> -------------------------------------------------------
>
>                 Key: HIVE-1138
>                 URL: https://issues.apache.org/jira/browse/HIVE-1138
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>         Environment: hadoop 0.20.1, hive trunk 2010-02-03
>            Reporter: Bennie Schut
>            Priority: Blocker
>
> I have a tab separated files I have loaded it with "load data inpath" then I do a
> SET hive.exec.compress.output=true;
> SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
> SET mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
> select distinct login_cldr_id as cldr_id from chatsessions_load;
> Ended Job = job_201001151039_1641
> OK
> NULL
> NULL
> NULL
> Time taken: 49.06 seconds
> however if I start it without the set commands I get this:
> Ended Job = job_201001151039_1642
> OK
> 2283
> Time taken: 45.308 seconds
> Which is the correct result.
> When I do a "insert overwrite" on a rcfile table it will actually compress the data correctly.
> When I disable compression and query this new table the result is correct.
> When I enable compression it's wrong again.
> I see no errors in the logs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1138) Hive using lzo comporession returns unexpected results.

Posted by "Zheng Shao (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831301#action_12831301 ] 

Zheng Shao commented on HIVE-1138:
----------------------------------

Bennie, I think the problem is that when Hive is reading the data and printing to screen, TextInputFormat didn't pick up the codec for the output file: attempt_201001151039_1841_r_000001_0.lzo_deflate

I remember TextInputFormat takes the extension of the file name and then decide what codec to use.  I think lzocodec is not correctly configured to handle *.lzo_deflate files.


> Hive using lzo comporession returns unexpected results.
> -------------------------------------------------------
>
>                 Key: HIVE-1138
>                 URL: https://issues.apache.org/jira/browse/HIVE-1138
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>         Environment: hadoop 0.20.1, hive trunk 2010-02-03
>            Reporter: Bennie Schut
>            Priority: Blocker
>
> I have a tab separated files I have loaded it with "load data inpath" then I do a
> SET hive.exec.compress.output=true;
> SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
> SET mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
> select distinct login_cldr_id as cldr_id from chatsessions_load;
> Ended Job = job_201001151039_1641
> OK
> NULL
> NULL
> NULL
> Time taken: 49.06 seconds
> however if I start it without the set commands I get this:
> Ended Job = job_201001151039_1642
> OK
> 2283
> Time taken: 45.308 seconds
> Which is the correct result.
> When I do a "insert overwrite" on a rcfile table it will actually compress the data correctly.
> When I disable compression and query this new table the result is correct.
> When I enable compression it's wrong again.
> I see no errors in the logs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1138) Hive using lzo comporession returns unexpected results.

Posted by "He Yongqiang (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831043#action_12831043 ] 

He Yongqiang commented on HIVE-1138:
------------------------------------

Hi Bennie

Are you seeing this on rcfile table or other fileformat table?

{quote}
I have a tab separated files I have loaded it with "load data inpath" then I do a
SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
SET mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
select distinct login_cldr_id as cldr_id from chatsessions_load;

Ended Job = job_201001151039_1641
OK
NULL
NULL
NULL
Time taken: 49.06 seconds

however if I start it without the set commands I get this:
Ended Job = job_201001151039_1642
OK
2283
Time taken: 45.308 seconds
{quote}
What is the file format here?

> Hive using lzo comporession returns unexpected results.
> -------------------------------------------------------
>
>                 Key: HIVE-1138
>                 URL: https://issues.apache.org/jira/browse/HIVE-1138
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>         Environment: hadoop 0.20.1, hive trunk 2010-02-03
>            Reporter: Bennie Schut
>            Priority: Blocker
>
> I have a tab separated files I have loaded it with "load data inpath" then I do a
> SET hive.exec.compress.output=true;
> SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
> SET mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
> select distinct login_cldr_id as cldr_id from chatsessions_load;
> Ended Job = job_201001151039_1641
> OK
> NULL
> NULL
> NULL
> Time taken: 49.06 seconds
> however if I start it without the set commands I get this:
> Ended Job = job_201001151039_1642
> OK
> 2283
> Time taken: 45.308 seconds
> Which is the correct result.
> When I do a "insert overwrite" on a rcfile table it will actually compress the data correctly.
> When I disable compression and query this new table the result is correct.
> When I enable compression it's wrong again.
> I see no errors in the logs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1138) Hive using lzo comporession returns unexpected results.

Posted by "Bennie Schut (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831106#action_12831106 ] 

Bennie Schut commented on HIVE-1138:
------------------------------------

This example is on a text table:
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n';

but when I do this:
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe' STORED AS RCFILE;
I get the same result.



> Hive using lzo comporession returns unexpected results.
> -------------------------------------------------------
>
>                 Key: HIVE-1138
>                 URL: https://issues.apache.org/jira/browse/HIVE-1138
>             Project: Hadoop Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>         Environment: hadoop 0.20.1, hive trunk 2010-02-03
>            Reporter: Bennie Schut
>            Priority: Blocker
>
> I have a tab separated files I have loaded it with "load data inpath" then I do a
> SET hive.exec.compress.output=true;
> SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
> SET mapred.map.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;
> select distinct login_cldr_id as cldr_id from chatsessions_load;
> Ended Job = job_201001151039_1641
> OK
> NULL
> NULL
> NULL
> Time taken: 49.06 seconds
> however if I start it without the set commands I get this:
> Ended Job = job_201001151039_1642
> OK
> 2283
> Time taken: 45.308 seconds
> Which is the correct result.
> When I do a "insert overwrite" on a rcfile table it will actually compress the data correctly.
> When I disable compression and query this new table the result is correct.
> When I enable compression it's wrong again.
> I see no errors in the logs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.