You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Vitaliy Fuks (JIRA)" <ji...@apache.org> on 2011/08/19 07:08:27 UTC

[jira] [Created] (HIVE-2395) Misleading "No LZO codec found, cannot run." exception when using external table and LZO / DeprecatedLzoTextInputFormat

Misleading "No LZO codec found, cannot run." exception when using external table and LZO / DeprecatedLzoTextInputFormat
-----------------------------------------------------------------------------------------------------------------------

                 Key: HIVE-2395
                 URL: https://issues.apache.org/jira/browse/HIVE-2395
             Project: Hive
          Issue Type: Bug
          Components: Serializers/Deserializers
    Affects Versions: 0.7.1
         Environment: Cloudera 3u1 with https://github.com/kevinweil/hadoop-lzo or https://github.com/kevinweil/elephant-bird
            Reporter: Vitaliy Fuks


We have a {{/tables/}} directory containing .lzo files with our data, compressed using lzop.

We {{CREATE EXTERNAL TABLE}} on top of this directory, using {{STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"}}.

.lzo files require that an LzoIndexer is run on them. When this is done, .lzo.index file is created for every .lzo file, so we end up with:

{noformat}
/tables/ourdata_2011-08-19.lzo
/tables/ourdata_2011-08-19.lzo.index
/tables/ourdata_2011-08-18.lzo
/tables/ourdata_2011-08-18.lzo.index
..etc
{noformat}

The issue is that org.apache.hadoop.hive.ql.io.CombineHiveRecordReader is attempting to getRecordReader() for .lzo.index files. This throws a pretty confusing exception:

{noformat}
Caused by: java.io.IOException: No LZO codec found, cannot run.
        at com.hadoop.mapred.DeprecatedLzoLineRecordReader.<init>(DeprecatedLzoLineRecordReader.java:53)
        at com.hadoop.mapred.DeprecatedLzoTextInputFormat.getRecordReader(DeprecatedLzoTextInputFormat.java:128)
        at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:68)
{noformat}

More precisely, it dies on second invocation of getRecordReader() - here is some System.out.println() output:

{noformat}
DeprecatedLzoTextInputFormat.getRecordReader(): split=/tables/ourdata_2011-08-19.lzo:0+616479
DeprecatedLzoTextInputFormat.getRecordReader(): split=/tables/ourdata_2011-08-19.lzo.index:0+64
{noformat}

DeprecatedLzoTextInputFormat contains the following code which causes the ultimate exception and death of query, as it obviously doesn't have a codec to read .lzo.index files.

{noformat}
    final CompressionCodec codec = codecFactory.getCodec(file);
    if (codec == null) {
      throw new IOException("No LZO codec found, cannot run.");
    }
{noformat}

So I understand that the way things are right now is that Hive considers all files within a directory to be part of a table. There is an open patch HIVE-951 which would allow a quick workaround for this problem.

Does it make sense to add some hooks so that CombineHiveRecordReader or its parents are more aware of what files should be considered instead of blindly trying to read everything?

Any suggestions for a quick workaround to make it skip .index files?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2395) Misleading "No LZO codec found, cannot run." exception when using external table and LZO / DeprecatedLzoTextInputFormat

Posted by "Vitaliy Fuks (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089907#comment-13089907 ] 

Vitaliy Fuks commented on HIVE-2395:
------------------------------------

Right, of course, and that's the workaround we went with. With smaller-than-block files it really doesn't matter.

After I filed this ticket I did a quick and dirty hack on DeprecatedLzoTextInputFormat to ignore .lzo.index files. However, I actually found that it doesn't read larger-than-block-size .lzo files correctly at all - it either crashes with things like ArrayIndexOutOfBoundsException in LzoDecompressor.setInput() or just outright ignores all data beyond the block size. This would happen even if .lzo.index files were absent.

So then I recreated tables without using INPUTFORMAT "DeprecatedLzoTextInputFormat" and it's returning correct data. It still attempts to read .lzo.index files as data so we are going un-indexed as a workaround (with the lack of splitting as the side effect, obviously). At this point, I wasn't sure why we were using DeprecatedLzoTextInputFormat in the first place, other than Google "told" us to. Maybe Hive codebase moved beyond needing it?

I will try code changes from https://github.com/kevinweil/hadoop-lzo/pull/28 when I have some free time.

PS. Our hadoop-lzo.jar is from January 2011 release (v0.4.8) built by Gerrit.

> Misleading "No LZO codec found, cannot run." exception when using external table and LZO / DeprecatedLzoTextInputFormat
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2395
>                 URL: https://issues.apache.org/jira/browse/HIVE-2395
>             Project: Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.7.1
>         Environment: Cloudera 3u1 with https://github.com/kevinweil/hadoop-lzo or https://github.com/kevinweil/elephant-bird
>            Reporter: Vitaliy Fuks
>
> We have a {{/tables/}} directory containing .lzo files with our data, compressed using lzop.
> We {{CREATE EXTERNAL TABLE}} on top of this directory, using {{STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"}}.
> .lzo files require that an LzoIndexer is run on them. When this is done, .lzo.index file is created for every .lzo file, so we end up with:
> {noformat}
> /tables/ourdata_2011-08-19.lzo
> /tables/ourdata_2011-08-19.lzo.index
> /tables/ourdata_2011-08-18.lzo
> /tables/ourdata_2011-08-18.lzo.index
> ..etc
> {noformat}
> The issue is that org.apache.hadoop.hive.ql.io.CombineHiveRecordReader is attempting to getRecordReader() for .lzo.index files. This throws a pretty confusing exception:
> {noformat}
> Caused by: java.io.IOException: No LZO codec found, cannot run.
>         at com.hadoop.mapred.DeprecatedLzoLineRecordReader.<init>(DeprecatedLzoLineRecordReader.java:53)
>         at com.hadoop.mapred.DeprecatedLzoTextInputFormat.getRecordReader(DeprecatedLzoTextInputFormat.java:128)
>         at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:68)
> {noformat}
> More precisely, it dies on second invocation of getRecordReader() - here is some System.out.println() output:
> {noformat}
> DeprecatedLzoTextInputFormat.getRecordReader(): split=/tables/ourdata_2011-08-19.lzo:0+616479
> DeprecatedLzoTextInputFormat.getRecordReader(): split=/tables/ourdata_2011-08-19.lzo.index:0+64
> {noformat}
> DeprecatedLzoTextInputFormat contains the following code which causes the ultimate exception and death of query, as it obviously doesn't have a codec to read .lzo.index files.
> {noformat}
>     final CompressionCodec codec = codecFactory.getCodec(file);
>     if (codec == null) {
>       throw new IOException("No LZO codec found, cannot run.");
>     }
> {noformat}
> So I understand that the way things are right now is that Hive considers all files within a directory to be part of a table. There is an open patch HIVE-951 which would allow a quick workaround for this problem.
> Does it make sense to add some hooks so that CombineHiveRecordReader or its parents are more aware of what files should be considered instead of blindly trying to read everything?
> Any suggestions for a quick workaround to make it skip .index files?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (HIVE-2395) Misleading "No LZO codec found, cannot run." exception when using external table and LZO / DeprecatedLzoTextInputFormat

Posted by "Vitaliy Fuks (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vitaliy Fuks resolved HIVE-2395.
--------------------------------

    Resolution: Won't Fix

Latest hadoop-lzo libraries do not exhibit this behavior.
                
> Misleading "No LZO codec found, cannot run." exception when using external table and LZO / DeprecatedLzoTextInputFormat
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2395
>                 URL: https://issues.apache.org/jira/browse/HIVE-2395
>             Project: Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.7.1
>         Environment: Cloudera 3u1 with https://github.com/kevinweil/hadoop-lzo or https://github.com/kevinweil/elephant-bird
>            Reporter: Vitaliy Fuks
>
> We have a {{/tables/}} directory containing .lzo files with our data, compressed using lzop.
> We {{CREATE EXTERNAL TABLE}} on top of this directory, using {{STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"}}.
> .lzo files require that an LzoIndexer is run on them. When this is done, .lzo.index file is created for every .lzo file, so we end up with:
> {noformat}
> /tables/ourdata_2011-08-19.lzo
> /tables/ourdata_2011-08-19.lzo.index
> /tables/ourdata_2011-08-18.lzo
> /tables/ourdata_2011-08-18.lzo.index
> ..etc
> {noformat}
> The issue is that org.apache.hadoop.hive.ql.io.CombineHiveRecordReader is attempting to getRecordReader() for .lzo.index files. This throws a pretty confusing exception:
> {noformat}
> Caused by: java.io.IOException: No LZO codec found, cannot run.
>         at com.hadoop.mapred.DeprecatedLzoLineRecordReader.<init>(DeprecatedLzoLineRecordReader.java:53)
>         at com.hadoop.mapred.DeprecatedLzoTextInputFormat.getRecordReader(DeprecatedLzoTextInputFormat.java:128)
>         at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:68)
> {noformat}
> More precisely, it dies on second invocation of getRecordReader() - here is some System.out.println() output:
> {noformat}
> DeprecatedLzoTextInputFormat.getRecordReader(): split=/tables/ourdata_2011-08-19.lzo:0+616479
> DeprecatedLzoTextInputFormat.getRecordReader(): split=/tables/ourdata_2011-08-19.lzo.index:0+64
> {noformat}
> DeprecatedLzoTextInputFormat contains the following code which causes the ultimate exception and death of query, as it obviously doesn't have a codec to read .lzo.index files.
> {noformat}
>     final CompressionCodec codec = codecFactory.getCodec(file);
>     if (codec == null) {
>       throw new IOException("No LZO codec found, cannot run.");
>     }
> {noformat}
> So I understand that the way things are right now is that Hive considers all files within a directory to be part of a table. There is an open patch HIVE-951 which would allow a quick workaround for this problem.
> Does it make sense to add some hooks so that CombineHiveRecordReader or its parents are more aware of what files should be considered instead of blindly trying to read everything?
> Any suggestions for a quick workaround to make it skip .index files?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2395) Misleading "No LZO codec found, cannot run." exception when using external table and LZO / DeprecatedLzoTextInputFormat

Posted by "Raghu Angadi (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089612#comment-13089612 ] 

Raghu Angadi commented on HIVE-2395:
------------------------------------

> .lzo files require that an LzoIndexer is run on them.

This is not a requirement. You need the index file only if you want split large lzo files. You could just remove the index files as a quick workaround (in which case you might as well use just TextInputFormat ).


> Misleading "No LZO codec found, cannot run." exception when using external table and LZO / DeprecatedLzoTextInputFormat
> -----------------------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-2395
>                 URL: https://issues.apache.org/jira/browse/HIVE-2395
>             Project: Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>    Affects Versions: 0.7.1
>         Environment: Cloudera 3u1 with https://github.com/kevinweil/hadoop-lzo or https://github.com/kevinweil/elephant-bird
>            Reporter: Vitaliy Fuks
>
> We have a {{/tables/}} directory containing .lzo files with our data, compressed using lzop.
> We {{CREATE EXTERNAL TABLE}} on top of this directory, using {{STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"}}.
> .lzo files require that an LzoIndexer is run on them. When this is done, .lzo.index file is created for every .lzo file, so we end up with:
> {noformat}
> /tables/ourdata_2011-08-19.lzo
> /tables/ourdata_2011-08-19.lzo.index
> /tables/ourdata_2011-08-18.lzo
> /tables/ourdata_2011-08-18.lzo.index
> ..etc
> {noformat}
> The issue is that org.apache.hadoop.hive.ql.io.CombineHiveRecordReader is attempting to getRecordReader() for .lzo.index files. This throws a pretty confusing exception:
> {noformat}
> Caused by: java.io.IOException: No LZO codec found, cannot run.
>         at com.hadoop.mapred.DeprecatedLzoLineRecordReader.<init>(DeprecatedLzoLineRecordReader.java:53)
>         at com.hadoop.mapred.DeprecatedLzoTextInputFormat.getRecordReader(DeprecatedLzoTextInputFormat.java:128)
>         at org.apache.hadoop.hive.ql.io.CombineHiveRecordReader.<init>(CombineHiveRecordReader.java:68)
> {noformat}
> More precisely, it dies on second invocation of getRecordReader() - here is some System.out.println() output:
> {noformat}
> DeprecatedLzoTextInputFormat.getRecordReader(): split=/tables/ourdata_2011-08-19.lzo:0+616479
> DeprecatedLzoTextInputFormat.getRecordReader(): split=/tables/ourdata_2011-08-19.lzo.index:0+64
> {noformat}
> DeprecatedLzoTextInputFormat contains the following code which causes the ultimate exception and death of query, as it obviously doesn't have a codec to read .lzo.index files.
> {noformat}
>     final CompressionCodec codec = codecFactory.getCodec(file);
>     if (codec == null) {
>       throw new IOException("No LZO codec found, cannot run.");
>     }
> {noformat}
> So I understand that the way things are right now is that Hive considers all files within a directory to be part of a table. There is an open patch HIVE-951 which would allow a quick workaround for this problem.
> Does it make sense to add some hooks so that CombineHiveRecordReader or its parents are more aware of what files should be considered instead of blindly trying to read everything?
> Any suggestions for a quick workaround to make it skip .index files?

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira