You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Dave <dr...@gmail.com> on 2011/08/19 00:53:28 UTC

Ignore subdirectories when querying external table

Hi,

I have a partitioned external table in Hive, and in the partition
directories there are other subdirectories that are not related to the table
itself. Hive seems to want to scan those directories, as I am getting an
error message when trying to do a SELECT on the table:

Failed with exception java.io.IOException:java.io.IOException: Not a file:
hdfs://path/to/partition/path/to/subdir

Also, it seems to ignore directories prefixed by an underscore (_directory).

I am using hive 0.7.1 on Hadoop 0.20.2.

Is there a way to force Hive to ignore all subdirectories in external tables
and only look at files?

Thanks in advance,
-Dave

Re: Ignore subdirectories when querying external table

Posted by Sam William <sa...@stumbleupon.com>.
Dave,
 Where do you specify the  classpath before starting the Hive shell , when you introduce  a custom class like this ?

Sam


On Aug 19, 2011, at 1:22 PM, Dave wrote:

> I solved my own problem. For anyone who's curious:
> 
> It turns out that subclassing an InputFormat allows one to override the listStatus method, which returns the list of files for Hive (or mapreduce in general) to process. All I had to do was subclass org.apache.hadoop.mapred.TextInputFormat and override the listStatus method and voila; I was able to make it ignore directories. Here's the java code that I used:
> 
> public class TextFileInputFormatIgnoreSubDir extends TextInputFormat {
>     @Override
>     protected FileStatus[] listStatus (JobConf job) throws IOException {
>         FileStatus[] files = super.listStatus(job);
>         List<FileStatus> newFiles = new ArrayList<FileStatus>();
>         int len = files.length;
>         for (int i = 0; i < len; ++i) {
>             FileStatus file = files[i];
>             if (!file.isDir()) {
>                 newFiles.add(file);
>             }
>         }
> 
>         files = new FileStatus[newFiles.size()];
>         for (int i = 0; i < newFiles.size(); ++i) {
>             files[i] = newFiles.get(i);
>         }
> 
>         return files;
>     }
> }
> 
> And the HiveQL code I used to define the table:
> 
> CREATE EXTERNAL TABLE users (id STRING, user_name STRING)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> STORED AS INPUTFORMAT 'com.example.mapreduce.input.TextFileInputFormatIgnoreSubDir'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION '/data/test/users';
> 
> Hope this saves someone else the trouble of figuring it out...
> 
> -Dave
> 
> On Thu, Aug 18, 2011 at 3:53 PM, Dave <dr...@gmail.com> wrote:
> Hi,
> 
> I have a partitioned external table in Hive, and in the partition directories there are other subdirectories that are not related to the table itself. Hive seems to want to scan those directories, as I am getting an error message when trying to do a SELECT on the table:
> 
> Failed with exception java.io.IOException:java.io.IOException: Not a file: hdfs://path/to/partition/path/to/subdir
> 
> Also, it seems to ignore directories prefixed by an underscore (_directory).
> 
> I am using hive 0.7.1 on Hadoop 0.20.2.
> 
> Is there a way to force Hive to ignore all subdirectories in external tables and only look at files?
> 
> Thanks in advance,
> -Dave
> 

Sam William
sampd@stumbleupon.com




Re: Ignore subdirectories when querying external table

Posted by Sam William <sa...@stumbleupon.com>.
On similar lines,  I want to  have hive inlcude   subdirs.   That is..

I have an external  table paritioned by month (data for each month under a folder).  Under  the current month I want to  keep adding  folders daily . Is this possible without having to subclass InputFormat ?




On Aug 19, 2011, at 1:22 PM, Dave wrote:

> I solved my own problem. For anyone who's curious:
> 
> It turns out that subclassing an InputFormat allows one to override the listStatus method, which returns the list of files for Hive (or mapreduce in general) to process. All I had to do was subclass org.apache.hadoop.mapred.TextInputFormat and override the listStatus method and voila; I was able to make it ignore directories. Here's the java code that I used:
> 
> public class TextFileInputFormatIgnoreSubDir extends TextInputFormat {
>     @Override
>     protected FileStatus[] listStatus (JobConf job) throws IOException {
>         FileStatus[] files = super.listStatus(job);
>         List<FileStatus> newFiles = new ArrayList<FileStatus>();
>         int len = files.length;
>         for (int i = 0; i < len; ++i) {
>             FileStatus file = files[i];
>             if (!file.isDir()) {
>                 newFiles.add(file);
>             }
>         }
> 
>         files = new FileStatus[newFiles.size()];
>         for (int i = 0; i < newFiles.size(); ++i) {
>             files[i] = newFiles.get(i);
>         }
> 
>         return files;
>     }
> }
> 
> And the HiveQL code I used to define the table:
> 
> CREATE EXTERNAL TABLE users (id STRING, user_name STRING)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> STORED AS INPUTFORMAT 'com.example.mapreduce.input.TextFileInputFormatIgnoreSubDir'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION '/data/test/users';
> 
> Hope this saves someone else the trouble of figuring it out...
> 
> -Dave
> 
> On Thu, Aug 18, 2011 at 3:53 PM, Dave <dr...@gmail.com> wrote:
> Hi,
> 
> I have a partitioned external table in Hive, and in the partition directories there are other subdirectories that are not related to the table itself. Hive seems to want to scan those directories, as I am getting an error message when trying to do a SELECT on the table:
> 
> Failed with exception java.io.IOException:java.io.IOException: Not a file: hdfs://path/to/partition/path/to/subdir
> 
> Also, it seems to ignore directories prefixed by an underscore (_directory).
> 
> I am using hive 0.7.1 on Hadoop 0.20.2.
> 
> Is there a way to force Hive to ignore all subdirectories in external tables and only look at files?
> 
> Thanks in advance,
> -Dave
> 

Sam William
sampd@stumbleupon.com




Re: Ignore subdirectories when querying external table

Posted by Dave <dr...@gmail.com>.
I solved my own problem. For anyone who's curious:

It turns out that subclassing an InputFormat allows one to override the
listStatus method, which returns the list of files for Hive (or mapreduce in
general) to process. All I had to do was subclass
org.apache.hadoop.mapred.TextInputFormat and override the listStatus method
and voila; I was able to make it ignore directories. Here's the java code
that I used:

public class TextFileInputFormatIgnoreSubDir extends TextInputFormat {
    @Override
    protected FileStatus[] listStatus (JobConf job) throws IOException {
        FileStatus[] files = super.listStatus(job);
        List<FileStatus> newFiles = new ArrayList<FileStatus>();
        int len = files.length;
        for (int i = 0; i < len; ++i) {
            FileStatus file = files[i];
            if (!file.isDir()) {
                newFiles.add(file);
            }
        }

        files = new FileStatus[newFiles.size()];
        for (int i = 0; i < newFiles.size(); ++i) {
            files[i] = newFiles.get(i);
        }

        return files;
    }
}

And the HiveQL code I used to define the table:

CREATE EXTERNAL TABLE users (id STRING, user_name STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS INPUTFORMAT
'com.example.mapreduce.input.TextFileInputFormatIgnoreSubDir'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/data/test/users';

Hope this saves someone else the trouble of figuring it out...

-Dave

On Thu, Aug 18, 2011 at 3:53 PM, Dave <dr...@gmail.com> wrote:

> Hi,
>
> I have a partitioned external table in Hive, and in the partition
> directories there are other subdirectories that are not related to the table
> itself. Hive seems to want to scan those directories, as I am getting an
> error message when trying to do a SELECT on the table:
>
> Failed with exception java.io.IOException:java.io.IOException: Not a file:
> hdfs://path/to/partition/path/to/subdir
>
> Also, it seems to ignore directories prefixed by an underscore
> (_directory).
>
> I am using hive 0.7.1 on Hadoop 0.20.2.
>
> Is there a way to force Hive to ignore all subdirectories in external
> tables and only look at files?
>
> Thanks in advance,
> -Dave
>