You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Dave <dr...@gmail.com> on 2011/08/19 00:53:28 UTC
Ignore subdirectories when querying external table
Hi,
I have a partitioned external table in Hive, and in the partition
directories there are other subdirectories that are not related to the table
itself. Hive seems to want to scan those directories, as I am getting an
error message when trying to do a SELECT on the table:
Failed with exception java.io.IOException:java.io.IOException: Not a file:
hdfs://path/to/partition/path/to/subdir
Also, it seems to ignore directories prefixed by an underscore (_directory).
I am using hive 0.7.1 on Hadoop 0.20.2.
Is there a way to force Hive to ignore all subdirectories in external tables
and only look at files?
Thanks in advance,
-Dave
Re: Ignore subdirectories when querying external table
Posted by Sam William <sa...@stumbleupon.com>.
Dave,
Where do you specify the classpath before starting the Hive shell , when you introduce a custom class like this ?
Sam
On Aug 19, 2011, at 1:22 PM, Dave wrote:
> I solved my own problem. For anyone who's curious:
>
> It turns out that subclassing an InputFormat allows one to override the listStatus method, which returns the list of files for Hive (or mapreduce in general) to process. All I had to do was subclass org.apache.hadoop.mapred.TextInputFormat and override the listStatus method and voila; I was able to make it ignore directories. Here's the java code that I used:
>
> public class TextFileInputFormatIgnoreSubDir extends TextInputFormat {
> @Override
> protected FileStatus[] listStatus (JobConf job) throws IOException {
> FileStatus[] files = super.listStatus(job);
> List<FileStatus> newFiles = new ArrayList<FileStatus>();
> int len = files.length;
> for (int i = 0; i < len; ++i) {
> FileStatus file = files[i];
> if (!file.isDir()) {
> newFiles.add(file);
> }
> }
>
> files = new FileStatus[newFiles.size()];
> for (int i = 0; i < newFiles.size(); ++i) {
> files[i] = newFiles.get(i);
> }
>
> return files;
> }
> }
>
> And the HiveQL code I used to define the table:
>
> CREATE EXTERNAL TABLE users (id STRING, user_name STRING)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> STORED AS INPUTFORMAT 'com.example.mapreduce.input.TextFileInputFormatIgnoreSubDir'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION '/data/test/users';
>
> Hope this saves someone else the trouble of figuring it out...
>
> -Dave
>
> On Thu, Aug 18, 2011 at 3:53 PM, Dave <dr...@gmail.com> wrote:
> Hi,
>
> I have a partitioned external table in Hive, and in the partition directories there are other subdirectories that are not related to the table itself. Hive seems to want to scan those directories, as I am getting an error message when trying to do a SELECT on the table:
>
> Failed with exception java.io.IOException:java.io.IOException: Not a file: hdfs://path/to/partition/path/to/subdir
>
> Also, it seems to ignore directories prefixed by an underscore (_directory).
>
> I am using hive 0.7.1 on Hadoop 0.20.2.
>
> Is there a way to force Hive to ignore all subdirectories in external tables and only look at files?
>
> Thanks in advance,
> -Dave
>
Sam William
sampd@stumbleupon.com
Re: Ignore subdirectories when querying external table
Posted by Sam William <sa...@stumbleupon.com>.
On similar lines, I want to have hive inlcude subdirs. That is..
I have an external table paritioned by month (data for each month under a folder). Under the current month I want to keep adding folders daily . Is this possible without having to subclass InputFormat ?
On Aug 19, 2011, at 1:22 PM, Dave wrote:
> I solved my own problem. For anyone who's curious:
>
> It turns out that subclassing an InputFormat allows one to override the listStatus method, which returns the list of files for Hive (or mapreduce in general) to process. All I had to do was subclass org.apache.hadoop.mapred.TextInputFormat and override the listStatus method and voila; I was able to make it ignore directories. Here's the java code that I used:
>
> public class TextFileInputFormatIgnoreSubDir extends TextInputFormat {
> @Override
> protected FileStatus[] listStatus (JobConf job) throws IOException {
> FileStatus[] files = super.listStatus(job);
> List<FileStatus> newFiles = new ArrayList<FileStatus>();
> int len = files.length;
> for (int i = 0; i < len; ++i) {
> FileStatus file = files[i];
> if (!file.isDir()) {
> newFiles.add(file);
> }
> }
>
> files = new FileStatus[newFiles.size()];
> for (int i = 0; i < newFiles.size(); ++i) {
> files[i] = newFiles.get(i);
> }
>
> return files;
> }
> }
>
> And the HiveQL code I used to define the table:
>
> CREATE EXTERNAL TABLE users (id STRING, user_name STRING)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> STORED AS INPUTFORMAT 'com.example.mapreduce.input.TextFileInputFormatIgnoreSubDir'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
> LOCATION '/data/test/users';
>
> Hope this saves someone else the trouble of figuring it out...
>
> -Dave
>
> On Thu, Aug 18, 2011 at 3:53 PM, Dave <dr...@gmail.com> wrote:
> Hi,
>
> I have a partitioned external table in Hive, and in the partition directories there are other subdirectories that are not related to the table itself. Hive seems to want to scan those directories, as I am getting an error message when trying to do a SELECT on the table:
>
> Failed with exception java.io.IOException:java.io.IOException: Not a file: hdfs://path/to/partition/path/to/subdir
>
> Also, it seems to ignore directories prefixed by an underscore (_directory).
>
> I am using hive 0.7.1 on Hadoop 0.20.2.
>
> Is there a way to force Hive to ignore all subdirectories in external tables and only look at files?
>
> Thanks in advance,
> -Dave
>
Sam William
sampd@stumbleupon.com
Re: Ignore subdirectories when querying external table
Posted by Dave <dr...@gmail.com>.
I solved my own problem. For anyone who's curious:
It turns out that subclassing an InputFormat allows one to override the
listStatus method, which returns the list of files for Hive (or mapreduce in
general) to process. All I had to do was subclass
org.apache.hadoop.mapred.TextInputFormat and override the listStatus method
and voila; I was able to make it ignore directories. Here's the java code
that I used:
public class TextFileInputFormatIgnoreSubDir extends TextInputFormat {
@Override
protected FileStatus[] listStatus (JobConf job) throws IOException {
FileStatus[] files = super.listStatus(job);
List<FileStatus> newFiles = new ArrayList<FileStatus>();
int len = files.length;
for (int i = 0; i < len; ++i) {
FileStatus file = files[i];
if (!file.isDir()) {
newFiles.add(file);
}
}
files = new FileStatus[newFiles.size()];
for (int i = 0; i < newFiles.size(); ++i) {
files[i] = newFiles.get(i);
}
return files;
}
}
And the HiveQL code I used to define the table:
CREATE EXTERNAL TABLE users (id STRING, user_name STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS INPUTFORMAT
'com.example.mapreduce.input.TextFileInputFormatIgnoreSubDir'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/data/test/users';
Hope this saves someone else the trouble of figuring it out...
-Dave
On Thu, Aug 18, 2011 at 3:53 PM, Dave <dr...@gmail.com> wrote:
> Hi,
>
> I have a partitioned external table in Hive, and in the partition
> directories there are other subdirectories that are not related to the table
> itself. Hive seems to want to scan those directories, as I am getting an
> error message when trying to do a SELECT on the table:
>
> Failed with exception java.io.IOException:java.io.IOException: Not a file:
> hdfs://path/to/partition/path/to/subdir
>
> Also, it seems to ignore directories prefixed by an underscore
> (_directory).
>
> I am using hive 0.7.1 on Hadoop 0.20.2.
>
> Is there a way to force Hive to ignore all subdirectories in external
> tables and only look at files?
>
> Thanks in advance,
> -Dave
>