You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Karthik <ka...@yahoo.com> on 2010/05/26 19:45:07 UTC

Query HDFS files without using LOAD (move)

Is there a way where I can specify a list of files (or file pattern / regex) from a HDFS location other than the Hive Warehouse as a parameter to a Hive Query?  I have a bunch of files that are used by other applications as well and I need to perform queries on those as well using Hive and so I do not want to use LOAD and move those files on to Hive warehouse from the original location.

My query is on incremental data (new files) that are added on a daily basis and need not use the full list of files on a folder and so I need to specify a list of file / pattern, something like a filter of files to the query.

Please suggest.

- KK.


RE: Query HDFS files without using LOAD (move)

Posted by John Sichi <js...@facebook.com>.
For the pattern part, there is a JIRA issue open, but from the comment thread, I'm not sure where we are with it:

https://issues.apache.org/jira/browse/HIVE-951

JVS
________________________________________
From: Ashish Thusoo [athusoo@facebook.com]
Sent: Wednesday, May 26, 2010 11:03 AM
To: hive-user@hadoop.apache.org
Subject: RE: Query HDFS files without using LOAD (move)

You could probably use external tables?? CREATE EXTERNAL TABLE allows you to create tables on hdfs files but I do not think that it takes file patterns / regex. If all the files are created within a directory then you could point the external table to the directory location and then querying on that table would automatically query all the files in that directory. Are your files in a single directory or are they spread out?

http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL#Create_Table

Ashish

-----Original Message-----
From: Karthik [mailto:karthik_swa@yahoo.com]
Sent: Wednesday, May 26, 2010 10:45 AM
To: hive-user@hadoop.apache.org
Subject: Query HDFS files without using LOAD (move)

Is there a way where I can specify a list of files (or file pattern / regex) from a HDFS location other than the Hive Warehouse as a parameter to a Hive Query?  I have a bunch of files that are used by other applications as well and I need to perform queries on those as well using Hive and so I do not want to use LOAD and move those files on to Hive warehouse from the original location.

My query is on incremental data (new files) that are added on a daily basis and need not use the full list of files on a folder and so I need to specify a list of file / pattern, something like a filter of files to the query.

Please suggest.

- KK.


RE: Query HDFS files without using LOAD (move)

Posted by John Sichi <js...@facebook.com>.
https://issues.apache.org/jira/browse/HIVE-1272

________________________________________
From: Edward Capriolo [edlinuxguru@gmail.com]
Sent: Wednesday, May 26, 2010 11:09 AM
To: hive-user@hadoop.apache.org
Subject: Re: Query HDFS files without using LOAD (move)

Also in trunk there is a feature that uses a file with a list of files as the input. I do not know the Jira #


Re: Query HDFS files without using LOAD (move)

Posted by Edward Capriolo <ed...@gmail.com>.
On Wed, May 26, 2010 at 2:03 PM, Ashish Thusoo <at...@facebook.com> wrote:

> You could probably use external tables?? CREATE EXTERNAL TABLE allows you
> to create tables on hdfs files but I do not think that it takes file
> patterns / regex. If all the files are created within a directory then you
> could point the external table to the directory location and then querying
> on that table would automatically query all the files in that directory. Are
> your files in a single directory or are they spread out?
>
> http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL#Create_Table
>
> Ashish
>
> -----Original Message-----
> From: Karthik [mailto:karthik_swa@yahoo.com]
> Sent: Wednesday, May 26, 2010 10:45 AM
> To: hive-user@hadoop.apache.org
> Subject: Query HDFS files without using LOAD (move)
>
> Is there a way where I can specify a list of files (or file pattern /
> regex) from a HDFS location other than the Hive Warehouse as a parameter to
> a Hive Query?  I have a bunch of files that are used by other applications
> as well and I need to perform queries on those as well using Hive and so I
> do not want to use LOAD and move those files on to Hive warehouse from the
> original location.
>
> My query is on incremental data (new files) that are added on a daily basis
> and need not use the full list of files on a folder and so I need to specify
> a list of file / pattern, something like a filter of files to the query.
>
> Please suggest.
>
> - KK.
>
> Also in trunk there is a feature that uses a file with a list of files as
the input. I do not know the Jira #

Re: Build failure on latest Hive trunk

Posted by John Sichi <js...@facebook.com>.
Those line numbers match method declarations in branch-0.5, not trunk.  Make sure you got this:

http://svn.apache.org/repos/asf/hadoop/hive/trunk

Not:

http://svn.apache.org/repos/asf/hadoop/hive/branches/branch-0.5

JVS

On May 27, 2010, at 10:37 AM, Karthik wrote:

> I checked out the latest Hive trunk "http://svn.apache.org/repos/asf/hadoop/hive/" today and when I did a build (ant package) I get this below error:  (Any quick fixes??)
> 
> build_shims:
>     [echo] Compiling shims against hadoop 0.17.2.1 (/Hadoop/downloads/hive/hive/build/hadoopcore/hadoop-0.17.2.1)
>    [javac] Compiling 5 source files to /Hadoop/downloads/hive/hive/build/shims/classes
>    [javac] /Hadoop/downloads/hive/hive/shims/src/0.17/java/org/apache/hadoop/hive/shims/Hadoop17Shims.java:96: method does not override a method from its superclass
>    [javac]   @Override
>    [javac]    ^
>    [javac] /Hadoop/downloads/hive/hive/shims/src/0.17/java/org/apache/hadoop/hive/shims/Hadoop17Shims.java:109: method does not override a method from its superclass
>    [javac]   @Override
>    [javac]    ^
>    [javac] 2 errors
> 
> BUILD FAILED
> 


Build failure on latest Hive trunk

Posted by Karthik <ka...@yahoo.com>.
I checked out the latest Hive trunk "http://svn.apache.org/repos/asf/hadoop/hive/" today and when I did a build (ant package) I get this below error:  (Any quick fixes??)

build_shims:
     [echo] Compiling shims against hadoop 0.17.2.1 (/Hadoop/downloads/hive/hive/build/hadoopcore/hadoop-0.17.2.1)
    [javac] Compiling 5 source files to /Hadoop/downloads/hive/hive/build/shims/classes
    [javac] /Hadoop/downloads/hive/hive/shims/src/0.17/java/org/apache/hadoop/hive/shims/Hadoop17Shims.java:96: method does not override a method from its superclass
    [javac]   @Override
    [javac]    ^
    [javac] /Hadoop/downloads/hive/hive/shims/src/0.17/java/org/apache/hadoop/hive/shims/Hadoop17Shims.java:109: method does not override a method from its superclass
    [javac]   @Override
    [javac]    ^
    [javac] 2 errors

BUILD FAILED


Re: Query HDFS files without using LOAD (move)

Posted by John Sichi <js...@facebook.com>.
Are you using the latest Hive trunk?  There were some patches such as HIVE-1200 and HIVE-132 to make Hive compatible with this Hadoop feature.

JVS

On May 26, 2010, at 5:01 PM, Karthik wrote:

> Hi John,
> 
> I tried your suggestion and almost worked :(
> 
> I see that both Map and Reduce tasks complete 100%  (and all the files I need under different sub folders are read by the mappers without any issue), but after the reducers are done, instead of getting the results printed, I get this error:
> 
> 2010-05-26 23:45:42,048 ERROR CliDriver (SessionState.java:printError(279)) - Failed with exception java.io.IOException:java.io.IOException: No input paths specified in job
> java.io.IOException: java.io.IOException: No input paths specified in job
> 
> Any idea?
> 
> Thanks again for helping out so far.
> 
> Regards,
> Karthik.
> 
> 
> 
> ----- Original Message ----
> From: John Sichi <js...@facebook.com>
> To: "hive-user@hadoop.apache.org" <hi...@hadoop.apache.org>
> Sent: Wed, May 26, 2010 11:14:26 AM
> Subject: RE: Query HDFS files without using LOAD (move)
> 
> Use a Hadoop version which includes this:
> 
> https://issues.apache.org/jira/browse/MAPREDUCE-1501
> 
> and
> 
> set mapred.input.dir.recursive=true; 
> 
> We are currently using this in production.  However, it does not deal with the pattern case.
> 
> JVS
> 
> ________________________________________
> From: Karthik [karthik_swa@yahoo.com]
> Sent: Wednesday, May 26, 2010 11:08 AM
> To: hive-user@hadoop.apache.org
> Subject: Re: Query HDFS files without using LOAD (move)
> 
> Thanks a lot for the quick reply Ashish.
> 
> The files are currently across multiple folders as they high in number and so they are arranged by category (functionally) across multiple folders in HDFS.  Any work around to support multiple folders?
> 
> -KK.
> 
> 
> 
> ----- Original Message ----
> From: Ashish Thusoo <at...@facebook.com>
> To: "hive-user@hadoop.apache.org" <hi...@hadoop.apache.org>
> Sent: Wed, May 26, 2010 11:03:43 AM
> Subject: RE: Query HDFS files without using LOAD (move)
> 
> You could probably use external tables?? CREATE EXTERNAL TABLE allows you to create tables on hdfs files but I do not think that it takes file patterns / regex. If all the files are created within a directory then you could point the external table to the directory location and then querying on that table would automatically query all the files in that directory. Are your files in a single directory or are they spread out?
> 
> http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL#Create_Table
> 
> Ashish
> 
> -----Original Message-----
> From: Karthik [mailto:karthik_swa@yahoo.com]
> Sent: Wednesday, May 26, 2010 10:45 AM
> To: hive-user@hadoop.apache.org
> Subject: Query HDFS files without using LOAD (move)
> 
> Is there a way where I can specify a list of files (or file pattern / regex) from a HDFS location other than the Hive Warehouse as a parameter to a Hive Query?  I have a bunch of files that are used by other applications as well and I need to perform queries on those as well using Hive and so I do not want to use LOAD and move those files on to Hive warehouse from the original location.
> 
> My query is on incremental data (new files) that are added on a daily basis and need not use the full list of files on a folder and so I need to specify a list of file / pattern, something like a filter of files to the query.
> 
> Please suggest.
> 
> - KK.
> 


Re: Query HDFS files without using LOAD (move)

Posted by Karthik <ka...@yahoo.com>.
Hi John,

I tried your suggestion and almost worked :(

I see that both Map and Reduce tasks complete 100%  (and all the files I need under different sub folders are read by the mappers without any issue), but after the reducers are done, instead of getting the results printed, I get this error:

2010-05-26 23:45:42,048 ERROR CliDriver (SessionState.java:printError(279)) - Failed with exception java.io.IOException:java.io.IOException: No input paths specified in job
java.io.IOException: java.io.IOException: No input paths specified in job

Any idea?

Thanks again for helping out so far.

Regards,
Karthik.



----- Original Message ----
From: John Sichi <js...@facebook.com>
To: "hive-user@hadoop.apache.org" <hi...@hadoop.apache.org>
Sent: Wed, May 26, 2010 11:14:26 AM
Subject: RE: Query HDFS files without using LOAD (move)

Use a Hadoop version which includes this:

https://issues.apache.org/jira/browse/MAPREDUCE-1501

and

set mapred.input.dir.recursive=true; 

We are currently using this in production.  However, it does not deal with the pattern case.

JVS

________________________________________
From: Karthik [karthik_swa@yahoo.com]
Sent: Wednesday, May 26, 2010 11:08 AM
To: hive-user@hadoop.apache.org
Subject: Re: Query HDFS files without using LOAD (move)

Thanks a lot for the quick reply Ashish.

The files are currently across multiple folders as they high in number and so they are arranged by category (functionally) across multiple folders in HDFS.  Any work around to support multiple folders?

-KK.



----- Original Message ----
From: Ashish Thusoo <at...@facebook.com>
To: "hive-user@hadoop.apache.org" <hi...@hadoop.apache.org>
Sent: Wed, May 26, 2010 11:03:43 AM
Subject: RE: Query HDFS files without using LOAD (move)

You could probably use external tables?? CREATE EXTERNAL TABLE allows you to create tables on hdfs files but I do not think that it takes file patterns / regex. If all the files are created within a directory then you could point the external table to the directory location and then querying on that table would automatically query all the files in that directory. Are your files in a single directory or are they spread out?

http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL#Create_Table

Ashish

-----Original Message-----
From: Karthik [mailto:karthik_swa@yahoo.com]
Sent: Wednesday, May 26, 2010 10:45 AM
To: hive-user@hadoop.apache.org
Subject: Query HDFS files without using LOAD (move)

Is there a way where I can specify a list of files (or file pattern / regex) from a HDFS location other than the Hive Warehouse as a parameter to a Hive Query?  I have a bunch of files that are used by other applications as well and I need to perform queries on those as well using Hive and so I do not want to use LOAD and move those files on to Hive warehouse from the original location.

My query is on incremental data (new files) that are added on a daily basis and need not use the full list of files on a folder and so I need to specify a list of file / pattern, something like a filter of files to the query.

Please suggest.

- KK.


RE: Query HDFS files without using LOAD (move)

Posted by John Sichi <js...@facebook.com>.
Use a Hadoop version which includes this:

https://issues.apache.org/jira/browse/MAPREDUCE-1501

and

set mapred.input.dir.recursive=true; 

We are currently using this in production.  However, it does not deal with the pattern case.

JVS

________________________________________
From: Karthik [karthik_swa@yahoo.com]
Sent: Wednesday, May 26, 2010 11:08 AM
To: hive-user@hadoop.apache.org
Subject: Re: Query HDFS files without using LOAD (move)

Thanks a lot for the quick reply Ashish.

The files are currently across multiple folders as they high in number and so they are arranged by category (functionally) across multiple folders in HDFS.  Any work around to support multiple folders?

-KK.



----- Original Message ----
From: Ashish Thusoo <at...@facebook.com>
To: "hive-user@hadoop.apache.org" <hi...@hadoop.apache.org>
Sent: Wed, May 26, 2010 11:03:43 AM
Subject: RE: Query HDFS files without using LOAD (move)

You could probably use external tables?? CREATE EXTERNAL TABLE allows you to create tables on hdfs files but I do not think that it takes file patterns / regex. If all the files are created within a directory then you could point the external table to the directory location and then querying on that table would automatically query all the files in that directory. Are your files in a single directory or are they spread out?

http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL#Create_Table

Ashish

-----Original Message-----
From: Karthik [mailto:karthik_swa@yahoo.com]
Sent: Wednesday, May 26, 2010 10:45 AM
To: hive-user@hadoop.apache.org
Subject: Query HDFS files without using LOAD (move)

Is there a way where I can specify a list of files (or file pattern / regex) from a HDFS location other than the Hive Warehouse as a parameter to a Hive Query?  I have a bunch of files that are used by other applications as well and I need to perform queries on those as well using Hive and so I do not want to use LOAD and move those files on to Hive warehouse from the original location.

My query is on incremental data (new files) that are added on a daily basis and need not use the full list of files on a folder and so I need to specify a list of file / pattern, something like a filter of files to the query.

Please suggest.

- KK.

Re: Query HDFS files without using LOAD (move)

Posted by Karthik <ka...@yahoo.com>.
Thanks a lot for the quick reply Ashish.

The files are currently across multiple folders as they high in number and so they are arranged by category (functionally) across multiple folders in HDFS.  Any work around to support multiple folders?

-KK.



----- Original Message ----
From: Ashish Thusoo <at...@facebook.com>
To: "hive-user@hadoop.apache.org" <hi...@hadoop.apache.org>
Sent: Wed, May 26, 2010 11:03:43 AM
Subject: RE: Query HDFS files without using LOAD (move)

You could probably use external tables?? CREATE EXTERNAL TABLE allows you to create tables on hdfs files but I do not think that it takes file patterns / regex. If all the files are created within a directory then you could point the external table to the directory location and then querying on that table would automatically query all the files in that directory. Are your files in a single directory or are they spread out? 

http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL#Create_Table

Ashish

-----Original Message-----
From: Karthik [mailto:karthik_swa@yahoo.com] 
Sent: Wednesday, May 26, 2010 10:45 AM
To: hive-user@hadoop.apache.org
Subject: Query HDFS files without using LOAD (move)

Is there a way where I can specify a list of files (or file pattern / regex) from a HDFS location other than the Hive Warehouse as a parameter to a Hive Query?  I have a bunch of files that are used by other applications as well and I need to perform queries on those as well using Hive and so I do not want to use LOAD and move those files on to Hive warehouse from the original location.

My query is on incremental data (new files) that are added on a daily basis and need not use the full list of files on a folder and so I need to specify a list of file / pattern, something like a filter of files to the query.

Please suggest.

- KK.

RE: Query HDFS files without using LOAD (move)

Posted by Ashish Thusoo <at...@facebook.com>.
You could probably use external tables?? CREATE EXTERNAL TABLE allows you to create tables on hdfs files but I do not think that it takes file patterns / regex. If all the files are created within a directory then you could point the external table to the directory location and then querying on that table would automatically query all the files in that directory. Are your files in a single directory or are they spread out? 

http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL#Create_Table

Ashish

-----Original Message-----
From: Karthik [mailto:karthik_swa@yahoo.com] 
Sent: Wednesday, May 26, 2010 10:45 AM
To: hive-user@hadoop.apache.org
Subject: Query HDFS files without using LOAD (move)

Is there a way where I can specify a list of files (or file pattern / regex) from a HDFS location other than the Hive Warehouse as a parameter to a Hive Query?  I have a bunch of files that are used by other applications as well and I need to perform queries on those as well using Hive and so I do not want to use LOAD and move those files on to Hive warehouse from the original location.

My query is on incremental data (new files) that are added on a daily basis and need not use the full list of files on a folder and so I need to specify a list of file / pattern, something like a filter of files to the query.

Please suggest.

- KK.