You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pig.apache.org by Jae Lee <Ja...@forward.co.uk> on 2010/12/01 13:33:01 UTC

has anyone tried using HiveColumnarLoader over TextFile fileformat?

Hi everyone.

I've tried using HiveColumnarLoader and getting java.io.IOException: hdfs://file_path not a RCFile

I've noticed HiveColumnarLoader is expecting HiveRCRecordReader from prepareToRead method..

Could you guys give any guidance how possible it is to modify HiveRCRecordReader to support any RecordReader?

J

Re: has anyone tried using HiveColumnarLoader over TextFile fileformat?

Posted by Jae Lee <Ja...@forward.co.uk>.

Hi,

Thanks for letting me know about the Jira ticket.
yes it would be necessary to have those partition as part of schema to group them by.

J

On 1 Dec 2010, at 16:33, Gerrit Jansen van Vuuren wrote:

> Hi,
> 
> 
> You'll have to tell pig in the AS statement what the schema is: 
> e.g. I = LOAD '$INPUT' using AllLoader() AS ( valueTime:int, userid:long,
> page_url:chararray, referrer_url:chararray, ip:chararray, country:chararray
> );
> 
> The only problem with the AllLoader currently (until the jira I sent earlier
> is fixed) is that the partition keys won't be in the schema itself, but you
> can still filter by partition using the all loader constructor: for example
> AllLoader("date>='2010-11-01'")
> 
> 
> Cheers,
> Gerrit
> 
> 
> viewTime INT, userid BIGINT,
>>    page_url STRING, referrer_url STRING,
>>    ip STRING COMMENT 'IP Address of the User',
>>    country STRING COMMENT 'country of originate
> 
> -----Original Message-----
> From: Jae Lee [mailto:Jae.Lee@forward.co.uk] 
> Sent: Wednesday, December 01, 2010 4:24 PM
> To: dev@pig.apache.org
> Subject: Re: has anyone tried using HiveColumnarLoader over TextFile
> fileformat?
> 
> Thanks Gerrit,
> 
> yeah it seems to work as in it loads up the files properly...
> 
> however it fails to understand schema and there's no way to specify the
> underlying schema....
> 
> Would you have any recommendation to get the schema right?
> 
> J
> 
> On 1 Dec 2010, at 15:48, Gerrit Jansen van Vuuren wrote:
> 
>> Hi,
>> 
>> 
>> 
>> Short answer is yes. As long as the partition keys are reflected in the
>> folder path itself AllLoader will pick it up.
>> 
>> Partition keys in hive are (normally from my understanding) reflected in
> the
>> file path itself so that if you have 
>> partitions: type, date
>> The table path will actually be
>> $HIVE_ROOT/warehouse/mytable/type=[value]/date=[value]
>> 
>> The AllLoader does understand this type of partitioning. So that if you
>> point it to load $HIVE_ROOT/warehouse/mytable
>> It will allow you to use the type and date columns to filte (note that you
>> can only specify the filtering in the AllLoader() part  see:
>> https://issues.apache.org/jira/browse/PIG-1717 )
>> 
>> The partitioning is detected by the AllLoader (and HiveColumnarLoader) by
>> looking at the actual folders in the path, and reading all key=value
>> patterns in the path name itself, registering these internally as
> partition
>> keys.
>> 
>> 
>> -----Original Message-----
>> From: Jae Lee [mailto:Jae.Lee@forward.co.uk] 
>> Sent: Wednesday, December 01, 2010 2:03 PM
>> To: dev@pig.apache.org
>> Subject: Re: has anyone tried using HiveColumnarLoader over TextFile
>> fileformat?
>> 
>> Hi Gerrit,
>> 
>> Yeah Hive table isn't stored as RCFILE but TEXTFILE
>> 
>> so our table creation ddl looks like below
>> 
>> CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,
>>    page_url STRING, referrer_url STRING,
>>    ip STRING COMMENT 'IP Address of the User',
>>    country STRING COMMENT 'country of origination')
>> COMMENT 'This is the staging page view table'
>> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
>> STORED AS TEXTFILE
>> 
>> Does AllLoader understand notion of partition keys? as HiveColumnarLoader?
>> 
>> J
>> 
>> On 1 Dec 2010, at 13:48, Gerrit Jansen van Vuuren wrote:
>> 
>>> Hi,
>>> 
>>> The HiveColumnarLoader can only read files written by hive or the hive
>>> API(s), and has its own InputFormat returning the HiveRCRecordReader.
>>> 
>>> Are you trying to read a plain text format? 
>>> Under the hood the HiveRCRecordReader uses the hive specific rc reader to
>>> read the input file and throws an error either if the file is not hive rc
>> or
>>> is a corrupt hiverc.
>>> 
>>> 
>>> If what you want is a Loader that loads all types of files, have a look
> at
>>> the AllLoader (latest piggybank trunk). It uses configuration that you
> set
>>> in the pig.properties to decide on the fly what loader to use for what
>> files
>>> (does extension, content and path matching), it also has the hive style
>> path
>>> partitioning for dates etc. Using this loader you can point it at a
>> directoy
>>> with lzo, gz, bz2 hiverc etc files in it and if you setup the loaders
>>> correctly it will load each file with its preconfigured loader.
>>> The javadoc in the class explains how to configure it.
>>> 
>>> Cheers,
>>> Gerrit
>>> 
>>> -----Original Message-----
>>> From: Jae Lee [mailto:Jae.Lee@forward.co.uk] 
>>> Sent: Wednesday, December 01, 2010 12:33 PM
>>> To: dev@pig.apache.org
>>> Subject: has anyone tried using HiveColumnarLoader over TextFile
>> fileformat?
>>> 
>>> Hi everyone.
>>> 
>>> I've tried using HiveColumnarLoader and getting java.io.IOException:
>>> hdfs://file_path not a RCFile
>>> 
>>> I've noticed HiveColumnarLoader is expecting HiveRCRecordReader from
>>> prepareToRead method..
>>> 
>>> Could you guys give any guidance how possible it is to modify
>>> HiveRCRecordReader to support any RecordReader?
>>> 
>>> J
>>> 
>>> 
>> 
>> 
>> 
> 
> 
>

RE: has anyone tried using HiveColumnarLoader over TextFile fileformat?

Posted by Gerrit Jansen van Vuuren <gv...@specificmedia.com>.

Hi,


You'll have to tell pig in the AS statement what the schema is: 
e.g. I = LOAD '$INPUT' using AllLoader() AS ( valueTime:int, userid:long,
page_url:chararray, referrer_url:chararray, ip:chararray, country:chararray
);

The only problem with the AllLoader currently (until the jira I sent earlier
is fixed) is that the partition keys won't be in the schema itself, but you
can still filter by partition using the all loader constructor: for example
AllLoader("date>='2010-11-01'")


Cheers,
 Gerrit


viewTime INT, userid BIGINT,
>     page_url STRING, referrer_url STRING,
>     ip STRING COMMENT 'IP Address of the User',
>     country STRING COMMENT 'country of originate

-----Original Message-----
From: Jae Lee [mailto:Jae.Lee@forward.co.uk] 
Sent: Wednesday, December 01, 2010 4:24 PM
To: dev@pig.apache.org
Subject: Re: has anyone tried using HiveColumnarLoader over TextFile
fileformat?

Thanks Gerrit,

yeah it seems to work as in it loads up the files properly...

however it fails to understand schema and there's no way to specify the
underlying schema....

Would you have any recommendation to get the schema right?

J

On 1 Dec 2010, at 15:48, Gerrit Jansen van Vuuren wrote:

> Hi,
> 
> 
> 
> Short answer is yes. As long as the partition keys are reflected in the
> folder path itself AllLoader will pick it up.
> 
> Partition keys in hive are (normally from my understanding) reflected in
the
> file path itself so that if you have 
> partitions: type, date
> The table path will actually be
> $HIVE_ROOT/warehouse/mytable/type=[value]/date=[value]
> 
> The AllLoader does understand this type of partitioning. So that if you
> point it to load $HIVE_ROOT/warehouse/mytable
> It will allow you to use the type and date columns to filte (note that you
> can only specify the filtering in the AllLoader() part  see:
> https://issues.apache.org/jira/browse/PIG-1717 )
> 
> The partitioning is detected by the AllLoader (and HiveColumnarLoader) by
> looking at the actual folders in the path, and reading all key=value
> patterns in the path name itself, registering these internally as
partition
> keys.
> 
> 
> -----Original Message-----
> From: Jae Lee [mailto:Jae.Lee@forward.co.uk] 
> Sent: Wednesday, December 01, 2010 2:03 PM
> To: dev@pig.apache.org
> Subject: Re: has anyone tried using HiveColumnarLoader over TextFile
> fileformat?
> 
> Hi Gerrit,
> 
> Yeah Hive table isn't stored as RCFILE but TEXTFILE
> 
> so our table creation ddl looks like below
> 
> CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,
>     page_url STRING, referrer_url STRING,
>     ip STRING COMMENT 'IP Address of the User',
>     country STRING COMMENT 'country of origination')
> COMMENT 'This is the staging page view table'
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> STORED AS TEXTFILE
> 
> Does AllLoader understand notion of partition keys? as HiveColumnarLoader?
> 
> J
> 
> On 1 Dec 2010, at 13:48, Gerrit Jansen van Vuuren wrote:
> 
>> Hi,
>> 
>> The HiveColumnarLoader can only read files written by hive or the hive
>> API(s), and has its own InputFormat returning the HiveRCRecordReader.
>> 
>> Are you trying to read a plain text format? 
>> Under the hood the HiveRCRecordReader uses the hive specific rc reader to
>> read the input file and throws an error either if the file is not hive rc
> or
>> is a corrupt hiverc.
>> 
>> 
>> If what you want is a Loader that loads all types of files, have a look
at
>> the AllLoader (latest piggybank trunk). It uses configuration that you
set
>> in the pig.properties to decide on the fly what loader to use for what
> files
>> (does extension, content and path matching), it also has the hive style
> path
>> partitioning for dates etc. Using this loader you can point it at a
> directoy
>> with lzo, gz, bz2 hiverc etc files in it and if you setup the loaders
>> correctly it will load each file with its preconfigured loader.
>> The javadoc in the class explains how to configure it.
>> 
>> Cheers,
>> Gerrit
>> 
>> -----Original Message-----
>> From: Jae Lee [mailto:Jae.Lee@forward.co.uk] 
>> Sent: Wednesday, December 01, 2010 12:33 PM
>> To: dev@pig.apache.org
>> Subject: has anyone tried using HiveColumnarLoader over TextFile
> fileformat?
>> 
>> Hi everyone.
>> 
>> I've tried using HiveColumnarLoader and getting java.io.IOException:
>> hdfs://file_path not a RCFile
>> 
>> I've noticed HiveColumnarLoader is expecting HiveRCRecordReader from
>> prepareToRead method..
>> 
>> Could you guys give any guidance how possible it is to modify
>> HiveRCRecordReader to support any RecordReader?
>> 
>> J
>> 
>> 
> 
> 
>

Re: has anyone tried using HiveColumnarLoader over TextFile fileformat?

Posted by Jae Lee <Ja...@forward.co.uk>.

also it doesn't seem to include the partitioning field as part of returned tuple either...

J

On 1 Dec 2010, at 16:24, Jae Lee wrote:

> Thanks Gerrit,
> 
> yeah it seems to work as in it loads up the files properly...
> 
> however it fails to understand schema and there's no way to specify the underlying schema....
> 
> Would you have any recommendation to get the schema right?
> 
> J
> 
> On 1 Dec 2010, at 15:48, Gerrit Jansen van Vuuren wrote:
> 
>> Hi,
>> 
>> 
>> 
>> Short answer is yes. As long as the partition keys are reflected in the
>> folder path itself AllLoader will pick it up.
>> 
>> Partition keys in hive are (normally from my understanding) reflected in the
>> file path itself so that if you have 
>> partitions: type, date
>> The table path will actually be
>> $HIVE_ROOT/warehouse/mytable/type=[value]/date=[value]
>> 
>> The AllLoader does understand this type of partitioning. So that if you
>> point it to load $HIVE_ROOT/warehouse/mytable
>> It will allow you to use the type and date columns to filte (note that you
>> can only specify the filtering in the AllLoader() part  see:
>> https://issues.apache.org/jira/browse/PIG-1717 )
>> 
>> The partitioning is detected by the AllLoader (and HiveColumnarLoader) by
>> looking at the actual folders in the path, and reading all key=value
>> patterns in the path name itself, registering these internally as partition
>> keys.
>> 
>> 
>> -----Original Message-----
>> From: Jae Lee [mailto:Jae.Lee@forward.co.uk] 
>> Sent: Wednesday, December 01, 2010 2:03 PM
>> To: dev@pig.apache.org
>> Subject: Re: has anyone tried using HiveColumnarLoader over TextFile
>> fileformat?
>> 
>> Hi Gerrit,
>> 
>> Yeah Hive table isn't stored as RCFILE but TEXTFILE
>> 
>> so our table creation ddl looks like below
>> 
>> CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,
>>    page_url STRING, referrer_url STRING,
>>    ip STRING COMMENT 'IP Address of the User',
>>    country STRING COMMENT 'country of origination')
>> COMMENT 'This is the staging page view table'
>> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
>> STORED AS TEXTFILE
>> 
>> Does AllLoader understand notion of partition keys? as HiveColumnarLoader?
>> 
>> J
>> 
>> On 1 Dec 2010, at 13:48, Gerrit Jansen van Vuuren wrote:
>> 
>>> Hi,
>>> 
>>> The HiveColumnarLoader can only read files written by hive or the hive
>>> API(s), and has its own InputFormat returning the HiveRCRecordReader.
>>> 
>>> Are you trying to read a plain text format? 
>>> Under the hood the HiveRCRecordReader uses the hive specific rc reader to
>>> read the input file and throws an error either if the file is not hive rc
>> or
>>> is a corrupt hiverc.
>>> 
>>> 
>>> If what you want is a Loader that loads all types of files, have a look at
>>> the AllLoader (latest piggybank trunk). It uses configuration that you set
>>> in the pig.properties to decide on the fly what loader to use for what
>> files
>>> (does extension, content and path matching), it also has the hive style
>> path
>>> partitioning for dates etc. Using this loader you can point it at a
>> directoy
>>> with lzo, gz, bz2 hiverc etc files in it and if you setup the loaders
>>> correctly it will load each file with its preconfigured loader.
>>> The javadoc in the class explains how to configure it.
>>> 
>>> Cheers,
>>> Gerrit
>>> 
>>> -----Original Message-----
>>> From: Jae Lee [mailto:Jae.Lee@forward.co.uk] 
>>> Sent: Wednesday, December 01, 2010 12:33 PM
>>> To: dev@pig.apache.org
>>> Subject: has anyone tried using HiveColumnarLoader over TextFile
>> fileformat?
>>> 
>>> Hi everyone.
>>> 
>>> I've tried using HiveColumnarLoader and getting java.io.IOException:
>>> hdfs://file_path not a RCFile
>>> 
>>> I've noticed HiveColumnarLoader is expecting HiveRCRecordReader from
>>> prepareToRead method..
>>> 
>>> Could you guys give any guidance how possible it is to modify
>>> HiveRCRecordReader to support any RecordReader?
>>> 
>>> J
>>> 
>>> 
>> 
>> 
>> 
> 
>

Re: has anyone tried using HiveColumnarLoader over TextFile fileformat?

Posted by Jae Lee <Ja...@forward.co.uk>.

Thanks Gerrit,

yeah it seems to work as in it loads up the files properly...

however it fails to understand schema and there's no way to specify the underlying schema....

Would you have any recommendation to get the schema right?

J

On 1 Dec 2010, at 15:48, Gerrit Jansen van Vuuren wrote:

> Hi,
> 
> 
> 
> Short answer is yes. As long as the partition keys are reflected in the
> folder path itself AllLoader will pick it up.
> 
> Partition keys in hive are (normally from my understanding) reflected in the
> file path itself so that if you have 
> partitions: type, date
> The table path will actually be
> $HIVE_ROOT/warehouse/mytable/type=[value]/date=[value]
> 
> The AllLoader does understand this type of partitioning. So that if you
> point it to load $HIVE_ROOT/warehouse/mytable
> It will allow you to use the type and date columns to filte (note that you
> can only specify the filtering in the AllLoader() part  see:
> https://issues.apache.org/jira/browse/PIG-1717 )
> 
> The partitioning is detected by the AllLoader (and HiveColumnarLoader) by
> looking at the actual folders in the path, and reading all key=value
> patterns in the path name itself, registering these internally as partition
> keys.
> 
> 
> -----Original Message-----
> From: Jae Lee [mailto:Jae.Lee@forward.co.uk] 
> Sent: Wednesday, December 01, 2010 2:03 PM
> To: dev@pig.apache.org
> Subject: Re: has anyone tried using HiveColumnarLoader over TextFile
> fileformat?
> 
> Hi Gerrit,
> 
> Yeah Hive table isn't stored as RCFILE but TEXTFILE
> 
> so our table creation ddl looks like below
> 
> CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,
>     page_url STRING, referrer_url STRING,
>     ip STRING COMMENT 'IP Address of the User',
>     country STRING COMMENT 'country of origination')
> COMMENT 'This is the staging page view table'
> ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
> STORED AS TEXTFILE
> 
> Does AllLoader understand notion of partition keys? as HiveColumnarLoader?
> 
> J
> 
> On 1 Dec 2010, at 13:48, Gerrit Jansen van Vuuren wrote:
> 
>> Hi,
>> 
>> The HiveColumnarLoader can only read files written by hive or the hive
>> API(s), and has its own InputFormat returning the HiveRCRecordReader.
>> 
>> Are you trying to read a plain text format? 
>> Under the hood the HiveRCRecordReader uses the hive specific rc reader to
>> read the input file and throws an error either if the file is not hive rc
> or
>> is a corrupt hiverc.
>> 
>> 
>> If what you want is a Loader that loads all types of files, have a look at
>> the AllLoader (latest piggybank trunk). It uses configuration that you set
>> in the pig.properties to decide on the fly what loader to use for what
> files
>> (does extension, content and path matching), it also has the hive style
> path
>> partitioning for dates etc. Using this loader you can point it at a
> directoy
>> with lzo, gz, bz2 hiverc etc files in it and if you setup the loaders
>> correctly it will load each file with its preconfigured loader.
>> The javadoc in the class explains how to configure it.
>> 
>> Cheers,
>> Gerrit
>> 
>> -----Original Message-----
>> From: Jae Lee [mailto:Jae.Lee@forward.co.uk] 
>> Sent: Wednesday, December 01, 2010 12:33 PM
>> To: dev@pig.apache.org
>> Subject: has anyone tried using HiveColumnarLoader over TextFile
> fileformat?
>> 
>> Hi everyone.
>> 
>> I've tried using HiveColumnarLoader and getting java.io.IOException:
>> hdfs://file_path not a RCFile
>> 
>> I've noticed HiveColumnarLoader is expecting HiveRCRecordReader from
>> prepareToRead method..
>> 
>> Could you guys give any guidance how possible it is to modify
>> HiveRCRecordReader to support any RecordReader?
>> 
>> J
>> 
>> 
> 
> 
>

RE: has anyone tried using HiveColumnarLoader over TextFile fileformat?

Posted by Gerrit Jansen van Vuuren <gv...@specificmedia.com>.

Hi,



Short answer is yes. As long as the partition keys are reflected in the
folder path itself AllLoader will pick it up.

Partition keys in hive are (normally from my understanding) reflected in the
file path itself so that if you have 
partitions: type, date
The table path will actually be
$HIVE_ROOT/warehouse/mytable/type=[value]/date=[value]

The AllLoader does understand this type of partitioning. So that if you
point it to load $HIVE_ROOT/warehouse/mytable
It will allow you to use the type and date columns to filte (note that you
can only specify the filtering in the AllLoader() part  see:
https://issues.apache.org/jira/browse/PIG-1717 )

The partitioning is detected by the AllLoader (and HiveColumnarLoader) by
looking at the actual folders in the path, and reading all key=value
patterns in the path name itself, registering these internally as partition
keys.


-----Original Message-----
From: Jae Lee [mailto:Jae.Lee@forward.co.uk] 
Sent: Wednesday, December 01, 2010 2:03 PM
To: dev@pig.apache.org
Subject: Re: has anyone tried using HiveColumnarLoader over TextFile
fileformat?

Hi Gerrit,

Yeah Hive table isn't stored as RCFILE but TEXTFILE

so our table creation ddl looks like below

CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,
     page_url STRING, referrer_url STRING,
     ip STRING COMMENT 'IP Address of the User',
     country STRING COMMENT 'country of origination')
 COMMENT 'This is the staging page view table'
 ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
 STORED AS TEXTFILE

Does AllLoader understand notion of partition keys? as HiveColumnarLoader?

J

On 1 Dec 2010, at 13:48, Gerrit Jansen van Vuuren wrote:

> Hi,
> 
> The HiveColumnarLoader can only read files written by hive or the hive
> API(s), and has its own InputFormat returning the HiveRCRecordReader.
> 
> Are you trying to read a plain text format? 
> Under the hood the HiveRCRecordReader uses the hive specific rc reader to
> read the input file and throws an error either if the file is not hive rc
or
> is a corrupt hiverc.
> 
> 
> If what you want is a Loader that loads all types of files, have a look at
> the AllLoader (latest piggybank trunk). It uses configuration that you set
> in the pig.properties to decide on the fly what loader to use for what
files
> (does extension, content and path matching), it also has the hive style
path
> partitioning for dates etc. Using this loader you can point it at a
directoy
> with lzo, gz, bz2 hiverc etc files in it and if you setup the loaders
> correctly it will load each file with its preconfigured loader.
> The javadoc in the class explains how to configure it.
> 
> Cheers,
> Gerrit
> 
> -----Original Message-----
> From: Jae Lee [mailto:Jae.Lee@forward.co.uk] 
> Sent: Wednesday, December 01, 2010 12:33 PM
> To: dev@pig.apache.org
> Subject: has anyone tried using HiveColumnarLoader over TextFile
fileformat?
> 
> Hi everyone.
> 
> I've tried using HiveColumnarLoader and getting java.io.IOException:
> hdfs://file_path not a RCFile
> 
> I've noticed HiveColumnarLoader is expecting HiveRCRecordReader from
> prepareToRead method..
> 
> Could you guys give any guidance how possible it is to modify
> HiveRCRecordReader to support any RecordReader?
> 
> J
> 
>

Re: has anyone tried using HiveColumnarLoader over TextFile fileformat?

Posted by Jae Lee <Ja...@forward.co.uk>.

Hi Gerrit,

Yeah Hive table isn't stored as RCFILE but TEXTFILE

so our table creation ddl looks like below

CREATE EXTERNAL TABLE page_view(viewTime INT, userid BIGINT,
     page_url STRING, referrer_url STRING,
     ip STRING COMMENT 'IP Address of the User',
     country STRING COMMENT 'country of origination')
 COMMENT 'This is the staging page view table'
 ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
 STORED AS TEXTFILE

Does AllLoader understand notion of partition keys? as HiveColumnarLoader?

J

On 1 Dec 2010, at 13:48, Gerrit Jansen van Vuuren wrote:

> Hi,
> 
> The HiveColumnarLoader can only read files written by hive or the hive
> API(s), and has its own InputFormat returning the HiveRCRecordReader.
> 
> Are you trying to read a plain text format? 
> Under the hood the HiveRCRecordReader uses the hive specific rc reader to
> read the input file and throws an error either if the file is not hive rc or
> is a corrupt hiverc.
> 
> 
> If what you want is a Loader that loads all types of files, have a look at
> the AllLoader (latest piggybank trunk). It uses configuration that you set
> in the pig.properties to decide on the fly what loader to use for what files
> (does extension, content and path matching), it also has the hive style path
> partitioning for dates etc. Using this loader you can point it at a directoy
> with lzo, gz, bz2 hiverc etc files in it and if you setup the loaders
> correctly it will load each file with its preconfigured loader.
> The javadoc in the class explains how to configure it.
> 
> Cheers,
> Gerrit
> 
> -----Original Message-----
> From: Jae Lee [mailto:Jae.Lee@forward.co.uk] 
> Sent: Wednesday, December 01, 2010 12:33 PM
> To: dev@pig.apache.org
> Subject: has anyone tried using HiveColumnarLoader over TextFile fileformat?
> 
> Hi everyone.
> 
> I've tried using HiveColumnarLoader and getting java.io.IOException:
> hdfs://file_path not a RCFile
> 
> I've noticed HiveColumnarLoader is expecting HiveRCRecordReader from
> prepareToRead method..
> 
> Could you guys give any guidance how possible it is to modify
> HiveRCRecordReader to support any RecordReader?
> 
> J
> 
>

RE: has anyone tried using HiveColumnarLoader over TextFile fileformat?

Posted by Gerrit Jansen van Vuuren <gv...@specificmedia.com>.

Hi,

The HiveColumnarLoader can only read files written by hive or the hive
API(s), and has its own InputFormat returning the HiveRCRecordReader.

Are you trying to read a plain text format? 
Under the hood the HiveRCRecordReader uses the hive specific rc reader to
read the input file and throws an error either if the file is not hive rc or
is a corrupt hiverc.


If what you want is a Loader that loads all types of files, have a look at
the AllLoader (latest piggybank trunk). It uses configuration that you set
in the pig.properties to decide on the fly what loader to use for what files
(does extension, content and path matching), it also has the hive style path
partitioning for dates etc. Using this loader you can point it at a directoy
with lzo, gz, bz2 hiverc etc files in it and if you setup the loaders
correctly it will load each file with its preconfigured loader.
The javadoc in the class explains how to configure it.

Cheers,
 Gerrit

-----Original Message-----
From: Jae Lee [mailto:Jae.Lee@forward.co.uk] 
Sent: Wednesday, December 01, 2010 12:33 PM
To: dev@pig.apache.org
Subject: has anyone tried using HiveColumnarLoader over TextFile fileformat?

Hi everyone.

I've tried using HiveColumnarLoader and getting java.io.IOException:
hdfs://file_path not a RCFile

I've noticed HiveColumnarLoader is expecting HiveRCRecordReader from
prepareToRead method..

Could you guys give any guidance how possible it is to modify
HiveRCRecordReader to support any RecordReader?

J