You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Vincent Barat <vi...@gmail.com> on 2013/12/26 12:35:58 UTC

How to read a file generated by Pig+BinStorage using the HDFS API ?

Hi all and merry Christmas !

I generate a file using a Pig script embedded in a Java process and 
store it using a BinStorage.

Then, I would like to read this file directly from another Java 
client, but without starting a Pig script (i.e only by using Hadoop 
API and Pig's BinStorage class).
The goal is to achieve some real-time computation by scanning the 
file in realtime, and so I cannot offer to start a Pig script to do 
the computation, as the time overhead to start the script and get 
the result is too long for my realtime objectives (I need a result 
in a few seconds).

Of course, I could use a JsonStorage and read my file using a Json 
deserializer, but my guess is it would be much slower, and also 
painful to handle the various parts generated for the output file 
(part-r-XXXXX).

Best regards,

Re: How to read a file generated by Pig+BinStorage using the HDFS API ?

Posted by Vincent Barat <vb...@capptain.com>.

Thanks for your answer.

Yes, I guess I will try to use these classes directly to access my data.

Best regards,

Le 29/12/2013 03:22, Cheolsoo Park a écrit :
> I haven't done it myself, so I can't give you a detailed answer. But every
> storage is associated with Input/outputFormat as well as
> RecordReader/Writer.
>
> As for BinStorage, you can take a look at BinStorageRecordReader-
> https://github.com/apache/pig/blob/trunk/src/org/apache/pig/impl/io/BinStorageRecordReader.java#L40
>
>
> On Thu, Dec 26, 2013 at 3:35 AM, Vincent Barat <vi...@gmail.com>wrote:
>
>> Hi all and merry Christmas !
>>
>> I generate a file using a Pig script embedded in a Java process and store
>> it using a BinStorage.
>>
>> Then, I would like to read this file directly from another Java client,
>> but without starting a Pig script (i.e only by using Hadoop API and Pig's
>> BinStorage class).
>> The goal is to achieve some real-time computation by scanning the file in
>> realtime, and so I cannot offer to start a Pig script to do the
>> computation, as the time overhead to start the script and get the result is
>> too long for my realtime objectives (I need a result in a few seconds).
>>
>> Of course, I could use a JsonStorage and read my file using a Json
>> deserializer, but my guess is it would be much slower, and also painful to
>> handle the various parts generated for the output file (part-r-XXXXX).
>>
>> Best regards,
>>

-- 
Vincent BARAT
/CTO,/ /Capptain/ 	

*p.* +33 299 656 913
*m.* +33 615 411 518
*e.* vbarat@capptain.com <ma...@capptain.com>
*w.* http://www.capptain.com/
*a.* 18 rue Tronchet, 75008 Paris, France
	<https://www.facebook.com/capptain> 
<https://twitter.com/capptain_hq> <http://www.capptain.com/feed/>

<http://www.capptain.com/contact/>
IMPORTANT: The contents of this email and any attachments are 
confidential. They are intended for the named recipient(s) only. If 
you have received this email by mistake, please notify the sender 
immediately and do not disclose the contents to anyone or make 
copies thereof.

Re: How to read a file generated by Pig+BinStorage using the HDFS API ?

Posted by Vincent Barat <vi...@gmail.com>.

Thanks for your help. I succeeded in reading my data. Here is the code:

     Path path = new Path("/mydata");
     BinStorageRecordReader recordReader = new BinStorageRecordReader();
     FileStatus fileStatus = fileSystem.getFileStatus(path);
     recordReader.initialize(new FileSplit(path, 0, 
fileStatus.getLen(), null),
       new TaskAttemptContext(new Configuration(), new 
TaskAttemptID()));

     while (recordReader.nextKeyValue())
     {
       Tuple tuple = recordReader.getCurrentValue();
        ...
     }

Best regards,

Le 29/12/2013 03:22, Cheolsoo Park a écrit :
> I haven't done it myself, so I can't give you a detailed answer. But every
> storage is associated with Input/outputFormat as well as
> RecordReader/Writer.
>
> As for BinStorage, you can take a look at BinStorageRecordReader-
> https://github.com/apache/pig/blob/trunk/src/org/apache/pig/impl/io/BinStorageRecordReader.java#L40
>
>
> On Thu, Dec 26, 2013 at 3:35 AM, Vincent Barat <vi...@gmail.com>wrote:
>
>> Hi all and merry Christmas !
>>
>> I generate a file using a Pig script embedded in a Java process and store
>> it using a BinStorage.
>>
>> Then, I would like to read this file directly from another Java client,
>> but without starting a Pig script (i.e only by using Hadoop API and Pig's
>> BinStorage class).
>> The goal is to achieve some real-time computation by scanning the file in
>> realtime, and so I cannot offer to start a Pig script to do the
>> computation, as the time overhead to start the script and get the result is
>> too long for my realtime objectives (I need a result in a few seconds).
>>
>> Of course, I could use a JsonStorage and read my file using a Json
>> deserializer, but my guess is it would be much slower, and also painful to
>> handle the various parts generated for the output file (part-r-XXXXX).
>>
>> Best regards,
>>

Re: How to read a file generated by Pig+BinStorage using the HDFS API ?

Posted by Cheolsoo Park <pi...@gmail.com>.

I haven't done it myself, so I can't give you a detailed answer. But every
storage is associated with Input/outputFormat as well as
RecordReader/Writer.

As for BinStorage, you can take a look at BinStorageRecordReader-
https://github.com/apache/pig/blob/trunk/src/org/apache/pig/impl/io/BinStorageRecordReader.java#L40


On Thu, Dec 26, 2013 at 3:35 AM, Vincent Barat <vi...@gmail.com>wrote:

> Hi all and merry Christmas !
>
> I generate a file using a Pig script embedded in a Java process and store
> it using a BinStorage.
>
> Then, I would like to read this file directly from another Java client,
> but without starting a Pig script (i.e only by using Hadoop API and Pig's
> BinStorage class).
> The goal is to achieve some real-time computation by scanning the file in
> realtime, and so I cannot offer to start a Pig script to do the
> computation, as the time overhead to start the script and get the result is
> too long for my realtime objectives (I need a result in a few seconds).
>
> Of course, I could use a JsonStorage and read my file using a Json
> deserializer, but my guess is it would be much slower, and also painful to
> handle the various parts generated for the output file (part-r-XXXXX).
>
> Best regards,
>