You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Vincent Barat <vi...@gmail.com> on 2013/12/26 12:35:58 UTC
How to read a file generated by Pig+BinStorage using the HDFS API
?
Hi all and merry Christmas !
I generate a file using a Pig script embedded in a Java process and
store it using a BinStorage.
Then, I would like to read this file directly from another Java
client, but without starting a Pig script (i.e only by using Hadoop
API and Pig's BinStorage class).
The goal is to achieve some real-time computation by scanning the
file in realtime, and so I cannot offer to start a Pig script to do
the computation, as the time overhead to start the script and get
the result is too long for my realtime objectives (I need a result
in a few seconds).
Of course, I could use a JsonStorage and read my file using a Json
deserializer, but my guess is it would be much slower, and also
painful to handle the various parts generated for the output file
(part-r-XXXXX).
Best regards,
Re: How to read a file generated by Pig+BinStorage using the HDFS
API ?
Posted by Vincent Barat <vb...@capptain.com>.
Thanks for your answer.
Yes, I guess I will try to use these classes directly to access my data.
Best regards,
Le 29/12/2013 03:22, Cheolsoo Park a écrit :
> I haven't done it myself, so I can't give you a detailed answer. But every
> storage is associated with Input/outputFormat as well as
> RecordReader/Writer.
>
> As for BinStorage, you can take a look at BinStorageRecordReader-
> https://github.com/apache/pig/blob/trunk/src/org/apache/pig/impl/io/BinStorageRecordReader.java#L40
>
>
> On Thu, Dec 26, 2013 at 3:35 AM, Vincent Barat <vi...@gmail.com>wrote:
>
>> Hi all and merry Christmas !
>>
>> I generate a file using a Pig script embedded in a Java process and store
>> it using a BinStorage.
>>
>> Then, I would like to read this file directly from another Java client,
>> but without starting a Pig script (i.e only by using Hadoop API and Pig's
>> BinStorage class).
>> The goal is to achieve some real-time computation by scanning the file in
>> realtime, and so I cannot offer to start a Pig script to do the
>> computation, as the time overhead to start the script and get the result is
>> too long for my realtime objectives (I need a result in a few seconds).
>>
>> Of course, I could use a JsonStorage and read my file using a Json
>> deserializer, but my guess is it would be much slower, and also painful to
>> handle the various parts generated for the output file (part-r-XXXXX).
>>
>> Best regards,
>>
--
Vincent BARAT
/CTO,/ /Capptain/
*p.* +33 299 656 913
*m.* +33 615 411 518
*e.* vbarat@capptain.com <ma...@capptain.com>
*w.* http://www.capptain.com/
*a.* 18 rue Tronchet, 75008 Paris, France
<https://www.facebook.com/capptain>
<https://twitter.com/capptain_hq> <http://www.capptain.com/feed/>
<http://www.capptain.com/contact/>
IMPORTANT: The contents of this email and any attachments are
confidential. They are intended for the named recipient(s) only. If
you have received this email by mistake, please notify the sender
immediately and do not disclose the contents to anyone or make
copies thereof.
Re: How to read a file generated by Pig+BinStorage using the HDFS
API ?
Posted by Vincent Barat <vi...@gmail.com>.
Thanks for your help. I succeeded in reading my data. Here is the code:
Path path = new Path("/mydata");
BinStorageRecordReader recordReader = new BinStorageRecordReader();
FileStatus fileStatus = fileSystem.getFileStatus(path);
recordReader.initialize(new FileSplit(path, 0,
fileStatus.getLen(), null),
new TaskAttemptContext(new Configuration(), new
TaskAttemptID()));
while (recordReader.nextKeyValue())
{
Tuple tuple = recordReader.getCurrentValue();
...
}
Best regards,
Le 29/12/2013 03:22, Cheolsoo Park a écrit :
> I haven't done it myself, so I can't give you a detailed answer. But every
> storage is associated with Input/outputFormat as well as
> RecordReader/Writer.
>
> As for BinStorage, you can take a look at BinStorageRecordReader-
> https://github.com/apache/pig/blob/trunk/src/org/apache/pig/impl/io/BinStorageRecordReader.java#L40
>
>
> On Thu, Dec 26, 2013 at 3:35 AM, Vincent Barat <vi...@gmail.com>wrote:
>
>> Hi all and merry Christmas !
>>
>> I generate a file using a Pig script embedded in a Java process and store
>> it using a BinStorage.
>>
>> Then, I would like to read this file directly from another Java client,
>> but without starting a Pig script (i.e only by using Hadoop API and Pig's
>> BinStorage class).
>> The goal is to achieve some real-time computation by scanning the file in
>> realtime, and so I cannot offer to start a Pig script to do the
>> computation, as the time overhead to start the script and get the result is
>> too long for my realtime objectives (I need a result in a few seconds).
>>
>> Of course, I could use a JsonStorage and read my file using a Json
>> deserializer, but my guess is it would be much slower, and also painful to
>> handle the various parts generated for the output file (part-r-XXXXX).
>>
>> Best regards,
>>
Re: How to read a file generated by Pig+BinStorage using the HDFS API ?
Posted by Cheolsoo Park <pi...@gmail.com>.
I haven't done it myself, so I can't give you a detailed answer. But every
storage is associated with Input/outputFormat as well as
RecordReader/Writer.
As for BinStorage, you can take a look at BinStorageRecordReader-
https://github.com/apache/pig/blob/trunk/src/org/apache/pig/impl/io/BinStorageRecordReader.java#L40
On Thu, Dec 26, 2013 at 3:35 AM, Vincent Barat <vi...@gmail.com>wrote:
> Hi all and merry Christmas !
>
> I generate a file using a Pig script embedded in a Java process and store
> it using a BinStorage.
>
> Then, I would like to read this file directly from another Java client,
> but without starting a Pig script (i.e only by using Hadoop API and Pig's
> BinStorage class).
> The goal is to achieve some real-time computation by scanning the file in
> realtime, and so I cannot offer to start a Pig script to do the
> computation, as the time overhead to start the script and get the result is
> too long for my realtime objectives (I need a result in a few seconds).
>
> Of course, I could use a JsonStorage and read my file using a Json
> deserializer, but my guess is it would be much slower, and also painful to
> handle the various parts generated for the output file (part-r-XXXXX).
>
> Best regards,
>