You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Vincent Barat <vi...@ubikod.com> on 2009/10/28 17:20:28 UTC

How can I read stored files using the PIG API in Java?

Hello,

I'm using PIG from Java and I store my results using the regular call:

     pigServer.store(pigAlias, outputFilePath);

Now, I need to read the file produced (in order to store it to a 
MySQL table).

The problem is that PIG (when used in map/reduce mode) is creating a 
directory + a set of part files for each stored "file".

I cannot figure out how to read this output: should I concatenate 
all part files? Is there an PIG API that hide this complexity?

Thanks for your help, as it is a blocking issue for me.

Regards,

Re: How can I read stored files using the PIG API in Java?

Posted by Vincent Barat <vi...@ubikod.com>.

I guess that it is actually exactly what I was looking for
(since I have a handle to the alias)

Thanks a lot :)

Benjamin Reed a écrit :
> dump is just pigServer.openIterator(alias). the problem is you must have 
> a handle to the alias, so if you are reading from another program, you 
> would need to do a load and then open an iterator on that alias, which 
> will probably run a map/reduce job.
> 
> ben
> 
> Vincent Barat wrote:
>> Thank you for your concern.
>>
>> I've implemented this method by scanning all part-* files found in the 
>> directory. It is far from being elegant, but at least the code is 
>> written :)
>>
>> Dump cannot be called from Java AFAIK.
>> I will definitively have a look at Zebra's contrib.
>>
>> zaki rahaman a écrit :
>>  
>>> >From my understanding, the part-0000 files correspond to each of the 
>>> final
>>> reduce tasks in a M/R job (whether you're running it from Pig or 
>>> directly in
>>> Hadoop). The easiest solution is to just cat the part files in the 
>>> created
>>> directory as you suggested. I'm not sure if there's some other method 
>>> in the
>>> API to directly read output. I suppose you could call dump and read 
>>> it in
>>> that way, but that seems even less elegant. Alternatively, if you're 
>>> looking
>>> to store into table output, take a look at the zebra contrib, although I
>>> myself am pretty clueless as to the details.
>>>
>>> On Wed, Oct 28, 2009 at 12:20 PM, Vincent Barat 
>>> <vi...@ubikod.com>wrote:
>>>
>>>    
>>>> Hello,
>>>>
>>>> I'm using PIG from Java and I store my results using the regular call:
>>>>
>>>>    pigServer.store(pigAlias, outputFilePath);
>>>>
>>>> Now, I need to read the file produced (in order to store it to a MySQL
>>>> table).
>>>>
>>>> The problem is that PIG (when used in map/reduce mode) is creating a
>>>> directory + a set of part files for each stored "file".
>>>>
>>>> I cannot figure out how to read this output: should I concatenate 
>>>> all part
>>>> files? Is there an PIG API that hide this complexity?
>>>>
>>>> Thanks for your help, as it is a blocking issue for me.
>>>>
>>>> Regards,
>>>>
>>>>
>>>>
>>>>       
>>>     
> 
> 
>

Re: How can I read stored files using the PIG API in Java?

Posted by Benjamin Reed <br...@yahoo-inc.com>.

dump is just pigServer.openIterator(alias). the problem is you must have 
a handle to the alias, so if you are reading from another program, you 
would need to do a load and then open an iterator on that alias, which 
will probably run a map/reduce job.

ben

Vincent Barat wrote:
> Thank you for your concern.
>
> I've implemented this method by scanning all part-* files found in 
> the directory. It is far from being elegant, but at least the code 
> is written :)
>
> Dump cannot be called from Java AFAIK.
> I will definitively have a look at Zebra's contrib.
>
> zaki rahaman a écrit :
>   
>> >From my understanding, the part-0000 files correspond to each of the final
>> reduce tasks in a M/R job (whether you're running it from Pig or directly in
>> Hadoop). The easiest solution is to just cat the part files in the created
>> directory as you suggested. I'm not sure if there's some other method in the
>> API to directly read output. I suppose you could call dump and read it in
>> that way, but that seems even less elegant. Alternatively, if you're looking
>> to store into table output, take a look at the zebra contrib, although I
>> myself am pretty clueless as to the details.
>>
>> On Wed, Oct 28, 2009 at 12:20 PM, Vincent Barat <vi...@ubikod.com>wrote:
>>
>>     
>>> Hello,
>>>
>>> I'm using PIG from Java and I store my results using the regular call:
>>>
>>>    pigServer.store(pigAlias, outputFilePath);
>>>
>>> Now, I need to read the file produced (in order to store it to a MySQL
>>> table).
>>>
>>> The problem is that PIG (when used in map/reduce mode) is creating a
>>> directory + a set of part files for each stored "file".
>>>
>>> I cannot figure out how to read this output: should I concatenate all part
>>> files? Is there an PIG API that hide this complexity?
>>>
>>> Thanks for your help, as it is a blocking issue for me.
>>>
>>> Regards,
>>>
>>>
>>>
>>>       
>>

Re: How can I read stored files using the PIG API in Java?

Posted by Vincent Barat <vi...@ubikod.com>.

Thank you for your concern.

I've implemented this method by scanning all part-* files found in 
the directory. It is far from being elegant, but at least the code 
is written :)

Dump cannot be called from Java AFAIK.
I will definitively have a look at Zebra's contrib.

zaki rahaman a écrit :
>>>From my understanding, the part-0000 files correspond to each of the final
> reduce tasks in a M/R job (whether you're running it from Pig or directly in
> Hadoop). The easiest solution is to just cat the part files in the created
> directory as you suggested. I'm not sure if there's some other method in the
> API to directly read output. I suppose you could call dump and read it in
> that way, but that seems even less elegant. Alternatively, if you're looking
> to store into table output, take a look at the zebra contrib, although I
> myself am pretty clueless as to the details.
> 
> On Wed, Oct 28, 2009 at 12:20 PM, Vincent Barat <vi...@ubikod.com>wrote:
> 
>> Hello,
>>
>> I'm using PIG from Java and I store my results using the regular call:
>>
>>    pigServer.store(pigAlias, outputFilePath);
>>
>> Now, I need to read the file produced (in order to store it to a MySQL
>> table).
>>
>> The problem is that PIG (when used in map/reduce mode) is creating a
>> directory + a set of part files for each stored "file".
>>
>> I cannot figure out how to read this output: should I concatenate all part
>> files? Is there an PIG API that hide this complexity?
>>
>> Thanks for your help, as it is a blocking issue for me.
>>
>> Regards,
>>
>>
>>
> 
>

Re: How can I read stored files using the PIG API in Java?

Posted by zaki rahaman <za...@gmail.com>.

>From my understanding, the part-0000 files correspond to each of the final
reduce tasks in a M/R job (whether you're running it from Pig or directly in
Hadoop). The easiest solution is to just cat the part files in the created
directory as you suggested. I'm not sure if there's some other method in the
API to directly read output. I suppose you could call dump and read it in
that way, but that seems even less elegant. Alternatively, if you're looking
to store into table output, take a look at the zebra contrib, although I
myself am pretty clueless as to the details.

On Wed, Oct 28, 2009 at 12:20 PM, Vincent Barat <vi...@ubikod.com>wrote:

> Hello,
>
> I'm using PIG from Java and I store my results using the regular call:
>
>    pigServer.store(pigAlias, outputFilePath);
>
> Now, I need to read the file produced (in order to store it to a MySQL
> table).
>
> The problem is that PIG (when used in map/reduce mode) is creating a
> directory + a set of part files for each stored "file".
>
> I cannot figure out how to read this output: should I concatenate all part
> files? Is there an PIG API that hide this complexity?
>
> Thanks for your help, as it is a blocking issue for me.
>
> Regards,
>
>
>


-- 
Zaki Rahaman