You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Mark <st...@gmail.com> on 2013/05/01 18:51:25 UTC

Pig write to single file

Thought I understood how to output to a single file but It doesn't seem to be working. Anything I'm missing here?


-- Dedupe and store

rows   = LOAD '$input';
unique = DISTINCT rows PARELLEL 1;

STORE unique INTO '$output';

Re: Pig write to single file

Posted by Mark <st...@gmail.com>.

What I'm doing is at the end of each day I deduce and store all my log files in lzo format in an archive directory. I thought that since LZO is splittable and Hadoop likes larger files that this would be best. Is this not the case?

And to answer your question there seems to be 2 files around 800mb in size.

On May 1, 2013, at 10:17 AM, Mike Sukmanowsky <mi...@parsely.com> wrote:

> How many output files are you getting?  You can set SET DEFAULT_PARALLEL 1;
> so you don't have to specify parallelism on each reduce phase.
> 
> In general though, I wouldn't recommend forcing your output into one file
> (parallelism is good).  Just write a shell/python/ruby/perl script that
> appends the files after the full job executes.
> 
> 
> On Wed, May 1, 2013 at 12:51 PM, Mark <st...@gmail.com> wrote:
> 
>> Thought I understood how to output to a single file but It doesn't seem to
>> be working. Anything I'm missing here?
>> 
>> 
>> -- Dedupe and store
>> 
>> rows   = LOAD '$input';
>> unique = DISTINCT rows PARELLEL 1;
>> 
>> STORE unique INTO '$output';
>> 
>> 
>> 
> 
> 
> -- 
> Mike Sukmanowsky
> 
> Product Lead, http://parse.ly
> 989 Avenue of the Americas, 3rd Floor
> New York, NY  10018
> p: +1 (416) 953-4248
> e: mike@parsely.com

Re: Pig write to single file

Posted by Mike Sukmanowsky <mi...@parsely.com>.

How many output files are you getting?  You can set SET DEFAULT_PARALLEL 1;
so you don't have to specify parallelism on each reduce phase.

In general though, I wouldn't recommend forcing your output into one file
(parallelism is good).  Just write a shell/python/ruby/perl script that
appends the files after the full job executes.

On Wed, May 1, 2013 at 12:51 PM, Mark <st...@gmail.com> wrote:

> Thought I understood how to output to a single file but It doesn't seem to
> be working. Anything I'm missing here?
>
>
> -- Dedupe and store
>
> rows   = LOAD '$input';
> unique = DISTINCT rows PARELLEL 1;
>
> STORE unique INTO '$output';
>
>
>

-- 
Mike Sukmanowsky

Product Lead, http://parse.ly
989 Avenue of the Americas, 3rd Floor
New York, NY  10018
p: +1 (416) 953-4248
e: mike@parsely.com