You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Jameson Li <ho...@gmail.com> on 2011/04/01 09:57:24 UTC

store less files

Hi,

When I run the below pig codes:
a = load '/logs/2011-03-31';
b = filter a by $1=='a' and $2=='b';
store b into '20110331-ab';

It runs a M/R that have thousands maps, and then create a output store
directory that have the same number so many files.

I have a doubt that how I could store less files when I use pig to store
files in the HDFS.


Thanks,
Jameson Li.

Re: store less files

Posted by Mridul Muralidharan <mr...@yahoo-inc.com>.

Using rand() as group key, in general, is a pretty bad idea in case of 
failures.


- Mridul

On Saturday 02 April 2011 12:23 AM, Dmitriy Ryaboy wrote:
> Don't order, that's expensive.
> Just group by rand(), specify parallelism on the group by, and store the
> result of "foreach grouped generate FLATTEN(name_of_original_relation);"
>
> On Fri, Apr 1, 2011 at 11:22 AM, Xiaomeng Wan<sh...@gmail.com>  wrote:
>
>> Hi Jameson,
>>
>> Do you mind to add something like this:
>>
>> c = order b by $0 parallel n;
>> store c into '20110331-ab';
>>
>> you can order on anything. it will add a reduce and give you less files.
>>
>> Regards,
>> Shawn
>> On Fri, Apr 1, 2011 at 1:57 AM, Jameson Li<ho...@gmail.com>  wrote:
>>> Hi,
>>>
>>> When I run the below pig codes:
>>> a = load '/logs/2011-03-31';
>>> b = filter a by $1=='a' and $2=='b';
>>> store b into '20110331-ab';
>>>
>>> It runs a M/R that have thousands maps, and then create a output store
>>> directory that have the same number so many files.
>>>
>>> I have a doubt that how I could store less files when I use pig to store
>>> files in the HDFS.
>>>
>>>
>>> Thanks,
>>> Jameson Li.
>>>
>>

Re: store less files

Posted by Jameson Li <ho...@gmail.com>.

Thanks all of you.

I have test that. It works well.
Below is the pig codes:
    a = load '/logs/2011-03-31';
    b = filter a by $1=='a' and $2=='b';
    c = group b by RANDOM() parallel 30;/*here you can modify the parallel
number, and it will generate the number of the output files.*/
    d = foreach c generate flatten(b);
    store d into 'youroutputdir';

But I still have the doubt that the way 'group by RANDOM()' will add the
extra steps.
Does there have no way directly store the number of the files that I want？


2011/4/2 Dmitriy Ryaboy <dv...@gmail.com>

> Don't order, that's expensive.
> Just group by rand(), specify parallelism on the group by, and store the
> result of "foreach grouped generate FLATTEN(name_of_original_relation);"
>
> On Fri, Apr 1, 2011 at 11:22 AM, Xiaomeng Wan <sh...@gmail.com> wrote:
>
> > Hi Jameson,
> >
> > Do you mind to add something like this:
> >
> > c = order b by $0 parallel n;
> > store c into '20110331-ab';
> >
> > you can order on anything. it will add a reduce and give you less files.
> >
> > Regards,
> > Shawn
> > On Fri, Apr 1, 2011 at 1:57 AM, Jameson Li <ho...@gmail.com> wrote:
> > > Hi,
> > >
> > > When I run the below pig codes:
> > > a = load '/logs/2011-03-31';
> > > b = filter a by $1=='a' and $2=='b';
> > > store b into '20110331-ab';
> > >
> > > It runs a M/R that have thousands maps, and then create a output store
> > > directory that have the same number so many files.
> > >
> > > I have a doubt that how I could store less files when I use pig to
> store
> > > files in the HDFS.
> > >
> > >
> > > Thanks,
> > > Jameson Li.
> > >
> >
>

Re: store less files

Posted by Dmitriy Ryaboy <dv...@gmail.com>.

Don't order, that's expensive.
Just group by rand(), specify parallelism on the group by, and store the
result of "foreach grouped generate FLATTEN(name_of_original_relation);"

On Fri, Apr 1, 2011 at 11:22 AM, Xiaomeng Wan <sh...@gmail.com> wrote:

> Hi Jameson,
>
> Do you mind to add something like this:
>
> c = order b by $0 parallel n;
> store c into '20110331-ab';
>
> you can order on anything. it will add a reduce and give you less files.
>
> Regards,
> Shawn
> On Fri, Apr 1, 2011 at 1:57 AM, Jameson Li <ho...@gmail.com> wrote:
> > Hi,
> >
> > When I run the below pig codes:
> > a = load '/logs/2011-03-31';
> > b = filter a by $1=='a' and $2=='b';
> > store b into '20110331-ab';
> >
> > It runs a M/R that have thousands maps, and then create a output store
> > directory that have the same number so many files.
> >
> > I have a doubt that how I could store less files when I use pig to store
> > files in the HDFS.
> >
> >
> > Thanks,
> > Jameson Li.
> >
>

Re: store less files

Posted by Xiaomeng Wan <sh...@gmail.com>.

Hi Jameson,

Do you mind to add something like this:

c = order b by $0 parallel n;
store c into '20110331-ab';

you can order on anything. it will add a reduce and give you less files.

Regards,
Shawn
On Fri, Apr 1, 2011 at 1:57 AM, Jameson Li <ho...@gmail.com> wrote:
> Hi,
>
> When I run the below pig codes:
> a = load '/logs/2011-03-31';
> b = filter a by $1=='a' and $2=='b';
> store b into '20110331-ab';
>
> It runs a M/R that have thousands maps, and then create a output store
> directory that have the same number so many files.
>
> I have a doubt that how I could store less files when I use pig to store
> files in the HDFS.
>
>
> Thanks,
> Jameson Li.
>

Re: store less files

Posted by Jameson Li <ho...@gmail.com>.

If I have many of the TB input, and I have configured the block size "128M",
it will generate thousands of mappers, and generate thousands of the output
files.
Because too many of the files will increase the loading of the Namenode, and
it also will increase the io loading in the cluster, I need to reduce the
number of files stored to HDFS.


2011/4/2 Jameson Lopp <ja...@bronto.com>

> I can't think of a simple way to accomplish that without reducing the
> parallelism of your M/R jobs, which of course would affect the performance
> of your script.
>
> Things I'd take into account:
>        * how much data are you reading / writing with this pig script?
>        * do you really need thousands of mappers / how adversely would your
> M/R performance be affected by reducing parallelism?
>        * why do you need to reduce the number of files stored to HDFS?
> --
> Jameson Lopp
> Software Engineer
> Bronto Software, Inc.
>
>
> On 04/01/2011 03:57 AM, Jameson Li wrote:
>
>> Hi,
>>
>> When I run the below pig codes:
>> a = load '/logs/2011-03-31';
>> b = filter a by $1=='a' and $2=='b';
>> store b into '20110331-ab';
>>
>> It runs a M/R that have thousands maps, and then create a output store
>> directory that have the same number so many files.
>>
>> I have a doubt that how I could store less files when I use pig to store
>> files in the HDFS.
>>
>>
>> Thanks,
>> Jameson Li.
>>
>>

Re: store less files

Posted by Jameson Lopp <ja...@bronto.com>.

I can't think of a simple way to accomplish that without reducing the parallelism of your M/R jobs, 
which of course would affect the performance of your script.

Things I'd take into account:
	* how much data are you reading / writing with this pig script?
	* do you really need thousands of mappers / how adversely would your M/R performance be affected by 
reducing parallelism?
	* why do you need to reduce the number of files stored to HDFS?
--
Jameson Lopp
Software Engineer
Bronto Software, Inc.

On 04/01/2011 03:57 AM, Jameson Li wrote:
> Hi,
>
> When I run the below pig codes:
> a = load '/logs/2011-03-31';
> b = filter a by $1=='a' and $2=='b';
> store b into '20110331-ab';
>
> It runs a M/R that have thousands maps, and then create a output store
> directory that have the same number so many files.
>
> I have a doubt that how I could store less files when I use pig to store
> files in the HDFS.
>
>
> Thanks,
> Jameson Li.
>