You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@pig.apache.org by Mix Nin <pi...@gmail.com> on 2013/05/24 21:11:40 UTC

Single Output file from STORE command

STORE command produces multiple output files. I want a single output file
and I tried using command as below

STORE (foreach (group NoNullData all) generate flatten($1))  into 'xxxx';

This command produces one single file but at the same time forces to use
single reducer which kills performance.

How do I overcome the scenario?

Normally   STORE command produces multiple output files, apart from that I
see another file
"_SUCCESS" in output directory. I ma generating metadata file  ( using
PigStorage('\t', '-schema') ) in output directory

I thought of using  getmerge as follows

*hadoop* fs -*getmerge* <dir_of_input_files>   <local file>

But this requires
1)eliminating files other than data files in HDFS directory
2)It creates a single file in local directory but not in HDFS directory
3)I need to again move file from local directory to HDFS directory which
may  take additional time , depending on size of single file
4)I need to agin place the files which I eliminated in Step 1


Is there an efficient way for my problem?

Thanks

Re: Single Output file from STORE command

Posted by Aniket Mokashi <an...@gmail.com>.

You can use pig to do what "hadoop fs -getmerge" is doing in a separate pig
script. It will still be one reducer though.


On Tue, May 28, 2013 at 8:29 AM, Alan Gates <ga...@hortonworks.com> wrote:

> Nothing that uses MapReduce as an underlying execution engine creates a
> single file when running multiple reducers because MapReduce doesn't.  The
> real question is if you want to keep the file on Hadoop, why worry about
> whether it's a single file?  Most applications on Hadoop will take a
> directory as an input and read all the files contained in it.
>
> Alan.
>
> On May 24, 2013, at 12:11 PM, Mix Nin wrote:
>
> > STORE command produces multiple output files. I want a single output file
> > and I tried using command as below
> >
> > STORE (foreach (group NoNullData all) generate flatten($1))  into 'xxxx';
> >
> > This command produces one single file but at the same time forces to use
> > single reducer which kills performance.
> >
> > How do I overcome the scenario?
> >
> > Normally   STORE command produces multiple output files, apart from that
> I
> > see another file
> > "_SUCCESS" in output directory. I ma generating metadata file  ( using
> > PigStorage('\t', '-schema') ) in output directory
> >
> > I thought of using  getmerge as follows
> >
> > *hadoop* fs -*getmerge* <dir_of_input_files>   <local file>
> >
> > But this requires
> > 1)eliminating files other than data files in HDFS directory
> > 2)It creates a single file in local directory but not in HDFS directory
> > 3)I need to again move file from local directory to HDFS directory which
> > may  take additional time , depending on size of single file
> > 4)I need to agin place the files which I eliminated in Step 1
> >
> >
> > Is there an efficient way for my problem?
> >
> > Thanks
>
>


-- 
"...:::Aniket:::... Quetzalco@tl"

Re: Single Output file from STORE command

Posted by Alan Gates <ga...@hortonworks.com>.

Nothing that uses MapReduce as an underlying execution engine creates a single file when running multiple reducers because MapReduce doesn't.  The real question is if you want to keep the file on Hadoop, why worry about whether it's a single file?  Most applications on Hadoop will take a directory as an input and read all the files contained in it.

Alan.

On May 24, 2013, at 12:11 PM, Mix Nin wrote:

> STORE command produces multiple output files. I want a single output file
> and I tried using command as below
> 
> STORE (foreach (group NoNullData all) generate flatten($1))  into 'xxxx';
> 
> This command produces one single file but at the same time forces to use
> single reducer which kills performance.
> 
> How do I overcome the scenario?
> 
> Normally   STORE command produces multiple output files, apart from that I
> see another file
> "_SUCCESS" in output directory. I ma generating metadata file  ( using
> PigStorage('\t', '-schema') ) in output directory
> 
> I thought of using  getmerge as follows
> 
> *hadoop* fs -*getmerge* <dir_of_input_files>   <local file>
> 
> But this requires
> 1)eliminating files other than data files in HDFS directory
> 2)It creates a single file in local directory but not in HDFS directory
> 3)I need to again move file from local directory to HDFS directory which
> may  take additional time , depending on size of single file
> 4)I need to agin place the files which I eliminated in Step 1
> 
> 
> Is there an efficient way for my problem?
> 
> Thanks