You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Keren Ouaknine <ke...@gmail.com> on 2014/07/09 23:02:48 UTC

Configuring Pig to store results all in one file

Hi,

I am aware there are several threads on the topic already :), however the
suggestions out there didn't seem to work on my script.

My output folder contains many parts:
part-m-00000  part-m-00003  part-m-00006  part-m-00009  part-m-00012
part-m-00015  part-m-00018  part-m-00021  part-m-00024  part-m-00027
part-m-00030
part-m-00001  part-m-00004  part-m-00007  part-m-00010  part-m-00013
part-m-00016  part-m-00019  part-m-00022  part-m-00025  part-m-00028
_temporary
part-m-00002  part-m-00005  part-m-00008  part-m-00011  part-m-00014
part-m-00017  part-m-00020  part-m-00023  part-m-00026  part-m-00029

I am reading from one local file and executing in local mode so I would
expect getting only one part-m-00000 as my output. Any clue why I get more
than one part?

I pasted my script below:
register /home/kereno/pigmix.jar

page_views = load
'/home/kereno/more/pig-0.13.0-RC1/conversion_pig_scripts/page_views' using
org.apache.pig.test.pigmix.udf.PigPerformanceLoader() as (user, action,
timespent, query_term, ip_addr, timesta
mp,estimated_revenue, page_info, page_links);

page_views_flattened = foreach page_views generate user, action, timespent,
query_term, ip_addr, timestamp, estimated_revenue,
((map[]) page_info) as page_info, (bag{tuple(map[])})page_links as
page_links;

store page_views_flattened into 'parsed/ADM-format/page_views' using
org.apache.pig.builtin.PigStorage_for_AQL('\t');

Thanks,
Keren
​



-- 
Keren Ouaknine
www.kereno.com

Re: Configuring Pig to store results all in one file

Posted by Cheolsoo Park <pi...@gmail.com>.
Hi Keren,

The # of output files is determined by the # of tasks that write out output
files. Given your query, Pig will run a map-only job. But even if you run
it on a single local file, multiple tasks (threads) can be launched if the
input file is big and splittable. You can probably enforce a single task by
tuning pig.maxCombinedSplitSize and mapred.max.split.size.

Thanks,
Cheolsoo


On Wed, Jul 9, 2014 at 2:02 PM, Keren Ouaknine <ke...@gmail.com> wrote:

> Hi,
>
> I am aware there are several threads on the topic already :), however the
> suggestions out there didn't seem to work on my script.
>
> My output folder contains many parts:
> part-m-00000  part-m-00003  part-m-00006  part-m-00009  part-m-00012
> part-m-00015  part-m-00018  part-m-00021  part-m-00024  part-m-00027
> part-m-00030
> part-m-00001  part-m-00004  part-m-00007  part-m-00010  part-m-00013
> part-m-00016  part-m-00019  part-m-00022  part-m-00025  part-m-00028
> _temporary
> part-m-00002  part-m-00005  part-m-00008  part-m-00011  part-m-00014
> part-m-00017  part-m-00020  part-m-00023  part-m-00026  part-m-00029
>
> I am reading from one local file and executing in local mode so I would
> expect getting only one part-m-00000 as my output. Any clue why I get more
> than one part?
>
> I pasted my script below:
> register /home/kereno/pigmix.jar
>
> page_views = load
> '/home/kereno/more/pig-0.13.0-RC1/conversion_pig_scripts/page_views' using
> org.apache.pig.test.pigmix.udf.PigPerformanceLoader() as (user, action,
> timespent, query_term, ip_addr, timesta
> mp,estimated_revenue, page_info, page_links);
>
> page_views_flattened = foreach page_views generate user, action, timespent,
> query_term, ip_addr, timestamp, estimated_revenue,
> ((map[]) page_info) as page_info, (bag{tuple(map[])})page_links as
> page_links;
>
> store page_views_flattened into 'parsed/ADM-format/page_views' using
> org.apache.pig.builtin.PigStorage_for_AQL('\t');
>
> Thanks,
> Keren
> ​
>
>
>
> --
> Keren Ouaknine
> www.kereno.com
>