You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by Keren Ouaknine <ke...@gmail.com> on 2014/07/09 23:02:48 UTC
Configuring Pig to store results all in one file
Hi,
I am aware there are several threads on the topic already :), however the
suggestions out there didn't seem to work on my script.
My output folder contains many parts:
part-m-00000 part-m-00003 part-m-00006 part-m-00009 part-m-00012
part-m-00015 part-m-00018 part-m-00021 part-m-00024 part-m-00027
part-m-00030
part-m-00001 part-m-00004 part-m-00007 part-m-00010 part-m-00013
part-m-00016 part-m-00019 part-m-00022 part-m-00025 part-m-00028
_temporary
part-m-00002 part-m-00005 part-m-00008 part-m-00011 part-m-00014
part-m-00017 part-m-00020 part-m-00023 part-m-00026 part-m-00029
I am reading from one local file and executing in local mode so I would
expect getting only one part-m-00000 as my output. Any clue why I get more
than one part?
I pasted my script below:
register /home/kereno/pigmix.jar
page_views = load
'/home/kereno/more/pig-0.13.0-RC1/conversion_pig_scripts/page_views' using
org.apache.pig.test.pigmix.udf.PigPerformanceLoader() as (user, action,
timespent, query_term, ip_addr, timesta
mp,estimated_revenue, page_info, page_links);
page_views_flattened = foreach page_views generate user, action, timespent,
query_term, ip_addr, timestamp, estimated_revenue,
((map[]) page_info) as page_info, (bag{tuple(map[])})page_links as
page_links;
store page_views_flattened into 'parsed/ADM-format/page_views' using
org.apache.pig.builtin.PigStorage_for_AQL('\t');
Thanks,
Keren
--
Keren Ouaknine
www.kereno.com
Re: Configuring Pig to store results all in one file
Posted by Cheolsoo Park <pi...@gmail.com>.
Hi Keren,
The # of output files is determined by the # of tasks that write out output
files. Given your query, Pig will run a map-only job. But even if you run
it on a single local file, multiple tasks (threads) can be launched if the
input file is big and splittable. You can probably enforce a single task by
tuning pig.maxCombinedSplitSize and mapred.max.split.size.
Thanks,
Cheolsoo
On Wed, Jul 9, 2014 at 2:02 PM, Keren Ouaknine <ke...@gmail.com> wrote:
> Hi,
>
> I am aware there are several threads on the topic already :), however the
> suggestions out there didn't seem to work on my script.
>
> My output folder contains many parts:
> part-m-00000 part-m-00003 part-m-00006 part-m-00009 part-m-00012
> part-m-00015 part-m-00018 part-m-00021 part-m-00024 part-m-00027
> part-m-00030
> part-m-00001 part-m-00004 part-m-00007 part-m-00010 part-m-00013
> part-m-00016 part-m-00019 part-m-00022 part-m-00025 part-m-00028
> _temporary
> part-m-00002 part-m-00005 part-m-00008 part-m-00011 part-m-00014
> part-m-00017 part-m-00020 part-m-00023 part-m-00026 part-m-00029
>
> I am reading from one local file and executing in local mode so I would
> expect getting only one part-m-00000 as my output. Any clue why I get more
> than one part?
>
> I pasted my script below:
> register /home/kereno/pigmix.jar
>
> page_views = load
> '/home/kereno/more/pig-0.13.0-RC1/conversion_pig_scripts/page_views' using
> org.apache.pig.test.pigmix.udf.PigPerformanceLoader() as (user, action,
> timespent, query_term, ip_addr, timesta
> mp,estimated_revenue, page_info, page_links);
>
> page_views_flattened = foreach page_views generate user, action, timespent,
> query_term, ip_addr, timestamp, estimated_revenue,
> ((map[]) page_info) as page_info, (bag{tuple(map[])})page_links as
> page_links;
>
> store page_views_flattened into 'parsed/ADM-format/page_views' using
> org.apache.pig.builtin.PigStorage_for_AQL('\t');
>
> Thanks,
> Keren
>
>
>
>
> --
> Keren Ouaknine
> www.kereno.com
>