You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@pig.apache.org by prasenjit mukherjee <pr...@gmail.com> on 2009/09/09 17:48:05 UTC

Fwd: [Fwd: Re: OutOfMemory Errors when loading a Gzip file]

Message forwarded from Irfan :


---------- Forwarded message ----------
From: Irfan Mohammed <ir...@gmail.com>
Date: Wed, Sep 9, 2009 at 11:26 AM
Subject: [Fwd: Re: OutOfMemory Errors when loading a Gzip file]
To: Prasenjit <pr...@gmail.com>


My mails keep getting bounced off the pig mailing list. can you send
it to the list?

-------- Original Message --------
Subject: Re: OutOfMemory Errors when loading a Gzip file
Date: Wed, 09 Sep 2009 11:24:00 -0400
From: Irfan Mohammed <ir...@gmail.com>
To: pig-user@hadoop.apache.org

<<< i have been trying to post this message since yesterday but
getting spam error and the mails are bounced >>>

Thanks for the quick response.

Each record is about 512 ascii characters.
# of records : 1 Million+
Multi-megabyte ?? No.
Tried decompressing the file and OutOfMemory errors still occur.

x1 = LOAD 'file:///mnt/transaction_ar20090907_1102_126.CSV' using
PigStorage('\u0002');
y1 = LIMIT x1 10;
dump y1;

Mridul Muralidharan wrote:

It could depend on the rest of your pipeline before the mapreduce
boundary (or after it).
What exactly is your script ?



As an example :

A = load 'file';
B = foreach A generate FLATTEN(myUdf1(*)), FLATTEN(myUdf2(*));

If myUdf* generates 'huge' bags, then the foreach above, which results
in a CROSS "might" cause OOM (might - since it is possible to avoid it
- not sure what pig does here : could be avoiding it !).

Though a more common reason is after a map reduce boundary - where the
key distribution is skewed.

A = load 'file';
B = GROUP A by $0;

C = foreach B generate group, myUdf($1);


Here, the $1 bag "can" become too large - so materializing the bag
will cause OOM - like suppose group key == "/" and you have 100
Million input tuples with $0 as "/".



Hope this helps to understand why ... pretty vague and general mail,
given I cant use your script as a reference to understand why there is
an OOM :-)

Regards,
Mridul


prasenjit mukherjee wrote:

Thanks Ankur for your quick reply.

In my case the files are unzipped. What possible reasons could be  for
it to fail with OOM errors in that case.

-Prasen

On Wed, Sep 9, 2009 at 1:29 AM, Ankur Goel<ga...@yahoo-inc.com> wrote:

When asking pig to use local file, it actually transfers the file to
DFS so that the file becomes available to map-reduce processes. Since
gz files are essentially non-splittable, 1 map-task will process 1 gz
file and if the size of it is large then u run the risk of process
running OOM. It is suggested to use bzip2 compressed file since they
are splittable and pig automatically takes care of them.

Assuming you are using capacity scheduler, to give more memory to your
map process try pig -Dmapred.job.map.memory.mb=2048
-Dmapred.child.java.opts=-Xmx1792m  <your-script>

Note that mapred.job.map.memory.mb limit should be less than or equal
to mapred.cluster.max.map.memory.mb , the cluster-wide max virtual
memory limit for a map process.

-@nkur

----- Original Message -----
From: "prasenjit mukherjee" <pr...@gmail.com>
To: pig-user@hadoop.apache.org
Sent: Tuesday, September 8, 2009 11:29:16 PM GMT +05:30 Chennai,
Kolkata, Mumbai, New Delhi
Subject: Re: OutOfMemory Errors when loading a Gzip file

I am also trying to grapple with the similar class of  problems.  In
my case the files are unzipped ( and hence  assuming is splitable on
record boundaries) . The record size is pretty small, though the total
number of records could be in 100s of millions.

Would like to know how pig splits files for LOAD/STORE operations
specifically.  Does the fact that pig is instructed to use  local file
(LOAD file:///.... )  make any difference ?

-Thanks,
Prasen

On Tue, Sep 8, 2009 at 10:58 AM, Irfan Mohammed<ir...@gmail.com> wrote:

Hi,
I am trying to load a large gzip file and process using pig. Everytime I run
the following script, I get outofmemory errors.

The hadoop-site.xml is attached. The pig and the hadoop jobtracker logs are
attached as well.

$ pig
x1 = LOAD 'file:///mnt/transaction_ar20090907_1102_126.CSV.gz' using
PigStorage('\u0002');
y1 = LIMIT x1 10;
dump y1;
Environment :
hadoop-0.20.0
pig-0.3.0 [ patched with Pig-660-4 to work with hadoop-0.20.0 ]
ec2 [ c1.medium ]

Thanks,
Irfan





---------- Forwarded message ----------
From: Mail Delivery Subsystem <ma...@googlemail.com>
To: irfan.ma@gmail.com
Date: Wed, 09 Sep 2009 08:24:32 -0700 (PDT)
Subject: Delivery Status Notification (Failure)
This is an automatically generated Delivery Status Notification

Delivery to the following recipient failed permanently:

    pig-user@hadoop.apache.org

Technical details of permanent failure:
Google tried to deliver your message, but it was rejected by the
recipient domain. We recommend contacting the other email provider for
further information about the cause of this error. The error that the
other server returned was: 552 552 spam score (5.7) exceeded threshold
(state 18).

  ----- Original message -----

Received: by 10.224.64.129 with SMTP id e1mr294424qai.54.1252509842598;
       Wed, 09 Sep 2009 08:24:02 -0700 (PDT)
Return-Path: <ir...@gmail.com>
Received: from ?10.168.1.144?
(ip67-91-232-130.z232-91-67.customer.algx.net [67.91.232.130])
       by mx.google.com with ESMTPS id 7sm54523qwb.44.2009.09.09.08.24.01
       (version=TLSv1/SSLv3 cipher=RC4-MD5);
       Wed, 09 Sep 2009 08:24:02 -0700 (PDT)
Message-ID: <4A...@gmail.com>
Date: Wed, 09 Sep 2009 11:24:00 -0400
From: Irfan Mohammed <ir...@gmail.com>
User-Agent: Thunderbird 2.0.0.23 (X11/20090817)
MIME-Version: 1.0
To: pig-user@hadoop.apache.org
Subject: Re: OutOfMemory Errors when loading a Gzip file
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
</head>
<body bgcolor="#ffffff" text="#000000">
<div class="moz-text-html" lang="x-western"> &lt;&lt;&lt; i have been
trying to post this message since yesterday
but getting spam error and the mails are bounced &gt;&gt;&gt;<br>
<br>
Thanks for the quick response.<br>

  ----- Message truncated -----