You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by xa...@orange-ftgroup.com on 2009/10/19 19:58:15 UTC

How to IO catch exceptions using python

Hi Everybody,

I'm doing a project where I have to read a large set of compress files
(gz). I'm using python and streaming to achieve my goals. However, I
have a problem, there are corrupt compress files that are killing my
map/reduce jobs.
My environment is the following:
Hadoop-0.18.3 (CDH1) 
 

Do you guys have some recommendations how to manage this case?
How I can catch that exception using python so that my jobs don't fail?
How I can identify these files using python and move them to a corrupt
file folder?

I really appreciate any recommendation

Xavier

RE: How to IO catch exceptions using python

Posted by xa...@orange-ftgroup.com.

Hi Jeff,
Thanks for the suggestion, however, I'm running a small (2 machines)
cluster with CDH2 with a folder that contains two files one corrupt and
the other not but I'm still have the exception and the streamjob is
kill.
This is ok but I want to know a way to manage this exception
(java.util.zip.ZipException: invalid block type or any other) using
streaming (python).

I really appreciate If you point me to some way to catch the exception,

Thanks again

Xavier

-----Original Message-----
From: Jeff Hammerbacher [mailto:hammer@cloudera.com] 
Sent: Monday, October 19, 2009 11:02 AM
To: common-user@hadoop.apache.org
Subject: Re: How to IO catch exceptions using python

Hey Xavier,

The functionality you are looking for was added to 0.19 and above:
http://issues.apache.org/jira/browse/HADOOP-3828. If you upgrade your
cluster to CDH2, you should be good to go.

Regards,
Jeff

On Mon, Oct 19, 2009 at 10:58 AM,
<xa...@orange-ftgroup.com>wrote:

> Hi Everybody,
>
> I'm doing a project where I have to read a large set of compress files

> (gz). I'm using python and streaming to achieve my goals. However, I 
> have a problem, there are corrupt compress files that are killing my 
> map/reduce jobs.
> My environment is the following:
> Hadoop-0.18.3 (CDH1)
>
>
> Do you guys have some recommendations how to manage this case?
> How I can catch that exception using python so that my jobs don't
fail?
> How I can identify these files using python and move them to a corrupt

> file folder?
>
> I really appreciate any recommendation
>
> Xavier
>
>

Re: How to IO catch exceptions using python

Posted by Jeff Hammerbacher <ha...@cloudera.com>.

Hey Xavier,

The functionality you are looking for was added to 0.19 and above:
http://issues.apache.org/jira/browse/HADOOP-3828. If you upgrade your
cluster to CDH2, you should be good to go.

Regards,
Jeff

On Mon, Oct 19, 2009 at 10:58 AM, <xa...@orange-ftgroup.com>wrote:

> Hi Everybody,
>
> I'm doing a project where I have to read a large set of compress files
> (gz). I'm using python and streaming to achieve my goals. However, I
> have a problem, there are corrupt compress files that are killing my
> map/reduce jobs.
> My environment is the following:
> Hadoop-0.18.3 (CDH1)
>
>
> Do you guys have some recommendations how to manage this case?
> How I can catch that exception using python so that my jobs don't fail?
> How I can identify these files using python and move them to a corrupt
> file folder?
>
> I really appreciate any recommendation
>
> Xavier
>
>