You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hama.apache.org by "Edward J. Yoon" <ed...@apache.org> on 2012/09/27 11:01:19 UTC

Fault tolerant test

Hi,

Today I tested Hama TRUNK on 1152 cores cluster, everything seems OK
except the AvroMessageManager and Memory issues.

I'm planning on testing new FT system 2 weeks later (I'll be vacation
Next week). So, could you please let me know where I should
concentrate my efforts?

-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: Fault tolerant test

Posted by "Edward J. Yoon" <ed...@apache.org>.
Interesting, it was "Too many open files" error? or something?

On Thu, Sep 27, 2012 at 6:13 PM, Yuesheng Hu <yu...@gmail.com> wrote:
> TB or hundreds GB data job, and long-time running job.
> I test a 200GB dataset for kmeans this afternoon, every superstep taken
> about 30m(our cluster is small), it will throw "Filesystem closed"
> exception occasionally.
>
> 2012/9/27 Edward J. Yoon <ed...@apache.org>
>
>> Hi,
>>
>> Today I tested Hama TRUNK on 1152 cores cluster, everything seems OK
>> except the AvroMessageManager and Memory issues.
>>
>> I'm planning on testing new FT system 2 weeks later (I'll be vacation
>> Next week). So, could you please let me know where I should
>> concentrate my efforts?
>>
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: Fault tolerant test

Posted by "Edward J. Yoon" <ed...@apache.org>.
My guess is this problem will be fixed by FT system.

I saw this error many times, occurs also on MapReduce. It's a network
related problem if it's not a bug of KMeanBSP.

On Thu, Sep 27, 2012 at 6:26 PM, Yuesheng Hu <yu...@gmail.com> wrote:
> my input dir contain only 45 file. The Error message of LOG is "Error
> running bsp setup and bsp function."  because of the "Filesystem closed"
> exception.



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: Fault tolerant test

Posted by Yuesheng Hu <yu...@gmail.com>.
https://issues.apache.org/jira/browse/HAMA-647  is relative of HAMA-613
This patch's goal is to guarantee that the numSplits will not larger than
the numTasks.
So the situation of HAMA-613 will not  happen.

2012/9/27 Suraj Menon <me...@gmail.com>

> :) Sure. Has anyone gauged the effort needed for
> HAMA-561<https://issues.apache.org/jira/browse/HAMA-561>
>  and HAMA-613 <https://issues.apache.org/jira/browse/HAMA-613> ? Is anyone
> working on it? We have one user probably waiting for HAMA-561 and the other
> one is a blocker.
>
> -Suraj
>
> On Thu, Sep 27, 2012 at 6:11 AM, Edward J. Yoon <edwardyoon@apache.org
> >wrote:
>
> > That's great. if you need some worker, feel free to use me. ;-)
> >
> > On Thu, Sep 27, 2012 at 6:50 PM, Suraj Menon <me...@gmail.com>
> > wrote:
> > > Two weeks should be enough for me to implement a better FT.
> > > I am currently working on spilling buffer and the new Superstep Chain
> > API.
> > > Should be out soon.
> > >
> > > -Suraj
> > >
> > > On Thu, Sep 27, 2012 at 5:26 AM, Yuesheng Hu <yu...@gmail.com>
> > wrote:
> > >
> > >> my input dir contain only 45 file. The Error message of LOG is "Error
> > >> running bsp setup and bsp function."  because of the "Filesystem
> closed"
> > >> exception.
> > >>
> >
> >
> >
> > --
> > Best Regards, Edward J. Yoon
> > @eddieyoon
> >
>

Re: Fault tolerant test

Posted by Suraj Menon <me...@gmail.com>.
:) Sure. Has anyone gauged the effort needed for
HAMA-561<https://issues.apache.org/jira/browse/HAMA-561>
 and HAMA-613 <https://issues.apache.org/jira/browse/HAMA-613> ? Is anyone
working on it? We have one user probably waiting for HAMA-561 and the other
one is a blocker.

-Suraj

On Thu, Sep 27, 2012 at 6:11 AM, Edward J. Yoon <ed...@apache.org>wrote:

> That's great. if you need some worker, feel free to use me. ;-)
>
> On Thu, Sep 27, 2012 at 6:50 PM, Suraj Menon <me...@gmail.com>
> wrote:
> > Two weeks should be enough for me to implement a better FT.
> > I am currently working on spilling buffer and the new Superstep Chain
> API.
> > Should be out soon.
> >
> > -Suraj
> >
> > On Thu, Sep 27, 2012 at 5:26 AM, Yuesheng Hu <yu...@gmail.com>
> wrote:
> >
> >> my input dir contain only 45 file. The Error message of LOG is "Error
> >> running bsp setup and bsp function."  because of the "Filesystem closed"
> >> exception.
> >>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon
>

Re: Fault tolerant test

Posted by "Edward J. Yoon" <ed...@apache.org>.
The error seems happened at file reading - setup() method.

Tasks should be attempt multiple times.

On Thu, Sep 27, 2012 at 7:29 PM, Suraj Menon <me...@gmail.com> wrote:
>
> Hi Yuesheng,   Just curious, did you check the HDFS status when you saw
> this error? We currently use HDFS to checkpoint. We are working on getting
> more selective on what to put on HDFS.
>
>
> On Thu, Sep 27, 2012 at 6:16 AM, Yuesheng Hu <yu...@gmail.com> wrote:
>>
>> It could not be better if the FT can resolve this problem.
>>
>> 2012/9/27 Edward J. Yoon <ed...@apache.org>
>>>
>>> That's great. if you need some worker, feel free to use me. ;-)
>>>
>>>
>>> On Thu, Sep 27, 2012 at 6:50 PM, Suraj Menon <me...@gmail.com>
>>> wrote:
>>> > Two weeks should be enough for me to implement a better FT.
>>> > I am currently working on spilling buffer and the new Superstep Chain
>>> > API.
>>> > Should be out soon.
>>> >
>>> > -Suraj
>>> >
>>> > On Thu, Sep 27, 2012 at 5:26 AM, Yuesheng Hu <yu...@gmail.com>
>>> > wrote:
>>> >
>>> >> my input dir contain only 45 file. The Error message of LOG is "Error
>>> >> running bsp setup and bsp function."  because of the "Filesystem
>>> >> closed"
>>> >> exception.
>>> >>
>>>
>>>
>>>
>>> --
>>> Best Regards, Edward J. Yoon
>>> @eddieyoon
>>
>>
>



--
Best Regards, Edward J. Yoon
@eddieyoon

Re: Fault tolerant test

Posted by Suraj Menon <me...@gmail.com>.
Hi Yuesheng,   Just curious, did you check the HDFS status when you saw
this error? We currently use HDFS to checkpoint. We are working on getting
more selective on what to put on HDFS.

On Thu, Sep 27, 2012 at 6:16 AM, Yuesheng Hu <yu...@gmail.com> wrote:

> It could not be better if the FT can resolve this problem. [?]
>
> 2012/9/27 Edward J. Yoon <ed...@apache.org>
>
>> That's great. if you need some worker, feel free to use me. ;-)
>>
>>
>> On Thu, Sep 27, 2012 at 6:50 PM, Suraj Menon <me...@gmail.com>
>> wrote:
>> > Two weeks should be enough for me to implement a better FT.
>> > I am currently working on spilling buffer and the new Superstep Chain
>> API.
>> > Should be out soon.
>> >
>> > -Suraj
>> >
>> > On Thu, Sep 27, 2012 at 5:26 AM, Yuesheng Hu <yu...@gmail.com>
>> wrote:
>> >
>> >> my input dir contain only 45 file. The Error message of LOG is "Error
>> >> running bsp setup and bsp function."  because of the "Filesystem
>> closed"
>> >> exception.
>> >>
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>>
>
>

Re: Fault tolerant test

Posted by Yuesheng Hu <yu...@gmail.com>.
It could not be better if the FT can resolve this problem. [?]

2012/9/27 Edward J. Yoon <ed...@apache.org>

> That's great. if you need some worker, feel free to use me. ;-)
>
> On Thu, Sep 27, 2012 at 6:50 PM, Suraj Menon <me...@gmail.com>
> wrote:
> > Two weeks should be enough for me to implement a better FT.
> > I am currently working on spilling buffer and the new Superstep Chain
> API.
> > Should be out soon.
> >
> > -Suraj
> >
> > On Thu, Sep 27, 2012 at 5:26 AM, Yuesheng Hu <yu...@gmail.com>
> wrote:
> >
> >> my input dir contain only 45 file. The Error message of LOG is "Error
> >> running bsp setup and bsp function."  because of the "Filesystem closed"
> >> exception.
> >>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon
>

Re: Fault tolerant test

Posted by "Edward J. Yoon" <ed...@apache.org>.
That's great. if you need some worker, feel free to use me. ;-)

On Thu, Sep 27, 2012 at 6:50 PM, Suraj Menon <me...@gmail.com> wrote:
> Two weeks should be enough for me to implement a better FT.
> I am currently working on spilling buffer and the new Superstep Chain API.
> Should be out soon.
>
> -Suraj
>
> On Thu, Sep 27, 2012 at 5:26 AM, Yuesheng Hu <yu...@gmail.com> wrote:
>
>> my input dir contain only 45 file. The Error message of LOG is "Error
>> running bsp setup and bsp function."  because of the "Filesystem closed"
>> exception.
>>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: Fault tolerant test

Posted by Suraj Menon <me...@gmail.com>.
Two weeks should be enough for me to implement a better FT.
I am currently working on spilling buffer and the new Superstep Chain API.
Should be out soon.

-Suraj

On Thu, Sep 27, 2012 at 5:26 AM, Yuesheng Hu <yu...@gmail.com> wrote:

> my input dir contain only 45 file. The Error message of LOG is "Error
> running bsp setup and bsp function."  because of the "Filesystem closed"
> exception.
>

Re: Fault tolerant test

Posted by Yuesheng Hu <yu...@gmail.com>.
my input dir contain only 45 file. The Error message of LOG is "Error
running bsp setup and bsp function."  because of the "Filesystem closed"
exception.

Re: Fault tolerant test

Posted by Yuesheng Hu <yu...@gmail.com>.
TB or hundreds GB data job, and long-time running job.
I test a 200GB dataset for kmeans this afternoon, every superstep taken
about 30m(our cluster is small), it will throw "Filesystem closed"
exception occasionally.

2012/9/27 Edward J. Yoon <ed...@apache.org>

> Hi,
>
> Today I tested Hama TRUNK on 1152 cores cluster, everything seems OK
> except the AvroMessageManager and Memory issues.
>
> I'm planning on testing new FT system 2 weeks later (I'll be vacation
> Next week). So, could you please let me know where I should
> concentrate my efforts?
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon
>