You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hama.apache.org by "Edward J. Yoon" <ed...@apache.org> on 2012/12/07 02:27:57 UTC

Task Priorities

Hello all,

I think large data processing capability is more important than fault
tolerance at the moment.

Here's my few thoughts.

1. We have to fix partitioning issue (most related with graph).

2. According to my experiences, Avro RPC implementation is unstable
and requires huge resources.  I'd like to just remove if nobody
maintain this implementation.

Additionally, we should focus more on memory efficiency and fast
parallel algorithms, not disk-based.

-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: Task Priorities

Posted by "Edward J. Yoon" <ed...@apache.org>.

P.S., Additionally, we should focus more on memory efficiency and fast
parallel algorithms, not disk-based.

What I was meaning is, parallel algorithms (examples).

On Sat, Dec 8, 2012 at 9:26 PM, Edward J. Yoon <ed...@apache.org> wrote:
> 'GraphJobRunner' BSP program already showed why disk-queue is
> important. The user always could be faced with memory issues.
>
> But, I'm talking about our task's priority. High performance computers
> and its parts are cheap and getting cheaper. And, I'm sure the
> message-passing and In-memory technologies are receiving attention as
> a near-future trend.
>
> In my case, the memory is 40GB per node. I want to confirm whether
> Hama is good candidate (ASAP). Hama can't processing large data but
> Hama team is currently working on YARN, FT, and Disk-queue.
>
> On Sat, Dec 8, 2012 at 6:28 PM, Thomas Jungblut
> <th...@gmail.com> wrote:
>> Yes that's nothing new, my rule of thumb is 10x the input size.
>> Which is bad, but the scalability must be done on multiple levels.
>> Spilling the graph to disk is just one part, because it consumes at least
>> the half of the memory for really sparse graphs.
>> The other is messaging, removing the bundling and the compression will not
>> save you much space.
>> We are writing messages to disk in fault tolerance anyways, so why not
>> directly writing it and then bundle/compress stuff on the fly while sending
>> (e.g. in 32m chunks)?
>>
>> 2012/12/8 Edward J. Yoon <ed...@apache.org>
>>
>>> Task is created per input split, and input splits are created one per
>>> block of each input file by default. If block size is 60~200 MB, 1 ~
>>> 3GB memory per task is enough.
>>>
>>> Yeah, there's still a queueing/messaging scalability issue as you
>>> know. However, according to my experiences, message bundler and
>>> compressor are mainly responsible for poor scalability and consumes
>>> huge memory. This is more urgent than "queue".
>>>
>>> On Sat, Dec 8, 2012 at 2:05 AM, Thomas Jungblut
>>> <th...@gmail.com> wrote:
>>> >>
>>> >>  not disk-based.
>>> >
>>> >
>>> > So how do you want to archieve scalability without that?
>>> > In order to process tasks independend of each other (not in parallel, but
>>> > e.g. in small mini batches), you have to save the state. RAM is limited
>>> and
>>> > can't store huge states (persistent in case of crashes).
>>> >
>>> > 2012/12/7 Suraj Menon <su...@apache.org>
>>> >
>>> >> On Thu, Dec 6, 2012 at 8:27 PM, Edward J. Yoon <edwardyoon@apache.org
>>> >> >wrote:
>>> >>
>>> >> > I think large data processing capability is more important than fault
>>> >> > tolerance at the moment.
>>> >> >
>>> >>
>>> >> +1
>>> >>
>>>
>>>
>>>
>>> --
>>> Best Regards, Edward J. Yoon
>>> @eddieyoon
>>>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: Task Priorities

Posted by "Edward J. Yoon" <ed...@apache.org>.

'GraphJobRunner' BSP program already showed why disk-queue is
important. The user always could be faced with memory issues.

But, I'm talking about our task's priority. High performance computers
and its parts are cheap and getting cheaper. And, I'm sure the
message-passing and In-memory technologies are receiving attention as
a near-future trend.

In my case, the memory is 40GB per node. I want to confirm whether
Hama is good candidate (ASAP). Hama can't processing large data but
Hama team is currently working on YARN, FT, and Disk-queue.

On Sat, Dec 8, 2012 at 6:28 PM, Thomas Jungblut
<th...@gmail.com> wrote:
> Yes that's nothing new, my rule of thumb is 10x the input size.
> Which is bad, but the scalability must be done on multiple levels.
> Spilling the graph to disk is just one part, because it consumes at least
> the half of the memory for really sparse graphs.
> The other is messaging, removing the bundling and the compression will not
> save you much space.
> We are writing messages to disk in fault tolerance anyways, so why not
> directly writing it and then bundle/compress stuff on the fly while sending
> (e.g. in 32m chunks)?
>
> 2012/12/8 Edward J. Yoon <ed...@apache.org>
>
>> Task is created per input split, and input splits are created one per
>> block of each input file by default. If block size is 60~200 MB, 1 ~
>> 3GB memory per task is enough.
>>
>> Yeah, there's still a queueing/messaging scalability issue as you
>> know. However, according to my experiences, message bundler and
>> compressor are mainly responsible for poor scalability and consumes
>> huge memory. This is more urgent than "queue".
>>
>> On Sat, Dec 8, 2012 at 2:05 AM, Thomas Jungblut
>> <th...@gmail.com> wrote:
>> >>
>> >>  not disk-based.
>> >
>> >
>> > So how do you want to archieve scalability without that?
>> > In order to process tasks independend of each other (not in parallel, but
>> > e.g. in small mini batches), you have to save the state. RAM is limited
>> and
>> > can't store huge states (persistent in case of crashes).
>> >
>> > 2012/12/7 Suraj Menon <su...@apache.org>
>> >
>> >> On Thu, Dec 6, 2012 at 8:27 PM, Edward J. Yoon <edwardyoon@apache.org
>> >> >wrote:
>> >>
>> >> > I think large data processing capability is more important than fault
>> >> > tolerance at the moment.
>> >> >
>> >>
>> >> +1
>> >>
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: Task Priorities

Posted by Thomas Jungblut <th...@gmail.com>.

Yes that's nothing new, my rule of thumb is 10x the input size.
Which is bad, but the scalability must be done on multiple levels.
Spilling the graph to disk is just one part, because it consumes at least
the half of the memory for really sparse graphs.
The other is messaging, removing the bundling and the compression will not
save you much space.
We are writing messages to disk in fault tolerance anyways, so why not
directly writing it and then bundle/compress stuff on the fly while sending
(e.g. in 32m chunks)?

2012/12/8 Edward J. Yoon <ed...@apache.org>

> Task is created per input split, and input splits are created one per
> block of each input file by default. If block size is 60~200 MB, 1 ~
> 3GB memory per task is enough.
>
> Yeah, there's still a queueing/messaging scalability issue as you
> know. However, according to my experiences, message bundler and
> compressor are mainly responsible for poor scalability and consumes
> huge memory. This is more urgent than "queue".
>
> On Sat, Dec 8, 2012 at 2:05 AM, Thomas Jungblut
> <th...@gmail.com> wrote:
> >>
> >>  not disk-based.
> >
> >
> > So how do you want to archieve scalability without that?
> > In order to process tasks independend of each other (not in parallel, but
> > e.g. in small mini batches), you have to save the state. RAM is limited
> and
> > can't store huge states (persistent in case of crashes).
> >
> > 2012/12/7 Suraj Menon <su...@apache.org>
> >
> >> On Thu, Dec 6, 2012 at 8:27 PM, Edward J. Yoon <edwardyoon@apache.org
> >> >wrote:
> >>
> >> > I think large data processing capability is more important than fault
> >> > tolerance at the moment.
> >> >
> >>
> >> +1
> >>
>
>
>
> --
> Best Regards, Edward J. Yoon
> @eddieyoon
>

Re: Task Priorities

Posted by "Edward J. Yoon" <ed...@apache.org>.

Task is created per input split, and input splits are created one per
block of each input file by default. If block size is 60~200 MB, 1 ~
3GB memory per task is enough.

Yeah, there's still a queueing/messaging scalability issue as you
know. However, according to my experiences, message bundler and
compressor are mainly responsible for poor scalability and consumes
huge memory. This is more urgent than "queue".

On Sat, Dec 8, 2012 at 2:05 AM, Thomas Jungblut
<th...@gmail.com> wrote:
>>
>>  not disk-based.
>
>
> So how do you want to archieve scalability without that?
> In order to process tasks independend of each other (not in parallel, but
> e.g. in small mini batches), you have to save the state. RAM is limited and
> can't store huge states (persistent in case of crashes).
>
> 2012/12/7 Suraj Menon <su...@apache.org>
>
>> On Thu, Dec 6, 2012 at 8:27 PM, Edward J. Yoon <edwardyoon@apache.org
>> >wrote:
>>
>> > I think large data processing capability is more important than fault
>> > tolerance at the moment.
>> >
>>
>> +1
>>

-- 
Best Regards, Edward J. Yoon
@eddieyoon

Re: Task Priorities

Posted by Thomas Jungblut <th...@gmail.com>.

>
>  not disk-based.

So how do you want to archieve scalability without that?
In order to process tasks independend of each other (not in parallel, but
e.g. in small mini batches), you have to save the state. RAM is limited and
can't store huge states (persistent in case of crashes).

2012/12/7 Suraj Menon <su...@apache.org>

> On Thu, Dec 6, 2012 at 8:27 PM, Edward J. Yoon <edwardyoon@apache.org
> >wrote:
>
> > I think large data processing capability is more important than fault
> > tolerance at the moment.
> >
>
> +1
>

Re: Task Priorities

Posted by Suraj Menon <su...@apache.org>.

On Thu, Dec 6, 2012 at 8:27 PM, Edward J. Yoon <ed...@apache.org>wrote:

> I think large data processing capability is more important than fault
> tolerance at the moment.
>

+1