You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Alicia Leong <lc...@gmail.com> on 2013/04/16 02:27:42 UTC

Re: Vnodes - HUNDRED of MapReduce jobs

Hi cem <ca...@gmail.com>,

In your previous reply, you mentioned that you have a simple solution.
Can you share with us :)

Thanks in advance.


On Sat, Mar 30, 2013 at 2:33 AM, Edward Capriolo <ed...@gmail.com>wrote:

> It should be easy to control the number of map tasks.
> http://wiki.apache.org/hadoop/HowManyMapsAndReduces. It standard HDFS you
> might run into a directory with 10,000 small files and you do not want
> 10,000 map tasks. This is what the CombinedInputFormat's do, they help you
> control the number of map tasks a job will generate. For example, imagine i
> have a multi-tenant cluster. If a job kicks up 10,000 map tasks, all those
> tasks can starve out other jobs. Being able to say "I only want 4 map tasks
> per c* node regardless of the number of vnodes" would be a meaningful and
> useful feature.
>
>
> On Fri, Mar 29, 2013 at 2:17 PM, Edward Capriolo <ed...@gmail.com>wrote:
>
>> Yes but my point, is with 50 map slots you can only be processing 50 at
>> once. So it will take 1000/50 "waves" of mappers to complete the job.
>>
>>
>> On Fri, Mar 29, 2013 at 11:46 AM, Jonathan Ellis <jb...@gmail.com>wrote:
>>
>>> My point is that if you have over 16MB of data per node, you're going
>>> to get thousands of map tasks (that is: hundreds per node) with or
>>> without vnodes.
>>>
>>> On Fri, Mar 29, 2013 at 9:42 AM, Edward Capriolo <ed...@gmail.com>
>>> wrote:
>>> > Every map reduce task typically has a minimum Xmx of 256MB memory. See
>>> > mapred.child.java.opts...
>>> > So if you have a 10 node cluster with 256 vnodes... You will need to
>>> spawn
>>> > 2,560 map tasks to complete a job.
>>> > And a 10 node hadoop cluster with 5 map slotes a node... You have 50
>>> map
>>> > slots.
>>> >
>>> > Wouldnt it be better if the input format spawned 10 map tasks instead
>>> of
>>> > 2,560?
>>> >
>>> >
>>> > On Fri, Mar 29, 2013 at 10:28 AM, Jonathan Ellis <jb...@gmail.com>
>>> wrote:
>>> >>
>>> >> I still don't see the hole in the following reasoning:
>>> >>
>>> >> - Input splits are 64k by default.  At this size, map processing time
>>> >> dominates job creation.
>>> >> - Therefore, if job creation time dominates, you have a toy data set
>>> >> (< 64K * 256 vnodes = 16 MB)
>>> >>
>>> >> Adding complexity to our inputformat to improve performance for this
>>> >> niche does not sound like a good idea to me.
>>> >>
>>> >> On Thu, Mar 28, 2013 at 8:40 AM, cem <ca...@gmail.com> wrote:
>>> >> > Hi Alicia ,
>>> >> >
>>> >> > Cassandra input format creates mappers as many as vnodes. It is a
>>> known
>>> >> > issue. You need to lower the number of vnodes :(
>>> >> >
>>> >> > I have a simple solution for that and ready to write a patch.
>>> Should I
>>> >> > create a ticket about that? I don't know the procedure about that.
>>> >> >
>>> >> >  Regards,
>>> >> > Cem
>>> >> >
>>> >> >
>>> >> > On Thu, Mar 28, 2013 at 2:30 PM, Alicia Leong <lc...@gmail.com>
>>> >> > wrote:
>>> >> >>
>>> >> >> Hi All,
>>> >> >>
>>> >> >> I have 3 nodes of Cassandra 1.2.3 & edited the cassandra.yaml for
>>> >> >> vnodes.
>>> >> >>
>>> >> >> When I execute a M/R job .. the console showed HUNDRED of Map
>>> tasks.
>>> >> >>
>>> >> >> May I know, is the normal since is vnodes?  If yes, this have slow
>>> the
>>> >> >> M/R
>>> >> >> job to finish/complete.
>>> >> >>
>>> >> >>
>>> >> >> Thanks
>>> >> >
>>> >> >
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Jonathan Ellis
>>> >> Project Chair, Apache Cassandra
>>> >> co-founder, http://www.datastax.com
>>> >> @spyced
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Jonathan Ellis
>>> Project Chair, Apache Cassandra
>>> co-founder, http://www.datastax.com
>>> @spyced
>>>
>>
>>
>