You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by ArunKumar <ar...@gmail.com> on 2011/12/01 04:18:14 UTC

Availability of Job traces or logs

Hi guys !

Apart from generating the job traces from RUMEN , can i get logs or job
traces of varied sizes from some organizations.

How can i make sure that the rumen generates only say 25 jobs,50 jobs or so
?


Thanks,
Arun

--
View this message in context: http://lucene.472066.n3.nabble.com/Availability-of-Job-traces-or-logs-tp3550462p3550462.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: Availability of Job traces or logs

Posted by ArunKumar <ar...@gmail.com>.
Praveen,

I was not referring to the Hadoop Code. Instead i was referring to the
Mumak where i want to change the preferred locations of the tasks of the
jobs in the topology and launch them under different scenarios as
local/Non-local Map tasks.
Basically i meant for studying the behavior of the scheduler when i launch
more local/non-local tasks in a controlled manner.

I am trying to design a better storage/computational aware scheduler.


Arun

On Sun, Dec 4, 2011 at 6:13 PM, Praveen Sripati [via Lucene] <
ml-node+s472066n3558972h11@n3.nabble.com> wrote:

> Arun,
>
> >I want to control the split placements.
>
> InputSplits are logical and part of the input data, there is nothing to do
> with placement of the InputSplits. InputSplits are calculated on a client
> by the InputFormat class when a job is submitted and the InputSplit
> metadata data is put in HDFS to be fetched later.
>
> Each InputSplit is processed by a map task. The Hadoop framework makes
> sure
> that the task and the InputSplit it processes are as close as possible to
> avoid any overheads.
>
> MAPREDUCE-207 is for moving the calculation of the InputSplits from the
> client to the cluster, but I don't see any progress in it.
>
> BTW, what is the new scheduler about?
>
> Regards,
> Praveen
>
> On Sun, Dec 4, 2011 at 10:19 AM, ArunKumar <[hidden email]<http://user/SendEmail.jtp?type=node&node=3558972&i=0>>
> wrote:
>
> > Amar,
> >
> > I am attempting to write a new scheduler for Hadoop and test it using
> > Mumak.
> >
> > 1> I want to test its behaviour under different size of jobs
> traces(meaning
> > number of jobs say 5,10,25,50,100) under different number of nodes.
> > Till now i was using only the test/data given by mumak which has 19 jobs
> > and 1529 node topology.
> > I don' have many nodes with me to run some programs and collect logs and
> > use Rumen to generate traces.
> >
> > 2> I want to control the split placements so i need to modify preferred
> > locations for task attempts in the trace but the trace for even 19 jobs
> is
> > huge. So, I was thinking whether i can get a small, medium and large
> number
> > of Job traces with corresponding topology trace so that modifying will
> be
> > easier.
> >
> >
> > Arun
> >
> >
> > On Sat, Dec 3, 2011 at 1:15 PM, Amar Kamat [via Lucene] <
> > [hidden email] <http://user/SendEmail.jtp?type=node&node=3558972&i=1>>
> wrote:
> >
> > > Arun,
> > > You can very well run synthetic workloads like large scale sort,
> > wordcount
> > > etc or more realistic workloads like PigMix (
> > > https://cwiki.apache.org/confluence/display/PIG/PigMix). On a decent
> > > enough cluster, these workloads work pretty well. Is there a specific
> > > reason why you want traces of varied sizes from various organizations?
> > >
> > > > How can i make sure that the rumen generates only say 25 jobs,50
> jobs
> > or
> > > so
> > > Do you want to get 25/50 jobs based on some filtering criterion? I
> > > recently faced a similar situation where I wanted to extract jobs from
> a
> > > Rumen trace based on job ids. I will be happy to share these filtering
> > > tools.
> > >
> > > Amar
> > >
> > >
> > > On 12/1/11 8:48 AM, "ArunKumar" <[hidden email]<
> > http://user/SendEmail.jtp?type=node&node=3556710&i=0>>
> > > wrote:
> > >
> > > Hi guys !
> > >
> > > Apart from generating the job traces from RUMEN , can i get logs or
> job
> > > traces of varied sizes from some organizations.
> > >
> > > How can i make sure that the rumen generates only say 25 jobs,50 jobs
> or
> > > so
> > > ?
> > >
> > >
> > > Thanks,
> > > Arun
> > >
> > > --
> > > View this message in context:
> > >
> >
> http://lucene.472066.n3.nabble.com/Availability-of-Job-traces-or-logs-tp3550462p3550462.html
> > > Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
> > >
> > >
> > >
> > > ------------------------------
> > >  If you reply to this email, your message will be added to the
> discussion
> > > below:
> > >
> > >
> >
> http://lucene.472066.n3.nabble.com/Availability-of-Job-traces-or-logs-tp3550462p3556710.html
> > >  To unsubscribe from Availability of Job traces or logs, click here<
> >
> >
> > > .
> > > NAML<
> >
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.InstantMailNamespace&breadcrumbs=instant+emails%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
>
> > >
> > >
> >
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Availability-of-Job-traces-or-logs-tp3550462p3558530.html
>
> > Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
> >
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Availability-of-Job-traces-or-logs-tp3550462p3558972.html
>  To unsubscribe from Availability of Job traces or logs, click here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3550462&code=YXJ1bms3ODZAZ21haWwuY29tfDM1NTA0NjJ8NzA5NTc4MTY3>
> .
> NAML<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.InstantMailNamespace&breadcrumbs=instant+emails%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>


--
View this message in context: http://lucene.472066.n3.nabble.com/Availability-of-Job-traces-or-logs-tp3550462p3559056.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: Availability of Job traces or logs

Posted by Praveen Sripati <pr...@gmail.com>.
Arun,

>I want to control the split placements.

InputSplits are logical and part of the input data, there is nothing to do
with placement of the InputSplits. InputSplits are calculated on a client
by the InputFormat class when a job is submitted and the InputSplit
metadata data is put in HDFS to be fetched later.

Each InputSplit is processed by a map task. The Hadoop framework makes sure
that the task and the InputSplit it processes are as close as possible to
avoid any overheads.

MAPREDUCE-207 is for moving the calculation of the InputSplits from the
client to the cluster, but I don't see any progress in it.

BTW, what is the new scheduler about?

Regards,
Praveen

On Sun, Dec 4, 2011 at 10:19 AM, ArunKumar <ar...@gmail.com> wrote:

> Amar,
>
> I am attempting to write a new scheduler for Hadoop and test it using
> Mumak.
>
> 1> I want to test its behaviour under different size of jobs traces(meaning
> number of jobs say 5,10,25,50,100) under different number of nodes.
> Till now i was using only the test/data given by mumak which has 19 jobs
> and 1529 node topology.
> I don' have many nodes with me to run some programs and collect logs and
> use Rumen to generate traces.
>
> 2> I want to control the split placements so i need to modify preferred
> locations for task attempts in the trace but the trace for even 19 jobs is
> huge. So, I was thinking whether i can get a small, medium and large number
> of Job traces with corresponding topology trace so that modifying will be
> easier.
>
>
> Arun
>
>
> On Sat, Dec 3, 2011 at 1:15 PM, Amar Kamat [via Lucene] <
> ml-node+s472066n3556710h89@n3.nabble.com> wrote:
>
> > Arun,
> > You can very well run synthetic workloads like large scale sort,
> wordcount
> > etc or more realistic workloads like PigMix (
> > https://cwiki.apache.org/confluence/display/PIG/PigMix). On a decent
> > enough cluster, these workloads work pretty well. Is there a specific
> > reason why you want traces of varied sizes from various organizations?
> >
> > > How can i make sure that the rumen generates only say 25 jobs,50 jobs
> or
> > so
> > Do you want to get 25/50 jobs based on some filtering criterion? I
> > recently faced a similar situation where I wanted to extract jobs from a
> > Rumen trace based on job ids. I will be happy to share these filtering
> > tools.
> >
> > Amar
> >
> >
> > On 12/1/11 8:48 AM, "ArunKumar" <[hidden email]<
> http://user/SendEmail.jtp?type=node&node=3556710&i=0>>
> > wrote:
> >
> > Hi guys !
> >
> > Apart from generating the job traces from RUMEN , can i get logs or job
> > traces of varied sizes from some organizations.
> >
> > How can i make sure that the rumen generates only say 25 jobs,50 jobs or
> > so
> > ?
> >
> >
> > Thanks,
> > Arun
> >
> > --
> > View this message in context:
> >
> http://lucene.472066.n3.nabble.com/Availability-of-Job-traces-or-logs-tp3550462p3550462.html
> > Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
> >
> >
> >
> > ------------------------------
> >  If you reply to this email, your message will be added to the discussion
> > below:
> >
> >
> http://lucene.472066.n3.nabble.com/Availability-of-Job-traces-or-logs-tp3550462p3556710.html
> >  To unsubscribe from Availability of Job traces or logs, click here<
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3550462&code=YXJ1bms3ODZAZ21haWwuY29tfDM1NTA0NjJ8NzA5NTc4MTY3
> >
> > .
> > NAML<
> http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.InstantMailNamespace&breadcrumbs=instant+emails%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml
> >
> >
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Availability-of-Job-traces-or-logs-tp3550462p3558530.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
>

Re: Availability of Job traces or logs

Posted by ArunKumar <ar...@gmail.com>.
Amar,

I want to test scheduler behavior with 50,100,1000 jobs but i have only 19
jobs from test/data of mumak.
How / where do i generate/get  such a job trace with corresponding topology
trace ?
You were mentioning about sleep jobs. Do you mean to say if i have 20 jobs
i can run 30 sleep jobs with sleep times so totaling 50 jobs ?
You were also mentioning about Hadoop security being turned off and using
default controller and about intelligent design of topology script.
I didn't get them & no idea how to do it .

Can u explain in detail ?

Thanks,
Arun


--
View this message in context: http://lucene.472066.n3.nabble.com/Availability-of-Job-traces-or-logs-tp3550462p3568155.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: Availability of Job traces or logs

Posted by Amar Kamat <am...@yahoo-inc.com>.
Arun,
> I want to test its behaviour under different size of jobs traces(meaning number of jobs say 5,10,25,50,100) under different
> number of nodes.
> Till now i was using only the test/data given by mumak which has 19 jobs and 1529 node topology. I don' have many nodes
> with me to run some programs and collect logs and use Rumen to generate traces.
For the varying jobs part, you can run sleep jobs with varying number of map/reduce tasks and sleep times. For varying the cluster size, you can run multiple task-trackers on the same node. You can start with 5 tracker per node. Since you will be running sleep jobs, this should be ok. Make sure Hadoop security is turned off and default controller is used. Intelligently design your topology script which will club all the trackers on the same node under one rack.

> I want to control the split placements so i need to modify preferred locations for task attempts in the trace but the trace for
> even 19 jobs is huge. So, I was thinking whether i can get a small, medium and large number of Job traces with
> corresponding topology trace so that modifying will be easier.
For this, you need to understand how Rumen handles job logs. I have created MAPREDUCE-3508 for adding filtering capabilities to Rumen. You can make use of this feature to modify Rumen output and play around with splits. You can also make use of this feature to select few jobs (say 10, 50 etc) from the input trace.

Amar

On 12/4/11 10:19 AM, "ArunKumar" <ar...@gmail.com> wrote:

Amar,

I am attempting to write a new scheduler for Hadoop and test it using Mumak.

1> I want to test its behaviour under different size of jobs traces(meaning
number of jobs say 5,10,25,50,100) under different number of nodes.
Till now i was using only the test/data given by mumak which has 19 jobs
and 1529 node topology.
I don' have many nodes with me to run some programs and collect logs and
use Rumen to generate traces.

2> I want to control the split placements so i need to modify preferred
locations for task attempts in the trace but the trace for even 19 jobs is
huge. So, I was thinking whether i can get a small, medium and large number
of Job traces with corresponding topology trace so that modifying will be
easier.


Arun


On Sat, Dec 3, 2011 at 1:15 PM, Amar Kamat [via Lucene] <
ml-node+s472066n3556710h89@n3.nabble.com> wrote:

> Arun,
> You can very well run synthetic workloads like large scale sort, wordcount
> etc or more realistic workloads like PigMix (
> https://cwiki.apache.org/confluence/display/PIG/PigMix). On a decent
> enough cluster, these workloads work pretty well. Is there a specific
> reason why you want traces of varied sizes from various organizations?
>
> > How can i make sure that the rumen generates only say 25 jobs,50 jobs or
> so
> Do you want to get 25/50 jobs based on some filtering criterion? I
> recently faced a similar situation where I wanted to extract jobs from a
> Rumen trace based on job ids. I will be happy to share these filtering
> tools.
>
> Amar
>
>
> On 12/1/11 8:48 AM, "ArunKumar" <[hidden email]<http://user/SendEmail.jtp?type=node&node=3556710&i=0>>
> wrote:
>
> Hi guys !
>
> Apart from generating the job traces from RUMEN , can i get logs or job
> traces of varied sizes from some organizations.
>
> How can i make sure that the rumen generates only say 25 jobs,50 jobs or
> so
> ?
>
>
> Thanks,
> Arun
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Availability-of-Job-traces-or-logs-tp3550462p3550462.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Availability-of-Job-traces-or-logs-tp3550462p3556710.html
>  To unsubscribe from Availability of Job traces or logs, click here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3550462&code=YXJ1bms3ODZAZ21haWwuY29tfDM1NTA0NjJ8NzA5NTc4MTY3>
> .
> NAML<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.InstantMailNamespace&breadcrumbs=instant+emails%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>


--
View this message in context: http://lucene.472066.n3.nabble.com/Availability-of-Job-traces-or-logs-tp3550462p3558530.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Re: Availability of Job traces or logs

Posted by ArunKumar <ar...@gmail.com>.
Amar,

I am attempting to write a new scheduler for Hadoop and test it using Mumak.

1> I want to test its behaviour under different size of jobs traces(meaning
number of jobs say 5,10,25,50,100) under different number of nodes.
Till now i was using only the test/data given by mumak which has 19 jobs
and 1529 node topology.
I don' have many nodes with me to run some programs and collect logs and
use Rumen to generate traces.

2> I want to control the split placements so i need to modify preferred
locations for task attempts in the trace but the trace for even 19 jobs is
huge. So, I was thinking whether i can get a small, medium and large number
of Job traces with corresponding topology trace so that modifying will be
easier.


Arun


On Sat, Dec 3, 2011 at 1:15 PM, Amar Kamat [via Lucene] <
ml-node+s472066n3556710h89@n3.nabble.com> wrote:

> Arun,
> You can very well run synthetic workloads like large scale sort, wordcount
> etc or more realistic workloads like PigMix (
> https://cwiki.apache.org/confluence/display/PIG/PigMix). On a decent
> enough cluster, these workloads work pretty well. Is there a specific
> reason why you want traces of varied sizes from various organizations?
>
> > How can i make sure that the rumen generates only say 25 jobs,50 jobs or
> so
> Do you want to get 25/50 jobs based on some filtering criterion? I
> recently faced a similar situation where I wanted to extract jobs from a
> Rumen trace based on job ids. I will be happy to share these filtering
> tools.
>
> Amar
>
>
> On 12/1/11 8:48 AM, "ArunKumar" <[hidden email]<http://user/SendEmail.jtp?type=node&node=3556710&i=0>>
> wrote:
>
> Hi guys !
>
> Apart from generating the job traces from RUMEN , can i get logs or job
> traces of varied sizes from some organizations.
>
> How can i make sure that the rumen generates only say 25 jobs,50 jobs or
> so
> ?
>
>
> Thanks,
> Arun
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Availability-of-Job-traces-or-logs-tp3550462p3550462.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://lucene.472066.n3.nabble.com/Availability-of-Job-traces-or-logs-tp3550462p3556710.html
>  To unsubscribe from Availability of Job traces or logs, click here<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=3550462&code=YXJ1bms3ODZAZ21haWwuY29tfDM1NTA0NjJ8NzA5NTc4MTY3>
> .
> NAML<http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.InstantMailNamespace&breadcrumbs=instant+emails%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>


--
View this message in context: http://lucene.472066.n3.nabble.com/Availability-of-Job-traces-or-logs-tp3550462p3558530.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.

Re: Availability of Job traces or logs

Posted by Amar Kamat <am...@yahoo-inc.com>.
Arun,
You can very well run synthetic workloads like large scale sort, wordcount etc or more realistic workloads like PigMix (https://cwiki.apache.org/confluence/display/PIG/PigMix). On a decent enough cluster, these workloads work pretty well. Is there a specific reason why you want traces of varied sizes from various organizations?

> How can i make sure that the rumen generates only say 25 jobs,50 jobs or so
Do you want to get 25/50 jobs based on some filtering criterion? I recently faced a similar situation where I wanted to extract jobs from a Rumen trace based on job ids. I will be happy to share these filtering tools.

Amar


On 12/1/11 8:48 AM, "ArunKumar" <ar...@gmail.com> wrote:

Hi guys !

Apart from generating the job traces from RUMEN , can i get logs or job
traces of varied sizes from some organizations.

How can i make sure that the rumen generates only say 25 jobs,50 jobs or so
?


Thanks,
Arun

--
View this message in context: http://lucene.472066.n3.nabble.com/Availability-of-Job-traces-or-logs-tp3550462p3550462.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.