You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mesos.apache.org by Vinod Kone <vi...@gmail.com> on 2014/07/03 22:40:49 UTC

0.19.1

Hi,

We are planning to release 0.19.1 (likely next week) which will be a bug
fix release. Specifically, these are the fixes that we are planning to
cherry pick.

https://issues.apache.org/jira/issues/?filter=12326191&jql=project%20%3D%20MESOS%20AND%20%22Target%20Version%2Fs%22%20%3D%200.19.1

If there are other critical fixes that need to be backported to 0.19.1
please reply here as soon as possible.

Thanks,

Re: 0.19.1

Posted by Benjamin Mahler <be...@gmail.com>.
Will also pull in MESOS-1538
<https://issues.apache.org/jira/browse/MESOS-1538> for this.


On Wed, Jul 9, 2014 at 11:12 AM, Benjamin Mahler <be...@gmail.com>
wrote:

> I've added it to the 0.19.1 list since it's trivial and helps those using
> S3.
>
> On Fri, Jul 4, 2014 at 12:52 PM, Tom Arnfeld <to...@duedil.com> wrote:
>
>> Happy to. It surprised me that this wasn't supported, especially
>> considering the fetcher is supposed to be able to download URIs from any
>> URL using http(s). This is most useful (and in my opinion quite an
>> important issue) for downloading executors from S3 in situations a redirect
>> is incurred, and more specifically, github tar archives which almost always
>> go through a 301.
>>
>> Don't mind if going into the next non-bugfix release if you don't agree
>> it's that important.
>>
>> On 4 Jul 2014, at 20:48, Dominic Hamon <dh...@twopensource.com> wrote:
>>
>> > Hi
>> >
>> > Can you give some background as to why this is a critical fix? We try to
>> > minimise what we include in bug fix releases to avoid feature creep.
>> >
>> > Thanks
>> > On Jul 4, 2014 12:31 PM, "Tom Arnfeld" <to...@duedil.com> wrote:
>> >
>> >> Any chance we can get https://issues.apache.org/jira/browse/MESOS-1448
>> >> too?
>> >>
>> >> On 3 Jul 2014, at 21:40, Vinod Kone <vi...@gmail.com> wrote:
>> >>
>> >> Hi,
>> >>
>> >> We are planning to release 0.19.1 (likely next week) which will be a
>> bug
>> >> fix release. Specifically, these are the fixes that we are planning to
>> >> cherry pick.
>> >>
>> >>
>> >>
>> https://issues.apache.org/jira/issues/?filter=12326191&jql=project%20%3D%20MESOS%20AND%20%22Target%20Version%2Fs%22%20%3D%200.19.1
>> >>
>> >> If there are other critical fixes that need to be backported to 0.19.1
>> >> please reply here as soon as possible.
>> >>
>> >> Thanks,
>> >>
>> >>
>> >>
>>
>>
>

Re: 0.19.1

Posted by Benjamin Mahler <be...@gmail.com>.
Will also pull in MESOS-1538
<https://issues.apache.org/jira/browse/MESOS-1538> for this.


On Wed, Jul 9, 2014 at 11:12 AM, Benjamin Mahler <be...@gmail.com>
wrote:

> I've added it to the 0.19.1 list since it's trivial and helps those using
> S3.
>
> On Fri, Jul 4, 2014 at 12:52 PM, Tom Arnfeld <to...@duedil.com> wrote:
>
>> Happy to. It surprised me that this wasn't supported, especially
>> considering the fetcher is supposed to be able to download URIs from any
>> URL using http(s). This is most useful (and in my opinion quite an
>> important issue) for downloading executors from S3 in situations a redirect
>> is incurred, and more specifically, github tar archives which almost always
>> go through a 301.
>>
>> Don't mind if going into the next non-bugfix release if you don't agree
>> it's that important.
>>
>> On 4 Jul 2014, at 20:48, Dominic Hamon <dh...@twopensource.com> wrote:
>>
>> > Hi
>> >
>> > Can you give some background as to why this is a critical fix? We try to
>> > minimise what we include in bug fix releases to avoid feature creep.
>> >
>> > Thanks
>> > On Jul 4, 2014 12:31 PM, "Tom Arnfeld" <to...@duedil.com> wrote:
>> >
>> >> Any chance we can get https://issues.apache.org/jira/browse/MESOS-1448
>> >> too?
>> >>
>> >> On 3 Jul 2014, at 21:40, Vinod Kone <vi...@gmail.com> wrote:
>> >>
>> >> Hi,
>> >>
>> >> We are planning to release 0.19.1 (likely next week) which will be a
>> bug
>> >> fix release. Specifically, these are the fixes that we are planning to
>> >> cherry pick.
>> >>
>> >>
>> >>
>> https://issues.apache.org/jira/issues/?filter=12326191&jql=project%20%3D%20MESOS%20AND%20%22Target%20Version%2Fs%22%20%3D%200.19.1
>> >>
>> >> If there are other critical fixes that need to be backported to 0.19.1
>> >> please reply here as soon as possible.
>> >>
>> >> Thanks,
>> >>
>> >>
>> >>
>>
>>
>

Re: 0.19.1

Posted by Benjamin Mahler <be...@gmail.com>.
I've added it to the 0.19.1 list since it's trivial and helps those using
S3.

On Fri, Jul 4, 2014 at 12:52 PM, Tom Arnfeld <to...@duedil.com> wrote:

> Happy to. It surprised me that this wasn't supported, especially
> considering the fetcher is supposed to be able to download URIs from any
> URL using http(s). This is most useful (and in my opinion quite an
> important issue) for downloading executors from S3 in situations a redirect
> is incurred, and more specifically, github tar archives which almost always
> go through a 301.
>
> Don't mind if going into the next non-bugfix release if you don't agree
> it's that important.
>
> On 4 Jul 2014, at 20:48, Dominic Hamon <dh...@twopensource.com> wrote:
>
> > Hi
> >
> > Can you give some background as to why this is a critical fix? We try to
> > minimise what we include in bug fix releases to avoid feature creep.
> >
> > Thanks
> > On Jul 4, 2014 12:31 PM, "Tom Arnfeld" <to...@duedil.com> wrote:
> >
> >> Any chance we can get https://issues.apache.org/jira/browse/MESOS-1448
> >> too?
> >>
> >> On 3 Jul 2014, at 21:40, Vinod Kone <vi...@gmail.com> wrote:
> >>
> >> Hi,
> >>
> >> We are planning to release 0.19.1 (likely next week) which will be a bug
> >> fix release. Specifically, these are the fixes that we are planning to
> >> cherry pick.
> >>
> >>
> >>
> https://issues.apache.org/jira/issues/?filter=12326191&jql=project%20%3D%20MESOS%20AND%20%22Target%20Version%2Fs%22%20%3D%200.19.1
> >>
> >> If there are other critical fixes that need to be backported to 0.19.1
> >> please reply here as soon as possible.
> >>
> >> Thanks,
> >>
> >>
> >>
>
>

Re: 0.19.1

Posted by Benjamin Mahler <be...@gmail.com>.
I've added it to the 0.19.1 list since it's trivial and helps those using
S3.

On Fri, Jul 4, 2014 at 12:52 PM, Tom Arnfeld <to...@duedil.com> wrote:

> Happy to. It surprised me that this wasn't supported, especially
> considering the fetcher is supposed to be able to download URIs from any
> URL using http(s). This is most useful (and in my opinion quite an
> important issue) for downloading executors from S3 in situations a redirect
> is incurred, and more specifically, github tar archives which almost always
> go through a 301.
>
> Don't mind if going into the next non-bugfix release if you don't agree
> it's that important.
>
> On 4 Jul 2014, at 20:48, Dominic Hamon <dh...@twopensource.com> wrote:
>
> > Hi
> >
> > Can you give some background as to why this is a critical fix? We try to
> > minimise what we include in bug fix releases to avoid feature creep.
> >
> > Thanks
> > On Jul 4, 2014 12:31 PM, "Tom Arnfeld" <to...@duedil.com> wrote:
> >
> >> Any chance we can get https://issues.apache.org/jira/browse/MESOS-1448
> >> too?
> >>
> >> On 3 Jul 2014, at 21:40, Vinod Kone <vi...@gmail.com> wrote:
> >>
> >> Hi,
> >>
> >> We are planning to release 0.19.1 (likely next week) which will be a bug
> >> fix release. Specifically, these are the fixes that we are planning to
> >> cherry pick.
> >>
> >>
> >>
> https://issues.apache.org/jira/issues/?filter=12326191&jql=project%20%3D%20MESOS%20AND%20%22Target%20Version%2Fs%22%20%3D%200.19.1
> >>
> >> If there are other critical fixes that need to be backported to 0.19.1
> >> please reply here as soon as possible.
> >>
> >> Thanks,
> >>
> >>
> >>
>
>

Re: 0.19.1

Posted by Tom Arnfeld <to...@duedil.com>.
Happy to. It surprised me that this wasn't supported, especially considering the fetcher is supposed to be able to download URIs from any URL using http(s). This is most useful (and in my opinion quite an important issue) for downloading executors from S3 in situations a redirect is incurred, and more specifically, github tar archives which almost always go through a 301.

Don't mind if going into the next non-bugfix release if you don't agree it's that important.

On 4 Jul 2014, at 20:48, Dominic Hamon <dh...@twopensource.com> wrote:

> Hi
> 
> Can you give some background as to why this is a critical fix? We try to
> minimise what we include in bug fix releases to avoid feature creep.
> 
> Thanks
> On Jul 4, 2014 12:31 PM, "Tom Arnfeld" <to...@duedil.com> wrote:
> 
>> Any chance we can get https://issues.apache.org/jira/browse/MESOS-1448
>> too?
>> 
>> On 3 Jul 2014, at 21:40, Vinod Kone <vi...@gmail.com> wrote:
>> 
>> Hi,
>> 
>> We are planning to release 0.19.1 (likely next week) which will be a bug
>> fix release. Specifically, these are the fixes that we are planning to
>> cherry pick.
>> 
>> 
>> https://issues.apache.org/jira/issues/?filter=12326191&jql=project%20%3D%20MESOS%20AND%20%22Target%20Version%2Fs%22%20%3D%200.19.1
>> 
>> If there are other critical fixes that need to be backported to 0.19.1
>> please reply here as soon as possible.
>> 
>> Thanks,
>> 
>> 
>> 


Re: 0.19.1

Posted by Tom Arnfeld <to...@duedil.com>.
Happy to. It surprised me that this wasn't supported, especially considering the fetcher is supposed to be able to download URIs from any URL using http(s). This is most useful (and in my opinion quite an important issue) for downloading executors from S3 in situations a redirect is incurred, and more specifically, github tar archives which almost always go through a 301.

Don't mind if going into the next non-bugfix release if you don't agree it's that important.

On 4 Jul 2014, at 20:48, Dominic Hamon <dh...@twopensource.com> wrote:

> Hi
> 
> Can you give some background as to why this is a critical fix? We try to
> minimise what we include in bug fix releases to avoid feature creep.
> 
> Thanks
> On Jul 4, 2014 12:31 PM, "Tom Arnfeld" <to...@duedil.com> wrote:
> 
>> Any chance we can get https://issues.apache.org/jira/browse/MESOS-1448
>> too?
>> 
>> On 3 Jul 2014, at 21:40, Vinod Kone <vi...@gmail.com> wrote:
>> 
>> Hi,
>> 
>> We are planning to release 0.19.1 (likely next week) which will be a bug
>> fix release. Specifically, these are the fixes that we are planning to
>> cherry pick.
>> 
>> 
>> https://issues.apache.org/jira/issues/?filter=12326191&jql=project%20%3D%20MESOS%20AND%20%22Target%20Version%2Fs%22%20%3D%200.19.1
>> 
>> If there are other critical fixes that need to be backported to 0.19.1
>> please reply here as soon as possible.
>> 
>> Thanks,
>> 
>> 
>> 


Re: 0.19.1

Posted by Dominic Hamon <dh...@twopensource.com>.
Hi

Can you give some background as to why this is a critical fix? We try to
minimise what we include in bug fix releases to avoid feature creep.

Thanks
On Jul 4, 2014 12:31 PM, "Tom Arnfeld" <to...@duedil.com> wrote:

> Any chance we can get https://issues.apache.org/jira/browse/MESOS-1448
> too?
>
> On 3 Jul 2014, at 21:40, Vinod Kone <vi...@gmail.com> wrote:
>
> Hi,
>
> We are planning to release 0.19.1 (likely next week) which will be a bug
> fix release. Specifically, these are the fixes that we are planning to
> cherry pick.
>
>
> https://issues.apache.org/jira/issues/?filter=12326191&jql=project%20%3D%20MESOS%20AND%20%22Target%20Version%2Fs%22%20%3D%200.19.1
>
> If there are other critical fixes that need to be backported to 0.19.1
> please reply here as soon as possible.
>
> Thanks,
>
>
>

Re: 0.19.1

Posted by Dominic Hamon <dh...@twopensource.com>.
Hi

Can you give some background as to why this is a critical fix? We try to
minimise what we include in bug fix releases to avoid feature creep.

Thanks
On Jul 4, 2014 12:31 PM, "Tom Arnfeld" <to...@duedil.com> wrote:

> Any chance we can get https://issues.apache.org/jira/browse/MESOS-1448
> too?
>
> On 3 Jul 2014, at 21:40, Vinod Kone <vi...@gmail.com> wrote:
>
> Hi,
>
> We are planning to release 0.19.1 (likely next week) which will be a bug
> fix release. Specifically, these are the fixes that we are planning to
> cherry pick.
>
>
> https://issues.apache.org/jira/issues/?filter=12326191&jql=project%20%3D%20MESOS%20AND%20%22Target%20Version%2Fs%22%20%3D%200.19.1
>
> If there are other critical fixes that need to be backported to 0.19.1
> please reply here as soon as possible.
>
> Thanks,
>
>
>

Re: 0.19.1

Posted by Tom Arnfeld <to...@duedil.com>.
Any chance we can get https://issues.apache.org/jira/browse/MESOS-1448  too?

On 3 Jul 2014, at 21:40, Vinod Kone <vi...@gmail.com> wrote:

> Hi,
> 
> We are planning to release 0.19.1 (likely next week) which will be a bug fix release. Specifically, these are the fixes that we are planning to cherry pick.
> 
> https://issues.apache.org/jira/issues/?filter=12326191&jql=project%20%3D%20MESOS%20AND%20%22Target%20Version%2Fs%22%20%3D%200.19.1
> 
> If there are other critical fixes that need to be backported to 0.19.1 please reply here as soon as possible.
> 
> Thanks,


Re: spark and mesos issue

Posted by Ray Rodriguez <ra...@gmail.com>.
I've been running into the same issue with task counts greater than 600 or so using spark with mesos in fine grain mode.

On Fri, Jul 4, 2014 at 5:06 AM, Gurvinder Singh
<gu...@uninett.no> wrote:

> We are getting this issue when we are running jobs with close to 1000
> workers. Spark is from the github version and mesos is 0.19.0
> ERROR storage.BlockManagerMasterActor: Got two different block manager
> registrations on 201407031041-1227224054-5050-24004-0
> Googling about it seems that mesos is starting slaves at the same time
> and giving them the same id. So may bug in mesos ?
> Thanks,
> Gurvinder
> On 07/04/2014 01:03 AM, Vinod Kone wrote:
>> correct url:
>> 
>> https://issues.apache.org/jira/issues/?jql=project%20%3D%20MESOS%20AND%20%22Target%20Version%2Fs%22%20%3D%200.19.1
>> 
>> 
>> On Thu, Jul 3, 2014 at 1:40 PM, Vinod Kone <vinodkone@gmail.com
>> <ma...@gmail.com>> wrote:
>> 
>>     Hi,
>> 
>>     We are planning to release 0.19.1 (likely next week) which will be a
>>     bug fix release. Specifically, these are the fixes that we are
>>     planning to cherry pick.
>> 
>>     https://issues.apache.org/jira/issues/?filter=12326191&jql=project%20%3D%20MESOS%20AND%20%22Target%20Version%2Fs%22%20%3D%200.19.1
>> 
>>     If there are other critical fixes that need to be backported to
>>     0.19.1 please reply here as soon as possible.
>> 
>>     Thanks,
>> 
>> 

Re: spark and mesos issue

Posted by Gurvinder Singh <gu...@uninett.no>.
It might not be related only to memory issue. Memory issue is also
there as you mentioned. I have seen that one too. The fine mode issue
is mainly spark considering that it got two different block manager
for same ID, whereas if I search for the ID in the mesos slave, it
exist only on the one slave not on multiple of them. Theis might be
due to the size of ID, as spark out the error as

14/09/16 08:04:29 ERROR BlockManagerMasterActor: Got two different
block manager registrations on 20140822-112818-711206558-5050-25951-0

where as in the mesos slave I see logs as

I0915 20:55:18.293903 31434 containerizer.cpp:392] Starting container
'3aab2237-d32f-470d-a206-7bada454ad3f' for executor
'20140822-112818-711206558-5050-25951-0' of framework
'20140822-112818-711206558-5050-25951-0053'

I0915 20:53:28.039218 31437 containerizer.cpp:392] Starting container
'fe4b344f-16c9-484a-9c2f-92bd92b43f6d' for executor
'20140822-112818-711206558-5050-25951-0' of framework
'20140822-112818-711206558-5050-25951-0050'


you the last 3 digits of ID are missing in spark where as they are
different in mesos slaves.

- Gurvinder
On 09/15/2014 11:13 PM, Brenden Matthews wrote:
> I started hitting a similar problem, and it seems to be related to 
> memory overhead and tasks getting OOM killed.  I filed a ticket
> here:
> 
> https://issues.apache.org/jira/browse/SPARK-3535
> 
> On Wed, Jul 16, 2014 at 5:27 AM, Ray Rodriguez
> <rayrod2030@gmail.com <ma...@gmail.com>> wrote:
> 
> I'll set some time aside today to gather and post some logs and 
> details about this issue from our end.
> 
> 
> On Wed, Jul 16, 2014 at 2:05 AM, Vinod Kone <vinodkone@gmail.com 
> <ma...@gmail.com>> wrote:
> 
> 
> 
> 
> On Tue, Jul 15, 2014 at 11:02 PM, Vinod Kone <vinod@twitter.com 
> <ma...@twitter.com>> wrote:
> 
> 
> On Fri, Jul 4, 2014 at 2:05 AM, Gurvinder Singh 
> <gurvinder.singh@uninett.no <ma...@uninett.no>>
> wrote:
> 
> ERROR storage.BlockManagerMasterActor: Got two different block
> manager registrations on 201407031041-1227224054-5050-24004-0
> 
> Googling about it seems that mesos is starting slaves at the same
> time and giving them the same id. So may bug in mesos ?
> 
> 
> Has this issue been resolved? We need more information to triage
> this. Maybe some logs that show the lifecycle of the duplicate
> instances?
> 
> 
> @vinodkone
> 
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: spark and mesos issue

Posted by Gurvinder Singh <gu...@uninett.no>.
It might not be related only to memory issue. Memory issue is also
there as you mentioned. I have seen that one too. The fine mode issue
is mainly spark considering that it got two different block manager
for same ID, whereas if I search for the ID in the mesos slave, it
exist only on the one slave not on multiple of them. Theis might be
due to the size of ID, as spark out the error as

14/09/16 08:04:29 ERROR BlockManagerMasterActor: Got two different
block manager registrations on 20140822-112818-711206558-5050-25951-0

where as in the mesos slave I see logs as

I0915 20:55:18.293903 31434 containerizer.cpp:392] Starting container
'3aab2237-d32f-470d-a206-7bada454ad3f' for executor
'20140822-112818-711206558-5050-25951-0' of framework
'20140822-112818-711206558-5050-25951-0053'

I0915 20:53:28.039218 31437 containerizer.cpp:392] Starting container
'fe4b344f-16c9-484a-9c2f-92bd92b43f6d' for executor
'20140822-112818-711206558-5050-25951-0' of framework
'20140822-112818-711206558-5050-25951-0050'


you the last 3 digits of ID are missing in spark where as they are
different in mesos slaves.

- Gurvinder
On 09/15/2014 11:13 PM, Brenden Matthews wrote:
> I started hitting a similar problem, and it seems to be related to 
> memory overhead and tasks getting OOM killed.  I filed a ticket
> here:
> 
> https://issues.apache.org/jira/browse/SPARK-3535
> 
> On Wed, Jul 16, 2014 at 5:27 AM, Ray Rodriguez
> <rayrod2030@gmail.com <ma...@gmail.com>> wrote:
> 
> I'll set some time aside today to gather and post some logs and 
> details about this issue from our end.
> 
> 
> On Wed, Jul 16, 2014 at 2:05 AM, Vinod Kone <vinodkone@gmail.com 
> <ma...@gmail.com>> wrote:
> 
> 
> 
> 
> On Tue, Jul 15, 2014 at 11:02 PM, Vinod Kone <vinod@twitter.com 
> <ma...@twitter.com>> wrote:
> 
> 
> On Fri, Jul 4, 2014 at 2:05 AM, Gurvinder Singh 
> <gurvinder.singh@uninett.no <ma...@uninett.no>>
> wrote:
> 
> ERROR storage.BlockManagerMasterActor: Got two different block
> manager registrations on 201407031041-1227224054-5050-24004-0
> 
> Googling about it seems that mesos is starting slaves at the same
> time and giving them the same id. So may bug in mesos ?
> 
> 
> Has this issue been resolved? We need more information to triage
> this. Maybe some logs that show the lifecycle of the duplicate
> instances?
> 
> 
> @vinodkone
> 
> 
> 
> 


Re: spark and mesos issue

Posted by Tim St Clair <ts...@redhat.com>.
inline - 

----- Original Message -----
> From: "CCAAT" <cc...@tampabay.rr.com>
> To: user@mesos.apache.org
> Cc: ccaat@tampabay.rr.com
> Sent: Monday, September 15, 2014 5:33:08 PM
> Subject: Re: spark and mesos issue
> 
> Hello Brenden/Vinod,
> 
> Is your installation using "systemd" ?
> 
> Has anyone documented systemd configurations/issues for the various
> linux distro running mesos/spark?
> 
> What if a cluster is running on a mixture of systems that use/do_not_use
> systemd; are there any issues, related to systemd and mesos/spark?

Yes, and I'll have patches posted today, I'm still debugging.  Basically you need the init kickers + re-parent the cgroup code. 

> 
> Has anyone tried to use Ftrace/trace-cmd/kernelshark in tracing down
> or optimizations of the linux kernel for machines dedicated to
> mesos/spark?
> 
> Are there  (kernel) .config files published for key kernel resources
> dedicated to the optimization of mesos/spark anywhere ?
> 
> 
> curiously,
> James
> 
> 
> 
> 
> On 09/15/14 16:13, Brenden Matthews wrote:
> > I started hitting a similar problem, and it seems to be related to
> > memory overhead and tasks getting OOM killed.  I filed a ticket here:
> >
> > https://issues.apache.org/jira/browse/SPARK-3535
> >
> > On Wed, Jul 16, 2014 at 5:27 AM, Ray Rodriguez <rayrod2030@gmail.com
> > <ma...@gmail.com>> wrote:
> >
> >     I'll set some time aside today to gather and post some logs and
> >     details about this issue from our end.
> >
> >
> >     On Wed, Jul 16, 2014 at 2:05 AM, Vinod Kone <vinodkone@gmail.com
> >     <ma...@gmail.com>> wrote:
> >
> >
> >
> >
> >         On Tue, Jul 15, 2014 at 11:02 PM, Vinod Kone <vinod@twitter.com
> >         <ma...@twitter.com>> wrote:
> >
> >
> >             On Fri, Jul 4, 2014 at 2:05 AM, Gurvinder Singh
> >             <gurvinder.singh@uninett.no
> >             <ma...@uninett.no>> wrote:
> >
> >                 ERROR storage.BlockManagerMasterActor: Got two different
> >                 block manager
> >                 registrations on 201407031041-1227224054-5050-24004-0
> >
> >                 Googling about it seems that mesos is starting slaves at
> >                 the same time
> >                 and giving them the same id. So may bug in mesos ?
> >
> >
> >             Has this issue been resolved? We need more information to
> >             triage this. Maybe some logs that show the lifecycle of the
> >             duplicate instances?
> >
> >
> >             @vinodkone
> >
> >
> >
> >
> 
> 

-- 
Cheers,
Timothy St. Clair
Red Hat Inc.

Re: spark and mesos issue

Posted by CCAAT <cc...@tampabay.rr.com>.
Hello Brenden/Vinod,

Is your installation using "systemd" ?

Has anyone documented systemd configurations/issues for the various 
linux distro running mesos/spark?

What if a cluster is running on a mixture of systems that use/do_not_use
systemd; are there any issues, related to systemd and mesos/spark?

Has anyone tried to use Ftrace/trace-cmd/kernelshark in tracing down
or optimizations of the linux kernel for machines dedicated to
mesos/spark?

Are there  (kernel) .config files published for key kernel resources 
dedicated to the optimization of mesos/spark anywhere ?


curiously,
James




On 09/15/14 16:13, Brenden Matthews wrote:
> I started hitting a similar problem, and it seems to be related to
> memory overhead and tasks getting OOM killed.  I filed a ticket here:
>
> https://issues.apache.org/jira/browse/SPARK-3535
>
> On Wed, Jul 16, 2014 at 5:27 AM, Ray Rodriguez <rayrod2030@gmail.com
> <ma...@gmail.com>> wrote:
>
>     I'll set some time aside today to gather and post some logs and
>     details about this issue from our end.
>
>
>     On Wed, Jul 16, 2014 at 2:05 AM, Vinod Kone <vinodkone@gmail.com
>     <ma...@gmail.com>> wrote:
>
>
>
>
>         On Tue, Jul 15, 2014 at 11:02 PM, Vinod Kone <vinod@twitter.com
>         <ma...@twitter.com>> wrote:
>
>
>             On Fri, Jul 4, 2014 at 2:05 AM, Gurvinder Singh
>             <gurvinder.singh@uninett.no
>             <ma...@uninett.no>> wrote:
>
>                 ERROR storage.BlockManagerMasterActor: Got two different
>                 block manager
>                 registrations on 201407031041-1227224054-5050-24004-0
>
>                 Googling about it seems that mesos is starting slaves at
>                 the same time
>                 and giving them the same id. So may bug in mesos ?
>
>
>             Has this issue been resolved? We need more information to
>             triage this. Maybe some logs that show the lifecycle of the
>             duplicate instances?
>
>
>             @vinodkone
>
>
>
>


Re: spark and mesos issue

Posted by Brenden Matthews <br...@airbedandbreakfast.com>.
I started hitting a similar problem, and it seems to be related to memory
overhead and tasks getting OOM killed.  I filed a ticket here:

https://issues.apache.org/jira/browse/SPARK-3535

On Wed, Jul 16, 2014 at 5:27 AM, Ray Rodriguez <ra...@gmail.com> wrote:

> I'll set some time aside today to gather and post some logs and details
> about this issue from our end.
>
>
> On Wed, Jul 16, 2014 at 2:05 AM, Vinod Kone <vi...@gmail.com> wrote:
>
>>
>>
>>
>> On Tue, Jul 15, 2014 at 11:02 PM, Vinod Kone <vi...@twitter.com> wrote:
>>
>>>
>>> On Fri, Jul 4, 2014 at 2:05 AM, Gurvinder Singh <
>>> gurvinder.singh@uninett.no> wrote:
>>>
>>>> ERROR storage.BlockManagerMasterActor: Got two different block manager
>>>> registrations on 201407031041-1227224054-5050-24004-0
>>>>
>>>> Googling about it seems that mesos is starting slaves at the same time
>>>> and giving them the same id. So may bug in mesos ?
>>>>
>>>
>>> Has this issue been resolved? We need more information to triage this.
>>> Maybe some logs that show the lifecycle of the duplicate instances?
>>>
>>>
>>> @vinodkone
>>>
>>
>>
>

Re: spark and mesos issue

Posted by Ray Rodriguez <ra...@gmail.com>.
I'll set some time aside today to gather and post some logs and details about this issue from our end.

On Wed, Jul 16, 2014 at 2:05 AM, Vinod Kone <vi...@gmail.com> wrote:

> On Tue, Jul 15, 2014 at 11:02 PM, Vinod Kone <vi...@twitter.com> wrote:
>>
>> On Fri, Jul 4, 2014 at 2:05 AM, Gurvinder Singh <
>> gurvinder.singh@uninett.no> wrote:
>>
>>> ERROR storage.BlockManagerMasterActor: Got two different block manager
>>> registrations on 201407031041-1227224054-5050-24004-0
>>>
>>> Googling about it seems that mesos is starting slaves at the same time
>>> and giving them the same id. So may bug in mesos ?
>>>
>>
>> Has this issue been resolved? We need more information to triage this.
>> Maybe some logs that show the lifecycle of the duplicate instances?
>>
>>
>> @vinodkone
>>

Re: spark and mesos issue

Posted by Vinod Kone <vi...@gmail.com>.
On Tue, Jul 15, 2014 at 11:02 PM, Vinod Kone <vi...@twitter.com> wrote:

>
> On Fri, Jul 4, 2014 at 2:05 AM, Gurvinder Singh <
> gurvinder.singh@uninett.no> wrote:
>
>> ERROR storage.BlockManagerMasterActor: Got two different block manager
>> registrations on 201407031041-1227224054-5050-24004-0
>>
>> Googling about it seems that mesos is starting slaves at the same time
>> and giving them the same id. So may bug in mesos ?
>>
>
> Has this issue been resolved? We need more information to triage this.
> Maybe some logs that show the lifecycle of the duplicate instances?
>
>
> @vinodkone
>

Re: spark and mesos issue

Posted by Vinod Kone <vi...@twitter.com>.
On Fri, Jul 4, 2014 at 2:05 AM, Gurvinder Singh <gu...@uninett.no>
wrote:

> ERROR storage.BlockManagerMasterActor: Got two different block manager
> registrations on 201407031041-1227224054-5050-24004-0
>
> Googling about it seems that mesos is starting slaves at the same time
> and giving them the same id. So may bug in mesos ?
>

Has this issue been resolved? We need more information to triage this.
Maybe some logs that show the lifecycle of the duplicate instances?


@vinodkone

spark and mesos issue

Posted by Gurvinder Singh <gu...@uninett.no>.
We are getting this issue when we are running jobs with close to 1000
workers. Spark is from the github version and mesos is 0.19.0

ERROR storage.BlockManagerMasterActor: Got two different block manager
registrations on 201407031041-1227224054-5050-24004-0

Googling about it seems that mesos is starting slaves at the same time
and giving them the same id. So may bug in mesos ?

Thanks,
Gurvinder
On 07/04/2014 01:03 AM, Vinod Kone wrote:
> correct url:
> 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20MESOS%20AND%20%22Target%20Version%2Fs%22%20%3D%200.19.1
> 
> 
> On Thu, Jul 3, 2014 at 1:40 PM, Vinod Kone <vinodkone@gmail.com
> <ma...@gmail.com>> wrote:
> 
>     Hi,
> 
>     We are planning to release 0.19.1 (likely next week) which will be a
>     bug fix release. Specifically, these are the fixes that we are
>     planning to cherry pick.
> 
>     https://issues.apache.org/jira/issues/?filter=12326191&jql=project%20%3D%20MESOS%20AND%20%22Target%20Version%2Fs%22%20%3D%200.19.1
> 
>     If there are other critical fixes that need to be backported to
>     0.19.1 please reply here as soon as possible.
> 
>     Thanks,
> 
> 


spark and mesos issue

Posted by Gurvinder Singh <gu...@uninett.no>.
We are getting this issue when we are running jobs with close to 1000
workers. Spark is from the github version and mesos is 0.19.0

ERROR storage.BlockManagerMasterActor: Got two different block manager
registrations on 201407031041-1227224054-5050-24004-0

Googling about it seems that mesos is starting slaves at the same time
and giving them the same id. So may bug in mesos ?

Thanks,
Gurvinder
On 07/04/2014 01:03 AM, Vinod Kone wrote:
> correct url:
> 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20MESOS%20AND%20%22Target%20Version%2Fs%22%20%3D%200.19.1
> 
> 
> On Thu, Jul 3, 2014 at 1:40 PM, Vinod Kone <vinodkone@gmail.com
> <ma...@gmail.com>> wrote:
> 
>     Hi,
> 
>     We are planning to release 0.19.1 (likely next week) which will be a
>     bug fix release. Specifically, these are the fixes that we are
>     planning to cherry pick.
> 
>     https://issues.apache.org/jira/issues/?filter=12326191&jql=project%20%3D%20MESOS%20AND%20%22Target%20Version%2Fs%22%20%3D%200.19.1
> 
>     If there are other critical fixes that need to be backported to
>     0.19.1 please reply here as soon as possible.
> 
>     Thanks,
> 
> 


Re: 0.19.1

Posted by Vinod Kone <vi...@gmail.com>.
correct url:

https://issues.apache.org/jira/issues/?jql=project%20%3D%20MESOS%20AND%20%22Target%20Version%2Fs%22%20%3D%200.19.1


On Thu, Jul 3, 2014 at 1:40 PM, Vinod Kone <vi...@gmail.com> wrote:

> Hi,
>
> We are planning to release 0.19.1 (likely next week) which will be a bug
> fix release. Specifically, these are the fixes that we are planning to
> cherry pick.
>
>
> https://issues.apache.org/jira/issues/?filter=12326191&jql=project%20%3D%20MESOS%20AND%20%22Target%20Version%2Fs%22%20%3D%200.19.1
>
> If there are other critical fixes that need to be backported to 0.19.1
> please reply here as soon as possible.
>
> Thanks,
>

Re: 0.19.1

Posted by Tom Arnfeld <to...@duedil.com>.
Any chance we can get https://issues.apache.org/jira/browse/MESOS-1448  too?

On 3 Jul 2014, at 21:40, Vinod Kone <vi...@gmail.com> wrote:

> Hi,
> 
> We are planning to release 0.19.1 (likely next week) which will be a bug fix release. Specifically, these are the fixes that we are planning to cherry pick.
> 
> https://issues.apache.org/jira/issues/?filter=12326191&jql=project%20%3D%20MESOS%20AND%20%22Target%20Version%2Fs%22%20%3D%200.19.1
> 
> If there are other critical fixes that need to be backported to 0.19.1 please reply here as soon as possible.
> 
> Thanks,


Re: 0.19.1

Posted by Vinod Kone <vi...@gmail.com>.
correct url:

https://issues.apache.org/jira/issues/?jql=project%20%3D%20MESOS%20AND%20%22Target%20Version%2Fs%22%20%3D%200.19.1


On Thu, Jul 3, 2014 at 1:40 PM, Vinod Kone <vi...@gmail.com> wrote:

> Hi,
>
> We are planning to release 0.19.1 (likely next week) which will be a bug
> fix release. Specifically, these are the fixes that we are planning to
> cherry pick.
>
>
> https://issues.apache.org/jira/issues/?filter=12326191&jql=project%20%3D%20MESOS%20AND%20%22Target%20Version%2Fs%22%20%3D%200.19.1
>
> If there are other critical fixes that need to be backported to 0.19.1
> please reply here as soon as possible.
>
> Thanks,
>