You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@aurora.apache.org by Hussein Elgridly <hu...@broadinstitute.org> on 2015/02/11 20:46:39 UTC

Speeding up Aurora client job creation

Hi folks,

I'm looking at a use cases that involves submitting potentially hundreds of
jobs a second to our Mesos cluster. My tests show that the aurora client is
taking 1-2 seconds for each job submission, and that I can run about four
client processes in parallel before they peg the CPU at 100%. I need more
throughput than this!

Squashing jobs down to the Process or Task level doesn't really make sense
for our use case. I'm aware that with some shenanigans I can batch jobs
together using job instances, but that's a lot of work on my current
timeframe (and of questionable utility given that the jobs certainly won't
have identical resource requirements).

What I really need is (at least) an order of magnitude speedup in terms of
being able to submit jobs to the Aurora scheduler (via the client or
otherwise).

Conceptually it doesn't seem like adding a job to a queue should be a thing
that takes a couple of seconds, so I'm baffled as to why it's taking so
long. As an experiment, I wrapped the call to client.execute() in
client.py:proxy_main in cProfile and called aurora job create with a very
simple test job.

Results of the profile are in the Gist below:

https://gist.github.com/helgridly/b37a0d27f04a37e72bb5

Our of a 0.977s profile time, the two things that stick out to me are:

1. 0.526s spent in Pystachio for a job that doesn't use any templates
2. 0.564s spent in create_job, presumably talking to the scheduler (and
setting up the machinery for doing so)

I imagine I can sidestep #1 with a check for "{{" in the job file and
bypass Pystachio entirely. Can I also skip the Aurora client entirely and
talk directly to the scheduler? If so what does that entail, and are there
any risks associated?

Thanks,
-Hussein

Hussein Elgridly
Senior Software Engineer, DSDE
The Broad Institute of MIT and Harvard

Re: Speeding up Aurora client job creation

Posted by Hussein Elgridly <hu...@broadinstitute.org>.

Another update: Thrift has a pull request open for Python 3 support [1],
but it was out of date and needed rebasing onto master. I did this off in
my own fork [2] and managed to build a Py3-generating version of Thrift.
This allowed me to generate Python 3 Thrift bindings for Aurora, which I'm
including in my project along with a tarball of the Python 3 Thrift
libraries. Success!

[1] https://github.com/apache/thrift/pull/213
[2] https://github.com/broadinstitute/thrift/tree/eevee/python3

The changes make Thrift fail on Python 2, so I imagine it'll be a while
before they make it into official Thrift. But it works for me, so I'm happy
:)


Hussein Elgridly
Senior Software Engineer, DSDE
The Broad Institute of MIT and Harvard


On 17 March 2015 at 15:18, Hussein Elgridly <hu...@broadinstitute.org>
wrote:

> For anyone following along at home, I managed to make my own THTTPClient
> for thriftpy just fine. Unfortunately, thriftpy's TJSONProtocol seems to be
> *a* JSON protocol, not *the* JSON protocol:
>
> thrift: [1,"getJobSummary",1,0,{}]
> thriftpy: {"metadata": {"ttype": 1, "name": "getJobSummary", "version": 1,
> "seqid": 0}, "payload": {}}
>
> Which is frustrating to say the least. I am now debating whether to:
>
> 1. Stub out the subset of the API that I actually need (currently only
> createJob and getTasksWithoutConfigs);
> 2. Roll my own protocol, based on Thrift's code [1]; or
> 3. Backport my project to Python 2.7 and use official Thrift.
>
> [1]
> https://github.com/apache/thrift/blob/93fea15b51494a79992a5323c803325537134bd8/lib/py/src/protocol/TJSONProtocol.py
>
>
> Hussein Elgridly
> Senior Software Engineer, DSDE
> The Broad Institute of MIT and Harvard
>
>
> On 16 March 2015 at 23:37, Hussein Elgridly <hu...@broadinstitute.org>
> wrote:
>
>> As a general rule we're trying to stick to Python 3.4. I don't imagine
>> implementing something a THTTPClient of my own will be too difficult,
>> especially given that I have the Aurora client's TRequestsTransport [1] for
>> reference.
>>
>> [1]
>> https://github.com/apache/incubator-aurora/blob/master/src/main/python/apache/aurora/common/transport.py
>>
>> Hussein Elgridly
>> Senior Software Engineer, DSDE
>> The Broad Institute of MIT and Harvard
>>
>>
>> On 16 March 2015 at 22:58, Bill Farner <wf...@apache.org> wrote:
>>
>>> Exploring the possibilities - can you use python 2.7?  If so, you could
>>> leverage some of the private libraries within the client and lower the
>>> surface area of what you need to build.  It won't be a stable
>>> programmatic
>>> API, but you might get moving faster.  I assume this is what Stephan is
>>> suggesting.
>>>
>>> -=Bill
>>>
>>> On Mon, Mar 16, 2015 at 7:52 PM, Hussein Elgridly <
>>> hussein@broadinstitute.org> wrote:
>>>
>>> > I'm not quite sure I understand your question, so I'll be painfully
>>> > explicit instead.
>>> >
>>> > I don't want to use the existing Aurora client because it's slow
>>> (Pystachio
>>> > + repeated HTTP connection overheads, as detailed earlier in this
>>> thread).
>>> > Instead, I want to use the Thrift interface to talk to the Aurora
>>> scheduler
>>> > directly - I can skip Pystachio entirely and keep the HTTP connection
>>> > open).
>>> >
>>> > I cannot use the official Thrift bindings for Python as they do not yet
>>> > support Python 3 [1]. There is a third-party, pure Python
>>> implementation of
>>> > Thrift that does support Python 3 called thriftpy [2]. However,
>>> thriftpy
>>> > does not include a THTTPClient transport, which is what the Aurora
>>> > scheduler uses. I will therefore have to write my own THTTPClient
>>> transport
>>> > (and probably contribute it back to thriftpy).
>>> >
>>> > [1] https://issues.apache.org/jira/browse/THRIFT-1857
>>> > [2] https://github.com/eleme/thriftpy
>>> >
>>> > Hussein Elgridly
>>> > Senior Software Engineer, DSDE
>>> > The Broad Institute of MIT and Harvard
>>> >
>>> >
>>> > On 16 March 2015 at 19:11, Erb, Stephan <St...@blue-yonder.com>
>>> > wrote:
>>> >
>>> > > Just to make sure I get this correctly: You say, you cannot use the
>>> > > existing python client because it is python 2.7 only so you want to
>>> > write a
>>> > > new one in python 3?
>>> > >
>>> > > Regards,
>>> > > Stephan
>>> > > ________________________________________
>>> > > From: Hussein Elgridly <hu...@broadinstitute.org>
>>> > > Sent: Monday, March 16, 2015 11:44 PM
>>> > > To: dev@aurora.incubator.apache.org
>>> > > Subject: Re: Speeding up Aurora client job creation
>>> > >
>>> > > So this has now bubbled back to the top of my TODO list and I'm
>>> actively
>>> > > working on it. I am entirely new to Thrift so please forgive the
>>> newbie
>>> > > questions...
>>> > >
>>> > > I would like to talk to the Aurora scheduler directly from my
>>> (Python)
>>> > > application using Thrift. Since I'm on Python 3.4 I've had to use
>>> > thriftpy:
>>> > > https://github.com/eleme/thriftpy
>>> > >
>>> > > As far as I can tell, the following should work (by default, thriftpy
>>> > uses
>>> > > a TBufferedTransport around a TSocket):
>>> > >
>>> > > ---
>>> > > import thriftpy
>>> > > import thriftpy.rpc
>>> > >
>>> > > aurora_api = thriftpy.load("api.thrift")
>>> > >
>>> > > client = thriftpy.rpc.make_client(aurora_api.AuroraSchedulerManager,
>>> > > host="localhost", port=8081,
>>> > > proto_factory=thriftpy.protocol.TJSONProtocolFactory() )
>>> > >
>>> > > print(client.getJobSummary())
>>> > > ---
>>> > >
>>> > > Obviously I wouldn't be writing this email if it did work :) It
>>> hangs.
>>> > >
>>> > > I jumped into pdb and found it was sending the following payload:
>>> > >
>>> > > b'\x00\x00\x00\\{"metadata": {"name": "getJobSummary", "seqid": 0,
>>> > "ttype":
>>> > > 1, "version": 1}, "payload": {}}'
>>> > >
>>> > > to a socket that looked like this:
>>> > >
>>> > > <socket.socket fd=3, family=AddressFamily.AF_INET, type=2049,
>>> proto=0,
>>> > > laddr=('<localhost's_private_ip>', 49167),
>>> > raddr=('localhost's_private_ip',
>>> > > 8081)>
>>> > >
>>> > > ...but was waiting forever to receive any data. Adding a timeout just
>>> > > triggered the timeout.
>>> > >
>>> > > I'm stumped. Any clues?
>>> > >
>>> > >
>>> > > Hussein Elgridly
>>> > > Senior Software Engineer, DSDE
>>> > > The Broad Institute of MIT and Harvard
>>> > >
>>> > >
>>> > > On 12 February 2015 at 04:15, Erb, Stephan <
>>> Stephan.Erb@blue-yonder.com>
>>> > > wrote:
>>> > >
>>> > > > Hi Hussein,
>>> > > >
>>> > > > we also had slight performance problems when talking to Aurora. We
>>> > ended
>>> > > > up using the existing python client directly in our code (see
>>> > > > apache.aurora.client.api.__init__.py). This allowed us to reuse
>>> the api
>>> > > > object and its scheduler connection, dropping a connection latency
>>> of
>>> > > about
>>> > > > 0.3-0.4 seconds per request.
>>> > > >
>>> > > > Best Regards,
>>> > > > Stephan
>>> > > > ________________________________________
>>> > > > From: Bill Farner <wf...@apache.org>
>>> > > > Sent: Wednesday, February 11, 2015 9:29 PM
>>> > > > To: dev@aurora.incubator.apache.org
>>> > > > Subject: Re: Speeding up Aurora client job creation
>>> > > >
>>> > > > To reduce that time you will indeed want to talk directly to the
>>> > > > scheduler.  This will definitely require you to roll up your
>>> sleeves a
>>> > > bit
>>> > > > and set up a thrift client to our api (based on api.thrift [1]),
>>> since
>>> > > you
>>> > > > will need to specify your tasks in a format that the thermos
>>> executor
>>> > can
>>> > > > understand.  Turns out this is JSON data, so it should not be *too*
>>> > > > prohibitive.
>>> > > >
>>> > > > However, there is another technical limitation you will hit for the
>>> > > > submission rate you are after.  The scheduler is backed by a
>>> durable
>>> > > store
>>> > > > whose write latency is at minimum the amount of time required to
>>> fsync.
>>> > > >
>>> > > > [1]
>>> > > >
>>> > > >
>>> > >
>>> >
>>> https://github.com/apache/incubator-aurora/blob/master/api/src/main/thrift/org/apache/aurora/gen/api.thrift
>>> > > >
>>> > > > -=Bill
>>> > > >
>>> > > > On Wed, Feb 11, 2015 at 11:46 AM, Hussein Elgridly <
>>> > > > hussein@broadinstitute.org> wrote:
>>> > > >
>>> > > > > Hi folks,
>>> > > > >
>>> > > > > I'm looking at a use cases that involves submitting potentially
>>> > > hundreds
>>> > > > of
>>> > > > > jobs a second to our Mesos cluster. My tests show that the aurora
>>> > > client
>>> > > > is
>>> > > > > taking 1-2 seconds for each job submission, and that I can run
>>> about
>>> > > four
>>> > > > > client processes in parallel before they peg the CPU at 100%. I
>>> need
>>> > > more
>>> > > > > throughput than this!
>>> > > > >
>>> > > > > Squashing jobs down to the Process or Task level doesn't really
>>> make
>>> > > > sense
>>> > > > > for our use case. I'm aware that with some shenanigans I can
>>> batch
>>> > jobs
>>> > > > > together using job instances, but that's a lot of work on my
>>> current
>>> > > > > timeframe (and of questionable utility given that the jobs
>>> certainly
>>> > > > won't
>>> > > > > have identical resource requirements).
>>> > > > >
>>> > > > > What I really need is (at least) an order of magnitude speedup in
>>> > terms
>>> > > > of
>>> > > > > being able to submit jobs to the Aurora scheduler (via the
>>> client or
>>> > > > > otherwise).
>>> > > > >
>>> > > > > Conceptually it doesn't seem like adding a job to a queue should
>>> be a
>>> > > > thing
>>> > > > > that takes a couple of seconds, so I'm baffled as to why it's
>>> taking
>>> > so
>>> > > > > long. As an experiment, I wrapped the call to client.execute() in
>>> > > > > client.py:proxy_main in cProfile and called aurora job create
>>> with a
>>> > > very
>>> > > > > simple test job.
>>> > > > >
>>> > > > > Results of the profile are in the Gist below:
>>> > > > >
>>> > > > > https://gist.github.com/helgridly/b37a0d27f04a37e72bb5
>>> > > > >
>>> > > > > Our of a 0.977s profile time, the two things that stick out to me
>>> > are:
>>> > > > >
>>> > > > > 1. 0.526s spent in Pystachio for a job that doesn't use any
>>> templates
>>> > > > > 2. 0.564s spent in create_job, presumably talking to the
>>> scheduler
>>> > (and
>>> > > > > setting up the machinery for doing so)
>>> > > > >
>>> > > > > I imagine I can sidestep #1 with a check for "{{" in the job
>>> file and
>>> > > > > bypass Pystachio entirely. Can I also skip the Aurora client
>>> entirely
>>> > > and
>>> > > > > talk directly to the scheduler? If so what does that entail, and
>>> are
>>> > > > there
>>> > > > > any risks associated?
>>> > > > >
>>> > > > > Thanks,
>>> > > > > -Hussein
>>> > > > >
>>> > > > > Hussein Elgridly
>>> > > > > Senior Software Engineer, DSDE
>>> > > > > The Broad Institute of MIT and Harvard
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

Re: Speeding up Aurora client job creation

Posted by Hussein Elgridly <hu...@broadinstitute.org>.

For anyone following along at home, I managed to make my own THTTPClient
for thriftpy just fine. Unfortunately, thriftpy's TJSONProtocol seems to be
*a* JSON protocol, not *the* JSON protocol:

thrift: [1,"getJobSummary",1,0,{}]
thriftpy: {"metadata": {"ttype": 1, "name": "getJobSummary", "version": 1,
"seqid": 0}, "payload": {}}

Which is frustrating to say the least. I am now debating whether to:

1. Stub out the subset of the API that I actually need (currently only
createJob and getTasksWithoutConfigs);
2. Roll my own protocol, based on Thrift's code [1]; or
3. Backport my project to Python 2.7 and use official Thrift.

[1]
https://github.com/apache/thrift/blob/93fea15b51494a79992a5323c803325537134bd8/lib/py/src/protocol/TJSONProtocol.py


Hussein Elgridly
Senior Software Engineer, DSDE
The Broad Institute of MIT and Harvard


On 16 March 2015 at 23:37, Hussein Elgridly <hu...@broadinstitute.org>
wrote:

> As a general rule we're trying to stick to Python 3.4. I don't imagine
> implementing something a THTTPClient of my own will be too difficult,
> especially given that I have the Aurora client's TRequestsTransport [1] for
> reference.
>
> [1]
> https://github.com/apache/incubator-aurora/blob/master/src/main/python/apache/aurora/common/transport.py
>
> Hussein Elgridly
> Senior Software Engineer, DSDE
> The Broad Institute of MIT and Harvard
>
>
> On 16 March 2015 at 22:58, Bill Farner <wf...@apache.org> wrote:
>
>> Exploring the possibilities - can you use python 2.7?  If so, you could
>> leverage some of the private libraries within the client and lower the
>> surface area of what you need to build.  It won't be a stable programmatic
>> API, but you might get moving faster.  I assume this is what Stephan is
>> suggesting.
>>
>> -=Bill
>>
>> On Mon, Mar 16, 2015 at 7:52 PM, Hussein Elgridly <
>> hussein@broadinstitute.org> wrote:
>>
>> > I'm not quite sure I understand your question, so I'll be painfully
>> > explicit instead.
>> >
>> > I don't want to use the existing Aurora client because it's slow
>> (Pystachio
>> > + repeated HTTP connection overheads, as detailed earlier in this
>> thread).
>> > Instead, I want to use the Thrift interface to talk to the Aurora
>> scheduler
>> > directly - I can skip Pystachio entirely and keep the HTTP connection
>> > open).
>> >
>> > I cannot use the official Thrift bindings for Python as they do not yet
>> > support Python 3 [1]. There is a third-party, pure Python
>> implementation of
>> > Thrift that does support Python 3 called thriftpy [2]. However, thriftpy
>> > does not include a THTTPClient transport, which is what the Aurora
>> > scheduler uses. I will therefore have to write my own THTTPClient
>> transport
>> > (and probably contribute it back to thriftpy).
>> >
>> > [1] https://issues.apache.org/jira/browse/THRIFT-1857
>> > [2] https://github.com/eleme/thriftpy
>> >
>> > Hussein Elgridly
>> > Senior Software Engineer, DSDE
>> > The Broad Institute of MIT and Harvard
>> >
>> >
>> > On 16 March 2015 at 19:11, Erb, Stephan <St...@blue-yonder.com>
>> > wrote:
>> >
>> > > Just to make sure I get this correctly: You say, you cannot use the
>> > > existing python client because it is python 2.7 only so you want to
>> > write a
>> > > new one in python 3?
>> > >
>> > > Regards,
>> > > Stephan
>> > > ________________________________________
>> > > From: Hussein Elgridly <hu...@broadinstitute.org>
>> > > Sent: Monday, March 16, 2015 11:44 PM
>> > > To: dev@aurora.incubator.apache.org
>> > > Subject: Re: Speeding up Aurora client job creation
>> > >
>> > > So this has now bubbled back to the top of my TODO list and I'm
>> actively
>> > > working on it. I am entirely new to Thrift so please forgive the
>> newbie
>> > > questions...
>> > >
>> > > I would like to talk to the Aurora scheduler directly from my (Python)
>> > > application using Thrift. Since I'm on Python 3.4 I've had to use
>> > thriftpy:
>> > > https://github.com/eleme/thriftpy
>> > >
>> > > As far as I can tell, the following should work (by default, thriftpy
>> > uses
>> > > a TBufferedTransport around a TSocket):
>> > >
>> > > ---
>> > > import thriftpy
>> > > import thriftpy.rpc
>> > >
>> > > aurora_api = thriftpy.load("api.thrift")
>> > >
>> > > client = thriftpy.rpc.make_client(aurora_api.AuroraSchedulerManager,
>> > > host="localhost", port=8081,
>> > > proto_factory=thriftpy.protocol.TJSONProtocolFactory() )
>> > >
>> > > print(client.getJobSummary())
>> > > ---
>> > >
>> > > Obviously I wouldn't be writing this email if it did work :) It hangs.
>> > >
>> > > I jumped into pdb and found it was sending the following payload:
>> > >
>> > > b'\x00\x00\x00\\{"metadata": {"name": "getJobSummary", "seqid": 0,
>> > "ttype":
>> > > 1, "version": 1}, "payload": {}}'
>> > >
>> > > to a socket that looked like this:
>> > >
>> > > <socket.socket fd=3, family=AddressFamily.AF_INET, type=2049, proto=0,
>> > > laddr=('<localhost's_private_ip>', 49167),
>> > raddr=('localhost's_private_ip',
>> > > 8081)>
>> > >
>> > > ...but was waiting forever to receive any data. Adding a timeout just
>> > > triggered the timeout.
>> > >
>> > > I'm stumped. Any clues?
>> > >
>> > >
>> > > Hussein Elgridly
>> > > Senior Software Engineer, DSDE
>> > > The Broad Institute of MIT and Harvard
>> > >
>> > >
>> > > On 12 February 2015 at 04:15, Erb, Stephan <
>> Stephan.Erb@blue-yonder.com>
>> > > wrote:
>> > >
>> > > > Hi Hussein,
>> > > >
>> > > > we also had slight performance problems when talking to Aurora. We
>> > ended
>> > > > up using the existing python client directly in our code (see
>> > > > apache.aurora.client.api.__init__.py). This allowed us to reuse the
>> api
>> > > > object and its scheduler connection, dropping a connection latency
>> of
>> > > about
>> > > > 0.3-0.4 seconds per request.
>> > > >
>> > > > Best Regards,
>> > > > Stephan
>> > > > ________________________________________
>> > > > From: Bill Farner <wf...@apache.org>
>> > > > Sent: Wednesday, February 11, 2015 9:29 PM
>> > > > To: dev@aurora.incubator.apache.org
>> > > > Subject: Re: Speeding up Aurora client job creation
>> > > >
>> > > > To reduce that time you will indeed want to talk directly to the
>> > > > scheduler.  This will definitely require you to roll up your
>> sleeves a
>> > > bit
>> > > > and set up a thrift client to our api (based on api.thrift [1]),
>> since
>> > > you
>> > > > will need to specify your tasks in a format that the thermos
>> executor
>> > can
>> > > > understand.  Turns out this is JSON data, so it should not be *too*
>> > > > prohibitive.
>> > > >
>> > > > However, there is another technical limitation you will hit for the
>> > > > submission rate you are after.  The scheduler is backed by a durable
>> > > store
>> > > > whose write latency is at minimum the amount of time required to
>> fsync.
>> > > >
>> > > > [1]
>> > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/incubator-aurora/blob/master/api/src/main/thrift/org/apache/aurora/gen/api.thrift
>> > > >
>> > > > -=Bill
>> > > >
>> > > > On Wed, Feb 11, 2015 at 11:46 AM, Hussein Elgridly <
>> > > > hussein@broadinstitute.org> wrote:
>> > > >
>> > > > > Hi folks,
>> > > > >
>> > > > > I'm looking at a use cases that involves submitting potentially
>> > > hundreds
>> > > > of
>> > > > > jobs a second to our Mesos cluster. My tests show that the aurora
>> > > client
>> > > > is
>> > > > > taking 1-2 seconds for each job submission, and that I can run
>> about
>> > > four
>> > > > > client processes in parallel before they peg the CPU at 100%. I
>> need
>> > > more
>> > > > > throughput than this!
>> > > > >
>> > > > > Squashing jobs down to the Process or Task level doesn't really
>> make
>> > > > sense
>> > > > > for our use case. I'm aware that with some shenanigans I can batch
>> > jobs
>> > > > > together using job instances, but that's a lot of work on my
>> current
>> > > > > timeframe (and of questionable utility given that the jobs
>> certainly
>> > > > won't
>> > > > > have identical resource requirements).
>> > > > >
>> > > > > What I really need is (at least) an order of magnitude speedup in
>> > terms
>> > > > of
>> > > > > being able to submit jobs to the Aurora scheduler (via the client
>> or
>> > > > > otherwise).
>> > > > >
>> > > > > Conceptually it doesn't seem like adding a job to a queue should
>> be a
>> > > > thing
>> > > > > that takes a couple of seconds, so I'm baffled as to why it's
>> taking
>> > so
>> > > > > long. As an experiment, I wrapped the call to client.execute() in
>> > > > > client.py:proxy_main in cProfile and called aurora job create
>> with a
>> > > very
>> > > > > simple test job.
>> > > > >
>> > > > > Results of the profile are in the Gist below:
>> > > > >
>> > > > > https://gist.github.com/helgridly/b37a0d27f04a37e72bb5
>> > > > >
>> > > > > Our of a 0.977s profile time, the two things that stick out to me
>> > are:
>> > > > >
>> > > > > 1. 0.526s spent in Pystachio for a job that doesn't use any
>> templates
>> > > > > 2. 0.564s spent in create_job, presumably talking to the scheduler
>> > (and
>> > > > > setting up the machinery for doing so)
>> > > > >
>> > > > > I imagine I can sidestep #1 with a check for "{{" in the job file
>> and
>> > > > > bypass Pystachio entirely. Can I also skip the Aurora client
>> entirely
>> > > and
>> > > > > talk directly to the scheduler? If so what does that entail, and
>> are
>> > > > there
>> > > > > any risks associated?
>> > > > >
>> > > > > Thanks,
>> > > > > -Hussein
>> > > > >
>> > > > > Hussein Elgridly
>> > > > > Senior Software Engineer, DSDE
>> > > > > The Broad Institute of MIT and Harvard
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Speeding up Aurora client job creation

Posted by Hussein Elgridly <hu...@broadinstitute.org>.

As a general rule we're trying to stick to Python 3.4. I don't imagine
implementing something a THTTPClient of my own will be too difficult,
especially given that I have the Aurora client's TRequestsTransport [1] for
reference.

[1]
https://github.com/apache/incubator-aurora/blob/master/src/main/python/apache/aurora/common/transport.py

Hussein Elgridly
Senior Software Engineer, DSDE
The Broad Institute of MIT and Harvard


On 16 March 2015 at 22:58, Bill Farner <wf...@apache.org> wrote:

> Exploring the possibilities - can you use python 2.7?  If so, you could
> leverage some of the private libraries within the client and lower the
> surface area of what you need to build.  It won't be a stable programmatic
> API, but you might get moving faster.  I assume this is what Stephan is
> suggesting.
>
> -=Bill
>
> On Mon, Mar 16, 2015 at 7:52 PM, Hussein Elgridly <
> hussein@broadinstitute.org> wrote:
>
> > I'm not quite sure I understand your question, so I'll be painfully
> > explicit instead.
> >
> > I don't want to use the existing Aurora client because it's slow
> (Pystachio
> > + repeated HTTP connection overheads, as detailed earlier in this
> thread).
> > Instead, I want to use the Thrift interface to talk to the Aurora
> scheduler
> > directly - I can skip Pystachio entirely and keep the HTTP connection
> > open).
> >
> > I cannot use the official Thrift bindings for Python as they do not yet
> > support Python 3 [1]. There is a third-party, pure Python implementation
> of
> > Thrift that does support Python 3 called thriftpy [2]. However, thriftpy
> > does not include a THTTPClient transport, which is what the Aurora
> > scheduler uses. I will therefore have to write my own THTTPClient
> transport
> > (and probably contribute it back to thriftpy).
> >
> > [1] https://issues.apache.org/jira/browse/THRIFT-1857
> > [2] https://github.com/eleme/thriftpy
> >
> > Hussein Elgridly
> > Senior Software Engineer, DSDE
> > The Broad Institute of MIT and Harvard
> >
> >
> > On 16 March 2015 at 19:11, Erb, Stephan <St...@blue-yonder.com>
> > wrote:
> >
> > > Just to make sure I get this correctly: You say, you cannot use the
> > > existing python client because it is python 2.7 only so you want to
> > write a
> > > new one in python 3?
> > >
> > > Regards,
> > > Stephan
> > > ________________________________________
> > > From: Hussein Elgridly <hu...@broadinstitute.org>
> > > Sent: Monday, March 16, 2015 11:44 PM
> > > To: dev@aurora.incubator.apache.org
> > > Subject: Re: Speeding up Aurora client job creation
> > >
> > > So this has now bubbled back to the top of my TODO list and I'm
> actively
> > > working on it. I am entirely new to Thrift so please forgive the newbie
> > > questions...
> > >
> > > I would like to talk to the Aurora scheduler directly from my (Python)
> > > application using Thrift. Since I'm on Python 3.4 I've had to use
> > thriftpy:
> > > https://github.com/eleme/thriftpy
> > >
> > > As far as I can tell, the following should work (by default, thriftpy
> > uses
> > > a TBufferedTransport around a TSocket):
> > >
> > > ---
> > > import thriftpy
> > > import thriftpy.rpc
> > >
> > > aurora_api = thriftpy.load("api.thrift")
> > >
> > > client = thriftpy.rpc.make_client(aurora_api.AuroraSchedulerManager,
> > > host="localhost", port=8081,
> > > proto_factory=thriftpy.protocol.TJSONProtocolFactory() )
> > >
> > > print(client.getJobSummary())
> > > ---
> > >
> > > Obviously I wouldn't be writing this email if it did work :) It hangs.
> > >
> > > I jumped into pdb and found it was sending the following payload:
> > >
> > > b'\x00\x00\x00\\{"metadata": {"name": "getJobSummary", "seqid": 0,
> > "ttype":
> > > 1, "version": 1}, "payload": {}}'
> > >
> > > to a socket that looked like this:
> > >
> > > <socket.socket fd=3, family=AddressFamily.AF_INET, type=2049, proto=0,
> > > laddr=('<localhost's_private_ip>', 49167),
> > raddr=('localhost's_private_ip',
> > > 8081)>
> > >
> > > ...but was waiting forever to receive any data. Adding a timeout just
> > > triggered the timeout.
> > >
> > > I'm stumped. Any clues?
> > >
> > >
> > > Hussein Elgridly
> > > Senior Software Engineer, DSDE
> > > The Broad Institute of MIT and Harvard
> > >
> > >
> > > On 12 February 2015 at 04:15, Erb, Stephan <
> Stephan.Erb@blue-yonder.com>
> > > wrote:
> > >
> > > > Hi Hussein,
> > > >
> > > > we also had slight performance problems when talking to Aurora. We
> > ended
> > > > up using the existing python client directly in our code (see
> > > > apache.aurora.client.api.__init__.py). This allowed us to reuse the
> api
> > > > object and its scheduler connection, dropping a connection latency of
> > > about
> > > > 0.3-0.4 seconds per request.
> > > >
> > > > Best Regards,
> > > > Stephan
> > > > ________________________________________
> > > > From: Bill Farner <wf...@apache.org>
> > > > Sent: Wednesday, February 11, 2015 9:29 PM
> > > > To: dev@aurora.incubator.apache.org
> > > > Subject: Re: Speeding up Aurora client job creation
> > > >
> > > > To reduce that time you will indeed want to talk directly to the
> > > > scheduler.  This will definitely require you to roll up your sleeves
> a
> > > bit
> > > > and set up a thrift client to our api (based on api.thrift [1]),
> since
> > > you
> > > > will need to specify your tasks in a format that the thermos executor
> > can
> > > > understand.  Turns out this is JSON data, so it should not be *too*
> > > > prohibitive.
> > > >
> > > > However, there is another technical limitation you will hit for the
> > > > submission rate you are after.  The scheduler is backed by a durable
> > > store
> > > > whose write latency is at minimum the amount of time required to
> fsync.
> > > >
> > > > [1]
> > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-aurora/blob/master/api/src/main/thrift/org/apache/aurora/gen/api.thrift
> > > >
> > > > -=Bill
> > > >
> > > > On Wed, Feb 11, 2015 at 11:46 AM, Hussein Elgridly <
> > > > hussein@broadinstitute.org> wrote:
> > > >
> > > > > Hi folks,
> > > > >
> > > > > I'm looking at a use cases that involves submitting potentially
> > > hundreds
> > > > of
> > > > > jobs a second to our Mesos cluster. My tests show that the aurora
> > > client
> > > > is
> > > > > taking 1-2 seconds for each job submission, and that I can run
> about
> > > four
> > > > > client processes in parallel before they peg the CPU at 100%. I
> need
> > > more
> > > > > throughput than this!
> > > > >
> > > > > Squashing jobs down to the Process or Task level doesn't really
> make
> > > > sense
> > > > > for our use case. I'm aware that with some shenanigans I can batch
> > jobs
> > > > > together using job instances, but that's a lot of work on my
> current
> > > > > timeframe (and of questionable utility given that the jobs
> certainly
> > > > won't
> > > > > have identical resource requirements).
> > > > >
> > > > > What I really need is (at least) an order of magnitude speedup in
> > terms
> > > > of
> > > > > being able to submit jobs to the Aurora scheduler (via the client
> or
> > > > > otherwise).
> > > > >
> > > > > Conceptually it doesn't seem like adding a job to a queue should
> be a
> > > > thing
> > > > > that takes a couple of seconds, so I'm baffled as to why it's
> taking
> > so
> > > > > long. As an experiment, I wrapped the call to client.execute() in
> > > > > client.py:proxy_main in cProfile and called aurora job create with
> a
> > > very
> > > > > simple test job.
> > > > >
> > > > > Results of the profile are in the Gist below:
> > > > >
> > > > > https://gist.github.com/helgridly/b37a0d27f04a37e72bb5
> > > > >
> > > > > Our of a 0.977s profile time, the two things that stick out to me
> > are:
> > > > >
> > > > > 1. 0.526s spent in Pystachio for a job that doesn't use any
> templates
> > > > > 2. 0.564s spent in create_job, presumably talking to the scheduler
> > (and
> > > > > setting up the machinery for doing so)
> > > > >
> > > > > I imagine I can sidestep #1 with a check for "{{" in the job file
> and
> > > > > bypass Pystachio entirely. Can I also skip the Aurora client
> entirely
> > > and
> > > > > talk directly to the scheduler? If so what does that entail, and
> are
> > > > there
> > > > > any risks associated?
> > > > >
> > > > > Thanks,
> > > > > -Hussein
> > > > >
> > > > > Hussein Elgridly
> > > > > Senior Software Engineer, DSDE
> > > > > The Broad Institute of MIT and Harvard
> > > > >
> > > >
> > >
> >
>

Re: Speeding up Aurora client job creation

Posted by Bill Farner <wf...@apache.org>.

Exploring the possibilities - can you use python 2.7?  If so, you could
leverage some of the private libraries within the client and lower the
surface area of what you need to build.  It won't be a stable programmatic
API, but you might get moving faster.  I assume this is what Stephan is
suggesting.

-=Bill

On Mon, Mar 16, 2015 at 7:52 PM, Hussein Elgridly <
hussein@broadinstitute.org> wrote:

> I'm not quite sure I understand your question, so I'll be painfully
> explicit instead.
>
> I don't want to use the existing Aurora client because it's slow (Pystachio
> + repeated HTTP connection overheads, as detailed earlier in this thread).
> Instead, I want to use the Thrift interface to talk to the Aurora scheduler
> directly - I can skip Pystachio entirely and keep the HTTP connection
> open).
>
> I cannot use the official Thrift bindings for Python as they do not yet
> support Python 3 [1]. There is a third-party, pure Python implementation of
> Thrift that does support Python 3 called thriftpy [2]. However, thriftpy
> does not include a THTTPClient transport, which is what the Aurora
> scheduler uses. I will therefore have to write my own THTTPClient transport
> (and probably contribute it back to thriftpy).
>
> [1] https://issues.apache.org/jira/browse/THRIFT-1857
> [2] https://github.com/eleme/thriftpy
>
> Hussein Elgridly
> Senior Software Engineer, DSDE
> The Broad Institute of MIT and Harvard
>
>
> On 16 March 2015 at 19:11, Erb, Stephan <St...@blue-yonder.com>
> wrote:
>
> > Just to make sure I get this correctly: You say, you cannot use the
> > existing python client because it is python 2.7 only so you want to
> write a
> > new one in python 3?
> >
> > Regards,
> > Stephan
> > ________________________________________
> > From: Hussein Elgridly <hu...@broadinstitute.org>
> > Sent: Monday, March 16, 2015 11:44 PM
> > To: dev@aurora.incubator.apache.org
> > Subject: Re: Speeding up Aurora client job creation
> >
> > So this has now bubbled back to the top of my TODO list and I'm actively
> > working on it. I am entirely new to Thrift so please forgive the newbie
> > questions...
> >
> > I would like to talk to the Aurora scheduler directly from my (Python)
> > application using Thrift. Since I'm on Python 3.4 I've had to use
> thriftpy:
> > https://github.com/eleme/thriftpy
> >
> > As far as I can tell, the following should work (by default, thriftpy
> uses
> > a TBufferedTransport around a TSocket):
> >
> > ---
> > import thriftpy
> > import thriftpy.rpc
> >
> > aurora_api = thriftpy.load("api.thrift")
> >
> > client = thriftpy.rpc.make_client(aurora_api.AuroraSchedulerManager,
> > host="localhost", port=8081,
> > proto_factory=thriftpy.protocol.TJSONProtocolFactory() )
> >
> > print(client.getJobSummary())
> > ---
> >
> > Obviously I wouldn't be writing this email if it did work :) It hangs.
> >
> > I jumped into pdb and found it was sending the following payload:
> >
> > b'\x00\x00\x00\\{"metadata": {"name": "getJobSummary", "seqid": 0,
> "ttype":
> > 1, "version": 1}, "payload": {}}'
> >
> > to a socket that looked like this:
> >
> > <socket.socket fd=3, family=AddressFamily.AF_INET, type=2049, proto=0,
> > laddr=('<localhost's_private_ip>', 49167),
> raddr=('localhost's_private_ip',
> > 8081)>
> >
> > ...but was waiting forever to receive any data. Adding a timeout just
> > triggered the timeout.
> >
> > I'm stumped. Any clues?
> >
> >
> > Hussein Elgridly
> > Senior Software Engineer, DSDE
> > The Broad Institute of MIT and Harvard
> >
> >
> > On 12 February 2015 at 04:15, Erb, Stephan <St...@blue-yonder.com>
> > wrote:
> >
> > > Hi Hussein,
> > >
> > > we also had slight performance problems when talking to Aurora. We
> ended
> > > up using the existing python client directly in our code (see
> > > apache.aurora.client.api.__init__.py). This allowed us to reuse the api
> > > object and its scheduler connection, dropping a connection latency of
> > about
> > > 0.3-0.4 seconds per request.
> > >
> > > Best Regards,
> > > Stephan
> > > ________________________________________
> > > From: Bill Farner <wf...@apache.org>
> > > Sent: Wednesday, February 11, 2015 9:29 PM
> > > To: dev@aurora.incubator.apache.org
> > > Subject: Re: Speeding up Aurora client job creation
> > >
> > > To reduce that time you will indeed want to talk directly to the
> > > scheduler.  This will definitely require you to roll up your sleeves a
> > bit
> > > and set up a thrift client to our api (based on api.thrift [1]), since
> > you
> > > will need to specify your tasks in a format that the thermos executor
> can
> > > understand.  Turns out this is JSON data, so it should not be *too*
> > > prohibitive.
> > >
> > > However, there is another technical limitation you will hit for the
> > > submission rate you are after.  The scheduler is backed by a durable
> > store
> > > whose write latency is at minimum the amount of time required to fsync.
> > >
> > > [1]
> > >
> > >
> >
> https://github.com/apache/incubator-aurora/blob/master/api/src/main/thrift/org/apache/aurora/gen/api.thrift
> > >
> > > -=Bill
> > >
> > > On Wed, Feb 11, 2015 at 11:46 AM, Hussein Elgridly <
> > > hussein@broadinstitute.org> wrote:
> > >
> > > > Hi folks,
> > > >
> > > > I'm looking at a use cases that involves submitting potentially
> > hundreds
> > > of
> > > > jobs a second to our Mesos cluster. My tests show that the aurora
> > client
> > > is
> > > > taking 1-2 seconds for each job submission, and that I can run about
> > four
> > > > client processes in parallel before they peg the CPU at 100%. I need
> > more
> > > > throughput than this!
> > > >
> > > > Squashing jobs down to the Process or Task level doesn't really make
> > > sense
> > > > for our use case. I'm aware that with some shenanigans I can batch
> jobs
> > > > together using job instances, but that's a lot of work on my current
> > > > timeframe (and of questionable utility given that the jobs certainly
> > > won't
> > > > have identical resource requirements).
> > > >
> > > > What I really need is (at least) an order of magnitude speedup in
> terms
> > > of
> > > > being able to submit jobs to the Aurora scheduler (via the client or
> > > > otherwise).
> > > >
> > > > Conceptually it doesn't seem like adding a job to a queue should be a
> > > thing
> > > > that takes a couple of seconds, so I'm baffled as to why it's taking
> so
> > > > long. As an experiment, I wrapped the call to client.execute() in
> > > > client.py:proxy_main in cProfile and called aurora job create with a
> > very
> > > > simple test job.
> > > >
> > > > Results of the profile are in the Gist below:
> > > >
> > > > https://gist.github.com/helgridly/b37a0d27f04a37e72bb5
> > > >
> > > > Our of a 0.977s profile time, the two things that stick out to me
> are:
> > > >
> > > > 1. 0.526s spent in Pystachio for a job that doesn't use any templates
> > > > 2. 0.564s spent in create_job, presumably talking to the scheduler
> (and
> > > > setting up the machinery for doing so)
> > > >
> > > > I imagine I can sidestep #1 with a check for "{{" in the job file and
> > > > bypass Pystachio entirely. Can I also skip the Aurora client entirely
> > and
> > > > talk directly to the scheduler? If so what does that entail, and are
> > > there
> > > > any risks associated?
> > > >
> > > > Thanks,
> > > > -Hussein
> > > >
> > > > Hussein Elgridly
> > > > Senior Software Engineer, DSDE
> > > > The Broad Institute of MIT and Harvard
> > > >
> > >
> >
>

Re: Speeding up Aurora client job creation

Posted by Hussein Elgridly <hu...@broadinstitute.org>.

I'm not quite sure I understand your question, so I'll be painfully
explicit instead.

I don't want to use the existing Aurora client because it's slow (Pystachio
+ repeated HTTP connection overheads, as detailed earlier in this thread).
Instead, I want to use the Thrift interface to talk to the Aurora scheduler
directly - I can skip Pystachio entirely and keep the HTTP connection open).

I cannot use the official Thrift bindings for Python as they do not yet
support Python 3 [1]. There is a third-party, pure Python implementation of
Thrift that does support Python 3 called thriftpy [2]. However, thriftpy
does not include a THTTPClient transport, which is what the Aurora
scheduler uses. I will therefore have to write my own THTTPClient transport
(and probably contribute it back to thriftpy).

[1] https://issues.apache.org/jira/browse/THRIFT-1857
[2] https://github.com/eleme/thriftpy

Hussein Elgridly
Senior Software Engineer, DSDE
The Broad Institute of MIT and Harvard


On 16 March 2015 at 19:11, Erb, Stephan <St...@blue-yonder.com> wrote:

> Just to make sure I get this correctly: You say, you cannot use the
> existing python client because it is python 2.7 only so you want to write a
> new one in python 3?
>
> Regards,
> Stephan
> ________________________________________
> From: Hussein Elgridly <hu...@broadinstitute.org>
> Sent: Monday, March 16, 2015 11:44 PM
> To: dev@aurora.incubator.apache.org
> Subject: Re: Speeding up Aurora client job creation
>
> So this has now bubbled back to the top of my TODO list and I'm actively
> working on it. I am entirely new to Thrift so please forgive the newbie
> questions...
>
> I would like to talk to the Aurora scheduler directly from my (Python)
> application using Thrift. Since I'm on Python 3.4 I've had to use thriftpy:
> https://github.com/eleme/thriftpy
>
> As far as I can tell, the following should work (by default, thriftpy uses
> a TBufferedTransport around a TSocket):
>
> ---
> import thriftpy
> import thriftpy.rpc
>
> aurora_api = thriftpy.load("api.thrift")
>
> client = thriftpy.rpc.make_client(aurora_api.AuroraSchedulerManager,
> host="localhost", port=8081,
> proto_factory=thriftpy.protocol.TJSONProtocolFactory() )
>
> print(client.getJobSummary())
> ---
>
> Obviously I wouldn't be writing this email if it did work :) It hangs.
>
> I jumped into pdb and found it was sending the following payload:
>
> b'\x00\x00\x00\\{"metadata": {"name": "getJobSummary", "seqid": 0, "ttype":
> 1, "version": 1}, "payload": {}}'
>
> to a socket that looked like this:
>
> <socket.socket fd=3, family=AddressFamily.AF_INET, type=2049, proto=0,
> laddr=('<localhost's_private_ip>', 49167), raddr=('localhost's_private_ip',
> 8081)>
>
> ...but was waiting forever to receive any data. Adding a timeout just
> triggered the timeout.
>
> I'm stumped. Any clues?
>
>
> Hussein Elgridly
> Senior Software Engineer, DSDE
> The Broad Institute of MIT and Harvard
>
>
> On 12 February 2015 at 04:15, Erb, Stephan <St...@blue-yonder.com>
> wrote:
>
> > Hi Hussein,
> >
> > we also had slight performance problems when talking to Aurora. We ended
> > up using the existing python client directly in our code (see
> > apache.aurora.client.api.__init__.py). This allowed us to reuse the api
> > object and its scheduler connection, dropping a connection latency of
> about
> > 0.3-0.4 seconds per request.
> >
> > Best Regards,
> > Stephan
> > ________________________________________
> > From: Bill Farner <wf...@apache.org>
> > Sent: Wednesday, February 11, 2015 9:29 PM
> > To: dev@aurora.incubator.apache.org
> > Subject: Re: Speeding up Aurora client job creation
> >
> > To reduce that time you will indeed want to talk directly to the
> > scheduler.  This will definitely require you to roll up your sleeves a
> bit
> > and set up a thrift client to our api (based on api.thrift [1]), since
> you
> > will need to specify your tasks in a format that the thermos executor can
> > understand.  Turns out this is JSON data, so it should not be *too*
> > prohibitive.
> >
> > However, there is another technical limitation you will hit for the
> > submission rate you are after.  The scheduler is backed by a durable
> store
> > whose write latency is at minimum the amount of time required to fsync.
> >
> > [1]
> >
> >
> https://github.com/apache/incubator-aurora/blob/master/api/src/main/thrift/org/apache/aurora/gen/api.thrift
> >
> > -=Bill
> >
> > On Wed, Feb 11, 2015 at 11:46 AM, Hussein Elgridly <
> > hussein@broadinstitute.org> wrote:
> >
> > > Hi folks,
> > >
> > > I'm looking at a use cases that involves submitting potentially
> hundreds
> > of
> > > jobs a second to our Mesos cluster. My tests show that the aurora
> client
> > is
> > > taking 1-2 seconds for each job submission, and that I can run about
> four
> > > client processes in parallel before they peg the CPU at 100%. I need
> more
> > > throughput than this!
> > >
> > > Squashing jobs down to the Process or Task level doesn't really make
> > sense
> > > for our use case. I'm aware that with some shenanigans I can batch jobs
> > > together using job instances, but that's a lot of work on my current
> > > timeframe (and of questionable utility given that the jobs certainly
> > won't
> > > have identical resource requirements).
> > >
> > > What I really need is (at least) an order of magnitude speedup in terms
> > of
> > > being able to submit jobs to the Aurora scheduler (via the client or
> > > otherwise).
> > >
> > > Conceptually it doesn't seem like adding a job to a queue should be a
> > thing
> > > that takes a couple of seconds, so I'm baffled as to why it's taking so
> > > long. As an experiment, I wrapped the call to client.execute() in
> > > client.py:proxy_main in cProfile and called aurora job create with a
> very
> > > simple test job.
> > >
> > > Results of the profile are in the Gist below:
> > >
> > > https://gist.github.com/helgridly/b37a0d27f04a37e72bb5
> > >
> > > Our of a 0.977s profile time, the two things that stick out to me are:
> > >
> > > 1. 0.526s spent in Pystachio for a job that doesn't use any templates
> > > 2. 0.564s spent in create_job, presumably talking to the scheduler (and
> > > setting up the machinery for doing so)
> > >
> > > I imagine I can sidestep #1 with a check for "{{" in the job file and
> > > bypass Pystachio entirely. Can I also skip the Aurora client entirely
> and
> > > talk directly to the scheduler? If so what does that entail, and are
> > there
> > > any risks associated?
> > >
> > > Thanks,
> > > -Hussein
> > >
> > > Hussein Elgridly
> > > Senior Software Engineer, DSDE
> > > The Broad Institute of MIT and Harvard
> > >
> >
>

Re: Speeding up Aurora client job creation

Posted by "Erb, Stephan" <St...@blue-yonder.com>.

Just to make sure I get this correctly: You say, you cannot use the existing python client because it is python 2.7 only so you want to write a new one in python 3?

Regards,
Stephan
________________________________________
From: Hussein Elgridly <hu...@broadinstitute.org>
Sent: Monday, March 16, 2015 11:44 PM
To: dev@aurora.incubator.apache.org
Subject: Re: Speeding up Aurora client job creation

So this has now bubbled back to the top of my TODO list and I'm actively
working on it. I am entirely new to Thrift so please forgive the newbie
questions...

I would like to talk to the Aurora scheduler directly from my (Python)
application using Thrift. Since I'm on Python 3.4 I've had to use thriftpy:
https://github.com/eleme/thriftpy

As far as I can tell, the following should work (by default, thriftpy uses
a TBufferedTransport around a TSocket):

---
import thriftpy
import thriftpy.rpc

aurora_api = thriftpy.load("api.thrift")

client = thriftpy.rpc.make_client(aurora_api.AuroraSchedulerManager,
host="localhost", port=8081,
proto_factory=thriftpy.protocol.TJSONProtocolFactory() )

print(client.getJobSummary())
---

Obviously I wouldn't be writing this email if it did work :) It hangs.

I jumped into pdb and found it was sending the following payload:

b'\x00\x00\x00\\{"metadata": {"name": "getJobSummary", "seqid": 0, "ttype":
1, "version": 1}, "payload": {}}'

to a socket that looked like this:

<socket.socket fd=3, family=AddressFamily.AF_INET, type=2049, proto=0,
laddr=('<localhost's_private_ip>', 49167), raddr=('localhost's_private_ip',
8081)>

...but was waiting forever to receive any data. Adding a timeout just
triggered the timeout.

I'm stumped. Any clues?


Hussein Elgridly
Senior Software Engineer, DSDE
The Broad Institute of MIT and Harvard


On 12 February 2015 at 04:15, Erb, Stephan <St...@blue-yonder.com>
wrote:

> Hi Hussein,
>
> we also had slight performance problems when talking to Aurora. We ended
> up using the existing python client directly in our code (see
> apache.aurora.client.api.__init__.py). This allowed us to reuse the api
> object and its scheduler connection, dropping a connection latency of about
> 0.3-0.4 seconds per request.
>
> Best Regards,
> Stephan
> ________________________________________
> From: Bill Farner <wf...@apache.org>
> Sent: Wednesday, February 11, 2015 9:29 PM
> To: dev@aurora.incubator.apache.org
> Subject: Re: Speeding up Aurora client job creation
>
> To reduce that time you will indeed want to talk directly to the
> scheduler.  This will definitely require you to roll up your sleeves a bit
> and set up a thrift client to our api (based on api.thrift [1]), since you
> will need to specify your tasks in a format that the thermos executor can
> understand.  Turns out this is JSON data, so it should not be *too*
> prohibitive.
>
> However, there is another technical limitation you will hit for the
> submission rate you are after.  The scheduler is backed by a durable store
> whose write latency is at minimum the amount of time required to fsync.
>
> [1]
>
> https://github.com/apache/incubator-aurora/blob/master/api/src/main/thrift/org/apache/aurora/gen/api.thrift
>
> -=Bill
>
> On Wed, Feb 11, 2015 at 11:46 AM, Hussein Elgridly <
> hussein@broadinstitute.org> wrote:
>
> > Hi folks,
> >
> > I'm looking at a use cases that involves submitting potentially hundreds
> of
> > jobs a second to our Mesos cluster. My tests show that the aurora client
> is
> > taking 1-2 seconds for each job submission, and that I can run about four
> > client processes in parallel before they peg the CPU at 100%. I need more
> > throughput than this!
> >
> > Squashing jobs down to the Process or Task level doesn't really make
> sense
> > for our use case. I'm aware that with some shenanigans I can batch jobs
> > together using job instances, but that's a lot of work on my current
> > timeframe (and of questionable utility given that the jobs certainly
> won't
> > have identical resource requirements).
> >
> > What I really need is (at least) an order of magnitude speedup in terms
> of
> > being able to submit jobs to the Aurora scheduler (via the client or
> > otherwise).
> >
> > Conceptually it doesn't seem like adding a job to a queue should be a
> thing
> > that takes a couple of seconds, so I'm baffled as to why it's taking so
> > long. As an experiment, I wrapped the call to client.execute() in
> > client.py:proxy_main in cProfile and called aurora job create with a very
> > simple test job.
> >
> > Results of the profile are in the Gist below:
> >
> > https://gist.github.com/helgridly/b37a0d27f04a37e72bb5
> >
> > Our of a 0.977s profile time, the two things that stick out to me are:
> >
> > 1. 0.526s spent in Pystachio for a job that doesn't use any templates
> > 2. 0.564s spent in create_job, presumably talking to the scheduler (and
> > setting up the machinery for doing so)
> >
> > I imagine I can sidestep #1 with a check for "{{" in the job file and
> > bypass Pystachio entirely. Can I also skip the Aurora client entirely and
> > talk directly to the scheduler? If so what does that entail, and are
> there
> > any risks associated?
> >
> > Thanks,
> > -Hussein
> >
> > Hussein Elgridly
> > Senior Software Engineer, DSDE
> > The Broad Institute of MIT and Harvard
> >
>

Re: Speeding up Aurora client job creation

Posted by Hussein Elgridly <hu...@broadinstitute.org>.

I dug into TRequestsTransport and I get it now. Sending raw bytes across a
socket is not the same as doing an HTTP POST with said bytes stuffed in the
body!

I guess I too will be rolling my own HTTP transport...

Hussein Elgridly
Senior Software Engineer, DSDE
The Broad Institute of MIT and Harvard


On 16 March 2015 at 18:44, Hussein Elgridly <hu...@broadinstitute.org>
wrote:

> So this has now bubbled back to the top of my TODO list and I'm actively
> working on it. I am entirely new to Thrift so please forgive the newbie
> questions...
>
> I would like to talk to the Aurora scheduler directly from my (Python)
> application using Thrift. Since I'm on Python 3.4 I've had to use thriftpy:
> https://github.com/eleme/thriftpy
>
> As far as I can tell, the following should work (by default, thriftpy uses
> a TBufferedTransport around a TSocket):
>
> ---
> import thriftpy
> import thriftpy.rpc
>
> aurora_api = thriftpy.load("api.thrift")
>
> client = thriftpy.rpc.make_client(aurora_api.AuroraSchedulerManager,
> host="localhost", port=8081,
> proto_factory=thriftpy.protocol.TJSONProtocolFactory() )
>
> print(client.getJobSummary())
> ---
>
> Obviously I wouldn't be writing this email if it did work :) It hangs.
>
> I jumped into pdb and found it was sending the following payload:
>
> b'\x00\x00\x00\\{"metadata": {"name": "getJobSummary", "seqid": 0,
> "ttype": 1, "version": 1}, "payload": {}}'
>
> to a socket that looked like this:
>
> <socket.socket fd=3, family=AddressFamily.AF_INET, type=2049, proto=0,
> laddr=('<localhost's_private_ip>', 49167), raddr=('localhost's_private_ip',
> 8081)>
>
> ...but was waiting forever to receive any data. Adding a timeout just
> triggered the timeout.
>
> I'm stumped. Any clues?
>
>
> Hussein Elgridly
> Senior Software Engineer, DSDE
> The Broad Institute of MIT and Harvard
>
>
> On 12 February 2015 at 04:15, Erb, Stephan <St...@blue-yonder.com>
> wrote:
>
>> Hi Hussein,
>>
>> we also had slight performance problems when talking to Aurora. We ended
>> up using the existing python client directly in our code (see
>> apache.aurora.client.api.__init__.py). This allowed us to reuse the api
>> object and its scheduler connection, dropping a connection latency of about
>> 0.3-0.4 seconds per request.
>>
>> Best Regards,
>> Stephan
>> ________________________________________
>> From: Bill Farner <wf...@apache.org>
>> Sent: Wednesday, February 11, 2015 9:29 PM
>> To: dev@aurora.incubator.apache.org
>> Subject: Re: Speeding up Aurora client job creation
>>
>> To reduce that time you will indeed want to talk directly to the
>> scheduler.  This will definitely require you to roll up your sleeves a bit
>> and set up a thrift client to our api (based on api.thrift [1]), since you
>> will need to specify your tasks in a format that the thermos executor can
>> understand.  Turns out this is JSON data, so it should not be *too*
>> prohibitive.
>>
>> However, there is another technical limitation you will hit for the
>> submission rate you are after.  The scheduler is backed by a durable store
>> whose write latency is at minimum the amount of time required to fsync.
>>
>> [1]
>>
>> https://github.com/apache/incubator-aurora/blob/master/api/src/main/thrift/org/apache/aurora/gen/api.thrift
>>
>> -=Bill
>>
>> On Wed, Feb 11, 2015 at 11:46 AM, Hussein Elgridly <
>> hussein@broadinstitute.org> wrote:
>>
>> > Hi folks,
>> >
>> > I'm looking at a use cases that involves submitting potentially
>> hundreds of
>> > jobs a second to our Mesos cluster. My tests show that the aurora
>> client is
>> > taking 1-2 seconds for each job submission, and that I can run about
>> four
>> > client processes in parallel before they peg the CPU at 100%. I need
>> more
>> > throughput than this!
>> >
>> > Squashing jobs down to the Process or Task level doesn't really make
>> sense
>> > for our use case. I'm aware that with some shenanigans I can batch jobs
>> > together using job instances, but that's a lot of work on my current
>> > timeframe (and of questionable utility given that the jobs certainly
>> won't
>> > have identical resource requirements).
>> >
>> > What I really need is (at least) an order of magnitude speedup in terms
>> of
>> > being able to submit jobs to the Aurora scheduler (via the client or
>> > otherwise).
>> >
>> > Conceptually it doesn't seem like adding a job to a queue should be a
>> thing
>> > that takes a couple of seconds, so I'm baffled as to why it's taking so
>> > long. As an experiment, I wrapped the call to client.execute() in
>> > client.py:proxy_main in cProfile and called aurora job create with a
>> very
>> > simple test job.
>> >
>> > Results of the profile are in the Gist below:
>> >
>> > https://gist.github.com/helgridly/b37a0d27f04a37e72bb5
>> >
>> > Our of a 0.977s profile time, the two things that stick out to me are:
>> >
>> > 1. 0.526s spent in Pystachio for a job that doesn't use any templates
>> > 2. 0.564s spent in create_job, presumably talking to the scheduler (and
>> > setting up the machinery for doing so)
>> >
>> > I imagine I can sidestep #1 with a check for "{{" in the job file and
>> > bypass Pystachio entirely. Can I also skip the Aurora client entirely
>> and
>> > talk directly to the scheduler? If so what does that entail, and are
>> there
>> > any risks associated?
>> >
>> > Thanks,
>> > -Hussein
>> >
>> > Hussein Elgridly
>> > Senior Software Engineer, DSDE
>> > The Broad Institute of MIT and Harvard
>> >
>>
>
>

Re: Speeding up Aurora client job creation

Posted by Hussein Elgridly <hu...@broadinstitute.org>.

So this has now bubbled back to the top of my TODO list and I'm actively
working on it. I am entirely new to Thrift so please forgive the newbie
questions...

I would like to talk to the Aurora scheduler directly from my (Python)
application using Thrift. Since I'm on Python 3.4 I've had to use thriftpy:
https://github.com/eleme/thriftpy

As far as I can tell, the following should work (by default, thriftpy uses
a TBufferedTransport around a TSocket):

---
import thriftpy
import thriftpy.rpc

aurora_api = thriftpy.load("api.thrift")

client = thriftpy.rpc.make_client(aurora_api.AuroraSchedulerManager,
host="localhost", port=8081,
proto_factory=thriftpy.protocol.TJSONProtocolFactory() )

print(client.getJobSummary())
---

Obviously I wouldn't be writing this email if it did work :) It hangs.

I jumped into pdb and found it was sending the following payload:

b'\x00\x00\x00\\{"metadata": {"name": "getJobSummary", "seqid": 0, "ttype":
1, "version": 1}, "payload": {}}'

to a socket that looked like this:

<socket.socket fd=3, family=AddressFamily.AF_INET, type=2049, proto=0,
laddr=('<localhost's_private_ip>', 49167), raddr=('localhost's_private_ip',
8081)>

...but was waiting forever to receive any data. Adding a timeout just
triggered the timeout.

I'm stumped. Any clues?


Hussein Elgridly
Senior Software Engineer, DSDE
The Broad Institute of MIT and Harvard


On 12 February 2015 at 04:15, Erb, Stephan <St...@blue-yonder.com>
wrote:

> Hi Hussein,
>
> we also had slight performance problems when talking to Aurora. We ended
> up using the existing python client directly in our code (see
> apache.aurora.client.api.__init__.py). This allowed us to reuse the api
> object and its scheduler connection, dropping a connection latency of about
> 0.3-0.4 seconds per request.
>
> Best Regards,
> Stephan
> ________________________________________
> From: Bill Farner <wf...@apache.org>
> Sent: Wednesday, February 11, 2015 9:29 PM
> To: dev@aurora.incubator.apache.org
> Subject: Re: Speeding up Aurora client job creation
>
> To reduce that time you will indeed want to talk directly to the
> scheduler.  This will definitely require you to roll up your sleeves a bit
> and set up a thrift client to our api (based on api.thrift [1]), since you
> will need to specify your tasks in a format that the thermos executor can
> understand.  Turns out this is JSON data, so it should not be *too*
> prohibitive.
>
> However, there is another technical limitation you will hit for the
> submission rate you are after.  The scheduler is backed by a durable store
> whose write latency is at minimum the amount of time required to fsync.
>
> [1]
>
> https://github.com/apache/incubator-aurora/blob/master/api/src/main/thrift/org/apache/aurora/gen/api.thrift
>
> -=Bill
>
> On Wed, Feb 11, 2015 at 11:46 AM, Hussein Elgridly <
> hussein@broadinstitute.org> wrote:
>
> > Hi folks,
> >
> > I'm looking at a use cases that involves submitting potentially hundreds
> of
> > jobs a second to our Mesos cluster. My tests show that the aurora client
> is
> > taking 1-2 seconds for each job submission, and that I can run about four
> > client processes in parallel before they peg the CPU at 100%. I need more
> > throughput than this!
> >
> > Squashing jobs down to the Process or Task level doesn't really make
> sense
> > for our use case. I'm aware that with some shenanigans I can batch jobs
> > together using job instances, but that's a lot of work on my current
> > timeframe (and of questionable utility given that the jobs certainly
> won't
> > have identical resource requirements).
> >
> > What I really need is (at least) an order of magnitude speedup in terms
> of
> > being able to submit jobs to the Aurora scheduler (via the client or
> > otherwise).
> >
> > Conceptually it doesn't seem like adding a job to a queue should be a
> thing
> > that takes a couple of seconds, so I'm baffled as to why it's taking so
> > long. As an experiment, I wrapped the call to client.execute() in
> > client.py:proxy_main in cProfile and called aurora job create with a very
> > simple test job.
> >
> > Results of the profile are in the Gist below:
> >
> > https://gist.github.com/helgridly/b37a0d27f04a37e72bb5
> >
> > Our of a 0.977s profile time, the two things that stick out to me are:
> >
> > 1. 0.526s spent in Pystachio for a job that doesn't use any templates
> > 2. 0.564s spent in create_job, presumably talking to the scheduler (and
> > setting up the machinery for doing so)
> >
> > I imagine I can sidestep #1 with a check for "{{" in the job file and
> > bypass Pystachio entirely. Can I also skip the Aurora client entirely and
> > talk directly to the scheduler? If so what does that entail, and are
> there
> > any risks associated?
> >
> > Thanks,
> > -Hussein
> >
> > Hussein Elgridly
> > Senior Software Engineer, DSDE
> > The Broad Institute of MIT and Harvard
> >
>

Re: Speeding up Aurora client job creation

Posted by "Erb, Stephan" <St...@blue-yonder.com>.

Hi Hussein,

we also had slight performance problems when talking to Aurora. We ended up using the existing python client directly in our code (see apache.aurora.client.api.__init__.py). This allowed us to reuse the api object and its scheduler connection, dropping a connection latency of about 0.3-0.4 seconds per request.

Best Regards,
Stephan
________________________________________
From: Bill Farner <wf...@apache.org>
Sent: Wednesday, February 11, 2015 9:29 PM
To: dev@aurora.incubator.apache.org
Subject: Re: Speeding up Aurora client job creation

To reduce that time you will indeed want to talk directly to the
scheduler.  This will definitely require you to roll up your sleeves a bit
and set up a thrift client to our api (based on api.thrift [1]), since you
will need to specify your tasks in a format that the thermos executor can
understand.  Turns out this is JSON data, so it should not be *too*
prohibitive.

However, there is another technical limitation you will hit for the
submission rate you are after.  The scheduler is backed by a durable store
whose write latency is at minimum the amount of time required to fsync.

[1]
https://github.com/apache/incubator-aurora/blob/master/api/src/main/thrift/org/apache/aurora/gen/api.thrift

-=Bill

On Wed, Feb 11, 2015 at 11:46 AM, Hussein Elgridly <
hussein@broadinstitute.org> wrote:

> Hi folks,
>
> I'm looking at a use cases that involves submitting potentially hundreds of
> jobs a second to our Mesos cluster. My tests show that the aurora client is
> taking 1-2 seconds for each job submission, and that I can run about four
> client processes in parallel before they peg the CPU at 100%. I need more
> throughput than this!
>
> Squashing jobs down to the Process or Task level doesn't really make sense
> for our use case. I'm aware that with some shenanigans I can batch jobs
> together using job instances, but that's a lot of work on my current
> timeframe (and of questionable utility given that the jobs certainly won't
> have identical resource requirements).
>
> What I really need is (at least) an order of magnitude speedup in terms of
> being able to submit jobs to the Aurora scheduler (via the client or
> otherwise).
>
> Conceptually it doesn't seem like adding a job to a queue should be a thing
> that takes a couple of seconds, so I'm baffled as to why it's taking so
> long. As an experiment, I wrapped the call to client.execute() in
> client.py:proxy_main in cProfile and called aurora job create with a very
> simple test job.
>
> Results of the profile are in the Gist below:
>
> https://gist.github.com/helgridly/b37a0d27f04a37e72bb5
>
> Our of a 0.977s profile time, the two things that stick out to me are:
>
> 1. 0.526s spent in Pystachio for a job that doesn't use any templates
> 2. 0.564s spent in create_job, presumably talking to the scheduler (and
> setting up the machinery for doing so)
>
> I imagine I can sidestep #1 with a check for "{{" in the job file and
> bypass Pystachio entirely. Can I also skip the Aurora client entirely and
> talk directly to the scheduler? If so what does that entail, and are there
> any risks associated?
>
> Thanks,
> -Hussein
>
> Hussein Elgridly
> Senior Software Engineer, DSDE
> The Broad Institute of MIT and Harvard
>

Re: Speeding up Aurora client job creation

Posted by Brian Wickman <wi...@apache.org>.

To echo what bill said, pro-tip:
Pystachio objects have .json_dump and .json_dumps methods that serialize to
file objects and strings respectively.  Analogously, they have classmethods
.json_load and .json_loads that can deserialize this data (these behave
exactly like the python json module methods.)  So if you have an .aurora
config, you can serialize it to JSON and roll your own client in the way
Bill mentions.  Similarly for those who want to generate configs
programmatically, use Job.json_load in a boilerplate .aurora config and use
the aurora client (at an invocation cost penalty.)  That being said, a lot
of this effort will be obviated by a REST API.

On Wed, Feb 11, 2015 at 12:29 PM, Bill Farner <wf...@apache.org> wrote:

> To reduce that time you will indeed want to talk directly to the
> scheduler.  This will definitely require you to roll up your sleeves a bit
> and set up a thrift client to our api (based on api.thrift [1]), since you
> will need to specify your tasks in a format that the thermos executor can
> understand.  Turns out this is JSON data, so it should not be *too*
> prohibitive.
>
> However, there is another technical limitation you will hit for the
> submission rate you are after.  The scheduler is backed by a durable store
> whose write latency is at minimum the amount of time required to fsync.
>
> [1]
>
> https://github.com/apache/incubator-aurora/blob/master/api/src/main/thrift/org/apache/aurora/gen/api.thrift
>
> -=Bill
>
> On Wed, Feb 11, 2015 at 11:46 AM, Hussein Elgridly <
> hussein@broadinstitute.org> wrote:
>
> > Hi folks,
> >
> > I'm looking at a use cases that involves submitting potentially hundreds
> of
> > jobs a second to our Mesos cluster. My tests show that the aurora client
> is
> > taking 1-2 seconds for each job submission, and that I can run about four
> > client processes in parallel before they peg the CPU at 100%. I need more
> > throughput than this!
> >
> > Squashing jobs down to the Process or Task level doesn't really make
> sense
> > for our use case. I'm aware that with some shenanigans I can batch jobs
> > together using job instances, but that's a lot of work on my current
> > timeframe (and of questionable utility given that the jobs certainly
> won't
> > have identical resource requirements).
> >
> > What I really need is (at least) an order of magnitude speedup in terms
> of
> > being able to submit jobs to the Aurora scheduler (via the client or
> > otherwise).
> >
> > Conceptually it doesn't seem like adding a job to a queue should be a
> thing
> > that takes a couple of seconds, so I'm baffled as to why it's taking so
> > long. As an experiment, I wrapped the call to client.execute() in
> > client.py:proxy_main in cProfile and called aurora job create with a very
> > simple test job.
> >
> > Results of the profile are in the Gist below:
> >
> > https://gist.github.com/helgridly/b37a0d27f04a37e72bb5
> >
> > Our of a 0.977s profile time, the two things that stick out to me are:
> >
> > 1. 0.526s spent in Pystachio for a job that doesn't use any templates
> > 2. 0.564s spent in create_job, presumably talking to the scheduler (and
> > setting up the machinery for doing so)
> >
> > I imagine I can sidestep #1 with a check for "{{" in the job file and
> > bypass Pystachio entirely. Can I also skip the Aurora client entirely and
> > talk directly to the scheduler? If so what does that entail, and are
> there
> > any risks associated?
> >
> > Thanks,
> > -Hussein
> >
> > Hussein Elgridly
> > Senior Software Engineer, DSDE
> > The Broad Institute of MIT and Harvard
> >
>

Re: Speeding up Aurora client job creation

Posted by Bill Farner <wf...@apache.org>.

To reduce that time you will indeed want to talk directly to the
scheduler.  This will definitely require you to roll up your sleeves a bit
and set up a thrift client to our api (based on api.thrift [1]), since you
will need to specify your tasks in a format that the thermos executor can
understand.  Turns out this is JSON data, so it should not be *too*
prohibitive.

However, there is another technical limitation you will hit for the
submission rate you are after.  The scheduler is backed by a durable store
whose write latency is at minimum the amount of time required to fsync.

[1]
https://github.com/apache/incubator-aurora/blob/master/api/src/main/thrift/org/apache/aurora/gen/api.thrift

-=Bill

On Wed, Feb 11, 2015 at 11:46 AM, Hussein Elgridly <
hussein@broadinstitute.org> wrote:

> Hi folks,
>
> I'm looking at a use cases that involves submitting potentially hundreds of
> jobs a second to our Mesos cluster. My tests show that the aurora client is
> taking 1-2 seconds for each job submission, and that I can run about four
> client processes in parallel before they peg the CPU at 100%. I need more
> throughput than this!
>
> Squashing jobs down to the Process or Task level doesn't really make sense
> for our use case. I'm aware that with some shenanigans I can batch jobs
> together using job instances, but that's a lot of work on my current
> timeframe (and of questionable utility given that the jobs certainly won't
> have identical resource requirements).
>
> What I really need is (at least) an order of magnitude speedup in terms of
> being able to submit jobs to the Aurora scheduler (via the client or
> otherwise).
>
> Conceptually it doesn't seem like adding a job to a queue should be a thing
> that takes a couple of seconds, so I'm baffled as to why it's taking so
> long. As an experiment, I wrapped the call to client.execute() in
> client.py:proxy_main in cProfile and called aurora job create with a very
> simple test job.
>
> Results of the profile are in the Gist below:
>
> https://gist.github.com/helgridly/b37a0d27f04a37e72bb5
>
> Our of a 0.977s profile time, the two things that stick out to me are:
>
> 1. 0.526s spent in Pystachio for a job that doesn't use any templates
> 2. 0.564s spent in create_job, presumably talking to the scheduler (and
> setting up the machinery for doing so)
>
> I imagine I can sidestep #1 with a check for "{{" in the job file and
> bypass Pystachio entirely. Can I also skip the Aurora client entirely and
> talk directly to the scheduler? If so what does that entail, and are there
> any risks associated?
>
> Thanks,
> -Hussein
>
> Hussein Elgridly
> Senior Software Engineer, DSDE
> The Broad Institute of MIT and Harvard
>