You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by Yi-Wen Liu <yi...@usc.edu> on 2015/11/29 01:23:09 UTC

lower preprocessing time

Hi,

I am using ducc to process text files(cTAKES), and one of my input is quite
short, about 10 lines.
But it takes more than two minutes to process it, as follows:
After submitting,
00:00-00:08 > no status
00:09-00:30 > waiting for driver
00:31-01:00 > waiting for resources
01:01-02:00 > initializing
02:01-02:30 > completing
02:31 > completed

Is there any way to lower the preprocessing time?(Time to wait for driver,
resources, initializing...)

I am wondering why it takes so long before completing, and have tried
different parameter values, for example lower initialization time, lower
resources needed, but didn't have much improvement.

Here's parameters I am using now: process_memory_size 2
process_jvm_args -Xmx4g
driver_jvm_args -Xmx4g
process_thread_count 2
process_per_item_time_max 5
process_deployments_max 999
environment AE_INIT_TIME=5 AE_INIT_RANGE=5 INIT_ERROR=0

Any suggestion is appreciated.

Thanks,
Yi-Wen

Re: lower preprocessing time

Posted by Lou DeGenaro <lo...@gmail.com>.
I have now tried the suggestion of my previous append.

My first recommendation is that one should not change these values
(especially for a production system) away from the defaults, as some of my
co-committers have reminded me offline.

On my *test* system I did modify site.ducc.properites and the system ran
1.job just fine.  I did not examine resource consumption (CPU), though I
sure it had to be higher to support the increased communications and
scheduling overhead.  And remember that 1.job is a "fake" job - the work
items only sleep so there is no competition for CPU.  Also, on the System
Daemons page, the ResourceManager showed as "down" every once-in-a-while
(even though it was really up) because its minimum publish rate is 5
seconds.

My second recommendation is to do one of the following instead:

1. submit Jobs with more than 1 work item
2. re-imagine your Job as a Service
3. Use all-in-one local

Lou.



On Mon, Nov 30, 2015 at 9:21 AM, Lou DeGenaro <lo...@gmail.com>
wrote:

> Yi-Wen,
>
> The latency you are experiencing is by-design for a large-ish computing
> cluster.  The normal life-cycle for a Job is:
>
> Received WaitingForDriver WaitingForResources
> Assigned
> Initializing
> Running
> Completing
> Completed
>
> There are some knobs you can turn to tune for your situation.
>
> 1. DUCC intra-daemon communications - states affected: All
>
> DUCC is implemented as a small collection of daemons that communicate
> with each other at discrete publishing intervals.  The publishing intervals
> are configured in $DUCC_HOME/resources/ducc.properties.  The default
> interval values are on the order of 15-60 seconds.  At the cost of more
> chatter between daemons on the network, you can try lowering some of these
> values.
>
> These times are the current default ones and are specified in milliseconds:
>
> ducc.jd.state.publish.rate=15000
> ducc.orchestrator.state.publish.rate=10000
> ducc.pm.state.publish.rate=15000
>
> I have not tried this myself, but perhaps try lowering them to:
>
> ducc.jd.state.publish.rate=2000
> ducc.orchestrator.state.publish.rate=1000
> ducc.pm.state.publish.rate=1000
>
> 2. DUCC scheduling - state affected: WaitingForResources
>
> The DUCC scheduler does not do continuous resource management, but rather
> calculates a desired allocation at discrete intervals.  After each
> scheduling cycle, the scheduler publishes its layout for the other daemons
> to implement.  By default, the scheduler is doing this calculation and
> publication whenever it receives an orchestrator.state publication:
>
> ducc.rm.state.publish.ratio = 1
>
> This seems fine as is.
>
> 3. DUCC deployment of Job - states affected: WaitingForDriver,
> Initializing
>
> Once a Job is accepted, the Job Driver [your CollectionReader] and one or
> more Job Processes [your AnlaysisEngine] must be launched.
>
> The partial sequence of states here are:
>
> WaitingForDriver: The Job Driver is launched, and not until it reports
> that is is ready to produce work items will the next state
> (WaitingForResources) occur
> ...
> Initializing: A Job Process is launched, and not until it has completed
> initialization of all threads will it ask the Job Driver for the first work
> item
> Running: The first work item has been dispatched
>
> Minimizing the time for your CR to initialize will help make the
> transition from WaitingForDriver to WaitingForResources faster.
> Minimizing the time for your AE to initialize will help make the
> transition from Initializing to Running faster.
>
> Hope this helps.
>
> Lou.
>
> On Sun, Nov 29, 2015 at 11:25 PM, Yi-Wen Liu <yi...@usc.edu> wrote:
>
>> Hi,
>>
>> Thanks for the reply, and yes, I only have a single work item.
>>
>> Thanks,
>> Yi-Wen
>>
>> On Sun, Nov 29, 2015 at 7:45 PM, Eddie Epstein <ea...@gmail.com>
>> wrote:
>>
>> > Hi,
>> >
>> > Yes, there are some site.ducc.property entries that will speed up the
>> > timing. Will respond with those tomorrow.
>> > Are you often running jobs with only a single work item?
>> >
>> > Eddie
>> >
>> > On Sat, Nov 28, 2015 at 7:23 PM, Yi-Wen Liu <yi...@usc.edu> wrote:
>> >
>> > > Hi,
>> > >
>> > > I am using ducc to process text files(cTAKES), and one of my input is
>> > quite
>> > > short, about 10 lines.
>> > > But it takes more than two minutes to process it, as follows:
>> > > After submitting,
>> > > 00:00-00:08 > no status
>> > > 00:09-00:30 > waiting for driver
>> > > 00:31-01:00 > waiting for resources
>> > > 01:01-02:00 > initializing
>> > > 02:01-02:30 > completing
>> > > 02:31 > completed
>> > >
>> > > Is there any way to lower the preprocessing time?(Time to wait for
>> > driver,
>> > > resources, initializing...)
>> > >
>> > > I am wondering why it takes so long before completing, and have tried
>> > > different parameter values, for example lower initialization time,
>> lower
>> > > resources needed, but didn't have much improvement.
>> > >
>> > > Here's parameters I am using now: process_memory_size 2
>> > > process_jvm_args -Xmx4g
>> > > driver_jvm_args -Xmx4g
>> > > process_thread_count 2
>> > > process_per_item_time_max 5
>> > > process_deployments_max 999
>> > > environment AE_INIT_TIME=5 AE_INIT_RANGE=5 INIT_ERROR=0
>> > >
>> > > Any suggestion is appreciated.
>> > >
>> > > Thanks,
>> > > Yi-Wen
>> > >
>> >
>>
>
>

Re: lower preprocessing time

Posted by Lou DeGenaro <lo...@gmail.com>.
Yi-Wen,

The latency you are experiencing is by-design for a large-ish computing
cluster.  The normal life-cycle for a Job is:

Received WaitingForDriver WaitingForResources
Assigned
Initializing
Running
Completing
Completed

There are some knobs you can turn to tune for your situation.

1. DUCC intra-daemon communications - states affected: All

DUCC is implemented as a small collection of daemons that communicate with
each other at discrete publishing intervals.  The publishing intervals are
configured in $DUCC_HOME/resources/ducc.properties.  The default interval
values are on the order of 15-60 seconds.  At the cost of more chatter
between daemons on the network, you can try lowering some of these values.

These times are the current default ones and are specified in milliseconds:

ducc.jd.state.publish.rate=15000
ducc.orchestrator.state.publish.rate=10000
ducc.pm.state.publish.rate=15000

I have not tried this myself, but perhaps try lowering them to:

ducc.jd.state.publish.rate=2000
ducc.orchestrator.state.publish.rate=1000
ducc.pm.state.publish.rate=1000

2. DUCC scheduling - state affected: WaitingForResources

The DUCC scheduler does not do continuous resource management, but rather
calculates a desired allocation at discrete intervals.  After each
scheduling cycle, the scheduler publishes its layout for the other daemons
to implement.  By default, the scheduler is doing this calculation and
publication whenever it receives an orchestrator.state publication:

ducc.rm.state.publish.ratio = 1

This seems fine as is.

3. DUCC deployment of Job - states affected: WaitingForDriver, Initializing

Once a Job is accepted, the Job Driver [your CollectionReader] and one or
more Job Processes [your AnlaysisEngine] must be launched.

The partial sequence of states here are:

WaitingForDriver: The Job Driver is launched, and not until it reports that
is is ready to produce work items will the next state (WaitingForResources)
occur
...
Initializing: A Job Process is launched, and not until it has completed
initialization of all threads will it ask the Job Driver for the first work
item
Running: The first work item has been dispatched

Minimizing the time for your CR to initialize will help make the transition
from WaitingForDriver to WaitingForResources faster.
Minimizing the time for your AE to initialize will help make the transition
from Initializing to Running faster.

Hope this helps.

Lou.

On Sun, Nov 29, 2015 at 11:25 PM, Yi-Wen Liu <yi...@usc.edu> wrote:

> Hi,
>
> Thanks for the reply, and yes, I only have a single work item.
>
> Thanks,
> Yi-Wen
>
> On Sun, Nov 29, 2015 at 7:45 PM, Eddie Epstein <ea...@gmail.com>
> wrote:
>
> > Hi,
> >
> > Yes, there are some site.ducc.property entries that will speed up the
> > timing. Will respond with those tomorrow.
> > Are you often running jobs with only a single work item?
> >
> > Eddie
> >
> > On Sat, Nov 28, 2015 at 7:23 PM, Yi-Wen Liu <yi...@usc.edu> wrote:
> >
> > > Hi,
> > >
> > > I am using ducc to process text files(cTAKES), and one of my input is
> > quite
> > > short, about 10 lines.
> > > But it takes more than two minutes to process it, as follows:
> > > After submitting,
> > > 00:00-00:08 > no status
> > > 00:09-00:30 > waiting for driver
> > > 00:31-01:00 > waiting for resources
> > > 01:01-02:00 > initializing
> > > 02:01-02:30 > completing
> > > 02:31 > completed
> > >
> > > Is there any way to lower the preprocessing time?(Time to wait for
> > driver,
> > > resources, initializing...)
> > >
> > > I am wondering why it takes so long before completing, and have tried
> > > different parameter values, for example lower initialization time,
> lower
> > > resources needed, but didn't have much improvement.
> > >
> > > Here's parameters I am using now: process_memory_size 2
> > > process_jvm_args -Xmx4g
> > > driver_jvm_args -Xmx4g
> > > process_thread_count 2
> > > process_per_item_time_max 5
> > > process_deployments_max 999
> > > environment AE_INIT_TIME=5 AE_INIT_RANGE=5 INIT_ERROR=0
> > >
> > > Any suggestion is appreciated.
> > >
> > > Thanks,
> > > Yi-Wen
> > >
> >
>

Re: lower preprocessing time

Posted by Yi-Wen Liu <yi...@usc.edu>.
Hi,

Thanks for the reply, and yes, I only have a single work item.

Thanks,
Yi-Wen

On Sun, Nov 29, 2015 at 7:45 PM, Eddie Epstein <ea...@gmail.com> wrote:

> Hi,
>
> Yes, there are some site.ducc.property entries that will speed up the
> timing. Will respond with those tomorrow.
> Are you often running jobs with only a single work item?
>
> Eddie
>
> On Sat, Nov 28, 2015 at 7:23 PM, Yi-Wen Liu <yi...@usc.edu> wrote:
>
> > Hi,
> >
> > I am using ducc to process text files(cTAKES), and one of my input is
> quite
> > short, about 10 lines.
> > But it takes more than two minutes to process it, as follows:
> > After submitting,
> > 00:00-00:08 > no status
> > 00:09-00:30 > waiting for driver
> > 00:31-01:00 > waiting for resources
> > 01:01-02:00 > initializing
> > 02:01-02:30 > completing
> > 02:31 > completed
> >
> > Is there any way to lower the preprocessing time?(Time to wait for
> driver,
> > resources, initializing...)
> >
> > I am wondering why it takes so long before completing, and have tried
> > different parameter values, for example lower initialization time, lower
> > resources needed, but didn't have much improvement.
> >
> > Here's parameters I am using now: process_memory_size 2
> > process_jvm_args -Xmx4g
> > driver_jvm_args -Xmx4g
> > process_thread_count 2
> > process_per_item_time_max 5
> > process_deployments_max 999
> > environment AE_INIT_TIME=5 AE_INIT_RANGE=5 INIT_ERROR=0
> >
> > Any suggestion is appreciated.
> >
> > Thanks,
> > Yi-Wen
> >
>

Re: lower preprocessing time

Posted by Eddie Epstein <ea...@gmail.com>.
Hi,

Yes, there are some site.ducc.property entries that will speed up the
timing. Will respond with those tomorrow.
Are you often running jobs with only a single work item?

Eddie

On Sat, Nov 28, 2015 at 7:23 PM, Yi-Wen Liu <yi...@usc.edu> wrote:

> Hi,
>
> I am using ducc to process text files(cTAKES), and one of my input is quite
> short, about 10 lines.
> But it takes more than two minutes to process it, as follows:
> After submitting,
> 00:00-00:08 > no status
> 00:09-00:30 > waiting for driver
> 00:31-01:00 > waiting for resources
> 01:01-02:00 > initializing
> 02:01-02:30 > completing
> 02:31 > completed
>
> Is there any way to lower the preprocessing time?(Time to wait for driver,
> resources, initializing...)
>
> I am wondering why it takes so long before completing, and have tried
> different parameter values, for example lower initialization time, lower
> resources needed, but didn't have much improvement.
>
> Here's parameters I am using now: process_memory_size 2
> process_jvm_args -Xmx4g
> driver_jvm_args -Xmx4g
> process_thread_count 2
> process_per_item_time_max 5
> process_deployments_max 999
> environment AE_INIT_TIME=5 AE_INIT_RANGE=5 INIT_ERROR=0
>
> Any suggestion is appreciated.
>
> Thanks,
> Yi-Wen
>