You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Eddie Epstein <ea...@gmail.com> on 2020/05/18 12:47:41 UTC

Re: UIMA DUCC slow processing

Hi,

Removing the AE from the pipeline was a good idea to help isolate the
bottleneck. The other two most likely possibilities are the collection
reader pulling from elastic search or the CAS consumer writing the
processing output.

DUCC Jobs are a simple way to scale out compute bottlenecks across a
cluster. Scaleout may be of limited or no value for I/O bound jobs.
Please give a more complete picture of the processing scenario on DUCC.

Regards,
Eddie


On Sat, May 16, 2020 at 1:29 AM Raja Muhammad Suleman <
Sulemanr@edgehill.ac.uk> wrote:

> Hi,
> I've been trying to run a very small UIMA DUCC cluster with 2 slave nodes
> having 32GB of RAM each. I wrote a custom Collection Reader to read data
> from an Elasticsearch index and dump it into a new index after certain
> analysis engine processing. The Analysis Engine is a simple sentiment
> analysis code. The performance I'm getting is very slow as it is only able
> to process ~150 documents/minute.
> To test the performance without the analysis engine, I removed the AE from
> the pipeline but still I did not get any improvement in the processing
> speeds. Can you please guide me as to where I might be going wrong or what
> I can do to improve the processing speeds?
>
> Thank you.
> ________________________________
> Edge Hill University<http://ehu.ac.uk/home/emailfooter>
> Teaching Excellence Framework Gold Award<http://ehu.ac.uk/tef/emailfooter>
> ________________________________
> This message is private and confidential. If you have received this
> message in error, please notify the sender and remove it from your system.
> Any views or opinions presented are solely those of the author and do not
> necessarily represent those of Edge Hill or associated companies. Edge Hill
> University may monitor email traffic data and also the content of email for
> the purposes of security and business communications during staff absence.<
> http://ehu.ac.uk/itspolicies/emailfooter>
>

Re: UIMA DUCC slow processing

Posted by Lou DeGenaro <lo...@gmail.com>.
Have you examined various DUCC and system log files for possible clues?

On Fri, Jun 12, 2020 at 1:16 PM Dr. Raja M. Suleman <
raja.m.sulaiman@gmail.com> wrote:

> Hi,
> Thank you for your reply and I'm sorry I couldn't get back to this
> earlier.
>
> To get a better picture of the processing speed of DUCC, I made a dummy
> pipeline where the CollectionReader runs a for loop to generate 100k
> workitems (so no disk reads). each workitem only has a simple string in it.
> These are then passed on to the CasMultiplier where for each workitem I'm
> creating a new CAS with DocumentInfo (again only having a simple string
> value) and pass it as a newcas to the CasConsumer. The CasConsumer doesn't
> do anything except add the Document received in the CAS to the logger. So
> basically this pipeline isn't doing anything, no Input reads and the only
> output is the information added to the logger. Running this on the cluster
> with 2 slave nodes with 8-CPUs and 32GB RAM each is still taking more than
> 30 minutes. I don't understand how is this possible since there's no heavy
> I/O processing is happening in the code.
>
> Any ideas please?
>
> Thank you.
>
> On 2020/05/18 12:47:41, Eddie Epstein <ea...@gmail.com> wrote:
> > Hi,
> >
> > Removing the AE from the pipeline was a good idea to help isolate the
> > bottleneck. The other two most likely possibilities are the collection
> > reader pulling from elastic search or the CAS consumer writing the
> > processing output.
> >
> > DUCC Jobs are a simple way to scale out compute bottlenecks across a
> > cluster. Scaleout may be of limited or no value for I/O bound jobs.
> > Please give a more complete picture of the processing scenario on DUCC.
> >
> > Regards,
> > Eddie
> >
> >
> > On Sat, May 16, 2020 at 1:29 AM Raja Muhammad Suleman <
> > Sulemanr@edgehill.ac.uk> wrote:
> >
> > > Hi,
> > > I've been trying to run a very small UIMA DUCC cluster with 2 slave
> nodes
> > > having 32GB of RAM each. I wrote a custom Collection Reader to read
> data
> > > from an Elasticsearch index and dump it into a new index after certain
> > > analysis engine processing. The Analysis Engine is a simple sentiment
> > > analysis code. The performance I'm getting is very slow as it is only
> able
> > > to process ~150 documents/minute.
> > > To test the performance without the analysis engine, I removed the AE
> from
> > > the pipeline but still I did not get any improvement in the processing
> > > speeds. Can you please guide me as to where I might be going wrong or
> what
> > > I can do to improve the processing speeds?
> > >
> > > Thank you.
> > > ________________________________
> > > Edge Hill University<http://ehu.ac.uk/home/emailfooter>
> > > Teaching Excellence Framework Gold Award<
> http://ehu.ac.uk/tef/emailfooter>
> > > ________________________________
> > > This message is private and confidential. If you have received this
> > > message in error, please notify the sender and remove it from your
> system.
> > > Any views or opinions presented are solely those of the author and do
> not
> > > necessarily represent those of Edge Hill or associated companies. Edge
> Hill
> > > University may monitor email traffic data and also the content of
> email for
> > > the purposes of security and business communications during staff
> absence.<
> > > http://ehu.ac.uk/itspolicies/emailfooter>
> > >
> >
>

Re: UIMA DUCC slow processing

Posted by Eddie Epstein <ea...@gmail.com>.
The time sequence of a DUCC job is as follows:
1. The JobDriver is started and the CR.init method called
2. When CR.init completes successfully one or more JobProcesses are started
and the Aggregate pipeline init method in each called
3. If the first pipeline init to complete is successful the DUCC job status
changes to RUNNING

The Processes tab on the job details page shows the init times for the JD
(JobDriver) and each of the JobProcesses. The ducc.log file on the Files
tab gives timestamps for job state changes.

Reported initialization times correspond to the init() method calls of the
UIMA components. Is the initialization delay in the CR init, or the
JobProcess init? Anything interesting in the logfiles for those components?

Normally the number of tasks should match the number of workitems. These
can be quite different if the JobProcess is using a custom UIMA-AS
asynchronous threading model. What do you see on the Work Items tab?

For debugging, DUCC's --all_in_one option allow running all the components,
CR + CM + AE + CC, in a single thread in the same process. I'd suggest that
for the CasConsummer issue. If that works, and if you are running multiple
pipelines then there is likely a thread safety issue involved with
Elasticsearch API.

Eddie

On Mon, Jun 15, 2020 at 1:30 AM Dr. Raja M. Suleman <
raja.m.sulaiman@gmail.com> wrote:

> Thank you very much for your response.
>
> Actually I am working on a project that would require horizontal scaling
> therefore I am focused on DUCC at the moment. My original query started
> with my question regarding a job I had created which was giving me a low
> throughput. The pipeline for this job looks like this:
>
>    1.  A CollectionReader connects to an Elasticsearch server and reads ids
>    from an index and adds *1* id in each workitem which is then passed to
>    the CasMultipler.
>    2. The CASMultiplier uses the 'id' in each workitem to get the 'document
>    text' from the Elasticsearch index. Each document text is a short
> review (1
>    - 20 lines) of English. In the Abstract 'next()' method I create an
> empty
>    JCas object and add the document text and other details related to the
>    review to the DocumentInfo(newcas) and return the JCas object.
>    3. My AnalysisEngine is running sentiment analysis on the document text.
>    sentiment analysis is a computationally expensive operation specially
> for
>    longer reviews.
>    4. Finally my CasConsumer is writing each DocumentInfo object into a
>    Elasticsearch index.
>
>
> A few things I noticed running this jobs and would be grateful for your
> comments on them:
>
>    1. The job's initialization time increases with the number of documents
>    in the index exponentially. I'm using the Elasticsearch scroll API which
>    returns all the document ids within milliseconds. However, the DUCC job
>    takes a long time to start running (~35 minutes for 100k documents).
> I've
>    noticed that the initialization time for the DUCC job increases
>    exponentially with the number of records. Is this due to the new CASes
>    being generated for each in CollectionReader.
>    2.  While checking the Performance tab of a job in the webserver UI, I
>    noticed that under the "Tasks" column, the number of Tasks for all the
>    components except the AnalysisEngine (AE) is twice the number of
> documents
>    processed, e.g. if the job has processed 100 documents, it will show 200
>    tasks for all components and 100 for the AE component.
>    3. In the CasConsumer, I tried to use the BulkProcessor provided by the
>    Elasticsearch Java API, which works asynchronously to send bulk indexing
>    requests. However, asynchronous calls weren't registering and the
>    CasConsumer would return without writing anything in the Elasticsearch
>    index. I checked the job logs and couldn't find any error messages.
>
> I'm sorry for another long message and I truly am grateful to you for your
> kind guidance.
>
> Thank you very much.
>
> On Mon, 15 Jun 2020, 00:34 Eddie Epstein, <ea...@gmail.com> wrote:
>
> > I forgot to add, if your application does not require horizontal scale
> out
> > to many CPUs on multiple machines, UIMA has a vertical scale out tool,
> the
> > CPE, that can support running multiple pipeline threads on a single
> > machine.
> > More information is at
> >
> >
> http://uima.apache.org/d/uimaj-current/tutorials_and_users_guides.html#ugr.tug.cpe
> >
> >
> >
> >
> > On Sun, Jun 14, 2020 at 7:06 PM Eddie Epstein <ea...@gmail.com>
> wrote:
> >
> > > In this case the problem is not DUCC, rather it is the high overhead of
> > > opening small files and sending them to a remote computer individually.
> > I/O
> > > works much more efficiently with larger blocks of data. Many small
> files
> > > can be merged into larger files using zip archives. DUCC sample code
> > shows
> > > how to do this for CASes, and very similar code could be used for input
> > > documents as well.
> > >
> > > Implementing efficient scale out is highly dependent on good treatment
> of
> > > input and output data.
> > > Best,
> > > Eddie
> > >
> > >
> > > On Sat, Jun 13, 2020 at 6:24 AM Dr. Raja M. Suleman <
> > > raja.m.sulaiman@gmail.com> wrote:
> > >
> > >> Hello,
> > >>
> > >> Thank you very much for your response and even more so for the
> detailed
> > >> explanation.
> > >>
> > >> So, if I understand it correctly, DUCC is more suited for scenarios
> > where
> > >> we have large input documents rather than many small ones?
> > >>
> > >> Thank you once again.
> > >>
> > >> On Fri, 12 Jun 2020, 22:18 Eddie Epstein, <ea...@gmail.com>
> wrote:
> > >>
> > >> > Hi,
> > >> >
> > >> > In this simple scenario there is a CollectionReader running in a
> > >> JobDriver
> > >> > process, delivering 100K workitems to multiple remote JobProcesses.
> > The
> > >> > processing time is essentially zero.  (30 * 60 seconds) / 100,000
> > >> workitems
> > >> > = 18 milliseconds per workitem. This time is roughly the expected
> > >> overhead
> > >> > of a DUCC jobDriver delivering workitems to remote JobProcesses and
> > >> > recording the results. DUCC jobs are much more efficient if the
> > overhead
> > >> > per workitem is much smaller than the processing time.
> > >> >
> > >> > Typically DUCC jobs would be processing much larger blocks of
> content
> > >> per
> > >> > workitem. For example, if a workitem was a document, and the
> document
> > >> > parsed into the small CASes by the CasMultiplier, the throughput
> would
> > >> be
> > >> > much better. However, with this example, as the number of working
> > >> > JobProcess threads is scaled up, the CR (JobDriver) would become a
> > >> > bottleneck. That's why a typical DUCC Job will not send the Document
> > >> > content as a workitem, but rather send a reference to the workitem
> > >> content
> > >> > and have the CasMultipliers in the JobProcesses read the content
> > >> directly
> > >> > from the source.
> > >> >
> > >> > Even though content read by the JobProcesses is much more efficient,
> > as
> > >> > scaleout continued to increase for this non-computation scenario the
> > >> > bottleneck would eventually move to the underlying filesystem or
> > >> whatever
> > >> > document source and JobProcess output are. The main motivation for
> > DUCC
> > >> was
> > >> > jobs similar to those in the DUCC examples which use OpenNLP to
> > process
> > >> > large documents. That is, jobs where CPU processing is the
> bottleneck
> > >> > rather than I/O.
> > >> >
> > >> > Hopefully this helps. If not, happy to continue the discussion.
> > >> > Eddie
> > >> >
> > >> > On Fri, Jun 12, 2020 at 1:16 PM Dr. Raja M. Suleman <
> > >> > raja.m.sulaiman@gmail.com> wrote:
> > >> >
> > >> > > Hi,
> > >> > > Thank you for your reply and I'm sorry I couldn't get back to this
> > >> > > earlier.
> > >> > >
> > >> > > To get a better picture of the processing speed of DUCC, I made a
> > >> dummy
> > >> > > pipeline where the CollectionReader runs a for loop to generate
> 100k
> > >> > > workitems (so no disk reads). each workitem only has a simple
> string
> > >> in
> > >> > it.
> > >> > > These are then passed on to the CasMultiplier where for each
> > workitem
> > >> I'm
> > >> > > creating a new CAS with DocumentInfo (again only having a simple
> > >> string
> > >> > > value) and pass it as a newcas to the CasConsumer. The CasConsumer
> > >> > doesn't
> > >> > > do anything except add the Document received in the CAS to the
> > >> logger. So
> > >> > > basically this pipeline isn't doing anything, no Input reads and
> the
> > >> only
> > >> > > output is the information added to the logger. Running this on the
> > >> > cluster
> > >> > > with 2 slave nodes with 8-CPUs and 32GB RAM each is still taking
> > more
> > >> > than
> > >> > > 30 minutes. I don't understand how is this possible since there's
> no
> > >> > heavy
> > >> > > I/O processing is happening in the code.
> > >> > >
> > >> > > Any ideas please?
> > >> > >
> > >> > > Thank you.
> > >> > >
> > >> > > On 2020/05/18 12:47:41, Eddie Epstein <ea...@gmail.com>
> wrote:
> > >> > > > Hi,
> > >> > > >
> > >> > > > Removing the AE from the pipeline was a good idea to help
> isolate
> > >> the
> > >> > > > bottleneck. The other two most likely possibilities are the
> > >> collection
> > >> > > > reader pulling from elastic search or the CAS consumer writing
> the
> > >> > > > processing output.
> > >> > > >
> > >> > > > DUCC Jobs are a simple way to scale out compute bottlenecks
> > across a
> > >> > > > cluster. Scaleout may be of limited or no value for I/O bound
> > jobs.
> > >> > > > Please give a more complete picture of the processing scenario
> on
> > >> DUCC.
> > >> > > >
> > >> > > > Regards,
> > >> > > > Eddie
> > >> > > >
> > >> > > >
> > >> > > > On Sat, May 16, 2020 at 1:29 AM Raja Muhammad Suleman <
> > >> > > > Sulemanr@edgehill.ac.uk> wrote:
> > >> > > >
> > >> > > > > Hi,
> > >> > > > > I've been trying to run a very small UIMA DUCC cluster with 2
> > >> slave
> > >> > > nodes
> > >> > > > > having 32GB of RAM each. I wrote a custom Collection Reader to
> > >> read
> > >> > > data
> > >> > > > > from an Elasticsearch index and dump it into a new index after
> > >> > certain
> > >> > > > > analysis engine processing. The Analysis Engine is a simple
> > >> sentiment
> > >> > > > > analysis code. The performance I'm getting is very slow as it
> is
> > >> only
> > >> > > able
> > >> > > > > to process ~150 documents/minute.
> > >> > > > > To test the performance without the analysis engine, I removed
> > >> the AE
> > >> > > from
> > >> > > > > the pipeline but still I did not get any improvement in the
> > >> > processing
> > >> > > > > speeds. Can you please guide me as to where I might be going
> > >> wrong or
> > >> > > what
> > >> > > > > I can do to improve the processing speeds?
> > >> > > > >
> > >> > > > > Thank you.
> > >> > > > > ________________________________
> > >> > > > > Edge Hill University<http://ehu.ac.uk/home/emailfooter>
> > >> > > > > Teaching Excellence Framework Gold Award<
> > >> > > http://ehu.ac.uk/tef/emailfooter>
> > >> > > > > ________________________________
> > >> > > > > This message is private and confidential. If you have received
> > >> this
> > >> > > > > message in error, please notify the sender and remove it from
> > your
> > >> > > system.
> > >> > > > > Any views or opinions presented are solely those of the author
> > >> and do
> > >> > > not
> > >> > > > > necessarily represent those of Edge Hill or associated
> > companies.
> > >> > Edge
> > >> > > Hill
> > >> > > > > University may monitor email traffic data and also the content
> > of
> > >> > > email for
> > >> > > > > the purposes of security and business communications during
> > staff
> > >> > > absence.<
> > >> > > > > http://ehu.ac.uk/itspolicies/emailfooter>
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: UIMA DUCC slow processing

Posted by "Dr. Raja M. Suleman" <ra...@gmail.com>.
Thank you very much for your response.

Actually I am working on a project that would require horizontal scaling
therefore I am focused on DUCC at the moment. My original query started
with my question regarding a job I had created which was giving me a low
throughput. The pipeline for this job looks like this:

   1.  A CollectionReader connects to an Elasticsearch server and reads ids
   from an index and adds *1* id in each workitem which is then passed to
   the CasMultipler.
   2. The CASMultiplier uses the 'id' in each workitem to get the 'document
   text' from the Elasticsearch index. Each document text is a short review (1
   - 20 lines) of English. In the Abstract 'next()' method I create an empty
   JCas object and add the document text and other details related to the
   review to the DocumentInfo(newcas) and return the JCas object.
   3. My AnalysisEngine is running sentiment analysis on the document text.
   sentiment analysis is a computationally expensive operation specially for
   longer reviews.
   4. Finally my CasConsumer is writing each DocumentInfo object into a
   Elasticsearch index.


A few things I noticed running this jobs and would be grateful for your
comments on them:

   1. The job's initialization time increases with the number of documents
   in the index exponentially. I'm using the Elasticsearch scroll API which
   returns all the document ids within milliseconds. However, the DUCC job
   takes a long time to start running (~35 minutes for 100k documents). I've
   noticed that the initialization time for the DUCC job increases
   exponentially with the number of records. Is this due to the new CASes
   being generated for each in CollectionReader.
   2.  While checking the Performance tab of a job in the webserver UI, I
   noticed that under the "Tasks" column, the number of Tasks for all the
   components except the AnalysisEngine (AE) is twice the number of documents
   processed, e.g. if the job has processed 100 documents, it will show 200
   tasks for all components and 100 for the AE component.
   3. In the CasConsumer, I tried to use the BulkProcessor provided by the
   Elasticsearch Java API, which works asynchronously to send bulk indexing
   requests. However, asynchronous calls weren't registering and the
   CasConsumer would return without writing anything in the Elasticsearch
   index. I checked the job logs and couldn't find any error messages.

I'm sorry for another long message and I truly am grateful to you for your
kind guidance.

Thank you very much.

On Mon, 15 Jun 2020, 00:34 Eddie Epstein, <ea...@gmail.com> wrote:

> I forgot to add, if your application does not require horizontal scale out
> to many CPUs on multiple machines, UIMA has a vertical scale out tool, the
> CPE, that can support running multiple pipeline threads on a single
> machine.
> More information is at
>
> http://uima.apache.org/d/uimaj-current/tutorials_and_users_guides.html#ugr.tug.cpe
>
>
>
>
> On Sun, Jun 14, 2020 at 7:06 PM Eddie Epstein <ea...@gmail.com> wrote:
>
> > In this case the problem is not DUCC, rather it is the high overhead of
> > opening small files and sending them to a remote computer individually.
> I/O
> > works much more efficiently with larger blocks of data. Many small files
> > can be merged into larger files using zip archives. DUCC sample code
> shows
> > how to do this for CASes, and very similar code could be used for input
> > documents as well.
> >
> > Implementing efficient scale out is highly dependent on good treatment of
> > input and output data.
> > Best,
> > Eddie
> >
> >
> > On Sat, Jun 13, 2020 at 6:24 AM Dr. Raja M. Suleman <
> > raja.m.sulaiman@gmail.com> wrote:
> >
> >> Hello,
> >>
> >> Thank you very much for your response and even more so for the detailed
> >> explanation.
> >>
> >> So, if I understand it correctly, DUCC is more suited for scenarios
> where
> >> we have large input documents rather than many small ones?
> >>
> >> Thank you once again.
> >>
> >> On Fri, 12 Jun 2020, 22:18 Eddie Epstein, <ea...@gmail.com> wrote:
> >>
> >> > Hi,
> >> >
> >> > In this simple scenario there is a CollectionReader running in a
> >> JobDriver
> >> > process, delivering 100K workitems to multiple remote JobProcesses.
> The
> >> > processing time is essentially zero.  (30 * 60 seconds) / 100,000
> >> workitems
> >> > = 18 milliseconds per workitem. This time is roughly the expected
> >> overhead
> >> > of a DUCC jobDriver delivering workitems to remote JobProcesses and
> >> > recording the results. DUCC jobs are much more efficient if the
> overhead
> >> > per workitem is much smaller than the processing time.
> >> >
> >> > Typically DUCC jobs would be processing much larger blocks of content
> >> per
> >> > workitem. For example, if a workitem was a document, and the document
> >> > parsed into the small CASes by the CasMultiplier, the throughput would
> >> be
> >> > much better. However, with this example, as the number of working
> >> > JobProcess threads is scaled up, the CR (JobDriver) would become a
> >> > bottleneck. That's why a typical DUCC Job will not send the Document
> >> > content as a workitem, but rather send a reference to the workitem
> >> content
> >> > and have the CasMultipliers in the JobProcesses read the content
> >> directly
> >> > from the source.
> >> >
> >> > Even though content read by the JobProcesses is much more efficient,
> as
> >> > scaleout continued to increase for this non-computation scenario the
> >> > bottleneck would eventually move to the underlying filesystem or
> >> whatever
> >> > document source and JobProcess output are. The main motivation for
> DUCC
> >> was
> >> > jobs similar to those in the DUCC examples which use OpenNLP to
> process
> >> > large documents. That is, jobs where CPU processing is the bottleneck
> >> > rather than I/O.
> >> >
> >> > Hopefully this helps. If not, happy to continue the discussion.
> >> > Eddie
> >> >
> >> > On Fri, Jun 12, 2020 at 1:16 PM Dr. Raja M. Suleman <
> >> > raja.m.sulaiman@gmail.com> wrote:
> >> >
> >> > > Hi,
> >> > > Thank you for your reply and I'm sorry I couldn't get back to this
> >> > > earlier.
> >> > >
> >> > > To get a better picture of the processing speed of DUCC, I made a
> >> dummy
> >> > > pipeline where the CollectionReader runs a for loop to generate 100k
> >> > > workitems (so no disk reads). each workitem only has a simple string
> >> in
> >> > it.
> >> > > These are then passed on to the CasMultiplier where for each
> workitem
> >> I'm
> >> > > creating a new CAS with DocumentInfo (again only having a simple
> >> string
> >> > > value) and pass it as a newcas to the CasConsumer. The CasConsumer
> >> > doesn't
> >> > > do anything except add the Document received in the CAS to the
> >> logger. So
> >> > > basically this pipeline isn't doing anything, no Input reads and the
> >> only
> >> > > output is the information added to the logger. Running this on the
> >> > cluster
> >> > > with 2 slave nodes with 8-CPUs and 32GB RAM each is still taking
> more
> >> > than
> >> > > 30 minutes. I don't understand how is this possible since there's no
> >> > heavy
> >> > > I/O processing is happening in the code.
> >> > >
> >> > > Any ideas please?
> >> > >
> >> > > Thank you.
> >> > >
> >> > > On 2020/05/18 12:47:41, Eddie Epstein <ea...@gmail.com> wrote:
> >> > > > Hi,
> >> > > >
> >> > > > Removing the AE from the pipeline was a good idea to help isolate
> >> the
> >> > > > bottleneck. The other two most likely possibilities are the
> >> collection
> >> > > > reader pulling from elastic search or the CAS consumer writing the
> >> > > > processing output.
> >> > > >
> >> > > > DUCC Jobs are a simple way to scale out compute bottlenecks
> across a
> >> > > > cluster. Scaleout may be of limited or no value for I/O bound
> jobs.
> >> > > > Please give a more complete picture of the processing scenario on
> >> DUCC.
> >> > > >
> >> > > > Regards,
> >> > > > Eddie
> >> > > >
> >> > > >
> >> > > > On Sat, May 16, 2020 at 1:29 AM Raja Muhammad Suleman <
> >> > > > Sulemanr@edgehill.ac.uk> wrote:
> >> > > >
> >> > > > > Hi,
> >> > > > > I've been trying to run a very small UIMA DUCC cluster with 2
> >> slave
> >> > > nodes
> >> > > > > having 32GB of RAM each. I wrote a custom Collection Reader to
> >> read
> >> > > data
> >> > > > > from an Elasticsearch index and dump it into a new index after
> >> > certain
> >> > > > > analysis engine processing. The Analysis Engine is a simple
> >> sentiment
> >> > > > > analysis code. The performance I'm getting is very slow as it is
> >> only
> >> > > able
> >> > > > > to process ~150 documents/minute.
> >> > > > > To test the performance without the analysis engine, I removed
> >> the AE
> >> > > from
> >> > > > > the pipeline but still I did not get any improvement in the
> >> > processing
> >> > > > > speeds. Can you please guide me as to where I might be going
> >> wrong or
> >> > > what
> >> > > > > I can do to improve the processing speeds?
> >> > > > >
> >> > > > > Thank you.
> >> > > > > ________________________________
> >> > > > > Edge Hill University<http://ehu.ac.uk/home/emailfooter>
> >> > > > > Teaching Excellence Framework Gold Award<
> >> > > http://ehu.ac.uk/tef/emailfooter>
> >> > > > > ________________________________
> >> > > > > This message is private and confidential. If you have received
> >> this
> >> > > > > message in error, please notify the sender and remove it from
> your
> >> > > system.
> >> > > > > Any views or opinions presented are solely those of the author
> >> and do
> >> > > not
> >> > > > > necessarily represent those of Edge Hill or associated
> companies.
> >> > Edge
> >> > > Hill
> >> > > > > University may monitor email traffic data and also the content
> of
> >> > > email for
> >> > > > > the purposes of security and business communications during
> staff
> >> > > absence.<
> >> > > > > http://ehu.ac.uk/itspolicies/emailfooter>
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: UIMA DUCC slow processing

Posted by Eddie Epstein <ea...@gmail.com>.
I forgot to add, if your application does not require horizontal scale out
to many CPUs on multiple machines, UIMA has a vertical scale out tool, the
CPE, that can support running multiple pipeline threads on a single
machine.
More information is at
http://uima.apache.org/d/uimaj-current/tutorials_and_users_guides.html#ugr.tug.cpe




On Sun, Jun 14, 2020 at 7:06 PM Eddie Epstein <ea...@gmail.com> wrote:

> In this case the problem is not DUCC, rather it is the high overhead of
> opening small files and sending them to a remote computer individually. I/O
> works much more efficiently with larger blocks of data. Many small files
> can be merged into larger files using zip archives. DUCC sample code shows
> how to do this for CASes, and very similar code could be used for input
> documents as well.
>
> Implementing efficient scale out is highly dependent on good treatment of
> input and output data.
> Best,
> Eddie
>
>
> On Sat, Jun 13, 2020 at 6:24 AM Dr. Raja M. Suleman <
> raja.m.sulaiman@gmail.com> wrote:
>
>> Hello,
>>
>> Thank you very much for your response and even more so for the detailed
>> explanation.
>>
>> So, if I understand it correctly, DUCC is more suited for scenarios where
>> we have large input documents rather than many small ones?
>>
>> Thank you once again.
>>
>> On Fri, 12 Jun 2020, 22:18 Eddie Epstein, <ea...@gmail.com> wrote:
>>
>> > Hi,
>> >
>> > In this simple scenario there is a CollectionReader running in a
>> JobDriver
>> > process, delivering 100K workitems to multiple remote JobProcesses. The
>> > processing time is essentially zero.  (30 * 60 seconds) / 100,000
>> workitems
>> > = 18 milliseconds per workitem. This time is roughly the expected
>> overhead
>> > of a DUCC jobDriver delivering workitems to remote JobProcesses and
>> > recording the results. DUCC jobs are much more efficient if the overhead
>> > per workitem is much smaller than the processing time.
>> >
>> > Typically DUCC jobs would be processing much larger blocks of content
>> per
>> > workitem. For example, if a workitem was a document, and the document
>> > parsed into the small CASes by the CasMultiplier, the throughput would
>> be
>> > much better. However, with this example, as the number of working
>> > JobProcess threads is scaled up, the CR (JobDriver) would become a
>> > bottleneck. That's why a typical DUCC Job will not send the Document
>> > content as a workitem, but rather send a reference to the workitem
>> content
>> > and have the CasMultipliers in the JobProcesses read the content
>> directly
>> > from the source.
>> >
>> > Even though content read by the JobProcesses is much more efficient, as
>> > scaleout continued to increase for this non-computation scenario the
>> > bottleneck would eventually move to the underlying filesystem or
>> whatever
>> > document source and JobProcess output are. The main motivation for DUCC
>> was
>> > jobs similar to those in the DUCC examples which use OpenNLP to process
>> > large documents. That is, jobs where CPU processing is the bottleneck
>> > rather than I/O.
>> >
>> > Hopefully this helps. If not, happy to continue the discussion.
>> > Eddie
>> >
>> > On Fri, Jun 12, 2020 at 1:16 PM Dr. Raja M. Suleman <
>> > raja.m.sulaiman@gmail.com> wrote:
>> >
>> > > Hi,
>> > > Thank you for your reply and I'm sorry I couldn't get back to this
>> > > earlier.
>> > >
>> > > To get a better picture of the processing speed of DUCC, I made a
>> dummy
>> > > pipeline where the CollectionReader runs a for loop to generate 100k
>> > > workitems (so no disk reads). each workitem only has a simple string
>> in
>> > it.
>> > > These are then passed on to the CasMultiplier where for each workitem
>> I'm
>> > > creating a new CAS with DocumentInfo (again only having a simple
>> string
>> > > value) and pass it as a newcas to the CasConsumer. The CasConsumer
>> > doesn't
>> > > do anything except add the Document received in the CAS to the
>> logger. So
>> > > basically this pipeline isn't doing anything, no Input reads and the
>> only
>> > > output is the information added to the logger. Running this on the
>> > cluster
>> > > with 2 slave nodes with 8-CPUs and 32GB RAM each is still taking more
>> > than
>> > > 30 minutes. I don't understand how is this possible since there's no
>> > heavy
>> > > I/O processing is happening in the code.
>> > >
>> > > Any ideas please?
>> > >
>> > > Thank you.
>> > >
>> > > On 2020/05/18 12:47:41, Eddie Epstein <ea...@gmail.com> wrote:
>> > > > Hi,
>> > > >
>> > > > Removing the AE from the pipeline was a good idea to help isolate
>> the
>> > > > bottleneck. The other two most likely possibilities are the
>> collection
>> > > > reader pulling from elastic search or the CAS consumer writing the
>> > > > processing output.
>> > > >
>> > > > DUCC Jobs are a simple way to scale out compute bottlenecks across a
>> > > > cluster. Scaleout may be of limited or no value for I/O bound jobs.
>> > > > Please give a more complete picture of the processing scenario on
>> DUCC.
>> > > >
>> > > > Regards,
>> > > > Eddie
>> > > >
>> > > >
>> > > > On Sat, May 16, 2020 at 1:29 AM Raja Muhammad Suleman <
>> > > > Sulemanr@edgehill.ac.uk> wrote:
>> > > >
>> > > > > Hi,
>> > > > > I've been trying to run a very small UIMA DUCC cluster with 2
>> slave
>> > > nodes
>> > > > > having 32GB of RAM each. I wrote a custom Collection Reader to
>> read
>> > > data
>> > > > > from an Elasticsearch index and dump it into a new index after
>> > certain
>> > > > > analysis engine processing. The Analysis Engine is a simple
>> sentiment
>> > > > > analysis code. The performance I'm getting is very slow as it is
>> only
>> > > able
>> > > > > to process ~150 documents/minute.
>> > > > > To test the performance without the analysis engine, I removed
>> the AE
>> > > from
>> > > > > the pipeline but still I did not get any improvement in the
>> > processing
>> > > > > speeds. Can you please guide me as to where I might be going
>> wrong or
>> > > what
>> > > > > I can do to improve the processing speeds?
>> > > > >
>> > > > > Thank you.
>> > > > > ________________________________
>> > > > > Edge Hill University<http://ehu.ac.uk/home/emailfooter>
>> > > > > Teaching Excellence Framework Gold Award<
>> > > http://ehu.ac.uk/tef/emailfooter>
>> > > > > ________________________________
>> > > > > This message is private and confidential. If you have received
>> this
>> > > > > message in error, please notify the sender and remove it from your
>> > > system.
>> > > > > Any views or opinions presented are solely those of the author
>> and do
>> > > not
>> > > > > necessarily represent those of Edge Hill or associated companies.
>> > Edge
>> > > Hill
>> > > > > University may monitor email traffic data and also the content of
>> > > email for
>> > > > > the purposes of security and business communications during staff
>> > > absence.<
>> > > > > http://ehu.ac.uk/itspolicies/emailfooter>
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: UIMA DUCC slow processing

Posted by Eddie Epstein <ea...@gmail.com>.
In this case the problem is not DUCC, rather it is the high overhead of
opening small files and sending them to a remote computer individually. I/O
works much more efficiently with larger blocks of data. Many small files
can be merged into larger files using zip archives. DUCC sample code shows
how to do this for CASes, and very similar code could be used for input
documents as well.

Implementing efficient scale out is highly dependent on good treatment of
input and output data.
Best,
Eddie


On Sat, Jun 13, 2020 at 6:24 AM Dr. Raja M. Suleman <
raja.m.sulaiman@gmail.com> wrote:

> Hello,
>
> Thank you very much for your response and even more so for the detailed
> explanation.
>
> So, if I understand it correctly, DUCC is more suited for scenarios where
> we have large input documents rather than many small ones?
>
> Thank you once again.
>
> On Fri, 12 Jun 2020, 22:18 Eddie Epstein, <ea...@gmail.com> wrote:
>
> > Hi,
> >
> > In this simple scenario there is a CollectionReader running in a
> JobDriver
> > process, delivering 100K workitems to multiple remote JobProcesses. The
> > processing time is essentially zero.  (30 * 60 seconds) / 100,000
> workitems
> > = 18 milliseconds per workitem. This time is roughly the expected
> overhead
> > of a DUCC jobDriver delivering workitems to remote JobProcesses and
> > recording the results. DUCC jobs are much more efficient if the overhead
> > per workitem is much smaller than the processing time.
> >
> > Typically DUCC jobs would be processing much larger blocks of content per
> > workitem. For example, if a workitem was a document, and the document
> > parsed into the small CASes by the CasMultiplier, the throughput would be
> > much better. However, with this example, as the number of working
> > JobProcess threads is scaled up, the CR (JobDriver) would become a
> > bottleneck. That's why a typical DUCC Job will not send the Document
> > content as a workitem, but rather send a reference to the workitem
> content
> > and have the CasMultipliers in the JobProcesses read the content directly
> > from the source.
> >
> > Even though content read by the JobProcesses is much more efficient, as
> > scaleout continued to increase for this non-computation scenario the
> > bottleneck would eventually move to the underlying filesystem or whatever
> > document source and JobProcess output are. The main motivation for DUCC
> was
> > jobs similar to those in the DUCC examples which use OpenNLP to process
> > large documents. That is, jobs where CPU processing is the bottleneck
> > rather than I/O.
> >
> > Hopefully this helps. If not, happy to continue the discussion.
> > Eddie
> >
> > On Fri, Jun 12, 2020 at 1:16 PM Dr. Raja M. Suleman <
> > raja.m.sulaiman@gmail.com> wrote:
> >
> > > Hi,
> > > Thank you for your reply and I'm sorry I couldn't get back to this
> > > earlier.
> > >
> > > To get a better picture of the processing speed of DUCC, I made a dummy
> > > pipeline where the CollectionReader runs a for loop to generate 100k
> > > workitems (so no disk reads). each workitem only has a simple string in
> > it.
> > > These are then passed on to the CasMultiplier where for each workitem
> I'm
> > > creating a new CAS with DocumentInfo (again only having a simple string
> > > value) and pass it as a newcas to the CasConsumer. The CasConsumer
> > doesn't
> > > do anything except add the Document received in the CAS to the logger.
> So
> > > basically this pipeline isn't doing anything, no Input reads and the
> only
> > > output is the information added to the logger. Running this on the
> > cluster
> > > with 2 slave nodes with 8-CPUs and 32GB RAM each is still taking more
> > than
> > > 30 minutes. I don't understand how is this possible since there's no
> > heavy
> > > I/O processing is happening in the code.
> > >
> > > Any ideas please?
> > >
> > > Thank you.
> > >
> > > On 2020/05/18 12:47:41, Eddie Epstein <ea...@gmail.com> wrote:
> > > > Hi,
> > > >
> > > > Removing the AE from the pipeline was a good idea to help isolate the
> > > > bottleneck. The other two most likely possibilities are the
> collection
> > > > reader pulling from elastic search or the CAS consumer writing the
> > > > processing output.
> > > >
> > > > DUCC Jobs are a simple way to scale out compute bottlenecks across a
> > > > cluster. Scaleout may be of limited or no value for I/O bound jobs.
> > > > Please give a more complete picture of the processing scenario on
> DUCC.
> > > >
> > > > Regards,
> > > > Eddie
> > > >
> > > >
> > > > On Sat, May 16, 2020 at 1:29 AM Raja Muhammad Suleman <
> > > > Sulemanr@edgehill.ac.uk> wrote:
> > > >
> > > > > Hi,
> > > > > I've been trying to run a very small UIMA DUCC cluster with 2 slave
> > > nodes
> > > > > having 32GB of RAM each. I wrote a custom Collection Reader to read
> > > data
> > > > > from an Elasticsearch index and dump it into a new index after
> > certain
> > > > > analysis engine processing. The Analysis Engine is a simple
> sentiment
> > > > > analysis code. The performance I'm getting is very slow as it is
> only
> > > able
> > > > > to process ~150 documents/minute.
> > > > > To test the performance without the analysis engine, I removed the
> AE
> > > from
> > > > > the pipeline but still I did not get any improvement in the
> > processing
> > > > > speeds. Can you please guide me as to where I might be going wrong
> or
> > > what
> > > > > I can do to improve the processing speeds?
> > > > >
> > > > > Thank you.
> > > > > ________________________________
> > > > > Edge Hill University<http://ehu.ac.uk/home/emailfooter>
> > > > > Teaching Excellence Framework Gold Award<
> > > http://ehu.ac.uk/tef/emailfooter>
> > > > > ________________________________
> > > > > This message is private and confidential. If you have received this
> > > > > message in error, please notify the sender and remove it from your
> > > system.
> > > > > Any views or opinions presented are solely those of the author and
> do
> > > not
> > > > > necessarily represent those of Edge Hill or associated companies.
> > Edge
> > > Hill
> > > > > University may monitor email traffic data and also the content of
> > > email for
> > > > > the purposes of security and business communications during staff
> > > absence.<
> > > > > http://ehu.ac.uk/itspolicies/emailfooter>
> > > > >
> > > >
> > >
> >
>

Re: UIMA DUCC slow processing

Posted by "Dr. Raja M. Suleman" <ra...@gmail.com>.
Hello,

Thank you very much for your response and even more so for the detailed
explanation.

So, if I understand it correctly, DUCC is more suited for scenarios where
we have large input documents rather than many small ones?

Thank you once again.

On Fri, 12 Jun 2020, 22:18 Eddie Epstein, <ea...@gmail.com> wrote:

> Hi,
>
> In this simple scenario there is a CollectionReader running in a JobDriver
> process, delivering 100K workitems to multiple remote JobProcesses. The
> processing time is essentially zero.  (30 * 60 seconds) / 100,000 workitems
> = 18 milliseconds per workitem. This time is roughly the expected overhead
> of a DUCC jobDriver delivering workitems to remote JobProcesses and
> recording the results. DUCC jobs are much more efficient if the overhead
> per workitem is much smaller than the processing time.
>
> Typically DUCC jobs would be processing much larger blocks of content per
> workitem. For example, if a workitem was a document, and the document
> parsed into the small CASes by the CasMultiplier, the throughput would be
> much better. However, with this example, as the number of working
> JobProcess threads is scaled up, the CR (JobDriver) would become a
> bottleneck. That's why a typical DUCC Job will not send the Document
> content as a workitem, but rather send a reference to the workitem content
> and have the CasMultipliers in the JobProcesses read the content directly
> from the source.
>
> Even though content read by the JobProcesses is much more efficient, as
> scaleout continued to increase for this non-computation scenario the
> bottleneck would eventually move to the underlying filesystem or whatever
> document source and JobProcess output are. The main motivation for DUCC was
> jobs similar to those in the DUCC examples which use OpenNLP to process
> large documents. That is, jobs where CPU processing is the bottleneck
> rather than I/O.
>
> Hopefully this helps. If not, happy to continue the discussion.
> Eddie
>
> On Fri, Jun 12, 2020 at 1:16 PM Dr. Raja M. Suleman <
> raja.m.sulaiman@gmail.com> wrote:
>
> > Hi,
> > Thank you for your reply and I'm sorry I couldn't get back to this
> > earlier.
> >
> > To get a better picture of the processing speed of DUCC, I made a dummy
> > pipeline where the CollectionReader runs a for loop to generate 100k
> > workitems (so no disk reads). each workitem only has a simple string in
> it.
> > These are then passed on to the CasMultiplier where for each workitem I'm
> > creating a new CAS with DocumentInfo (again only having a simple string
> > value) and pass it as a newcas to the CasConsumer. The CasConsumer
> doesn't
> > do anything except add the Document received in the CAS to the logger. So
> > basically this pipeline isn't doing anything, no Input reads and the only
> > output is the information added to the logger. Running this on the
> cluster
> > with 2 slave nodes with 8-CPUs and 32GB RAM each is still taking more
> than
> > 30 minutes. I don't understand how is this possible since there's no
> heavy
> > I/O processing is happening in the code.
> >
> > Any ideas please?
> >
> > Thank you.
> >
> > On 2020/05/18 12:47:41, Eddie Epstein <ea...@gmail.com> wrote:
> > > Hi,
> > >
> > > Removing the AE from the pipeline was a good idea to help isolate the
> > > bottleneck. The other two most likely possibilities are the collection
> > > reader pulling from elastic search or the CAS consumer writing the
> > > processing output.
> > >
> > > DUCC Jobs are a simple way to scale out compute bottlenecks across a
> > > cluster. Scaleout may be of limited or no value for I/O bound jobs.
> > > Please give a more complete picture of the processing scenario on DUCC.
> > >
> > > Regards,
> > > Eddie
> > >
> > >
> > > On Sat, May 16, 2020 at 1:29 AM Raja Muhammad Suleman <
> > > Sulemanr@edgehill.ac.uk> wrote:
> > >
> > > > Hi,
> > > > I've been trying to run a very small UIMA DUCC cluster with 2 slave
> > nodes
> > > > having 32GB of RAM each. I wrote a custom Collection Reader to read
> > data
> > > > from an Elasticsearch index and dump it into a new index after
> certain
> > > > analysis engine processing. The Analysis Engine is a simple sentiment
> > > > analysis code. The performance I'm getting is very slow as it is only
> > able
> > > > to process ~150 documents/minute.
> > > > To test the performance without the analysis engine, I removed the AE
> > from
> > > > the pipeline but still I did not get any improvement in the
> processing
> > > > speeds. Can you please guide me as to where I might be going wrong or
> > what
> > > > I can do to improve the processing speeds?
> > > >
> > > > Thank you.
> > > > ________________________________
> > > > Edge Hill University<http://ehu.ac.uk/home/emailfooter>
> > > > Teaching Excellence Framework Gold Award<
> > http://ehu.ac.uk/tef/emailfooter>
> > > > ________________________________
> > > > This message is private and confidential. If you have received this
> > > > message in error, please notify the sender and remove it from your
> > system.
> > > > Any views or opinions presented are solely those of the author and do
> > not
> > > > necessarily represent those of Edge Hill or associated companies.
> Edge
> > Hill
> > > > University may monitor email traffic data and also the content of
> > email for
> > > > the purposes of security and business communications during staff
> > absence.<
> > > > http://ehu.ac.uk/itspolicies/emailfooter>
> > > >
> > >
> >
>

Re: UIMA DUCC slow processing

Posted by Eddie Epstein <ea...@gmail.com>.
Hi,

In this simple scenario there is a CollectionReader running in a JobDriver
process, delivering 100K workitems to multiple remote JobProcesses. The
processing time is essentially zero.  (30 * 60 seconds) / 100,000 workitems
= 18 milliseconds per workitem. This time is roughly the expected overhead
of a DUCC jobDriver delivering workitems to remote JobProcesses and
recording the results. DUCC jobs are much more efficient if the overhead
per workitem is much smaller than the processing time.

Typically DUCC jobs would be processing much larger blocks of content per
workitem. For example, if a workitem was a document, and the document
parsed into the small CASes by the CasMultiplier, the throughput would be
much better. However, with this example, as the number of working
JobProcess threads is scaled up, the CR (JobDriver) would become a
bottleneck. That's why a typical DUCC Job will not send the Document
content as a workitem, but rather send a reference to the workitem content
and have the CasMultipliers in the JobProcesses read the content directly
from the source.

Even though content read by the JobProcesses is much more efficient, as
scaleout continued to increase for this non-computation scenario the
bottleneck would eventually move to the underlying filesystem or whatever
document source and JobProcess output are. The main motivation for DUCC was
jobs similar to those in the DUCC examples which use OpenNLP to process
large documents. That is, jobs where CPU processing is the bottleneck
rather than I/O.

Hopefully this helps. If not, happy to continue the discussion.
Eddie

On Fri, Jun 12, 2020 at 1:16 PM Dr. Raja M. Suleman <
raja.m.sulaiman@gmail.com> wrote:

> Hi,
> Thank you for your reply and I'm sorry I couldn't get back to this
> earlier.
>
> To get a better picture of the processing speed of DUCC, I made a dummy
> pipeline where the CollectionReader runs a for loop to generate 100k
> workitems (so no disk reads). each workitem only has a simple string in it.
> These are then passed on to the CasMultiplier where for each workitem I'm
> creating a new CAS with DocumentInfo (again only having a simple string
> value) and pass it as a newcas to the CasConsumer. The CasConsumer doesn't
> do anything except add the Document received in the CAS to the logger. So
> basically this pipeline isn't doing anything, no Input reads and the only
> output is the information added to the logger. Running this on the cluster
> with 2 slave nodes with 8-CPUs and 32GB RAM each is still taking more than
> 30 minutes. I don't understand how is this possible since there's no heavy
> I/O processing is happening in the code.
>
> Any ideas please?
>
> Thank you.
>
> On 2020/05/18 12:47:41, Eddie Epstein <ea...@gmail.com> wrote:
> > Hi,
> >
> > Removing the AE from the pipeline was a good idea to help isolate the
> > bottleneck. The other two most likely possibilities are the collection
> > reader pulling from elastic search or the CAS consumer writing the
> > processing output.
> >
> > DUCC Jobs are a simple way to scale out compute bottlenecks across a
> > cluster. Scaleout may be of limited or no value for I/O bound jobs.
> > Please give a more complete picture of the processing scenario on DUCC.
> >
> > Regards,
> > Eddie
> >
> >
> > On Sat, May 16, 2020 at 1:29 AM Raja Muhammad Suleman <
> > Sulemanr@edgehill.ac.uk> wrote:
> >
> > > Hi,
> > > I've been trying to run a very small UIMA DUCC cluster with 2 slave
> nodes
> > > having 32GB of RAM each. I wrote a custom Collection Reader to read
> data
> > > from an Elasticsearch index and dump it into a new index after certain
> > > analysis engine processing. The Analysis Engine is a simple sentiment
> > > analysis code. The performance I'm getting is very slow as it is only
> able
> > > to process ~150 documents/minute.
> > > To test the performance without the analysis engine, I removed the AE
> from
> > > the pipeline but still I did not get any improvement in the processing
> > > speeds. Can you please guide me as to where I might be going wrong or
> what
> > > I can do to improve the processing speeds?
> > >
> > > Thank you.
> > > ________________________________
> > > Edge Hill University<http://ehu.ac.uk/home/emailfooter>
> > > Teaching Excellence Framework Gold Award<
> http://ehu.ac.uk/tef/emailfooter>
> > > ________________________________
> > > This message is private and confidential. If you have received this
> > > message in error, please notify the sender and remove it from your
> system.
> > > Any views or opinions presented are solely those of the author and do
> not
> > > necessarily represent those of Edge Hill or associated companies. Edge
> Hill
> > > University may monitor email traffic data and also the content of
> email for
> > > the purposes of security and business communications during staff
> absence.<
> > > http://ehu.ac.uk/itspolicies/emailfooter>
> > >
> >
>

Re: UIMA DUCC slow processing

Posted by "Dr. Raja M. Suleman" <ra...@gmail.com>.
Hi,
Thank you for your reply and I'm sorry I couldn't get back to this earlier. 

To get a better picture of the processing speed of DUCC, I made a dummy pipeline where the CollectionReader runs a for loop to generate 100k workitems (so no disk reads). each workitem only has a simple string in it. These are then passed on to the CasMultiplier where for each workitem I'm creating a new CAS with DocumentInfo (again only having a simple string value) and pass it as a newcas to the CasConsumer. The CasConsumer doesn't do anything except add the Document received in the CAS to the logger. So basically this pipeline isn't doing anything, no Input reads and the only output is the information added to the logger. Running this on the cluster with 2 slave nodes with 8-CPUs and 32GB RAM each is still taking more than 30 minutes. I don't understand how is this possible since there's no heavy I/O processing is happening in the code. 

Any ideas please?

Thank you.

On 2020/05/18 12:47:41, Eddie Epstein <ea...@gmail.com> wrote: 
> Hi,
> 
> Removing the AE from the pipeline was a good idea to help isolate the
> bottleneck. The other two most likely possibilities are the collection
> reader pulling from elastic search or the CAS consumer writing the
> processing output.
> 
> DUCC Jobs are a simple way to scale out compute bottlenecks across a
> cluster. Scaleout may be of limited or no value for I/O bound jobs.
> Please give a more complete picture of the processing scenario on DUCC.
> 
> Regards,
> Eddie
> 
> 
> On Sat, May 16, 2020 at 1:29 AM Raja Muhammad Suleman <
> Sulemanr@edgehill.ac.uk> wrote:
> 
> > Hi,
> > I've been trying to run a very small UIMA DUCC cluster with 2 slave nodes
> > having 32GB of RAM each. I wrote a custom Collection Reader to read data
> > from an Elasticsearch index and dump it into a new index after certain
> > analysis engine processing. The Analysis Engine is a simple sentiment
> > analysis code. The performance I'm getting is very slow as it is only able
> > to process ~150 documents/minute.
> > To test the performance without the analysis engine, I removed the AE from
> > the pipeline but still I did not get any improvement in the processing
> > speeds. Can you please guide me as to where I might be going wrong or what
> > I can do to improve the processing speeds?
> >
> > Thank you.
> > ________________________________
> > Edge Hill University<http://ehu.ac.uk/home/emailfooter>
> > Teaching Excellence Framework Gold Award<http://ehu.ac.uk/tef/emailfooter>
> > ________________________________
> > This message is private and confidential. If you have received this
> > message in error, please notify the sender and remove it from your system.
> > Any views or opinions presented are solely those of the author and do not
> > necessarily represent those of Edge Hill or associated companies. Edge Hill
> > University may monitor email traffic data and also the content of email for
> > the purposes of security and business communications during staff absence.<
> > http://ehu.ac.uk/itspolicies/emailfooter>
> >
> 

Re: UIMA DUCC slow processing

Posted by Marshall Schor <ms...@schor.com>.
Hi,

An important variable to know/measure in the 150 docs/minute:  How large are
these documents?

-Marshall

On 5/18/2020 8:47 AM, Eddie Epstein wrote:
> Hi,
>
> Removing the AE from the pipeline was a good idea to help isolate the
> bottleneck. The other two most likely possibilities are the collection
> reader pulling from elastic search or the CAS consumer writing the
> processing output.
>
> DUCC Jobs are a simple way to scale out compute bottlenecks across a
> cluster. Scaleout may be of limited or no value for I/O bound jobs.
> Please give a more complete picture of the processing scenario on DUCC.
>
> Regards,
> Eddie
>
>
> On Sat, May 16, 2020 at 1:29 AM Raja Muhammad Suleman <
> Sulemanr@edgehill.ac.uk> wrote:
>
>> Hi,
>> I've been trying to run a very small UIMA DUCC cluster with 2 slave nodes
>> having 32GB of RAM each. I wrote a custom Collection Reader to read data
>> from an Elasticsearch index and dump it into a new index after certain
>> analysis engine processing. The Analysis Engine is a simple sentiment
>> analysis code. The performance I'm getting is very slow as it is only able
>> to process ~150 documents/minute.
>> To test the performance without the analysis engine, I removed the AE from
>> the pipeline but still I did not get any improvement in the processing
>> speeds. Can you please guide me as to where I might be going wrong or what
>> I can do to improve the processing speeds?
>>
>> Thank you.
>> ________________________________
>> Edge Hill University<http://ehu.ac.uk/home/emailfooter>
>> Teaching Excellence Framework Gold Award<http://ehu.ac.uk/tef/emailfooter>
>> ________________________________
>> This message is private and confidential. If you have received this
>> message in error, please notify the sender and remove it from your system.
>> Any views or opinions presented are solely those of the author and do not
>> necessarily represent those of Edge Hill or associated companies. Edge Hill
>> University may monitor email traffic data and also the content of email for
>> the purposes of security and business communications during staff absence.<
>> http://ehu.ac.uk/itspolicies/emailfooter>
>>