You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by rohan rai <hi...@gmail.com> on 2008/06/26 13:35:31 UTC

Annotation (Indexing) a bottleneck in UIMA in terms of speed

When I profile a UIMA application
What I see that annonation takes a lot of time
If I profile I see that to annotate 1 record , it takes around 0.06 seconds
Now you may say its good
Now scale up
Although it does not scale up linearly. But here is rough estimate on
experiments done
6000 records take 6 min to annotate
800000 record tale around 10 hrs min to annotate
Which is bad.
One thing is that I am treating each record individually as a cas
Even if I treat all the record as a single cas it takes around 6-7 hrs
Which is still not good in terms of speed

Is there a way out?
Can I improve performance by any means??

Regards
Rohan

Re: Annotation (Indexing) a bottleneck in UIMA in terms of speed

Posted by Thilo Goetz <tw...@gmx.de>.
Great minds think alike :-)

LeHouillier, Frank D. wrote:
> To test your theory that it is the writing of Annotations to the CAS
> that is taking so long I ran an annotator with this code: 
> 
> public class TestAnnotator extends JCasAnnotator_ImplBase {
> 
> 	@Override
> 	public void process(JCas arg0) throws
> AnalysisEngineProcessException {
> 
> 		int i = 0;
> 		
> 		while (i < 100000)
> 		{
> 			Annotation a = new Annotation(arg0);
> 			
> 			a.setBegin(1);
> 			a.setEnd(2);
> 			a.addToIndexes();
> 			
> 			i++;
> 		}
> 		
> 		System.out.println("Done");
> 
> 	}
> 
> } 
> 
> This takes less than two seconds to run on my laptop.  Is it possible
> your bottleneck isn't where you think it is?
> 
> -----Original Message-----
> From: rohan rai [mailto:hirohanin@gmail.com] 
> Sent: Thursday, June 26, 2008 12:04 PM
> To: uima-user@incubator.apache.org
> Subject: Re: Annotation (Indexing) a bottleneck in UIMA in terms of
> speed
> 
> @Pascal: As I have already said the timing does not scale linearly
>               Secondly it the approx times which I have specified
> @Frank:
>      I was talking about actual adding of annotation to CAS
>     Record refer to lets say in tags like these <a>.....</a>
>     and the document consist of such record
>     Annotation is done via this method
>                                MyType annotation = new MyType(jCas);
>                                annotation.setBegin(start);
>                                annotation.setEnd(end);
>                                annotation.addToIndexes();
>    This takes a lot of time which is not likeable.
> 
> Regards
> Rohan
> 
> 
> On Thu, Jun 26, 2008 at 8:15 PM, LeHouiloes lier, Frank D. <
> Frank.LeHouillier@gd-ais.com> wrote:
> 
>> Just to clarify, what do you mean by "annotation"?  Is there a 
>> specific Analysis Engine that you are using? What is a "record"? Is 
>> this a document?  It would actually be surprizing for many 
>> applications if annotation were not the bottleneck, given that some 
>> annotation processes are quite expensive, but this doesn't seem like 
>> what you mean here. I can't tell from your question whether it is the 
>> process that determines the annotations that is a burden or the actual
> 
>> adding of the annotations to the cas.
>>
>> -----Original Message-----
>> From: rohan rai [mailto:hirohanin@gmail.com]
>> Sent: Thursday, June 26, 2008 7:36 AM
>> To: uima-user@incubator.apache.org
>> Subject: Annotation (Indexing) a bottleneck in UIMA in terms of speed
>>
>> When I profile a UIMA application
>> What I see that annonation takes a lot of time If I profile I see that
> 
>> to annotate 1 record , it takes around 0.06 seconds Now you may say 
>> its good Now scale up Although it does not scale up linearly. But here
> 
>> is rough estimate on experiments done 6000 records take 6 min to 
>> annotate 800000 record tale around 10 hrs min to annotate Which is
> bad.
>> One thing is that I am treating each record individually as a cas Even
> 
>> if I treat all the record as a single cas it takes around 6-7 hrs 
>> Which is still not good in terms of speed
>>
>> Is there a way out?
>> Can I improve performance by any means??
>>
>> Regards
>> Rohan
>>

RE: Annotation (Indexing) a bottleneck in UIMA in terms of speed

Posted by "LeHouillier, Frank D." <Fr...@gd-ais.com>.
To test your theory that it is the writing of Annotations to the CAS
that is taking so long I ran an annotator with this code: 

public class TestAnnotator extends JCasAnnotator_ImplBase {

	@Override
	public void process(JCas arg0) throws
AnalysisEngineProcessException {

		int i = 0;
		
		while (i < 100000)
		{
			Annotation a = new Annotation(arg0);
			
			a.setBegin(1);
			a.setEnd(2);
			a.addToIndexes();
			
			i++;
		}
		
		System.out.println("Done");

	}

} 

This takes less than two seconds to run on my laptop.  Is it possible
your bottleneck isn't where you think it is?

-----Original Message-----
From: rohan rai [mailto:hirohanin@gmail.com] 
Sent: Thursday, June 26, 2008 12:04 PM
To: uima-user@incubator.apache.org
Subject: Re: Annotation (Indexing) a bottleneck in UIMA in terms of
speed

@Pascal: As I have already said the timing does not scale linearly
              Secondly it the approx times which I have specified
@Frank:
     I was talking about actual adding of annotation to CAS
    Record refer to lets say in tags like these <a>.....</a>
    and the document consist of such record
    Annotation is done via this method
                               MyType annotation = new MyType(jCas);
                               annotation.setBegin(start);
                               annotation.setEnd(end);
                               annotation.addToIndexes();
   This takes a lot of time which is not likeable.

Regards
Rohan


On Thu, Jun 26, 2008 at 8:15 PM, LeHouiloes lier, Frank D. <
Frank.LeHouillier@gd-ais.com> wrote:

> Just to clarify, what do you mean by "annotation"?  Is there a 
> specific Analysis Engine that you are using? What is a "record"? Is 
> this a document?  It would actually be surprizing for many 
> applications if annotation were not the bottleneck, given that some 
> annotation processes are quite expensive, but this doesn't seem like 
> what you mean here. I can't tell from your question whether it is the 
> process that determines the annotations that is a burden or the actual

> adding of the annotations to the cas.
>
> -----Original Message-----
> From: rohan rai [mailto:hirohanin@gmail.com]
> Sent: Thursday, June 26, 2008 7:36 AM
> To: uima-user@incubator.apache.org
> Subject: Annotation (Indexing) a bottleneck in UIMA in terms of speed
>
> When I profile a UIMA application
> What I see that annonation takes a lot of time If I profile I see that

> to annotate 1 record , it takes around 0.06 seconds Now you may say 
> its good Now scale up Although it does not scale up linearly. But here

> is rough estimate on experiments done 6000 records take 6 min to 
> annotate 800000 record tale around 10 hrs min to annotate Which is
bad.
> One thing is that I am treating each record individually as a cas Even

> if I treat all the record as a single cas it takes around 6-7 hrs 
> Which is still not good in terms of speed
>
> Is there a way out?
> Can I improve performance by any means??
>
> Regards
> Rohan
>

Re: Annotation (Indexing) a bottleneck in UIMA in terms of speed

Posted by Julien Nioche <li...@gmail.com>.
Rohan,

I was not asking about scalability at all but about the way you built the
job file. Have found the anwer to my problem in the meantime : the procedure
you described on the Wiki page is valid in distributed mode only (pseudo or
real); I was tryong in standalone mode. I will update the Wiki page.

J.

2008/8/19 rohan rai <hi...@gmail.com>

> Hey Julien
>
>  There are two aspect of making UIMA work with hadoop..
>
> First to make it run...Somehow run on short data for the proof of
> concept...
>
> And then worry about the scalability
>
> Have you gone through the link
> http://cwiki.apache.org/confluence/display/UIMA/Running+UIMA+Apps+on+Hadoop
> or
> http://rohanrai.blogspot.com/2008/06/uima-hadoop.html
> When you have understood what is going on over here..
> Then you should look at this thread which specifically talks about
> scalability issues
>
> Feel free to query more, if you are still  unable to make progress
> Regards
> Rohan
>
>
> After
>
> On Tue, Aug 19, 2008 at 3:49 PM, Julien Nioche <
> lists.digitalpebble@gmail.com> wrote:
>
>> Hi Rohan,
>>
>> I saw that thread on the uima list and am in a similar situation. Would
>> you mind telling me how you built the job file? I have one which contains
>> all my libs and xml configuration files but it does not get automatically
>> extracted + I can't access my files using the ClassLoader.
>>
>> Do you use conf.setJar() at all?
>>
>> Thanks
>>
>> Julien
>>
>>
>> 2008/6/30 rohan rai <hi...@gmail.com>
>>
>> Sorry for misleading you guys by keeping a few facts with myself.
>>> Let me elaborate and tell you the actual problem and the solution I found
>>>
>>> Actually I am running my UIMA app over hadoop.
>>> There I encountered a big problem regarding which I had asked in this
>>> forum
>>> before
>>> Then I found out the solution which later got posted over here
>>> http://cwiki.apache.org/UIMA/running-uima-apps-on-hadoop.html
>>> This solved a set of problem but it started to give performance issues.
>>> Instead of speeding up and scaling up I started facing two sets of
>>> problem
>>> because of the solution mentioned in the
>>> wiki
>>>
>>> problem 1) Out of memory error
>>> The solution talks about using
>>> XMLInputSource in = new
>>>
>>> XMLInputSource(ClassLoader.getSystemResourceAsStream(aeXmlDescriptor),null)
>>>
>>> to load the xmls and using resource manager to do so.
>>>
>>> But if this activity is carried on in Map/Reduce class then eventually
>>> one
>>> gets out of memory error inspite of increasing the heap size
>>> considerably.
>>>
>>> The solution is to
>>> initialize these Analysis engine etc in the configure(JobConf) method of
>>> the
>>> Mapper,Reducer class so as to create a single instance of it in each
>>> hadoop
>>> task. One can even reuse the cas created using cas.reset() method.
>>>
>>> By this way the problem of out of memory was solved.
>>>
>>> Now I started facing another problem regarding performance.
>>> The source of which was the usage of Resource Manager mentioned in the
>>> wiki
>>> to solve another problem.
>>>
>>> It was caused as each class mentioned in the descriptor, was bought from
>>> the
>>> job temp directory to task temp directory.
>>>
>>> Now the problem became to achieve and solve the problem for which the
>>> wiki
>>> entry was made without using Resource Manager.
>>>
>>> The solution is to fake imports (Yeah indeed, Ironical, that faking
>>> proved
>>> to be useful :)). Now what we can do is in the  class file where the
>>> Map/Reduce task has been implemented we need to import all the classes
>>> required by the descriptor initialized in those class.
>>>
>>> This ensures the presence of these classes at each individual task and
>>> thus
>>> giving considerable increase in performance
>>>
>>> Keeping the points mentioned in mind I was  now the beauty of UIMA and
>>> hadoop together to my own benefit
>>>
>>> Regards
>>> Rohan
>>>
>>>
>>>
>>>
>>> On Thu, Jun 26, 2008 at 10:52 PM, Thilo Goetz <tw...@gmx.de> wrote:
>>>
>>> > rohan rai wrote:
>>> >
>>> >> @Pascal: As I have already said the timing does not scale linearly
>>> >>              Secondly it the approx times which I have specified
>>> >> @Frank:
>>> >>     I was talking about actual adding of annotation to CAS
>>> >>    Record refer to lets say in tags like these <a>.....</a>
>>> >>    and the document consist of such record
>>> >>    Annotation is done via this method
>>> >>                               MyType annotation = new MyType(jCas);
>>> >>                               annotation.setBegin(start);
>>> >>                               annotation.setEnd(end);
>>> >>                               annotation.addToIndexes();
>>> >>   This takes a lot of time which is not likeable.
>>> >>
>>> >
>>> > I don't know what you mean by a lot of time, but
>>> > you can create hundreds of thousands of annotations
>>> > like this per second on a standard windows machine.
>>> > You can easily verify this by running this code in
>>> > isolation (with mock data).
>>> >
>>> > You're more likely seeing per document overhead.
>>> > For example, resetting the CAS after processing
>>> > a document is not so cheap.  However, I still don't
>>> > know why things are so slow for you.  For example,
>>> > I ran the following experiment.  I installed the
>>> > Whitespace Tokenizer pear file into c:\tmp and ran
>>> > it 10000 times on its own descriptor.  That creates
>>> > approx 10Mio annotations.  On my 18 months old Xeon
>>> > this ran in about 4 seconds.  Code and output is
>>> > below, for you to recreate.  So I'm not sure you have
>>> > correctly identified your bottleneck.
>>> >
>>> >  public static void main(String[] args) {
>>> >    try {
>>> >      System.out.println("Starting setup.");
>>> >      XMLParser parser = UIMAFramework.getXMLParser();
>>> >      ResourceSpecifier spec = parser.parseResourceSpecifier(new
>>> > XMLInputSource(new File(
>>> >
>>>  "c:\\tmp\\WhitespaceTokenizer\\WhitespaceTokenizer_pear.xml")));
>>> >      AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(spec,
>>> null,
>>> > null);
>>> >      String text = FileUtils.file2String(new File(
>>> >
>>>  "c:\\tmp\\WhitespaceTokenizer\\desc\\WhitespaceTokenizer.xml"));
>>> >      CAS cas = ae.newCAS();
>>> >      System.out.println("Setup done, starting processing.");
>>> >      final int max = 10000;
>>> >      long time = System.currentTimeMillis();
>>> >      for (int i = 0; i < max; i++) {
>>> >        cas.reset();
>>> >        cas.setDocumentText(text);
>>> >        ae.process(cas);
>>> >        if (cas.getAnnotationIndex().size() != 1080) {
>>> >          // There are 1080 annotations created for each run
>>> >          System.out.println("Processing error.");
>>> >        }
>>> >      }
>>> >      time = System.currentTimeMillis() - time;
>>> >      System.out.println("Time for processing " + max + " documents, " +
>>> max
>>> > * 1080
>>> >          + " annotations: " + new TimeSpan(time));
>>> >    } catch (Exception e) {
>>> >      e.printStackTrace();
>>> >    }
>>> >  }
>>> >
>>> > Output on my machine:
>>> >
>>> > Starting setup.
>>> > Setup done, starting processing.
>>> > Time for processing 10000 documents, 10800000 annotations: 4.078 sec
>>> >
>>> > --Thilo
>>> >
>>> >
>>> >
>>> >
>>> >> Regards
>>> >> Rohan
>>> >>
>>> >>
>>> >> On Thu, Jun 26, 2008 at 8:15 PM, LeHouiloes lier, Frank D. <
>>> >> Frank.LeHouillier@gd-ais.com> wrote:
>>> >>
>>> >>  Just to clarify, what do you mean by "annotation"?  Is there a
>>> specific
>>> >>> Analysis Engine that you are using? What is a "record"? Is this a
>>> >>> document?  It would actually be surprizing for many applications if
>>> >>> annotation were not the bottleneck, given that some annotation
>>> processes
>>> >>> are quite expensive, but this doesn't seem like what you mean here. I
>>> >>> can't tell from your question whether it is the process that
>>> determines
>>> >>> the annotations that is a burden or the actual adding of the
>>> annotations
>>> >>> to the cas.
>>> >>>
>>> >>> -----Original Message-----
>>> >>> From: rohan rai [mailto:hirohanin@gmail.com]
>>> >>> Sent: Thursday, June 26, 2008 7:36 AM
>>> >>> To: uima-user@incubator.apache.org
>>> >>> Subject: Annotation (Indexing) a bottleneck in UIMA in terms of speed
>>> >>>
>>> >>> When I profile a UIMA application
>>> >>> What I see that annonation takes a lot of time If I profile I see
>>> that
>>> >>> to annotate 1 record , it takes around 0.06 seconds Now you may say
>>> its
>>> >>> good Now scale up Although it does not scale up linearly. But here is
>>> >>> rough estimate on experiments done 6000 records take 6 min to
>>> annotate
>>> >>> 800000 record tale around 10 hrs min to annotate Which is bad.
>>> >>> One thing is that I am treating each record individually as a cas
>>> Even
>>> >>> if I treat all the record as a single cas it takes around 6-7 hrs
>>> Which
>>> >>> is still not good in terms of speed
>>> >>>
>>> >>> Is there a way out?
>>> >>> Can I improve performance by any means??
>>> >>>
>>> >>> Regards
>>> >>> Rohan
>>> >>>
>>> >>>
>>> >>
>>>
>>
>>
>>
>> --
>> DigitalPebble Ltd
>> http://www.digitalpebble.com
>>
>
>


-- 
DigitalPebble Ltd
http://www.digitalpebble.com

Re: Annotation (Indexing) a bottleneck in UIMA in terms of speed

Posted by rohan rai <hi...@gmail.com>.
Hey Julien

 There are two aspect of making UIMA work with hadoop..

First to make it run...Somehow run on short data for the proof of concept...

And then worry about the scalability

Have you gone through the link
http://cwiki.apache.org/confluence/display/UIMA/Running+UIMA+Apps+on+Hadoop
or
http://rohanrai.blogspot.com/2008/06/uima-hadoop.html
When you have understood what is going on over here..
Then you should look at this thread which specifically talks about
scalability issues

Feel free to query more, if you are still  unable to make progress
Regards
Rohan


After

On Tue, Aug 19, 2008 at 3:49 PM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> Hi Rohan,
>
> I saw that thread on the uima list and am in a similar situation. Would you
> mind telling me how you built the job file? I have one which contains all my
> libs and xml configuration files but it does not get automatically extracted
> + I can't access my files using the ClassLoader.
>
> Do you use conf.setJar() at all?
>
> Thanks
>
> Julien
>
>
> 2008/6/30 rohan rai <hi...@gmail.com>
>
> Sorry for misleading you guys by keeping a few facts with myself.
>> Let me elaborate and tell you the actual problem and the solution I found
>>
>> Actually I am running my UIMA app over hadoop.
>> There I encountered a big problem regarding which I had asked in this
>> forum
>> before
>> Then I found out the solution which later got posted over here
>> http://cwiki.apache.org/UIMA/running-uima-apps-on-hadoop.html
>> This solved a set of problem but it started to give performance issues.
>> Instead of speeding up and scaling up I started facing two sets of problem
>> because of the solution mentioned in the
>> wiki
>>
>> problem 1) Out of memory error
>> The solution talks about using
>> XMLInputSource in = new
>>
>> XMLInputSource(ClassLoader.getSystemResourceAsStream(aeXmlDescriptor),null)
>>
>> to load the xmls and using resource manager to do so.
>>
>> But if this activity is carried on in Map/Reduce class then eventually one
>> gets out of memory error inspite of increasing the heap size considerably.
>>
>> The solution is to
>> initialize these Analysis engine etc in the configure(JobConf) method of
>> the
>> Mapper,Reducer class so as to create a single instance of it in each
>> hadoop
>> task. One can even reuse the cas created using cas.reset() method.
>>
>> By this way the problem of out of memory was solved.
>>
>> Now I started facing another problem regarding performance.
>> The source of which was the usage of Resource Manager mentioned in the
>> wiki
>> to solve another problem.
>>
>> It was caused as each class mentioned in the descriptor, was bought from
>> the
>> job temp directory to task temp directory.
>>
>> Now the problem became to achieve and solve the problem for which the wiki
>> entry was made without using Resource Manager.
>>
>> The solution is to fake imports (Yeah indeed, Ironical, that faking proved
>> to be useful :)). Now what we can do is in the  class file where the
>> Map/Reduce task has been implemented we need to import all the classes
>> required by the descriptor initialized in those class.
>>
>> This ensures the presence of these classes at each individual task and
>> thus
>> giving considerable increase in performance
>>
>> Keeping the points mentioned in mind I was  now the beauty of UIMA and
>> hadoop together to my own benefit
>>
>> Regards
>> Rohan
>>
>>
>>
>>
>> On Thu, Jun 26, 2008 at 10:52 PM, Thilo Goetz <tw...@gmx.de> wrote:
>>
>> > rohan rai wrote:
>> >
>> >> @Pascal: As I have already said the timing does not scale linearly
>> >>              Secondly it the approx times which I have specified
>> >> @Frank:
>> >>     I was talking about actual adding of annotation to CAS
>> >>    Record refer to lets say in tags like these <a>.....</a>
>> >>    and the document consist of such record
>> >>    Annotation is done via this method
>> >>                               MyType annotation = new MyType(jCas);
>> >>                               annotation.setBegin(start);
>> >>                               annotation.setEnd(end);
>> >>                               annotation.addToIndexes();
>> >>   This takes a lot of time which is not likeable.
>> >>
>> >
>> > I don't know what you mean by a lot of time, but
>> > you can create hundreds of thousands of annotations
>> > like this per second on a standard windows machine.
>> > You can easily verify this by running this code in
>> > isolation (with mock data).
>> >
>> > You're more likely seeing per document overhead.
>> > For example, resetting the CAS after processing
>> > a document is not so cheap.  However, I still don't
>> > know why things are so slow for you.  For example,
>> > I ran the following experiment.  I installed the
>> > Whitespace Tokenizer pear file into c:\tmp and ran
>> > it 10000 times on its own descriptor.  That creates
>> > approx 10Mio annotations.  On my 18 months old Xeon
>> > this ran in about 4 seconds.  Code and output is
>> > below, for you to recreate.  So I'm not sure you have
>> > correctly identified your bottleneck.
>> >
>> >  public static void main(String[] args) {
>> >    try {
>> >      System.out.println("Starting setup.");
>> >      XMLParser parser = UIMAFramework.getXMLParser();
>> >      ResourceSpecifier spec = parser.parseResourceSpecifier(new
>> > XMLInputSource(new File(
>> >
>>  "c:\\tmp\\WhitespaceTokenizer\\WhitespaceTokenizer_pear.xml")));
>> >      AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(spec, null,
>> > null);
>> >      String text = FileUtils.file2String(new File(
>> >
>>  "c:\\tmp\\WhitespaceTokenizer\\desc\\WhitespaceTokenizer.xml"));
>> >      CAS cas = ae.newCAS();
>> >      System.out.println("Setup done, starting processing.");
>> >      final int max = 10000;
>> >      long time = System.currentTimeMillis();
>> >      for (int i = 0; i < max; i++) {
>> >        cas.reset();
>> >        cas.setDocumentText(text);
>> >        ae.process(cas);
>> >        if (cas.getAnnotationIndex().size() != 1080) {
>> >          // There are 1080 annotations created for each run
>> >          System.out.println("Processing error.");
>> >        }
>> >      }
>> >      time = System.currentTimeMillis() - time;
>> >      System.out.println("Time for processing " + max + " documents, " +
>> max
>> > * 1080
>> >          + " annotations: " + new TimeSpan(time));
>> >    } catch (Exception e) {
>> >      e.printStackTrace();
>> >    }
>> >  }
>> >
>> > Output on my machine:
>> >
>> > Starting setup.
>> > Setup done, starting processing.
>> > Time for processing 10000 documents, 10800000 annotations: 4.078 sec
>> >
>> > --Thilo
>> >
>> >
>> >
>> >
>> >> Regards
>> >> Rohan
>> >>
>> >>
>> >> On Thu, Jun 26, 2008 at 8:15 PM, LeHouiloes lier, Frank D. <
>> >> Frank.LeHouillier@gd-ais.com> wrote:
>> >>
>> >>  Just to clarify, what do you mean by "annotation"?  Is there a
>> specific
>> >>> Analysis Engine that you are using? What is a "record"? Is this a
>> >>> document?  It would actually be surprizing for many applications if
>> >>> annotation were not the bottleneck, given that some annotation
>> processes
>> >>> are quite expensive, but this doesn't seem like what you mean here. I
>> >>> can't tell from your question whether it is the process that
>> determines
>> >>> the annotations that is a burden or the actual adding of the
>> annotations
>> >>> to the cas.
>> >>>
>> >>> -----Original Message-----
>> >>> From: rohan rai [mailto:hirohanin@gmail.com]
>> >>> Sent: Thursday, June 26, 2008 7:36 AM
>> >>> To: uima-user@incubator.apache.org
>> >>> Subject: Annotation (Indexing) a bottleneck in UIMA in terms of speed
>> >>>
>> >>> When I profile a UIMA application
>> >>> What I see that annonation takes a lot of time If I profile I see that
>> >>> to annotate 1 record , it takes around 0.06 seconds Now you may say
>> its
>> >>> good Now scale up Although it does not scale up linearly. But here is
>> >>> rough estimate on experiments done 6000 records take 6 min to annotate
>> >>> 800000 record tale around 10 hrs min to annotate Which is bad.
>> >>> One thing is that I am treating each record individually as a cas Even
>> >>> if I treat all the record as a single cas it takes around 6-7 hrs
>> Which
>> >>> is still not good in terms of speed
>> >>>
>> >>> Is there a way out?
>> >>> Can I improve performance by any means??
>> >>>
>> >>> Regards
>> >>> Rohan
>> >>>
>> >>>
>> >>
>>
>
>
>
> --
> DigitalPebble Ltd
> http://www.digitalpebble.com
>

Re: Annotation (Indexing) a bottleneck in UIMA in terms of speed

Posted by rohan rai <hi...@gmail.com>.
Sorry for misleading you guys by keeping a few facts with myself.
Let me elaborate and tell you the actual problem and the solution I found

Actually I am running my UIMA app over hadoop.
There I encountered a big problem regarding which I had asked in this forum
before
Then I found out the solution which later got posted over here
http://cwiki.apache.org/UIMA/running-uima-apps-on-hadoop.html
This solved a set of problem but it started to give performance issues.
Instead of speeding up and scaling up I started facing two sets of problem
because of the solution mentioned in the
wiki

problem 1) Out of memory error
The solution talks about using
XMLInputSource in = new
XMLInputSource(ClassLoader.getSystemResourceAsStream(aeXmlDescriptor),null)

to load the xmls and using resource manager to do so.

But if this activity is carried on in Map/Reduce class then eventually one
gets out of memory error inspite of increasing the heap size considerably.

The solution is to
initialize these Analysis engine etc in the configure(JobConf) method of the
Mapper,Reducer class so as to create a single instance of it in each hadoop
task. One can even reuse the cas created using cas.reset() method.

By this way the problem of out of memory was solved.

Now I started facing another problem regarding performance.
The source of which was the usage of Resource Manager mentioned in the wiki
to solve another problem.

It was caused as each class mentioned in the descriptor, was bought from the
job temp directory to task temp directory.

Now the problem became to achieve and solve the problem for which the wiki
entry was made without using Resource Manager.

The solution is to fake imports (Yeah indeed, Ironical, that faking proved
to be useful :)). Now what we can do is in the  class file where the
Map/Reduce task has been implemented we need to import all the classes
required by the descriptor initialized in those class.

This ensures the presence of these classes at each individual task and thus
giving considerable increase in performance

Keeping the points mentioned in mind I was  now the beauty of UIMA and
hadoop together to my own benefit

Regards
Rohan




On Thu, Jun 26, 2008 at 10:52 PM, Thilo Goetz <tw...@gmx.de> wrote:

> rohan rai wrote:
>
>> @Pascal: As I have already said the timing does not scale linearly
>>              Secondly it the approx times which I have specified
>> @Frank:
>>     I was talking about actual adding of annotation to CAS
>>    Record refer to lets say in tags like these <a>.....</a>
>>    and the document consist of such record
>>    Annotation is done via this method
>>                               MyType annotation = new MyType(jCas);
>>                               annotation.setBegin(start);
>>                               annotation.setEnd(end);
>>                               annotation.addToIndexes();
>>   This takes a lot of time which is not likeable.
>>
>
> I don't know what you mean by a lot of time, but
> you can create hundreds of thousands of annotations
> like this per second on a standard windows machine.
> You can easily verify this by running this code in
> isolation (with mock data).
>
> You're more likely seeing per document overhead.
> For example, resetting the CAS after processing
> a document is not so cheap.  However, I still don't
> know why things are so slow for you.  For example,
> I ran the following experiment.  I installed the
> Whitespace Tokenizer pear file into c:\tmp and ran
> it 10000 times on its own descriptor.  That creates
> approx 10Mio annotations.  On my 18 months old Xeon
> this ran in about 4 seconds.  Code and output is
> below, for you to recreate.  So I'm not sure you have
> correctly identified your bottleneck.
>
>  public static void main(String[] args) {
>    try {
>      System.out.println("Starting setup.");
>      XMLParser parser = UIMAFramework.getXMLParser();
>      ResourceSpecifier spec = parser.parseResourceSpecifier(new
> XMLInputSource(new File(
>          "c:\\tmp\\WhitespaceTokenizer\\WhitespaceTokenizer_pear.xml")));
>      AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(spec, null,
> null);
>      String text = FileUtils.file2String(new File(
>          "c:\\tmp\\WhitespaceTokenizer\\desc\\WhitespaceTokenizer.xml"));
>      CAS cas = ae.newCAS();
>      System.out.println("Setup done, starting processing.");
>      final int max = 10000;
>      long time = System.currentTimeMillis();
>      for (int i = 0; i < max; i++) {
>        cas.reset();
>        cas.setDocumentText(text);
>        ae.process(cas);
>        if (cas.getAnnotationIndex().size() != 1080) {
>          // There are 1080 annotations created for each run
>          System.out.println("Processing error.");
>        }
>      }
>      time = System.currentTimeMillis() - time;
>      System.out.println("Time for processing " + max + " documents, " + max
> * 1080
>          + " annotations: " + new TimeSpan(time));
>    } catch (Exception e) {
>      e.printStackTrace();
>    }
>  }
>
> Output on my machine:
>
> Starting setup.
> Setup done, starting processing.
> Time for processing 10000 documents, 10800000 annotations: 4.078 sec
>
> --Thilo
>
>
>
>
>> Regards
>> Rohan
>>
>>
>> On Thu, Jun 26, 2008 at 8:15 PM, LeHouiloes lier, Frank D. <
>> Frank.LeHouillier@gd-ais.com> wrote:
>>
>>  Just to clarify, what do you mean by "annotation"?  Is there a specific
>>> Analysis Engine that you are using? What is a "record"? Is this a
>>> document?  It would actually be surprizing for many applications if
>>> annotation were not the bottleneck, given that some annotation processes
>>> are quite expensive, but this doesn't seem like what you mean here. I
>>> can't tell from your question whether it is the process that determines
>>> the annotations that is a burden or the actual adding of the annotations
>>> to the cas.
>>>
>>> -----Original Message-----
>>> From: rohan rai [mailto:hirohanin@gmail.com]
>>> Sent: Thursday, June 26, 2008 7:36 AM
>>> To: uima-user@incubator.apache.org
>>> Subject: Annotation (Indexing) a bottleneck in UIMA in terms of speed
>>>
>>> When I profile a UIMA application
>>> What I see that annonation takes a lot of time If I profile I see that
>>> to annotate 1 record , it takes around 0.06 seconds Now you may say its
>>> good Now scale up Although it does not scale up linearly. But here is
>>> rough estimate on experiments done 6000 records take 6 min to annotate
>>> 800000 record tale around 10 hrs min to annotate Which is bad.
>>> One thing is that I am treating each record individually as a cas Even
>>> if I treat all the record as a single cas it takes around 6-7 hrs Which
>>> is still not good in terms of speed
>>>
>>> Is there a way out?
>>> Can I improve performance by any means??
>>>
>>> Regards
>>> Rohan
>>>
>>>
>>

Re: Annotation (Indexing) a bottleneck in UIMA in terms of speed

Posted by Thilo Goetz <tw...@gmx.de>.
rohan rai wrote:
> @Pascal: As I have already said the timing does not scale linearly
>               Secondly it the approx times which I have specified
> @Frank:
>      I was talking about actual adding of annotation to CAS
>     Record refer to lets say in tags like these <a>.....</a>
>     and the document consist of such record
>     Annotation is done via this method
>                                MyType annotation = new MyType(jCas);
>                                annotation.setBegin(start);
>                                annotation.setEnd(end);
>                                annotation.addToIndexes();
>    This takes a lot of time which is not likeable.

I don't know what you mean by a lot of time, but
you can create hundreds of thousands of annotations
like this per second on a standard windows machine.
You can easily verify this by running this code in
isolation (with mock data).

You're more likely seeing per document overhead.
For example, resetting the CAS after processing
a document is not so cheap.  However, I still don't
know why things are so slow for you.  For example,
I ran the following experiment.  I installed the
Whitespace Tokenizer pear file into c:\tmp and ran
it 10000 times on its own descriptor.  That creates
approx 10Mio annotations.  On my 18 months old Xeon
this ran in about 4 seconds.  Code and output is
below, for you to recreate.  So I'm not sure you have
correctly identified your bottleneck.

   public static void main(String[] args) {
     try {
       System.out.println("Starting setup.");
       XMLParser parser = UIMAFramework.getXMLParser();
       ResourceSpecifier spec = parser.parseResourceSpecifier(new XMLInputSource(new File(
           "c:\\tmp\\WhitespaceTokenizer\\WhitespaceTokenizer_pear.xml")));
       AnalysisEngine ae = UIMAFramework.produceAnalysisEngine(spec, null, null);
       String text = FileUtils.file2String(new File(
           "c:\\tmp\\WhitespaceTokenizer\\desc\\WhitespaceTokenizer.xml"));
       CAS cas = ae.newCAS();
       System.out.println("Setup done, starting processing.");
       final int max = 10000;
       long time = System.currentTimeMillis();
       for (int i = 0; i < max; i++) {
         cas.reset();
         cas.setDocumentText(text);
         ae.process(cas);
         if (cas.getAnnotationIndex().size() != 1080) {
           // There are 1080 annotations created for each run
           System.out.println("Processing error.");
         }
       }
       time = System.currentTimeMillis() - time;
       System.out.println("Time for processing " + max + " documents, " + max * 1080
           + " annotations: " + new TimeSpan(time));
     } catch (Exception e) {
       e.printStackTrace();
     }
   }

Output on my machine:

Starting setup.
Setup done, starting processing.
Time for processing 10000 documents, 10800000 annotations: 4.078 sec

--Thilo


> 
> Regards
> Rohan
> 
> 
> On Thu, Jun 26, 2008 at 8:15 PM, LeHouiloes lier, Frank D. <
> Frank.LeHouillier@gd-ais.com> wrote:
> 
>> Just to clarify, what do you mean by "annotation"?  Is there a specific
>> Analysis Engine that you are using? What is a "record"? Is this a
>> document?  It would actually be surprizing for many applications if
>> annotation were not the bottleneck, given that some annotation processes
>> are quite expensive, but this doesn't seem like what you mean here. I
>> can't tell from your question whether it is the process that determines
>> the annotations that is a burden or the actual adding of the annotations
>> to the cas.
>>
>> -----Original Message-----
>> From: rohan rai [mailto:hirohanin@gmail.com]
>> Sent: Thursday, June 26, 2008 7:36 AM
>> To: uima-user@incubator.apache.org
>> Subject: Annotation (Indexing) a bottleneck in UIMA in terms of speed
>>
>> When I profile a UIMA application
>> What I see that annonation takes a lot of time If I profile I see that
>> to annotate 1 record , it takes around 0.06 seconds Now you may say its
>> good Now scale up Although it does not scale up linearly. But here is
>> rough estimate on experiments done 6000 records take 6 min to annotate
>> 800000 record tale around 10 hrs min to annotate Which is bad.
>> One thing is that I am treating each record individually as a cas Even
>> if I treat all the record as a single cas it takes around 6-7 hrs Which
>> is still not good in terms of speed
>>
>> Is there a way out?
>> Can I improve performance by any means??
>>
>> Regards
>> Rohan
>>
> 

Re: Annotation (Indexing) a bottleneck in UIMA in terms of speed

Posted by rohan rai <hi...@gmail.com>.
@Pascal: As I have already said the timing does not scale linearly
              Secondly it the approx times which I have specified
@Frank:
     I was talking about actual adding of annotation to CAS
    Record refer to lets say in tags like these <a>.....</a>
    and the document consist of such record
    Annotation is done via this method
                               MyType annotation = new MyType(jCas);
                               annotation.setBegin(start);
                               annotation.setEnd(end);
                               annotation.addToIndexes();
   This takes a lot of time which is not likeable.

Regards
Rohan


On Thu, Jun 26, 2008 at 8:15 PM, LeHouiloes lier, Frank D. <
Frank.LeHouillier@gd-ais.com> wrote:

> Just to clarify, what do you mean by "annotation"?  Is there a specific
> Analysis Engine that you are using? What is a "record"? Is this a
> document?  It would actually be surprizing for many applications if
> annotation were not the bottleneck, given that some annotation processes
> are quite expensive, but this doesn't seem like what you mean here. I
> can't tell from your question whether it is the process that determines
> the annotations that is a burden or the actual adding of the annotations
> to the cas.
>
> -----Original Message-----
> From: rohan rai [mailto:hirohanin@gmail.com]
> Sent: Thursday, June 26, 2008 7:36 AM
> To: uima-user@incubator.apache.org
> Subject: Annotation (Indexing) a bottleneck in UIMA in terms of speed
>
> When I profile a UIMA application
> What I see that annonation takes a lot of time If I profile I see that
> to annotate 1 record , it takes around 0.06 seconds Now you may say its
> good Now scale up Although it does not scale up linearly. But here is
> rough estimate on experiments done 6000 records take 6 min to annotate
> 800000 record tale around 10 hrs min to annotate Which is bad.
> One thing is that I am treating each record individually as a cas Even
> if I treat all the record as a single cas it takes around 6-7 hrs Which
> is still not good in terms of speed
>
> Is there a way out?
> Can I improve performance by any means??
>
> Regards
> Rohan
>

RE: Annotation (Indexing) a bottleneck in UIMA in terms of speed

Posted by "LeHouillier, Frank D." <Fr...@gd-ais.com>.
Just to clarify, what do you mean by "annotation"?  Is there a specific
Analysis Engine that you are using? What is a "record"? Is this a
document?  It would actually be surprizing for many applications if
annotation were not the bottleneck, given that some annotation processes
are quite expensive, but this doesn't seem like what you mean here. I
can't tell from your question whether it is the process that determines
the annotations that is a burden or the actual adding of the annotations
to the cas.

-----Original Message-----
From: rohan rai [mailto:hirohanin@gmail.com] 
Sent: Thursday, June 26, 2008 7:36 AM
To: uima-user@incubator.apache.org
Subject: Annotation (Indexing) a bottleneck in UIMA in terms of speed

When I profile a UIMA application
What I see that annonation takes a lot of time If I profile I see that
to annotate 1 record , it takes around 0.06 seconds Now you may say its
good Now scale up Although it does not scale up linearly. But here is
rough estimate on experiments done 6000 records take 6 min to annotate
800000 record tale around 10 hrs min to annotate Which is bad.
One thing is that I am treating each record individually as a cas Even
if I treat all the record as a single cas it takes around 6-7 hrs Which
is still not good in terms of speed

Is there a way out?
Can I improve performance by any means??

Regards
Rohan

RE: Annotation (Indexing) a bottleneck in UIMA in terms of speed

Posted by Pascal Coupet <pa...@temis.com>.
Hi Rohan,

6000 records in 6 min => 1000/min
800000 records in 10H => 1333/min

Not that bad! I guess one of these numbers is wrong. 

Are you distributing the load across several machines? Vinci is not that
good for load balancing across a lot of machines (>20-50, depending your
annotator speed). 

Pascal 
> -----Original Message-----
> From: rohan rai [mailto:hirohanin@gmail.com]
> Sent: Thursday, June 26, 2008 7:36 AM
> To: uima-user@incubator.apache.org
> Subject: Annotation (Indexing) a bottleneck in UIMA in terms of speed
> 
> When I profile a UIMA application
> What I see that annonation takes a lot of time
> If I profile I see that to annotate 1 record , it takes around 0.06
> seconds
> Now you may say its good
> Now scale up
> Although it does not scale up linearly. But here is rough estimate on
> experiments done
> 6000 records take 6 min to annotate
> 800000 record tale around 10 hrs min to annotate
> Which is bad.
> One thing is that I am treating each record individually as a cas
> Even if I treat all the record as a single cas it takes around 6-7 hrs
> Which is still not good in terms of speed
> 
> Is there a way out?
> Can I improve performance by any means??
> 
> Regards
> Rohan