You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@uima.apache.org by Mario Gazzo <ma...@gmail.com> on 2016/01/04 16:13:04 UTC

Re: Very long Ruta stream initialization

Hi Peter,

No problem, I was anyway pretty much offline myself during Christmas holidays.

The term “overhead” is probably an exaggeration in this context especially after I disabled the MARKUP initialisation. We implemented earlier our own XML markup annotator tailored to better fit our needs with additional annotation types and properties, so the Ruta MARKUP is currently not used. It just happens that we don’t directly use RutaBasic in any of our rules in this particular case so I was curious to know whether we could avoid creating them in the first place since there seems to be quite a few. However, overall processing required by our Ruta scripts compared to other processing steps is now small and sub-optimising this further by making RutaBasic optional would currently be of very low priority to us. We would prioritise other features higher e.g. being able to assign annotations to variables as we discussed previously in another thread.

We haven’t processed documents as large as those you mention since books have so far been divided into chapters and processing could therefore be parallelised accordingly. We also drop extreme outliers above a certain size if we encounter them and then we batch process them later in smaller chunks but this has so far not been necessary with our current data sets. Like you, our processing bottlenecks are now in different components.

Cheers
Mario

> On 30 Dec 2015, at 16:44 , Peter Klügl <pe...@averbis.com> wrote:
> 
> Hi,
> 
> sorry for the delayed reply.
> 
> RutaEngine::initializeStream:
> 
> The special treatment of MARKUPs that causes the increased time required for initialization is just a workaround because I was to lazy to write a working jflex rule. Well, I tried but failed. It shouldn't be hard be to improve this code... I will create an issue for it. When I did the last performance optimization, uima did not check the indexes yet and my test set did not contain markups.
> 
> Deactivate creation of RutaBasic:
> Short answer is no. I was already thinking about making RutaBasic optional in future so that the user can configure if they are used. However, right now, they are required for rule inference and make the rule inference "fast" in the first place. RutaBasic is just an internal annotation like RutaAnnotation (for SCORE, MARKSCORE) and RutaFrame, and rules should not match on them at all.
> 
> Some background information:
> 
> RutaBasics are used for three things:
> - store additional information in order to avoid index operations. Some useful conditions would require many index operations, e.g., PARTOF or ENDSWITH. RutaBasic is utilized as a cache what annotations start and end at which position, and which positions are covered by which types.
> - provide a container to make this information available across analysis engines. Information shared by analysis engine is normally stored in the CAS, e.g. in annotations, (or in external resources). This is the role of RutaBasic. It is not really implemented right now as it should be but I will improve it soon. Then, there is no performance decrease when a pipeline is spammed with small ruta engines.
> - a basic minimal disjunct partitioning of the document for the coverage based visibility concept.
> 
> Making RutaBasic optional is possible. If there is a real need for it, e.g., in order to reduce the memory footprint or when processing large documents where parts are simply not interesting, then I will put it on my TODO list. I am also open for other/new ideas how to solve the challenges (and for incremental usage of internal caches).
> 
> What is your experience with the processing overhead concerning RutaBasic? Is it the rule matching or rather the initialization? I myself had already some performance problems with the initalization and memory consumption in large CAS (500+ pages pdfs). However, other components, serialization and the CAS editor were the actual bottlenecks.
> 
> Best,
> 
> Peter
> 
> 
> Am 22.12.2015 um 17:26 schrieb Mario Gazzo:
>> I got around it by removing the default seeders by specifying an empty seeders list since we don’t need the MARKUP annotations anymore.
>> 
>> I still don’t know why it created so much overhead but it sometimes seemed to rival the POS tagger in processing time.
>> 
>> Anyway, this leads me to the next question. Can I disable the creation of Ruta basic annotations entirely to save processing overhead and only apply Ruta rules to other annotation types created by other AEs such as our own?
>> 
>> Cheers
>> Mario
>> 
>>> On 21 Dec 2015, at 16:09 , Mario Juric <ma...@gmail.com> wrote:
>>> 
>>> Hi Peter,
>>> 
>>> I noticed that occasionally the initialisation in RutaEngine::initializeStream can tak very long time. I can’t really explain them and it seems independent of document length since I have seen this with even very small XML documents.
>>> 
>>> The method seems to spend much time in the DefaultSeeder when creating MARKUP annotations during subiterator.moveToNext calls (line 89) and inside Subiterator it seems to be the while loop inside adjustForStrictForward (line 232), which is inside UIMA core classes. I haven’t gone into any deeper analysis yet but I first like to hear whether you have an idea what could be the main cause(s) for this?
>>> 
>>> We use Ruta 2.3.1 with UIMA 2.8.1
>>> 
>>> 
>>> Cheers
>>> Mario
> 


Re: Very long Ruta stream initialization

Posted by Peter Klügl <pe...@averbis.com>.
Here's the description of the UIMA site:
https://uima.apache.org/get-involved.html

Here's the description of general apache process:
http://www.apache.org/dev/new-committers-guide.html#cla

A short summary of what is to do:
- complete the ICLA (http://www.apache.org/licenses/icla.pdf), print it,
sign it and scan it
- maybe do the same for the CCLA
(http://www.apache.org/licenses/cla-corporate.txt) if your employer
requires it and you did the contribution/implementation during work time
- send the scanned document (or both) to secretary@apache.org

"apache id" and "notify project" are optional but I would add it (so
that we get informed that the documents have been processed, and you
already have an id in case you would gain comitter rights).

I hope I have not forgotten something...

Best,

Peter


Am 07.01.2016 um 10:22 schrieb Mario Gazzo:
> Yes, where do we sign this?
>
> :-)
>
>> On 07 Jan 2016, at 10:16 , Peter Klügl <pe...@averbis.com> wrote:
>>
>> :-) let me know if you need help or have any questions.
>>
>> Am 07.01.2016 um 10:12 schrieb Mario Gazzo:
>>> Yes, let us just sign and submit it.
>>>
>>>> On 07 Jan 2016, at 10:11 , Peter Klügl <pe...@averbis.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> thanks, that would be great. Patches are simply attached to the issue.
>>>> Non-trivial changes require an ICLA. Do you want to sign and submit it?
>>>>
>>>> Best,
>>>>
>>>> Peter
>>>>
>>>>
>>>> Am 07.01.2016 um 10:08 schrieb Mario Gazzo:
>>>>> Thanks,
>>>>>
>>>>> I just added the JIRA issue: https://issues.apache.org/jira/browse/UIMA-4729 <https://issues.apache.org/jira/browse/UIMA-4729>
>>>>>
>>>>> If you like, then we can also implement it and submit a patch, just let us know what the process is.
>>>>>
>>>>> Cheers
>>>>> Mario
>>>>>
>>>>>> On 07 Jan 2016, at 09:08 , Peter Klügl <pe...@averbis.com> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Am 06.01.2016 um 14:48 schrieb Mario Gazzo:
>>>>>>> Hi Peter,
>>>>>>>
>>>>>>> I had a look at the test cases and I think there are many interesting and useful features that cover many of our use cases but I will have to experiment with them before I know what might be missing. I have a few questions though:
>>>>>>>
>>>>>>> 1) It appears that we would then also be able to assign annotations to lists, which is nice. I am not sure from looking at the tests whether it is possible to use ADD with the annotation lists but I assume so.
>>>>>> Not yet, but I will implement it. It's still work in progress. But
>>>>>> thanks for pointing it out, I would probably have forgotten about it.
>>>>>>
>>>>>>> 2) The use of addresses is unclear to me just from reading the test, maybe you could explain them.? This concept is very new to me.
>>>>>> It's not intented be to utilized directly in a rule file. It's rather
>>>>>> just a way to combine logic in java with ruta rules or use ruta
>>>>>> functionality in java code.
>>>>>> Let's say we have a new method like
>>>>>> boolean Ruta.matches(CAS cas, String rule, AnnotationFS... annotations)
>>>>>> and you call it with something like (syntax is not yet specified)
>>>>>> Ruta.matches(cas, "${PARTOF(Headline)} Keyword;", annotation)
>>>>>> Then, the "$" would be replaced by the address of the annotation and the
>>>>>> method would return whether the annotation is covered by a Headline
>>>>>> annotation and is followed by a Keyword annotation.
>>>>>>
>>>>>>> 3) The annotation feature expression looks nice but I wonder whether an array element can also be referenced using an int expression and not just a constant e.g. Struct.as[intVar+1]{->T1};
>>>>>> Yes, without allowing number expressions, it would not really be useful.
>>>>>> The current implementation is just a test in order to check whether the
>>>>>> internal object model is good enough to cover it. The complete
>>>>>> functionality will probably not be included in the next release since
>>>>>> there is still much work left in order to get it up and running. The
>>>>>> semantics of such expressions (Struct.as) are resolved on the fly, and
>>>>>> the code odes not support expressions at all. I still have to think
>>>>>> about a way to implement it.
>>>>>>
>>>>>>> The label expressions are also useful and will make some of our rules more readable.
>>>>>>>
>>>>>>> Finally I have one additional question to the MARKUP initialisation. I have a case where I need the token seeds coming from the default seeder but I don’t want to run the markup initialisation. Is there a separate seeder defined for this somewhere? Right now I have my own copy of the default seeder without the MARKUP initialisation but obviously I do not want to maintain this. It looks as if they could also be split in two seeders with both added as default and then I could overwrite with my own seeder list containing only the token seeder.
>>>>>> Yes, we can split them or just add another one that ignores markup. I
>>>>>> was also always thinking about adding a DetailedSeeder that creates much
>>>>>> more finegrained types like different brackets and quotes... but it was
>>>>>> never on top of my todo list.
>>>>>>
>>>>>> Do you want to open a jira issue for it?
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Peter
>>>>>>
>>>>>>> Cheers
>>>>>>> Mario
>>>>>>>
>>>>>>>
>>>>>>>> On 04 Jan 2016, at 17:06 , Peter Klügl <pe...@averbis.com> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Am 04.01.2016 um 16:13 schrieb Mario Gazzo:
>>>>>>>>> Hi Peter,
>>>>>>>>>
>>>>>>>>> No problem, I was anyway pretty much offline myself during Christmas holidays.
>>>>>>>>>
>>>>>>>>> The term “overhead” is probably an exaggeration in this context especially after I disabled the MARKUP initialisation. We implemented earlier our own XML markup annotator tailored to better fit our needs with additional annotation types and properties, so the Ruta MARKUP is currently not used. It just happens that we don’t directly use RutaBasic in any of our rules in this particular case so I was curious to know whether we could avoid creating them in the first place since there seems to be quite a few. However, overall processing required by our Ruta scripts compared to other processing steps is now small and sub-optimising this further by making RutaBasic optional would currently be of very low priority to us. We would prioritise other features higher e.g. being able to assign annotations to variables as we discussed previously in another thread.
>>>>>>>> I am working on this right now and there is finally some first progress :-)
>>>>>>>>
>>>>>>>> I fear that I won't catch all use cases (combinations with language
>>>>>>>> elements) with the first attempt. If you are interested (and wanna take
>>>>>>>> care I do not miss your use case), feel free to take a look at the new
>>>>>>>> unit tests:
>>>>>>>> https://svn.apache.org/repos/asf/uima/ruta/trunk/ruta-core/src/test/java/org/apache/uima/ruta/expression/annotation
>>>>>>>>
>>>>>>>> It's still work in progress. Proposals for more unit tests are very welcome.
>>>>>>>>
>>>>>>>>> We haven’t processed documents as large as those you mention since books have so far been divided into chapters and processing could therefore be parallelised accordingly. We also drop extreme outliers above a certain size if we encounter them and then we batch process them later in smaller chunks but this has so far not been necessary with our current data sets. Like you, our processing bottlenecks are now in different components.
>>>>>>>> Ah, that's nice to hear that ruta is not the bottleneck :-D
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Peter
>>>>>>>>
>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>> Mario
>>>>>>>>>
>>>>>>>>>> On 30 Dec 2015, at 16:44 , Peter Klügl <pe...@averbis.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> sorry for the delayed reply.
>>>>>>>>>>
>>>>>>>>>> RutaEngine::initializeStream:
>>>>>>>>>>
>>>>>>>>>> The special treatment of MARKUPs that causes the increased time required for initialization is just a workaround because I was to lazy to write a working jflex rule. Well, I tried but failed. It shouldn't be hard be to improve this code... I will create an issue for it. When I did the last performance optimization, uima did not check the indexes yet and my test set did not contain markups.
>>>>>>>>>>
>>>>>>>>>> Deactivate creation of RutaBasic:
>>>>>>>>>> Short answer is no. I was already thinking about making RutaBasic optional in future so that the user can configure if they are used. However, right now, they are required for rule inference and make the rule inference "fast" in the first place. RutaBasic is just an internal annotation like RutaAnnotation (for SCORE, MARKSCORE) and RutaFrame, and rules should not match on them at all.
>>>>>>>>>>
>>>>>>>>>> Some background information:
>>>>>>>>>>
>>>>>>>>>> RutaBasics are used for three things:
>>>>>>>>>> - store additional information in order to avoid index operations. Some useful conditions would require many index operations, e.g., PARTOF or ENDSWITH. RutaBasic is utilized as a cache what annotations start and end at which position, and which positions are covered by which types.
>>>>>>>>>> - provide a container to make this information available across analysis engines. Information shared by analysis engine is normally stored in the CAS, e.g. in annotations, (or in external resources). This is the role of RutaBasic. It is not really implemented right now as it should be but I will improve it soon. Then, there is no performance decrease when a pipeline is spammed with small ruta engines.
>>>>>>>>>> - a basic minimal disjunct partitioning of the document for the coverage based visibility concept.
>>>>>>>>>>
>>>>>>>>>> Making RutaBasic optional is possible. If there is a real need for it, e.g., in order to reduce the memory footprint or when processing large documents where parts are simply not interesting, then I will put it on my TODO list. I am also open for other/new ideas how to solve the challenges (and for incremental usage of internal caches).
>>>>>>>>>>
>>>>>>>>>> What is your experience with the processing overhead concerning RutaBasic? Is it the rule matching or rather the initialization? I myself had already some performance problems with the initalization and memory consumption in large CAS (500+ pages pdfs). However, other components, serialization and the CAS editor were the actual bottlenecks.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>>
>>>>>>>>>> Peter
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Am 22.12.2015 um 17:26 schrieb Mario Gazzo:
>>>>>>>>>>> I got around it by removing the default seeders by specifying an empty seeders list since we don’t need the MARKUP annotations anymore.
>>>>>>>>>>>
>>>>>>>>>>> I still don’t know why it created so much overhead but it sometimes seemed to rival the POS tagger in processing time.
>>>>>>>>>>>
>>>>>>>>>>> Anyway, this leads me to the next question. Can I disable the creation of Ruta basic annotations entirely to save processing overhead and only apply Ruta rules to other annotation types created by other AEs such as our own?
>>>>>>>>>>>
>>>>>>>>>>> Cheers
>>>>>>>>>>> Mario
>>>>>>>>>>>
>>>>>>>>>>>> On 21 Dec 2015, at 16:09 , Mario Juric <ma...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Peter,
>>>>>>>>>>>>
>>>>>>>>>>>> I noticed that occasionally the initialisation in RutaEngine::initializeStream can tak very long time. I can’t really explain them and it seems independent of document length since I have seen this with even very small XML documents.
>>>>>>>>>>>>
>>>>>>>>>>>> The method seems to spend much time in the DefaultSeeder when creating MARKUP annotations during subiterator.moveToNext calls (line 89) and inside Subiterator it seems to be the while loop inside adjustForStrictForward (line 232), which is inside UIMA core classes. I haven’t gone into any deeper analysis yet but I first like to hear whether you have an idea what could be the main cause(s) for this?
>>>>>>>>>>>>
>>>>>>>>>>>> We use Ruta 2.3.1 with UIMA 2.8.1
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Cheers
>>>>>>>>>>>> Mario


Re: Very long Ruta stream initialization

Posted by Mario Gazzo <ma...@gmail.com>.
Yes, where do we sign this?

:-)

> On 07 Jan 2016, at 10:16 , Peter Klügl <pe...@averbis.com> wrote:
> 
> :-) let me know if you need help or have any questions.
> 
> Am 07.01.2016 um 10:12 schrieb Mario Gazzo:
>> Yes, let us just sign and submit it.
>> 
>>> On 07 Jan 2016, at 10:11 , Peter Klügl <pe...@averbis.com> wrote:
>>> 
>>> Hi,
>>> 
>>> thanks, that would be great. Patches are simply attached to the issue.
>>> Non-trivial changes require an ICLA. Do you want to sign and submit it?
>>> 
>>> Best,
>>> 
>>> Peter
>>> 
>>> 
>>> Am 07.01.2016 um 10:08 schrieb Mario Gazzo:
>>>> Thanks,
>>>> 
>>>> I just added the JIRA issue: https://issues.apache.org/jira/browse/UIMA-4729 <https://issues.apache.org/jira/browse/UIMA-4729>
>>>> 
>>>> If you like, then we can also implement it and submit a patch, just let us know what the process is.
>>>> 
>>>> Cheers
>>>> Mario
>>>> 
>>>>> On 07 Jan 2016, at 09:08 , Peter Klügl <pe...@averbis.com> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> Am 06.01.2016 um 14:48 schrieb Mario Gazzo:
>>>>>> Hi Peter,
>>>>>> 
>>>>>> I had a look at the test cases and I think there are many interesting and useful features that cover many of our use cases but I will have to experiment with them before I know what might be missing. I have a few questions though:
>>>>>> 
>>>>>> 1) It appears that we would then also be able to assign annotations to lists, which is nice. I am not sure from looking at the tests whether it is possible to use ADD with the annotation lists but I assume so.
>>>>> Not yet, but I will implement it. It's still work in progress. But
>>>>> thanks for pointing it out, I would probably have forgotten about it.
>>>>> 
>>>>>> 2) The use of addresses is unclear to me just from reading the test, maybe you could explain them.? This concept is very new to me.
>>>>> It's not intented be to utilized directly in a rule file. It's rather
>>>>> just a way to combine logic in java with ruta rules or use ruta
>>>>> functionality in java code.
>>>>> Let's say we have a new method like
>>>>> boolean Ruta.matches(CAS cas, String rule, AnnotationFS... annotations)
>>>>> and you call it with something like (syntax is not yet specified)
>>>>> Ruta.matches(cas, "${PARTOF(Headline)} Keyword;", annotation)
>>>>> Then, the "$" would be replaced by the address of the annotation and the
>>>>> method would return whether the annotation is covered by a Headline
>>>>> annotation and is followed by a Keyword annotation.
>>>>> 
>>>>>> 3) The annotation feature expression looks nice but I wonder whether an array element can also be referenced using an int expression and not just a constant e.g. Struct.as[intVar+1]{->T1};
>>>>> Yes, without allowing number expressions, it would not really be useful.
>>>>> The current implementation is just a test in order to check whether the
>>>>> internal object model is good enough to cover it. The complete
>>>>> functionality will probably not be included in the next release since
>>>>> there is still much work left in order to get it up and running. The
>>>>> semantics of such expressions (Struct.as) are resolved on the fly, and
>>>>> the code odes not support expressions at all. I still have to think
>>>>> about a way to implement it.
>>>>> 
>>>>>> The label expressions are also useful and will make some of our rules more readable.
>>>>>> 
>>>>>> Finally I have one additional question to the MARKUP initialisation. I have a case where I need the token seeds coming from the default seeder but I don’t want to run the markup initialisation. Is there a separate seeder defined for this somewhere? Right now I have my own copy of the default seeder without the MARKUP initialisation but obviously I do not want to maintain this. It looks as if they could also be split in two seeders with both added as default and then I could overwrite with my own seeder list containing only the token seeder.
>>>>> Yes, we can split them or just add another one that ignores markup. I
>>>>> was also always thinking about adding a DetailedSeeder that creates much
>>>>> more finegrained types like different brackets and quotes... but it was
>>>>> never on top of my todo list.
>>>>> 
>>>>> Do you want to open a jira issue for it?
>>>>> 
>>>>> Best,
>>>>> 
>>>>> Peter
>>>>> 
>>>>>> Cheers
>>>>>> Mario
>>>>>> 
>>>>>> 
>>>>>>> On 04 Jan 2016, at 17:06 , Peter Klügl <pe...@averbis.com> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> Am 04.01.2016 um 16:13 schrieb Mario Gazzo:
>>>>>>>> Hi Peter,
>>>>>>>> 
>>>>>>>> No problem, I was anyway pretty much offline myself during Christmas holidays.
>>>>>>>> 
>>>>>>>> The term “overhead” is probably an exaggeration in this context especially after I disabled the MARKUP initialisation. We implemented earlier our own XML markup annotator tailored to better fit our needs with additional annotation types and properties, so the Ruta MARKUP is currently not used. It just happens that we don’t directly use RutaBasic in any of our rules in this particular case so I was curious to know whether we could avoid creating them in the first place since there seems to be quite a few. However, overall processing required by our Ruta scripts compared to other processing steps is now small and sub-optimising this further by making RutaBasic optional would currently be of very low priority to us. We would prioritise other features higher e.g. being able to assign annotations to variables as we discussed previously in another thread.
>>>>>>> I am working on this right now and there is finally some first progress :-)
>>>>>>> 
>>>>>>> I fear that I won't catch all use cases (combinations with language
>>>>>>> elements) with the first attempt. If you are interested (and wanna take
>>>>>>> care I do not miss your use case), feel free to take a look at the new
>>>>>>> unit tests:
>>>>>>> https://svn.apache.org/repos/asf/uima/ruta/trunk/ruta-core/src/test/java/org/apache/uima/ruta/expression/annotation
>>>>>>> 
>>>>>>> It's still work in progress. Proposals for more unit tests are very welcome.
>>>>>>> 
>>>>>>>> We haven’t processed documents as large as those you mention since books have so far been divided into chapters and processing could therefore be parallelised accordingly. We also drop extreme outliers above a certain size if we encounter them and then we batch process them later in smaller chunks but this has so far not been necessary with our current data sets. Like you, our processing bottlenecks are now in different components.
>>>>>>> Ah, that's nice to hear that ruta is not the bottleneck :-D
>>>>>>> 
>>>>>>> Best,
>>>>>>> 
>>>>>>> Peter
>>>>>>> 
>>>>>>> 
>>>>>>>> Cheers
>>>>>>>> Mario
>>>>>>>> 
>>>>>>>>> On 30 Dec 2015, at 16:44 , Peter Klügl <pe...@averbis.com> wrote:
>>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> sorry for the delayed reply.
>>>>>>>>> 
>>>>>>>>> RutaEngine::initializeStream:
>>>>>>>>> 
>>>>>>>>> The special treatment of MARKUPs that causes the increased time required for initialization is just a workaround because I was to lazy to write a working jflex rule. Well, I tried but failed. It shouldn't be hard be to improve this code... I will create an issue for it. When I did the last performance optimization, uima did not check the indexes yet and my test set did not contain markups.
>>>>>>>>> 
>>>>>>>>> Deactivate creation of RutaBasic:
>>>>>>>>> Short answer is no. I was already thinking about making RutaBasic optional in future so that the user can configure if they are used. However, right now, they are required for rule inference and make the rule inference "fast" in the first place. RutaBasic is just an internal annotation like RutaAnnotation (for SCORE, MARKSCORE) and RutaFrame, and rules should not match on them at all.
>>>>>>>>> 
>>>>>>>>> Some background information:
>>>>>>>>> 
>>>>>>>>> RutaBasics are used for three things:
>>>>>>>>> - store additional information in order to avoid index operations. Some useful conditions would require many index operations, e.g., PARTOF or ENDSWITH. RutaBasic is utilized as a cache what annotations start and end at which position, and which positions are covered by which types.
>>>>>>>>> - provide a container to make this information available across analysis engines. Information shared by analysis engine is normally stored in the CAS, e.g. in annotations, (or in external resources). This is the role of RutaBasic. It is not really implemented right now as it should be but I will improve it soon. Then, there is no performance decrease when a pipeline is spammed with small ruta engines.
>>>>>>>>> - a basic minimal disjunct partitioning of the document for the coverage based visibility concept.
>>>>>>>>> 
>>>>>>>>> Making RutaBasic optional is possible. If there is a real need for it, e.g., in order to reduce the memory footprint or when processing large documents where parts are simply not interesting, then I will put it on my TODO list. I am also open for other/new ideas how to solve the challenges (and for incremental usage of internal caches).
>>>>>>>>> 
>>>>>>>>> What is your experience with the processing overhead concerning RutaBasic? Is it the rule matching or rather the initialization? I myself had already some performance problems with the initalization and memory consumption in large CAS (500+ pages pdfs). However, other components, serialization and the CAS editor were the actual bottlenecks.
>>>>>>>>> 
>>>>>>>>> Best,
>>>>>>>>> 
>>>>>>>>> Peter
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Am 22.12.2015 um 17:26 schrieb Mario Gazzo:
>>>>>>>>>> I got around it by removing the default seeders by specifying an empty seeders list since we don’t need the MARKUP annotations anymore.
>>>>>>>>>> 
>>>>>>>>>> I still don’t know why it created so much overhead but it sometimes seemed to rival the POS tagger in processing time.
>>>>>>>>>> 
>>>>>>>>>> Anyway, this leads me to the next question. Can I disable the creation of Ruta basic annotations entirely to save processing overhead and only apply Ruta rules to other annotation types created by other AEs such as our own?
>>>>>>>>>> 
>>>>>>>>>> Cheers
>>>>>>>>>> Mario
>>>>>>>>>> 
>>>>>>>>>>> On 21 Dec 2015, at 16:09 , Mario Juric <ma...@gmail.com> wrote:
>>>>>>>>>>> 
>>>>>>>>>>> Hi Peter,
>>>>>>>>>>> 
>>>>>>>>>>> I noticed that occasionally the initialisation in RutaEngine::initializeStream can tak very long time. I can’t really explain them and it seems independent of document length since I have seen this with even very small XML documents.
>>>>>>>>>>> 
>>>>>>>>>>> The method seems to spend much time in the DefaultSeeder when creating MARKUP annotations during subiterator.moveToNext calls (line 89) and inside Subiterator it seems to be the while loop inside adjustForStrictForward (line 232), which is inside UIMA core classes. I haven’t gone into any deeper analysis yet but I first like to hear whether you have an idea what could be the main cause(s) for this?
>>>>>>>>>>> 
>>>>>>>>>>> We use Ruta 2.3.1 with UIMA 2.8.1
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Cheers
>>>>>>>>>>> Mario
> 


Re: Very long Ruta stream initialization

Posted by Peter Klügl <pe...@averbis.com>.
:-) let me know if you need help or have any questions.

Am 07.01.2016 um 10:12 schrieb Mario Gazzo:
> Yes, let us just sign and submit it.
>
>> On 07 Jan 2016, at 10:11 , Peter Klügl <pe...@averbis.com> wrote:
>>
>> Hi,
>>
>> thanks, that would be great. Patches are simply attached to the issue.
>> Non-trivial changes require an ICLA. Do you want to sign and submit it?
>>
>> Best,
>>
>> Peter
>>
>>
>> Am 07.01.2016 um 10:08 schrieb Mario Gazzo:
>>> Thanks,
>>>
>>> I just added the JIRA issue: https://issues.apache.org/jira/browse/UIMA-4729 <https://issues.apache.org/jira/browse/UIMA-4729>
>>>
>>> If you like, then we can also implement it and submit a patch, just let us know what the process is.
>>>
>>> Cheers
>>> Mario
>>>
>>>> On 07 Jan 2016, at 09:08 , Peter Klügl <pe...@averbis.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> Am 06.01.2016 um 14:48 schrieb Mario Gazzo:
>>>>> Hi Peter,
>>>>>
>>>>> I had a look at the test cases and I think there are many interesting and useful features that cover many of our use cases but I will have to experiment with them before I know what might be missing. I have a few questions though:
>>>>>
>>>>> 1) It appears that we would then also be able to assign annotations to lists, which is nice. I am not sure from looking at the tests whether it is possible to use ADD with the annotation lists but I assume so.
>>>> Not yet, but I will implement it. It's still work in progress. But
>>>> thanks for pointing it out, I would probably have forgotten about it.
>>>>
>>>>> 2) The use of addresses is unclear to me just from reading the test, maybe you could explain them.? This concept is very new to me.
>>>> It's not intented be to utilized directly in a rule file. It's rather
>>>> just a way to combine logic in java with ruta rules or use ruta
>>>> functionality in java code.
>>>> Let's say we have a new method like
>>>> boolean Ruta.matches(CAS cas, String rule, AnnotationFS... annotations)
>>>> and you call it with something like (syntax is not yet specified)
>>>> Ruta.matches(cas, "${PARTOF(Headline)} Keyword;", annotation)
>>>> Then, the "$" would be replaced by the address of the annotation and the
>>>> method would return whether the annotation is covered by a Headline
>>>> annotation and is followed by a Keyword annotation.
>>>>
>>>>> 3) The annotation feature expression looks nice but I wonder whether an array element can also be referenced using an int expression and not just a constant e.g. Struct.as[intVar+1]{->T1};
>>>> Yes, without allowing number expressions, it would not really be useful.
>>>> The current implementation is just a test in order to check whether the
>>>> internal object model is good enough to cover it. The complete
>>>> functionality will probably not be included in the next release since
>>>> there is still much work left in order to get it up and running. The
>>>> semantics of such expressions (Struct.as) are resolved on the fly, and
>>>> the code odes not support expressions at all. I still have to think
>>>> about a way to implement it.
>>>>
>>>>> The label expressions are also useful and will make some of our rules more readable.
>>>>>
>>>>> Finally I have one additional question to the MARKUP initialisation. I have a case where I need the token seeds coming from the default seeder but I don’t want to run the markup initialisation. Is there a separate seeder defined for this somewhere? Right now I have my own copy of the default seeder without the MARKUP initialisation but obviously I do not want to maintain this. It looks as if they could also be split in two seeders with both added as default and then I could overwrite with my own seeder list containing only the token seeder.
>>>> Yes, we can split them or just add another one that ignores markup. I
>>>> was also always thinking about adding a DetailedSeeder that creates much
>>>> more finegrained types like different brackets and quotes... but it was
>>>> never on top of my todo list.
>>>>
>>>> Do you want to open a jira issue for it?
>>>>
>>>> Best,
>>>>
>>>> Peter
>>>>
>>>>> Cheers
>>>>> Mario
>>>>>
>>>>>
>>>>>> On 04 Jan 2016, at 17:06 , Peter Klügl <pe...@averbis.com> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Am 04.01.2016 um 16:13 schrieb Mario Gazzo:
>>>>>>> Hi Peter,
>>>>>>>
>>>>>>> No problem, I was anyway pretty much offline myself during Christmas holidays.
>>>>>>>
>>>>>>> The term “overhead” is probably an exaggeration in this context especially after I disabled the MARKUP initialisation. We implemented earlier our own XML markup annotator tailored to better fit our needs with additional annotation types and properties, so the Ruta MARKUP is currently not used. It just happens that we don’t directly use RutaBasic in any of our rules in this particular case so I was curious to know whether we could avoid creating them in the first place since there seems to be quite a few. However, overall processing required by our Ruta scripts compared to other processing steps is now small and sub-optimising this further by making RutaBasic optional would currently be of very low priority to us. We would prioritise other features higher e.g. being able to assign annotations to variables as we discussed previously in another thread.
>>>>>> I am working on this right now and there is finally some first progress :-)
>>>>>>
>>>>>> I fear that I won't catch all use cases (combinations with language
>>>>>> elements) with the first attempt. If you are interested (and wanna take
>>>>>> care I do not miss your use case), feel free to take a look at the new
>>>>>> unit tests:
>>>>>> https://svn.apache.org/repos/asf/uima/ruta/trunk/ruta-core/src/test/java/org/apache/uima/ruta/expression/annotation
>>>>>>
>>>>>> It's still work in progress. Proposals for more unit tests are very welcome.
>>>>>>
>>>>>>> We haven’t processed documents as large as those you mention since books have so far been divided into chapters and processing could therefore be parallelised accordingly. We also drop extreme outliers above a certain size if we encounter them and then we batch process them later in smaller chunks but this has so far not been necessary with our current data sets. Like you, our processing bottlenecks are now in different components.
>>>>>> Ah, that's nice to hear that ruta is not the bottleneck :-D
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Peter
>>>>>>
>>>>>>
>>>>>>> Cheers
>>>>>>> Mario
>>>>>>>
>>>>>>>> On 30 Dec 2015, at 16:44 , Peter Klügl <pe...@averbis.com> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> sorry for the delayed reply.
>>>>>>>>
>>>>>>>> RutaEngine::initializeStream:
>>>>>>>>
>>>>>>>> The special treatment of MARKUPs that causes the increased time required for initialization is just a workaround because I was to lazy to write a working jflex rule. Well, I tried but failed. It shouldn't be hard be to improve this code... I will create an issue for it. When I did the last performance optimization, uima did not check the indexes yet and my test set did not contain markups.
>>>>>>>>
>>>>>>>> Deactivate creation of RutaBasic:
>>>>>>>> Short answer is no. I was already thinking about making RutaBasic optional in future so that the user can configure if they are used. However, right now, they are required for rule inference and make the rule inference "fast" in the first place. RutaBasic is just an internal annotation like RutaAnnotation (for SCORE, MARKSCORE) and RutaFrame, and rules should not match on them at all.
>>>>>>>>
>>>>>>>> Some background information:
>>>>>>>>
>>>>>>>> RutaBasics are used for three things:
>>>>>>>> - store additional information in order to avoid index operations. Some useful conditions would require many index operations, e.g., PARTOF or ENDSWITH. RutaBasic is utilized as a cache what annotations start and end at which position, and which positions are covered by which types.
>>>>>>>> - provide a container to make this information available across analysis engines. Information shared by analysis engine is normally stored in the CAS, e.g. in annotations, (or in external resources). This is the role of RutaBasic. It is not really implemented right now as it should be but I will improve it soon. Then, there is no performance decrease when a pipeline is spammed with small ruta engines.
>>>>>>>> - a basic minimal disjunct partitioning of the document for the coverage based visibility concept.
>>>>>>>>
>>>>>>>> Making RutaBasic optional is possible. If there is a real need for it, e.g., in order to reduce the memory footprint or when processing large documents where parts are simply not interesting, then I will put it on my TODO list. I am also open for other/new ideas how to solve the challenges (and for incremental usage of internal caches).
>>>>>>>>
>>>>>>>> What is your experience with the processing overhead concerning RutaBasic? Is it the rule matching or rather the initialization? I myself had already some performance problems with the initalization and memory consumption in large CAS (500+ pages pdfs). However, other components, serialization and the CAS editor were the actual bottlenecks.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Peter
>>>>>>>>
>>>>>>>>
>>>>>>>> Am 22.12.2015 um 17:26 schrieb Mario Gazzo:
>>>>>>>>> I got around it by removing the default seeders by specifying an empty seeders list since we don’t need the MARKUP annotations anymore.
>>>>>>>>>
>>>>>>>>> I still don’t know why it created so much overhead but it sometimes seemed to rival the POS tagger in processing time.
>>>>>>>>>
>>>>>>>>> Anyway, this leads me to the next question. Can I disable the creation of Ruta basic annotations entirely to save processing overhead and only apply Ruta rules to other annotation types created by other AEs such as our own?
>>>>>>>>>
>>>>>>>>> Cheers
>>>>>>>>> Mario
>>>>>>>>>
>>>>>>>>>> On 21 Dec 2015, at 16:09 , Mario Juric <ma...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Peter,
>>>>>>>>>>
>>>>>>>>>> I noticed that occasionally the initialisation in RutaEngine::initializeStream can tak very long time. I can’t really explain them and it seems independent of document length since I have seen this with even very small XML documents.
>>>>>>>>>>
>>>>>>>>>> The method seems to spend much time in the DefaultSeeder when creating MARKUP annotations during subiterator.moveToNext calls (line 89) and inside Subiterator it seems to be the while loop inside adjustForStrictForward (line 232), which is inside UIMA core classes. I haven’t gone into any deeper analysis yet but I first like to hear whether you have an idea what could be the main cause(s) for this?
>>>>>>>>>>
>>>>>>>>>> We use Ruta 2.3.1 with UIMA 2.8.1
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Cheers
>>>>>>>>>> Mario


Re: Very long Ruta stream initialization

Posted by Mario Gazzo <ma...@gmail.com>.
Yes, let us just sign and submit it.

> On 07 Jan 2016, at 10:11 , Peter Klügl <pe...@averbis.com> wrote:
> 
> Hi,
> 
> thanks, that would be great. Patches are simply attached to the issue.
> Non-trivial changes require an ICLA. Do you want to sign and submit it?
> 
> Best,
> 
> Peter
> 
> 
> Am 07.01.2016 um 10:08 schrieb Mario Gazzo:
>> Thanks,
>> 
>> I just added the JIRA issue: https://issues.apache.org/jira/browse/UIMA-4729 <https://issues.apache.org/jira/browse/UIMA-4729>
>> 
>> If you like, then we can also implement it and submit a patch, just let us know what the process is.
>> 
>> Cheers
>> Mario
>> 
>>> On 07 Jan 2016, at 09:08 , Peter Klügl <pe...@averbis.com> wrote:
>>> 
>>> Hi,
>>> 
>>> Am 06.01.2016 um 14:48 schrieb Mario Gazzo:
>>>> Hi Peter,
>>>> 
>>>> I had a look at the test cases and I think there are many interesting and useful features that cover many of our use cases but I will have to experiment with them before I know what might be missing. I have a few questions though:
>>>> 
>>>> 1) It appears that we would then also be able to assign annotations to lists, which is nice. I am not sure from looking at the tests whether it is possible to use ADD with the annotation lists but I assume so.
>>> Not yet, but I will implement it. It's still work in progress. But
>>> thanks for pointing it out, I would probably have forgotten about it.
>>> 
>>>> 2) The use of addresses is unclear to me just from reading the test, maybe you could explain them.? This concept is very new to me.
>>> It's not intented be to utilized directly in a rule file. It's rather
>>> just a way to combine logic in java with ruta rules or use ruta
>>> functionality in java code.
>>> Let's say we have a new method like
>>> boolean Ruta.matches(CAS cas, String rule, AnnotationFS... annotations)
>>> and you call it with something like (syntax is not yet specified)
>>> Ruta.matches(cas, "${PARTOF(Headline)} Keyword;", annotation)
>>> Then, the "$" would be replaced by the address of the annotation and the
>>> method would return whether the annotation is covered by a Headline
>>> annotation and is followed by a Keyword annotation.
>>> 
>>>> 3) The annotation feature expression looks nice but I wonder whether an array element can also be referenced using an int expression and not just a constant e.g. Struct.as[intVar+1]{->T1};
>>> Yes, without allowing number expressions, it would not really be useful.
>>> The current implementation is just a test in order to check whether the
>>> internal object model is good enough to cover it. The complete
>>> functionality will probably not be included in the next release since
>>> there is still much work left in order to get it up and running. The
>>> semantics of such expressions (Struct.as) are resolved on the fly, and
>>> the code odes not support expressions at all. I still have to think
>>> about a way to implement it.
>>> 
>>>> The label expressions are also useful and will make some of our rules more readable.
>>>> 
>>>> Finally I have one additional question to the MARKUP initialisation. I have a case where I need the token seeds coming from the default seeder but I don’t want to run the markup initialisation. Is there a separate seeder defined for this somewhere? Right now I have my own copy of the default seeder without the MARKUP initialisation but obviously I do not want to maintain this. It looks as if they could also be split in two seeders with both added as default and then I could overwrite with my own seeder list containing only the token seeder.
>>> Yes, we can split them or just add another one that ignores markup. I
>>> was also always thinking about adding a DetailedSeeder that creates much
>>> more finegrained types like different brackets and quotes... but it was
>>> never on top of my todo list.
>>> 
>>> Do you want to open a jira issue for it?
>>> 
>>> Best,
>>> 
>>> Peter
>>> 
>>>> Cheers
>>>> Mario
>>>> 
>>>> 
>>>>> On 04 Jan 2016, at 17:06 , Peter Klügl <pe...@averbis.com> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> Am 04.01.2016 um 16:13 schrieb Mario Gazzo:
>>>>>> Hi Peter,
>>>>>> 
>>>>>> No problem, I was anyway pretty much offline myself during Christmas holidays.
>>>>>> 
>>>>>> The term “overhead” is probably an exaggeration in this context especially after I disabled the MARKUP initialisation. We implemented earlier our own XML markup annotator tailored to better fit our needs with additional annotation types and properties, so the Ruta MARKUP is currently not used. It just happens that we don’t directly use RutaBasic in any of our rules in this particular case so I was curious to know whether we could avoid creating them in the first place since there seems to be quite a few. However, overall processing required by our Ruta scripts compared to other processing steps is now small and sub-optimising this further by making RutaBasic optional would currently be of very low priority to us. We would prioritise other features higher e.g. being able to assign annotations to variables as we discussed previously in another thread.
>>>>> I am working on this right now and there is finally some first progress :-)
>>>>> 
>>>>> I fear that I won't catch all use cases (combinations with language
>>>>> elements) with the first attempt. If you are interested (and wanna take
>>>>> care I do not miss your use case), feel free to take a look at the new
>>>>> unit tests:
>>>>> https://svn.apache.org/repos/asf/uima/ruta/trunk/ruta-core/src/test/java/org/apache/uima/ruta/expression/annotation
>>>>> 
>>>>> It's still work in progress. Proposals for more unit tests are very welcome.
>>>>> 
>>>>>> We haven’t processed documents as large as those you mention since books have so far been divided into chapters and processing could therefore be parallelised accordingly. We also drop extreme outliers above a certain size if we encounter them and then we batch process them later in smaller chunks but this has so far not been necessary with our current data sets. Like you, our processing bottlenecks are now in different components.
>>>>> Ah, that's nice to hear that ruta is not the bottleneck :-D
>>>>> 
>>>>> Best,
>>>>> 
>>>>> Peter
>>>>> 
>>>>> 
>>>>>> Cheers
>>>>>> Mario
>>>>>> 
>>>>>>> On 30 Dec 2015, at 16:44 , Peter Klügl <pe...@averbis.com> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> sorry for the delayed reply.
>>>>>>> 
>>>>>>> RutaEngine::initializeStream:
>>>>>>> 
>>>>>>> The special treatment of MARKUPs that causes the increased time required for initialization is just a workaround because I was to lazy to write a working jflex rule. Well, I tried but failed. It shouldn't be hard be to improve this code... I will create an issue for it. When I did the last performance optimization, uima did not check the indexes yet and my test set did not contain markups.
>>>>>>> 
>>>>>>> Deactivate creation of RutaBasic:
>>>>>>> Short answer is no. I was already thinking about making RutaBasic optional in future so that the user can configure if they are used. However, right now, they are required for rule inference and make the rule inference "fast" in the first place. RutaBasic is just an internal annotation like RutaAnnotation (for SCORE, MARKSCORE) and RutaFrame, and rules should not match on them at all.
>>>>>>> 
>>>>>>> Some background information:
>>>>>>> 
>>>>>>> RutaBasics are used for three things:
>>>>>>> - store additional information in order to avoid index operations. Some useful conditions would require many index operations, e.g., PARTOF or ENDSWITH. RutaBasic is utilized as a cache what annotations start and end at which position, and which positions are covered by which types.
>>>>>>> - provide a container to make this information available across analysis engines. Information shared by analysis engine is normally stored in the CAS, e.g. in annotations, (or in external resources). This is the role of RutaBasic. It is not really implemented right now as it should be but I will improve it soon. Then, there is no performance decrease when a pipeline is spammed with small ruta engines.
>>>>>>> - a basic minimal disjunct partitioning of the document for the coverage based visibility concept.
>>>>>>> 
>>>>>>> Making RutaBasic optional is possible. If there is a real need for it, e.g., in order to reduce the memory footprint or when processing large documents where parts are simply not interesting, then I will put it on my TODO list. I am also open for other/new ideas how to solve the challenges (and for incremental usage of internal caches).
>>>>>>> 
>>>>>>> What is your experience with the processing overhead concerning RutaBasic? Is it the rule matching or rather the initialization? I myself had already some performance problems with the initalization and memory consumption in large CAS (500+ pages pdfs). However, other components, serialization and the CAS editor were the actual bottlenecks.
>>>>>>> 
>>>>>>> Best,
>>>>>>> 
>>>>>>> Peter
>>>>>>> 
>>>>>>> 
>>>>>>> Am 22.12.2015 um 17:26 schrieb Mario Gazzo:
>>>>>>>> I got around it by removing the default seeders by specifying an empty seeders list since we don’t need the MARKUP annotations anymore.
>>>>>>>> 
>>>>>>>> I still don’t know why it created so much overhead but it sometimes seemed to rival the POS tagger in processing time.
>>>>>>>> 
>>>>>>>> Anyway, this leads me to the next question. Can I disable the creation of Ruta basic annotations entirely to save processing overhead and only apply Ruta rules to other annotation types created by other AEs such as our own?
>>>>>>>> 
>>>>>>>> Cheers
>>>>>>>> Mario
>>>>>>>> 
>>>>>>>>> On 21 Dec 2015, at 16:09 , Mario Juric <ma...@gmail.com> wrote:
>>>>>>>>> 
>>>>>>>>> Hi Peter,
>>>>>>>>> 
>>>>>>>>> I noticed that occasionally the initialisation in RutaEngine::initializeStream can tak very long time. I can’t really explain them and it seems independent of document length since I have seen this with even very small XML documents.
>>>>>>>>> 
>>>>>>>>> The method seems to spend much time in the DefaultSeeder when creating MARKUP annotations during subiterator.moveToNext calls (line 89) and inside Subiterator it seems to be the while loop inside adjustForStrictForward (line 232), which is inside UIMA core classes. I haven’t gone into any deeper analysis yet but I first like to hear whether you have an idea what could be the main cause(s) for this?
>>>>>>>>> 
>>>>>>>>> We use Ruta 2.3.1 with UIMA 2.8.1
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> Cheers
>>>>>>>>> Mario
>> 
> 


Re: Very long Ruta stream initialization

Posted by Peter Klügl <pe...@averbis.com>.
Hi,

thanks, that would be great. Patches are simply attached to the issue.
Non-trivial changes require an ICLA. Do you want to sign and submit it?

Best,

Peter
 

Am 07.01.2016 um 10:08 schrieb Mario Gazzo:
> Thanks,
>
> I just added the JIRA issue: https://issues.apache.org/jira/browse/UIMA-4729 <https://issues.apache.org/jira/browse/UIMA-4729>
>
> If you like, then we can also implement it and submit a patch, just let us know what the process is.
>
> Cheers
> Mario
>
>> On 07 Jan 2016, at 09:08 , Peter Klügl <pe...@averbis.com> wrote:
>>
>> Hi,
>>
>> Am 06.01.2016 um 14:48 schrieb Mario Gazzo:
>>> Hi Peter,
>>>
>>> I had a look at the test cases and I think there are many interesting and useful features that cover many of our use cases but I will have to experiment with them before I know what might be missing. I have a few questions though:
>>>
>>> 1) It appears that we would then also be able to assign annotations to lists, which is nice. I am not sure from looking at the tests whether it is possible to use ADD with the annotation lists but I assume so.
>> Not yet, but I will implement it. It's still work in progress. But
>> thanks for pointing it out, I would probably have forgotten about it.
>>
>>> 2) The use of addresses is unclear to me just from reading the test, maybe you could explain them.? This concept is very new to me.
>> It's not intented be to utilized directly in a rule file. It's rather
>> just a way to combine logic in java with ruta rules or use ruta
>> functionality in java code.
>> Let's say we have a new method like
>> boolean Ruta.matches(CAS cas, String rule, AnnotationFS... annotations)
>> and you call it with something like (syntax is not yet specified)
>> Ruta.matches(cas, "${PARTOF(Headline)} Keyword;", annotation)
>> Then, the "$" would be replaced by the address of the annotation and the
>> method would return whether the annotation is covered by a Headline
>> annotation and is followed by a Keyword annotation.
>>
>>> 3) The annotation feature expression looks nice but I wonder whether an array element can also be referenced using an int expression and not just a constant e.g. Struct.as[intVar+1]{->T1};
>> Yes, without allowing number expressions, it would not really be useful.
>> The current implementation is just a test in order to check whether the
>> internal object model is good enough to cover it. The complete
>> functionality will probably not be included in the next release since
>> there is still much work left in order to get it up and running. The
>> semantics of such expressions (Struct.as) are resolved on the fly, and
>> the code odes not support expressions at all. I still have to think
>> about a way to implement it.
>>
>>> The label expressions are also useful and will make some of our rules more readable.
>>>
>>> Finally I have one additional question to the MARKUP initialisation. I have a case where I need the token seeds coming from the default seeder but I don’t want to run the markup initialisation. Is there a separate seeder defined for this somewhere? Right now I have my own copy of the default seeder without the MARKUP initialisation but obviously I do not want to maintain this. It looks as if they could also be split in two seeders with both added as default and then I could overwrite with my own seeder list containing only the token seeder.
>> Yes, we can split them or just add another one that ignores markup. I
>> was also always thinking about adding a DetailedSeeder that creates much
>> more finegrained types like different brackets and quotes... but it was
>> never on top of my todo list.
>>
>> Do you want to open a jira issue for it?
>>
>> Best,
>>
>> Peter
>>
>>> Cheers
>>> Mario
>>>
>>>
>>>> On 04 Jan 2016, at 17:06 , Peter Klügl <pe...@averbis.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> Am 04.01.2016 um 16:13 schrieb Mario Gazzo:
>>>>> Hi Peter,
>>>>>
>>>>> No problem, I was anyway pretty much offline myself during Christmas holidays.
>>>>>
>>>>> The term “overhead” is probably an exaggeration in this context especially after I disabled the MARKUP initialisation. We implemented earlier our own XML markup annotator tailored to better fit our needs with additional annotation types and properties, so the Ruta MARKUP is currently not used. It just happens that we don’t directly use RutaBasic in any of our rules in this particular case so I was curious to know whether we could avoid creating them in the first place since there seems to be quite a few. However, overall processing required by our Ruta scripts compared to other processing steps is now small and sub-optimising this further by making RutaBasic optional would currently be of very low priority to us. We would prioritise other features higher e.g. being able to assign annotations to variables as we discussed previously in another thread.
>>>> I am working on this right now and there is finally some first progress :-)
>>>>
>>>> I fear that I won't catch all use cases (combinations with language
>>>> elements) with the first attempt. If you are interested (and wanna take
>>>> care I do not miss your use case), feel free to take a look at the new
>>>> unit tests:
>>>> https://svn.apache.org/repos/asf/uima/ruta/trunk/ruta-core/src/test/java/org/apache/uima/ruta/expression/annotation
>>>>
>>>> It's still work in progress. Proposals for more unit tests are very welcome.
>>>>
>>>>> We haven’t processed documents as large as those you mention since books have so far been divided into chapters and processing could therefore be parallelised accordingly. We also drop extreme outliers above a certain size if we encounter them and then we batch process them later in smaller chunks but this has so far not been necessary with our current data sets. Like you, our processing bottlenecks are now in different components.
>>>> Ah, that's nice to hear that ruta is not the bottleneck :-D
>>>>
>>>> Best,
>>>>
>>>> Peter
>>>>
>>>>
>>>>> Cheers
>>>>> Mario
>>>>>
>>>>>> On 30 Dec 2015, at 16:44 , Peter Klügl <pe...@averbis.com> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> sorry for the delayed reply.
>>>>>>
>>>>>> RutaEngine::initializeStream:
>>>>>>
>>>>>> The special treatment of MARKUPs that causes the increased time required for initialization is just a workaround because I was to lazy to write a working jflex rule. Well, I tried but failed. It shouldn't be hard be to improve this code... I will create an issue for it. When I did the last performance optimization, uima did not check the indexes yet and my test set did not contain markups.
>>>>>>
>>>>>> Deactivate creation of RutaBasic:
>>>>>> Short answer is no. I was already thinking about making RutaBasic optional in future so that the user can configure if they are used. However, right now, they are required for rule inference and make the rule inference "fast" in the first place. RutaBasic is just an internal annotation like RutaAnnotation (for SCORE, MARKSCORE) and RutaFrame, and rules should not match on them at all.
>>>>>>
>>>>>> Some background information:
>>>>>>
>>>>>> RutaBasics are used for three things:
>>>>>> - store additional information in order to avoid index operations. Some useful conditions would require many index operations, e.g., PARTOF or ENDSWITH. RutaBasic is utilized as a cache what annotations start and end at which position, and which positions are covered by which types.
>>>>>> - provide a container to make this information available across analysis engines. Information shared by analysis engine is normally stored in the CAS, e.g. in annotations, (or in external resources). This is the role of RutaBasic. It is not really implemented right now as it should be but I will improve it soon. Then, there is no performance decrease when a pipeline is spammed with small ruta engines.
>>>>>> - a basic minimal disjunct partitioning of the document for the coverage based visibility concept.
>>>>>>
>>>>>> Making RutaBasic optional is possible. If there is a real need for it, e.g., in order to reduce the memory footprint or when processing large documents where parts are simply not interesting, then I will put it on my TODO list. I am also open for other/new ideas how to solve the challenges (and for incremental usage of internal caches).
>>>>>>
>>>>>> What is your experience with the processing overhead concerning RutaBasic? Is it the rule matching or rather the initialization? I myself had already some performance problems with the initalization and memory consumption in large CAS (500+ pages pdfs). However, other components, serialization and the CAS editor were the actual bottlenecks.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Peter
>>>>>>
>>>>>>
>>>>>> Am 22.12.2015 um 17:26 schrieb Mario Gazzo:
>>>>>>> I got around it by removing the default seeders by specifying an empty seeders list since we don’t need the MARKUP annotations anymore.
>>>>>>>
>>>>>>> I still don’t know why it created so much overhead but it sometimes seemed to rival the POS tagger in processing time.
>>>>>>>
>>>>>>> Anyway, this leads me to the next question. Can I disable the creation of Ruta basic annotations entirely to save processing overhead and only apply Ruta rules to other annotation types created by other AEs such as our own?
>>>>>>>
>>>>>>> Cheers
>>>>>>> Mario
>>>>>>>
>>>>>>>> On 21 Dec 2015, at 16:09 , Mario Juric <ma...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> Hi Peter,
>>>>>>>>
>>>>>>>> I noticed that occasionally the initialisation in RutaEngine::initializeStream can tak very long time. I can’t really explain them and it seems independent of document length since I have seen this with even very small XML documents.
>>>>>>>>
>>>>>>>> The method seems to spend much time in the DefaultSeeder when creating MARKUP annotations during subiterator.moveToNext calls (line 89) and inside Subiterator it seems to be the while loop inside adjustForStrictForward (line 232), which is inside UIMA core classes. I haven’t gone into any deeper analysis yet but I first like to hear whether you have an idea what could be the main cause(s) for this?
>>>>>>>>
>>>>>>>> We use Ruta 2.3.1 with UIMA 2.8.1
>>>>>>>>
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>> Mario
>


Re: Very long Ruta stream initialization

Posted by Mario Gazzo <ma...@gmail.com>.
Thanks,

I just added the JIRA issue: https://issues.apache.org/jira/browse/UIMA-4729 <https://issues.apache.org/jira/browse/UIMA-4729>

If you like, then we can also implement it and submit a patch, just let us know what the process is.

Cheers
Mario

> On 07 Jan 2016, at 09:08 , Peter Klügl <pe...@averbis.com> wrote:
> 
> Hi,
> 
> Am 06.01.2016 um 14:48 schrieb Mario Gazzo:
>> Hi Peter,
>> 
>> I had a look at the test cases and I think there are many interesting and useful features that cover many of our use cases but I will have to experiment with them before I know what might be missing. I have a few questions though:
>> 
>> 1) It appears that we would then also be able to assign annotations to lists, which is nice. I am not sure from looking at the tests whether it is possible to use ADD with the annotation lists but I assume so.
> 
> Not yet, but I will implement it. It's still work in progress. But
> thanks for pointing it out, I would probably have forgotten about it.
> 
>> 2) The use of addresses is unclear to me just from reading the test, maybe you could explain them.? This concept is very new to me.
> 
> It's not intented be to utilized directly in a rule file. It's rather
> just a way to combine logic in java with ruta rules or use ruta
> functionality in java code.
> Let's say we have a new method like
> boolean Ruta.matches(CAS cas, String rule, AnnotationFS... annotations)
> and you call it with something like (syntax is not yet specified)
> Ruta.matches(cas, "${PARTOF(Headline)} Keyword;", annotation)
> Then, the "$" would be replaced by the address of the annotation and the
> method would return whether the annotation is covered by a Headline
> annotation and is followed by a Keyword annotation.
> 
>> 3) The annotation feature expression looks nice but I wonder whether an array element can also be referenced using an int expression and not just a constant e.g. Struct.as[intVar+1]{->T1};
> 
> Yes, without allowing number expressions, it would not really be useful.
> The current implementation is just a test in order to check whether the
> internal object model is good enough to cover it. The complete
> functionality will probably not be included in the next release since
> there is still much work left in order to get it up and running. The
> semantics of such expressions (Struct.as) are resolved on the fly, and
> the code odes not support expressions at all. I still have to think
> about a way to implement it.
> 
>> The label expressions are also useful and will make some of our rules more readable.
>> 
>> Finally I have one additional question to the MARKUP initialisation. I have a case where I need the token seeds coming from the default seeder but I don’t want to run the markup initialisation. Is there a separate seeder defined for this somewhere? Right now I have my own copy of the default seeder without the MARKUP initialisation but obviously I do not want to maintain this. It looks as if they could also be split in two seeders with both added as default and then I could overwrite with my own seeder list containing only the token seeder.
> 
> Yes, we can split them or just add another one that ignores markup. I
> was also always thinking about adding a DetailedSeeder that creates much
> more finegrained types like different brackets and quotes... but it was
> never on top of my todo list.
> 
> Do you want to open a jira issue for it?
> 
> Best,
> 
> Peter
> 
>> Cheers
>> Mario
>> 
>> 
>>> On 04 Jan 2016, at 17:06 , Peter Klügl <pe...@averbis.com> wrote:
>>> 
>>> Hi,
>>> 
>>> Am 04.01.2016 um 16:13 schrieb Mario Gazzo:
>>>> Hi Peter,
>>>> 
>>>> No problem, I was anyway pretty much offline myself during Christmas holidays.
>>>> 
>>>> The term “overhead” is probably an exaggeration in this context especially after I disabled the MARKUP initialisation. We implemented earlier our own XML markup annotator tailored to better fit our needs with additional annotation types and properties, so the Ruta MARKUP is currently not used. It just happens that we don’t directly use RutaBasic in any of our rules in this particular case so I was curious to know whether we could avoid creating them in the first place since there seems to be quite a few. However, overall processing required by our Ruta scripts compared to other processing steps is now small and sub-optimising this further by making RutaBasic optional would currently be of very low priority to us. We would prioritise other features higher e.g. being able to assign annotations to variables as we discussed previously in another thread.
>>> I am working on this right now and there is finally some first progress :-)
>>> 
>>> I fear that I won't catch all use cases (combinations with language
>>> elements) with the first attempt. If you are interested (and wanna take
>>> care I do not miss your use case), feel free to take a look at the new
>>> unit tests:
>>> https://svn.apache.org/repos/asf/uima/ruta/trunk/ruta-core/src/test/java/org/apache/uima/ruta/expression/annotation
>>> 
>>> It's still work in progress. Proposals for more unit tests are very welcome.
>>> 
>>>> We haven’t processed documents as large as those you mention since books have so far been divided into chapters and processing could therefore be parallelised accordingly. We also drop extreme outliers above a certain size if we encounter them and then we batch process them later in smaller chunks but this has so far not been necessary with our current data sets. Like you, our processing bottlenecks are now in different components.
>>> Ah, that's nice to hear that ruta is not the bottleneck :-D
>>> 
>>> Best,
>>> 
>>> Peter
>>> 
>>> 
>>>> Cheers
>>>> Mario
>>>> 
>>>>> On 30 Dec 2015, at 16:44 , Peter Klügl <pe...@averbis.com> wrote:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> sorry for the delayed reply.
>>>>> 
>>>>> RutaEngine::initializeStream:
>>>>> 
>>>>> The special treatment of MARKUPs that causes the increased time required for initialization is just a workaround because I was to lazy to write a working jflex rule. Well, I tried but failed. It shouldn't be hard be to improve this code... I will create an issue for it. When I did the last performance optimization, uima did not check the indexes yet and my test set did not contain markups.
>>>>> 
>>>>> Deactivate creation of RutaBasic:
>>>>> Short answer is no. I was already thinking about making RutaBasic optional in future so that the user can configure if they are used. However, right now, they are required for rule inference and make the rule inference "fast" in the first place. RutaBasic is just an internal annotation like RutaAnnotation (for SCORE, MARKSCORE) and RutaFrame, and rules should not match on them at all.
>>>>> 
>>>>> Some background information:
>>>>> 
>>>>> RutaBasics are used for three things:
>>>>> - store additional information in order to avoid index operations. Some useful conditions would require many index operations, e.g., PARTOF or ENDSWITH. RutaBasic is utilized as a cache what annotations start and end at which position, and which positions are covered by which types.
>>>>> - provide a container to make this information available across analysis engines. Information shared by analysis engine is normally stored in the CAS, e.g. in annotations, (or in external resources). This is the role of RutaBasic. It is not really implemented right now as it should be but I will improve it soon. Then, there is no performance decrease when a pipeline is spammed with small ruta engines.
>>>>> - a basic minimal disjunct partitioning of the document for the coverage based visibility concept.
>>>>> 
>>>>> Making RutaBasic optional is possible. If there is a real need for it, e.g., in order to reduce the memory footprint or when processing large documents where parts are simply not interesting, then I will put it on my TODO list. I am also open for other/new ideas how to solve the challenges (and for incremental usage of internal caches).
>>>>> 
>>>>> What is your experience with the processing overhead concerning RutaBasic? Is it the rule matching or rather the initialization? I myself had already some performance problems with the initalization and memory consumption in large CAS (500+ pages pdfs). However, other components, serialization and the CAS editor were the actual bottlenecks.
>>>>> 
>>>>> Best,
>>>>> 
>>>>> Peter
>>>>> 
>>>>> 
>>>>> Am 22.12.2015 um 17:26 schrieb Mario Gazzo:
>>>>>> I got around it by removing the default seeders by specifying an empty seeders list since we don’t need the MARKUP annotations anymore.
>>>>>> 
>>>>>> I still don’t know why it created so much overhead but it sometimes seemed to rival the POS tagger in processing time.
>>>>>> 
>>>>>> Anyway, this leads me to the next question. Can I disable the creation of Ruta basic annotations entirely to save processing overhead and only apply Ruta rules to other annotation types created by other AEs such as our own?
>>>>>> 
>>>>>> Cheers
>>>>>> Mario
>>>>>> 
>>>>>>> On 21 Dec 2015, at 16:09 , Mario Juric <ma...@gmail.com> wrote:
>>>>>>> 
>>>>>>> Hi Peter,
>>>>>>> 
>>>>>>> I noticed that occasionally the initialisation in RutaEngine::initializeStream can tak very long time. I can’t really explain them and it seems independent of document length since I have seen this with even very small XML documents.
>>>>>>> 
>>>>>>> The method seems to spend much time in the DefaultSeeder when creating MARKUP annotations during subiterator.moveToNext calls (line 89) and inside Subiterator it seems to be the while loop inside adjustForStrictForward (line 232), which is inside UIMA core classes. I haven’t gone into any deeper analysis yet but I first like to hear whether you have an idea what could be the main cause(s) for this?
>>>>>>> 
>>>>>>> We use Ruta 2.3.1 with UIMA 2.8.1
>>>>>>> 
>>>>>>> 
>>>>>>> Cheers
>>>>>>> Mario
> 


Re: Very long Ruta stream initialization

Posted by Peter Klügl <pe...@averbis.com>.
Hi,

Am 06.01.2016 um 14:48 schrieb Mario Gazzo:
> Hi Peter,
>
> I had a look at the test cases and I think there are many interesting and useful features that cover many of our use cases but I will have to experiment with them before I know what might be missing. I have a few questions though:
>
> 1) It appears that we would then also be able to assign annotations to lists, which is nice. I am not sure from looking at the tests whether it is possible to use ADD with the annotation lists but I assume so.

Not yet, but I will implement it. It's still work in progress. But
thanks for pointing it out, I would probably have forgotten about it.

> 2) The use of addresses is unclear to me just from reading the test, maybe you could explain them.? This concept is very new to me.

It's not intented be to utilized directly in a rule file. It's rather
just a way to combine logic in java with ruta rules or use ruta
functionality in java code.
Let's say we have a new method like
boolean Ruta.matches(CAS cas, String rule, AnnotationFS... annotations)
and you call it with something like (syntax is not yet specified)
Ruta.matches(cas, "${PARTOF(Headline)} Keyword;", annotation)
Then, the "$" would be replaced by the address of the annotation and the
method would return whether the annotation is covered by a Headline
annotation and is followed by a Keyword annotation.

> 3) The annotation feature expression looks nice but I wonder whether an array element can also be referenced using an int expression and not just a constant e.g. Struct.as[intVar+1]{->T1};

Yes, without allowing number expressions, it would not really be useful.
The current implementation is just a test in order to check whether the
internal object model is good enough to cover it. The complete
functionality will probably not be included in the next release since
there is still much work left in order to get it up and running. The
semantics of such expressions (Struct.as) are resolved on the fly, and
the code odes not support expressions at all. I still have to think
about a way to implement it.

> The label expressions are also useful and will make some of our rules more readable.
>
> Finally I have one additional question to the MARKUP initialisation. I have a case where I need the token seeds coming from the default seeder but I don’t want to run the markup initialisation. Is there a separate seeder defined for this somewhere? Right now I have my own copy of the default seeder without the MARKUP initialisation but obviously I do not want to maintain this. It looks as if they could also be split in two seeders with both added as default and then I could overwrite with my own seeder list containing only the token seeder.

Yes, we can split them or just add another one that ignores markup. I
was also always thinking about adding a DetailedSeeder that creates much
more finegrained types like different brackets and quotes... but it was
never on top of my todo list.

Do you want to open a jira issue for it?

Best,

Peter

> Cheers
> Mario
>
>
>> On 04 Jan 2016, at 17:06 , Peter Klügl <pe...@averbis.com> wrote:
>>
>> Hi,
>>
>> Am 04.01.2016 um 16:13 schrieb Mario Gazzo:
>>> Hi Peter,
>>>
>>> No problem, I was anyway pretty much offline myself during Christmas holidays.
>>>
>>> The term “overhead” is probably an exaggeration in this context especially after I disabled the MARKUP initialisation. We implemented earlier our own XML markup annotator tailored to better fit our needs with additional annotation types and properties, so the Ruta MARKUP is currently not used. It just happens that we don’t directly use RutaBasic in any of our rules in this particular case so I was curious to know whether we could avoid creating them in the first place since there seems to be quite a few. However, overall processing required by our Ruta scripts compared to other processing steps is now small and sub-optimising this further by making RutaBasic optional would currently be of very low priority to us. We would prioritise other features higher e.g. being able to assign annotations to variables as we discussed previously in another thread.
>> I am working on this right now and there is finally some first progress :-)
>>
>> I fear that I won't catch all use cases (combinations with language
>> elements) with the first attempt. If you are interested (and wanna take
>> care I do not miss your use case), feel free to take a look at the new
>> unit tests:
>> https://svn.apache.org/repos/asf/uima/ruta/trunk/ruta-core/src/test/java/org/apache/uima/ruta/expression/annotation
>>
>> It's still work in progress. Proposals for more unit tests are very welcome.
>>
>>> We haven’t processed documents as large as those you mention since books have so far been divided into chapters and processing could therefore be parallelised accordingly. We also drop extreme outliers above a certain size if we encounter them and then we batch process them later in smaller chunks but this has so far not been necessary with our current data sets. Like you, our processing bottlenecks are now in different components.
>> Ah, that's nice to hear that ruta is not the bottleneck :-D
>>
>> Best,
>>
>> Peter
>>
>>
>>> Cheers
>>> Mario
>>>
>>>> On 30 Dec 2015, at 16:44 , Peter Klügl <pe...@averbis.com> wrote:
>>>>
>>>> Hi,
>>>>
>>>> sorry for the delayed reply.
>>>>
>>>> RutaEngine::initializeStream:
>>>>
>>>> The special treatment of MARKUPs that causes the increased time required for initialization is just a workaround because I was to lazy to write a working jflex rule. Well, I tried but failed. It shouldn't be hard be to improve this code... I will create an issue for it. When I did the last performance optimization, uima did not check the indexes yet and my test set did not contain markups.
>>>>
>>>> Deactivate creation of RutaBasic:
>>>> Short answer is no. I was already thinking about making RutaBasic optional in future so that the user can configure if they are used. However, right now, they are required for rule inference and make the rule inference "fast" in the first place. RutaBasic is just an internal annotation like RutaAnnotation (for SCORE, MARKSCORE) and RutaFrame, and rules should not match on them at all.
>>>>
>>>> Some background information:
>>>>
>>>> RutaBasics are used for three things:
>>>> - store additional information in order to avoid index operations. Some useful conditions would require many index operations, e.g., PARTOF or ENDSWITH. RutaBasic is utilized as a cache what annotations start and end at which position, and which positions are covered by which types.
>>>> - provide a container to make this information available across analysis engines. Information shared by analysis engine is normally stored in the CAS, e.g. in annotations, (or in external resources). This is the role of RutaBasic. It is not really implemented right now as it should be but I will improve it soon. Then, there is no performance decrease when a pipeline is spammed with small ruta engines.
>>>> - a basic minimal disjunct partitioning of the document for the coverage based visibility concept.
>>>>
>>>> Making RutaBasic optional is possible. If there is a real need for it, e.g., in order to reduce the memory footprint or when processing large documents where parts are simply not interesting, then I will put it on my TODO list. I am also open for other/new ideas how to solve the challenges (and for incremental usage of internal caches).
>>>>
>>>> What is your experience with the processing overhead concerning RutaBasic? Is it the rule matching or rather the initialization? I myself had already some performance problems with the initalization and memory consumption in large CAS (500+ pages pdfs). However, other components, serialization and the CAS editor were the actual bottlenecks.
>>>>
>>>> Best,
>>>>
>>>> Peter
>>>>
>>>>
>>>> Am 22.12.2015 um 17:26 schrieb Mario Gazzo:
>>>>> I got around it by removing the default seeders by specifying an empty seeders list since we don’t need the MARKUP annotations anymore.
>>>>>
>>>>> I still don’t know why it created so much overhead but it sometimes seemed to rival the POS tagger in processing time.
>>>>>
>>>>> Anyway, this leads me to the next question. Can I disable the creation of Ruta basic annotations entirely to save processing overhead and only apply Ruta rules to other annotation types created by other AEs such as our own?
>>>>>
>>>>> Cheers
>>>>> Mario
>>>>>
>>>>>> On 21 Dec 2015, at 16:09 , Mario Juric <ma...@gmail.com> wrote:
>>>>>>
>>>>>> Hi Peter,
>>>>>>
>>>>>> I noticed that occasionally the initialisation in RutaEngine::initializeStream can tak very long time. I can’t really explain them and it seems independent of document length since I have seen this with even very small XML documents.
>>>>>>
>>>>>> The method seems to spend much time in the DefaultSeeder when creating MARKUP annotations during subiterator.moveToNext calls (line 89) and inside Subiterator it seems to be the while loop inside adjustForStrictForward (line 232), which is inside UIMA core classes. I haven’t gone into any deeper analysis yet but I first like to hear whether you have an idea what could be the main cause(s) for this?
>>>>>>
>>>>>> We use Ruta 2.3.1 with UIMA 2.8.1
>>>>>>
>>>>>>
>>>>>> Cheers
>>>>>> Mario


Re: Very long Ruta stream initialization

Posted by Mario Gazzo <ma...@gmail.com>.
Hi Peter,

I had a look at the test cases and I think there are many interesting and useful features that cover many of our use cases but I will have to experiment with them before I know what might be missing. I have a few questions though:

1) It appears that we would then also be able to assign annotations to lists, which is nice. I am not sure from looking at the tests whether it is possible to use ADD with the annotation lists but I assume so.
2) The use of addresses is unclear to me just from reading the test, maybe you could explain them.? This concept is very new to me.
3) The annotation feature expression looks nice but I wonder whether an array element can also be referenced using an int expression and not just a constant e.g. Struct.as[intVar+1]{->T1};

The label expressions are also useful and will make some of our rules more readable.

Finally I have one additional question to the MARKUP initialisation. I have a case where I need the token seeds coming from the default seeder but I don’t want to run the markup initialisation. Is there a separate seeder defined for this somewhere? Right now I have my own copy of the default seeder without the MARKUP initialisation but obviously I do not want to maintain this. It looks as if they could also be split in two seeders with both added as default and then I could overwrite with my own seeder list containing only the token seeder.

Cheers
Mario


> On 04 Jan 2016, at 17:06 , Peter Klügl <pe...@averbis.com> wrote:
> 
> Hi,
> 
> Am 04.01.2016 um 16:13 schrieb Mario Gazzo:
>> Hi Peter,
>> 
>> No problem, I was anyway pretty much offline myself during Christmas holidays.
>> 
>> The term “overhead” is probably an exaggeration in this context especially after I disabled the MARKUP initialisation. We implemented earlier our own XML markup annotator tailored to better fit our needs with additional annotation types and properties, so the Ruta MARKUP is currently not used. It just happens that we don’t directly use RutaBasic in any of our rules in this particular case so I was curious to know whether we could avoid creating them in the first place since there seems to be quite a few. However, overall processing required by our Ruta scripts compared to other processing steps is now small and sub-optimising this further by making RutaBasic optional would currently be of very low priority to us. We would prioritise other features higher e.g. being able to assign annotations to variables as we discussed previously in another thread.
> 
> I am working on this right now and there is finally some first progress :-)
> 
> I fear that I won't catch all use cases (combinations with language
> elements) with the first attempt. If you are interested (and wanna take
> care I do not miss your use case), feel free to take a look at the new
> unit tests:
> https://svn.apache.org/repos/asf/uima/ruta/trunk/ruta-core/src/test/java/org/apache/uima/ruta/expression/annotation
> 
> It's still work in progress. Proposals for more unit tests are very welcome.
> 
>> We haven’t processed documents as large as those you mention since books have so far been divided into chapters and processing could therefore be parallelised accordingly. We also drop extreme outliers above a certain size if we encounter them and then we batch process them later in smaller chunks but this has so far not been necessary with our current data sets. Like you, our processing bottlenecks are now in different components.
> 
> Ah, that's nice to hear that ruta is not the bottleneck :-D
> 
> Best,
> 
> Peter
> 
> 
>> Cheers
>> Mario
>> 
>>> On 30 Dec 2015, at 16:44 , Peter Klügl <pe...@averbis.com> wrote:
>>> 
>>> Hi,
>>> 
>>> sorry for the delayed reply.
>>> 
>>> RutaEngine::initializeStream:
>>> 
>>> The special treatment of MARKUPs that causes the increased time required for initialization is just a workaround because I was to lazy to write a working jflex rule. Well, I tried but failed. It shouldn't be hard be to improve this code... I will create an issue for it. When I did the last performance optimization, uima did not check the indexes yet and my test set did not contain markups.
>>> 
>>> Deactivate creation of RutaBasic:
>>> Short answer is no. I was already thinking about making RutaBasic optional in future so that the user can configure if they are used. However, right now, they are required for rule inference and make the rule inference "fast" in the first place. RutaBasic is just an internal annotation like RutaAnnotation (for SCORE, MARKSCORE) and RutaFrame, and rules should not match on them at all.
>>> 
>>> Some background information:
>>> 
>>> RutaBasics are used for three things:
>>> - store additional information in order to avoid index operations. Some useful conditions would require many index operations, e.g., PARTOF or ENDSWITH. RutaBasic is utilized as a cache what annotations start and end at which position, and which positions are covered by which types.
>>> - provide a container to make this information available across analysis engines. Information shared by analysis engine is normally stored in the CAS, e.g. in annotations, (or in external resources). This is the role of RutaBasic. It is not really implemented right now as it should be but I will improve it soon. Then, there is no performance decrease when a pipeline is spammed with small ruta engines.
>>> - a basic minimal disjunct partitioning of the document for the coverage based visibility concept.
>>> 
>>> Making RutaBasic optional is possible. If there is a real need for it, e.g., in order to reduce the memory footprint or when processing large documents where parts are simply not interesting, then I will put it on my TODO list. I am also open for other/new ideas how to solve the challenges (and for incremental usage of internal caches).
>>> 
>>> What is your experience with the processing overhead concerning RutaBasic? Is it the rule matching or rather the initialization? I myself had already some performance problems with the initalization and memory consumption in large CAS (500+ pages pdfs). However, other components, serialization and the CAS editor were the actual bottlenecks.
>>> 
>>> Best,
>>> 
>>> Peter
>>> 
>>> 
>>> Am 22.12.2015 um 17:26 schrieb Mario Gazzo:
>>>> I got around it by removing the default seeders by specifying an empty seeders list since we don’t need the MARKUP annotations anymore.
>>>> 
>>>> I still don’t know why it created so much overhead but it sometimes seemed to rival the POS tagger in processing time.
>>>> 
>>>> Anyway, this leads me to the next question. Can I disable the creation of Ruta basic annotations entirely to save processing overhead and only apply Ruta rules to other annotation types created by other AEs such as our own?
>>>> 
>>>> Cheers
>>>> Mario
>>>> 
>>>>> On 21 Dec 2015, at 16:09 , Mario Juric <ma...@gmail.com> wrote:
>>>>> 
>>>>> Hi Peter,
>>>>> 
>>>>> I noticed that occasionally the initialisation in RutaEngine::initializeStream can tak very long time. I can’t really explain them and it seems independent of document length since I have seen this with even very small XML documents.
>>>>> 
>>>>> The method seems to spend much time in the DefaultSeeder when creating MARKUP annotations during subiterator.moveToNext calls (line 89) and inside Subiterator it seems to be the while loop inside adjustForStrictForward (line 232), which is inside UIMA core classes. I haven’t gone into any deeper analysis yet but I first like to hear whether you have an idea what could be the main cause(s) for this?
>>>>> 
>>>>> We use Ruta 2.3.1 with UIMA 2.8.1
>>>>> 
>>>>> 
>>>>> Cheers
>>>>> Mario
> 


Re: Very long Ruta stream initialization

Posted by Peter Klügl <pe...@averbis.com>.
Hi,

Am 04.01.2016 um 16:13 schrieb Mario Gazzo:
> Hi Peter,
>
> No problem, I was anyway pretty much offline myself during Christmas holidays.
>
> The term “overhead” is probably an exaggeration in this context especially after I disabled the MARKUP initialisation. We implemented earlier our own XML markup annotator tailored to better fit our needs with additional annotation types and properties, so the Ruta MARKUP is currently not used. It just happens that we don’t directly use RutaBasic in any of our rules in this particular case so I was curious to know whether we could avoid creating them in the first place since there seems to be quite a few. However, overall processing required by our Ruta scripts compared to other processing steps is now small and sub-optimising this further by making RutaBasic optional would currently be of very low priority to us. We would prioritise other features higher e.g. being able to assign annotations to variables as we discussed previously in another thread.

I am working on this right now and there is finally some first progress :-)

I fear that I won't catch all use cases (combinations with language
elements) with the first attempt. If you are interested (and wanna take
care I do not miss your use case), feel free to take a look at the new
unit tests:
https://svn.apache.org/repos/asf/uima/ruta/trunk/ruta-core/src/test/java/org/apache/uima/ruta/expression/annotation

It's still work in progress. Proposals for more unit tests are very welcome.

> We haven’t processed documents as large as those you mention since books have so far been divided into chapters and processing could therefore be parallelised accordingly. We also drop extreme outliers above a certain size if we encounter them and then we batch process them later in smaller chunks but this has so far not been necessary with our current data sets. Like you, our processing bottlenecks are now in different components.

Ah, that's nice to hear that ruta is not the bottleneck :-D

Best,

Peter


> Cheers
> Mario
>
>> On 30 Dec 2015, at 16:44 , Peter Klügl <pe...@averbis.com> wrote:
>>
>> Hi,
>>
>> sorry for the delayed reply.
>>
>> RutaEngine::initializeStream:
>>
>> The special treatment of MARKUPs that causes the increased time required for initialization is just a workaround because I was to lazy to write a working jflex rule. Well, I tried but failed. It shouldn't be hard be to improve this code... I will create an issue for it. When I did the last performance optimization, uima did not check the indexes yet and my test set did not contain markups.
>>
>> Deactivate creation of RutaBasic:
>> Short answer is no. I was already thinking about making RutaBasic optional in future so that the user can configure if they are used. However, right now, they are required for rule inference and make the rule inference "fast" in the first place. RutaBasic is just an internal annotation like RutaAnnotation (for SCORE, MARKSCORE) and RutaFrame, and rules should not match on them at all.
>>
>> Some background information:
>>
>> RutaBasics are used for three things:
>> - store additional information in order to avoid index operations. Some useful conditions would require many index operations, e.g., PARTOF or ENDSWITH. RutaBasic is utilized as a cache what annotations start and end at which position, and which positions are covered by which types.
>> - provide a container to make this information available across analysis engines. Information shared by analysis engine is normally stored in the CAS, e.g. in annotations, (or in external resources). This is the role of RutaBasic. It is not really implemented right now as it should be but I will improve it soon. Then, there is no performance decrease when a pipeline is spammed with small ruta engines.
>> - a basic minimal disjunct partitioning of the document for the coverage based visibility concept.
>>
>> Making RutaBasic optional is possible. If there is a real need for it, e.g., in order to reduce the memory footprint or when processing large documents where parts are simply not interesting, then I will put it on my TODO list. I am also open for other/new ideas how to solve the challenges (and for incremental usage of internal caches).
>>
>> What is your experience with the processing overhead concerning RutaBasic? Is it the rule matching or rather the initialization? I myself had already some performance problems with the initalization and memory consumption in large CAS (500+ pages pdfs). However, other components, serialization and the CAS editor were the actual bottlenecks.
>>
>> Best,
>>
>> Peter
>>
>>
>> Am 22.12.2015 um 17:26 schrieb Mario Gazzo:
>>> I got around it by removing the default seeders by specifying an empty seeders list since we don’t need the MARKUP annotations anymore.
>>>
>>> I still don’t know why it created so much overhead but it sometimes seemed to rival the POS tagger in processing time.
>>>
>>> Anyway, this leads me to the next question. Can I disable the creation of Ruta basic annotations entirely to save processing overhead and only apply Ruta rules to other annotation types created by other AEs such as our own?
>>>
>>> Cheers
>>> Mario
>>>
>>>> On 21 Dec 2015, at 16:09 , Mario Juric <ma...@gmail.com> wrote:
>>>>
>>>> Hi Peter,
>>>>
>>>> I noticed that occasionally the initialisation in RutaEngine::initializeStream can tak very long time. I can’t really explain them and it seems independent of document length since I have seen this with even very small XML documents.
>>>>
>>>> The method seems to spend much time in the DefaultSeeder when creating MARKUP annotations during subiterator.moveToNext calls (line 89) and inside Subiterator it seems to be the while loop inside adjustForStrictForward (line 232), which is inside UIMA core classes. I haven’t gone into any deeper analysis yet but I first like to hear whether you have an idea what could be the main cause(s) for this?
>>>>
>>>> We use Ruta 2.3.1 with UIMA 2.8.1
>>>>
>>>>
>>>> Cheers
>>>> Mario