You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@uima.apache.org by Gaurav Dudeja <ga...@ephesoft.com> on 2017/07/20 10:27:29 UTC

High Memory Usage in Uima ruta

This is per reference of this question I raised on StackOverflow As per @Peter Kluegl there is too much scope for code improvement.
So eagerly looking how can I improve this script
https://stackoverflow.com/questions/44351051/uima-ruta-out-of-memory-issue-in-spark-context

=========================================================
TYPESYSTEM EDMTypeSystem;

WORDLIST EnglishStopWordList = 'en/anchor/en_stopWords.txt';
WORDLIST FiltersList = 'en/anchor/AnchorFilters.txt';
DECLARE Filters, EnglishStopWords;
DECLARE Anchors, SpanStart,SpanClose;

DocumentAnnotation{-> ADDRETAINTYPE(MARKUP)};

DocumentAnnotation{-> MARKFAST(Filters, FiltersList)};

STRING MixCharacterRegex = "[0-9]+[a-zA-Z]+";

DocumentAnnotation{-> MARKFAST(EnglishStopWords, EnglishStopWordList,true)};
(SW | CW | CAP ) { -> MARK(Anchors, 1, 2)};
Anchors{CONTAINS(EnglishStopWords) -> UNMARK(Anchors)};

(SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) (SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) EnglishStopWords? { -> MARK(Anchors, 1, 4)};
(SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM)? (SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) EnglishStopWords? { -> MARK(Anchors, 1, 4)};
(SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) (SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM)? EnglishStopWords? { -> MARK(Anchors, 1, 4)};
(SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) EnglishStopWords? { -> MARK(Anchors, 1, 3)};

Anchors{CONTAINS(MARKUP) -> UNMARK(Anchors)};
MixCharacterRegex -> Anchors;

"<Value>"  -> SpanStart;
"</Value>" -> SpanClose;

Anchors{-> CREATE(ExtractedData, "type" = "ANCHOR", "value" = Anchors)};

SpanStart Filters? SPACE? ExtractedData SPACE? Filters? SpanClose{-> GATHER(Data, 2, 6, "ExtractedData" = 4)};
=========================================================

Re: High Memory Usage in Uima ruta

Posted by Peter Klügl <pe...@averbis.com>.

Hi,


if you want me to improve the rules, you have to provide some
representative text.

If I make up some text and optimize the rules, I'll report a speedup of X.

Then you test the optimized rule and in case the results of rules are
correct (no realistic text to test it on), you measure a speed up of Y.

Then we start again where I ask for some representative text.


Best,


Peter


Am 20.07.2017 um 14:52 schrieb Peter Klügl:
> Hi,
>
>
> can you provide a dummy/exemplary document for the optimization? As
> similar to your usual imput as possible.
>
> The size of the document, the coverage and amount of annotations are
> some important key figures for the optimization.
>
>
> Best,
>
>
> Peter
>
>
> Am 20.07.2017 um 12:27 schrieb Gaurav Dudeja:
>> This is per reference of this question I raised on StackOverflow As per @Peter Kluegl there is too much scope for code improvement.
>> So eagerly looking how can I improve this script
>> https://stackoverflow.com/questions/44351051/uima-ruta-out-of-memory-issue-in-spark-context
>>
>> =========================================================
>> TYPESYSTEM EDMTypeSystem;
>>
>> WORDLIST EnglishStopWordList = 'en/anchor/en_stopWords.txt';
>> WORDLIST FiltersList = 'en/anchor/AnchorFilters.txt';
>> DECLARE Filters, EnglishStopWords;
>> DECLARE Anchors, SpanStart,SpanClose;
>>
>> DocumentAnnotation{-> ADDRETAINTYPE(MARKUP)};
>>
>> DocumentAnnotation{-> MARKFAST(Filters, FiltersList)};
>>
>> STRING MixCharacterRegex = "[0-9]+[a-zA-Z]+";
>>
>> DocumentAnnotation{-> MARKFAST(EnglishStopWords, EnglishStopWordList,true)};
>> (SW | CW | CAP ) { -> MARK(Anchors, 1, 2)};
>> Anchors{CONTAINS(EnglishStopWords) -> UNMARK(Anchors)};
>>
>> (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) (SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) EnglishStopWords? { -> MARK(Anchors, 1, 4)};
>> (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM)? (SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) EnglishStopWords? { -> MARK(Anchors, 1, 4)};
>> (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) (SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM)? EnglishStopWords? { -> MARK(Anchors, 1, 4)};
>> (SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) EnglishStopWords? { -> MARK(Anchors, 1, 3)};
>>
>> Anchors{CONTAINS(MARKUP) -> UNMARK(Anchors)};
>> MixCharacterRegex -> Anchors;
>>
>> "<Value>"  -> SpanStart;
>> "</Value>" -> SpanClose;
>>
>> Anchors{-> CREATE(ExtractedData, "type" = "ANCHOR", "value" = Anchors)};
>>
>> SpanStart Filters? SPACE? ExtractedData SPACE? Filters? SpanClose{-> GATHER(Data, 2, 6, "ExtractedData" = 4)};
>> =========================================================

Re: High Memory Usage in Uima ruta

Posted by Peter Klügl <pe...@averbis.com>.

Hi,


can you provide a dummy/exemplary document for the optimization? As
similar to your usual imput as possible.

The size of the document, the coverage and amount of annotations are
some important key figures for the optimization.


Best,


Peter


Am 20.07.2017 um 12:27 schrieb Gaurav Dudeja:
> This is per reference of this question I raised on StackOverflow As per @Peter Kluegl there is too much scope for code improvement.
> So eagerly looking how can I improve this script
> https://stackoverflow.com/questions/44351051/uima-ruta-out-of-memory-issue-in-spark-context
>
> =========================================================
> TYPESYSTEM EDMTypeSystem;
>
> WORDLIST EnglishStopWordList = 'en/anchor/en_stopWords.txt';
> WORDLIST FiltersList = 'en/anchor/AnchorFilters.txt';
> DECLARE Filters, EnglishStopWords;
> DECLARE Anchors, SpanStart,SpanClose;
>
> DocumentAnnotation{-> ADDRETAINTYPE(MARKUP)};
>
> DocumentAnnotation{-> MARKFAST(Filters, FiltersList)};
>
> STRING MixCharacterRegex = "[0-9]+[a-zA-Z]+";
>
> DocumentAnnotation{-> MARKFAST(EnglishStopWords, EnglishStopWordList,true)};
> (SW | CW | CAP ) { -> MARK(Anchors, 1, 2)};
> Anchors{CONTAINS(EnglishStopWords) -> UNMARK(Anchors)};
>
> (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) (SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) EnglishStopWords? { -> MARK(Anchors, 1, 4)};
> (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM)? (SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) EnglishStopWords? { -> MARK(Anchors, 1, 4)};
> (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) (SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM)? EnglishStopWords? { -> MARK(Anchors, 1, 4)};
> (SW | CW | CAP ) (SPECIAL{REGEXP("['\"-=()\\[\\]]")}| PM) EnglishStopWords? { -> MARK(Anchors, 1, 3)};
>
> Anchors{CONTAINS(MARKUP) -> UNMARK(Anchors)};
> MixCharacterRegex -> Anchors;
>
> "<Value>"  -> SpanStart;
> "</Value>" -> SpanClose;
>
> Anchors{-> CREATE(ExtractedData, "type" = "ANCHOR", "value" = Anchors)};
>
> SpanStart Filters? SPACE? ExtractedData SPACE? Filters? SpanClose{-> GATHER(Data, 2, 6, "ExtractedData" = 4)};
> =========================================================