You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by Renaud Richardet <re...@gmail.com> on 2015/08/23 21:40:38 UTC

understanding Ruta's MARFAST's options

Hello,

I am facing some issues with Ruta's MARFAST, and started to digg into the
JUnit tests in order to make sure I understood it right. Below is some
documentation I came up with. I will commit it, but for now I don't quite
understand the second result (T2). Could anyone explain in details what is
going on, Thanks.

-- Renaud


// MARKFAST(T1,...) ignores WS in resources (MarkFastTestList.txt) so

// they 'collapse' to 100 and 200;

// spaces are not retained in Ruta script, so all 3 numerical inputs in

// MarkFastText.txt also collapse to 100, 100, 200 and thus match.

RutaTestUtils.assertAnnotationsEquals(cas, 1, 3, "1 0 0", "100",

"2 0 0");


// MARKFAST(T2,...) does NOT ignore WS in resources, so no resource

// matched (since SPACE are not retained, resources collapse to 100 and

// 200, while input is 1 0 0 and 2 0 0). WHY?!?!?!

RutaTestUtils.assertAnnotationsEquals(cas, 2, 0);


// from now on, SPACE are retained


// MARKFAST(T3,...) space retained in input, but we ignore WS in resources

// Thus 1 0 0 and 100 are identical and get matched 1 time;

// The resource 2 0 0 gets collapsed to 200 and does not match

// the inputs ('1 0 0' or '2 0 0').

RutaTestUtils.assertAnnotationsEquals(cas, 3, 1, "100");


// T4 space retained in input and WS "retained" in resources,

// so 2 'exact' matches (not 100 match)

RutaTestUtils.assertAnnotationsEquals(cas, 4, 2, "1 0 0", "2 0 0");

Re: understanding Ruta's MARFAST's options

Posted by Peter Klügl <pe...@averbis.com>.
Hi,

a short summary (maybe for other to follow the problem):

The input text is something like "1 0 0 \n aaaaa \n 100 \n bbbbb \n ....".
The wordlist list1 is something like  "1 0 0 \n 2 0 0"
The rule is: MARKFAST(T2, list1, false, 0, false);
... skipWhiteSpaces (last argument) is deactivated and the rule is
applied using the default filtering setting. In the underlying analysis
engine, the parameter dictRemoveWS is set to false.
The result: no annotations of the type T2 are created.

Ok, here's what happens:

- dictRemoveWS is false, therefore the created trie in the tree word
list contains nodes for the whitespaces. The trie has two root nodes (1
and 2) each followed by a node with a space
- the filtering setting is the default one, therefore the spaces are
invisible and the dictionary lookup in the MARKFAST action sees only
strings like "100", but none with spaces like "1 0 0".
- skipWhiteSpaces is deactivated and therefore, if a node in the trie
lookup contains a space (as it happens to be), then it is not skipped
but we try to match it. This fails for the given input because there are
no spaces.
- as a result, no matches are found for this word list and the
configuration/filtering.


Best,

Peter


Am 23.08.2015 um 21:40 schrieb Renaud Richardet:
> Hello,
>
> I am facing some issues with Ruta's MARFAST, and started to digg into the
> JUnit tests in order to make sure I understood it right. Below is some
> documentation I came up with. I will commit it, but for now I don't quite
> understand the second result (T2). Could anyone explain in details what is
> going on, Thanks.
>
> -- Renaud
>
>
> // MARKFAST(T1,...) ignores WS in resources (MarkFastTestList.txt) so
>
> // they 'collapse' to 100 and 200;
>
> // spaces are not retained in Ruta script, so all 3 numerical inputs in
>
> // MarkFastText.txt also collapse to 100, 100, 200 and thus match.
>
> RutaTestUtils.assertAnnotationsEquals(cas, 1, 3, "1 0 0", "100",
>
> "2 0 0");
>
>
> // MARKFAST(T2,...) does NOT ignore WS in resources, so no resource
>
> // matched (since SPACE are not retained, resources collapse to 100 and
>
> // 200, while input is 1 0 0 and 2 0 0). WHY?!?!?!
>
> RutaTestUtils.assertAnnotationsEquals(cas, 2, 0);
>
>
> // from now on, SPACE are retained
>
>
> // MARKFAST(T3,...) space retained in input, but we ignore WS in resources
>
> // Thus 1 0 0 and 100 are identical and get matched 1 time;
>
> // The resource 2 0 0 gets collapsed to 200 and does not match
>
> // the inputs ('1 0 0' or '2 0 0').
>
> RutaTestUtils.assertAnnotationsEquals(cas, 3, 1, "100");
>
>
> // T4 space retained in input and WS "retained" in resources,
>
> // so 2 'exact' matches (not 100 match)
>
> RutaTestUtils.assertAnnotationsEquals(cas, 4, 2, "1 0 0", "2 0 0");
>