You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@uima.apache.org by "Jasper Huzen (JIRA)" <de...@uima.apache.org> on 2018/03/19 14:11:00 UTC

[jira] [Commented] (UIMA-5723) MARKTABLE fails to assign feature for single word entry in first CSV column

    [ https://issues.apache.org/jira/browse/UIMA-5723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16404876#comment-16404876 ] 

Jasper Huzen commented on UIMA-5723:
------------------------------------

The change / fix in UIMA-4556 cause some problems when using a CSV file with whitespaces.

When setting param PARAM_DICT_REMOVE_WS to TRUE and don't have WS visible in the token stream:
- all items in the dictionary will be recognized
- all items will also be recognized if you add whitespaces between words. For example: IlikeRUTA, Ilike Ruta, I like Ruta all result in the same match.

If whitespaces are visible, words with spacers won't be recognized. 

The problem that this cause is that the default hardcored value to ignore whitespaces is always true:
{code:java}
private IBooleanExpression ignoreWS = new SimpleBooleanExpression(true);
{code}

This is not correct because if you want to use whitespaces (if they are important) that won't be work. This matcher should use the same value as set in the PARAM_DICT_REMOVE_WS parameter or the value that is set via setIgnoreWS method.

I attached a patch to fix this issue. [^UIMA-5723.patch]

> MARKTABLE fails to assign feature for single word entry in first CSV column
> ---------------------------------------------------------------------------
>
>                 Key: UIMA-5723
>                 URL: https://issues.apache.org/jira/browse/UIMA-5723
>             Project: UIMA
>          Issue Type: Bug
>          Components: Ruta
>    Affects Versions: 2.6.1ruta
>            Reporter: Andreas Thiel
>            Assignee: Peter Klügl
>            Priority: Major
>         Attachments: UIMA-5723.patch
>
>
> When using Ruta's MARKTABLE action with a CSV file {{nl_law_names.csv}} like this
> {code:xml}
> WAZ;WAZELF
> Wet arbeidsongeschiktheidsverzekering zelfstandigen;WAZELF
> {code}
> and corresponding Ruta script containing these lines
> {code:java}
> WORDTABLE LawNameTable = 'nl_law_names.csv';
> Document{->MARKTABLE(WetNaam, 1, LawNameTable, "WetIdentifier" = 2)};
> {code}
> it seems that the text {{WAZ}} is detected, but the {{WetIdentifier}} feature of the resulting annotation is not filled by the string following the semicolon. Instead, it remains empty.
> (Note: _WetNaam_ annotation is defined elsewhere via type system description)
> In contrast, the fully written name {{Wet arbeidsongeschiktheidsverzekering zelfstandigen}} is detected and processed as expected with feature WetIdentifier = WAZELF after annnotating.
> Could it be that problems arise when only a single word (i.e. no spaces or uppercase letters following lowercase chars) is present in the first column in the CSV file? Or is it a matter of configuration?
> We experimented also with the optional arguments of MARKTABLE regarding uppercase/lowercase distinction, but to no avail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)