You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@uima.apache.org by "Rinat Gareyev (JIRA)" <de...@uima.apache.org> on 2012/08/14 18:52:38 UTC

[jira] [Created] (UIMA-2455) Make ordering of getNextAnnotations result configurable

Rinat Gareyev created UIMA-2455:
-----------------------------------

Summary: Make ordering of getNextAnnotations result configurable
Key: UIMA-2455
URL: https://issues.apache.org/jira/browse/UIMA-2455
Project: UIMA
Issue Type: New Feature
Components: TextMarker
Reporter: Rinat Gareyev

Example rule:
A B C{-PARTOF(D)->MARK(D,3)};

Example text:
aText bText cText cMoreText

where following correspondence between annotations and tokens are held:
A = aText
B = bText
C = cText
C = cText cMoreText

Rule results in the following:
D = cText

However I expect that:
D = cText cMoreText

The reason of actual behaviour is org.apache.uima.textmarker.rule.AnnotationComparator#compare implementation. It returns a shorter annotation before longer. That is why the sequence 'aText bText cText' will be matched and sequence 'aText bText cText cMoreText' will not because it will be considered later and will not pass NOT PARTOF condition.

I've revealed this after migration to the latest TextMarker sources (from ASF repo). Before we used the one from Sourceforge.net. In the old (sourceforge) version this problem did not arise because TextMarkerBasic could keep only one annotation per Type as 'begin anchor'. Returning to the example this means that 'cText' TextMarkerBasic held only one C annotation as begin anchor.

In current (rev. 1371274) version TextMarkerBasic keeps a set of begin and end anchors per Type. This is actually a good improvement.
But I suggest to make ordering of anchored annotations returned by TextMarkerRuleElement#getNextAnnotations(boolean, AnnotationFS, TextMarkerStream) method more controllable.
E.g., by adding some parameter for TextMarkerEngine or script which will define AnnotationComparator#compare implementation.

Also returning longer annotations before shorter ones seems to be more compliant to the UIMA default indexing. See http://uima.apache.org/d/uimaj-2.4.0/references.html#ugr.ref.cas.index.built_in_indexes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (UIMA-2455) Make ordering of getNextAnnotations result configurable

Posted by "Rinat Gareyev (JIRA)" <de...@uima.apache.org>.

     [ https://issues.apache.org/jira/browse/UIMA-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rinat Gareyev closed UIMA-2455.
-------------------------------

    
> Make ordering of getNextAnnotations result configurable
> -------------------------------------------------------
>
>                 Key: UIMA-2455
>                 URL: https://issues.apache.org/jira/browse/UIMA-2455
>             Project: UIMA
>          Issue Type: New Feature
>          Components: TextMarker
>            Reporter: Rinat Gareyev
>            Assignee: Peter Klügl
>
> Example rule:
> A B C{NOT(PARTOF(D))->MARK(D,3)};
> Example text:
> aText bText cText cMoreText
> where following correspondence between annotations and tokens are held:
> A = aText
> B = bText
> C = cText
> C = cText cMoreText
> Rule results in the following:
> D = cText
> However I expect that:
> D = cText cMoreText
> The reason of actual behaviour is org.apache.uima.textmarker.rule.AnnotationComparator#compare implementation. It returns a shorter annotation before longer. That is why the sequence 'aText bText cText' will be matched and sequence 'aText bText cText cMoreText' will not because it will be considered later and will not pass NOT PARTOF condition.
> I've revealed this after migration to the latest TextMarker sources (from ASF repo). Before we used the one from Sourceforge.net. In the old (sourceforge) version this problem did not arise because TextMarkerBasic could keep only one annotation per Type as 'begin anchor'. Returning to the example this means that 'cText' TextMarkerBasic held only one C annotation as begin anchor.
> In current (rev. 1371274) version TextMarkerBasic keeps a set of begin and end anchors per Type. This is actually a good improvement.
> But I suggest to make ordering of anchored annotations returned by TextMarkerRuleElement#getNextAnnotations(boolean, AnnotationFS, TextMarkerStream) method more controllable.
> E.g., by adding some parameter for TextMarkerEngine or script which will define AnnotationComparator#compare implementation.
> Also returning longer annotations before shorter ones seems to be more compliant to the UIMA default indexing. See http://uima.apache.org/d/uimaj-2.4.0/references.html#ugr.ref.cas.index.built_in_indexes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (UIMA-2455) Make ordering of getNextAnnotations result configurable

Posted by "Rinat Gareyev (JIRA)" <de...@uima.apache.org>.

     [ https://issues.apache.org/jira/browse/UIMA-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rinat Gareyev updated UIMA-2455:
--------------------------------

    Description: 
Example rule:
A B C{NOT(PARTOF(D))->MARK(D,3)};

Example text:
aText bText cText cMoreText

where following correspondence between annotations and tokens are held:
A = aText
B = bText
C = cText
C = cText cMoreText

Rule results in the following:
D = cText

However I expect that:
D = cText cMoreText

The reason of actual behaviour is org.apache.uima.textmarker.rule.AnnotationComparator#compare implementation. It returns a shorter annotation before longer. That is why the sequence 'aText bText cText' will be matched and sequence 'aText bText cText cMoreText' will not because it will be considered later and will not pass NOT PARTOF condition.

I've revealed this after migration to the latest TextMarker sources (from ASF repo). Before we used the one from Sourceforge.net. In the old (sourceforge) version this problem did not arise because TextMarkerBasic could keep only one annotation per Type as 'begin anchor'. Returning to the example this means that 'cText' TextMarkerBasic held only one C annotation as begin anchor.

In current (rev. 1371274) version TextMarkerBasic keeps a set of begin and end anchors per Type. This is actually a good improvement.
But I suggest to make ordering of anchored annotations returned by TextMarkerRuleElement#getNextAnnotations(boolean, AnnotationFS, TextMarkerStream) method more controllable.
E.g., by adding some parameter for TextMarkerEngine or script which will define AnnotationComparator#compare implementation.

Also returning longer annotations before shorter ones seems to be more compliant to the UIMA default indexing. See http://uima.apache.org/d/uimaj-2.4.0/references.html#ugr.ref.cas.index.built_in_indexes

  was:
Example rule:
A B C{-PARTOF(D)->MARK(D,3)};

Example text:
aText bText cText cMoreText

where following correspondence between annotations and tokens are held:
A = aText
B = bText
C = cText
C = cText cMoreText

Rule results in the following:
D = cText

However I expect that:
D = cText cMoreText

The reason of actual behaviour is org.apache.uima.textmarker.rule.AnnotationComparator#compare implementation. It returns a shorter annotation before longer. That is why the sequence 'aText bText cText' will be matched and sequence 'aText bText cText cMoreText' will not because it will be considered later and will not pass NOT PARTOF condition.

I've revealed this after migration to the latest TextMarker sources (from ASF repo). Before we used the one from Sourceforge.net. In the old (sourceforge) version this problem did not arise because TextMarkerBasic could keep only one annotation per Type as 'begin anchor'. Returning to the example this means that 'cText' TextMarkerBasic held only one C annotation as begin anchor.

In current (rev. 1371274) version TextMarkerBasic keeps a set of begin and end anchors per Type. This is actually a good improvement.
But I suggest to make ordering of anchored annotations returned by TextMarkerRuleElement#getNextAnnotations(boolean, AnnotationFS, TextMarkerStream) method more controllable.
E.g., by adding some parameter for TextMarkerEngine or script which will define AnnotationComparator#compare implementation.

Also returning longer annotations before shorter ones seems to be more compliant to the UIMA default indexing. See http://uima.apache.org/d/uimaj-2.4.0/references.html#ugr.ref.cas.index.built_in_indexes

    
> Make ordering of getNextAnnotations result configurable
> -------------------------------------------------------
>
>                 Key: UIMA-2455
>                 URL: https://issues.apache.org/jira/browse/UIMA-2455
>             Project: UIMA
>          Issue Type: New Feature
>          Components: TextMarker
>            Reporter: Rinat Gareyev
>
> Example rule:
> A B C{NOT(PARTOF(D))->MARK(D,3)};
> Example text:
> aText bText cText cMoreText
> where following correspondence between annotations and tokens are held:
> A = aText
> B = bText
> C = cText
> C = cText cMoreText
> Rule results in the following:
> D = cText
> However I expect that:
> D = cText cMoreText
> The reason of actual behaviour is org.apache.uima.textmarker.rule.AnnotationComparator#compare implementation. It returns a shorter annotation before longer. That is why the sequence 'aText bText cText' will be matched and sequence 'aText bText cText cMoreText' will not because it will be considered later and will not pass NOT PARTOF condition.
> I've revealed this after migration to the latest TextMarker sources (from ASF repo). Before we used the one from Sourceforge.net. In the old (sourceforge) version this problem did not arise because TextMarkerBasic could keep only one annotation per Type as 'begin anchor'. Returning to the example this means that 'cText' TextMarkerBasic held only one C annotation as begin anchor.
> In current (rev. 1371274) version TextMarkerBasic keeps a set of begin and end anchors per Type. This is actually a good improvement.
> But I suggest to make ordering of anchored annotations returned by TextMarkerRuleElement#getNextAnnotations(boolean, AnnotationFS, TextMarkerStream) method more controllable.
> E.g., by adding some parameter for TextMarkerEngine or script which will define AnnotationComparator#compare implementation.
> Also returning longer annotations before shorter ones seems to be more compliant to the UIMA default indexing. See http://uima.apache.org/d/uimaj-2.4.0/references.html#ugr.ref.cas.index.built_in_indexes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (UIMA-2455) Make ordering of getNextAnnotations result configurable

Posted by "Peter Klügl (JIRA)" <de...@uima.apache.org>.

    [ https://issues.apache.org/jira/browse/UIMA-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13442437#comment-13442437 ] 

Peter Klügl commented on UIMA-2455:
-----------------------------------

I agree that the longer annotation should be returned. However, I also would go another step and would expect that both annotations in a correct ordering are returned. I have thought much about the problem of annotations with the same start offset last summer, but I have forgotten/ignored the problem, probably because the older TextMarker implementation taught me to avoid that situation. Most code for providing this functionality is already done and the rest should not be a problem.

The changes I have in mind should result in the following functionality:

Rule: 
A B C{NOT(PARTOF(D))->MARK(D,3)};
Result:
D = cText cMoreText

Rule: 
A B C{->MARK(D,3)};
Result:
D = cText cMoreText
D = cText

... because there are two valid alternatives (two Cs) and no conditions.

Thanks for pointing this out.


                
> Make ordering of getNextAnnotations result configurable
> -------------------------------------------------------
>
>                 Key: UIMA-2455
>                 URL: https://issues.apache.org/jira/browse/UIMA-2455
>             Project: UIMA
>          Issue Type: New Feature
>          Components: TextMarker
>            Reporter: Rinat Gareyev
>            Assignee: Peter Klügl
>
> Example rule:
> A B C{NOT(PARTOF(D))->MARK(D,3)};
> Example text:
> aText bText cText cMoreText
> where following correspondence between annotations and tokens are held:
> A = aText
> B = bText
> C = cText
> C = cText cMoreText
> Rule results in the following:
> D = cText
> However I expect that:
> D = cText cMoreText
> The reason of actual behaviour is org.apache.uima.textmarker.rule.AnnotationComparator#compare implementation. It returns a shorter annotation before longer. That is why the sequence 'aText bText cText' will be matched and sequence 'aText bText cText cMoreText' will not because it will be considered later and will not pass NOT PARTOF condition.
> I've revealed this after migration to the latest TextMarker sources (from ASF repo). Before we used the one from Sourceforge.net. In the old (sourceforge) version this problem did not arise because TextMarkerBasic could keep only one annotation per Type as 'begin anchor'. Returning to the example this means that 'cText' TextMarkerBasic held only one C annotation as begin anchor.
> In current (rev. 1371274) version TextMarkerBasic keeps a set of begin and end anchors per Type. This is actually a good improvement.
> But I suggest to make ordering of anchored annotations returned by TextMarkerRuleElement#getNextAnnotations(boolean, AnnotationFS, TextMarkerStream) method more controllable.
> E.g., by adding some parameter for TextMarkerEngine or script which will define AnnotationComparator#compare implementation.
> Also returning longer annotations before shorter ones seems to be more compliant to the UIMA default indexing. See http://uima.apache.org/d/uimaj-2.4.0/references.html#ugr.ref.cas.index.built_in_indexes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (UIMA-2455) Make ordering of getNextAnnotations result configurable

Posted by "Rinat Gareyev (JIRA)" <de...@uima.apache.org>.

    [ https://issues.apache.org/jira/browse/UIMA-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13442639#comment-13442639 ] 

Rinat Gareyev commented on UIMA-2455:
-------------------------------------

Yep, it seems to be ok now. Should I resolve this issue?
                
> Make ordering of getNextAnnotations result configurable
> -------------------------------------------------------
>
>                 Key: UIMA-2455
>                 URL: https://issues.apache.org/jira/browse/UIMA-2455
>             Project: UIMA
>          Issue Type: New Feature
>          Components: TextMarker
>            Reporter: Rinat Gareyev
>            Assignee: Peter Klügl
>
> Example rule:
> A B C{NOT(PARTOF(D))->MARK(D,3)};
> Example text:
> aText bText cText cMoreText
> where following correspondence between annotations and tokens are held:
> A = aText
> B = bText
> C = cText
> C = cText cMoreText
> Rule results in the following:
> D = cText
> However I expect that:
> D = cText cMoreText
> The reason of actual behaviour is org.apache.uima.textmarker.rule.AnnotationComparator#compare implementation. It returns a shorter annotation before longer. That is why the sequence 'aText bText cText' will be matched and sequence 'aText bText cText cMoreText' will not because it will be considered later and will not pass NOT PARTOF condition.
> I've revealed this after migration to the latest TextMarker sources (from ASF repo). Before we used the one from Sourceforge.net. In the old (sourceforge) version this problem did not arise because TextMarkerBasic could keep only one annotation per Type as 'begin anchor'. Returning to the example this means that 'cText' TextMarkerBasic held only one C annotation as begin anchor.
> In current (rev. 1371274) version TextMarkerBasic keeps a set of begin and end anchors per Type. This is actually a good improvement.
> But I suggest to make ordering of anchored annotations returned by TextMarkerRuleElement#getNextAnnotations(boolean, AnnotationFS, TextMarkerStream) method more controllable.
> E.g., by adding some parameter for TextMarkerEngine or script which will define AnnotationComparator#compare implementation.
> Also returning longer annotations before shorter ones seems to be more compliant to the UIMA default indexing. See http://uima.apache.org/d/uimaj-2.4.0/references.html#ugr.ref.cas.index.built_in_indexes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (UIMA-2455) Make ordering of getNextAnnotations result configurable

Posted by "Rinat Gareyev (JIRA)" <de...@uima.apache.org>.

     [ https://issues.apache.org/jira/browse/UIMA-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rinat Gareyev resolved UIMA-2455.
---------------------------------

    Resolution: Fixed

Fixed. For related problem look at https://issues.apache.org/jira/browse/UIMA-2462
                
> Make ordering of getNextAnnotations result configurable
> -------------------------------------------------------
>
>                 Key: UIMA-2455
>                 URL: https://issues.apache.org/jira/browse/UIMA-2455
>             Project: UIMA
>          Issue Type: New Feature
>          Components: TextMarker
>            Reporter: Rinat Gareyev
>            Assignee: Peter Klügl
>
> Example rule:
> A B C{NOT(PARTOF(D))->MARK(D,3)};
> Example text:
> aText bText cText cMoreText
> where following correspondence between annotations and tokens are held:
> A = aText
> B = bText
> C = cText
> C = cText cMoreText
> Rule results in the following:
> D = cText
> However I expect that:
> D = cText cMoreText
> The reason of actual behaviour is org.apache.uima.textmarker.rule.AnnotationComparator#compare implementation. It returns a shorter annotation before longer. That is why the sequence 'aText bText cText' will be matched and sequence 'aText bText cText cMoreText' will not because it will be considered later and will not pass NOT PARTOF condition.
> I've revealed this after migration to the latest TextMarker sources (from ASF repo). Before we used the one from Sourceforge.net. In the old (sourceforge) version this problem did not arise because TextMarkerBasic could keep only one annotation per Type as 'begin anchor'. Returning to the example this means that 'cText' TextMarkerBasic held only one C annotation as begin anchor.
> In current (rev. 1371274) version TextMarkerBasic keeps a set of begin and end anchors per Type. This is actually a good improvement.
> But I suggest to make ordering of anchored annotations returned by TextMarkerRuleElement#getNextAnnotations(boolean, AnnotationFS, TextMarkerStream) method more controllable.
> E.g., by adding some parameter for TextMarkerEngine or script which will define AnnotationComparator#compare implementation.
> Also returning longer annotations before shorter ones seems to be more compliant to the UIMA default indexing. See http://uima.apache.org/d/uimaj-2.4.0/references.html#ugr.ref.cas.index.built_in_indexes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (UIMA-2455) Make ordering of getNextAnnotations result configurable

Posted by "Peter Klügl (JIRA)" <de...@uima.apache.org>.

    [ https://issues.apache.org/jira/browse/UIMA-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13443033#comment-13443033 ] 

Peter Klügl commented on UIMA-2455:
-----------------------------------

Yes, and you can also close this issue. I will create a new one for the problem with the identical annotations.
                
> Make ordering of getNextAnnotations result configurable
> -------------------------------------------------------
>
>                 Key: UIMA-2455
>                 URL: https://issues.apache.org/jira/browse/UIMA-2455
>             Project: UIMA
>          Issue Type: New Feature
>          Components: TextMarker
>            Reporter: Rinat Gareyev
>            Assignee: Peter Klügl
>
> Example rule:
> A B C{NOT(PARTOF(D))->MARK(D,3)};
> Example text:
> aText bText cText cMoreText
> where following correspondence between annotations and tokens are held:
> A = aText
> B = bText
> C = cText
> C = cText cMoreText
> Rule results in the following:
> D = cText
> However I expect that:
> D = cText cMoreText
> The reason of actual behaviour is org.apache.uima.textmarker.rule.AnnotationComparator#compare implementation. It returns a shorter annotation before longer. That is why the sequence 'aText bText cText' will be matched and sequence 'aText bText cText cMoreText' will not because it will be considered later and will not pass NOT PARTOF condition.
> I've revealed this after migration to the latest TextMarker sources (from ASF repo). Before we used the one from Sourceforge.net. In the old (sourceforge) version this problem did not arise because TextMarkerBasic could keep only one annotation per Type as 'begin anchor'. Returning to the example this means that 'cText' TextMarkerBasic held only one C annotation as begin anchor.
> In current (rev. 1371274) version TextMarkerBasic keeps a set of begin and end anchors per Type. This is actually a good improvement.
> But I suggest to make ordering of anchored annotations returned by TextMarkerRuleElement#getNextAnnotations(boolean, AnnotationFS, TextMarkerStream) method more controllable.
> E.g., by adding some parameter for TextMarkerEngine or script which will define AnnotationComparator#compare implementation.
> Also returning longer annotations before shorter ones seems to be more compliant to the UIMA default indexing. See http://uima.apache.org/d/uimaj-2.4.0/references.html#ugr.ref.cas.index.built_in_indexes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (UIMA-2455) Make ordering of getNextAnnotations result configurable

Posted by "Peter Klügl (JIRA)" <de...@uima.apache.org>.

    [ https://issues.apache.org/jira/browse/UIMA-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13442491#comment-13442491 ] 

Peter Klügl commented on UIMA-2455:
-----------------------------------

I have to correct my last comment. The problem I was mentioning (which I ignored a bit) refers to the situation when there are two annotations of the same type and the same offset.

Your use case should work just fine with the fixed AnnotationComparator that I have committed.
                
> Make ordering of getNextAnnotations result configurable
> -------------------------------------------------------
>
>                 Key: UIMA-2455
>                 URL: https://issues.apache.org/jira/browse/UIMA-2455
>             Project: UIMA
>          Issue Type: New Feature
>          Components: TextMarker
>            Reporter: Rinat Gareyev
>            Assignee: Peter Klügl
>
> Example rule:
> A B C{NOT(PARTOF(D))->MARK(D,3)};
> Example text:
> aText bText cText cMoreText
> where following correspondence between annotations and tokens are held:
> A = aText
> B = bText
> C = cText
> C = cText cMoreText
> Rule results in the following:
> D = cText
> However I expect that:
> D = cText cMoreText
> The reason of actual behaviour is org.apache.uima.textmarker.rule.AnnotationComparator#compare implementation. It returns a shorter annotation before longer. That is why the sequence 'aText bText cText' will be matched and sequence 'aText bText cText cMoreText' will not because it will be considered later and will not pass NOT PARTOF condition.
> I've revealed this after migration to the latest TextMarker sources (from ASF repo). Before we used the one from Sourceforge.net. In the old (sourceforge) version this problem did not arise because TextMarkerBasic could keep only one annotation per Type as 'begin anchor'. Returning to the example this means that 'cText' TextMarkerBasic held only one C annotation as begin anchor.
> In current (rev. 1371274) version TextMarkerBasic keeps a set of begin and end anchors per Type. This is actually a good improvement.
> But I suggest to make ordering of anchored annotations returned by TextMarkerRuleElement#getNextAnnotations(boolean, AnnotationFS, TextMarkerStream) method more controllable.
> E.g., by adding some parameter for TextMarkerEngine or script which will define AnnotationComparator#compare implementation.
> Also returning longer annotations before shorter ones seems to be more compliant to the UIMA default indexing. See http://uima.apache.org/d/uimaj-2.4.0/references.html#ugr.ref.cas.index.built_in_indexes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (UIMA-2455) Make ordering of getNextAnnotations result configurable

Posted by "Peter Klügl (JIRA)" <de...@uima.apache.org>.

     [ https://issues.apache.org/jira/browse/UIMA-2455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Peter Klügl reassigned UIMA-2455:
---------------------------------

    Assignee: Peter Klügl
    
> Make ordering of getNextAnnotations result configurable
> -------------------------------------------------------
>
>                 Key: UIMA-2455
>                 URL: https://issues.apache.org/jira/browse/UIMA-2455
>             Project: UIMA
>          Issue Type: New Feature
>          Components: TextMarker
>            Reporter: Rinat Gareyev
>            Assignee: Peter Klügl
>
> Example rule:
> A B C{NOT(PARTOF(D))->MARK(D,3)};
> Example text:
> aText bText cText cMoreText
> where following correspondence between annotations and tokens are held:
> A = aText
> B = bText
> C = cText
> C = cText cMoreText
> Rule results in the following:
> D = cText
> However I expect that:
> D = cText cMoreText
> The reason of actual behaviour is org.apache.uima.textmarker.rule.AnnotationComparator#compare implementation. It returns a shorter annotation before longer. That is why the sequence 'aText bText cText' will be matched and sequence 'aText bText cText cMoreText' will not because it will be considered later and will not pass NOT PARTOF condition.
> I've revealed this after migration to the latest TextMarker sources (from ASF repo). Before we used the one from Sourceforge.net. In the old (sourceforge) version this problem did not arise because TextMarkerBasic could keep only one annotation per Type as 'begin anchor'. Returning to the example this means that 'cText' TextMarkerBasic held only one C annotation as begin anchor.
> In current (rev. 1371274) version TextMarkerBasic keeps a set of begin and end anchors per Type. This is actually a good improvement.
> But I suggest to make ordering of anchored annotations returned by TextMarkerRuleElement#getNextAnnotations(boolean, AnnotationFS, TextMarkerStream) method more controllable.
> E.g., by adding some parameter for TextMarkerEngine or script which will define AnnotationComparator#compare implementation.
> Also returning longer annotations before shorter ones seems to be more compliant to the UIMA default indexing. See http://uima.apache.org/d/uimaj-2.4.0/references.html#ugr.ref.cas.index.built_in_indexes

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira