You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Andrew Duffy (JIRA)" <ji...@apache.org> on 2008/09/17 12:55:44 UTC

[jira] Created: (LUCENE-1389) SimpleSpanFragmenter can create very short fragments

SimpleSpanFragmenter can create very short fragments
----------------------------------------------------

                 Key: LUCENE-1389
                 URL: https://issues.apache.org/jira/browse/LUCENE-1389
             Project: Lucene - Java
          Issue Type: Bug
          Components: contrib/highlighter
    Affects Versions: 2.3.2
            Reporter: Andrew Duffy
            Priority: Minor


Line 74 of SimpleSpanFragmenter returns true when the current token is the start of a hit on a span or phrase, thus starting a new fragment. Two problems occur:

- The previous fragment may be very short, but if it contains a hit it will be combined with the new fragment later so this disappears.
- If the token is close to a natural fragment boundary the new fragment will end up very short; possibly even as short as just the span or phrase itself. This is the result of creating a new fragment without incrementing currentNumFrags.

To fix, remove or comment out line 74. The result is that fragments average to the fragment size unless a span or phrase hit is towards the end of the fragment - that fragment is made larger and the following fragment shorter to accommodate the hit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-1389) SimpleSpanFragmenter can create very short fragments

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Miller resolved LUCENE-1389.
---------------------------------

    Resolution: Fixed

Thanks Andrew.

> SimpleSpanFragmenter can create very short fragments
> ----------------------------------------------------
>
>                 Key: LUCENE-1389
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1389
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>    Affects Versions: 2.3.2
>            Reporter: Andrew Duffy
>            Assignee: Mark Miller
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: Lucene-1389.patch, positions.patch, tailfragments.patch
>
>
> Line 74 of SimpleSpanFragmenter returns true when the current token is the start of a hit on a span or phrase, thus starting a new fragment. Two problems occur:
> - The previous fragment may be very short, but if it contains a hit it will be combined with the new fragment later so this disappears.
> - If the token is close to a natural fragment boundary the new fragment will end up very short; possibly even as short as just the span or phrase itself. This is the result of creating a new fragment without incrementing currentNumFrags.
> To fix, remove or comment out line 74. The result is that fragments average to the fragment size unless a span or phrase hit is towards the end of the fragment - that fragment is made larger and the following fragment shorter to accommodate the hit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1389) SimpleSpanFragmenter can create very short fragments

Posted by "Andrew Duffy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Duffy updated LUCENE-1389:
---------------------------------

    Attachment: positions.patch

I've attached another diff, again from the trunk version. There is a slight optimisation - the span loop is broken early when a span is found at the current position.

The main change is to start(String), though. Previously, it set currentPosition to 0, meaning every position was off by one and spans were not matched. It now starts currentPosition at -1 so the first token position ends up 0 as it should.

> SimpleSpanFragmenter can create very short fragments
> ----------------------------------------------------
>
>                 Key: LUCENE-1389
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1389
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>    Affects Versions: 2.3.2
>            Reporter: Andrew Duffy
>            Priority: Minor
>         Attachments: positions.patch, tailfragments.patch
>
>
> Line 74 of SimpleSpanFragmenter returns true when the current token is the start of a hit on a span or phrase, thus starting a new fragment. Two problems occur:
> - The previous fragment may be very short, but if it contains a hit it will be combined with the new fragment later so this disappears.
> - If the token is close to a natural fragment boundary the new fragment will end up very short; possibly even as short as just the span or phrase itself. This is the result of creating a new fragment without incrementing currentNumFrags.
> To fix, remove or comment out line 74. The result is that fragments average to the fragment size unless a span or phrase hit is towards the end of the fragment - that fragment is made larger and the following fragment shorter to accommodate the hit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1389) SimpleSpanFragmenter can create very short fragments

Posted by "Andrew Duffy (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Duffy updated LUCENE-1389:
---------------------------------

    Attachment: tailfragments.patch

Another problem with the simple fragmenters is that they can produce a very short fragment at the end of the stream. The attached diff from the trunk SimpleSpanFragmenter.java remembers how long the text being fragmented is and doesn't make a fragment break if the token is less than half a fragment away from the end of the text. SimpleFragmenter could easily be changed in the same way.

> SimpleSpanFragmenter can create very short fragments
> ----------------------------------------------------
>
>                 Key: LUCENE-1389
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1389
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>    Affects Versions: 2.3.2
>            Reporter: Andrew Duffy
>            Priority: Minor
>         Attachments: tailfragments.patch
>
>
> Line 74 of SimpleSpanFragmenter returns true when the current token is the start of a hit on a span or phrase, thus starting a new fragment. Two problems occur:
> - The previous fragment may be very short, but if it contains a hit it will be combined with the new fragment later so this disappears.
> - If the token is close to a natural fragment boundary the new fragment will end up very short; possibly even as short as just the span or phrase itself. This is the result of creating a new fragment without incrementing currentNumFrags.
> To fix, remove or comment out line 74. The result is that fragments average to the fragment size unless a span or phrase hit is towards the end of the fragment - that fragment is made larger and the following fragment shorter to accommodate the hit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1389) SimpleSpanFragmenter can create very short fragments

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Miller updated LUCENE-1389:
--------------------------------

    Fix Version/s: 2.9
         Assignee: Mark Miller

> SimpleSpanFragmenter can create very short fragments
> ----------------------------------------------------
>
>                 Key: LUCENE-1389
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1389
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>    Affects Versions: 2.3.2
>            Reporter: Andrew Duffy
>            Assignee: Mark Miller
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: Lucene-1389.patch, positions.patch, tailfragments.patch
>
>
> Line 74 of SimpleSpanFragmenter returns true when the current token is the start of a hit on a span or phrase, thus starting a new fragment. Two problems occur:
> - The previous fragment may be very short, but if it contains a hit it will be combined with the new fragment later so this disappears.
> - If the token is close to a natural fragment boundary the new fragment will end up very short; possibly even as short as just the span or phrase itself. This is the result of creating a new fragment without incrementing currentNumFrags.
> To fix, remove or comment out line 74. The result is that fragments average to the fragment size unless a span or phrase hit is towards the end of the fragment - that fragment is made larger and the following fragment shorter to accommodate the hit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


[jira] Updated: (LUCENE-1389) SimpleSpanFragmenter can create very short fragments

Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mark Miller updated LUCENE-1389:
--------------------------------

    Attachment: Lucene-1389.patch

Thanks Andrew!

> SimpleSpanFragmenter can create very short fragments
> ----------------------------------------------------
>
>                 Key: LUCENE-1389
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1389
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>    Affects Versions: 2.3.2
>            Reporter: Andrew Duffy
>            Priority: Minor
>         Attachments: Lucene-1389.patch, positions.patch, tailfragments.patch
>
>
> Line 74 of SimpleSpanFragmenter returns true when the current token is the start of a hit on a span or phrase, thus starting a new fragment. Two problems occur:
> - The previous fragment may be very short, but if it contains a hit it will be combined with the new fragment later so this disappears.
> - If the token is close to a natural fragment boundary the new fragment will end up very short; possibly even as short as just the span or phrase itself. This is the result of creating a new fragment without incrementing currentNumFrags.
> To fix, remove or comment out line 74. The result is that fragments average to the fragment size unless a span or phrase hit is towards the end of the fragment - that fragment is made larger and the following fragment shorter to accommodate the hit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org