You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Andrew Duffy (JIRA)" <ji...@apache.org> on 2008/09/17 12:55:44 UTC
[jira] Created: (LUCENE-1389) SimpleSpanFragmenter can create very
short fragments
SimpleSpanFragmenter can create very short fragments
----------------------------------------------------
Key: LUCENE-1389
URL: https://issues.apache.org/jira/browse/LUCENE-1389
Project: Lucene - Java
Issue Type: Bug
Components: contrib/highlighter
Affects Versions: 2.3.2
Reporter: Andrew Duffy
Priority: Minor
Line 74 of SimpleSpanFragmenter returns true when the current token is the start of a hit on a span or phrase, thus starting a new fragment. Two problems occur:
- The previous fragment may be very short, but if it contains a hit it will be combined with the new fragment later so this disappears.
- If the token is close to a natural fragment boundary the new fragment will end up very short; possibly even as short as just the span or phrase itself. This is the result of creating a new fragment without incrementing currentNumFrags.
To fix, remove or comment out line 74. The result is that fragments average to the fragment size unless a span or phrase hit is towards the end of the fragment - that fragment is made larger and the following fragment shorter to accommodate the hit.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Resolved: (LUCENE-1389) SimpleSpanFragmenter can create very
short fragments
Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mark Miller resolved LUCENE-1389.
---------------------------------
Resolution: Fixed
Thanks Andrew.
> SimpleSpanFragmenter can create very short fragments
> ----------------------------------------------------
>
> Key: LUCENE-1389
> URL: https://issues.apache.org/jira/browse/LUCENE-1389
> Project: Lucene - Java
> Issue Type: Bug
> Components: contrib/highlighter
> Affects Versions: 2.3.2
> Reporter: Andrew Duffy
> Assignee: Mark Miller
> Priority: Minor
> Fix For: 2.9
>
> Attachments: Lucene-1389.patch, positions.patch, tailfragments.patch
>
>
> Line 74 of SimpleSpanFragmenter returns true when the current token is the start of a hit on a span or phrase, thus starting a new fragment. Two problems occur:
> - The previous fragment may be very short, but if it contains a hit it will be combined with the new fragment later so this disappears.
> - If the token is close to a natural fragment boundary the new fragment will end up very short; possibly even as short as just the span or phrase itself. This is the result of creating a new fragment without incrementing currentNumFrags.
> To fix, remove or comment out line 74. The result is that fragments average to the fragment size unless a span or phrase hit is towards the end of the fragment - that fragment is made larger and the following fragment shorter to accommodate the hit.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Updated: (LUCENE-1389) SimpleSpanFragmenter can create very
short fragments
Posted by "Andrew Duffy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Duffy updated LUCENE-1389:
---------------------------------
Attachment: positions.patch
I've attached another diff, again from the trunk version. There is a slight optimisation - the span loop is broken early when a span is found at the current position.
The main change is to start(String), though. Previously, it set currentPosition to 0, meaning every position was off by one and spans were not matched. It now starts currentPosition at -1 so the first token position ends up 0 as it should.
> SimpleSpanFragmenter can create very short fragments
> ----------------------------------------------------
>
> Key: LUCENE-1389
> URL: https://issues.apache.org/jira/browse/LUCENE-1389
> Project: Lucene - Java
> Issue Type: Bug
> Components: contrib/highlighter
> Affects Versions: 2.3.2
> Reporter: Andrew Duffy
> Priority: Minor
> Attachments: positions.patch, tailfragments.patch
>
>
> Line 74 of SimpleSpanFragmenter returns true when the current token is the start of a hit on a span or phrase, thus starting a new fragment. Two problems occur:
> - The previous fragment may be very short, but if it contains a hit it will be combined with the new fragment later so this disappears.
> - If the token is close to a natural fragment boundary the new fragment will end up very short; possibly even as short as just the span or phrase itself. This is the result of creating a new fragment without incrementing currentNumFrags.
> To fix, remove or comment out line 74. The result is that fragments average to the fragment size unless a span or phrase hit is towards the end of the fragment - that fragment is made larger and the following fragment shorter to accommodate the hit.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Updated: (LUCENE-1389) SimpleSpanFragmenter can create very
short fragments
Posted by "Andrew Duffy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Duffy updated LUCENE-1389:
---------------------------------
Attachment: tailfragments.patch
Another problem with the simple fragmenters is that they can produce a very short fragment at the end of the stream. The attached diff from the trunk SimpleSpanFragmenter.java remembers how long the text being fragmented is and doesn't make a fragment break if the token is less than half a fragment away from the end of the text. SimpleFragmenter could easily be changed in the same way.
> SimpleSpanFragmenter can create very short fragments
> ----------------------------------------------------
>
> Key: LUCENE-1389
> URL: https://issues.apache.org/jira/browse/LUCENE-1389
> Project: Lucene - Java
> Issue Type: Bug
> Components: contrib/highlighter
> Affects Versions: 2.3.2
> Reporter: Andrew Duffy
> Priority: Minor
> Attachments: tailfragments.patch
>
>
> Line 74 of SimpleSpanFragmenter returns true when the current token is the start of a hit on a span or phrase, thus starting a new fragment. Two problems occur:
> - The previous fragment may be very short, but if it contains a hit it will be combined with the new fragment later so this disappears.
> - If the token is close to a natural fragment boundary the new fragment will end up very short; possibly even as short as just the span or phrase itself. This is the result of creating a new fragment without incrementing currentNumFrags.
> To fix, remove or comment out line 74. The result is that fragments average to the fragment size unless a span or phrase hit is towards the end of the fragment - that fragment is made larger and the following fragment shorter to accommodate the hit.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Updated: (LUCENE-1389) SimpleSpanFragmenter can create very
short fragments
Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mark Miller updated LUCENE-1389:
--------------------------------
Fix Version/s: 2.9
Assignee: Mark Miller
> SimpleSpanFragmenter can create very short fragments
> ----------------------------------------------------
>
> Key: LUCENE-1389
> URL: https://issues.apache.org/jira/browse/LUCENE-1389
> Project: Lucene - Java
> Issue Type: Bug
> Components: contrib/highlighter
> Affects Versions: 2.3.2
> Reporter: Andrew Duffy
> Assignee: Mark Miller
> Priority: Minor
> Fix For: 2.9
>
> Attachments: Lucene-1389.patch, positions.patch, tailfragments.patch
>
>
> Line 74 of SimpleSpanFragmenter returns true when the current token is the start of a hit on a span or phrase, thus starting a new fragment. Two problems occur:
> - The previous fragment may be very short, but if it contains a hit it will be combined with the new fragment later so this disappears.
> - If the token is close to a natural fragment boundary the new fragment will end up very short; possibly even as short as just the span or phrase itself. This is the result of creating a new fragment without incrementing currentNumFrags.
> To fix, remove or comment out line 74. The result is that fragments average to the fragment size unless a span or phrase hit is towards the end of the fragment - that fragment is made larger and the following fragment shorter to accommodate the hit.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
[jira] Updated: (LUCENE-1389) SimpleSpanFragmenter can create very
short fragments
Posted by "Mark Miller (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/LUCENE-1389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mark Miller updated LUCENE-1389:
--------------------------------
Attachment: Lucene-1389.patch
Thanks Andrew!
> SimpleSpanFragmenter can create very short fragments
> ----------------------------------------------------
>
> Key: LUCENE-1389
> URL: https://issues.apache.org/jira/browse/LUCENE-1389
> Project: Lucene - Java
> Issue Type: Bug
> Components: contrib/highlighter
> Affects Versions: 2.3.2
> Reporter: Andrew Duffy
> Priority: Minor
> Attachments: Lucene-1389.patch, positions.patch, tailfragments.patch
>
>
> Line 74 of SimpleSpanFragmenter returns true when the current token is the start of a hit on a span or phrase, thus starting a new fragment. Two problems occur:
> - The previous fragment may be very short, but if it contains a hit it will be combined with the new fragment later so this disappears.
> - If the token is close to a natural fragment boundary the new fragment will end up very short; possibly even as short as just the span or phrase itself. This is the result of creating a new fragment without incrementing currentNumFrags.
> To fix, remove or comment out line 74. The result is that fragments average to the fragment size unless a span or phrase hit is towards the end of the fragment - that fragment is made larger and the following fragment shorter to accommodate the hit.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org