You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Michael McCandless (Created) (JIRA)" <ji...@apache.org> on 2012/03/31 02:07:33 UTC

[jira] [Created] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole

When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole
-------------------------------------------------------------------------------------

                 Key: LUCENE-3940
                 URL: https://issues.apache.org/jira/browse/LUCENE-3940
             Project: Lucene - Java
          Issue Type: Bug
            Reporter: Michael McCandless
            Assignee: Michael McCandless
            Priority: Minor
             Fix For: 4.0


I modified BaseTokenStreamTestCase to assert that the start/end
offsets match for graph (posLen > 1) tokens, and this caught a bug in
Kuromoji when the decompounding of a compound token has a punctuation
token that's dropped.

In this case we should leave hole(s) so that the graph is intact, ie,
the graph should look the same as if the punctuation tokens were not
initially removed, but then a StopFilter had removed them.

This also affects tokens that have no compound over them, ie we fail
to leave a hole today when we remove the punctuation tokens.

I'm not sure this is serious enough to warrant fixing in 3.6 at the
last minute...


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13243785#comment-13243785 ] 

Michael McCandless commented on LUCENE-3940:
--------------------------------------------

OK here's one possible fix...

Right now, when we are glomming up an UNKNOWN token, we glom only as long as the character class of each character is the same as the first character.

What if we also require that isPunct-ness is the same?  That way we would never create an UNKNOWN token mixing punct and non-punct...

I implemented that and the tests seem to pass w/ offset checking fully turned on again...
                
> When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3940
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3940
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch
>
>
> I modified BaseTokenStreamTestCase to assert that the start/end
> offsets match for graph (posLen > 1) tokens, and this caught a bug in
> Kuromoji when the decompounding of a compound token has a punctuation
> token that's dropped.
> In this case we should leave hole(s) so that the graph is intact, ie,
> the graph should look the same as if the punctuation tokens were not
> initially removed, but then a StopFilter had removed them.
> This also affects tokens that have no compound over them, ie we fail
> to leave a hole today when we remove the punctuation tokens.
> I'm not sure this is serious enough to warrant fixing in 3.6 at the
> last minute...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole

Posted by "Michael McCandless (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-3940:
---------------------------------------

    Attachment: LUCENE-3940.patch

New test-only patch, breaking out the non-controversial (I think!)
part of the patch.

With this new patch, Kuromoji still silently discards punctuation
(just like StandardAnalyzer), but at least we get better test coverage
in BTSTC to verify graph tokens are not messing up their offsets.

I had to turn it off when testing Kuromoji w/ punctuation
removal... but it's still tested w/o punctuation removal, so I think
it'd likely catch any bugs in how Kuromoji sets offsets of the
compound tokens... at least it's better than not checking at all
(ie, today).

The only non-tests-only change is I uncommented an assert in Kuromoji;
I think it's a valid assert.

                
> When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3940
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3940
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch
>
>
> I modified BaseTokenStreamTestCase to assert that the start/end
> offsets match for graph (posLen > 1) tokens, and this caught a bug in
> Kuromoji when the decompounding of a compound token has a punctuation
> token that's dropped.
> In this case we should leave hole(s) so that the graph is intact, ie,
> the graph should look the same as if the punctuation tokens were not
> initially removed, but then a StopFilter had removed them.
> This also affects tokens that have no compound over them, ie we fail
> to leave a hole today when we remove the punctuation tokens.
> I'm not sure this is serious enough to warrant fixing in 3.6 at the
> last minute...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole

Posted by "Steven Rowe (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13243784#comment-13243784 ] 

Steven Rowe commented on LUCENE-3940:
-------------------------------------

bq. I think its well accepted that words carry the information content of a doc, punctuation has no information content really here, it doesn't tell me what the doc is about, and I don't think this is controversial, I just think your view on this is extreme...

I disagree with you, Robert.  (If punctuation has no information content, why does it exist?)  IMHO Mike's examples are not at all extreme, e.g. some punctuation tokens could be used to trigger position increment gaps.

bq. StandardTokenizer doesnt leave holes when it drops punctuation, I think holes should only be real 'words' for the most part

"Standard"Tokenizer is drawn from Unicode UAX#29, which only describes word *boundaries*.  Lucene has grafted onto these boundary rules an assumption that only alphanumeric "words" should be tokens - this assumption does not exist in the standard itself.

My opinion is that we should have both types of things: a tokenizer that discards non-alphanumeric characters between word boundaries; and different type of analysis component that discards nothing.  I think of the discard-nothing process as *segmentation* rather than tokenization, and I've [argued for it previously|https://issues.apache.org/jira/browse/LUCENE-2498?focusedCommentId=12878963&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-12878963].
                
> When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3940
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3940
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch
>
>
> I modified BaseTokenStreamTestCase to assert that the start/end
> offsets match for graph (posLen > 1) tokens, and this caught a bug in
> Kuromoji when the decompounding of a compound token has a punctuation
> token that's dropped.
> In this case we should leave hole(s) so that the graph is intact, ie,
> the graph should look the same as if the punctuation tokens were not
> initially removed, but then a StopFilter had removed them.
> This also affects tokens that have no compound over them, ie we fail
> to leave a hole today when we remove the punctuation tokens.
> I'm not sure this is serious enough to warrant fixing in 3.6 at the
> last minute...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13243126#comment-13243126 ] 

Robert Muir commented on LUCENE-3940:
-------------------------------------

I dont think we should do this. StandardTokenizer doesnt leave holes when it drops punctuation,
I think holes should only be real 'words' for the most part
                
> When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3940
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3940
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-3940.patch, LUCENE-3940.patch
>
>
> I modified BaseTokenStreamTestCase to assert that the start/end
> offsets match for graph (posLen > 1) tokens, and this caught a bug in
> Kuromoji when the decompounding of a compound token has a punctuation
> token that's dropped.
> In this case we should leave hole(s) so that the graph is intact, ie,
> the graph should look the same as if the punctuation tokens were not
> initially removed, but then a StopFilter had removed them.
> This also affects tokens that have no compound over them, ie we fail
> to leave a hole today when we remove the punctuation tokens.
> I'm not sure this is serious enough to warrant fixing in 3.6 at the
> last minute...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole

Posted by "Robert Muir (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-3940:
--------------------------------

    Fix Version/s: 3.6.1
    
> When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3940
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3940
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 4.0, 3.6.1
>
>         Attachments: LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch
>
>
> I modified BaseTokenStreamTestCase to assert that the start/end
> offsets match for graph (posLen > 1) tokens, and this caught a bug in
> Kuromoji when the decompounding of a compound token has a punctuation
> token that's dropped.
> In this case we should leave hole(s) so that the graph is intact, ie,
> the graph should look the same as if the punctuation tokens were not
> initially removed, but then a StopFilter had removed them.
> This also affects tokens that have no compound over them, ie we fail
> to leave a hole today when we remove the punctuation tokens.
> I'm not sure this is serious enough to warrant fixing in 3.6 at the
> last minute...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole

Posted by "Michael McCandless (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-3940:
---------------------------------------

    Attachment: LUCENE-3940.patch

New patch, implementing the idea above.  I think it's ready...

It also means we can unconditionally check for correct offsets in
graph tokens, which is nice.

                
> When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3940
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3940
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch
>
>
> I modified BaseTokenStreamTestCase to assert that the start/end
> offsets match for graph (posLen > 1) tokens, and this caught a bug in
> Kuromoji when the decompounding of a compound token has a punctuation
> token that's dropped.
> In this case we should leave hole(s) so that the graph is intact, ie,
> the graph should look the same as if the punctuation tokens were not
> initially removed, but then a StopFilter had removed them.
> This also affects tokens that have no compound over them, ie we fail
> to leave a hole today when we remove the punctuation tokens.
> I'm not sure this is serious enough to warrant fixing in 3.6 at the
> last minute...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13243299#comment-13243299 ] 

Michael McCandless commented on LUCENE-3940:
--------------------------------------------

bq.  StandardTokenizer doesnt leave holes when it drops punctuation,

But is that really good?

This means a PhraseQuery will match across end-of-sentence (.),
semicolon, colon, comma, etc. (English examples..).

I think tokenizers should throw away as little information as
possible... we can always filter out such tokens in a later stage?

For example, if a tokenizer created punct tokens (instead of silently
discarding them), other token filters could make use of them in the
mean time, eg a synonym rule for "u.s.a. -> usa" or maybe a dedicated
English "acronyms" filter.  We could then later filter them out, even
not leaving holes, and have the same behavior that we have now?

Are there non-English examples where you would want the PhraseQuery to
match over punctuation...?  EG, for Japanese, I assume we don't want
PhraseQuery applying across periods/commas, like it will now? (Not
sure about middle dot...?  Others...?).
                
> When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3940
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3940
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-3940.patch, LUCENE-3940.patch
>
>
> I modified BaseTokenStreamTestCase to assert that the start/end
> offsets match for graph (posLen > 1) tokens, and this caught a bug in
> Kuromoji when the decompounding of a compound token has a punctuation
> token that's dropped.
> In this case we should leave hole(s) so that the graph is intact, ie,
> the graph should look the same as if the punctuation tokens were not
> initially removed, but then a StopFilter had removed them.
> This also affects tokens that have no compound over them, ie we fail
> to leave a hole today when we remove the punctuation tokens.
> I'm not sure this is serious enough to warrant fixing in 3.6 at the
> last minute...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13243744#comment-13243744 ] 

Robert Muir commented on LUCENE-3940:
-------------------------------------

I think kuromoji has the problem: here it creates 'fake' intermediate 'tokens' and then deletes them and this somehow screws up the decompounding graph???

it should never create these tokens in the first place! I think its well accepted that words carry the information content of a doc, punctuation has no information content really here, it doesn't tell me what the doc is about, and I don't think this is controversial, I just think your view on this is extreme...

                
> When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3940
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3940
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch
>
>
> I modified BaseTokenStreamTestCase to assert that the start/end
> offsets match for graph (posLen > 1) tokens, and this caught a bug in
> Kuromoji when the decompounding of a compound token has a punctuation
> token that's dropped.
> In this case we should leave hole(s) so that the graph is intact, ie,
> the graph should look the same as if the punctuation tokens were not
> initially removed, but then a StopFilter had removed them.
> This also affects tokens that have no compound over them, ie we fail
> to leave a hole today when we remove the punctuation tokens.
> I'm not sure this is serious enough to warrant fixing in 3.6 at the
> last minute...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole

Posted by "Christian Moen (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13244111#comment-13244111 ] 

Christian Moen commented on LUCENE-3940:
----------------------------------------

I'm not familiar with the various considerations that were made with StandardTokenizer, but please allow me to share some comments anyway.

Perhaps it's useful to distinguish between _analysis for information retrieval_ and _analysis for information extraction_ here?

I like Michael's and Steven's idea of doing tokenization that doesn't discard any information.  This is certainly useful in the case of _information extraction_.  For example, if we'd like to extract noun-phrases based on part-of-speech tags, we don't want to conjoin tokens in case there's a punctuation character between two nouns (unless that punctuation character is a middle dot).

Robert is of course correct that we generally don't want to index punctuation characters that occur in every document, so from an _information retrieval_ point of view, we'd like punctuation characters removed.

If there's an established convention that Tokenizer variants discards punctuation and produces the terms that are meant to be directly searchable, it sounds like a good idea that we stick to the convention here as well.

If there's no established convention, it seems useful that a Tokenizer would provide as much details as possible with text being input and leave downstream Filters/Analyzers  to remove whatever is suitable based on a particular processing purpose.  We can provide common ready-to-use Analyzers with reasonable defaults that users can look to, i.e. to process a specific language or do another common high-level task with text.  

Hence, perhaps each Tokenizer can decide what makes the most sense to do based on that particular tokenizer's scope of processing?

To Roberts point, this would leave processing totally arbitrary and consistent, but this would be _by design_ as it wouldn't be Tokenizer's role to enforce any overall consistency -- i.e. with regards to punctuation -- higher level Analyzers would provide that.

Thoughts?
                
> When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3940
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3940
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch
>
>
> I modified BaseTokenStreamTestCase to assert that the start/end
> offsets match for graph (posLen > 1) tokens, and this caught a bug in
> Kuromoji when the decompounding of a compound token has a punctuation
> token that's dropped.
> In this case we should leave hole(s) so that the graph is intact, ie,
> the graph should look the same as if the punctuation tokens were not
> initially removed, but then a StopFilter had removed them.
> This also affects tokens that have no compound over them, ie we fail
> to leave a hole today when we remove the punctuation tokens.
> I'm not sure this is serious enough to warrant fixing in 3.6 at the
> last minute...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole

Posted by "Michael McCandless (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-3940:
---------------------------------------

    Attachment: LUCENE-3940.patch

Patch; I think it's ready, except, I need to make a simpler test case than that new curious string....
                
> When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3940
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3940
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-3940.patch
>
>
> I modified BaseTokenStreamTestCase to assert that the start/end
> offsets match for graph (posLen > 1) tokens, and this caught a bug in
> Kuromoji when the decompounding of a compound token has a punctuation
> token that's dropped.
> In this case we should leave hole(s) so that the graph is intact, ie,
> the graph should look the same as if the punctuation tokens were not
> initially removed, but then a StopFilter had removed them.
> This also affects tokens that have no compound over them, ie we fail
> to leave a hole today when we remove the punctuation tokens.
> I'm not sure this is serious enough to warrant fixing in 3.6 at the
> last minute...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Resolved] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole

Posted by "Michael McCandless (Resolved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless resolved LUCENE-3940.
----------------------------------------

    Resolution: Fixed
    
> When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3940
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3940
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch
>
>
> I modified BaseTokenStreamTestCase to assert that the start/end
> offsets match for graph (posLen > 1) tokens, and this caught a bug in
> Kuromoji when the decompounding of a compound token has a punctuation
> token that's dropped.
> In this case we should leave hole(s) so that the graph is intact, ie,
> the graph should look the same as if the punctuation tokens were not
> initially removed, but then a StopFilter had removed them.
> This also affects tokens that have no compound over them, ie we fail
> to leave a hole today when we remove the punctuation tokens.
> I'm not sure this is serious enough to warrant fixing in 3.6 at the
> last minute...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13243799#comment-13243799 ] 

Robert Muir commented on LUCENE-3940:
-------------------------------------

{quote}
I disagree with you, Robert. (If punctuation has no information content, why does it exist?) IMHO Mike's examples are not at all extreme, e.g. some punctuation tokens could be used to trigger position increment gaps.
{quote}

Punctuation simply doesn't tell you anything about the document: this is fact. if we start indexing punctuation we just create useless terms that go to every document

Because of this, nobody wastes their time trying to figure out how index "punctuation tokens". Mike's problem is basically the fact he is creating a compound token of '??' 

Furthermore, the idea that 'if we don't leave a hole for anything removed, we are losing formation' is totally arbitrary, confusing, and inconsistent anyway. How come we leave holes for definitiveness in english but not for plurals in english, but in arabic or bulgarian we don't leave holes for definiteness, because it happens to be attached to the word and stemmed?

                
> When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3940
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3940
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch
>
>
> I modified BaseTokenStreamTestCase to assert that the start/end
> offsets match for graph (posLen > 1) tokens, and this caught a bug in
> Kuromoji when the decompounding of a compound token has a punctuation
> token that's dropped.
> In this case we should leave hole(s) so that the graph is intact, ie,
> the graph should look the same as if the punctuation tokens were not
> initially removed, but then a StopFilter had removed them.
> This also affects tokens that have no compound over them, ie we fail
> to leave a hole today when we remove the punctuation tokens.
> I'm not sure this is serious enough to warrant fixing in 3.6 at the
> last minute...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Updated] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole

Posted by "Michael McCandless (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated LUCENE-3940:
---------------------------------------

    Attachment: LUCENE-3940.patch

New patch, fixing a bug in the last one, and adding a few more test cases.  I also made the "print curious string on exception" from BTSTC more ascii-friendly.

I think it's ready.
                
> When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3940
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3940
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-3940.patch, LUCENE-3940.patch
>
>
> I modified BaseTokenStreamTestCase to assert that the start/end
> offsets match for graph (posLen > 1) tokens, and this caught a bug in
> Kuromoji when the decompounding of a compound token has a punctuation
> token that's dropped.
> In this case we should leave hole(s) so that the graph is intact, ie,
> the graph should look the same as if the punctuation tokens were not
> initially removed, but then a StopFilter had removed them.
> This also affects tokens that have no compound over them, ie we fail
> to leave a hole today when we remove the punctuation tokens.
> I'm not sure this is serious enough to warrant fixing in 3.6 at the
> last minute...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13244118#comment-13244118 ] 

Robert Muir commented on LUCENE-3940:
-------------------------------------

{quote}
Perhaps it's useful to distinguish between analysis for information retrieval and analysis for information extraction here?
{quote}

Yes, since we are an information retrieval library, then there is no sense in adding *traps* and *complexity* to our analysis API.

{quote}
I like Michael's and Steven's idea of doing tokenization that doesn't discard any information.
{quote}

For IR, this is definitely not information... calling it data is a stretch.

{quote}
If there's an established convention that Tokenizer variants discards punctuation and produces the terms that are meant to be directly searchable, it sounds like a good idea that we stick to the convention here as well.
{quote}

Thats what the tokenizers do today, they find tokens (In the IR sense). So yeah, there is an established convention already. Changing
this would be a *monster trap* because suddenly tons of people would be indexing tons of useless punctuation. I would strongly
oppose such a change.




                
> When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3940
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3940
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch
>
>
> I modified BaseTokenStreamTestCase to assert that the start/end
> offsets match for graph (posLen > 1) tokens, and this caught a bug in
> Kuromoji when the decompounding of a compound token has a punctuation
> token that's dropped.
> In this case we should leave hole(s) so that the graph is intact, ie,
> the graph should look the same as if the punctuation tokens were not
> initially removed, but then a StopFilter had removed them.
> This also affects tokens that have no compound over them, ie we fail
> to leave a hole today when we remove the punctuation tokens.
> I'm not sure this is serious enough to warrant fixing in 3.6 at the
> last minute...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13243782#comment-13243782 ] 

Michael McCandless commented on LUCENE-3940:
--------------------------------------------

Here's an example where we create a compound token with punctuation.

I got this from the Japanese Wikipedia export, with our MockCharFilter
sometimes doubling characters: we are at a position that the
characters 〇〇'''、''' after it... that 〇 is this Unicode character
http://www.fileformat.info/info/unicode/char/3007/index.htm

When Kuromoji extends from this position, both 〇 and 〇〇 are KNOWN,
but then we also extend by unknown 〇〇'''、''' (ie, 〇〇 plus only
punctuation):

Note that 〇 is not considered punctuation by Kuromoji's isPunctuation
method...

{noformat}
  + UNKNOWN word 〇〇'''、''' toPos=41 cost=21223 penalty=3400 toPos.idx=0
  + KNOWN word 〇〇 toPos=34 cost=9895 penalty=0 toPos.idx=0
  + KNOWN word 〇 toPos=33 cost=2766 penalty=0 toPos.idx=0
  + KNOWN word 〇 toPos=33 cost=5256 penalty=0 toPos.idx=1
{noformat}

And then on backtrace we make a compound token (UNKNOWN) for all of
"〇〇'''、'''", while the decompounded path keeps two separate "〇"
tokens but drops the '''、''' since it's all punctuation, thus
creating inconsistent offsets.

                
> When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3940
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3940
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch
>
>
> I modified BaseTokenStreamTestCase to assert that the start/end
> offsets match for graph (posLen > 1) tokens, and this caught a bug in
> Kuromoji when the decompounding of a compound token has a punctuation
> token that's dropped.
> In this case we should leave hole(s) so that the graph is intact, ie,
> the graph should look the same as if the punctuation tokens were not
> initially removed, but then a StopFilter had removed them.
> This also affects tokens that have no compound over them, ie we fail
> to leave a hole today when we remove the punctuation tokens.
> I'm not sure this is serious enough to warrant fixing in 3.6 at the
> last minute...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

[jira] [Commented] (LUCENE-3940) When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole

Posted by "Robert Muir (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13244124#comment-13244124 ] 

Robert Muir commented on LUCENE-3940:
-------------------------------------

{quote}
This is certainly useful in the case of information extraction. For example, if we'd like to extract noun-phrases based on part-of-speech tags, we don't want to conjoin tokens in case there's a punctuation character between two nouns (unless that punctuation character is a middle dot).
{quote}

The option still exists in kuromoji (discardPunctuation=false) if you want to use it for this.
I added this option because it originally kept the punctuation (for downstream filters to remove).

lucene-gosen worked the same way, and after some time i saw *numerous* examples across the internet
where people simply configured the tokenizer without any filters, which means huge amounts of 
punctuation being indexed by default. People don't pay attention to documentation or details,
so all these people were getting bad performance.

Based on this experience, I didn't want keeping punctuation to be the default, nor even achievable
via things like solr factories here. But i added the (expert) option to Kuromoji because its really 
a more general purpose things for japanese analysis, because its already being used for other things,
and because allowing a boolean was not expensive or complex to support.

But I don't consider this a bonafied option from the lucene apis, i would be strongly against adding
this to the solr factories, or as an option to KuromojiAnalyzer. And, I don't think we should add such
a thing to other tokenizers either. 

Our general mission is search, if we want to decide we are expanding our analysis API to be generally
useful outside of information retrieval, thats a much bigger decision with more complex tradeoffs that
everyone should vote on (e.g. moving analyzers outside of lucene.apache.org and everything).


                
> When Japanese (Kuromoji) tokenizer removes a punctuation token it should leave a hole
> -------------------------------------------------------------------------------------
>
>                 Key: LUCENE-3940
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3940
>             Project: Lucene - Java
>          Issue Type: Bug
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch, LUCENE-3940.patch
>
>
> I modified BaseTokenStreamTestCase to assert that the start/end
> offsets match for graph (posLen > 1) tokens, and this caught a bug in
> Kuromoji when the decompounding of a compound token has a punctuation
> token that's dropped.
> In this case we should leave hole(s) so that the graph is intact, ie,
> the graph should look the same as if the punctuation tokens were not
> initially removed, but then a StopFilter had removed them.
> This also affects tokens that have no compound over them, ie we fail
> to leave a hole today when we remove the punctuation tokens.
> I'm not sure this is serious enough to warrant fixing in 3.6 at the
> last minute...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org