You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Robert Muir (JIRA)" <ji...@apache.org> on 2010/09/18 18:18:32 UTC

[jira] Created: (LUCENE-2653) ThaiAnalyzer assumes things about your jre

ThaiAnalyzer assumes things about your jre
------------------------------------------

                 Key: LUCENE-2653
                 URL: https://issues.apache.org/jira/browse/LUCENE-2653
             Project: Lucene - Java
          Issue Type: Bug
          Components: contrib/analyzers
    Affects Versions: 3.1, 4.0
            Reporter: Robert Muir


The ThaiAnalyzer/ThaiWordFilter depends on the fact that BreakIterator.getWordInstance(new Locale("th")) returns a dictionary-based break iterator that can segment thai phrases into words (it does not use whitespace).

But this is non-standard that the JRE will specialize this locale in this way, its nice, but you can't depend on it.
For example, if you are running on IBM JRE, this analyzer/wordfilter is completely "broken" in the sense it won't do what it claims to do.

At the minimum, we need to document this and suggest users look at ICUTokenizer for thai, which always has this breakiterator and is not jre-dependent.

Better, would be to check statically that the thing actually works.
when creating a new ThaiWordFilter we could clone() the BreakIterator, which is often cheaper than making a new one anyway.
we could throw an exception, if its not supported, and add a boolean so the user knows it works.
and we could refer to this boolean with Assert.assume in its tests.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2653) ThaiAnalyzer assumes things about your jre

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12912227#action_12912227 ] 

Robert Muir commented on LUCENE-2653:
-------------------------------------

bq. Could that have been a bw break since it did not do what it claimed to do?

I dont understand the question. ThaiWordFilter has always been broken this way, it is broken by design.

> ThaiAnalyzer assumes things about your jre
> ------------------------------------------
>
>                 Key: LUCENE-2653
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2653
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>    Affects Versions: 3.1, 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-2653.patch
>
>
> The ThaiAnalyzer/ThaiWordFilter depends on the fact that BreakIterator.getWordInstance(new Locale("th")) returns a dictionary-based break iterator that can segment thai phrases into words (it does not use whitespace).
> But this is non-standard that the JRE will specialize this locale in this way, its nice, but you can't depend on it.
> For example, if you are running on IBM JRE, this analyzer/wordfilter is completely "broken" in the sense it won't do what it claims to do.
> At the minimum, we need to document this and suggest users look at ICUTokenizer for thai, which always has this breakiterator and is not jre-dependent.
> Better, would be to check statically that the thing actually works.
> when creating a new ThaiWordFilter we could clone() the BreakIterator, which is often cheaper than making a new one anyway.
> we could throw an exception, if its not supported, and add a boolean so the user knows it works.
> and we could refer to this boolean with Assert.assume in its tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Reopened: (LUCENE-2653) ThaiAnalyzer assumes things about your jre

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir reopened LUCENE-2653:
---------------------------------


reopening for possible 2.9.4/3.0.3 backport.


> ThaiAnalyzer assumes things about your jre
> ------------------------------------------
>
>                 Key: LUCENE-2653
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2653
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>    Affects Versions: 3.1, 4.0
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2653.patch
>
>
> The ThaiAnalyzer/ThaiWordFilter depends on the fact that BreakIterator.getWordInstance(new Locale("th")) returns a dictionary-based break iterator that can segment thai phrases into words (it does not use whitespace).
> But this is non-standard that the JRE will specialize this locale in this way, its nice, but you can't depend on it.
> For example, if you are running on IBM JRE, this analyzer/wordfilter is completely "broken" in the sense it won't do what it claims to do.
> At the minimum, we need to document this and suggest users look at ICUTokenizer for thai, which always has this breakiterator and is not jre-dependent.
> Better, would be to check statically that the thing actually works.
> when creating a new ThaiWordFilter we could clone() the BreakIterator, which is often cheaper than making a new one anyway.
> we could throw an exception, if its not supported, and add a boolean so the user knows it works.
> and we could refer to this boolean with Assert.assume in its tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2653) ThaiAnalyzer assumes things about your jre

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12912215#action_12912215 ] 

Simon Willnauer commented on LUCENE-2653:
-----------------------------------------

Looks good to me robert! Make sure you add a CHANGES.TXT entry. Could that have been a bw break since it did not do what it claimed to do?

simon

> ThaiAnalyzer assumes things about your jre
> ------------------------------------------
>
>                 Key: LUCENE-2653
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2653
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>    Affects Versions: 3.1, 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-2653.patch
>
>
> The ThaiAnalyzer/ThaiWordFilter depends on the fact that BreakIterator.getWordInstance(new Locale("th")) returns a dictionary-based break iterator that can segment thai phrases into words (it does not use whitespace).
> But this is non-standard that the JRE will specialize this locale in this way, its nice, but you can't depend on it.
> For example, if you are running on IBM JRE, this analyzer/wordfilter is completely "broken" in the sense it won't do what it claims to do.
> At the minimum, we need to document this and suggest users look at ICUTokenizer for thai, which always has this breakiterator and is not jre-dependent.
> Better, would be to check statically that the thing actually works.
> when creating a new ThaiWordFilter we could clone() the BreakIterator, which is often cheaper than making a new one anyway.
> we could throw an exception, if its not supported, and add a boolean so the user knows it works.
> and we could refer to this boolean with Assert.assume in its tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2653) ThaiAnalyzer assumes things about your jre

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12926304#action_12926304 ] 

Robert Muir commented on LUCENE-2653:
-------------------------------------

I'm gonna shoot for documentation-only fix here for 2.9.x and 3.0.x as well... 
its a no-risk "fix" at least to alert people that this won't work on e.g. IBM jdk... 

> ThaiAnalyzer assumes things about your jre
> ------------------------------------------
>
>                 Key: LUCENE-2653
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2653
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>    Affects Versions: 3.1, 4.0
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 2.9.4, 3.0.3, 3.1, 4.0
>
>         Attachments: LUCENE-2653.patch
>
>
> The ThaiAnalyzer/ThaiWordFilter depends on the fact that BreakIterator.getWordInstance(new Locale("th")) returns a dictionary-based break iterator that can segment thai phrases into words (it does not use whitespace).
> But this is non-standard that the JRE will specialize this locale in this way, its nice, but you can't depend on it.
> For example, if you are running on IBM JRE, this analyzer/wordfilter is completely "broken" in the sense it won't do what it claims to do.
> At the minimum, we need to document this and suggest users look at ICUTokenizer for thai, which always has this breakiterator and is not jre-dependent.
> Better, would be to check statically that the thing actually works.
> when creating a new ThaiWordFilter we could clone() the BreakIterator, which is often cheaper than making a new one anyway.
> we could throw an exception, if its not supported, and add a boolean so the user knows it works.
> and we could refer to this boolean with Assert.assume in its tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2653) ThaiAnalyzer assumes things about your jre

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2653:
--------------------------------

    Fix Version/s: 3.0.3
                   2.9.4

> ThaiAnalyzer assumes things about your jre
> ------------------------------------------
>
>                 Key: LUCENE-2653
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2653
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>    Affects Versions: 3.1, 4.0
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 2.9.4, 3.0.3, 3.1, 4.0
>
>         Attachments: LUCENE-2653.patch
>
>
> The ThaiAnalyzer/ThaiWordFilter depends on the fact that BreakIterator.getWordInstance(new Locale("th")) returns a dictionary-based break iterator that can segment thai phrases into words (it does not use whitespace).
> But this is non-standard that the JRE will specialize this locale in this way, its nice, but you can't depend on it.
> For example, if you are running on IBM JRE, this analyzer/wordfilter is completely "broken" in the sense it won't do what it claims to do.
> At the minimum, we need to document this and suggest users look at ICUTokenizer for thai, which always has this breakiterator and is not jre-dependent.
> Better, would be to check statically that the thing actually works.
> when creating a new ThaiWordFilter we could clone() the BreakIterator, which is often cheaper than making a new one anyway.
> we could throw an exception, if its not supported, and add a boolean so the user knows it works.
> and we could refer to this boolean with Assert.assume in its tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2653) ThaiAnalyzer assumes things about your jre

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12912234#action_12912234 ] 

Robert Muir commented on LUCENE-2653:
-------------------------------------

no, in this case the filter does not work at all, it does nothing.

> ThaiAnalyzer assumes things about your jre
> ------------------------------------------
>
>                 Key: LUCENE-2653
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2653
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>    Affects Versions: 3.1, 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-2653.patch
>
>
> The ThaiAnalyzer/ThaiWordFilter depends on the fact that BreakIterator.getWordInstance(new Locale("th")) returns a dictionary-based break iterator that can segment thai phrases into words (it does not use whitespace).
> But this is non-standard that the JRE will specialize this locale in this way, its nice, but you can't depend on it.
> For example, if you are running on IBM JRE, this analyzer/wordfilter is completely "broken" in the sense it won't do what it claims to do.
> At the minimum, we need to document this and suggest users look at ICUTokenizer for thai, which always has this breakiterator and is not jre-dependent.
> Better, would be to check statically that the thing actually works.
> when creating a new ThaiWordFilter we could clone() the BreakIterator, which is often cheaper than making a new one anyway.
> we could throw an exception, if its not supported, and add a boolean so the user knows it works.
> and we could refer to this boolean with Assert.assume in its tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-2653) ThaiAnalyzer assumes things about your jre

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved LUCENE-2653.
---------------------------------

         Assignee: Robert Muir
    Fix Version/s: 3.1
                   4.0
       Resolution: Fixed

Committed revision 998684 (trunk), 998688 (3x)

> ThaiAnalyzer assumes things about your jre
> ------------------------------------------
>
>                 Key: LUCENE-2653
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2653
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>    Affects Versions: 3.1, 4.0
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 3.1, 4.0
>
>         Attachments: LUCENE-2653.patch
>
>
> The ThaiAnalyzer/ThaiWordFilter depends on the fact that BreakIterator.getWordInstance(new Locale("th")) returns a dictionary-based break iterator that can segment thai phrases into words (it does not use whitespace).
> But this is non-standard that the JRE will specialize this locale in this way, its nice, but you can't depend on it.
> For example, if you are running on IBM JRE, this analyzer/wordfilter is completely "broken" in the sense it won't do what it claims to do.
> At the minimum, we need to document this and suggest users look at ICUTokenizer for thai, which always has this breakiterator and is not jre-dependent.
> Better, would be to check statically that the thing actually works.
> when creating a new ThaiWordFilter we could clone() the BreakIterator, which is often cheaper than making a new one anyway.
> we could throw an exception, if its not supported, and add a boolean so the user knows it works.
> and we could refer to this boolean with Assert.assume in its tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Resolved: (LUCENE-2653) ThaiAnalyzer assumes things about your jre

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir resolved LUCENE-2653.
---------------------------------

    Resolution: Fixed

Committed documentation about this in:
Revision 1028789 for 3.0.x
Revision 1028791 for 2.9.x

> ThaiAnalyzer assumes things about your jre
> ------------------------------------------
>
>                 Key: LUCENE-2653
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2653
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>    Affects Versions: 3.1, 4.0
>            Reporter: Robert Muir
>            Assignee: Robert Muir
>             Fix For: 2.9.4, 3.0.3, 3.1, 4.0
>
>         Attachments: LUCENE-2653.patch
>
>
> The ThaiAnalyzer/ThaiWordFilter depends on the fact that BreakIterator.getWordInstance(new Locale("th")) returns a dictionary-based break iterator that can segment thai phrases into words (it does not use whitespace).
> But this is non-standard that the JRE will specialize this locale in this way, its nice, but you can't depend on it.
> For example, if you are running on IBM JRE, this analyzer/wordfilter is completely "broken" in the sense it won't do what it claims to do.
> At the minimum, we need to document this and suggest users look at ICUTokenizer for thai, which always has this breakiterator and is not jre-dependent.
> Better, would be to check statically that the thing actually works.
> when creating a new ThaiWordFilter we could clone() the BreakIterator, which is often cheaper than making a new one anyway.
> we could throw an exception, if its not supported, and add a boolean so the user knows it works.
> and we could refer to this boolean with Assert.assume in its tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Updated: (LUCENE-2653) ThaiAnalyzer assumes things about your jre

Posted by "Robert Muir (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/LUCENE-2653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Muir updated LUCENE-2653:
--------------------------------

    Attachment: LUCENE-2653.patch

Here's a patch: it detects statically if the BreakIterator from thai locale will actually work at all,
and sets a boolean DBBI_AVAILABLE

in the ctor if this is false, it throws UOE("This JRE does not have support for Thai segmentation")

I also added docs referring to ICUTokenizer in case you need this across all jres, and put
Assume.assumeTrue(ThaiWordFilter.DBBI_AVAILABLE) in the tests.



> ThaiAnalyzer assumes things about your jre
> ------------------------------------------
>
>                 Key: LUCENE-2653
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2653
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>    Affects Versions: 3.1, 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-2653.patch
>
>
> The ThaiAnalyzer/ThaiWordFilter depends on the fact that BreakIterator.getWordInstance(new Locale("th")) returns a dictionary-based break iterator that can segment thai phrases into words (it does not use whitespace).
> But this is non-standard that the JRE will specialize this locale in this way, its nice, but you can't depend on it.
> For example, if you are running on IBM JRE, this analyzer/wordfilter is completely "broken" in the sense it won't do what it claims to do.
> At the minimum, we need to document this and suggest users look at ICUTokenizer for thai, which always has this breakiterator and is not jre-dependent.
> Better, would be to check statically that the thing actually works.
> when creating a new ThaiWordFilter we could clone() the BreakIterator, which is often cheaper than making a new one anyway.
> we could throw an exception, if its not supported, and add a boolean so the user knows it works.
> and we could refer to this boolean with Assert.assume in its tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


[jira] Commented: (LUCENE-2653) ThaiAnalyzer assumes things about your jre

Posted by "Simon Willnauer (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/LUCENE-2653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12912231#action_12912231 ] 

Simon Willnauer commented on LUCENE-2653:
-----------------------------------------

bq. I dont understand the question. ThaiWordFilter has always been broken this way, it is broken by design.
could somebody have used the broken behavior and relies on it? Just making sure its not a bw break somehow which we should document.



> ThaiAnalyzer assumes things about your jre
> ------------------------------------------
>
>                 Key: LUCENE-2653
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2653
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/analyzers
>    Affects Versions: 3.1, 4.0
>            Reporter: Robert Muir
>         Attachments: LUCENE-2653.patch
>
>
> The ThaiAnalyzer/ThaiWordFilter depends on the fact that BreakIterator.getWordInstance(new Locale("th")) returns a dictionary-based break iterator that can segment thai phrases into words (it does not use whitespace).
> But this is non-standard that the JRE will specialize this locale in this way, its nice, but you can't depend on it.
> For example, if you are running on IBM JRE, this analyzer/wordfilter is completely "broken" in the sense it won't do what it claims to do.
> At the minimum, we need to document this and suggest users look at ICUTokenizer for thai, which always has this breakiterator and is not jre-dependent.
> Better, would be to check statically that the thing actually works.
> when creating a new ThaiWordFilter we could clone() the BreakIterator, which is often cheaper than making a new one anyway.
> we could throw an exception, if its not supported, and add a boolean so the user knows it works.
> and we could refer to this boolean with Assert.assume in its tests.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org