You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Grant Ingersoll (JIRA)" <ji...@apache.org> on 2007/12/28 03:47:43 UTC
[jira] Resolved: (LUCENE-1068) Invalid behavior of
StandardTokenizerImpl
[ https://issues.apache.org/jira/browse/LUCENE-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Grant Ingersoll resolved LUCENE-1068.
-------------------------------------
Resolution: Fixed
Committed.
> Invalid behavior of StandardTokenizerImpl
> -----------------------------------------
>
> Key: LUCENE-1068
> URL: https://issues.apache.org/jira/browse/LUCENE-1068
> Project: Lucene - Java
> Issue Type: Bug
> Components: Analysis
> Reporter: Shai Erera
> Assignee: Grant Ingersoll
> Priority: Minor
> Fix For: 2.3
>
> Attachments: LUCENE-1068.patch, StandardTokenizer-java-4.patch, StandardTokenizer-test-4.patch, StandardTokenizerImpl-2.patch, StandardTokenizerImpl-3.patch, StandardTokenizerImpl-5.patch, standardTokenizerImpl.jflex.patch, standardTokenizerImpl.patch
>
>
> The following code prints the output of StandardAnalyzer:
> Analyzer analyzer = new StandardAnalyzer();
> TokenStream ts = analyzer.tokenStream("content", new StringReader("<some text>"));
> Token t;
> while ((t = ts.next()) != null) {
> System.out.println(t);
> }
> If you pass "www.abc.com", the output is (www.abc.com,0,11,type=<HOST>) (which is correct in my opinion).
> However, if you pass "www.abc.com." (notice the extra '.' at the end), the output is (wwwabccom,0,12,type=<ACRONYM>).
> I think the behavior in the second case is incorrect for several reasons:
> 1. It recognizes the string incorrectly (no argue on that).
> 2. It kind of prevents you from putting URLs at the end of a sentence, which is perfectly legal.
> 3. An ACRONYM, at least to the best of my understanding, is of the form A.B.C. and not ABC.DEF.
> I looked at StandardTokenizerImpl.jflex and I think the problem comes from this definition:
> // acronyms: U.S.A., I.B.M., etc.
> // use a post-filter to remove dots
> ACRONYM = {ALPHA} "." ({ALPHA} ".")+
> Notice how the comment relates to acronym as U.S.A., I.B.M. and not something else. I changed the definition to
> ACRONYM = {LETTER} "." ({LETTER} ".")+
> and it solved the problem.
> This was also reported here:
> http://www.nabble.com/Inconsistent-StandardTokenizer-behaviour-tf596059.html#a1593383
> http://www.nabble.com/Standard-Analyzer---Host-and-Acronym-tf3620533.html#a10109926
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org