You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Lars Feistner (JIRA)" <ji...@apache.org> on 2011/04/27 11:33:03 UTC
[jira] [Created] (LUCENE-3047) HyphenationCompoundWordTokenFilter
does not work correctly with the german word Brustamputation
HyphenationCompoundWordTokenFilter does not work correctly with the german word Brustamputation
-----------------------------------------------------------------------------------------------
Key: LUCENE-3047
URL: https://issues.apache.org/jira/browse/LUCENE-3047
Project: Lucene - Java
Issue Type: Bug
Components: contrib/analyzers
Affects Versions: 3.1
Environment: Linux 2.6.32-31-generic
java version "1.6.0_21"
Java(TM) SE Runtime Environment (build 1.6.0_21-b06)
Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode)
Reporter: Lars Feistner
Priority: Minor
Following Test fails:
@Test
public void testBrustamputation()
throws IOException {
Analyzer compoundAnalyzer = new Analyzer()
{
@Override
public TokenStream tokenStream( String fieldName, Reader reader ) {
InputStream in = this.getClass().getResourceAsStream( "/de_DR.xml" );
final InputSource inputSource = new InputSource( in );
inputSource.setEncoding( "iso-8859-1" );
HyphenationTree hyphenator = null;
try {
hyphenator = HyphenationCompoundWordTokenFilter.getHyphenationTree( inputSource );
} catch ( Exception ex ) {
Assert.fail( "", ex);
}
HashSet dict = new HashSet( Arrays.asList( new String[]{"brust", "amputation"} ) );
return new HyphenationCompoundWordTokenFilter( Version.LUCENE_31, new WhitespaceTokenizer( Version.LUCENE_31, reader ), hyphenator,
dict, CompoundWordTokenFilterBase.DEFAULT_MIN_WORD_SIZE,
4, CompoundWordTokenFilterBase.DEFAULT_MAX_SUBWORD_SIZE, false );
}
};
TokenStream tokenStream = compoundAnalyzer.tokenStream( "Kurztext", new StringReader( "brustamputation" ) );
CharTermAttribute t = tokenStream.addAttribute( CharTermAttribute.class );
Set<String> tokenSet = new HashSet<String>();
while ( tokenStream.incrementToken() ) {
tokenSet.add( t.toString() );
System.out.println( t );
}
Assert.assertTrue( tokenSet.contains( "brust" ), "brust" );
Assert.assertTrue( tokenSet.contains( "brustamputation" ), "brustamputation" );
Assert.assertTrue( tokenSet.contains( "amputation" ), "amputation" );
}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org