You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Lars Feistner (JIRA)" <ji...@apache.org> on 2011/04/27 11:33:03 UTC

[jira] [Created] (LUCENE-3047) HyphenationCompoundWordTokenFilter does not work correctly with the german word Brustamputation

HyphenationCompoundWordTokenFilter does not work correctly with the german word Brustamputation
-----------------------------------------------------------------------------------------------

                 Key: LUCENE-3047
                 URL: https://issues.apache.org/jira/browse/LUCENE-3047
             Project: Lucene - Java
          Issue Type: Bug
          Components: contrib/analyzers
    Affects Versions: 3.1
         Environment: Linux 2.6.32-31-generic
java version "1.6.0_21"
Java(TM) SE Runtime Environment (build 1.6.0_21-b06)
Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode)
            Reporter: Lars Feistner
            Priority: Minor


Following Test fails:

@Test
    public void testBrustamputation()
            throws IOException {
        Analyzer compoundAnalyzer = new Analyzer()
        {
            @Override
            public TokenStream tokenStream( String fieldName, Reader reader ) {
                InputStream in = this.getClass().getResourceAsStream( "/de_DR.xml" );

                final InputSource inputSource = new InputSource( in );
                inputSource.setEncoding( "iso-8859-1" );
                HyphenationTree hyphenator = null;
                try {
                    hyphenator = HyphenationCompoundWordTokenFilter.getHyphenationTree( inputSource );
                } catch ( Exception ex ) {
                    Assert.fail( "", ex);
                }
                HashSet dict = new HashSet( Arrays.asList( new String[]{"brust", "amputation"} ) );
                return new HyphenationCompoundWordTokenFilter( Version.LUCENE_31, new WhitespaceTokenizer( Version.LUCENE_31, reader ), hyphenator,
                        dict, CompoundWordTokenFilterBase.DEFAULT_MIN_WORD_SIZE,
                        4, CompoundWordTokenFilterBase.DEFAULT_MAX_SUBWORD_SIZE, false );
            }
        };
        TokenStream tokenStream = compoundAnalyzer.tokenStream( "Kurztext", new StringReader( "brustamputation" ) );
        CharTermAttribute t = tokenStream.addAttribute( CharTermAttribute.class );
        Set<String> tokenSet = new HashSet<String>();
        while ( tokenStream.incrementToken() ) {
            tokenSet.add( t.toString() );
            System.out.println( t );
        }
        Assert.assertTrue( tokenSet.contains( "brust" ), "brust" );
        Assert.assertTrue( tokenSet.contains( "brustamputation" ), "brustamputation" );
        Assert.assertTrue( tokenSet.contains( "amputation" ), "amputation" );

    }

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org