You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Wulf Berschin <be...@dosco.de> on 2011/04/01 17:50:28 UTC

Undo hyphenation when indexing

Hi,

for indexing PDF files we have to undo word hyphenation. The basic idea 
is simply to remove the hyphen when a new line and a small letter 
follows. Of course this approach isnt 100%-foolproofed but checking 
against a dictionary wouldnt be as well...

Since we face this problem too when highlighting using HTMLCharStripper 
(yes, we do have hyphenation in our HTML docs...) it seems to me I have 
to adjust the JFlex generated StandardTokenizerImpl.

Is this the right approach and hwo would I have to modify this script?

Thanks
Wulf


PS: I see that there are changes made in the brand new 3.1.0 version we 
are using 3.0.3, but as far I understand no relevant changes in this 
respect.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Undo hyphenation when indexing

Posted by Wulf Berschin <be...@dosco.de>.

Thank you, Yonnik for this hint. (Again, I wasn't aware that obviousely 
Solr offers useful extensions for the Lucene indexing process and I 
wonder why they haven't been added to Lucene itself.)

Anyway, since the HyphenatedWordsFilter needs newlines in the input I 
will have to take another Tokenizer than StandardTokenizer. If I simply 
take the WhitespaceTokenizerFactory (as suggested by 
HyphenatedWordsFilterFactory) I will loose the punctuation handling done 
by StandardTokenizer, right? What will I have to borrow for that? Or do 
I have to extend StandardTokenizerImpl.jflex?

Wulf


Am 01.04.2011 18:23, schrieb Yonik Seeley:
> Solr has a hyphenated word filter you could copy.
> http://lucene.apache.org/solr/api/org/apache/solr/analysis/HyphenatedWordsFilterFactory.html
>
> On trunk, this has been folded into the analysis module.
>
> -Yonik
> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> 25-26, San Francisco
>
> On Fri, Apr 1, 2011 at 11:50 AM, Wulf Berschin<be...@dosco.de>  wrote:
>> Hi,
>>
>> for indexing PDF files we have to undo word hyphenation. The basic idea is
>> simply to remove the hyphen when a new line and a small letter follows. Of
>> course this approach isnt 100%-foolproofed but checking against a dictionary
>> wouldnt be as well...
>>
>> Since we face this problem too when highlighting using HTMLCharStripper
>> (yes, we do have hyphenation in our HTML docs...) it seems to me I have to
>> adjust the JFlex generated StandardTokenizerImpl.
>>
>> Is this the right approach and hwo would I have to modify this script?
>>
>> Thanks
>> Wulf
>>
>>
>> PS: I see that there are changes made in the brand new 3.1.0 version we are
>> using 3.0.3, but as far I understand no relevant changes in this respect.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Undo hyphenation when indexing

Posted by Yonik Seeley <yo...@lucidimagination.com>.

Solr has a hyphenated word filter you could copy.
http://lucene.apache.org/solr/api/org/apache/solr/analysis/HyphenatedWordsFilterFactory.html

On trunk, this has been folded into the analysis module.

-Yonik
http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
25-26, San Francisco

On Fri, Apr 1, 2011 at 11:50 AM, Wulf Berschin <be...@dosco.de> wrote:
> Hi,
>
> for indexing PDF files we have to undo word hyphenation. The basic idea is
> simply to remove the hyphen when a new line and a small letter follows. Of
> course this approach isnt 100%-foolproofed but checking against a dictionary
> wouldnt be as well...
>
> Since we face this problem too when highlighting using HTMLCharStripper
> (yes, we do have hyphenation in our HTML docs...) it seems to me I have to
> adjust the JFlex generated StandardTokenizerImpl.
>
> Is this the right approach and hwo would I have to modify this script?
>
> Thanks
> Wulf
>
>
> PS: I see that there are changes made in the brand new 3.1.0 version we are
> using 3.0.3, but as far I understand no relevant changes in this respect.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org