You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Hang Mang <gu...@googlemail.com> on 2013/06/27 17:45:55 UTC
Language detection
Hello,
is there some kind of a filter or component that I could use to filter
non-english text? I have a preprocessing step that I only want to index
English documents.
Best,
Gucko
Re: Language detection
Posted by Jack Krupansky <ja...@basetechnology.com>.
Oops... sorry, I just realized this was on the Lucene-user list. My response
was for Solr-ONLY!
-- Jack Krupansky
-----Original Message-----
From: Jack Krupansky
Sent: Thursday, June 27, 2013 1:11 PM
To: java-user@lucene.apache.org
Subject: Re: Language detection
You can use the LangDetectLanguageIdentifierUpdateProcessorFactory update
processor to redirect languages to alternate fields, and then set the
non-English fields to be "ignored". But, the document would still be
indexed, just without the redirected text fields.
(Examples of using that update processor are in my book - but not the
"ignored" step.)
There is also a Tika-specific processor as well:
TikaLanguageIdentifierUpdateProcessorFactory
If you really want to completely suppress the indexing of documents
containing non-English text, you'll have to make an explicit check before
sendting the document to Solr. Tika also has language detection, so you
could call Tika from an external process before sending the document to
Solr.
-- Jack Krupansky
-----Original Message-----
From: Hang Mang
Sent: Thursday, June 27, 2013 11:45 AM
To: java-user@lucene.apache.org
Subject: Language detection
Hello,
is there some kind of a filter or component that I could use to filter
non-english text? I have a preprocessing step that I only want to index
English documents.
Best,
Gucko
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Language detection
Posted by Jack Krupansky <ja...@basetechnology.com>.
You can use the LangDetectLanguageIdentifierUpdateProcessorFactory update
processor to redirect languages to alternate fields, and then set the
non-English fields to be "ignored". But, the document would still be
indexed, just without the redirected text fields.
(Examples of using that update processor are in my book - but not the
"ignored" step.)
There is also a Tika-specific processor as well:
TikaLanguageIdentifierUpdateProcessorFactory
If you really want to completely suppress the indexing of documents
containing non-English text, you'll have to make an explicit check before
sendting the document to Solr. Tika also has language detection, so you
could call Tika from an external process before sending the document to
Solr.
-- Jack Krupansky
-----Original Message-----
From: Hang Mang
Sent: Thursday, June 27, 2013 11:45 AM
To: java-user@lucene.apache.org
Subject: Language detection
Hello,
is there some kind of a filter or component that I could use to filter
non-english text? I have a preprocessing step that I only want to index
English documents.
Best,
Gucko
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org