You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Graham Sugden <gr...@gmail.com> on 2011/08/18 18:23:42 UTC

Multiple fields derived from same source text?

Hi,

I am just beginning to implement text indexation for an application I am
building and am not quite sure of a few things. The documents indexed will
be in various languages, ranging mostly from short notes to ~20 page
articles (with the occaisional book length). And so my plan is to have
separate indexes for each language, each of which would contain a number of
fields created from the same text analyzed in a number of ways. So for an
English document I might have fields

         stem, suffix, token

generated from the same text with respectively

        an EnglishAnalyzer(), A custom analyzer with a
ReverseStringFilter(), and StandardAnalyzer().

As doing things this way seems to mean having the text go through Standard
and Stopword filters 3 times, once for each field, I am wondering if the
there is a way to do something like this (with custom
analyzers/implementation of PerFieldAnalyzer (or even out of the box--I'm
very new to lucene)) that could avoid that duplicate processing*? Maybe a
way to store the result of the analysis for the "token" field**, to be
reused as the start point for the analysis for the "stem" and "suffix"
fields (which would then just need the application of a Stemming filter and
the ReverseStringFilter respectively).

Note I am keen to avoid any pre-analysis processing of text as I would like
to keep the offsets etc in line with the sources (stored externally) for hit
highlighting when I eventually get that far!

Any help/advice greatly appreciated.

Thanks and kind regards,

graham

* in languages with requiring removal of diacritics for some fields and not
others, etc, there will I guess be more duplication.
**would this be achievable with reusableTokenStream()--(with my google
skills) I haven't been able to get any clear idea of how to go about using
this.

Re: Multiple fields derived from same source text?

Posted by Graham Sugden <gr...@gmail.com>.

Closed! TeeSinkTokenFilter and CachingTokenFilter seem to provide the
functionality/code examples I was looking for.

Thanks, graham.

---------- Forwarded message ----------
From: Graham Sugden <gr...@gmail.com>
Date: Thu, Aug 18, 2011 at 5:23 PM
Subject: Multiple fields derived from same source text?
To: java-user@lucene.apache.org

Hi,

I am just beginning to implement text indexation for an application I am
building and am not quite sure of a few things. The documents indexed will
be in various languages, ranging mostly from short notes to ~20 page
articles (with the occaisional book length). And so my plan is to have
separate indexes for each language, each of which would contain a number of
fields created from the same text analyzed in a number of ways. So for an
English document I might have fields

         stem, suffix, token

generated from the same text with respectively

        an EnglishAnalyzer(), A custom analyzer with a
ReverseStringFilter(), and StandardAnalyzer().

As doing things this way seems to mean having the text go through Standard
and Stopword filters 3 times, once for each field, I am wondering if the
there is a way to do something like this (with custom
analyzers/implementation of PerFieldAnalyzer (or even out of the box--I'm
very new to lucene)) that could avoid that duplicate processing*? Maybe a
way to store the result of the analysis for the "token" field**, to be
reused as the start point for the analysis for the "stem" and "suffix"
fields (which would then just need the application of a Stemming filter and
the ReverseStringFilter respectively).

Note I am keen to avoid any pre-analysis processing of text as I would like
to keep the offsets etc in line with the sources (stored externally) for hit
highlighting when I eventually get that far!

Any help/advice greatly appreciated.

Thanks and kind regards,

graham

* in languages with requiring removal of diacritics for some fields and not
others, etc, there will I guess be more duplication.
**would this be achievable with reusableTokenStream()--(with my google
skills) I haven't been able to get any clear idea of how to go about using
this.