You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by hui <hu...@triplehop.com> on 2003/09/19 17:15:16 UTC

some requests

Hi,
I am wondering whether we could do some changes for the feature Lucene.
1. Move the Analyzer down to field level from document level so some fields
could be applied a specail analyzer.Other fields still use the default
analyzer from the document level.
For example, I do not need to index the number for the "content" field. It
helps me reduce the index size a lot when I have some excel files. But I
always need the "created_date" to be indexed though it is a number field.

I know there are some workarounds put in the group, but I think it should be
a good feature to have.

2. Does it affect the performance/space a lot if I use the large max length
for the field like 1000000(Normally only "content" need that.) If so, could
we make this parameter associated with "field" rather than "document"?

3. Based on the
document(http://www.onjava.com/pub/a/onjava/2003/03/05/lucene.html) from
Otic(It is really great! ), the mergeFactor control when we put the indexed
documents into one segment and merge the segments into one big segment. For
Windows, larger mergeFactor could cause the "Too many files open" problem
when merging the segments. But the lower mergeFactor slows down the
indexing. Could we export two different parameters here? One is  to control
when we put the files into one segment in the disk so I can set it larger
when the machine has enough memory; another one is to control when merge the
segments.
Right now it is the power of the mergeFactor.

Regards,
Hui


Re: Proposition :adding minMergeDoc to IndexWriter

Posted by hui <hu...@triplehop.com>.
It is a great. Julien. Thanks.
Next time I am going to post the requests to the developer groups.

Regards,
Hui
----- Original Message ----- 
From: "Julien Nioche" <Ju...@lingway.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Tuesday, September 23, 2003 5:38 AM
Subject: Proposition :adding minMergeDoc to IndexWriter


> Hui,
>
> Concerning an other point of your request list I proposed a patch this
week
> end on the lucene-dev list and i totally forgot that this feature was
> requested on the user list.
>
> This new feature should help you to set a number of Documents to be merged
> in memory independently of the mergeFactor.
>
> Any comments would be appreciated
>
> Best regards
>
> Julien Nioche
> http://www.lingway.com
>
> ---------- Debut du message initial -----------
>
> De     : "fp235-5" <ju...@lingway.com>
> A      : "lucene-dev" <lu...@jakarta.apache.org>
> Copies :
> Date   : Sat, 20 Sep 2003 16:06:06 +0200
> Sujet  : [PATCH] IndexWriter : controling the number of Docs merged
>
> Hello,
>
> Someone made a suggestion yesterday about adding a variable to IndexWriter
> in
> order to control the number of Documents merged in RAMDirectory
> independently of
> the mergeFactor. (I'm sorry I don't remember who exactly and the mail
> arrived at
> my office).
> I'm proposing a tiny modification of IndexWriter to add this
functionality.
> A
> variable minMergeDocs specifies the number of Documents to be merged in
> memory
> before starting a new Segment. The mergeFactor still control the number of
> Segments created in the Directory and thus it's possible to avoid the file
> number limitation problem.
>
> The diff file is attached.
>
> As noticed by Dmitry and Erik there are no true JUnit tests. I'd be OK to
> write
> a JUnit test for this feature. The problem is that the SegmentInfos field
is
> private in IndexWriter and can't be used to check the number and size of
the
> Segments. I ran a test using the infoStream variable of IndexWriter -
> everything
> seems to be OK.
>
> Any comments / suggestions are welcome.
>
> Regards
>
> Julien
>
>
>
>
>
>
>
>
>
> ----- Original Message -----
> From: "hui" <hu...@triplehop.com>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Monday, September 22, 2003 3:40 PM
> Subject: Re: per-field Analyzer (was Re: some requests)
>
>
> > Good work, Erik.
> >
> > Hui
> >
> > ----- Original Message -----
> > From: "Erik Hatcher" <er...@ehatchersolutions.com>
> > To: "Lucene Users List" <lu...@jakarta.apache.org>
> > Sent: Saturday, September 20, 2003 4:13 AM
> > Subject: per-field Analyzer (was Re: some requests)
> >
> >
> > > On Friday, September 19, 2003, at 07:45  PM, Erik Hatcher wrote:
> > > > On Friday, September 19, 2003, at 11:15  AM, hui wrote:
> > > >> 1. Move the Analyzer down to field level from document level so
some
> > > >> fields
> > > >> could be applied a specail analyzer.Other fields still use the
> default
> > > >> analyzer from the document level.
> > > >> For example, I do not need to index the number for the "content"
> > > >> field. It
> > > >> helps me reduce the index size a lot when I have some excel files.
> > > >> But I
> > > >> always need the "created_date" to be indexed though it is a number
> > > >> field.
> > > >>
> > > >> I know there are some workarounds put in the group, but I think it
> > > >> should be
> > > >> a good feature to have.
> > > >
> > > > The "workaround" is to write a custom analyzer and and have it do
the
> > > > desired thing per-field.
> > > >
> > > > Hmmm.... just thinking out loud here without knowing if this is
> > > > possible, but could a generic "wrapper" Analyzer be written that
> > > > allows other analyzers to be used under the covers based on a field
> > > > name/analyzer mapping?   If so, that would be quite cool and save
> > > > folks from having to write custom analyzers as much to handle this
> > > > pretty typical use-case.  I'll look into this more in the very near
> > > > future personally, but feel free to have a look at this yourself and
> > > > see what you can come up with.
> > >
> > > What about something like this?
> > >
> > > public class PerFieldWrapperAnalyzer extends Analyzer {
> > >    private Analyzer defaultAnalyzer;
> > >    private Map analyzerMap = new HashMap();
> > >
> > >
> > >    public PerFieldWrapperAnalyzer(Analyzer defaultAnalyzer) {
> > >      this.defaultAnalyzer = defaultAnalyzer;
> > >    }
> > >
> > >    public void addAnalyzer(String fieldName, Analyzer analyzer) {
> > >      analyzerMap.put(fieldName, analyzer);
> > >    }
> > >
> > >    public TokenStream tokenStream(String fieldName, Reader reader) {
> > >      Analyzer analyzer = (Analyzer) analyzerMap.get(fieldName);
> > >      if (analyzer == null) {
> > >        analyzer = defaultAnalyzer;
> > >      }
> > >
> > >      return analyzer.tokenStream(fieldName, reader);
> > >    }
> > > }
> > >
> > > This would allow you to construct a single analyzer out of others, on
a
> > > per-field basis, including a default one for any fields that do not
> > > have a special one.  Whether the constructor should take the map or
the
> > > addAnalyzer method is implemented is debatable, but I prefer the
> > > addAnalyzer way.  Maybe addAnalyzer could return 'this' so you could
> > > chain: new PerFieldWrapperAnalyzer(new
> > > StandardAnalyzer).addAnalyzer("field1", new
> > > WhitespaceAnalyzer()).addAnalyzer(.....).  And I'm more inclined to
> > > call this thing PerFieldAnalyzerWrapper instead.  Any naming
> > > suggestions?
> > >
> > > This simple little class would seem to be the answer to a very common
> > > question asked.
> > >
> > > Thoughts?  Should this be made part of the core?
> > >
> > > Erik
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
>


----------------------------------------------------------------------------
----


> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Proposition :adding minMergeDoc to IndexWriter

Posted by hui <hu...@triplehop.com>.
It is a great. Julien. Thanks.
Next time I am going to post the requests to the developer groups.

Regards,
Hui
----- Original Message ----- 
From: "Julien Nioche" <Ju...@lingway.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Tuesday, September 23, 2003 5:38 AM
Subject: Proposition :adding minMergeDoc to IndexWriter


> Hui,
>
> Concerning an other point of your request list I proposed a patch this
week
> end on the lucene-dev list and i totally forgot that this feature was
> requested on the user list.
>
> This new feature should help you to set a number of Documents to be merged
> in memory independently of the mergeFactor.
>
> Any comments would be appreciated
>
> Best regards
>
> Julien Nioche
> http://www.lingway.com
>
> ---------- Debut du message initial -----------
>
> De     : "fp235-5" <ju...@lingway.com>
> A      : "lucene-dev" <lu...@jakarta.apache.org>
> Copies :
> Date   : Sat, 20 Sep 2003 16:06:06 +0200
> Sujet  : [PATCH] IndexWriter : controling the number of Docs merged
>
> Hello,
>
> Someone made a suggestion yesterday about adding a variable to IndexWriter
> in
> order to control the number of Documents merged in RAMDirectory
> independently of
> the mergeFactor. (I'm sorry I don't remember who exactly and the mail
> arrived at
> my office).
> I'm proposing a tiny modification of IndexWriter to add this
functionality.
> A
> variable minMergeDocs specifies the number of Documents to be merged in
> memory
> before starting a new Segment. The mergeFactor still control the number of
> Segments created in the Directory and thus it's possible to avoid the file
> number limitation problem.
>
> The diff file is attached.
>
> As noticed by Dmitry and Erik there are no true JUnit tests. I'd be OK to
> write
> a JUnit test for this feature. The problem is that the SegmentInfos field
is
> private in IndexWriter and can't be used to check the number and size of
the
> Segments. I ran a test using the infoStream variable of IndexWriter -
> everything
> seems to be OK.
>
> Any comments / suggestions are welcome.
>
> Regards
>
> Julien
>
>
>
>
>
>
>
>
>
> ----- Original Message -----
> From: "hui" <hu...@triplehop.com>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Monday, September 22, 2003 3:40 PM
> Subject: Re: per-field Analyzer (was Re: some requests)
>
>
> > Good work, Erik.
> >
> > Hui
> >
> > ----- Original Message -----
> > From: "Erik Hatcher" <er...@ehatchersolutions.com>
> > To: "Lucene Users List" <lu...@jakarta.apache.org>
> > Sent: Saturday, September 20, 2003 4:13 AM
> > Subject: per-field Analyzer (was Re: some requests)
> >
> >
> > > On Friday, September 19, 2003, at 07:45  PM, Erik Hatcher wrote:
> > > > On Friday, September 19, 2003, at 11:15  AM, hui wrote:
> > > >> 1. Move the Analyzer down to field level from document level so
some
> > > >> fields
> > > >> could be applied a specail analyzer.Other fields still use the
> default
> > > >> analyzer from the document level.
> > > >> For example, I do not need to index the number for the "content"
> > > >> field. It
> > > >> helps me reduce the index size a lot when I have some excel files.
> > > >> But I
> > > >> always need the "created_date" to be indexed though it is a number
> > > >> field.
> > > >>
> > > >> I know there are some workarounds put in the group, but I think it
> > > >> should be
> > > >> a good feature to have.
> > > >
> > > > The "workaround" is to write a custom analyzer and and have it do
the
> > > > desired thing per-field.
> > > >
> > > > Hmmm.... just thinking out loud here without knowing if this is
> > > > possible, but could a generic "wrapper" Analyzer be written that
> > > > allows other analyzers to be used under the covers based on a field
> > > > name/analyzer mapping?   If so, that would be quite cool and save
> > > > folks from having to write custom analyzers as much to handle this
> > > > pretty typical use-case.  I'll look into this more in the very near
> > > > future personally, but feel free to have a look at this yourself and
> > > > see what you can come up with.
> > >
> > > What about something like this?
> > >
> > > public class PerFieldWrapperAnalyzer extends Analyzer {
> > >    private Analyzer defaultAnalyzer;
> > >    private Map analyzerMap = new HashMap();
> > >
> > >
> > >    public PerFieldWrapperAnalyzer(Analyzer defaultAnalyzer) {
> > >      this.defaultAnalyzer = defaultAnalyzer;
> > >    }
> > >
> > >    public void addAnalyzer(String fieldName, Analyzer analyzer) {
> > >      analyzerMap.put(fieldName, analyzer);
> > >    }
> > >
> > >    public TokenStream tokenStream(String fieldName, Reader reader) {
> > >      Analyzer analyzer = (Analyzer) analyzerMap.get(fieldName);
> > >      if (analyzer == null) {
> > >        analyzer = defaultAnalyzer;
> > >      }
> > >
> > >      return analyzer.tokenStream(fieldName, reader);
> > >    }
> > > }
> > >
> > > This would allow you to construct a single analyzer out of others, on
a
> > > per-field basis, including a default one for any fields that do not
> > > have a special one.  Whether the constructor should take the map or
the
> > > addAnalyzer method is implemented is debatable, but I prefer the
> > > addAnalyzer way.  Maybe addAnalyzer could return 'this' so you could
> > > chain: new PerFieldWrapperAnalyzer(new
> > > StandardAnalyzer).addAnalyzer("field1", new
> > > WhitespaceAnalyzer()).addAnalyzer(.....).  And I'm more inclined to
> > > call this thing PerFieldAnalyzerWrapper instead.  Any naming
> > > suggestions?
> > >
> > > This simple little class would seem to be the answer to a very common
> > > question asked.
> > >
> > > Thoughts?  Should this be made part of the core?
> > >
> > > Erik
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
> >
>


----------------------------------------------------------------------------
----


> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Proposition :adding minMergeDoc to IndexWriter

Posted by Julien Nioche <Ju...@lingway.com>.
Hui,

Concerning an other point of your request list I proposed a patch this week
end on the lucene-dev list and i totally forgot that this feature was
requested on the user list.

This new feature should help you to set a number of Documents to be merged
in memory independently of the mergeFactor.

Any comments would be appreciated

Best regards

Julien Nioche
http://www.lingway.com

---------- Debut du message initial -----------

De     : "fp235-5" <ju...@lingway.com>
A      : "lucene-dev" <lu...@jakarta.apache.org>
Copies :
Date   : Sat, 20 Sep 2003 16:06:06 +0200
Sujet  : [PATCH] IndexWriter : controling the number of Docs merged

Hello,

Someone made a suggestion yesterday about adding a variable to IndexWriter
in
order to control the number of Documents merged in RAMDirectory
independently of
the mergeFactor. (I'm sorry I don't remember who exactly and the mail
arrived at
my office).
I'm proposing a tiny modification of IndexWriter to add this functionality.
A
variable minMergeDocs specifies the number of Documents to be merged in
memory
before starting a new Segment. The mergeFactor still control the number of
Segments created in the Directory and thus it's possible to avoid the file
number limitation problem.

The diff file is attached.

As noticed by Dmitry and Erik there are no true JUnit tests. I'd be OK to
write
a JUnit test for this feature. The problem is that the SegmentInfos field is
private in IndexWriter and can't be used to check the number and size of the
Segments. I ran a test using the infoStream variable of IndexWriter -
everything
seems to be OK.

Any comments / suggestions are welcome.

Regards

Julien









----- Original Message -----
From: "hui" <hu...@triplehop.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Monday, September 22, 2003 3:40 PM
Subject: Re: per-field Analyzer (was Re: some requests)


> Good work, Erik.
>
> Hui
>
> ----- Original Message -----
> From: "Erik Hatcher" <er...@ehatchersolutions.com>
> To: "Lucene Users List" <lu...@jakarta.apache.org>
> Sent: Saturday, September 20, 2003 4:13 AM
> Subject: per-field Analyzer (was Re: some requests)
>
>
> > On Friday, September 19, 2003, at 07:45  PM, Erik Hatcher wrote:
> > > On Friday, September 19, 2003, at 11:15  AM, hui wrote:
> > >> 1. Move the Analyzer down to field level from document level so some
> > >> fields
> > >> could be applied a specail analyzer.Other fields still use the
default
> > >> analyzer from the document level.
> > >> For example, I do not need to index the number for the "content"
> > >> field. It
> > >> helps me reduce the index size a lot when I have some excel files.
> > >> But I
> > >> always need the "created_date" to be indexed though it is a number
> > >> field.
> > >>
> > >> I know there are some workarounds put in the group, but I think it
> > >> should be
> > >> a good feature to have.
> > >
> > > The "workaround" is to write a custom analyzer and and have it do the
> > > desired thing per-field.
> > >
> > > Hmmm.... just thinking out loud here without knowing if this is
> > > possible, but could a generic "wrapper" Analyzer be written that
> > > allows other analyzers to be used under the covers based on a field
> > > name/analyzer mapping?   If so, that would be quite cool and save
> > > folks from having to write custom analyzers as much to handle this
> > > pretty typical use-case.  I'll look into this more in the very near
> > > future personally, but feel free to have a look at this yourself and
> > > see what you can come up with.
> >
> > What about something like this?
> >
> > public class PerFieldWrapperAnalyzer extends Analyzer {
> >    private Analyzer defaultAnalyzer;
> >    private Map analyzerMap = new HashMap();
> >
> >
> >    public PerFieldWrapperAnalyzer(Analyzer defaultAnalyzer) {
> >      this.defaultAnalyzer = defaultAnalyzer;
> >    }
> >
> >    public void addAnalyzer(String fieldName, Analyzer analyzer) {
> >      analyzerMap.put(fieldName, analyzer);
> >    }
> >
> >    public TokenStream tokenStream(String fieldName, Reader reader) {
> >      Analyzer analyzer = (Analyzer) analyzerMap.get(fieldName);
> >      if (analyzer == null) {
> >        analyzer = defaultAnalyzer;
> >      }
> >
> >      return analyzer.tokenStream(fieldName, reader);
> >    }
> > }
> >
> > This would allow you to construct a single analyzer out of others, on a
> > per-field basis, including a default one for any fields that do not
> > have a special one.  Whether the constructor should take the map or the
> > addAnalyzer method is implemented is debatable, but I prefer the
> > addAnalyzer way.  Maybe addAnalyzer could return 'this' so you could
> > chain: new PerFieldWrapperAnalyzer(new
> > StandardAnalyzer).addAnalyzer("field1", new
> > WhitespaceAnalyzer()).addAnalyzer(.....).  And I'm more inclined to
> > call this thing PerFieldAnalyzerWrapper instead.  Any naming
> > suggestions?
> >
> > This simple little class would seem to be the answer to a very common
> > question asked.
> >
> > Thoughts?  Should this be made part of the core?
> >
> > Erik
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>

Re: per-field Analyzer (was Re: some requests)

Posted by hui <hu...@triplehop.com>.
Good work, Erik.

Hui

----- Original Message ----- 
From: "Erik Hatcher" <er...@ehatchersolutions.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Saturday, September 20, 2003 4:13 AM
Subject: per-field Analyzer (was Re: some requests)


> On Friday, September 19, 2003, at 07:45  PM, Erik Hatcher wrote:
> > On Friday, September 19, 2003, at 11:15  AM, hui wrote:
> >> 1. Move the Analyzer down to field level from document level so some 
> >> fields
> >> could be applied a specail analyzer.Other fields still use the default
> >> analyzer from the document level.
> >> For example, I do not need to index the number for the "content" 
> >> field. It
> >> helps me reduce the index size a lot when I have some excel files. 
> >> But I
> >> always need the "created_date" to be indexed though it is a number 
> >> field.
> >>
> >> I know there are some workarounds put in the group, but I think it 
> >> should be
> >> a good feature to have.
> >
> > The "workaround" is to write a custom analyzer and and have it do the 
> > desired thing per-field.
> >
> > Hmmm.... just thinking out loud here without knowing if this is 
> > possible, but could a generic "wrapper" Analyzer be written that 
> > allows other analyzers to be used under the covers based on a field 
> > name/analyzer mapping?   If so, that would be quite cool and save 
> > folks from having to write custom analyzers as much to handle this 
> > pretty typical use-case.  I'll look into this more in the very near 
> > future personally, but feel free to have a look at this yourself and 
> > see what you can come up with.
> 
> What about something like this?
> 
> public class PerFieldWrapperAnalyzer extends Analyzer {
>    private Analyzer defaultAnalyzer;
>    private Map analyzerMap = new HashMap();
> 
> 
>    public PerFieldWrapperAnalyzer(Analyzer defaultAnalyzer) {
>      this.defaultAnalyzer = defaultAnalyzer;
>    }
> 
>    public void addAnalyzer(String fieldName, Analyzer analyzer) {
>      analyzerMap.put(fieldName, analyzer);
>    }
> 
>    public TokenStream tokenStream(String fieldName, Reader reader) {
>      Analyzer analyzer = (Analyzer) analyzerMap.get(fieldName);
>      if (analyzer == null) {
>        analyzer = defaultAnalyzer;
>      }
> 
>      return analyzer.tokenStream(fieldName, reader);
>    }
> }
> 
> This would allow you to construct a single analyzer out of others, on a 
> per-field basis, including a default one for any fields that do not 
> have a special one.  Whether the constructor should take the map or the 
> addAnalyzer method is implemented is debatable, but I prefer the 
> addAnalyzer way.  Maybe addAnalyzer could return 'this' so you could 
> chain: new PerFieldWrapperAnalyzer(new 
> StandardAnalyzer).addAnalyzer("field1", new 
> WhitespaceAnalyzer()).addAnalyzer(.....).  And I'm more inclined to 
> call this thing PerFieldAnalyzerWrapper instead.  Any naming 
> suggestions?
> 
> This simple little class would seem to be the answer to a very common 
> question asked.
> 
> Thoughts?  Should this be made part of the core?
> 
> Erik
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: per-field Analyzer (was Re: some requests)

Posted by hui <hu...@triplehop.com>.
Good work, Erik.

Hui

----- Original Message ----- 
From: "Erik Hatcher" <er...@ehatchersolutions.com>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Saturday, September 20, 2003 4:13 AM
Subject: per-field Analyzer (was Re: some requests)


> On Friday, September 19, 2003, at 07:45  PM, Erik Hatcher wrote:
> > On Friday, September 19, 2003, at 11:15  AM, hui wrote:
> >> 1. Move the Analyzer down to field level from document level so some 
> >> fields
> >> could be applied a specail analyzer.Other fields still use the default
> >> analyzer from the document level.
> >> For example, I do not need to index the number for the "content" 
> >> field. It
> >> helps me reduce the index size a lot when I have some excel files. 
> >> But I
> >> always need the "created_date" to be indexed though it is a number 
> >> field.
> >>
> >> I know there are some workarounds put in the group, but I think it 
> >> should be
> >> a good feature to have.
> >
> > The "workaround" is to write a custom analyzer and and have it do the 
> > desired thing per-field.
> >
> > Hmmm.... just thinking out loud here without knowing if this is 
> > possible, but could a generic "wrapper" Analyzer be written that 
> > allows other analyzers to be used under the covers based on a field 
> > name/analyzer mapping?   If so, that would be quite cool and save 
> > folks from having to write custom analyzers as much to handle this 
> > pretty typical use-case.  I'll look into this more in the very near 
> > future personally, but feel free to have a look at this yourself and 
> > see what you can come up with.
> 
> What about something like this?
> 
> public class PerFieldWrapperAnalyzer extends Analyzer {
>    private Analyzer defaultAnalyzer;
>    private Map analyzerMap = new HashMap();
> 
> 
>    public PerFieldWrapperAnalyzer(Analyzer defaultAnalyzer) {
>      this.defaultAnalyzer = defaultAnalyzer;
>    }
> 
>    public void addAnalyzer(String fieldName, Analyzer analyzer) {
>      analyzerMap.put(fieldName, analyzer);
>    }
> 
>    public TokenStream tokenStream(String fieldName, Reader reader) {
>      Analyzer analyzer = (Analyzer) analyzerMap.get(fieldName);
>      if (analyzer == null) {
>        analyzer = defaultAnalyzer;
>      }
> 
>      return analyzer.tokenStream(fieldName, reader);
>    }
> }
> 
> This would allow you to construct a single analyzer out of others, on a 
> per-field basis, including a default one for any fields that do not 
> have a special one.  Whether the constructor should take the map or the 
> addAnalyzer method is implemented is debatable, but I prefer the 
> addAnalyzer way.  Maybe addAnalyzer could return 'this' so you could 
> chain: new PerFieldWrapperAnalyzer(new 
> StandardAnalyzer).addAnalyzer("field1", new 
> WhitespaceAnalyzer()).addAnalyzer(.....).  And I'm more inclined to 
> call this thing PerFieldAnalyzerWrapper instead.  Any naming 
> suggestions?
> 
> This simple little class would seem to be the answer to a very common 
> question asked.
> 
> Thoughts?  Should this be made part of the core?
> 
> Erik
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 

per-field Analyzer (was Re: some requests)

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Friday, September 19, 2003, at 07:45  PM, Erik Hatcher wrote:
> On Friday, September 19, 2003, at 11:15  AM, hui wrote:
>> 1. Move the Analyzer down to field level from document level so some 
>> fields
>> could be applied a specail analyzer.Other fields still use the default
>> analyzer from the document level.
>> For example, I do not need to index the number for the "content" 
>> field. It
>> helps me reduce the index size a lot when I have some excel files. 
>> But I
>> always need the "created_date" to be indexed though it is a number 
>> field.
>>
>> I know there are some workarounds put in the group, but I think it 
>> should be
>> a good feature to have.
>
> The "workaround" is to write a custom analyzer and and have it do the 
> desired thing per-field.
>
> Hmmm.... just thinking out loud here without knowing if this is 
> possible, but could a generic "wrapper" Analyzer be written that 
> allows other analyzers to be used under the covers based on a field 
> name/analyzer mapping?   If so, that would be quite cool and save 
> folks from having to write custom analyzers as much to handle this 
> pretty typical use-case.  I'll look into this more in the very near 
> future personally, but feel free to have a look at this yourself and 
> see what you can come up with.

What about something like this?

public class PerFieldWrapperAnalyzer extends Analyzer {
   private Analyzer defaultAnalyzer;
   private Map analyzerMap = new HashMap();


   public PerFieldWrapperAnalyzer(Analyzer defaultAnalyzer) {
     this.defaultAnalyzer = defaultAnalyzer;
   }

   public void addAnalyzer(String fieldName, Analyzer analyzer) {
     analyzerMap.put(fieldName, analyzer);
   }

   public TokenStream tokenStream(String fieldName, Reader reader) {
     Analyzer analyzer = (Analyzer) analyzerMap.get(fieldName);
     if (analyzer == null) {
       analyzer = defaultAnalyzer;
     }

     return analyzer.tokenStream(fieldName, reader);
   }
}

This would allow you to construct a single analyzer out of others, on a 
per-field basis, including a default one for any fields that do not 
have a special one.  Whether the constructor should take the map or the 
addAnalyzer method is implemented is debatable, but I prefer the 
addAnalyzer way.  Maybe addAnalyzer could return 'this' so you could 
chain: new PerFieldWrapperAnalyzer(new 
StandardAnalyzer).addAnalyzer("field1", new 
WhitespaceAnalyzer()).addAnalyzer(.....).  And I'm more inclined to 
call this thing PerFieldAnalyzerWrapper instead.  Any naming 
suggestions?

This simple little class would seem to be the answer to a very common 
question asked.

Thoughts?  Should this be made part of the core?

	Erik


per-field Analyzer (was Re: some requests)

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Friday, September 19, 2003, at 07:45  PM, Erik Hatcher wrote:
> On Friday, September 19, 2003, at 11:15  AM, hui wrote:
>> 1. Move the Analyzer down to field level from document level so some 
>> fields
>> could be applied a specail analyzer.Other fields still use the default
>> analyzer from the document level.
>> For example, I do not need to index the number for the "content" 
>> field. It
>> helps me reduce the index size a lot when I have some excel files. 
>> But I
>> always need the "created_date" to be indexed though it is a number 
>> field.
>>
>> I know there are some workarounds put in the group, but I think it 
>> should be
>> a good feature to have.
>
> The "workaround" is to write a custom analyzer and and have it do the 
> desired thing per-field.
>
> Hmmm.... just thinking out loud here without knowing if this is 
> possible, but could a generic "wrapper" Analyzer be written that 
> allows other analyzers to be used under the covers based on a field 
> name/analyzer mapping?   If so, that would be quite cool and save 
> folks from having to write custom analyzers as much to handle this 
> pretty typical use-case.  I'll look into this more in the very near 
> future personally, but feel free to have a look at this yourself and 
> see what you can come up with.

What about something like this?

public class PerFieldWrapperAnalyzer extends Analyzer {
   private Analyzer defaultAnalyzer;
   private Map analyzerMap = new HashMap();


   public PerFieldWrapperAnalyzer(Analyzer defaultAnalyzer) {
     this.defaultAnalyzer = defaultAnalyzer;
   }

   public void addAnalyzer(String fieldName, Analyzer analyzer) {
     analyzerMap.put(fieldName, analyzer);
   }

   public TokenStream tokenStream(String fieldName, Reader reader) {
     Analyzer analyzer = (Analyzer) analyzerMap.get(fieldName);
     if (analyzer == null) {
       analyzer = defaultAnalyzer;
     }

     return analyzer.tokenStream(fieldName, reader);
   }
}

This would allow you to construct a single analyzer out of others, on a 
per-field basis, including a default one for any fields that do not 
have a special one.  Whether the constructor should take the map or the 
addAnalyzer method is implemented is debatable, but I prefer the 
addAnalyzer way.  Maybe addAnalyzer could return 'this' so you could 
chain: new PerFieldWrapperAnalyzer(new 
StandardAnalyzer).addAnalyzer("field1", new 
WhitespaceAnalyzer()).addAnalyzer(.....).  And I'm more inclined to 
call this thing PerFieldAnalyzerWrapper instead.  Any naming 
suggestions?

This simple little class would seem to be the answer to a very common 
question asked.

Thoughts?  Should this be made part of the core?

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: some requests

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Friday, September 19, 2003, at 11:15  AM, hui wrote:
> 1. Move the Analyzer down to field level from document level so some 
> fields
> could be applied a specail analyzer.Other fields still use the default
> analyzer from the document level.
> For example, I do not need to index the number for the "content" 
> field. It
> helps me reduce the index size a lot when I have some excel files. But 
> I
> always need the "created_date" to be indexed though it is a number 
> field.
>
> I know there are some workarounds put in the group, but I think it 
> should be
> a good feature to have.

The "workaround" is to write a custom analyzer and and have it do the 
desired thing per-field.

Hmmm.... just thinking out loud here without knowing if this is 
possible, but could a generic "wrapper" Analyzer be written that allows 
other analyzers to be used under the covers based on a field 
name/analyzer mapping?   If so, that would be quite cool and save folks 
from having to write custom analyzers as much to handle this pretty 
typical use-case.  I'll look into this more in the very near future 
personally, but feel free to have a look at this yourself and see what 
you can come up with.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: some requests

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Friday, September 19, 2003, at 11:15  AM, hui wrote:
> 1. Move the Analyzer down to field level from document level so some 
> fields
> could be applied a specail analyzer.Other fields still use the default
> analyzer from the document level.
> For example, I do not need to index the number for the "content" 
> field. It
> helps me reduce the index size a lot when I have some excel files. But 
> I
> always need the "created_date" to be indexed though it is a number 
> field.
>
> I know there are some workarounds put in the group, but I think it 
> should be
> a good feature to have.

The "workaround" is to write a custom analyzer and and have it do the 
desired thing per-field.

Hmmm.... just thinking out loud here without knowing if this is 
possible, but could a generic "wrapper" Analyzer be written that allows 
other analyzers to be used under the covers based on a field 
name/analyzer mapping?   If so, that would be quite cool and save folks 
from having to write custom analyzers as much to handle this pretty 
typical use-case.  I'll look into this more in the very near future 
personally, but feel free to have a look at this yourself and see what 
you can come up with.

	Erik