You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Peter Mularien <pm...@deploy.com> on 2002/11/12 21:17:19 UTC

getAllFieldNames diffs

Attached. Please comment or critique. I've used the java.util.Collection 
classes here extensively, which may be an issue since other parts of the 
API return Enumeration. Collections have been preferred since JDK 1.2 
for a number of reasons I won't go into here. Please let me know if 
these are OK to use in Lucene.

I have added an [ambitiously named (: ] TestIndexReader to the unit 
tests, it contains a couple simple unit tests for this functionality. 
Diffs are also attached.

Thanks
Peter

RE: language identifier, stemmers and analyzers

Posted by Alex Murzaku <li...@lissus.com>.

Putting everything in one index should be fine as long as you know the
analyzer that created each term. This means that you need to store the
language ID of each document indexed, which would mean building virtual
separate indices for each language.

Once you index using different analyzer/stemmers, you also need to
establish your search strategy. The same analyzers should be applied to
the search process as well. The problem with the automatic analyzer
selection is that queries are usually short and the language guesser
will not be as effective with it. You might use the language field and
manual language selection for this.

-- 
Alex Murzaku
___________________________________________
 alex(at)lissus.com  http://www.lissus.com            

-----Original Message-----
From: maurits van wijland [mailto:m.vanwijland@quicknet.nl] 
Sent: Saturday, November 16, 2002 8:22 AM
To: Lucene Developers List
Cc: Brad Wellington
Subject: Re: language identifier, stemmers and analyzers


Otis,

Thanks for the reply.

>
> 1. Ideally, yes, if you ask me.  You get email in at least 2 languages
> - wouldn't it make sense to have it all indexed in a single email 
> index?

>
> 2. I think it would be nice to have an Analyzer that can pick the 
> correct Analyzer based on the language, but since language identifier 
> can also be retrieved from Brad's code directly, one will always be 
> able to opt for using custom logic in their application instead of 
> using your language-aware Analyzer. So my opinion is that a 
> specialized Analyzer that can pick the right Analyzer implementation 
> based on the language of the input would be good, as it does not 
> prevent developers from using Brad's code directly.
That makes sense. I first thought that the analyzer would be a problem,
because the queryparser should use the same analyzer! But I guess that
this special analyzer would initiate a language specific analyzer to
stem the words accordingly.

And yes, Brad's code can be used directly. Ofcourse. Brad has made a
terrific language identifier that is suitable for more uses other than
Lucene's. And it works like a charm and works with international
character standards.

I will put together a package with an analyzer, a language model (will
include the language source files so anybody can rebuild the model).
Give me a couple of days, because I am currently swammped with work, but
will soon post the result to the list.

>
> Is this something that can be included in Lucene core/sandbox?
>
This is for the code/sandbox yes.

regards,

Maurits.

> Otis
>
>
> --- maurits van wijland <m....@quicknet.nl> wrote:
> > Dear all,
> >
> > Brad Wellington has created a language identifier which can be used 
> > in combination with
> > the snowball stemmers donated to Lucene by Alex Murzaku. I have
> > currently
> > build a solid language model for use with the language identifier
for
> > the
> > languages: Danish, Dutch, English, Finnish, French, German, Italian,
> > Norwegian, Portuguese, Spanish and Swedisch.
> >
> > The language identifier is based on a Naive Bayes classifier. Now, 
> > this is all nice, but I have some integration questions, and I hope 
> > you can help
> > out.
> >
> > Basically, the process of indexing is:
> > Create an analyzer
> > Open a IndexWriter
> > Pass it the analyzer
> > Proces a document
> > Add document to Index
> > Optimize writer
> > Close writer
> >
> > Now, the language identifier can help automatically identify what 
> > langauge a document is written in. Based on the suggestion of the 
> > identifier, an apropriate analyzer can be selected.
> >
> > This is al great, but...
> >
> > 1. Do we index all the terms from various documents in various 
> > languages into 1 index?
> > 2. Do I build a specialised Analyzer that selects the stemmer based
> > on the
> > Language Identifier or leave that up to the custom indexing
> > application?
> >
> > Your thoughts please...
> >
> > regards,
> >
> > Maurits
>
>
>
> __________________________________________________
> Do you Yahoo!?
> Yahoo! Web Hosting - Let the expert host your site 
> http://webhosting.yahoo.com
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>


--
To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
For additional commands, e-mail:
<ma...@jakarta.apache.org>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: language identifier, stemmers and analyzers

Posted by maurits van wijland <m....@quicknet.nl>.

Otis,

Thanks for the reply.

>
> 1. Ideally, yes, if you ask me.  You get email in at least 2 languages
> - wouldn't it make sense to have it all indexed in a single email
> index?

>
> 2. I think it would be nice to have an Analyzer that can pick the
> correct Analyzer based on the language, but since language identifier
> can also be retrieved from Brad's code directly, one will always be
> able to opt for using custom logic in their application instead of
> using your language-aware Analyzer.
> So my opinion is that a specialized Analyzer that can pick the right
> Analyzer implementation based on the language of the input would be
> good, as it does not prevent developers from using Brad's code
> directly.
That makes sense. I first thought that the analyzer would be a problem,
because the queryparser should use the same analyzer! But I guess that
this special analyzer would initiate a language specific analyzer to stem
the words accordingly.

And yes, Brad's code can be used directly. Ofcourse. Brad has made a
terrific language identifier that is suitable for more uses other than
Lucene's.
And it works like a charm and works with international character standards.

I will put together a package with an analyzer, a language model (will
include
the language source files so anybody can rebuild the model).
Give me a couple of days, because I am currently swammped with
work, but will soon post the result to the list.

>
> Is this something that can be included in Lucene core/sandbox?
>
This is for the code/sandbox yes.

regards,

Maurits.

> Otis
>
>
> --- maurits van wijland <m....@quicknet.nl> wrote:
> > Dear all,
> >
> > Brad Wellington has created a language identifier which can be used
> > in
> > combination with
> > the snowball stemmers donated to Lucene by Alex Murzaku. I have
> > currently
> > build a solid language model for use with the language identifier for
> > the
> > languages: Danish, Dutch, English, Finnish, French, German, Italian,
> > Norwegian, Portuguese, Spanish and Swedisch.
> >
> > The language identifier is based on a Naive Bayes classifier. Now,
> > this is
> > all nice, but I have some integration questions, and I hope you can
> > help
> > out.
> >
> > Basically, the process of indexing is:
> > Create an analyzer
> > Open a IndexWriter
> > Pass it the analyzer
> > Proces a document
> > Add document to Index
> > Optimize writer
> > Close writer
> >
> > Now, the language identifier can help automatically identify what
> > langauge a
> > document is written in. Based on the suggestion of the identifier, an
> > apropriate analyzer can be selected.
> >
> > This is al great, but...
> >
> > 1. Do we index all the terms from various documents in various
> > languages
> > into 1 index?
> > 2. Do I build a specialised Analyzer that selects the stemmer based
> > on the
> > Language Identifier or leave that up to the custom indexing
> > application?
> >
> > Your thoughts please...
> >
> > regards,
> >
> > Maurits
>
>
>
> __________________________________________________
> Do you Yahoo!?
> Yahoo! Web Hosting - Let the expert host your site
> http://webhosting.yahoo.com
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: language identifier, stemmers and analyzers

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Maurits,

1. Ideally, yes, if you ask me.  You get email in at least 2 languages
- wouldn't it make sense to have it all indexed in a single email
index?

2. I think it would be nice to have an Analyzer that can pick the
correct Analyzer based on the language, but since language identifier
can also be retrieved from Brad's code directly, one will always be
able to opt for using custom logic in their application instead of
using your language-aware Analyzer.
So my opinion is that a specialized Analyzer that can pick the right
Analyzer implementation based on the language of the input would be
good, as it does not prevent developers from using Brad's code
directly.

Is this something that can be included in Lucene core/sandbox?

Otis

--- maurits van wijland <m....@quicknet.nl> wrote:
> Dear all,
> 
> Brad Wellington has created a language identifier which can be used
> in
> combination with
> the snowball stemmers donated to Lucene by Alex Murzaku. I have
> currently
> build a solid language model for use with the language identifier for
> the
> languages: Danish, Dutch, English, Finnish, French, German, Italian,
> Norwegian, Portuguese, Spanish and Swedisch.
> 
> The language identifier is based on a Naive Bayes classifier. Now,
> this is
> all nice, but I have some integration questions, and I hope you can
> help
> out.
> 
> Basically, the process of indexing is:
> Create an analyzer
> Open a IndexWriter
> Pass it the analyzer
> Proces a document
> Add document to Index
> Optimize writer
> Close writer
> 
> Now, the language identifier can help automatically identify what
> langauge a
> document is written in. Based on the suggestion of the identifier, an
> apropriate analyzer can be selected.
> 
> This is al great, but...
> 
> 1. Do we index all the terms from various documents in various
> languages
> into 1 index?
> 2. Do I build a specialised Analyzer that selects the stemmer based
> on the
> Language Identifier or leave that up to the custom indexing
> application?
> 
> Your thoughts please...
> 
> regards,
> 
> Maurits

__________________________________________________
Do you Yahoo!?
Yahoo! Web Hosting - Let the expert host your site
http://webhosting.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

language identifier, stemmers and analyzers

Posted by maurits van wijland <m....@quicknet.nl>.

Dear all,

Brad Wellington has created a language identifier which can be used in
combination with
the snowball stemmers donated to Lucene by Alex Murzaku. I have currently
build a solid language model for use with the language identifier for the
languages: Danish, Dutch, English, Finnish, French, German, Italian,
Norwegian, Portuguese, Spanish and Swedisch.

The language identifier is based on a Naive Bayes classifier. Now, this is
all nice, but I have some integration questions, and I hope you can help
out.

Basically, the process of indexing is:
Create an analyzer
Open a IndexWriter
Pass it the analyzer
Proces a document
Add document to Index
Optimize writer
Close writer

Now, the language identifier can help automatically identify what langauge a
document is written in. Based on the suggestion of the identifier, an
apropriate analyzer can be selected.

This is al great, but...

1. Do we index all the terms from various documents in various languages
into 1 index?
2. Do I build a specialised Analyzer that selects the stemmer based on the
Language Identifier or leave that up to the custom indexing application?

Your thoughts please...

regards,

Maurits




--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: getAllFieldNames diffs

Posted by Peter Mularien <pm...@deploy.com>.

Actually, I started thinking about this as I was going to make the 
changes, I don't believe it is appropriate to return a Set after all. I 
think that the post condition assertion that we're making about the 
contents of the returned data (uniqueness) needn't be enforced by the 
return type itself, because at some point we might choose to drop or 
modify that constraint. At that point, we'd need to change the API and 
introduce incompatibility.

Since we're using a Set internally, as you mentioned, the uniqueness 
constraint is still enforced (internally), but externally, users need to 
take it on faith that the method is behaving according to the assertions 
made. I believe this is consistent with most other APIs that behave in a 
similar fashion with regards to post condition assertions.

So --- I think I'd propose that the diffs I made be applied "as-is", 
given that the other thread (Collections vs. Iterator vs. Enumeration) 
also seems to have wrapped up in favor of Collection being returned in 
"new" APIs.

Anyone disagree?

Thanks
Peter

Darren Hobbs wrote:

>Given that the concrete implementation returns a Set and the javadoc states
>that the field names will be unique, would it better to declare the method
>to return a Set rather than a Collection?  That would seem to better capture
>the intention.  Apologies if this seems nit-picky - it means I can't find
>anything worse wrong with it!
>
>Regards,
>
>-Darren
>
>
>--
>To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
>For additional commands, e-mail: <ma...@jakarta.apache.org>
>
>
>  
>

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: getAllFieldNames diffs

Posted by Peter Mularien <pm...@deploy.com>.

Great comment, I will do it (assuming I don't hear any conflicting 
opinions from the list! (:  )

Peter

>Given that the concrete implementation returns a Set and the javadoc states
>that the field names will be unique, would it better to declare the method
>to return a Set rather than a Collection?  That would seem to better capture
>the intention.  Apologies if this seems nit-picky - it means I can't find
>anything worse wrong with it!
>
>Regards,
>
>-Darren
>
>
>--
>To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
>For additional commands, e-mail: <ma...@jakarta.apache.org>
>
>
>  
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: getAllFieldNames diffs

Posted by Darren Hobbs <da...@farfetched.org>.

On Tue, Nov 12, 2002 at 03:17:19PM -0500, Peter Mularien wrote:
> Attached. Please comment or critique. I've used the java.util.Collection 
> classes here extensively, which may be an issue since other parts of the 
> API return Enumeration. Collections have been preferred since JDK 1.2 
> for a number of reasons I won't go into here. Please let me know if 
> these are OK to use in Lucene.
> 
> +     /**
> +      * Return a list of all unique field names which exist in the index pointed to by
> +      * this IndexReader.
> +      * @return Collection of Strings indicating the names of the fields
> +      * @throws IOException if there is a problem with accessing the index
> +      */
> +     public abstract Collection getFieldNames() throws IOException;
>   
>     /**
> +     public Collection getFieldNames() throws IOException {
> +         // maintain a unique set of field names
> +         Set fieldSet = new HashSet();
> +         for (int i = 0; i < fieldInfos.size(); i++) {
> +             FieldInfo fi = fieldInfos.fieldInfo(i);
> +             fieldSet.add(fi.name);
> +         }
> +         return fieldSet;
> +     }
>   }

Given that the concrete implementation returns a Set and the javadoc states
that the field names will be unique, would it better to declare the method
to return a Set rather than a Collection?  That would seem to better capture
the intention.  Apologies if this seems nit-picky - it means I can't find
anything worse wrong with it!

Regards,

-Darren


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: getAllFieldNames diffs

Posted by Peter Mularien <pm...@deploy.com>.

Your reasoning is logical, and it fits in nicely with many of the existing
API calls returning Enumeration.

Thanks
Peter
----- Original Message -----
From: "Clemens Marschner" <cm...@lanlab.de>
To: "Lucene Developers List" <lu...@jakarta.apache.org>
Sent: Tuesday, November 12, 2002 6:25 PM
Subject: Re: getAllFieldNames diffs


> Instead of returning Object[] or Collection I would consider returning an
> iterator. Iterators may be designed data-driven, that is, temporary
objects
> are only created when next() is called and not at the time the method is
> called. There are powerful frameworks like the XXL library that
extensively
> use iterators to implement cursors efficiently
>
> Finally Iterators are supposed to be the standard mechanism to returning
> collections in Java, aren't they?
>
> Clemens
>
> ----- Original Message -----
> From: "Peter Mularien" <pm...@deploy.com>
> To: "Lucene Developers List" <lu...@jakarta.apache.org>
> Sent: Tuesday, November 12, 2002 9:33 PM
> Subject: Re: getAllFieldNames diffs
>
>
> > Personal preference. I don't tend to like returning arrays from methods
> > unless I have to, primarily because then the caller has to check for a
> > null value being returned. When returning Collection (or other Set-type
> > object) it is easy to always return a value, even if the object is
> > empty. For this particular method, it seemed useful also to be able to
> > test for memership, do sorting, etc., although I suppose
> > java.util.Arrays can do most of that anyway.
> >
> > Peter
> >
> > Otis Gospodnetic wrote:
> >
> > >Nice :)
> > >I looked at the code first, and was about to ask - why not just return
> > >String[]?  What is the advantage of Collection in this case?
> > >
> > >Thanks,
> > >Otis
> > >
> > >
> > >
> > >
> >
> >
> > --
> > To unsubscribe, e-mail:
> <ma...@jakarta.apache.org>
> > For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> >
>
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: getAllFieldNames diffs

Posted by Scott Ganyo <sc...@etapestry.com>.

Follow up to my other comment about not passing Iterators.  Here is the 
"official" word from the designer of the Java Collections API:

http://java.sun.com/products/jdk/1.2/docs/guide/collections/designfaq.html#8

Scott

Clemens Marschner wrote:

> Instead of returning Object[] or Collection I would consider returning an
> iterator. Iterators may be designed data-driven, that is, temporary 
> objects
> are only created when next() is called and not at the time the method is
> called. There are powerful frameworks like the XXL library that 
> extensively
> use iterators to implement cursors efficiently
>
> Finally Iterators are supposed to be the standard mechanism to returning
> collections in Java, aren't they?
>
> Clemens
>
> ----- Original Message -----
> From: "Peter Mularien"
> To: "Lucene Developers List"
> Sent: Tuesday, November 12, 2002 9:33 PM
> Subject: Re: getAllFieldNames diffs
>
>
>
> >Personal preference. I don't tend to like returning arrays from methods
> >unless I have to, primarily because then the caller has to check for a
> >null value being returned. When returning Collection (or other Set-type
> >object) it is easy to always return a value, even if the object is
> >empty. For this particular method, it seemed useful also to be able to
> >test for memership, do sorting, etc., although I suppose
> >java.util.Arrays can do most of that anyway.
> >
> >Peter
> >
> >Otis Gospodnetic wrote:
> >
> >
> >>Nice :)
> >>I looked at the code first, and was about to ask - why not just return
> >>String[]?  What is the advantage of Collection in this case?
> >>
> >>Thanks,
> >>Otis
> >>
> >>
> >>
> >>
> >
> >
> >--
> >To unsubscribe, e-mail:
>
>
>
> >For additional commands, e-mail:
>
>
>
>
>
> --
> To unsubscribe, e-mail:
> For additional commands, e-mail: 


-- 
Brain: Pinky, are you pondering what I’m pondering?
Pinky: I think so, Brain, but calling it a pu-pu platter? Huh, what were 
they thinking?


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: getAllFieldNames diffs

Posted by Clemens Marschner <cm...@lanlab.de>.

Instead of returning Object[] or Collection I would consider returning an
iterator. Iterators may be designed data-driven, that is, temporary objects
are only created when next() is called and not at the time the method is
called. There are powerful frameworks like the XXL library that extensively
use iterators to implement cursors efficiently

Finally Iterators are supposed to be the standard mechanism to returning
collections in Java, aren't they?

Clemens

----- Original Message -----
From: "Peter Mularien" <pm...@deploy.com>
To: "Lucene Developers List" <lu...@jakarta.apache.org>
Sent: Tuesday, November 12, 2002 9:33 PM
Subject: Re: getAllFieldNames diffs


> Personal preference. I don't tend to like returning arrays from methods
> unless I have to, primarily because then the caller has to check for a
> null value being returned. When returning Collection (or other Set-type
> object) it is easy to always return a value, even if the object is
> empty. For this particular method, it seemed useful also to be able to
> test for memership, do sorting, etc., although I suppose
> java.util.Arrays can do most of that anyway.
>
> Peter
>
> Otis Gospodnetic wrote:
>
> >Nice :)
> >I looked at the code first, and was about to ask - why not just return
> >String[]?  What is the advantage of Collection in this case?
> >
> >Thanks,
> >Otis
> >
> >
> >
> >
>
>
> --
> To unsubscribe, e-mail:
<ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: getAllFieldNames diffs

Posted by Peter Mularien <pm...@deploy.com>.

Personal preference. I don't tend to like returning arrays from methods 
unless I have to, primarily because then the caller has to check for a 
null value being returned. When returning Collection (or other Set-type 
object) it is easy to always return a value, even if the object is 
empty. For this particular method, it seemed useful also to be able to 
test for memership, do sorting, etc., although I suppose 
java.util.Arrays can do most of that anyway.

Peter

Otis Gospodnetic wrote:

>Nice :)
>I looked at the code first, and was about to ask - why not just return
>String[]?  What is the advantage of Collection in this case?
>
>Thanks,
>Otis
>  
>
>  
>

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: getAllFieldNames diffs

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Nice :)
I looked at the code first, and was about to ask - why not just return
String[]?  What is the advantage of Collection in this case?

Thanks,
Otis

--- Peter Mularien <pm...@deploy.com> wrote:
> Attached. Please comment or critique. I've used the
> java.util.Collection 
> classes here extensively, which may be an issue since other parts of
> the 
> API return Enumeration. Collections have been preferred since JDK 1.2
> 
> for a number of reasons I won't go into here. Please let me know if 
> these are OK to use in Lucene.
> 
> I have added an [ambitiously named (: ] TestIndexReader to the unit 
> tests, it contains a couple simple unit tests for this functionality.
> 
> Diffs are also attached.
> 
> Thanks
> Peter
> > ? src/test/org/apache/lucene/index/TestIndexReader.java
> Index: src/java/org/apache/lucene/index/IndexReader.java
> ===================================================================
> RCS file:
>
/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/IndexReader.java,v
> retrieving revision 1.11
> diff -u -c -r1.11 IndexReader.java
> *** src/java/org/apache/lucene/index/IndexReader.java	7 Nov 2002
> 05:55:39 -0000	1.11
> --- src/java/org/apache/lucene/index/IndexReader.java	12 Nov 2002
> 20:11:20 -0000
> ***************
> *** 56,61 ****
> --- 56,63 ----
>   
>   import java.io.IOException;
>   import java.io.File;
> + import java.util.Collection;
> + 
>   import org.apache.lucene.store.Directory;
>   import org.apache.lucene.store.FSDirectory;
>   import org.apache.lucene.store.Lock;
> ***************
> *** 301,306 ****
> --- 303,316 ----
>         writeLock = null;
>       }
>     }
> + 
> +     /**
> +      * Return a list of all unique field names which exist in the
> index pointed to by
> +      * this IndexReader.
> +      * @return Collection of Strings indicating the names of the
> fields
> +      * @throws IOException if there is a problem with accessing the
> index
> +      */
> +     public abstract Collection getFieldNames() throws IOException;
>   
>     /**
>      * Returns <code>true</code> iff the index in the named directory
> is
> Index: src/java/org/apache/lucene/index/SegmentReader.java
> ===================================================================
> RCS file:
>
/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/SegmentReader.java,v
> retrieving revision 1.6
> diff -u -c -r1.6 SegmentReader.java
> *** src/java/org/apache/lucene/index/SegmentReader.java	7 Nov 2002
> 05:55:39 -0000	1.6
> --- src/java/org/apache/lucene/index/SegmentReader.java	12 Nov 2002
> 20:11:20 -0000
> ***************
> *** 55,69 ****
>    */
>   
>   import java.io.IOException;
> ! import java.util.Hashtable;
>   import java.util.Enumeration;
>   import java.util.Vector;
>   
> - import org.apache.lucene.util.BitVector;
> - import org.apache.lucene.store.Directory;
> - import org.apache.lucene.store.Lock;
> - import org.apache.lucene.store.InputStream;
>   import org.apache.lucene.document.Document;
>   
>   final class SegmentReader extends IndexReader {
>     private boolean closeDirectory = false;
> --- 55,71 ----
>    */
>   
>   import java.io.IOException;
> ! import java.util.Collection;
>   import java.util.Enumeration;
> + import java.util.HashSet;
> + import java.util.Hashtable;
> + import java.util.Set;
>   import java.util.Vector;
>   
>   import org.apache.lucene.document.Document;
> + import org.apache.lucene.store.InputStream;
> + import org.apache.lucene.store.Lock;
> + import org.apache.lucene.util.BitVector;
>   
>   final class SegmentReader extends IndexReader {
>     private boolean closeDirectory = false;
> ***************
> *** 73,79 ****
>     private FieldsReader fieldsReader;
>   
>     TermInfosReader tis;
> !   
>     BitVector deletedDocs = null;
>     private boolean deletedDocsDirty = false;
>   
> --- 75,81 ----
>     private FieldsReader fieldsReader;
>   
>     TermInfosReader tis;
> ! 
>     BitVector deletedDocs = null;
>     private boolean deletedDocsDirty = false;
>   
> ***************
> *** 113,119 ****
>       proxStream = directory.openFile(segment + ".prx");
>       openNorms();
>     }
> !   
>     final synchronized void doClose() throws IOException {
>       if (deletedDocsDirty) {
>         synchronized (directory) {		  // in- & inter-process sync
> --- 115,121 ----
>       proxStream = directory.openFile(segment + ".prx");
>       openNorms();
>     }
> ! 
>     final synchronized void doClose() throws IOException {
>       if (deletedDocsDirty) {
>         synchronized (directory) {		  // in- & inter-process sync
> ***************
> *** 256,262 ****
>     private final void openNorms() throws IOException {
>       for (int i = 0; i < fieldInfos.size(); i++) {
>         FieldInfo fi = fieldInfos.fieldInfo(i);
> !       if (fi.isIndexed) 
>   	norms.put(fi.name,
>   		  new Norm(directory.openFile(segment + ".f" + fi.number)));
>       }
> --- 258,264 ----
>     private final void openNorms() throws IOException {
>       for (int i = 0; i < fieldInfos.size(); i++) {
>         FieldInfo fi = fieldInfos.fieldInfo(i);
> !       if (fi.isIndexed)
>   	norms.put(fi.name,
>   		  new Norm(directory.openFile(segment + ".f" + fi.number)));
>       }
> ***************
> *** 271,274 ****
> --- 273,287 ----
>         }
>       }
>     }
> + 
> +     // javadoc inherited
> +     public Collection getFieldNames() throws IOException {
> +         // maintain a unique set of field names
> +         Set fieldSet = new HashSet();
> +         for (int i = 0; i < fieldInfos.size(); i++) {
> +             FieldInfo fi = fieldInfos.fieldInfo(i);
> +             fieldSet.add(fi.name);
> +         }
> +         return fieldSet;
> +     }
>   }
> Index: src/java/org/apache/lucene/index/SegmentsReader.java
> ===================================================================
> RCS file:
>
/home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/index/SegmentsReader.java,v
> retrieving revision 1.9
> diff -u -c -r1.9 SegmentsReader.java
> *** src/java/org/apache/lucene/index/SegmentsReader.java	7 Nov 2002
> 05:55:39 -0000	1.9
> --- src/java/org/apache/lucene/index/SegmentsReader.java	12 Nov 2002
> 20:11:20 -0000
> ***************
> *** 55,64 ****
>    */
>   
>   import java.io.IOException;
>   import java.util.Hashtable;
>   
> - import org.apache.lucene.store.Directory;
>   import org.apache.lucene.document.Document;
>   
>   /**
>    * FIXME: Describe class <code>SegmentsReader</code> here.
> --- 55,68 ----
>    */
>   
>   import java.io.IOException;
> + import java.util.Collection;
> + import java.util.HashSet;
>   import java.util.Hashtable;
> + import java.util.Iterator;
> + import java.util.Set;
>   
>   import org.apache.lucene.document.Document;
> + import org.apache.lucene.store.Directory;
>   
>   /**
>    * FIXME: Describe class <code>SegmentsReader</code> here.
> ***************
> *** 174,179 ****
> --- 178,199 ----
>       for (int i = 0; i < readers.length; i++)
>         readers[i].close();
>     }
> + 
> +     // javadoc inherited
> +     public Collection getFieldNames() throws IOException {
> +         // maintain a unique set of field names
> +         Set fieldSet = new HashSet();
> +         for (int i = 0; i < readers.length; i++) {
> +             SegmentReader reader = readers[i];
> +             Collection names = reader.getFieldNames();
> +             // iterate through the field names and add them to the
> set
> +             for (Iterator iterator = names.iterator();
> iterator.hasNext();) {
> +                 String s = (String) iterator.next();
> +                 fieldSet.add(s);
> +             }
> +         }
> +         return fieldSet;
> +     }
>   }
>   
>   class SegmentsTermEnum extends TermEnum {
> > package org.apache.lucene.index;
> 
> import junit.framework.TestCase;
> import org.apache.lucene.store.RAMDirectory;
> import org.apache.lucene.analysis.standard.StandardAnalyzer;
> import org.apache.lucene.document.Document;
> import org.apache.lucene.document.Field;
> 
> import java.util.Collection;
> import java.io.IOException;
> 
> /*
> ====================================================================
>  * The Apache Software License, Version 1.1
>  *
>  * Copyright (c) 2001 The Apache Software Foundation.  All rights
>  * reserved.
>  *
>  * Redistribution and use in source and binary forms, with or without
>  * modification, are permitted provided that the following conditions
>  * are met:
>  *
>  * 1. Redistributions of source code must retain the above copyright
>  *    notice, this list of conditions and the following disclaimer.
>  *
>  * 2. Redistributions in binary form must reproduce the above
> copyright
>  *    notice, this list of conditions and the following disclaimer in
>  *    the documentation and/or other materials provided with the
>  *    distribution.
>  *
>  * 3. The end-user documentation included with the redistribution,
>  *    if any, must include the following acknowledgment:
>  *       "This product includes software developed by the
>  *        Apache Software Foundation (http://www.apache.org/)."
>  *    Alternately, this acknowledgment may appear in the software
> itself,
>  *    if and wherever such third-party acknowledgments normally
> appear.
>  *
>  * 4. The names "Apache" and "Apache Software Foundation" and
>  *    "Apache Lucene" must not be used to endorse or promote products
>  *    derived from this software without prior written permission.
> For
>  *    written permission, please contact apache@apache.org.
>  *
>  * 5. Products derived from this software may not be called "Apache",
>  *    "Apache Lucene", nor may "Apache" appear in their name, without
>  *    prior written permission of the Apache Software Foundation.
>  *
>  * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
>  * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
>  * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
>  * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
>  * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
>  * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
>  * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
>  * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
> AND
>  * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
>  * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
>  * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
>  * SUCH DAMAGE.
>  *
> ====================================================================
>  *
>  * This software consists of voluntary contributions made by many
>  * individuals on behalf of the Apache Software Foundation.  For more
>  * information on the Apache Software Foundation, please see
>  * <http://www.apache.org/>.
>  */
> 
> public class TestIndexReader extends TestCase {
>     /**
>      * Test the IndexReader.getFieldNames implementation
>      * @throws Exception on error
>      */
>     public void testGetFieldNames() throws Exception {
>         RAMDirectory d = new RAMDirectory();
>         // set up writer
>         IndexWriter writer = new IndexWriter(d, new
> StandardAnalyzer(), true);
>         addDocumentWithFields(writer);
>         writer.close();
>         // set up reader
>         IndexReader reader = IndexReader.open(d);
>         Collection fieldNames = reader.getFieldNames();
>         assertTrue(fieldNames.contains("keyword"));
>         assertTrue(fieldNames.contains("text"));
>         assertTrue(fieldNames.contains("unindexed"));
>         assertTrue(fieldNames.contains("unstored"));
>         // add more documents
>         writer = new IndexWriter(d, new StandardAnalyzer(), false);
>         // want to get some more segments here
>         for(int i=0;i<5*writer.mergeFactor;i++) {
>             addDocumentWithFields(writer);
>         }
>         // new fields are in some different segments (we hope)
>         for(int i=0;i<5*writer.mergeFactor;i++) {
>             addDocumentWithDifferentFields(writer);
>         }
>         writer.close();
>         // verify fields again
>         reader = IndexReader.open(d);
>         fieldNames = reader.getFieldNames();
>         assertTrue(fieldNames.contains("keyword"));
>         assertTrue(fieldNames.contains("text"));
>         assertTrue(fieldNames.contains("unindexed"));
>         assertTrue(fieldNames.contains("unstored"));
>         assertTrue(fieldNames.contains("keyword2"));
>         assertTrue(fieldNames.contains("text2"));
>         assertTrue(fieldNames.contains("unindexed2"));
>         assertTrue(fieldNames.contains("unstored2"));
>     }
> 
>     private void addDocumentWithFields(IndexWriter writer) throws
> IOException {
>         Document doc = new Document();
>         doc.add(Field.Keyword("keyword","test1"));
>         doc.add(Field.Text("text","test1"));
>         doc.add(Field.UnIndexed("unindexed","test1"));
>         doc.add(Field.UnStored("unstored","test1"));
>         writer.addDocument(doc);
>     }
> 
>     private void addDocumentWithDifferentFields(IndexWriter writer)
> throws IOException {
>         Document doc = new Document();
>         doc.add(Field.Keyword("keyword2","test1"));
>         doc.add(Field.Text("text2","test1"));
>         doc.add(Field.UnIndexed("unindexed2","test1"));
>         doc.add(Field.UnStored("unstored2","test1"));
>         writer.addDocument(doc);
>     }
> }
> 
> > --
> To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>


__________________________________________________
Do you Yahoo!?
U2 on LAUNCH - Exclusive greatest hits videos
http://launch.yahoo.com/u2

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>