You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Daniel Naber <da...@t-online.de> on 2004/08/28 15:53:32 UTC

API cleanup for Field

Hi,

here's a patch to clean up the API as described by Doug:
http://www.mail-archive.com/lucene-user%40jakarta.apache.org/msg08479.html

The Field constructor with three booleans is deprecated because it's too 
easy to mix up the order of those parameters. Also, one variation of 
Field.Text() is deprecated because it behaves different depending on if 
you pass it a String or a StringReader, which is very confusing.

The static Field.UnStored/Keyword etc methods are not deprecated. I think 
these can be confusing (e.g. what exactly does UnIndexed mean -- I always 
have to look it up), but nobody is forced to use them so there's no reason 
to deprecate them.

The only boolean left is the one for term vectors. Should we add another 
enumeration like TermVectorIndex.NO/YES/...? I know that there's a patch 
that adds position information to the term vectors. How does that fit in 
here?

Regards
 Daniel

-- 
http://www.danielnaber.de

Re: API cleanup for Field

Posted by Christoph Goller <go...@detego-software.de>.
Daniel Naber wrote:
> On Wednesday 01 September 2004 21:55, Doug Cutting wrote:
> 
> 
>>That's right.  In particular I think we'll need:
>>
>>  public Field(String, Reader, Index); // Reader is never stored
> 
> 
> Actually you'll also get an exception when you try to index the field 
> UN_TOKENIZED. I didn't check what exactly happens in that case, for now I 
> just left out the "Index" parameter, too (i.e. default to TOKENIZE).

A field with Reader value cannot be stored, as Doug said, since some
Readers (e.g. PipedReader) do not support reset and FieldsWriter would
consume such a Reader in order to store the field. After storing, indexing
would then no longer be possible.

DocumentWriter could probably be changed to allow UN_TOKENIZED fields
with ReaderValue, but that's probably not necesary, since a ReaderValue
is something supposed to be quite long and so not tokenizing it does not
make much sense.

Internally Field still uses the boolean values isStored, isIndexed,
isTokenized, and storeTermVector. And there are public getters for them.
Shouldn't we substitute them totally with the respective enums?

Christoph


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: API cleanup for Field

Posted by Daniel Naber <da...@t-online.de>.
On Wednesday 01 September 2004 21:55, Doug Cutting wrote:

> That's right.  In particular I think we'll need:
>
>   public Field(String, Reader, Index); // Reader is never stored

Actually you'll also get an exception when you try to index the field 
UN_TOKENIZED. I didn't check what exactly happens in that case, for now I 
just left out the "Index" parameter, too (i.e. default to TOKENIZE).

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: API cleanup for Field

Posted by Doug Cutting <cu...@apache.org>.
Daniel Naber wrote:
> Okay, but the Field.Text that takes a Reader doesn't seem to be a pure 
> convenience method, it makes use of the Reader, doesn't it? So to keep 
> that feature we probably need another public constructor just like the 
> current one with parameters name, value, and with two resp. three 
> enumerations, but it needs to take a Reader instead of a String. Is that 
> correct?

That's right.  In particular I think we'll need:

  public Field(String, Reader, Index); // Reader is never stored
  public Field(String, Reader, Index, TermVector);

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: API cleanup for Field

Posted by Daniel Naber <da...@t-online.de>.
On Monday 30 August 2004 18:34, Doug Cutting wrote:

> If they're confusing and have a less-confusing alternative then we
> should eventually remove them from the API, so we should deprecate them
> now.  We should move entirely to the new enumeration-based contructors.
>   Everything else should be deprecated.

Okay, but the Field.Text that takes a Reader doesn't seem to be a pure 
convenience method, it makes use of the Reader, doesn't it? So to keep 
that feature we probably need another public constructor just like the 
current one with parameters name, value, and with two resp. three 
enumerations, but it needs to take a Reader instead of a String. Is that 
correct?

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: API cleanup for Field

Posted by Todd VanderVeen <td...@part.net>.
Otis Gospodnetic wrote:

>--- Doug Cutting <cu...@apache.org> wrote:
>
>  
>
>>Otis Gospodnetic wrote:
>>    
>>
>>>I like short names, too, but maybe Field.Vector.NO will make
>>>      
>>>
>>somebody
>>    
>>
>>>think java.util.Vector.  How about Field.TermVector or
>>>      
>>>
>>Field.TVector,
>>    
>>
>>>or Field.TV?
>>>      
>>>
>>Field.TermVector seems redundant to me, and Field.TV too telegraphic.
>> I 
>>think, in this context, the use of Vector is unambiguous.  Also,
>>doesn't 
>>everyone use ArrayList now instead of Vector?
>>    
>>
>
>You'd think so, but I still see people just use Vectors when they don't
>want to bother thinking about synchronization issues. :( Don't we still
>have Vectors in Lucene in a number of places?
>
>  
>
>>What do others think?
>>    
>>
>
>Field.Vector is fine by me, but those less familiar with Lucene may
>first think java.util.Vector, even though that wouldn't work even
>syntactically.
>
>Otis
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>  
>
I recommend sticking with TermVector. Consider the use of static imports 
in 1.5. Space will be gained by dropping "Field." so what remains should 
be unambiguous.

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: API cleanup for Field

Posted by Otis Gospodnetic <ot...@yahoo.com>.
--- Doug Cutting <cu...@apache.org> wrote:

> Otis Gospodnetic wrote:
> > I like short names, too, but maybe Field.Vector.NO will make
> somebody
> > think java.util.Vector.  How about Field.TermVector or
> Field.TVector,
> > or Field.TV?
> 
> Field.TermVector seems redundant to me, and Field.TV too telegraphic.
>  I 
> think, in this context, the use of Vector is unambiguous.  Also,
> doesn't 
> everyone use ArrayList now instead of Vector?

You'd think so, but I still see people just use Vectors when they don't
want to bother thinking about synchronization issues. :( Don't we still
have Vectors in Lucene in a number of places?

> What do others think?

Field.Vector is fine by me, but those less familiar with Lucene may
first think java.util.Vector, even though that wouldn't work even
syntactically.

Otis

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: API cleanup for Field

Posted by Doug Cutting <cu...@apache.org>.
Otis Gospodnetic wrote:
> I like short names, too, but maybe Field.Vector.NO will make somebody
> think java.util.Vector.  How about Field.TermVector or Field.TVector,
> or Field.TV?

Field.TermVector seems redundant to me, and Field.TV too telegraphic.  I 
think, in this context, the use of Vector is unambiguous.  Also, doesn't 
everyone use ArrayList now instead of Vector?

What do others think?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: API cleanup for Field

Posted by Otis Gospodnetic <ot...@yahoo.com>.
I like short names, too, but maybe Field.Vector.NO will make somebody
think java.util.Vector.  How about Field.TermVector or Field.TVector,
or Field.TV?

Otis

--- Doug Cutting <cu...@apache.org> wrote:

> Daniel Naber wrote:
> > here's a patch to clean up the API as described by Doug:
> >
>
http://www.mail-archive.com/lucene-user%40jakarta.apache.org/msg08479.html
> 
> Daniel,
> 
> This looks great!  Thanks for this and all of the other wonderful
> work 
> you've done recently for Lucene.
> 
> > The static Field.UnStored/Keyword etc methods are not deprecated. I
> think 
> > these can be confusing (e.g. what exactly does UnIndexed mean -- I
> always 
> > have to look it up), but nobody is forced to use them so there's no
> reason 
> > to deprecate them.
> 
> If they're confusing and have a less-confusing alternative then we 
> should eventually remove them from the API, so we should deprecate
> them 
> now.  We should move entirely to the new enumeration-based
> contructors. 
>   Everything else should be deprecated.
> 
> > The only boolean left is the one for term vectors. Should we add
> another 
> > enumeration like TermVectorIndex.NO/YES/...? I know that there's a
> patch 
> > that adds position information to the term vectors. How does that
> fit in 
> > here?
> 
> I think we should also add an enumeration for vector indexing.  I
> like 
> keeping names shorter when possible, so I'd vote for something like 
> Field.Vector.YES, and Field.Vector.NO.  This parameter should be 
> optional, defaulting to Field.Vector.NO.
> 
> Thanks again,
> 
> Doug



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: API cleanup for Field

Posted by Doug Cutting <cu...@apache.org>.
Daniel Naber wrote:
> here's a patch to clean up the API as described by Doug:
> http://www.mail-archive.com/lucene-user%40jakarta.apache.org/msg08479.html

Daniel,

This looks great!  Thanks for this and all of the other wonderful work 
you've done recently for Lucene.

> The static Field.UnStored/Keyword etc methods are not deprecated. I think 
> these can be confusing (e.g. what exactly does UnIndexed mean -- I always 
> have to look it up), but nobody is forced to use them so there's no reason 
> to deprecate them.

If they're confusing and have a less-confusing alternative then we 
should eventually remove them from the API, so we should deprecate them 
now.  We should move entirely to the new enumeration-based contructors. 
  Everything else should be deprecated.

> The only boolean left is the one for term vectors. Should we add another 
> enumeration like TermVectorIndex.NO/YES/...? I know that there's a patch 
> that adds position information to the term vectors. How does that fit in 
> here?

I think we should also add an enumeration for vector indexing.  I like 
keeping names shorter when possible, so I'd vote for something like 
Field.Vector.YES, and Field.Vector.NO.  This parameter should be 
optional, defaulting to Field.Vector.NO.

Thanks again,

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: API cleanup for Field

Posted by Christoph Goller <go...@detego-software.de>.
Daniel Naber wrote:
> Hi,
> 
> here's a patch to clean up the API as described by Doug:
> http://www.mail-archive.com/lucene-user%40jakarta.apache.org/msg08479.html
> 
> The Field constructor with three booleans is deprecated because it's too 
> easy to mix up the order of those parameters. Also, one variation of 
> Field.Text() is deprecated because it behaves different depending on if 
> you pass it a String or a StringReader, which is very confusing.
> 
> The static Field.UnStored/Keyword etc methods are not deprecated. I think 
> these can be confusing (e.g. what exactly does UnIndexed mean -- I always 
> have to look it up), but nobody is forced to use them so there's no reason 
> to deprecate them.
> 
> The only boolean left is the one for term vectors. Should we add another 
> enumeration like TermVectorIndex.NO/YES/...? I know that there's a patch 
> that adds position information to the term vectors. How does that fit in 
> here?

+1
Your patch looks ok for me. I haven't looked into the Termvector-positions
patch. Anyway, an enumeration like TermVectorIndex.NO/YES/..., that could be
extended for Termvector-positions if necessary, seems reasonable.

Christoph


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: API cleanup for Field

Posted by Christoph Goller <go...@detego-software.de>.
Otis Gospodnetic wrote:
> This looks much nicer.  I'm for deprecating all the Field.* factory
> methods and switching to this new API fully, and including that
> TV/TVector/Vector/TermVector enum akin to Store/Index inner class.

I agree. The factory methods never helped me much since I never
remembered their exact meaning.

Christoph


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: API cleanup for Field

Posted by Otis Gospodnetic <ot...@yahoo.com>.
This looks much nicer.  I'm for deprecating all the Field.* factory
methods and switching to this new API fully, and including that
TV/TVector/Vector/TermVector enum akin to Store/Index inner class.

Nitpick: some exception messages start with upper case, some don't. 
Some end with a full stop, others do not.

+1

Otis

--- Daniel Naber <da...@t-online.de> wrote:

> Hi,
> 
> here's a patch to clean up the API as described by Doug:
>
http://www.mail-archive.com/lucene-user%40jakarta.apache.org/msg08479.html
> 
> The Field constructor with three booleans is deprecated because it's
> too 
> easy to mix up the order of those parameters. Also, one variation of 
> Field.Text() is deprecated because it behaves different depending on
> if 
> you pass it a String or a StringReader, which is very confusing.
> 
> The static Field.UnStored/Keyword etc methods are not deprecated. I
> think 
> these can be confusing (e.g. what exactly does UnIndexed mean -- I
> always 
> have to look it up), but nobody is forced to use them so there's no
> reason 
> to deprecate them.
> 
> The only boolean left is the one for term vectors. Should we add
> another 
> enumeration like TermVectorIndex.NO/YES/...? I know that there's a
> patch 
> that adds position information to the term vectors. How does that fit
> in 
> here?
> 
> Regards
>  Daniel
> 
> -- 
> http://www.danielnaber.de
> > Index: Field.java
> ===================================================================
> RCS file:
>
/home/cvs/jakarta-lucene/src/java/org/apache/lucene/document/Field.java,v
> retrieving revision 1.16
> diff -u -r1.16 Field.java
> --- Field.java	17 Aug 2004 20:22:33 -0000	1.16
> +++ Field.java	27 Aug 2004 23:13:18 -0000
> @@ -40,6 +40,50 @@
>    private boolean isTokenized = true;
>  
>    private float boost = 1.0f;
> +  
> +  public static final class Store {
> +    private String name;
> +    private Store() {}
> +    private Store(String name) {
> +      this.name = name;
> +    }
> +    public String toString() {
> +      return name;
> +    }
> +    /** Store the original field value in the index. This is useful
> for short texts
> +     * like a document's title which whould be displayed with the
> results. The
> +     * value is stored in its original form, i.e. no analyzer is
> used before it is
> +     * stored. 
> +     */
> +    public static final Store YES = new Store("YES");
> +    /** Do not store the field value in the index. */
> +    public static final Store NO = new Store("NO");
> +  }
> +  
> +  public static final class Index {
> +    private String name;
> +    private Index() {}
> +    private Index(String name) {
> +      this.name = name;
> +    }
> +    public String toString() {
> +      return name;
> +    }
> +    /** Do not index the field value. This field can thus not be
> searched,
> +     * but one can still access its contents provided it is 
> +     * {@link Field.Store stored}. */
> +    public static final Index NO = new Index("NO");
> +    /** Index the field's value so it can be searched. An Analyzer
> will be used
> +     * to tokenize and possibly further normalize the text before
> its
> +     * terms will be stored in the index. This is useful for common
> text.
> +     */
> +    public static final Index TOKENIZED = new Index("TOKENIZED");
> +    /** Index the field's value without using an Analyzer, so it can
> be searched.
> +     * As no analyzer is used the value will be stored as a single
> term. This is 
> +     * useful for unique Ids like product numbers.
> +     */
> +    public static final Index UN_TOKENIZED = new
> Index("UN_TOKENIZED");
> +  }
>  
>    /** Sets the boost factor hits on this field.  This value will be
>     * multiplied into the score of all hits on this this field of
> this
> @@ -91,7 +135,9 @@
>  
>    /** Constructs a String-valued Field that is tokenized and
> indexed,
>      and is stored in the index, for return with hits.  Useful for
> short text
> -    fields, like "title" or "subject". Term vector will not be
> stored for this field. */
> +    fields, like "title" or "subject". Term vector will not be
> stored for this field.
> +  @deprecated use {@link #Field(String, String, Field.Store,
> Field.Index)
> +    Field(name, value, Field.Store.YES, Field.Index.TOKENIZED)}
> instead */
>    public static final Field Text(String name, String value) {
>      return Text(name, value, false);
>    }
> @@ -104,7 +150,9 @@
>  
>    /** Constructs a String-valued Field that is tokenized and
> indexed,
>      and is stored in the index, for return with hits.  Useful for
> short text
> -    fields, like "title" or "subject". */
> +    fields, like "title" or "subject".
> +    @deprecated use {@link #Field(String, String, Field.Store,
> Field.Index, boolean)
> +      Field(name, value, Field.Store.YES, Field.Index.TOKENIZED,
> boolean)} instead */
>    public static final Field Text(String name, String value, boolean
> storeTermVector) {
>      return new Field(name, value, true, true, true,
> storeTermVector);
>    }
> @@ -152,6 +200,63 @@
>    /** Create a field by specifying all parameters except for
> <code>storeTermVector</code>,
>     *  which is set to <code>false</code>.
>     */
> +  public Field(String name, String string, Store store, Index index)
> {
> +    this(name, string, store, index, false);
> +  }
> +
> +  /**
> +   * Create a field by specifying its name, value and how it will
> +   * be saved in the index.
> +   * 
> +   * @param name The name of the field
> +   * @param string The string to process
> +   * @param store whether <code>string</code> should be stored in
> the index
> +   * @param index whether the field should be indexed, and if so, if
> it should
> +   *  be tokenized before indexing 
> +   * @param storeTermVector true if we should store the Term Vector
> info
> +   * @throws IllegalArgumentException if
> <code>storeTermVector</code> is
> +   *  <code>true</code> but the field is not indexed or if the field
> +   *  is neither stored nor indexed 
> +   */ 
> +  public Field(String name, String string, Store store, Index index,
> boolean storeTermVector) {
> +      if (name == null)
> +         throw new IllegalArgumentException("name cannot be null");
> +      if (string == null)
> +        throw new IllegalArgumentException("value cannot be null");
> +      if (index == Index.NO && storeTermVector)
> +        throw new IllegalArgumentException("cannot store a term
> vector for fields that are not indexed.");
> +      if (index == Index.NO && store == Store.NO)
> +        throw new IllegalArgumentException("it doesn't make sense to
> have a field that "
> +            + "is neither indexed nor stored");
> +
> +      this.name = name.intern();        // field names are interned
> +      this.stringValue = string;
> +      if (store == Store.YES)
> +        this.isStored = true;
> +      else if (store == Store.NO)
> +        this.isStored = false;
> +      else
> +        throw new IllegalArgumentException("Unknown store parameter
> " + store);
> +      
> +      if (index == Index.NO) {
> +        this.isIndexed = false;
> +        this.isTokenized = false;
> +      } else if (index == Index.TOKENIZED) {
> +        this.isIndexed = true;
> +        this.isTokenized = true;
> +      } else if (index == Index.UN_TOKENIZED) {
> +        this.isIndexed = true;
> +        this.isTokenized = false;
> +      } else {
> +        throw new IllegalArgumentException("Unknown index parameter
> " + index);
> +      }
> +      this.storeTermVector = storeTermVector;
> +}
> +
> +  /** Create a field by specifying all parameters except for
> <code>storeTermVector</code>,
> +   *  which is set to <code>false</code>.
> +   * @deprecated use {@link #Field(String, String, Field.Store,
> Field.Index)} instead
> +   */
>    public Field(String name, String string,
>  	       boolean store, boolean index, boolean token) {
>      this(name, string, store, index, token, false);
> @@ -165,6 +270,7 @@
>     * @param index true if the field should be indexed
>     * @param token true if the field should be tokenized
>     * @param storeTermVector true if we should store the Term Vector
> info
> +   * @deprecated use {@link #Field(String, String, Field.Store,
> Field.Index, boolean)} instead
>     */ 
>    public Field(String name, String string,
>  	       boolean store, boolean index, boolean token, boolean
> storeTermVector) {
> 
> >
---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org