You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by Matt Diehl <ma...@gooddiehl.net.INVALID> on 2017/06/11 02:56:46 UTC

Lucene 4.8 - Reusing Document during indexing

Hi,

I am not understanding how to reuse Document like we could in 3.0.3 for
indexing purposes.

For instance, in 3.0.3, I could create and then set several common Field
values, and then just iterate changing a single field in the Document, and
add to index:

Document lucenedoc = createDocumentAndSetFileSpecificFields( file );

foreach ( var block in blocks )
{
        luceneDoc.GetField( "text" ).SetValue( block.Text );
        indexWriter.AddDocument( luceneDoc );
}

In 4.8, SetValue is not a function anymore, and it seems like I have to
recreate my 8-field Document every time I write to Index.

foreach ( var block in blocks )
{
    Document lucenedoc = createDocumentAndSetFileSpecificFields( file,
block.Text );
    indexWriter.AddDocument( luceneDoc );
}

Can someone help me realize what I am missing?

Thanks,
Matt

RE: Lucene 4.8 - Reusing Document during indexing

Posted by Shad Storhaug <sh...@shadstorhaug.com>.
Thanks for the feedback.

I see now this isn't an issue with the Field type, but with the Document type. It is returning an IIndexableField, which doesn't make these methods visible by default. However, you can cast to Field to make them visible.

((Field)luceneDoc.GetField("text")).SetStringValue( block.Text );

That said, it would definitely be ideal if there were no need to cast.

There are just 2 concrete implementations of IIndexableField, Field and LazyField, but LazyField is read-only so it would not be practical to make these setters part of IIndexableField. There are also read-only test mocks for IIndexableField.

In 3.0.3, there was a GetField() that returned a Field, and GetFieldable() that returned IFieldable. Unfortunately, the way it was done in 3.0.3 seems more practical than in 4.8.0, but 4.8.0 is more extensible - it allows you to use your own field types. Making it return type Field would make it more practical to use, but would limit extensibility to classes that subclass Field, meaning all fields would have to be read-write. It looks like the cast to Field is required in Lucene 4.8.0. I am going to have to contemplate how to best handle this case - suggestions welcome. 

Thanks,
Shad Storhaug (NightOwl888)


-----Original Message-----
From: Matt Diehl [mailto:mdiehl@lexprompt.com.INVALID] 
Sent: Sunday, June 11, 2017 4:58 PM
To: user@lucenenet.apache.org
Subject: Re: Lucene 4.8 - Reusing Document during indexing

Thanks for the details Shad.

It's a little bit of a pain to use. Not as easy as what you showed, since you have to typecast:
 ((TextField)luceneDoc.GetField("text")).SetStringValue( block.Text );

If you do not typecast, then SetStringValue is not available.

Also strangely, it doesn't matter what I typecast it to. I can typecast to Int32Field and I get SetStringValue and SetInt32Value. I can typecast to TextField and still have SetInt32Value.
If it doesn't matter what it is cast to, can we get the function definitions in IIndexableField, which is the return type of GetField()?

Thanks,
Matt


On Sun, Jun 11, 2017 at 2:00 AM, Shad Storhaug <sh...@shadstorhaug.com>
wrote:

> Matt,
>
> Since a field needs to keep track of both the value and the type, the 
> field values are set using methods that include type name.
>
> luceneDoc.GetField( "text" ).SetStringValue( block.Text );
>
> Setting the field value using a common SetValue function is something 
> that was carefully considered, but it would mean you would have to be 
> extremely explicit when setting the correct type. For example:
>
> float value1 = 5.00000001;
> string value2 = value1.ToString()
>
> luceneDoc.GetField( "number" ).SetValue(value2);
>
> object value3 = luceneDoc.GetField( "number" ).GetNumericValue();
>
>
> The above code would produce an error because the field was originally 
> set as a string, but a float was expected to be stored. This would 
> produce a bug that might be hard to track down, where forcing the 
> developer to think about what type they are trying to set 
> (SetSingleValue) makes it more explicit and less likely to go wrong, 
> since it would produce a compile-time error.
>
>
> That said, an overloaded SetValue is more .NET-like and in this 
> particular case we don't have any duplicate types that would cause 
> collisions so we could add an overloaded SetValue method and convert 
> the existing methods into extension methods in the Support namespace. 
> I would be interested in hearing any feedback on whether explicitly 
> specifying the type in the method name or explicitly casting to the 
> correct type (as was the case in
> 3.0.3) is preferable. In .NET, the overloaded methods don't normally 
> all store the value in the same object variable under the covers, so 
> making explicit methods seems like a better choice to me.
>
>
> On a side note, it looks like we should deprecate all of the 
> FieldExtensions methods except IsStored to make sure people are aware 
> that they will not be available after Lucene.Net 4.8, since the 
> corresponding enumerations have been deprecated.
>
> Thanks,
> Shad Storhaug (NightOwl888)
>
>
> -----Original Message-----
> From: Matt Diehl [mailto:matt@gooddiehl.net.INVALID]
> Sent: Sunday, June 11, 2017 9:57 AM
> To: user@lucenenet.apache.org
> Subject: Lucene 4.8 - Reusing Document during indexing
>
> Hi,
>
> I am not understanding how to reuse Document like we could in 3.0.3 
> for indexing purposes.
>
> For instance, in 3.0.3, I could create and then set several common 
> Field values, and then just iterate changing a single field in the 
> Document, and add to index:
>
> Document lucenedoc = createDocumentAndSetFileSpecificFields( file );
>
> foreach ( var block in blocks )
> {
>         luceneDoc.GetField( "text" ).SetValue( block.Text );
>         indexWriter.AddDocument( luceneDoc ); }
>
> In 4.8, SetValue is not a function anymore, and it seems like I have 
> to recreate my 8-field Document every time I write to Index.
>
> foreach ( var block in blocks )
> {
>     Document lucenedoc = createDocumentAndSetFileSpecificFields( file, 
> block.Text );
>     indexWriter.AddDocument( luceneDoc ); }
>
> Can someone help me realize what I am missing?
>
> Thanks,
> Matt
>

Re: Lucene 4.8 - Reusing Document during indexing

Posted by Matt Diehl <md...@lexprompt.com.INVALID>.
Thanks for the details Shad.

It's a little bit of a pain to use. Not as easy as what you showed, since
you have to typecast:
 ((TextField)luceneDoc.GetField("text")).SetStringValue( block.Text );

If you do not typecast, then SetStringValue is not available.

Also strangely, it doesn't matter what I typecast it to. I can typecast to
Int32Field and I get SetStringValue and SetInt32Value. I can typecast to
TextField and still have SetInt32Value.
If it doesn't matter what it is cast to, can we get the function
definitions in IIndexableField, which is the return type of GetField()?

Thanks,
Matt


On Sun, Jun 11, 2017 at 2:00 AM, Shad Storhaug <sh...@shadstorhaug.com>
wrote:

> Matt,
>
> Since a field needs to keep track of both the value and the type, the
> field values are set using methods that include type name.
>
> luceneDoc.GetField( "text" ).SetStringValue( block.Text );
>
> Setting the field value using a common SetValue function is something that
> was carefully considered, but it would mean you would have to be extremely
> explicit when setting the correct type. For example:
>
> float value1 = 5.00000001;
> string value2 = value1.ToString()
>
> luceneDoc.GetField( "number" ).SetValue(value2);
>
> object value3 = luceneDoc.GetField( "number" ).GetNumericValue();
>
>
> The above code would produce an error because the field was originally set
> as a string, but a float was expected to be stored. This would produce a
> bug that might be hard to track down, where forcing the developer to think
> about what type they are trying to set (SetSingleValue) makes it more
> explicit and less likely to go wrong, since it would produce a compile-time
> error.
>
>
> That said, an overloaded SetValue is more .NET-like and in this particular
> case we don't have any duplicate types that would cause collisions so we
> could add an overloaded SetValue method and convert the existing methods
> into extension methods in the Support namespace. I would be interested in
> hearing any feedback on whether explicitly specifying the type in the
> method name or explicitly casting to the correct type (as was the case in
> 3.0.3) is preferable. In .NET, the overloaded methods don't normally all
> store the value in the same object variable under the covers, so making
> explicit methods seems like a better choice to me.
>
>
> On a side note, it looks like we should deprecate all of the
> FieldExtensions methods except IsStored to make sure people are aware that
> they will not be available after Lucene.Net 4.8, since the corresponding
> enumerations have been deprecated.
>
> Thanks,
> Shad Storhaug (NightOwl888)
>
>
> -----Original Message-----
> From: Matt Diehl [mailto:matt@gooddiehl.net.INVALID]
> Sent: Sunday, June 11, 2017 9:57 AM
> To: user@lucenenet.apache.org
> Subject: Lucene 4.8 - Reusing Document during indexing
>
> Hi,
>
> I am not understanding how to reuse Document like we could in 3.0.3 for
> indexing purposes.
>
> For instance, in 3.0.3, I could create and then set several common Field
> values, and then just iterate changing a single field in the Document, and
> add to index:
>
> Document lucenedoc = createDocumentAndSetFileSpecificFields( file );
>
> foreach ( var block in blocks )
> {
>         luceneDoc.GetField( "text" ).SetValue( block.Text );
>         indexWriter.AddDocument( luceneDoc ); }
>
> In 4.8, SetValue is not a function anymore, and it seems like I have to
> recreate my 8-field Document every time I write to Index.
>
> foreach ( var block in blocks )
> {
>     Document lucenedoc = createDocumentAndSetFileSpecificFields( file,
> block.Text );
>     indexWriter.AddDocument( luceneDoc ); }
>
> Can someone help me realize what I am missing?
>
> Thanks,
> Matt
>

RE: Lucene 4.8 - Reusing Document during indexing

Posted by Shad Storhaug <sh...@shadstorhaug.com>.
Matt,

Since a field needs to keep track of both the value and the type, the field values are set using methods that include type name.

luceneDoc.GetField( "text" ).SetStringValue( block.Text );

Setting the field value using a common SetValue function is something that was carefully considered, but it would mean you would have to be extremely explicit when setting the correct type. For example:

float value1 = 5.00000001;
string value2 = value1.ToString()

luceneDoc.GetField( "number" ).SetValue(value2);

object value3 = luceneDoc.GetField( "number" ).GetNumericValue();


The above code would produce an error because the field was originally set as a string, but a float was expected to be stored. This would produce a bug that might be hard to track down, where forcing the developer to think about what type they are trying to set (SetSingleValue) makes it more explicit and less likely to go wrong, since it would produce a compile-time error.


That said, an overloaded SetValue is more .NET-like and in this particular case we don't have any duplicate types that would cause collisions so we could add an overloaded SetValue method and convert the existing methods into extension methods in the Support namespace. I would be interested in hearing any feedback on whether explicitly specifying the type in the method name or explicitly casting to the correct type (as was the case in 3.0.3) is preferable. In .NET, the overloaded methods don't normally all store the value in the same object variable under the covers, so making explicit methods seems like a better choice to me.


On a side note, it looks like we should deprecate all of the FieldExtensions methods except IsStored to make sure people are aware that they will not be available after Lucene.Net 4.8, since the corresponding enumerations have been deprecated.

Thanks,
Shad Storhaug (NightOwl888)


-----Original Message-----
From: Matt Diehl [mailto:matt@gooddiehl.net.INVALID] 
Sent: Sunday, June 11, 2017 9:57 AM
To: user@lucenenet.apache.org
Subject: Lucene 4.8 - Reusing Document during indexing

Hi,

I am not understanding how to reuse Document like we could in 3.0.3 for indexing purposes.

For instance, in 3.0.3, I could create and then set several common Field values, and then just iterate changing a single field in the Document, and add to index:

Document lucenedoc = createDocumentAndSetFileSpecificFields( file );

foreach ( var block in blocks )
{
        luceneDoc.GetField( "text" ).SetValue( block.Text );
        indexWriter.AddDocument( luceneDoc ); }

In 4.8, SetValue is not a function anymore, and it seems like I have to recreate my 8-field Document every time I write to Index.

foreach ( var block in blocks )
{
    Document lucenedoc = createDocumentAndSetFileSpecificFields( file, block.Text );
    indexWriter.AddDocument( luceneDoc ); }

Can someone help me realize what I am missing?

Thanks,
Matt