You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by eksdev <ek...@googlemail.com> on 2013/03/17 13:43:09 UTC

StoredField

is there any way to get ByteRef from a field originally stored as String?

I am playing with Sorter to implement  StoredDocSorter, analogous to NumericDocValuesSorter.  But  realised I do not need ByteRef - > String conversion just to compare fields  (byte order would be as good for sorting)

StoredDocument d1 = reader.document(docID1, fieldNamesSet);
String value1 = d1.get("fieldName")
String value1 = d1.getStringAsBytesValue("fieldName")// would love to have it

I need String type in other places, so indexing as byte[] would be too much hassle.

String is internally stored as byte[], no reason not to expose it for StoredField (or any other type)?

Re: StoredField

Posted by Adrien Grand <jp...@gmail.com>.

On Sun, Mar 17, 2013 at 5:56 PM, eksdev <ek...@googlemail.com> wrote:
> Hi Adrian,

Hi eksdev,

> I cannot tell if such thing would make it less or more robust, just thinking aloud  :)
>
> I am thinking of it as a way to somehow postpone byte->type conversion to the moment where it is really needed.  Simply, keep byte[] around as long as possible.
> *Theoretically*, this should improve gc() and memory footprint for some types of downstream processing. It all depends how easy would something like that be.
>
> There is already a way to achieve this by using binary field type, …  hmmm, maybe some lucene.expert hack to make Lucene think every field is binary wold be simple and robust enough?
> e.g. Visitor.transportOnlySerializedValuesWithoutTypeConversion()

Sorry, but I think it would do more harm than good:
 - Stored fields encoding is an implementation detail so someone could
write a StoredFieldsFormat that serializes strings in UTF-16 to avoid
decoding overhead at read time, how
would.transportOnlySerializedValuesWithoutTypeConversion know the
actual encoding used by the underlying StoredFieldsFormat?
 - It would make users think that this kind of optimization is
valuable performance-wise while I think it's unnoticeable.

-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: StoredField

Posted by eksdev <ek...@googlemail.com>.

Hi Adrian, 
I cannot tell if such thing would make it less or more robust, just thinking aloud  :)

I am thinking of it as a way to somehow postpone byte->type conversion to the moment where it is really needed.  Simply, keep byte[] around as long as possible.   
*Theoretically*, this should improve gc() and memory footprint for some types of downstream processing. It all depends how easy would something like that be.

There is already a way to achieve this by using binary field type, …  hmmm, maybe some lucene.expert hack to make Lucene think every field is binary wold be simple and robust enough? 
e.g. Visitor.transportOnlySerializedValuesWithoutTypeConversion()

---------

By the way, the trick with tim-sort in Sorter worked great. For 1.1 Mio short documents, the time to sort unsorted index on handful of stored fields went from 490 seconds to 380. 
Congrats and thanks for it! It also improved compression by 12% (very small, 4k chunk size)

On Mar 17, 2013, at 5:26 PM, Adrien Grand <jp...@gmail.com> wrote:

> Hi,
> 
> On Sun, Mar 17, 2013 at 2:58 PM, eksdev <ek...@googlemail.com> wrote:
>> sure, there is a way to make anything -> byte[] ;)
>> 
>> it looks like this byte[]->type conversion is done deep-down and this
>> visitor user-api gets already correct types  …
>> 
>> Maybe an idea would be to delay byte[] -> type conversion to field access
>> time, i do not know what mines would be on the road to do it.
>> 
>> use cases that require identity checks, or not locale specific sorting and
>> co would benefit from having row, serialised representations without type
>> conversion…. anyhow, I could switch overt to byte[] fields completely to do
>> ii…
> 
> I understand that it is frustrating to perform a String -> byte[]
> conversion if Lucene just did the opposite. But because it needs to
> perform one random seek per document (on a file which is often large),
> the stored fields API is much slower than a String -> UTF-8 bytes
> conversion, so I think we should keep the API robust rather than
> allowing for these kinds of optimizations?
> 
> -- 
> Adrien
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: StoredField

Posted by Adrien Grand <jp...@gmail.com>.

Hi,

On Sun, Mar 17, 2013 at 2:58 PM, eksdev <ek...@googlemail.com> wrote:
> sure, there is a way to make anything -> byte[] ;)
>
> it looks like this byte[]->type conversion is done deep-down and this
> visitor user-api gets already correct types  …
>
> Maybe an idea would be to delay byte[] -> type conversion to field access
> time, i do not know what mines would be on the road to do it.
>
> use cases that require identity checks, or not locale specific sorting and
> co would benefit from having row, serialised representations without type
> conversion…. anyhow, I could switch overt to byte[] fields completely to do
> ii…

I understand that it is frustrating to perform a String -> byte[]
conversion if Lucene just did the opposite. But because it needs to
perform one random seek per document (on a file which is often large),
the stored fields API is much slower than a String -> UTF-8 bytes
conversion, so I think we should keep the API robust rather than
allowing for these kinds of optimizations?

-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: StoredField

Posted by eksdev <ek...@googlemail.com>.

sure, there is a way to make anything -> byte[] ;)

it looks like this byte[]->type conversion is done deep-down and this visitor user-api gets already correct types  … 

Maybe an idea would be to delay byte[] -> type conversion to field access time, i do not know what mines would be on the road to do it. 

use cases that require identity checks, or not locale specific sorting and co would benefit from having row, serialised representations without type conversion…. anyhow, I could switch overt to byte[] fields completely to do ii…

Thanks for responding!  




On Mar 17, 2013, at 2:24 PM, Shai Erera <se...@gmail.com> wrote:

> No no, not irony at all. I misunderstood the first time. You wrote "is there any way to get ByteRef from a field originally stored as String?", so I understand the first thing that came to mind :).
> 
> But I understand the question now -- you say that since the String field is written as byte[] in the file, you want to read the byte[] as they are, without translating them to String. right?
> 
> I don't know if it's possible. I'd try field.binaryValue(), though looking at the impl it doesn't suggest it will do what you want.
> 
> Shai
> 
> 
> On Sun, Mar 17, 2013 at 3:02 PM, eksdev <ek...@googlemail.com> wrote:
> Shai,  was that irony or I am missing something big time?
> 
> I would like to spare BytesRef -> String conversion, not to introduce another one back to BytesRef
> 
> Simply, for sorting, you do not need to do this byte[]->String conversion, byte representation of the String is perfectly sortable… 
> 
>  
> 
> On Mar 17, 2013, at 1:53 PM, Shai Erera <se...@gmail.com> wrote:
> 
>> You can do new BytesRef(d1.get("fieldName")).
>> 
>> Shai
>> 
>> 
>> On Sun, Mar 17, 2013 at 2:43 PM, eksdev <ek...@googlemail.com> wrote:
>> is there any way to get ByteRef from a field originally stored as String?
>> 
>> I am playing with Sorter to implement  StoredDocSorter, analogous to NumericDocValuesSorter.  But  realised I do not need ByteRef - > String conversion just to compare fields  (byte order would be as good for sorting)
>> 
>> StoredDocument d1 = reader.document(docID1, fieldNamesSet);
>> String value1 = d1.get("fieldName")
>> String value1 = d1.getStringAsBytesValue("fieldName")// would love to have it
>> 
>> I need String type in other places, so indexing as byte[] would be too much hassle.
>> 
>> String is internally stored as byte[], no reason not to expose it for StoredField (or any other type)? 
>> 
>> 
>> 
> 
>

Re: StoredField

Posted by Shai Erera <se...@gmail.com>.

No no, not irony at all. I misunderstood the first time. You wrote "is
there any way to get ByteRef from a field originally stored as String?", so
I understand the first thing that came to mind :).

But I understand the question now -- you say that since the String field is
written as byte[] in the file, you want to read the byte[] as they are,
without translating them to String. right?

I don't know if it's possible. I'd try field.binaryValue(), though looking
at the impl it doesn't suggest it will do what you want.

Shai

On Sun, Mar 17, 2013 at 3:02 PM, eksdev <ek...@googlemail.com> wrote:

> Shai,  was that irony or I am missing something big time?
>
> I would like to spare BytesRef -> String conversion, not to introduce
> another one back to BytesRef
>
> Simply, for sorting, you do not need to do this byte[]->String conversion,
> byte representation of the String is perfectly sortable…
>
>
>
> On Mar 17, 2013, at 1:53 PM, Shai Erera <se...@gmail.com> wrote:
>
> You can do new BytesRef(d1.get("fieldName")).
>
> Shai
>
>
> On Sun, Mar 17, 2013 at 2:43 PM, eksdev <ek...@googlemail.com> wrote:
>
>> is there any way to get ByteRef from a field originally stored as String?
>>
>> I am playing with Sorter to implement  StoredDocSorter, analogous to
>> NumericDocValuesSorter.  But  realised I do not need ByteRef - > String
>> conversion just to compare fields  (byte order would be as good for sorting)
>>
>> StoredDocument d1 = reader.document(docID1, fieldNamesSet);
>> String value1 = d1.get("fieldName")
>> String value1 = d1.getStringAsBytesValue("fieldName")// would love to
>> have it
>>
>> I need String type in other places, so indexing as byte[] would be too
>> much hassle.
>>
>> String is internally stored as byte[], no reason not to expose it for
>> StoredField (or any other type)?
>>
>>
>>
>
>

Re: StoredField

Posted by eksdev <ek...@googlemail.com>.

Shai,  was that irony or I am missing something big time?

I would like to spare BytesRef -> String conversion, not to introduce another one back to BytesRef

Simply, for sorting, you do not need to do this byte[]->String conversion, byte representation of the String is perfectly sortable… 
 

On Mar 17, 2013, at 1:53 PM, Shai Erera <se...@gmail.com> wrote:

> You can do new BytesRef(d1.get("fieldName")).
> 
> Shai
> 
> 
> On Sun, Mar 17, 2013 at 2:43 PM, eksdev <ek...@googlemail.com> wrote:
> is there any way to get ByteRef from a field originally stored as String?
> 
> I am playing with Sorter to implement  StoredDocSorter, analogous to NumericDocValuesSorter.  But  realised I do not need ByteRef - > String conversion just to compare fields  (byte order would be as good for sorting)
> 
> StoredDocument d1 = reader.document(docID1, fieldNamesSet);
> String value1 = d1.get("fieldName")
> String value1 = d1.getStringAsBytesValue("fieldName")// would love to have it
> 
> I need String type in other places, so indexing as byte[] would be too much hassle.
> 
> String is internally stored as byte[], no reason not to expose it for StoredField (or any other type)? 
> 
> 
>

Re: StoredField

Posted by Shai Erera <se...@gmail.com>.

You can do new BytesRef(d1.get("fieldName")).

Shai


On Sun, Mar 17, 2013 at 2:43 PM, eksdev <ek...@googlemail.com> wrote:

> is there any way to get ByteRef from a field originally stored as String?
>
> I am playing with Sorter to implement  StoredDocSorter, analogous to
> NumericDocValuesSorter.  But  realised I do not need ByteRef - > String
> conversion just to compare fields  (byte order would be as good for sorting)
>
> StoredDocument d1 = reader.document(docID1, fieldNamesSet);
> String value1 = d1.get("fieldName")
> String value1 = d1.getStringAsBytesValue("fieldName")// would love to have
> it
>
> I need String type in other places, so indexing as byte[] would be too
> much hassle.
>
> String is internally stored as byte[], no reason not to expose it for
> StoredField (or any other type)?
>
>
>