You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mathias Lux <ml...@itec.uni-klu.ac.at> on 2013/08/12 14:38:22 UTC

Is there a way to store binary data (byte[]) in DocValues?

Hi!

I'm basically searching for a method to put byte[] data into Lucene
DocValues of type BINARY (see [1]). Currently only primitives and
Strings are supported according to [1].

I know that this can be done with a custom update handler, but I'd
like to avoid that.

cheers,
Mathias

[1] http://wiki.apache.org/solr/DocValues

-- 
Dr. Mathias Lux
Assistant Professor, Klagenfurt University, Austria
http://tinyurl.com/mlux-itec

Re: Is there a way to store binary data (byte[]) in DocValues?

Posted by Robert Muir <rc...@gmail.com>.
On Mon, Aug 12, 2013 at 12:25 PM, Mathias Lux <ml...@itec.uni-klu.ac.at> wrote:
>
> Another thing for not using the the SORTED_SET and SORTED
> implementations is, that Solr currently works with Strings on that and
> I want to have a small memory footprint for millions of images ...
> which does not go well with immutables.

Just as a side note, again these work with byte[]. It happens to be
the case that solr uses these for its StringField (converting the
strings to bytes), but if you wanted to use these with BinaryField you
could (they just take BytesRef).

Re: Is there a way to store binary data (byte[]) in DocValues?

Posted by Mathias Lux <ml...@itec.uni-klu.ac.at>.
Hi Robert,

I'm basically "mis-using" Solr for content based image search. So I
have indexed fields (hashes) for candidate selection, i.e. 1,500
candidate results retrieved with the IndexSearcher by hashes, which I
then have to re-rank based on numeric vectors I'm storing in byte[]
arrays. I had an implementation, where this is based on the binary
field but reading from an index with a lot of small stored field is
not a good idea with the current compression approach (I've already
discussed this in the Lucene user group :) BINARY is the thing for me
to go for, as you said, there's nothing, just the values.

Another thing for not using the the SORTED_SET and SORTED
implementations is, that Solr currently works with Strings on that and
I want to have a small memory footprint for millions of images ...
which does not go well with immutables.

However, I now already have a solution, which I just wanted to post
here when I saw your answer. Basically I copied the source from the
BinaryField and changed it to a BinaryDocValuesField (see line 68 at
http://pastebin.com/dscPTwhr). This works out well for indexing when
you adapt the schema to use this class:

[...]
<!-- ColorLayout -->
<field name="cl_ha" type="text_ws" indexed="true" stored="false"
required="false"/>
<field name="cl_hi" type="binaryDV"  indexed="false" stored="true"
required="false"/>
[...]
<fieldtype name="binaryDV"
class="net.semanticmetadata.lire.solr.BinaryDocValuesField"/>
[...]

I then have a custom request handler, that does the search for me.
First based on the hashes (field cl_ha, treated as whitespace
delimited terms) and then re-ranking the 1,500 first results based on
the DocValues.

Now it works rather fast, a demo with 1M images is available at
http://demo-itec.uni-klu.ac.at/liredemo/ .. hash based search time is
still not optimal, but that's an issue of the distribution of terms,
which is not optimal for this kind of index (find the runtime
separated in search & re-rank at the end of the page).

I'll put the whole (open, GPL-ed) source online at the end of
September (as module of LIRE), after some stress tests, documentation
and further bug fixing.

cheers,
  Mathias

On Mon, Aug 12, 2013 at 4:51 PM, Robert Muir <rc...@gmail.com> wrote:
> On Mon, Aug 12, 2013 at 8:38 AM, Mathias Lux <ml...@itec.uni-klu.ac.at> wrote:
>> Hi!
>>
>> I'm basically searching for a method to put byte[] data into Lucene
>> DocValues of type BINARY (see [1]). Currently only primitives and
>> Strings are supported according to [1].
>>
>> I know that this can be done with a custom update handler, but I'd
>> like to avoid that.
>>
>
> Can you describe a little bit what kind of operations you want to do with it?
> I don't really know how BinaryField is typically used, but maybe it
> could support this option. On the other hand adding it to BinaryField
> might not "buy" you much without some additional stuff depending upon
> what you need to do. Like if you really want to do sort/facet on the
> thing, SORTED(SET) would probably be a better implementation: it
> doesnt care that the values are binary.
>
> BINARY, SORTED, and SORTED_SET actually all take byte[]: the difference is:
> * SORTED: deduplicates/compresses the unique byte[]'s and gives each
> document an ordinal number that reflects sort order (for
> sorting/faceting/grouping/etc)
> * SORTED_SET: similar, except each document has a "set" (which can be
> empty), of ordinal numbers (e.g. for faceting multivalued fields)
> * BINARY: just stores the byte[] for each document (no deduplication,
> no compression, no ordinals, nothing).
>
> So for sorting/faceting: BINARY is generally not very efficient unless
> there is something custom going on: for example lucene's faceting
> package stores the "values" elsewhere in a separate taxonomy index, so
> it uses this type just to encode a delta-compressed ordinal list for
> each document.
>
> For scoring factors/function queries: encoding the values inside
> NUMERIC(s) [up to 64 bits each] might still be best on average: the
> compression applied here is surprisingly efficient.



-- 
Dr. Mathias Lux
Assistant Professor, Klagenfurt University, Austria
http://tinyurl.com/mlux-itec

Re: Is there a way to store binary data (byte[]) in DocValues?

Posted by Robert Muir <rc...@gmail.com>.
On Mon, Aug 12, 2013 at 8:38 AM, Mathias Lux <ml...@itec.uni-klu.ac.at> wrote:
> Hi!
>
> I'm basically searching for a method to put byte[] data into Lucene
> DocValues of type BINARY (see [1]). Currently only primitives and
> Strings are supported according to [1].
>
> I know that this can be done with a custom update handler, but I'd
> like to avoid that.
>

Can you describe a little bit what kind of operations you want to do with it?
I don't really know how BinaryField is typically used, but maybe it
could support this option. On the other hand adding it to BinaryField
might not "buy" you much without some additional stuff depending upon
what you need to do. Like if you really want to do sort/facet on the
thing, SORTED(SET) would probably be a better implementation: it
doesnt care that the values are binary.

BINARY, SORTED, and SORTED_SET actually all take byte[]: the difference is:
* SORTED: deduplicates/compresses the unique byte[]'s and gives each
document an ordinal number that reflects sort order (for
sorting/faceting/grouping/etc)
* SORTED_SET: similar, except each document has a "set" (which can be
empty), of ordinal numbers (e.g. for faceting multivalued fields)
* BINARY: just stores the byte[] for each document (no deduplication,
no compression, no ordinals, nothing).

So for sorting/faceting: BINARY is generally not very efficient unless
there is something custom going on: for example lucene's faceting
package stores the "values" elsewhere in a separate taxonomy index, so
it uses this type just to encode a delta-compressed ordinal list for
each document.

For scoring factors/function queries: encoding the values inside
NUMERIC(s) [up to 64 bits each] might still be best on average: the
compression applied here is surprisingly efficient.

Re: Is there a way to store binary data (byte[]) in DocValues?

Posted by Mathias Lux <ml...@itec.uni-klu.ac.at>.
Hi!

That's what I'm doing currently, but it ends up in StoredField
implementations, which create an overhead on decompression I want to
avoid.

cheers,
Mathias

On Mon, Aug 12, 2013 at 3:11 PM, Raymond Wiker <rw...@gmail.com> wrote:
> base64-encode the binary data? That will give you strings, at the expense
> of some storage overhead.
>
>
> On Mon, Aug 12, 2013 at 2:38 PM, Mathias Lux <ml...@itec.uni-klu.ac.at>wrote:
>
>> Hi!
>>
>> I'm basically searching for a method to put byte[] data into Lucene
>> DocValues of type BINARY (see [1]). Currently only primitives and
>> Strings are supported according to [1].
>>
>> I know that this can be done with a custom update handler, but I'd
>> like to avoid that.
>>
>> cheers,
>> Mathias
>>
>> [1] http://wiki.apache.org/solr/DocValues
>>
>> --
>> Dr. Mathias Lux
>> Assistant Professor, Klagenfurt University, Austria
>> http://tinyurl.com/mlux-itec
>>



-- 
Dr. Mathias Lux
Assistant Professor, Klagenfurt University, Austria
http://tinyurl.com/mlux-itec

Re: Is there a way to store binary data (byte[]) in DocValues?

Posted by Raymond Wiker <rw...@gmail.com>.
base64-encode the binary data? That will give you strings, at the expense
of some storage overhead.


On Mon, Aug 12, 2013 at 2:38 PM, Mathias Lux <ml...@itec.uni-klu.ac.at>wrote:

> Hi!
>
> I'm basically searching for a method to put byte[] data into Lucene
> DocValues of type BINARY (see [1]). Currently only primitives and
> Strings are supported according to [1].
>
> I know that this can be done with a custom update handler, but I'd
> like to avoid that.
>
> cheers,
> Mathias
>
> [1] http://wiki.apache.org/solr/DocValues
>
> --
> Dr. Mathias Lux
> Assistant Professor, Klagenfurt University, Austria
> http://tinyurl.com/mlux-itec
>