You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Debasish Das <de...@gmail.com> on 2016/12/20 00:53:04 UTC
SortedSetDocValue vs BinaryDocValues
Hi,
I need to add col1:Array[String], col2:Array[Int] and col3:Array[Float] to
docvalue.
col1: Array[String] sparse dimension from OLAP world
col2: Array[Int] + Array[Float] represents a sparse vector for sparse
measure from OLAP world with dictionary encoding for col1 mapped to col2
I have few options to implement it:
1. Use SortedSetDocValuesField for each one of them with String, Int and
Float mapped to Byte
2. Generate byte array from Array[String], Array[Int] and Array[Float] and
save them as a byteBlob using BinaryDocValuesField
I know for sure that Array[Int] and Array[Float] will compress better if I
save them using specific encoding but I am confused whether to use 1 or 2
to implement the idea.
1 has a limitation on the number of bytes I can save and I am not sure if
pushing a Set to serialize to disk is a good idea (I am not sure yet if a
Set is being serialized to disk, most likely not).
I am open to coming up with specific encoding for Array data type where it
re-uses the current String, Int and Float encodings that we already have.
It will be great if experts can provide some pointers on using
SortedSetDocValues or serialize/deserialize using BinaryDocValuesField. The
idea of sparse dimension and measure comes from Oracle Essbase and I
believe we may bring in tensors as well in future.
Thanks.
Deb