You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Christoph Alt <ch...@posteo.de> on 2015/05/08 15:20:50 UTC

SparseVector.fromCOO keeps zero entries

Hi,

Felix and I are currently working on the implementation of the FeatureHasher (Issue #1735), which in the end returns a SparseVector.

When using “SparseVector.fromCOO" I’m facing some odd behaviour I haven’t expected.

Assume I create a SparseVector.fromCOO(numFeatures, Map((0, 1.0), (1, 1.0), (1, -1.0))), this returns a SparseVector((0, 1.0), (1, 0.0)).
I would have expected that after summing up the values of similar indices, an index with a resulting value of 0.0 would be dropped during the creation of a SparseVector.
Is this the expected behaviour or does this need to be fixed?

Furthermore, are there any plans to extend the SparseVector implementation by a SparseVector.fromArray(), which takes an array like Array(0.0, 1.0, 2.0, 0.0, 3.2) as parameter and creates a SparseVector((1, 1.0), (2, 2.0), (4, 3.2)) of array.length while only keeping non-zero entries?

Best,
Christoph

Re: SparseVector.fromCOO keeps zero entries

Posted by Till Rohrmann <ti...@gmail.com>.
Hi Christoph,

the thing with the current implementation of the SparseVector is that you
can only modify entries which are “non-zero”. All other entries are not
represented in the underlying data structures. This means that you have to
create a new SparseVector if you want to set a zero entry to non-zero. If
the user specifies non-zero entries, then he might modify these entries
later on. Therefore, we have implemented the SparseVector initialization in
such a way that elements which add up to *0* are explicitly represented and
thus modifiable. I agree that this might not be intuitive and maybe the
other way around, meaning filtering out these 0 values might be be better.

I’m not so sure whether it makes sense to initialize a SparseVector from an
array of values. My gut feeling is that you would use an array to represent
a DenseVector because you have to specify for each index a value. If you
have only few non-zero entries, then a different data structure, e.g. a set
of pairs (index, value), seems to be more efficient to me. But adding such
a initialization method is not a big deal. What kind of use case do you
have in mind?

Cheers,
Till
​

On Fri, May 8, 2015 at 3:20 PM, Christoph Alt <ch...@posteo.de>
wrote:

> Hi,
>
> Felix and I are currently working on the implementation of the
> FeatureHasher (Issue #1735), which in the end returns a SparseVector.
>
> When using “SparseVector.fromCOO" I’m facing some odd behaviour I haven’t
> expected.
>
> Assume I create a SparseVector.fromCOO(numFeatures, Map((0, 1.0), (1,
> 1.0), (1, -1.0))), this returns a SparseVector((0, 1.0), *(1, 0.0)*).
> I would have expected that after summing up the values of similar indices,
> an index with a resulting value of 0.0 would be dropped during the creation
> of a SparseVector.
> Is this the expected behaviour or does this need to be fixed?
>
> Furthermore, are there any plans to extend the SparseVector implementation
> by a SparseVector.fromArray(), which takes an array like Array(0.0, 1.0,
> 2.0, 0.0, 3.2) as parameter and creates a SparseVector((1, 1.0), (2, 2.0),
> (4, 3.2)) of array.length while only keeping non-zero entries?
>
> Best,
> Christoph
>