You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2015/09/08 10:38:47 UTC

[jira] [Created] (MAHOUT-1771) Cluster dumper omits indices and 0 elements for dense vectors

Sean Owen created MAHOUT-1771:
---------------------------------

             Summary: Cluster dumper omits indices and 0 elements for dense vectors
                 Key: MAHOUT-1771
                 URL: https://issues.apache.org/jira/browse/MAHOUT-1771
             Project: Mahout
          Issue Type: Bug
          Components: Clustering, mrlegacy
    Affects Versions: 0.9
            Reporter: Sean Owen
            Priority: Minor


Blast from the past -- are patches still being accepted for "mrlegacy" code? Something turned up incidentally when working with a customer that looks like a minor bug in the cluster dumper code.

In {{AbstractCluster.java}}:

{code}
public static List<Object> formatVectorAsJson(Vector v, String[] bindings) throws IOException {

    boolean hasBindings = bindings != null;
    boolean isSparse = !v.isDense() && v.getNumNondefaultElements() != v.size();

    // we assume sequential access in the output
    Vector provider = v.isSequentialAccess() ? v : new SequentialAccessSparseVector(v);

    List<Object> terms = new LinkedList<>();
    String term = "";

    for (Element elem : provider.nonZeroes()) {

      if (hasBindings && bindings.length >= elem.index() + 1 && bindings[elem.index()] != null) {
        term = bindings[elem.index()];
      } else if (hasBindings || isSparse) {
        term = String.valueOf(elem.index());
      }

      Map<String, Object> term_entry = new HashMap<>();
      double roundedWeight = (double) Math.round(elem.get() * 1000) / 1000;
      if (hasBindings || isSparse) {
        term_entry.put(term, roundedWeight);
        terms.add(term_entry);
      } else {
        terms.add(roundedWeight);
      }
    }

    return terms;
  }
{code}

Imagine a {{DenseVector}} with 5 elements, of which two are 0. It's considered dense in this method since the number of non-default elements is 5 (all elements are "non default" in a dense vector).

However the iteration is over non-zero elements only. And indices are only printed if it's sparse (or has bindings). So the result will be the 3 non-zero elements printed without indices. Which dimensions they are can't be determined.

The fix seems to be either:
- Compare number of _non-zero_ elements to the size when determining if it's sparse
- Iterate over all elements if non-sparse

I think the first is the intent? it would be a one-line change if so.

{code}
    boolean isSparse = !v.isDense() && v.getNumZeroElements() != v.size();
{code}

Pretty straightforward, and minor, but wanted to check with everyone before making a change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Re: [jira] [Created] (MAHOUT-1771) Cluster dumper omits indices and 0 elements for dense vectors

Posted by Ted Dunning <te...@gmail.com>.
On Tue, Sep 8, 2015 at 1:38 AM, Sean Owen (JIRA) <ji...@apache.org> wrote:

> Sean Owen created MAHOUT-1771:
> ---------------------------------
>
>              Summary: Cluster dumper omits indices and 0 elements for
> dense vectors
>                  Key: MAHOUT-1771
>                  URL: https://issues.apache.org/jira/browse/MAHOUT-1771
>              Project: Mahout
>           Issue Type: Bug
>           Components: Clustering, mrlegacy
>     Affects Versions: 0.9
>             Reporter: Sean Owen
>             Priority: Minor
>
>
> Blast from the past -- are patches still being accepted for "mrlegacy"
> code? Something turned up incidentally when working with a customer that
> looks like a minor bug in the cluster dumper code.
>
> In {{AbstractCluster.java}}:
>
> {code}
> public static List<Object> formatVectorAsJson(Vector v, String[] bindings)
> throws IOException {
>
>     boolean hasBindings = bindings != null;
>     boolean isSparse = !v.isDense() && v.getNumNondefaultElements() !=
> v.size();
>
>     // we assume sequential access in the output
>     Vector provider = v.isSequentialAccess() ? v : new
> SequentialAccessSparseVector(v);
>
>     List<Object> terms = new LinkedList<>();
>     String term = "";
>
>     for (Element elem : provider.nonZeroes()) {
>
>       if (hasBindings && bindings.length >= elem.index() + 1 &&
> bindings[elem.index()] != null) {
>         term = bindings[elem.index()];
>       } else if (hasBindings || isSparse) {
>         term = String.valueOf(elem.index());
>       }
>
>       Map<String, Object> term_entry = new HashMap<>();
>       double roundedWeight = (double) Math.round(elem.get() * 1000) / 1000;
>       if (hasBindings || isSparse) {
>         term_entry.put(term, roundedWeight);
>         terms.add(term_entry);
>       } else {
>         terms.add(roundedWeight);
>       }
>     }
>
>     return terms;
>   }
> {code}
>
> Imagine a {{DenseVector}} with 5 elements, of which two are 0. It's
> considered dense in this method since the number of non-default elements is
> 5 (all elements are "non default" in a dense vector).
>
> However the iteration is over non-zero elements only. And indices are only
> printed if it's sparse (or has bindings). So the result will be the 3
> non-zero elements printed without indices. Which dimensions they are can't
> be determined.
>
> The fix seems to be either:
> - Compare number of _non-zero_ elements to the size when determining if
> it's sparse
> - Iterate over all elements if non-sparse
>
> I think the first is the intent? it would be a one-line change if so.
>
> {code}
>     boolean isSparse = !v.isDense() && v.getNumZeroElements() != v.size();
> {code}
>
> Pretty straightforward, and minor, but wanted to check with everyone
> before making a change.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>