You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by zero323 <gi...@git.apache.org> on 2015/10/13 14:42:55 UTC

[GitHub] spark pull request: [SPARK-11084][ML][PYTHON] Check if index can c...

GitHub user zero323 opened a pull request:

    https://github.com/apache/spark/pull/9098

    [SPARK-11084][ML][PYTHON] Check if index can contain non-zero value before binary search

    At this moment `SparseVector.__getitem__` executes `np.searchsorted` first and checks if result is in an expected range after that. It is possible to check if index can contain non-zero value before executing `np.searchsorted`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/zero323/spark sparse_vector_getitem_improved

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9098.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9098
    
----
commit 7f1e455fff7f68508db39a5186d6894a9857b54c
Author: zero323 <ma...@gmail.com>
Date:   2015-10-13T11:59:42Z

    Check if index can contain non-zero value before binary search

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11084][ML][PYTHON] Check if index can c...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9098#issuecomment-148155252
  
      [Test build #1900 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1900/console) for   PR 9098 at commit [`7f1e455`](https://github.com/apache/spark/commit/7f1e455fff7f68508db39a5186d6894a9857b54c).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11084][ML][PYTHON] Check if index can c...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9098#discussion_r42033324
  
    --- Diff: python/pyspark/mllib/linalg/__init__.py ---
    @@ -769,10 +769,10 @@ def __getitem__(self, index):
             if index >= self.size or index < 0:
                 raise ValueError("Index %d out of bounds." % index)
     
    -        insert_index = np.searchsorted(inds, index)
    -        if insert_index >= inds.size:
    +        if (inds.size == 0) or (index > inds.item(-1)):
    --- End diff --
    
    Curious: Is ```inds.item(-1)``` faster than ```inds[-1]```?  I wasn't sure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11084][ML][PYTHON] Check if index can c...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/9098#issuecomment-148147241
  
    Thanks for the PR!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11084][ML][PYTHON] Check if index can c...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/9098


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11084][ML][PYTHON] Check if index can c...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9098#discussion_r42033329
  
    --- Diff: python/pyspark/mllib/tests.py ---
    @@ -252,6 +252,16 @@ def test_sparse_vector_indexing(self):
             for ind in [7.8, '1']:
                 self.assertRaises(TypeError, sv.__getitem__, ind)
     
    +        zeros = SparseVector(4, {})
    --- End diff --
    
    Isn't this already covered in the previous part of this test with "sv"?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11084][ML][PYTHON] Check if index can c...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9098#discussion_r42298325
  
    --- Diff: python/pyspark/mllib/linalg/__init__.py ---
    @@ -769,10 +769,10 @@ def __getitem__(self, index):
             if index >= self.size or index < 0:
                 raise ValueError("Index %d out of bounds." % index)
     
    -        insert_index = np.searchsorted(inds, index)
    -        if insert_index >= inds.size:
    +        if (inds.size == 0) or (index > inds.item(-1)):
    --- End diff --
    
    Sounds good, I'd keep what you have.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11084][ML][PYTHON] Check if index can c...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9098#discussion_r42055507
  
    --- Diff: python/pyspark/mllib/linalg/__init__.py ---
    @@ -769,10 +769,10 @@ def __getitem__(self, index):
             if index >= self.size or index < 0:
                 raise ValueError("Index %d out of bounds." % index)
     
    -        insert_index = np.searchsorted(inds, index)
    -        if insert_index >= inds.size:
    +        if (inds.size == 0) or (index > inds.item(-1)):
    --- End diff --
    
    By itself it is slower than using `__getitem__` but it returns standard Python scalar and overall it should be marginally faster when index is a standard scalar as well. 
    
    If you think  that `inds[-1]` is better it shouldn't really matter. With crude tests I get ~40 ns difference on average (245 ns for `[]` and, 205 ns for `inds`)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11084][ML][PYTHON] Check if index can c...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9098#discussion_r42033334
  
    --- Diff: python/pyspark/mllib/tests.py ---
    @@ -252,6 +252,16 @@ def test_sparse_vector_indexing(self):
             for ind in [7.8, '1']:
                 self.assertRaises(TypeError, sv.__getitem__, ind)
     
    +        zeros = SparseVector(4, {})
    +        self.assertEqual(zeros[0], 0.0)
    +        self.assertEqual(zeros[3], 0.0)
    +        for ind in [4, -5]:
    +            self.assertRaises(ValueError, zeros.__getitem__, ind)
    +
    +        empty = SparseVector(0, {})
    --- End diff --
    
    This seems like a good test to have.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11084][ML][PYTHON] Check if index can c...

Posted by zero323 <gi...@git.apache.org>.
Github user zero323 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9098#discussion_r42056374
  
    --- Diff: python/pyspark/mllib/tests.py ---
    @@ -252,6 +252,16 @@ def test_sparse_vector_indexing(self):
             for ind in [7.8, '1']:
                 self.assertRaises(TypeError, sv.__getitem__, ind)
     
    +        zeros = SparseVector(4, {})
    --- End diff --
    
    Well, It should be. I played with different implementations and I just wanted to be sure I didn't break anything. I can remove this if you want.                   


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11084][ML][PYTHON] Check if index can c...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/9098#issuecomment-147704842
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11084][ML][PYTHON] Check if index can c...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/9098#issuecomment-148147705
  
      [Test build #1900 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1900/consoleFull) for   PR 9098 at commit [`7f1e455`](https://github.com/apache/spark/commit/7f1e455fff7f68508db39a5186d6894a9857b54c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11084][ML][PYTHON] Check if index can c...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the pull request:

    https://github.com/apache/spark/pull/9098#issuecomment-148858684
  
    LGTM
    Merging with master
    Thank you!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-11084][ML][PYTHON] Check if index can c...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/9098#discussion_r42298385
  
    --- Diff: python/pyspark/mllib/tests.py ---
    @@ -252,6 +252,16 @@ def test_sparse_vector_indexing(self):
             for ind in [7.8, '1']:
                 self.assertRaises(TypeError, sv.__getitem__, ind)
     
    +        zeros = SparseVector(4, {})
    --- End diff --
    
    It's OK for now; it's a quick test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org