You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by anantasty <gi...@git.apache.org> on 2014/10/27 02:52:57 UTC

[GitHub] spark pull request: [examples][mllib][python] SPARK-3838

GitHub user anantasty opened a pull request:

    https://github.com/apache/spark/pull/2952

    [examples][mllib][python] SPARK-3838

    This pull request refers to issue: https://issues.apache.org/jira/browse/SPARK-3838
    
    Python example for word2vec

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/anantasty/spark SPARK-3838

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2952.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2952
    
----
commit c015b14a8987831dedf742624a65479d388fd217
Author: Anant <an...@gmail.com>
Date:   2014-10-27T01:48:55Z

    Added python example for word2vec

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2952#issuecomment-61352555
  
    LGTM. Merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2952#discussion_r19581493
  
    --- Diff: examples/src/main/python/mllib/word2vec.py ---
    @@ -0,0 +1,48 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +# This example uses text8 file from http://mattmahoney.net/dc/text8.zip
    +# The file was unziped and split into multiple lines using
    +# grep -o -E '\w+(\W+\w+){0,15}' text8 > text8_lines
    +# This was done so that the example can be run in local mode
    --- End diff --
    
    Could you provide runnable bash commands here to generate "text8_lines"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by anantasty <gi...@git.apache.org>.
Github user anantasty commented on the pull request:

    https://github.com/apache/spark/pull/2952#issuecomment-61316422
  
    The URL shows 0 failures I am not sure why it says the tests fail.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2952#issuecomment-60794001
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/2952


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by anantasty <gi...@git.apache.org>.
Github user anantasty commented on the pull request:

    https://github.com/apache/spark/pull/2952#issuecomment-61305691
  
    @mengxr @davies Thanks for the time and guidance


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2952#discussion_r19653338
  
    --- Diff: examples/src/main/python/mllib/word2vec.py ---
    @@ -0,0 +1,50 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +# This example uses text8 file from http://mattmahoney.net/dc/text8.zip
    +# The file was downloadded, unziped and split into multiple lines using
    +#
    +# wget http://mattmahoney.net/dc/text8.zip
    +# unzip text8.zip
    +# grep -o -E '\w+(\W+\w+){0,15}' text8 > text8_lines
    +# This was done so that the example can be run in local mode
    +
    +
    +import sys
    +
    +from pyspark import SparkContext
    +from pyspark.mllib.feature import Word2Vec
    +
    +USAGE = ("bin/spark-submit --driver-memory 4g "
    +         "examples/src/main/python/mllib/word2vec.py text8_lines")
    +
    +if __name__ == "__main__":
    +    if len(sys.argv) < 2:
    +        print USAGE
    +        sys.exit("Argument for file not provided")
    --- End diff --
    
    ```
    sys.exit = exit(...)
        exit([status])
    
        Exit the interpreter by raising SystemExit(status).
        If the status is omitted or None, it defaults to zero (i.e., success).
        If the status is an integer, it will be used as the system exit status.
        If it is another kind of object, it will be printed and the system
        exit status will be one (i.e., failure).
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2952#discussion_r19453785
  
    --- Diff: examples/src/main/python/mllib/word2vec.py ---
    @@ -0,0 +1,47 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +# This example uses text8 file from http://mattmahoney.net/dc/text8.zip
    +# The file was unziped and split into multiple lines using
    +# grep -o -E '\w+(\W+\w+){0,15}' text8 > text8_lines
    +# This was done so that the example can be run in local mode
    +
    +
    +import sys
    +from pyspark import SparkContext
    +from pyspark.mllib.feature import Word2Vec
    +
    +USAGE = ("bin/spark-submit --driver-memory 4g "
    +         "examples/src/main/python/mllib/word2vec.py text8_lines")
    +
    +if __name__ == "__main__":
    +    if len(sys.argv) < 2:
    +        print USAGE
    +        return
    +    file_path = sys.argv[1]
    +    sc = SparkContext(appName='Word2Vec')
    +    inp = sc.textFile("text8_lines").map(lambda row: [row])
    --- End diff --
    
    ditto: `inp` -> `sentences`, `[row]` -> `row.split(' ')`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by anantasty <gi...@git.apache.org>.
Github user anantasty commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2952#discussion_r19683664
  
    --- Diff: examples/src/main/python/mllib/word2vec.py ---
    @@ -0,0 +1,50 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +# This example uses text8 file from http://mattmahoney.net/dc/text8.zip
    +# The file was downloadded, unziped and split into multiple lines using
    +#
    +# wget http://mattmahoney.net/dc/text8.zip
    +# unzip text8.zip
    +# grep -o -E '\w+(\W+\w+){0,15}' text8 > text8_lines
    +# This was done so that the example can be run in local mode
    +
    +
    +import sys
    +
    +from pyspark import SparkContext
    +from pyspark.mllib.feature import Word2Vec
    +
    +USAGE = ("bin/spark-submit --driver-memory 4g "
    +         "examples/src/main/python/mllib/word2vec.py text8_lines")
    +
    +if __name__ == "__main__":
    +    if len(sys.argv) < 2:
    +        print USAGE
    +        sys.exit("Argument for file not provided")
    --- End diff --
    
    I am not sure what exactly you are pointing out. This basically returns 1 as the exit status whics is a failure status. Would it be better if i used sys.exit(1) ??


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2952#discussion_r19588505
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -162,6 +162,40 @@ for((synonym, cosineSimilarity) <- synonyms) {
     }
     {% endhighlight %}
     </div>
    +<div data-lang="python">
    +{% highlight python %}
    +# This example uses text8 file from http://mattmahoney.net/dc/text8.zip
    +# The file was unziped and split into multiple lines using
    +# grep -o -E '\w+(\W+\w+){0,15}' text8 > text8_lines
    +# This was done so that the example can be run in local mode
    +
    +import sys
    +
    +from pyspark import SparkContext
    +from pyspark.mllib.feature import Word2Vec
    +
    +USAGE = ("bin/spark-submit --driver-memory 4g "
    --- End diff --
    
    it should look like the scala one, I think the following should be enough:
    ```
        from pyspark import SparkContext
        from pyspark.mllib.feature import Word2Vec
    
        sc = SparkContext(appName='Word2Vec')
        inp = sc.textFile("text8_lines").map(lambda row: row.split(" "))
    
        word2vec = Word2Vec()
        model = word2vec.fit(inp)
    
        synonyms = model.findSynonyms('china', 40)
        for word, cosine_distance in synonyms:
            print "{}: {}".format(word, cosine_distance)
    ```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2952#issuecomment-60795375
  
      [Test build #22364 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22364/consoleFull) for   PR 2952 at commit [`3d3c9ee`](https://github.com/apache/spark/commit/3d3c9eed4153e3b8c5d6f60a00202df9d296319f).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2952#issuecomment-61341766
  
      [Test build #22658 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22658/consoleFull) for   PR 2952 at commit [`87bd723`](https://github.com/apache/spark/commit/87bd723d05a1a7d187dd25e22e21a98a7eecda83).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2952#discussion_r19588522
  
    --- Diff: examples/src/main/python/mllib/word2vec.py ---
    @@ -0,0 +1,48 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +# This example uses text8 file from http://mattmahoney.net/dc/text8.zip
    +# The file was unziped and split into multiple lines using
    +# grep -o -E '\w+(\W+\w+){0,15}' text8 > text8_lines
    +# This was done so that the example can be run in local mode
    --- End diff --
    
    It's better to including download and unzip.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2952#issuecomment-61347387
  
      [Test build #22658 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22658/consoleFull) for   PR 2952 at commit [`87bd723`](https://github.com/apache/spark/commit/87bd723d05a1a7d187dd25e22e21a98a7eecda83).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2952#issuecomment-61222418
  
      [Test build #22598 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22598/consoleFull) for   PR 2952 at commit [`4bd439e`](https://github.com/apache/spark/commit/4bd439e0b0dade19a94cdbc0e3708ad634fd35e2).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2952#issuecomment-61228648
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22598/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by anantasty <gi...@git.apache.org>.
Github user anantasty commented on the pull request:

    https://github.com/apache/spark/pull/2952#issuecomment-60687518
  
    @mengxr I updated the example code as well.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2952#discussion_r19451852
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -162,6 +162,28 @@ for((synonym, cosineSimilarity) <- synonyms) {
     }
     {% endhighlight %}
     </div>
    +<div data-lang="python">
    +{% highlight python %}
    +# example uses text8 file from http://mattmahoney.net/dc/text8.zip                                                                                                                        
    +# the file was unziped and split into multiple lines using                                                                                                                                
    +# grep -o '[^ ]\+' text8 > text8_lines                                                                                                                                                    
    --- End diff --
    
    The input to `Word2Vec` should be sentences instead of individual words, though it doesn't affect the implementation. The following command extract 16 words per line.
    
    ~~~
    grep -o -E '\w+(\W+\w+){0,15}' text8 > text8_lines
    ~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2952#issuecomment-60794739
  
      [Test build #22364 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22364/consoleFull) for   PR 2952 at commit [`3d3c9ee`](https://github.com/apache/spark/commit/3d3c9eed4153e3b8c5d6f60a00202df9d296319f).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2952#issuecomment-61306151
  
      [Test build #22618 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22618/consoleFull) for   PR 2952 at commit [`87bd723`](https://github.com/apache/spark/commit/87bd723d05a1a7d187dd25e22e21a98a7eecda83).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2952#discussion_r19653295
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -162,6 +162,24 @@ for((synonym, cosineSimilarity) <- synonyms) {
     }
     {% endhighlight %}
     </div>
    +<div data-lang="python">
    +{% highlight python %}
    +from pyspark import SparkContext
    +from pyspark.mllib.feature import Word2Vec
    +
    +sc = SparkContext(appName='Word2Vec')
    +inp = sc.textFile("text8_lines").map(lambda row: row.split(" "))
    +
    +word2vec = Word2Vec()
    +model = word2vec.fit(inp)
    +
    +synonyms = model.findSynonyms('china', 40)
    +
    +for word, cosine_distance in synonyms:
    +    print "{}: {}".format(word, cosine_distance)
    +    sc.stop()
    --- End diff --
    
    remove the indent or remove this line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2952#issuecomment-60794164
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2952#issuecomment-60625545
  
    @anantasty Could you also update the doc in `https://github.com/apache/spark/blob/master/docs/mllib-feature-extraction.md`? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2952#issuecomment-61314835
  
      [Test build #22618 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22618/consoleFull) for   PR 2952 at commit [`87bd723`](https://github.com/apache/spark/commit/87bd723d05a1a7d187dd25e22e21a98a7eecda83).
     * This patch **fails Spark unit tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2952#discussion_r19451872
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -162,6 +162,28 @@ for((synonym, cosineSimilarity) <- synonyms) {
     }
     {% endhighlight %}
     </div>
    +<div data-lang="python">
    +{% highlight python %}
    +# example uses text8 file from http://mattmahoney.net/dc/text8.zip                                                                                                                        
    +# the file was unziped and split into multiple lines using                                                                                                                                
    +# grep -o '[^ ]\+' text8 > text8_lines                                                                                                                                                    
    +# this was done so that the example can be run in local mode
    +
    +from pyspark import SparkContext
    +from pyspark.mllib.feature import Word2Vec
    +
    +sc = SparkContext(appName='Word2Vec')
    +inp = sc.textFile("text8_chunked").map(λ row: [row])
    --- End diff --
    
    `text8_chunked` -> `text8_lines`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/2952#issuecomment-61305357
  
    @mengxr I think it's ready to merge.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2952#discussion_r19451976
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -162,6 +162,28 @@ for((synonym, cosineSimilarity) <- synonyms) {
     }
     {% endhighlight %}
     </div>
    +<div data-lang="python">
    +{% highlight python %}
    +# example uses text8 file from http://mattmahoney.net/dc/text8.zip                                                                                                                        
    --- End diff --
    
    Those should be complete sentences: `This example uses text8 file from ... .`, `Unzip the file and split it into lines using ...`, `This was done ... in local mode.`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by anantasty <gi...@git.apache.org>.
Github user anantasty commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2952#discussion_r19585598
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -162,6 +162,40 @@ for((synonym, cosineSimilarity) <- synonyms) {
     }
     {% endhighlight %}
     </div>
    +<div data-lang="python">
    +{% highlight python %}
    +# This example uses text8 file from http://mattmahoney.net/dc/text8.zip
    +# The file was unziped and split into multiple lines using
    +# grep -o -E '\w+(\W+\w+){0,15}' text8 > text8_lines
    +# This was done so that the example can be run in local mode
    +
    +import sys
    +
    +from pyspark import SparkContext
    +from pyspark.mllib.feature import Word2Vec
    +
    +USAGE = ("bin/spark-submit --driver-memory 4g "
    --- End diff --
    
    @davies  simplify the docs should i just remove the Usage line and the creation of the context?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/2952#issuecomment-61228643
  
      [Test build #22598 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22598/consoleFull) for   PR 2952 at commit [`4bd439e`](https://github.com/apache/spark/commit/4bd439e0b0dade19a94cdbc0e3708ad634fd35e2).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by anantasty <gi...@git.apache.org>.
Github user anantasty commented on the pull request:

    https://github.com/apache/spark/pull/2952#issuecomment-61222153
  
    @davies @mengxr Just made the suggested changes.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/2952#issuecomment-61341643
  
    test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by anantasty <gi...@git.apache.org>.
Github user anantasty commented on the pull request:

    https://github.com/apache/spark/pull/2952#issuecomment-60710266
  
    @mengxr I just implemented those changes.
    I kept the command line args very simple instead of using arg parse etc just for the sake of simplicity.
    Thanks for the review.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2952#issuecomment-60795382
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22364/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2952#discussion_r19453782
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -162,6 +162,38 @@ for((synonym, cosineSimilarity) <- synonyms) {
     }
     {% endhighlight %}
     </div>
    +<div data-lang="python">
    +{% highlight python %}
    +# This example uses text8 file from http://mattmahoney.net/dc/text8.zip
    +# The file was unziped and split into multiple lines using
    +# grep -o -E '\w+(\W+\w+){0,15}' text8 > text8_lines
    +# This was done so that the example can be run in local mode
    +
    +from pyspark import SparkContext
    +from pyspark.mllib.feature import Word2Vec
    +
    +USAGE = ("bin/spark-submit --driver-memory 4g "
    +         "examples/src/main/python/mllib/word2vec.py text8_lines")
    +
    +
    +if __name__ == "__main__":
    +    if len(sys.argv) < 2:
    +        print USAGE
    +        return
    +    file_path = sys.argv[1]
    +    sc = SparkContext(appName='Word2Vec')
    +    inp = sc.textFile("text8_lines").map(lambda row: [row])
    --- End diff --
    
    `inp` -> `sentences`, `[row]` -> `row.split(' ')`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2952#discussion_r19581517
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -162,6 +162,40 @@ for((synonym, cosineSimilarity) <- synonyms) {
     }
     {% endhighlight %}
     </div>
    +<div data-lang="python">
    +{% highlight python %}
    +# This example uses text8 file from http://mattmahoney.net/dc/text8.zip
    +# The file was unziped and split into multiple lines using
    +# grep -o -E '\w+(\W+\w+){0,15}' text8 > text8_lines
    +# This was done so that the example can be run in local mode
    +
    +import sys
    +
    +from pyspark import SparkContext
    +from pyspark.mllib.feature import Word2Vec
    +
    +USAGE = ("bin/spark-submit --driver-memory 4g "
    --- End diff --
    
    It's better to put a simplified version here, then could have a link to the example


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2952#issuecomment-61347393
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22658/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by anantasty <gi...@git.apache.org>.
Github user anantasty commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2952#discussion_r19653043
  
    --- Diff: examples/src/main/python/mllib/word2vec.py ---
    @@ -0,0 +1,48 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +# This example uses text8 file from http://mattmahoney.net/dc/text8.zip
    +# The file was unziped and split into multiple lines using
    +# grep -o -E '\w+(\W+\w+){0,15}' text8 > text8_lines
    +# This was done so that the example can be run in local mode
    +
    +
    +import sys
    +
    +from pyspark import SparkContext
    +from pyspark.mllib.feature import Word2Vec
    +
    +USAGE = ("bin/spark-submit --driver-memory 4g "
    +         "examples/src/main/python/mllib/word2vec.py text8_lines")
    +
    +if __name__ == "__main__":
    +    if len(sys.argv) < 2:
    +        print USAGE
    +        sys.exit("Argument for file not provided")
    +    file_path = sys.argv[1]
    +    sc = SparkContext(appName='Word2Vec')
    +    inp = sc.textFile("text8_lines").map(lambda row: row.split(" "))
    --- End diff --
    
    I left that old line in there thanks for that!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2952#discussion_r19452204
  
    --- Diff: examples/src/main/python/mllib/word2vec.py ---
    @@ -0,0 +1,36 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +# example uses text8 file from http://mattmahoney.net/dc/text8.zip
    +# the file was unziped and split into multiple lines using
    +# grep -o '[^ ]\+' text8 > text8_lines
    +# this was done so that the example can be run in local mode
    +
    +from pyspark import SparkContext
    +from pyspark.mllib.feature import Word2Vec
    +
    +sc = SparkContext(appName='Word2Vec')
    +inp = sc.textFile("text8_lines").map(lambda row: [row])
    --- End diff --
    
    Shall we make the input path configurable and show users the command to run the example code? For example,
    
    ~~~
    bin/spark-submit --driver-memory 4g examples/src/main/python/mllib/word2vec.py text8_lines
    ~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [examples][mllib][python] SPARK-3838

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2952#issuecomment-60542640
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2952#discussion_r19451869
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -162,6 +162,28 @@ for((synonym, cosineSimilarity) <- synonyms) {
     }
     {% endhighlight %}
     </div>
    +<div data-lang="python">
    +{% highlight python %}
    +# example uses text8 file from http://mattmahoney.net/dc/text8.zip                                                                                                                        
    +# the file was unziped and split into multiple lines using                                                                                                                                
    +# grep -o '[^ ]\+' text8 > text8_lines                                                                                                                                                    
    +# this was done so that the example can be run in local mode
    +
    +from pyspark import SparkContext
    +from pyspark.mllib.feature import Word2Vec
    +
    +sc = SparkContext(appName='Word2Vec')
    +inp = sc.textFile("text8_chunked").map(λ row: [row])
    --- End diff --
    
    `λ` -> `lambda` (do not use special chars)
    
    If we use sentences, this should be `row.split(' ')`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2952#discussion_r19453784
  
    --- Diff: examples/src/main/python/mllib/word2vec.py ---
    @@ -0,0 +1,47 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +# This example uses text8 file from http://mattmahoney.net/dc/text8.zip
    +# The file was unziped and split into multiple lines using
    +# grep -o -E '\w+(\W+\w+){0,15}' text8 > text8_lines
    +# This was done so that the example can be run in local mode
    +
    +
    +import sys
    --- End diff --
    
    insert an empty line after `python` imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/2952#issuecomment-61314842
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22618/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2952#discussion_r19684029
  
    --- Diff: examples/src/main/python/mllib/word2vec.py ---
    @@ -0,0 +1,50 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +# This example uses text8 file from http://mattmahoney.net/dc/text8.zip
    +# The file was downloadded, unziped and split into multiple lines using
    +#
    +# wget http://mattmahoney.net/dc/text8.zip
    +# unzip text8.zip
    +# grep -o -E '\w+(\W+\w+){0,15}' text8 > text8_lines
    +# This was done so that the example can be run in local mode
    +
    +
    +import sys
    +
    +from pyspark import SparkContext
    +from pyspark.mllib.feature import Word2Vec
    +
    +USAGE = ("bin/spark-submit --driver-memory 4g "
    +         "examples/src/main/python/mllib/word2vec.py text8_lines")
    +
    +if __name__ == "__main__":
    +    if len(sys.argv) < 2:
    +        print USAGE
    +        sys.exit("Argument for file not provided")
    --- End diff --
    
    Sorry, I didn't read the help doc carefully, you are brilliant!  


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2952#discussion_r19581470
  
    --- Diff: examples/src/main/python/mllib/word2vec.py ---
    @@ -0,0 +1,48 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +# This example uses text8 file from http://mattmahoney.net/dc/text8.zip
    +# The file was unziped and split into multiple lines using
    +# grep -o -E '\w+(\W+\w+){0,15}' text8 > text8_lines
    +# This was done so that the example can be run in local mode
    +
    +
    +import sys
    +
    +from pyspark import SparkContext
    +from pyspark.mllib.feature import Word2Vec
    +
    +USAGE = ("bin/spark-submit --driver-memory 4g "
    +         "examples/src/main/python/mllib/word2vec.py text8_lines")
    +
    +if __name__ == "__main__":
    +    if len(sys.argv) < 2:
    +        print USAGE
    +        sys.exit("Argument for file not provided")
    +    file_path = sys.argv[1]
    +    sc = SparkContext(appName='Word2Vec')
    +    inp = sc.textFile("text8_lines").map(lambda row: row.split(" "))
    --- End diff --
    
    use file_path


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by mengxr <gi...@git.apache.org>.
Github user mengxr commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2952#discussion_r19452346
  
    --- Diff: docs/mllib-feature-extraction.md ---
    @@ -162,6 +162,28 @@ for((synonym, cosineSimilarity) <- synonyms) {
     }
     {% endhighlight %}
     </div>
    +<div data-lang="python">
    +{% highlight python %}
    +# example uses text8 file from http://mattmahoney.net/dc/text8.zip                                                                                                                        
    --- End diff --
    
    Btw, it is also worth noting the expected running time. This is not as fast as the C implementation. It may take ~10 minutes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-3838][examples][mllib][python] Word2Vec...

Posted by anantasty <gi...@git.apache.org>.
Github user anantasty commented on a diff in the pull request:

    https://github.com/apache/spark/pull/2952#discussion_r19585582
  
    --- Diff: examples/src/main/python/mllib/word2vec.py ---
    @@ -0,0 +1,48 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +# This example uses text8 file from http://mattmahoney.net/dc/text8.zip
    +# The file was unziped and split into multiple lines using
    +# grep -o -E '\w+(\W+\w+){0,15}' text8 > text8_lines
    +# This was done so that the example can be run in local mode
    --- End diff --
    
    @davies I just jued the command listed in the comments. 
    "grep -o -E '\w+(\W+\w+){0,15}' text8 > text8_lines"
    Did you want me to include a bash script?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org