You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by lazyman500 <gi...@git.apache.org> on 2015/02/06 04:41:11 UTC

[GitHub] spark pull request: [SPARK-5155] [PySpark]

GitHub user lazyman500 opened a pull request:

    https://github.com/apache/spark/pull/4417

    [SPARK-5155] [PySpark]

    add examples for PySpark

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/lazyman500/spark SPARK-5616

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4417.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4417
    
----
commit f7f7f2249b8b0adf0a3671c5ee94a609c80d5cb0
Author: lazyman <la...@gmail.com>
Date:   2015-02-06T03:30:32Z

    1.add boardcast example for PySpark
    2.add module example for PySpark

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5616] [PySpark]

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4417#issuecomment-73406837
  
      [Test build #27037 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27037/consoleFull) for   PR 4417 at commit [`f7f7f22`](https://github.com/apache/spark/commit/f7f7f2249b8b0adf0a3671c5ee94a609c80d5cb0).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5616] [PySpark] Add examples for PySpar...

Posted by lazyman500 <gi...@git.apache.org>.
Github user lazyman500 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4417#discussion_r24299603
  
    --- Diff: examples/src/main/python/boardcast.py ---
    @@ -0,0 +1,55 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +
    +import sys
    +import time
    +from operator import add
    +
    +from pyspark import SparkContext,SparkConf
    +
    +#Usage: BroadcastTest [slices] [numElem] [broadcastAlgo] [blockSize]
    +
    +if __name__ == "__main__":
    +
    +        slices  =  int(sys.argv[0]) if len(sys.argv) > 1 else 1
    +        num  =  int(sys.argv[1]) if len(sys.argv) > 2 else 10000000
    +        bcName = sys.argv[2] if len(sys.argv) > 3 else "Http"
    +        blockSize =  sys.argv[3] if len(sys.argv) > 4 else "4092"
    +
    +        conf = SparkConf().setAppName("Broadcast Test") \
    +                          .setMaster("local") \
    +                          .set("spark.broadcast.factory", "org.apache.spark.broadcast.%sBroadcastFactory"%bcName) \
    +                          .set("spark.broadcast.blockSize", blockSize)
    +
    +
    +
    +        sc = SparkContext(conf=conf)
    +        #simple broadcast       
    +        b = sc.broadcast([1, 2, 3])
    +        result =sc.parallelize([0, 0]).flatMap(lambda x: b.value).collect()
    +        for s in result:
    +            print "value is %s:" % s
    +        #large broadcast
    +        for i in range(3):
    +            print "Iteration %i" % i
    +            start = time.time()
    +            barr1 = sc.broadcast(range(num))
    --- End diff --
    
    I had added some comment to explain why we use broadcast variables . and print the performance report.
    --------------------------------------------------------------
    Using broadcast: Iteration 0 cost time 0.829586982727
    Using broadcast: Iteration 1 cost time 0.0809919834137
    Using broadcast: Iteration 2 cost time 0.0794229507446
    Don't use broadcast: Iteration 0 cost time 2.80766296387
    Don't use broadcast: Iteration 1 cost time 2.83087706566
    Don't use broadcast: Iteration 2 cost time 3.16146707535
     


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5616] [PySpark] Add examples for PySpar...

Posted by rxin <gi...@git.apache.org>.
Github user rxin commented on the pull request:

    https://github.com/apache/spark/pull/4417#issuecomment-168112581
  
    I'm going to close this pull request. If this is still relevant and you are interested in pushing it forward, please open a new pull request. Thanks!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5616] [PySpark] Add examples for PySpar...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4417#issuecomment-73416677
  
      [Test build #27043 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27043/consoleFull) for   PR 4417 at commit [`1cf3e59`](https://github.com/apache/spark/commit/1cf3e59c504183e9c9a7fd27d6dd7cf8ad4f47b5).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5616] [PySpark]

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4417#discussion_r24290123
  
    --- Diff: examples/src/main/python/boardcast.py ---
    @@ -0,0 +1,55 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +
    +import sys
    +import time
    +from operator import add
    +
    +from pyspark import SparkContext,SparkConf
    +
    +#Usage: BroadcastTest [slices] [numElem] [broadcastAlgo] [blockSize]
    +
    +if __name__ == "__main__":
    +
    +        slices  =  int(sys.argv[0]) if len(sys.argv) > 1 else 1
    +        num  =  int(sys.argv[1]) if len(sys.argv) > 2 else 10000000
    +        bcName = sys.argv[2] if len(sys.argv) > 3 else "Http"
    +        blockSize =  sys.argv[3] if len(sys.argv) > 4 else "4092"
    +
    +        conf = SparkConf().setAppName("Broadcast Test") \
    +                          .setMaster("local") \
    +                          .set("spark.broadcast.factory", "org.apache.spark.broadcast.%sBroadcastFactory"%bcName) \
    +                          .set("spark.broadcast.blockSize", blockSize)
    +
    +
    +
    +        sc = SparkContext(conf=conf)
    +        #simple broadcast       
    +        b = sc.broadcast([1, 2, 3])
    +        result =sc.parallelize([0, 0]).flatMap(lambda x: b.value).collect()
    +        for s in result:
    +            print "value is %s:" % s
    +        #large broadcast
    +        for i in range(3):
    +            print "Iteration %i" % i
    +            start = time.time()
    +            barr1 = sc.broadcast(range(num))
    --- End diff --
    
    I'm not sure it's clear what this is an example of. Yes it uses broadcast variables, but the `als.py` example already does too. Why does it need to be done several times and how does this show the difference versus non-broadcast variables? that plus comments might make this more useful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5616] [PySpark] Add examples for PySpar...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4417#issuecomment-96769679
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5616] [PySpark] Add examples for PySpar...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4417#issuecomment-73416678
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27043/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5616] [PySpark] Add examples for PySpar...

Posted by lazyman500 <gi...@git.apache.org>.
Github user lazyman500 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4417#discussion_r25486698
  
    --- Diff: examples/src/main/python/broadcast.py ---
    @@ -0,0 +1,60 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +import sys
    +import time
    +from operator import add
    +from pyspark import SparkContext, SparkConf
    +
    +# Broadcast variables allow the programmer to keep a read-only variable
    +# cached on each machine rather than  shipping a copy of it with tasks.
    +# Spark also attempts to distribute broadcast variables using efficient
    +# broadcast algorithms to reduce communication cost.
    +
    +# Usage: BroadcastTest [slices] [numElem] [broadcastAlgo] [blockSize]
    +
    +if __name__ == "__main__":
    +    slices = int(sys.argv[0]) if len(sys.argv) > 1 else 1
    +    num = int(sys.argv[1]) if len(sys.argv) > 2 else 1000000
    +    bcName = sys.argv[2] if len(sys.argv) > 3 else "Http"
    --- End diff --
    
    I imitate the scala example (https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/BroadcastTest.scala)
    I guess that  author want to tell user how to change boardCast Type  :)
    Do I need change the default Broadcast factory to TorrentBroadcastFactory ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5616] [PySpark] Add examples for PySpar...

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/4417#issuecomment-73418069
  
    @lazyman500 yes, this needs to pass python style checks. You can use `./dev/lint-python` to check your code. But I think it's also important for the Python maintainers to confirm whether this is an additive example. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5616] [PySpark] Add examples for PySpar...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4417#issuecomment-73417070
  
      [Test build #27044 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27044/consoleFull) for   PR 4417 at commit [`d2ec368`](https://github.com/apache/spark/commit/d2ec3683df20b53c3b4753844cf7d73ec7775af1).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5616] [PySpark] Add examples for PySpar...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on the pull request:

    https://github.com/apache/spark/pull/4417#issuecomment-107753169
  
    It's good to have these examples, thanks for working on it. I had took a round on it, left few comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5616] [PySpark] Add examples for PySpar...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4417#issuecomment-73424421
  
      [Test build #27045 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27045/consoleFull) for   PR 4417 at commit [`28b8a55`](https://github.com/apache/spark/commit/28b8a55a46cb8b4b4374ca376bbc08cece54e272).
     * This patch **passes all tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5616] [PySpark]

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4417#issuecomment-73406674
  
      [Test build #27037 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27037/consoleFull) for   PR 4417 at commit [`f7f7f22`](https://github.com/apache/spark/commit/f7f7f2249b8b0adf0a3671c5ee94a609c80d5cb0).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5616] [PySpark] Add examples for PySpar...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4417#discussion_r31484624
  
    --- Diff: examples/src/main/python/broadcast.py ---
    @@ -0,0 +1,60 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +import sys
    +import time
    +from operator import add
    +from pyspark import SparkContext, SparkConf
    +
    +# Broadcast variables allow the programmer to keep a read-only variable
    +# cached on each machine rather than  shipping a copy of it with tasks.
    +# Spark also attempts to distribute broadcast variables using efficient
    +# broadcast algorithms to reduce communication cost.
    +
    +# Usage: BroadcastTest [slices] [numElem] [broadcastAlgo] [blockSize]
    +
    +if __name__ == "__main__":
    +    slices = int(sys.argv[0]) if len(sys.argv) > 1 else 1
    +    num = int(sys.argv[1]) if len(sys.argv) > 2 else 1000000
    +    bcName = sys.argv[2] if len(sys.argv) > 3 else "Http"
    +    blockSize = sys.argv[3] if len(sys.argv) > 4 else "4092"
    +
    +    conf = SparkConf().setAppName("Broadcast Test") \
    +                      .setMaster("local") \
    +                      .set("spark.broadcast.factory", "org.apache.spark.broadcast.%sBroadcastFactory" % bcName) \
    +                      .set("spark.broadcast.blockSize", blockSize)
    +
    +    sc = SparkContext(conf=conf)
    +    # large broadcast,using broadcast will cost less time!
    +    barr1 = sc.broadcast(range(num))
    +    for i in range(3):
    +        start = time.time()
    +        # variable barr1 cached on each machine rather than shipping a copy of
    +        # it with tasks threes times
    +        broadcast_result = sc.parallelize(range(10), slices)
    --- End diff --
    
    Can use move this line out of loop? then we can re-use the broadcast object, and see the second and third runs are faster than first one (the broadcast object are cached).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5616] [PySpark]

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4417#issuecomment-73178429
  
    Can one of the admins verify this patch?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5616] [PySpark]

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/4417#issuecomment-73406550
  
    ok to test


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5616] [PySpark] Add examples for PySpar...

Posted by lazyman500 <gi...@git.apache.org>.
Github user lazyman500 commented on the pull request:

    https://github.com/apache/spark/pull/4417#issuecomment-73417997
  
    Do I need to fix the python style ? I have use lint-python to check my python style.
    There are only some E501 errors like that:
    [examples/src/main/python/broadcast.py:46:80: E501 line too long (105 > 79 characters)]
    But I saw it  at other scripts too. If I fix it I will make my comment ugly.
    [examples/src/main/python/streaming/recoverable_network_wordcount.py:47:80: E501 line too long (85 > 79 characters)]
    



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5616] [PySpark]

Posted by srowen <gi...@git.apache.org>.
Github user srowen commented on the pull request:

    https://github.com/apache/spark/pull/4417#issuecomment-73368764
  
    The title of the PR should include the title of the JIRA. There is a typo throughout this PR: it's "broadcast" and not "boardcast". What are the other new files besides the main example file?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5616] [PySpark] Add examples for PySpar...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4417#issuecomment-73417123
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27044/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5616] [PySpark] Add examples for PySpar...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4417#issuecomment-73424427
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27045/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5616] [PySpark] Add examples for PySpar...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4417#issuecomment-73420745
  
      [Test build #27045 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27045/consoleFull) for   PR 4417 at commit [`28b8a55`](https://github.com/apache/spark/commit/28b8a55a46cb8b4b4374ca376bbc08cece54e272).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5616] [PySpark] Add examples for PySpar...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4417#discussion_r31484689
  
    --- Diff: examples/src/main/python/broadcast.py ---
    @@ -0,0 +1,60 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +import sys
    +import time
    +from operator import add
    +from pyspark import SparkContext, SparkConf
    +
    +# Broadcast variables allow the programmer to keep a read-only variable
    +# cached on each machine rather than  shipping a copy of it with tasks.
    +# Spark also attempts to distribute broadcast variables using efficient
    +# broadcast algorithms to reduce communication cost.
    +
    +# Usage: BroadcastTest [slices] [numElem] [broadcastAlgo] [blockSize]
    +
    +if __name__ == "__main__":
    +    slices = int(sys.argv[0]) if len(sys.argv) > 1 else 1
    +    num = int(sys.argv[1]) if len(sys.argv) > 2 else 1000000
    +    bcName = sys.argv[2] if len(sys.argv) > 3 else "Http"
    +    blockSize = sys.argv[3] if len(sys.argv) > 4 else "4092"
    +
    +    conf = SparkConf().setAppName("Broadcast Test") \
    +                      .setMaster("local") \
    +                      .set("spark.broadcast.factory", "org.apache.spark.broadcast.%sBroadcastFactory" % bcName) \
    +                      .set("spark.broadcast.blockSize", blockSize)
    +
    +    sc = SparkContext(conf=conf)
    +    # large broadcast,using broadcast will cost less time!
    +    barr1 = sc.broadcast(range(num))
    +    for i in range(3):
    +        start = time.time()
    +        # variable barr1 cached on each machine rather than shipping a copy of
    +        # it with tasks threes times
    +        broadcast_result = sc.parallelize(range(10), slices)
    +        broadcast_result.map(lambda x: len(barr1.value)).collect()
    +        end = time.time()
    +        print "Using broadcast: Iteration %s cost time %s" % (i, end-start)
    +    # it will cost time
    --- End diff --
    
    PySpark will create broadcast object automatically, so this will not have much difference (each run will create a new broadcast).
    
    I'd like to remove these.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5616] [PySpark] Add examples for PySpar...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4417#issuecomment-73416633
  
      [Test build #27043 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27043/consoleFull) for   PR 4417 at commit [`1cf3e59`](https://github.com/apache/spark/commit/1cf3e59c504183e9c9a7fd27d6dd7cf8ad4f47b5).
     * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5616] [PySpark]

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the pull request:

    https://github.com/apache/spark/pull/4417#issuecomment-73406840
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27037/
    Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5616] [PySpark] Add examples for PySpar...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/4417


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5616] [PySpark] Add examples for PySpar...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the pull request:

    https://github.com/apache/spark/pull/4417#issuecomment-73417121
  
      [Test build #27044 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/27044/consoleFull) for   PR 4417 at commit [`d2ec368`](https://github.com/apache/spark/commit/d2ec3683df20b53c3b4753844cf7d73ec7775af1).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5616] [PySpark] Add examples for PySpar...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4417#discussion_r31484830
  
    --- Diff: examples/src/main/python/module.py ---
    @@ -0,0 +1,39 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +import sys
    +import os
    +from pyspark import SparkContext
    +from mylib import myfunc
    +
    +# this is exmaple for using our module file
    +if __name__ == "__main__":
    +
    +    """
    +    Usage:module.py
    +bin/spark-submit examples/src/main/python/module.py
    +    use   --py-files  to  replace addPyFile()
    +bin/spark-submit examples/src/main/python/module.py --py-files
    +    examples/src/main/python/mylib.zip
    +    """
    +
    +    tmpdir = os.path.split(sys.argv[0])[0]
    +    sc = SparkContext(appName="PythonModule")
    +    path = os.path.join(tmpdir, "mylib.zip")
    --- End diff --
    
    It's confusing to have both mylib.py and mylib.zip (both have the same myfunc?).
    
    Could you separate them, and catch the exception if user forget to use `--py-files`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request: [SPARK-5616] [PySpark] Add examples for PySpar...

Posted by davies <gi...@git.apache.org>.
Github user davies commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4417#discussion_r25145797
  
    --- Diff: examples/src/main/python/broadcast.py ---
    @@ -0,0 +1,60 @@
    +#
    +# Licensed to the Apache Software Foundation (ASF) under one or more
    +# contributor license agreements.  See the NOTICE file distributed with
    +# this work for additional information regarding copyright ownership.
    +# The ASF licenses this file to You under the Apache License, Version 2.0
    +# (the "License"); you may not use this file except in compliance with
    +# the License.  You may obtain a copy of the License at
    +#
    +#    http://www.apache.org/licenses/LICENSE-2.0
    +#
    +# Unless required by applicable law or agreed to in writing, software
    +# distributed under the License is distributed on an "AS IS" BASIS,
    +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    +# See the License for the specific language governing permissions and
    +# limitations under the License.
    +#
    +
    +import sys
    +import time
    +from operator import add
    +from pyspark import SparkContext, SparkConf
    +
    +# Broadcast variables allow the programmer to keep a read-only variable
    +# cached on each machine rather than  shipping a copy of it with tasks.
    +# Spark also attempts to distribute broadcast variables using efficient
    +# broadcast algorithms to reduce communication cost.
    +
    +# Usage: BroadcastTest [slices] [numElem] [broadcastAlgo] [blockSize]
    +
    +if __name__ == "__main__":
    +    slices = int(sys.argv[0]) if len(sys.argv) > 1 else 1
    +    num = int(sys.argv[1]) if len(sys.argv) > 2 else 1000000
    +    bcName = sys.argv[2] if len(sys.argv) > 3 else "Http"
    --- End diff --
    
    We change the default Broadcast factory to TorrentBroadcastFactory, is there any reason we use `Http` as default here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org