You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Congrui Yi <fi...@us.bosch.com> on 2014/06/16 20:54:41 UTC

pyspark-Failed to run first

Hi All,

I am just trying to compare Scala and Python API in my local machine. Just
tried to import a local matrix(1000 by 10, created in R) stored in a text
file via textFile in pyspark. when I run data.first() it fails to present
the line and give error messages including the next:
 

Then I did nothing except changing the number of rows to 500 and importing
the file again. data.first() runs correctly. 

I also tried these in scala using spark-shell, which runs correctly for both
cases and larger matrices.

Could somebody help me with this problem? I couldn't find an answer on the
internet. It looks like pyspark has a problem with this simplest step?

Best,

Congrui Yi




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-Failed-to-run-first-tp7691.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: pyspark-Failed to run first

Posted by balajikvijayan <ba...@gmail.com>.

Any updates on this issue? A cursory search shows that others are still
experiencing this issue. I'm seeing this occur on trivial data sets in
pyspark; however they are running successfully in scala.

While this is an acceptable workaround I would like to know if this item is
on the spark roadmap or if I should completely punt on pyspark and use only
scala. 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-Failed-to-run-first-tp7691p24879.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: pyspark-Failed to run first

Posted by angel2014 <an...@gmail.com>.

It's ... kind of weird .... if I try to execute this

cotizas = sc.textFile("A_ko")
print cotizas.take(10)

it doesn't work, but if I remove only one "A" character from this file ...
 it's all OK ...

At first I thought it was due to the number of splits or something like
that ... but I downloaded this file

http://www.briandunning.com/sample-data/uk-500.zip

and it also works OK. This file is larger in number of lines (501 lines
over 50 lines) and in size (96KB over 14KB).





2014-06-23 18:28 GMT+02:00 Congrui Yi [via Apache Spark User List] <
ml-node+s1001560n8128h58@n3.nabble.com>:

> So it does not work for files on HDFS either? That is really a problem.
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-Failed-to-run-first-tp7691p8128.html
>  To unsubscribe from pyspark-Failed to run first, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=7691&code=YW5nZWwuYWx2YXJlei5wYXNjdWFAZ21haWwuY29tfDc2OTF8ODAzOTc5ODky>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>


A_ko (18K) <http://apache-spark-user-list.1001560.n3.nabble.com/attachment/8165/0/A_ko>




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-Failed-to-run-first-tp7691p8165.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: pyspark-Failed to run first

Posted by Congrui Yi <fi...@us.bosch.com>.

So it does not work for files on HDFS either? That is really a problem. 



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-Failed-to-run-first-tp7691p8128.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: pyspark-Failed to run first

Posted by angel2014 <an...@gmail.com>.

I've got the same problem trying to execute the following scriptlet from my
Eclipse environment:

/v = sc.textFile("path_to_my_file")
print v.take(1)
/

  File "my_script.py", line 18, in <module>
    print v.take(1)
  File "spark-1.0.0-bin-hadoop2\python\pyspark\rdd.py", line 868, in take
 *   iterator =
mapped._jrdd.collectPartitions(partitionsToTake)[0].iterator()*
  File
"spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\java_gateway.py",
line 537, in __call__
  File
"spark-1.0.0-bin-hadoop2\python\lib\py4j-0.8.1-src.zip\py4j\protocol.py",
line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling
o21.collectPartitions.
: java.net.SocketException: Connection reset by peer: socket write error


It doesn't matter whether the file is stored in the HDFS or in my local hard
disk, however, it does matter if the file contains more than 315 lines
(records) or not. If the file contains less or equal than 315 line, my
script works perfectly!




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-Failed-to-run-first-tp7691p8124.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: pyspark-Failed to run first

Posted by Congrui Yi <fi...@us.bosch.com>.

I'm starting to develop ADMM for some models using pyspark(Spark version
1.0.0). So I constantly simulated data to test my code. I did simulation in
python but then I ran into the same kind of problems as mentioned above.
Same meaningless error messages show up when I tried methods like first,
take or takeSample. There is no "Out of Memory" so the size should not be a
problem for pyspark.

Again, this is not a problem for Scala. 

I also installed and tried Spark 0.9.1. The same code runs correctly in
pyspark of the older version.

So it is a problem only with pyspark in 1.0.0. 

My code for data simulation:


-Congrui



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-Failed-to-run-first-tp7691p7964.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.