You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by skane <sk...@websense.com> on 2014/11/06 19:39:10 UTC

Re: PySpark issue with sortByKey: "IndexError: list index out of range"

I don't have any insight into this bug, but on Spark version 1.0.0 I ran into
the same bug running the 'sort.py' example. On a smaller data set, it worked
fine. On a larger data set I got this error:

Traceback (most recent call last):
  File "/home/skane/spark/examples/src/main/python/sort.py", line 30, in
<module>
    .sortByKey(lambda x: x)
  File "/usr/lib/spark/python/pyspark/rdd.py", line 480, in sortByKey
    bounds.append(samples[index])
IndexError: list index out of range



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18288.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: PySpark issue with sortByKey: "IndexError: list index out of range"

Posted by Davies Liu <da...@databricks.com>.

It should be fixed in 1.1+.

Could you have a script to reproduce it?

On Thu, Nov 6, 2014 at 10:39 AM, skane <sk...@websense.com> wrote:
> I don't have any insight into this bug, but on Spark version 1.0.0 I ran into
> the same bug running the 'sort.py' example. On a smaller data set, it worked
> fine. On a larger data set I got this error:
>
> Traceback (most recent call last):
>   File "/home/skane/spark/examples/src/main/python/sort.py", line 30, in
> <module>
>     .sortByKey(lambda x: x)
>   File "/usr/lib/spark/python/pyspark/rdd.py", line 480, in sortByKey
>     bounds.append(samples[index])
> IndexError: list index out of range
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18288.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: PySpark issue with sortByKey: "IndexError: list index out of range"

Posted by Davies Liu <da...@databricks.com>.

The errors maybe happens because that there is not enough memory in
worker, so it keeping spilling with many small files, could you verify
that the PR [1] could fix your problem?

[1] https://github.com/apache/spark/pull/3252

On Thu, Nov 13, 2014 at 11:28 AM, santon <st...@gmail.com> wrote:
> Thanks for the thoughts. I've been testing on Spark 1.1 and haven't seen the
> IndexError yet. I've run into some other errors ("too many open files"), but
> these issues seem to have been discussed already. The dataset, by the way,
> was about 40 Gb and 188 million lines; I'm running a sort on 3 worker nodes
> with a total of about 80 cores.
>
> Thanks again for the tips!
>
> On Fri, Nov 7, 2014 at 6:03 PM, Davies Liu-2 [via Apache Spark User List]
> <[hidden email]> wrote:
>>
>> Could you tell how large is the data set? It will help us to debug this
>> issue.
>>
>> On Thu, Nov 6, 2014 at 10:39 AM, skane <[hidden email]> wrote:
>>
>> > I don't have any insight into this bug, but on Spark version 1.0.0 I ran
>> > into
>> > the same bug running the 'sort.py' example. On a smaller data set, it
>> > worked
>> > fine. On a larger data set I got this error:
>> >
>> > Traceback (most recent call last):
>> >   File "/home/skane/spark/examples/src/main/python/sort.py", line 30, in
>> > <module>
>> >     .sortByKey(lambda x: x)
>> >   File "/usr/lib/spark/python/pyspark/rdd.py", line 480, in sortByKey
>> >     bounds.append(samples[index])
>> > IndexError: list index out of range
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18288.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [hidden email]
>> > For additional commands, e-mail: [hidden email]
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [hidden email]
>> For additional commands, e-mail: [hidden email]
>>
>>
>>
>> ________________________________
>> If you reply to this email, your message will be added to the discussion
>> below:
>>
>> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18393.html
>> To unsubscribe from PySpark issue with sortByKey: "IndexError: list index
>> out of range", click here.
>> NAML
>
>
>
> ________________________________
> View this message in context: Re: PySpark issue with sortByKey: "IndexError:
> list index out of range"
>
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: PySpark issue with sortByKey: "IndexError: list index out of range"

Posted by santon <st...@gmail.com>.

Thanks for the thoughts. I've been testing on Spark 1.1 and haven't seen
the IndexError yet. I've run into some other errors ("too many open
files"), but these issues seem to have been discussed already. The dataset,
by the way, was about 40 Gb and 188 million lines; I'm running a sort on 3
worker nodes with a total of about 80 cores.

Thanks again for the tips!

On Fri, Nov 7, 2014 at 6:03 PM, Davies Liu-2 [via Apache Spark User List] <
ml-node+s1001560n18393h26@n3.nabble.com> wrote:

> Could you tell how large is the data set? It will help us to debug this
> issue.
>
> On Thu, Nov 6, 2014 at 10:39 AM, skane <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=18393&i=0>> wrote:
>
> > I don't have any insight into this bug, but on Spark version 1.0.0 I ran
> into
> > the same bug running the 'sort.py' example. On a smaller data set, it
> worked
> > fine. On a larger data set I got this error:
> >
> > Traceback (most recent call last):
> >   File "/home/skane/spark/examples/src/main/python/sort.py", line 30, in
> > <module>
> >     .sortByKey(lambda x: x)
> >   File "/usr/lib/spark/python/pyspark/rdd.py", line 480, in sortByKey
> >     bounds.append(samples[index])
> > IndexError: list index out of range
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18288.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> <http://user/SendEmail.jtp?type=node&node=18393&i=1>
> > For additional commands, e-mail: [hidden email]
> <http://user/SendEmail.jtp?type=node&node=18393&i=2>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> <http://user/SendEmail.jtp?type=node&node=18393&i=3>
> For additional commands, e-mail: [hidden email]
> <http://user/SendEmail.jtp?type=node&node=18393&i=4>
>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18393.html
>  To unsubscribe from PySpark issue with sortByKey: "IndexError: list index
> out of range", click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=16445&code=c3RldmVuLm0uYW50b25AZ21haWwuY29tfDE2NDQ1fDEzNTcxOTI5>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18871.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: PySpark issue with sortByKey: "IndexError: list index out of range"

Posted by santon <st...@gmail.com>.

Sorry for the delay. I'll try to add some more details on Monday.

Unfortunately, I don't have a script to reproduce the error. Actually, it
seemed to be more about the data set than the script. The same code on
different data sets lead to different results; only larger data sets on the
order of 40 GB seemed to crash with the described error. Also, I believe
our cluster was recently updated to CDH 5.2, which uses Spark 1.1. I'll
check to see if the issue was resolved.

On Fri, Nov 7, 2014 at 6:03 PM, Davies Liu-2 [via Apache Spark User List] <
ml-node+s1001560n18393h26@n3.nabble.com> wrote:

> Could you tell how large is the data set? It will help us to debug this
> issue.
>
> On Thu, Nov 6, 2014 at 10:39 AM, skane <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=18393&i=0>> wrote:
>
> > I don't have any insight into this bug, but on Spark version 1.0.0 I ran
> into
> > the same bug running the 'sort.py' example. On a smaller data set, it
> worked
> > fine. On a larger data set I got this error:
> >
> > Traceback (most recent call last):
> >   File "/home/skane/spark/examples/src/main/python/sort.py", line 30, in
> > <module>
> >     .sortByKey(lambda x: x)
> >   File "/usr/lib/spark/python/pyspark/rdd.py", line 480, in sortByKey
> >     bounds.append(samples[index])
> > IndexError: list index out of range
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18288.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [hidden email]
> <http://user/SendEmail.jtp?type=node&node=18393&i=1>
> > For additional commands, e-mail: [hidden email]
> <http://user/SendEmail.jtp?type=node&node=18393&i=2>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [hidden email]
> <http://user/SendEmail.jtp?type=node&node=18393&i=3>
> For additional commands, e-mail: [hidden email]
> <http://user/SendEmail.jtp?type=node&node=18393&i=4>
>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18393.html
>  To unsubscribe from PySpark issue with sortByKey: "IndexError: list index
> out of range", click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=16445&code=c3RldmVuLm0uYW50b25AZ21haWwuY29tfDE2NDQ1fDEzNTcxOTI5>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18442.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: PySpark issue with sortByKey: "IndexError: list index out of range"

Posted by Davies Liu <da...@databricks.com>.

Could you tell how large is the data set? It will help us to debug this issue.

On Thu, Nov 6, 2014 at 10:39 AM, skane <sk...@websense.com> wrote:
> I don't have any insight into this bug, but on Spark version 1.0.0 I ran into
> the same bug running the 'sort.py' example. On a smaller data set, it worked
> fine. On a larger data set I got this error:
>
> Traceback (most recent call last):
>   File "/home/skane/spark/examples/src/main/python/sort.py", line 30, in
> <module>
>     .sortByKey(lambda x: x)
>   File "/usr/lib/spark/python/pyspark/rdd.py", line 480, in sortByKey
>     bounds.append(samples[index])
> IndexError: list index out of range
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-issue-with-sortByKey-IndexError-list-index-out-of-range-tp16445p18288.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org