You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Christopher Bourez <ch...@gmail.com> on 2016/01/25 16:53:09 UTC

bug for large textfiles on windows

Dears,

I would like to re-open a case for a potential bug (current status is
resolved but it sounds not) :

*https://issues.apache.org/jira/browse/SPARK-12261
<https://issues.apache.org/jira/browse/SPARK-12261>*

I believe there is something wrong about the memory management under
windows

It has no sense to work with files smaller than a few Mo...

Do not hesitate to ask me questions if you try to help and reproduce the
bug,

Best

Christopher Bourez
06 17 17 50 60

Re: bug for large textfiles on windows

Posted by Christopher Bourez <ch...@gmail.com>.
Dears,

I recompiled Spark on Windows, sounds to work better. My problem with
Pyspark remains :
https://issues.apache.org/jira/browse/SPARK-12261

I do not know how to debug this, sounds to be linked with Pickle, the
garbage collector... I would like to clear the Spark context to see if I
can gain anything.

Christopher Bourez
06 17 17 50 60

On Mon, Jan 25, 2016 at 10:14 PM, Christopher Bourez <
christopher.bourez@gmail.com> wrote:

> Here is a pic of memory
> If I put --conf spark.driver.memory=3g, it increases the displaid memory,
> but the problem remains... for a file that is only 13M.
>
> Christopher Bourez
> 06 17 17 50 60
>
> On Mon, Jan 25, 2016 at 10:06 PM, Christopher Bourez <
> christopher.bourez@gmail.com> wrote:
>
>> The same problem occurs on my desktop at work.
>> What's great with AWS Workspace is that you can easily reproduce it.
>>
>> I created the test file with commands :
>>
>> for i in {0..300000}; do
>>   VALUE="$RANDOM"
>>   for j in {0..6}; do
>>     VALUE="$VALUE;$RANDOM";
>>   done
>>   echo $VALUE >> test.csv
>> done
>>
>> Christopher Bourez
>> 06 17 17 50 60
>>
>> On Mon, Jan 25, 2016 at 10:01 PM, Christopher Bourez <
>> christopher.bourez@gmail.com> wrote:
>>
>>> Josh,
>>>
>>> Thanks a lot !
>>>
>>> You can download a video I created :
>>> https://s3-eu-west-1.amazonaws.com/christopherbourez/public/video.mov
>>>
>>> I created a sample file of 13 MB as explained :
>>> https://s3-eu-west-1.amazonaws.com/christopherbourez/public/test.csv
>>>
>>> Here are the commands I did :
>>>
>>> I created an Aws Workspace with Windows 7 (that I can share you if you'd
>>> like) with Standard instance, 2GiB RAM
>>> On this instance :
>>> I downloaded spark (1.5 or 1.6 same pb) with hadoop 2.6
>>> installed java 8 jdk
>>> downloaded python 2.7.8
>>>
>>> downloaded the sample file
>>> https://s3-eu-west-1.amazonaws.com/christopherbourez/public/test.csv
>>>
>>> And then the command lines I launch are :
>>> bin\pyspark --master local[1]
>>> sc.textFile("test.csv").take(1)
>>>
>>> As you can see, sc.textFile("test.csv", 2000).take(1) works well
>>>
>>> Thanks a lot !
>>>
>>>
>>> Christopher Bourez
>>> 06 17 17 50 60
>>>
>>> On Mon, Jan 25, 2016 at 8:02 PM, Josh Rosen <jo...@databricks.com>
>>> wrote:
>>>
>>>> Hi Christopher,
>>>>
>>>> What would be super helpful here is a standalone reproduction. Ideally
>>>> this would be a single Scala file or set of commands that I can run in
>>>> `spark-shell` in order to reproduce this. Ideally, this code would generate
>>>> a giant file, then try to read it in a way that demonstrates the bug. If
>>>> you have such a reproduction, could you attach it to that JIRA ticket?
>>>> Thanks!
>>>>
>>>> On Mon, Jan 25, 2016 at 7:53 AM Christopher Bourez <
>>>> christopher.bourez@gmail.com> wrote:
>>>>
>>>>> Dears,
>>>>>
>>>>> I would like to re-open a case for a potential bug (current status is
>>>>> resolved but it sounds not) :
>>>>>
>>>>> *https://issues.apache.org/jira/browse/SPARK-12261
>>>>> <https://issues.apache.org/jira/browse/SPARK-12261>*
>>>>>
>>>>> I believe there is something wrong about the memory management under
>>>>> windows
>>>>>
>>>>> It has no sense to work with files smaller than a few Mo...
>>>>>
>>>>> Do not hesitate to ask me questions if you try to help and reproduce
>>>>> the bug,
>>>>>
>>>>> Best
>>>>>
>>>>> Christopher Bourez
>>>>> 06 17 17 50 60
>>>>>
>>>>
>>>
>>
>

Re: bug for large textfiles on windows

Posted by Christopher Bourez <ch...@gmail.com>.
Here is a pic of memory
If I put --conf spark.driver.memory=3g, it increases the displaid memory,
but the problem remains... for a file that is only 13M.

Christopher Bourez
06 17 17 50 60

On Mon, Jan 25, 2016 at 10:06 PM, Christopher Bourez <
christopher.bourez@gmail.com> wrote:

> The same problem occurs on my desktop at work.
> What's great with AWS Workspace is that you can easily reproduce it.
>
> I created the test file with commands :
>
> for i in {0..300000}; do
>   VALUE="$RANDOM"
>   for j in {0..6}; do
>     VALUE="$VALUE;$RANDOM";
>   done
>   echo $VALUE >> test.csv
> done
>
> Christopher Bourez
> 06 17 17 50 60
>
> On Mon, Jan 25, 2016 at 10:01 PM, Christopher Bourez <
> christopher.bourez@gmail.com> wrote:
>
>> Josh,
>>
>> Thanks a lot !
>>
>> You can download a video I created :
>> https://s3-eu-west-1.amazonaws.com/christopherbourez/public/video.mov
>>
>> I created a sample file of 13 MB as explained :
>> https://s3-eu-west-1.amazonaws.com/christopherbourez/public/test.csv
>>
>> Here are the commands I did :
>>
>> I created an Aws Workspace with Windows 7 (that I can share you if you'd
>> like) with Standard instance, 2GiB RAM
>> On this instance :
>> I downloaded spark (1.5 or 1.6 same pb) with hadoop 2.6
>> installed java 8 jdk
>> downloaded python 2.7.8
>>
>> downloaded the sample file
>> https://s3-eu-west-1.amazonaws.com/christopherbourez/public/test.csv
>>
>> And then the command lines I launch are :
>> bin\pyspark --master local[1]
>> sc.textFile("test.csv").take(1)
>>
>> As you can see, sc.textFile("test.csv", 2000).take(1) works well
>>
>> Thanks a lot !
>>
>>
>> Christopher Bourez
>> 06 17 17 50 60
>>
>> On Mon, Jan 25, 2016 at 8:02 PM, Josh Rosen <jo...@databricks.com>
>> wrote:
>>
>>> Hi Christopher,
>>>
>>> What would be super helpful here is a standalone reproduction. Ideally
>>> this would be a single Scala file or set of commands that I can run in
>>> `spark-shell` in order to reproduce this. Ideally, this code would generate
>>> a giant file, then try to read it in a way that demonstrates the bug. If
>>> you have such a reproduction, could you attach it to that JIRA ticket?
>>> Thanks!
>>>
>>> On Mon, Jan 25, 2016 at 7:53 AM Christopher Bourez <
>>> christopher.bourez@gmail.com> wrote:
>>>
>>>> Dears,
>>>>
>>>> I would like to re-open a case for a potential bug (current status is
>>>> resolved but it sounds not) :
>>>>
>>>> *https://issues.apache.org/jira/browse/SPARK-12261
>>>> <https://issues.apache.org/jira/browse/SPARK-12261>*
>>>>
>>>> I believe there is something wrong about the memory management under
>>>> windows
>>>>
>>>> It has no sense to work with files smaller than a few Mo...
>>>>
>>>> Do not hesitate to ask me questions if you try to help and reproduce
>>>> the bug,
>>>>
>>>> Best
>>>>
>>>> Christopher Bourez
>>>> 06 17 17 50 60
>>>>
>>>
>>
>

Re: bug for large textfiles on windows

Posted by Christopher Bourez <ch...@gmail.com>.
The same problem occurs on my desktop at work.
What's great with AWS Workspace is that you can easily reproduce it.

I created the test file with commands :

for i in {0..300000}; do
  VALUE="$RANDOM"
  for j in {0..6}; do
    VALUE="$VALUE;$RANDOM";
  done
  echo $VALUE >> test.csv
done

Christopher Bourez
06 17 17 50 60

On Mon, Jan 25, 2016 at 10:01 PM, Christopher Bourez <
christopher.bourez@gmail.com> wrote:

> Josh,
>
> Thanks a lot !
>
> You can download a video I created :
> https://s3-eu-west-1.amazonaws.com/christopherbourez/public/video.mov
>
> I created a sample file of 13 MB as explained :
> https://s3-eu-west-1.amazonaws.com/christopherbourez/public/test.csv
>
> Here are the commands I did :
>
> I created an Aws Workspace with Windows 7 (that I can share you if you'd
> like) with Standard instance, 2GiB RAM
> On this instance :
> I downloaded spark (1.5 or 1.6 same pb) with hadoop 2.6
> installed java 8 jdk
> downloaded python 2.7.8
>
> downloaded the sample file
> https://s3-eu-west-1.amazonaws.com/christopherbourez/public/test.csv
>
> And then the command lines I launch are :
> bin\pyspark --master local[1]
> sc.textFile("test.csv").take(1)
>
> As you can see, sc.textFile("test.csv", 2000).take(1) works well
>
> Thanks a lot !
>
>
> Christopher Bourez
> 06 17 17 50 60
>
> On Mon, Jan 25, 2016 at 8:02 PM, Josh Rosen <jo...@databricks.com>
> wrote:
>
>> Hi Christopher,
>>
>> What would be super helpful here is a standalone reproduction. Ideally
>> this would be a single Scala file or set of commands that I can run in
>> `spark-shell` in order to reproduce this. Ideally, this code would generate
>> a giant file, then try to read it in a way that demonstrates the bug. If
>> you have such a reproduction, could you attach it to that JIRA ticket?
>> Thanks!
>>
>> On Mon, Jan 25, 2016 at 7:53 AM Christopher Bourez <
>> christopher.bourez@gmail.com> wrote:
>>
>>> Dears,
>>>
>>> I would like to re-open a case for a potential bug (current status is
>>> resolved but it sounds not) :
>>>
>>> *https://issues.apache.org/jira/browse/SPARK-12261
>>> <https://issues.apache.org/jira/browse/SPARK-12261>*
>>>
>>> I believe there is something wrong about the memory management under
>>> windows
>>>
>>> It has no sense to work with files smaller than a few Mo...
>>>
>>> Do not hesitate to ask me questions if you try to help and reproduce the
>>> bug,
>>>
>>> Best
>>>
>>> Christopher Bourez
>>> 06 17 17 50 60
>>>
>>
>

Re: bug for large textfiles on windows

Posted by Christopher Bourez <ch...@gmail.com>.
Josh,

Thanks a lot !

You can download a video I created :
https://s3-eu-west-1.amazonaws.com/christopherbourez/public/video.mov

I created a sample file of 13 MB as explained :
https://s3-eu-west-1.amazonaws.com/christopherbourez/public/test.csv

Here are the commands I did :

I created an Aws Workspace with Windows 7 (that I can share you if you'd
like) with Standard instance, 2GiB RAM
On this instance :
I downloaded spark (1.5 or 1.6 same pb) with hadoop 2.6
installed java 8 jdk
downloaded python 2.7.8

downloaded the sample file
https://s3-eu-west-1.amazonaws.com/christopherbourez/public/test.csv

And then the command lines I launch are :
bin\pyspark --master local[1]
sc.textFile("test.csv").take(1)

As you can see, sc.textFile("test.csv", 2000).take(1) works well

Thanks a lot !


Christopher Bourez
06 17 17 50 60

On Mon, Jan 25, 2016 at 8:02 PM, Josh Rosen <jo...@databricks.com>
wrote:

> Hi Christopher,
>
> What would be super helpful here is a standalone reproduction. Ideally
> this would be a single Scala file or set of commands that I can run in
> `spark-shell` in order to reproduce this. Ideally, this code would generate
> a giant file, then try to read it in a way that demonstrates the bug. If
> you have such a reproduction, could you attach it to that JIRA ticket?
> Thanks!
>
> On Mon, Jan 25, 2016 at 7:53 AM Christopher Bourez <
> christopher.bourez@gmail.com> wrote:
>
>> Dears,
>>
>> I would like to re-open a case for a potential bug (current status is
>> resolved but it sounds not) :
>>
>> *https://issues.apache.org/jira/browse/SPARK-12261
>> <https://issues.apache.org/jira/browse/SPARK-12261>*
>>
>> I believe there is something wrong about the memory management under
>> windows
>>
>> It has no sense to work with files smaller than a few Mo...
>>
>> Do not hesitate to ask me questions if you try to help and reproduce the
>> bug,
>>
>> Best
>>
>> Christopher Bourez
>> 06 17 17 50 60
>>
>

Re: bug for large textfiles on windows

Posted by Josh Rosen <jo...@databricks.com>.
Hi Christopher,

What would be super helpful here is a standalone reproduction. Ideally this
would be a single Scala file or set of commands that I can run in
`spark-shell` in order to reproduce this. Ideally, this code would generate
a giant file, then try to read it in a way that demonstrates the bug. If
you have such a reproduction, could you attach it to that JIRA ticket?
Thanks!

On Mon, Jan 25, 2016 at 7:53 AM Christopher Bourez <
christopher.bourez@gmail.com> wrote:

> Dears,
>
> I would like to re-open a case for a potential bug (current status is
> resolved but it sounds not) :
>
> *https://issues.apache.org/jira/browse/SPARK-12261
> <https://issues.apache.org/jira/browse/SPARK-12261>*
>
> I believe there is something wrong about the memory management under
> windows
>
> It has no sense to work with files smaller than a few Mo...
>
> Do not hesitate to ask me questions if you try to help and reproduce the
> bug,
>
> Best
>
> Christopher Bourez
> 06 17 17 50 60
>