You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by saatvikshah1994 <sa...@gmail.com> on 2017/06/29 19:59:08 UTC

PySpark working with Generators

Hi,

I have this file reading function is called /foo/ which reads contents into
a list of lists or into a generator of list of lists representing the same
file.

When reading as a complete chunk(1 record array) I do something like:
rdd = file_paths_rdd.map(lambda x: foo(x,"wholeFile")).flatMap(lambda x:x)

I'd like to now do something similar but with the generator, so that I can
work with more cores and a lower memory. I'm not sure how to tackle this
since generators cannot be pickled and thus I'm not sure how to ditribute
the work of reading each file_path on the rdd?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-working-with-Generators-tp28810.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: PySpark working with Generators

Posted by ayan guha <gu...@gmail.com>.

Hi

I understand that now. However, your function foo() should take a string
and parse it, rather than trying to read from file. This way, you can
separate the file read path and process part.

r = sc.wholeTextFile(path)

parsed = r.map(lambda x: x[0],foo(x[1]))



On Fri, Jun 30, 2017 at 1:25 PM, Saatvik Shah <sa...@gmail.com>
wrote:

> Hey Ayan,
>
> This isnt a typical text file - Its a proprietary data format for which a
> native Spark reader is not available.
>
> Thanks and Regards,
> Saatvik Shah
>
> On Thu, Jun 29, 2017 at 6:48 PM, ayan guha <gu...@gmail.com> wrote:
>
>> If your files are in same location you can use sc.wholeTextFile. If not,
>> sc.textFile accepts a list of filepaths.
>>
>> On Fri, 30 Jun 2017 at 5:59 am, saatvikshah1994 <
>> saatvikshah1994@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have this file reading function is called /foo/ which reads contents
>>> into
>>> a list of lists or into a generator of list of lists representing the
>>> same
>>> file.
>>>
>>> When reading as a complete chunk(1 record array) I do something like:
>>> rdd = file_paths_rdd.map(lambda x: foo(x,"wholeFile")).flatMap(lambda
>>> x:x)
>>>
>>> I'd like to now do something similar but with the generator, so that I
>>> can
>>> work with more cores and a lower memory. I'm not sure how to tackle this
>>> since generators cannot be pickled and thus I'm not sure how to ditribute
>>> the work of reading each file_path on the rdd?
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/PySpark-working-with-Generators-tp28810.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>>
>>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


-- 
Best Regards,
Ayan Guha

Re: PySpark working with Generators

Posted by Saatvik Shah <sa...@gmail.com>.

Hi Jörn,

I apologize for such a late response.

Yes, the data volume is very high(won't fit on 1 machine's memory) and I am
getting a significant benefit when reading the files in a distributed
manner.
Since the data volume is high, converting it to an alternative format would
be a worst case scenario.
I agree on writing a custom Spark writer, but that might take a while, and
to proceed with the work till then was hoping to use the current
implementation itself which is fast enough to work with. The only issue is
the one I've already discussed, which is of working with generators to
allow low memory executor tasks.
I'm not sure I fully understand your recommendation on the core usage -
Could you explain in a little more detail? I'm currently using dynamic
allocation with YARN allowing each spark executor 8 vcores.
The data format is proprietary and surely not heard of.

Thanks and Regards,
Saatvik Shah


On Fri, Jun 30, 2017 at 10:16 AM, Jörn Franke <jo...@gmail.com> wrote:

> In this case i do not see so many benefits of using Spark. Is the data
> volume high?
> Alternatively i recommend to convert the proprietary format into a format
> Sparks understand and then use this format in Spark.
> Another alternative would be to write a custom Spark datasource. Even your
> proprietary format should be then able to be put on HDFS.
> That being said, I do not recommend to use more cores outside Sparks
> control. The reason is that Spark thinks these core are free and does the
> wrong allocation of executors/tasks. This will slow down all applications
> on Spark.
>
> May I ask what the format is called?
>
> On 30. Jun 2017, at 16:05, Saatvik Shah <sa...@gmail.com> wrote:
>
> Hi Mahesh and Ayan,
>
> The files I'm working with are a very complex proprietary format, for whom
> I only have access to a reader function as I had described earlier which
> only accepts a path to a local file system.
> This rules out sc.wholeTextFile - since I cannot pass the contents of
> wholeTextFile to an function(API call) expecting a local file path.
> For similar reasons, I cannot use HDFS and am bound to using a highly
> available Network File System arrangement currently.
> Any suggestions, given these constraints? Or any incorrect assumptions
> you'll think I've made?
>
> Thanks and Regards,
> Saatvik Shah
>
>
>
> On Fri, Jun 30, 2017 at 12:50 AM, Mahesh Sawaiker <
> mahesh_sawaiker@persistent.com> wrote:
>
>> Wouldn’t this work if you load the files in hdfs and let the partitions
>> be equal to the amount of parallelism you want?
>>
>>
>>
>> *From:* Saatvik Shah [mailto:saatvikshah1994@gmail.com]
>> *Sent:* Friday, June 30, 2017 8:55 AM
>> *To:* ayan guha
>> *Cc:* user
>> *Subject:* Re: PySpark working with Generators
>>
>>
>>
>> Hey Ayan,
>>
>>
>>
>> This isnt a typical text file - Its a proprietary data format for which a
>> native Spark reader is not available.
>>
>>
>>
>> Thanks and Regards,
>>
>> Saatvik Shah
>>
>>
>>
>> On Thu, Jun 29, 2017 at 6:48 PM, ayan guha <gu...@gmail.com> wrote:
>>
>> If your files are in same location you can use sc.wholeTextFile. If not,
>> sc.textFile accepts a list of filepaths.
>>
>>
>>
>> On Fri, 30 Jun 2017 at 5:59 am, saatvikshah1994 <
>> saatvikshah1994@gmail.com> wrote:
>>
>> Hi,
>>
>> I have this file reading function is called /foo/ which reads contents
>> into
>> a list of lists or into a generator of list of lists representing the same
>> file.
>>
>> When reading as a complete chunk(1 record array) I do something like:
>> rdd = file_paths_rdd.map(lambda x: foo(x,"wholeFile")).flatMap(lambda
>> x:x)
>>
>> I'd like to now do something similar but with the generator, so that I can
>> work with more cores and a lower memory. I'm not sure how to tackle this
>> since generators cannot be pickled and thus I'm not sure how to ditribute
>> the work of reading each file_path on the rdd?
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/PySpark-working-with-Generators-tp28810.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>> --
>>
>> Best Regards,
>> Ayan Guha
>>
>>
>> DISCLAIMER
>> ==========
>> This e-mail may contain privileged and confidential information which is
>> the property of Persistent Systems Ltd. It is intended only for the use of
>> the individual or entity to which it is addressed. If you are not the
>> intended recipient, you are not authorized to read, retain, copy, print,
>> distribute or use this message. If you have received this communication in
>> error, please notify the sender and delete all copies of this message.
>> Persistent Systems Ltd. does not accept any liability for virus infected
>> mails.
>>
>


-- 
*Saatvik Shah,*
*1st  Year,*
*Masters in the School of Computer Science,*
*Carnegie Mellon University*

*https://saatvikshah1994.github.io/ <https://saatvikshah1994.github.io/>*

Re: PySpark working with Generators

Posted by Jörn Franke <jo...@gmail.com>.

In this case i do not see so many benefits of using Spark. Is the data volume high?
Alternatively i recommend to convert the proprietary format into a format Sparks understand and then use this format in Spark.
Another alternative would be to write a custom Spark datasource. Even your proprietary format should be then able to be put on HDFS.
That being said, I do not recommend to use more cores outside Sparks control. The reason is that Spark thinks these core are free and does the wrong allocation of executors/tasks. This will slow down all applications on Spark.

May I ask what the format is called?

> On 30. Jun 2017, at 16:05, Saatvik Shah <sa...@gmail.com> wrote:
> 
> Hi Mahesh and Ayan,
> 
> The files I'm working with are a very complex proprietary format, for whom I only have access to a reader function as I had described earlier which only accepts a path to a local file system.
> This rules out sc.wholeTextFile - since I cannot pass the contents of wholeTextFile to an function(API call) expecting a local file path.
> For similar reasons, I cannot use HDFS and am bound to using a highly available Network File System arrangement currently.
> Any suggestions, given these constraints? Or any incorrect assumptions you'll think I've made?
> 
> Thanks and Regards,
> Saatvik Shah
> 
>  
> 
>> On Fri, Jun 30, 2017 at 12:50 AM, Mahesh Sawaiker <ma...@persistent.com> wrote:
>> Wouldn’t this work if you load the files in hdfs and let the partitions be equal to the amount of parallelism you want?
>> 
>>  
>> 
>> From: Saatvik Shah [mailto:saatvikshah1994@gmail.com] 
>> Sent: Friday, June 30, 2017 8:55 AM
>> To: ayan guha
>> Cc: user
>> Subject: Re: PySpark working with Generators
>> 
>>  
>> 
>> Hey Ayan,
>> 
>>  
>> 
>> This isnt a typical text file - Its a proprietary data format for which a native Spark reader is not available.
>> 
>>  
>> 
>> Thanks and Regards,
>> 
>> Saatvik Shah
>> 
>>  
>> 
>> On Thu, Jun 29, 2017 at 6:48 PM, ayan guha <gu...@gmail.com> wrote:
>> 
>> If your files are in same location you can use sc.wholeTextFile. If not, sc.textFile accepts a list of filepaths.
>> 
>>  
>> 
>> On Fri, 30 Jun 2017 at 5:59 am, saatvikshah1994 <sa...@gmail.com> wrote:
>> 
>> Hi,
>> 
>> I have this file reading function is called /foo/ which reads contents into
>> a list of lists or into a generator of list of lists representing the same
>> file.
>> 
>> When reading as a complete chunk(1 record array) I do something like:
>> rdd = file_paths_rdd.map(lambda x: foo(x,"wholeFile")).flatMap(lambda x:x)
>> 
>> I'd like to now do something similar but with the generator, so that I can
>> work with more cores and a lower memory. I'm not sure how to tackle this
>> since generators cannot be pickled and thus I'm not sure how to ditribute
>> the work of reading each file_path on the rdd?
>> 
>> 
>> 
>> --
>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-working-with-Generators-tp28810.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> 
>> --
>> 
>> Best Regards,
>> Ayan Guha
>> 
>>  
>> 
>> DISCLAIMER
>> ==========
>> This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems  Ltd. does not accept any liability for virus infected mails.

Re: PySpark working with Generators

Posted by Saatvik Shah <sa...@gmail.com>.

Hi Mahesh and Ayan,

The files I'm working with are a very complex proprietary format, for whom
I only have access to a reader function as I had described earlier which
only accepts a path to a local file system.
This rules out sc.wholeTextFile - since I cannot pass the contents of
wholeTextFile to an function(API call) expecting a local file path.
For similar reasons, I cannot use HDFS and am bound to using a highly
available Network File System arrangement currently.
Any suggestions, given these constraints? Or any incorrect assumptions
you'll think I've made?

Thanks and Regards,
Saatvik Shah



On Fri, Jun 30, 2017 at 12:50 AM, Mahesh Sawaiker <
mahesh_sawaiker@persistent.com> wrote:

> Wouldn’t this work if you load the files in hdfs and let the partitions be
> equal to the amount of parallelism you want?
>
>
>
> *From:* Saatvik Shah [mailto:saatvikshah1994@gmail.com]
> *Sent:* Friday, June 30, 2017 8:55 AM
> *To:* ayan guha
> *Cc:* user
> *Subject:* Re: PySpark working with Generators
>
>
>
> Hey Ayan,
>
>
>
> This isnt a typical text file - Its a proprietary data format for which a
> native Spark reader is not available.
>
>
>
> Thanks and Regards,
>
> Saatvik Shah
>
>
>
> On Thu, Jun 29, 2017 at 6:48 PM, ayan guha <gu...@gmail.com> wrote:
>
> If your files are in same location you can use sc.wholeTextFile. If not,
> sc.textFile accepts a list of filepaths.
>
>
>
> On Fri, 30 Jun 2017 at 5:59 am, saatvikshah1994 <sa...@gmail.com>
> wrote:
>
> Hi,
>
> I have this file reading function is called /foo/ which reads contents into
> a list of lists or into a generator of list of lists representing the same
> file.
>
> When reading as a complete chunk(1 record array) I do something like:
> rdd = file_paths_rdd.map(lambda x: foo(x,"wholeFile")).flatMap(lambda x:x)
>
> I'd like to now do something similar but with the generator, so that I can
> work with more cores and a lower memory. I'm not sure how to tackle this
> since generators cannot be pickled and thus I'm not sure how to ditribute
> the work of reading each file_path on the rdd?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/PySpark-working-with-Generators-tp28810.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
> --
>
> Best Regards,
> Ayan Guha
>
>
> DISCLAIMER
> ==========
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>

RE: PySpark working with Generators

Posted by Mahesh Sawaiker <ma...@persistent.com>.

Wouldn’t this work if you load the files in hdfs and let the partitions be equal to the amount of parallelism you want?

From: Saatvik Shah [mailto:saatvikshah1994@gmail.com]
Sent: Friday, June 30, 2017 8:55 AM
To: ayan guha
Cc: user
Subject: Re: PySpark working with Generators

Hey Ayan,

This isnt a typical text file - Its a proprietary data format for which a native Spark reader is not available.

Thanks and Regards,
Saatvik Shah

On Thu, Jun 29, 2017 at 6:48 PM, ayan guha <gu...@gmail.com>> wrote:
If your files are in same location you can use sc.wholeTextFile. If not, sc.textFile accepts a list of filepaths.

On Fri, 30 Jun 2017 at 5:59 am, saatvikshah1994 <sa...@gmail.com>> wrote:
Hi,

I have this file reading function is called /foo/ which reads contents into
a list of lists or into a generator of list of lists representing the same
file.

When reading as a complete chunk(1 record array) I do something like:
rdd = file_paths_rdd.map(lambda x: foo(x,"wholeFile")).flatMap(lambda x:x)

I'd like to now do something similar but with the generator, so that I can
work with more cores and a lower memory. I'm not sure how to tackle this
since generators cannot be pickled and thus I'm not sure how to ditribute
the work of reading each file_path on the rdd?

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-working-with-Generators-tp28810.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org<ma...@spark.apache.org>
--
Best Regards,
Ayan Guha

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.

Re: PySpark working with Generators

Posted by Saatvik Shah <sa...@gmail.com>.

Hey Ayan,

This isnt a typical text file - Its a proprietary data format for which a
native Spark reader is not available.

Thanks and Regards,
Saatvik Shah

On Thu, Jun 29, 2017 at 6:48 PM, ayan guha <gu...@gmail.com> wrote:

> If your files are in same location you can use sc.wholeTextFile. If not,
> sc.textFile accepts a list of filepaths.
>
> On Fri, 30 Jun 2017 at 5:59 am, saatvikshah1994 <sa...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I have this file reading function is called /foo/ which reads contents
>> into
>> a list of lists or into a generator of list of lists representing the same
>> file.
>>
>> When reading as a complete chunk(1 record array) I do something like:
>> rdd = file_paths_rdd.map(lambda x: foo(x,"wholeFile")).flatMap(lambda
>> x:x)
>>
>> I'd like to now do something similar but with the generator, so that I can
>> work with more cores and a lower memory. I'm not sure how to tackle this
>> since generators cannot be pickled and thus I'm not sure how to ditribute
>> the work of reading each file_path on the rdd?
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/PySpark-working-with-Generators-tp28810.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>> --
> Best Regards,
> Ayan Guha
>

Re: PySpark working with Generators

Posted by ayan guha <gu...@gmail.com>.

If your files are in same location you can use sc.wholeTextFile. If not,
sc.textFile accepts a list of filepaths.

On Fri, 30 Jun 2017 at 5:59 am, saatvikshah1994 <sa...@gmail.com>
wrote:

> Hi,
>
> I have this file reading function is called /foo/ which reads contents into
> a list of lists or into a generator of list of lists representing the same
> file.
>
> When reading as a complete chunk(1 record array) I do something like:
> rdd = file_paths_rdd.map(lambda x: foo(x,"wholeFile")).flatMap(lambda x:x)
>
> I'd like to now do something similar but with the generator, so that I can
> work with more cores and a lower memory. I'm not sure how to tackle this
> since generators cannot be pickled and thus I'm not sure how to ditribute
> the work of reading each file_path on the rdd?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-working-with-Generators-tp28810.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
> --
Best Regards,
Ayan Guha