You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Bertrand Dechoux <de...@gmail.com> on 2014/04/16 13:27:10 UTC

PySpark still reading only text?

Hi,

I have browsed the online documentation and it is stated that PySpark only
read text files as sources. Is it still the case?

>From what I understand, the RDD can after this first step be any serialized
python structure if the class definitions are well distributed.

Is it not possible to read back those RDDs? That is create a flow to parse
everything and then, e.g. the next week, start from the binary, structured
data?

Technically, what is the difficulty? I would assume the code reading a
binary python RDD or a binary python file to be quite similar. Where can I
know more about this subject?

Thanks in advance

Bertrand

Re: PySpark still reading only text?

Posted by Bertrand Dechoux <de...@gmail.com>.

Cool, thanks for the link.

Bertrand Dechoux


On Mon, Apr 21, 2014 at 7:31 PM, Nick Pentreath <ni...@gmail.com>wrote:

> Also see: https://github.com/apache/spark/pull/455
>
> This will add support for reading sequencefile and other inputformat in
> PySpark, as long as the Writables are either simple (primitives, maps and
> arrays of same), or reasonably simple Java objects.
>
> I'm about to push a change from MsgPack to Pyrolite for the serialization.
>
> Support for saving as sequencefile or inputformat could then also come
> after that. It would be based on saving the python pickle able objects as
> sequence file and being able to read those back.
> —
> Sent from Mailbox <https://www.dropbox.com/mailbox> for iPhone
>
>
> On Thu, Apr 17, 2014 at 11:40 AM, Bertrand Dechoux <de...@gmail.com>wrote:
>
>> According to the Spark SQL documentation, indeed, this project allows
>> python to be used while reading/writing table ie data which not necessarily
>> in text format.
>>
>> Thanks a lot!
>>
>> Bertrand Dechoux
>>
>>
>> On Thu, Apr 17, 2014 at 10:06 AM, Bertrand Dechoux <de...@gmail.com>wrote:
>>
>>> Thanks for the IRA reference. I really need to look at Spark SQL.
>>>
>>> Am I right to understand that due to Spark SQL, hive data can be read
>>> (and it does not need to be a text format) and then 'classical' Spark can
>>> work on this extraction?
>>>
>>> It seems logical but I haven't worked with Spark SQL as of now.
>>>
>>> Does it also imply the reverse is true? That I can write data as hive
>>> data with spark SQL using results from a random (python) Spark application?
>>>
>>> Bertrand Dechoux
>>>
>>>
>>> On Thu, Apr 17, 2014 at 7:23 AM, Matei Zaharia <ma...@gmail.com>wrote:
>>>
>>>> Yes, this JIRA would enable that. The Hive support also handles HDFS.
>>>>
>>>>  Matei
>>>>
>>>> On Apr 16, 2014, at 9:55 PM, Jesvin Jose <fr...@gmail.com>
>>>> wrote:
>>>>
>>>> When this is implemented, can you load/save an RDD of pickled objects
>>>> to HDFS?
>>>>
>>>>
>>>> On Thu, Apr 17, 2014 at 1:51 AM, Matei Zaharia <matei.zaharia@gmail.com
>>>> > wrote:
>>>>
>>>>> Hi Bertrand,
>>>>>
>>>>> We should probably add a SparkContext.pickleFile and
>>>>> RDD.saveAsPickleFile that will allow saving pickled objects. Unfortunately
>>>>> this is not in yet, but there is an issue up to track it:
>>>>> https://issues.apache.org/jira/browse/SPARK-1161.
>>>>>
>>>>> In 1.0, one feature we do have now is the ability to load binary data
>>>>> from Hive using Spark SQL’s Python API. Later we will also be able to save
>>>>> to Hive.
>>>>>
>>>>> Matei
>>>>>
>>>>> On Apr 16, 2014, at 4:27 AM, Bertrand Dechoux <de...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> > Hi,
>>>>> >
>>>>> > I have browsed the online documentation and it is stated that
>>>>> PySpark only read text files as sources. Is it still the case?
>>>>> >
>>>>> > From what I understand, the RDD can after this first step be any
>>>>> serialized python structure if the class definitions are well distributed.
>>>>> >
>>>>> > Is it not possible to read back those RDDs? That is create a flow to
>>>>> parse everything and then, e.g. the next week, start from the binary,
>>>>> structured data?
>>>>> >
>>>>> > Technically, what is the difficulty? I would assume the code reading
>>>>> a binary python RDD or a binary python file to be quite similar. Where can
>>>>> I know more about this subject?
>>>>> >
>>>>> > Thanks in advance
>>>>> >
>>>>> > Bertrand
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> We dont beat the reaper by living longer. We beat the reaper by living
>>>> well and living fully. The reaper will come for all of us. Question is,
>>>> what do we do between the time we are born and the time he shows up? -Randy
>>>> Pausch
>>>>
>>>>
>>>>
>>>
>>
>

Re: PySpark still reading only text?

Posted by Nick Pentreath <ni...@gmail.com>.

Also see: https://github.com/apache/spark/pull/455


This will add support for reading sequencefile and other inputformat in PySpark, as long as the Writables are either simple (primitives, maps and arrays of same), or reasonably simple Java objects.




I'm about to push a change from MsgPack to Pyrolite for the serialization. 




Support for saving as sequencefile or inputformat could then also come after that. It would be based on saving the python pickle able objects as sequence file and being able to read those back.
—
Sent from Mailbox for iPhone

On Thu, Apr 17, 2014 at 11:40 AM, Bertrand Dechoux <de...@gmail.com>
wrote:

> According to the Spark SQL documentation, indeed, this project allows
> python to be used while reading/writing table ie data which not necessarily
> in text format.
> Thanks a lot!
> Bertrand Dechoux
> On Thu, Apr 17, 2014 at 10:06 AM, Bertrand Dechoux <de...@gmail.com>wrote:
>> Thanks for the IRA reference. I really need to look at Spark SQL.
>>
>> Am I right to understand that due to Spark SQL, hive data can be read (and
>> it does not need to be a text format) and then 'classical' Spark can work
>> on this extraction?
>>
>> It seems logical but I haven't worked with Spark SQL as of now.
>>
>> Does it also imply the reverse is true? That I can write data as hive data
>> with spark SQL using results from a random (python) Spark application?
>>
>> Bertrand Dechoux
>>
>>
>> On Thu, Apr 17, 2014 at 7:23 AM, Matei Zaharia <ma...@gmail.com>wrote:
>>
>>> Yes, this JIRA would enable that. The Hive support also handles HDFS.
>>>
>>> Matei
>>>
>>> On Apr 16, 2014, at 9:55 PM, Jesvin Jose <fr...@gmail.com>
>>> wrote:
>>>
>>> When this is implemented, can you load/save an RDD of pickled objects to
>>> HDFS?
>>>
>>>
>>> On Thu, Apr 17, 2014 at 1:51 AM, Matei Zaharia <ma...@gmail.com>wrote:
>>>
>>>> Hi Bertrand,
>>>>
>>>> We should probably add a SparkContext.pickleFile and
>>>> RDD.saveAsPickleFile that will allow saving pickled objects. Unfortunately
>>>> this is not in yet, but there is an issue up to track it:
>>>> https://issues.apache.org/jira/browse/SPARK-1161.
>>>>
>>>> In 1.0, one feature we do have now is the ability to load binary data
>>>> from Hive using Spark SQL’s Python API. Later we will also be able to save
>>>> to Hive.
>>>>
>>>> Matei
>>>>
>>>> On Apr 16, 2014, at 4:27 AM, Bertrand Dechoux <de...@gmail.com>
>>>> wrote:
>>>>
>>>> > Hi,
>>>> >
>>>> > I have browsed the online documentation and it is stated that PySpark
>>>> only read text files as sources. Is it still the case?
>>>> >
>>>> > From what I understand, the RDD can after this first step be any
>>>> serialized python structure if the class definitions are well distributed.
>>>> >
>>>> > Is it not possible to read back those RDDs? That is create a flow to
>>>> parse everything and then, e.g. the next week, start from the binary,
>>>> structured data?
>>>> >
>>>> > Technically, what is the difficulty? I would assume the code reading a
>>>> binary python RDD or a binary python file to be quite similar. Where can I
>>>> know more about this subject?
>>>> >
>>>> > Thanks in advance
>>>> >
>>>> > Bertrand
>>>>
>>>>
>>>
>>>
>>> --
>>> We dont beat the reaper by living longer. We beat the reaper by living
>>> well and living fully. The reaper will come for all of us. Question is,
>>> what do we do between the time we are born and the time he shows up? -Randy
>>> Pausch
>>>
>>>
>>>
>>

Re: PySpark still reading only text?

Posted by Bertrand Dechoux <de...@gmail.com>.

According to the Spark SQL documentation, indeed, this project allows
python to be used while reading/writing table ie data which not necessarily
in text format.

Thanks a lot!

Bertrand Dechoux


On Thu, Apr 17, 2014 at 10:06 AM, Bertrand Dechoux <de...@gmail.com>wrote:

> Thanks for the IRA reference. I really need to look at Spark SQL.
>
> Am I right to understand that due to Spark SQL, hive data can be read (and
> it does not need to be a text format) and then 'classical' Spark can work
> on this extraction?
>
> It seems logical but I haven't worked with Spark SQL as of now.
>
> Does it also imply the reverse is true? That I can write data as hive data
> with spark SQL using results from a random (python) Spark application?
>
> Bertrand Dechoux
>
>
> On Thu, Apr 17, 2014 at 7:23 AM, Matei Zaharia <ma...@gmail.com>wrote:
>
>> Yes, this JIRA would enable that. The Hive support also handles HDFS.
>>
>> Matei
>>
>> On Apr 16, 2014, at 9:55 PM, Jesvin Jose <fr...@gmail.com>
>> wrote:
>>
>> When this is implemented, can you load/save an RDD of pickled objects to
>> HDFS?
>>
>>
>> On Thu, Apr 17, 2014 at 1:51 AM, Matei Zaharia <ma...@gmail.com>wrote:
>>
>>> Hi Bertrand,
>>>
>>> We should probably add a SparkContext.pickleFile and
>>> RDD.saveAsPickleFile that will allow saving pickled objects. Unfortunately
>>> this is not in yet, but there is an issue up to track it:
>>> https://issues.apache.org/jira/browse/SPARK-1161.
>>>
>>> In 1.0, one feature we do have now is the ability to load binary data
>>> from Hive using Spark SQL’s Python API. Later we will also be able to save
>>> to Hive.
>>>
>>> Matei
>>>
>>> On Apr 16, 2014, at 4:27 AM, Bertrand Dechoux <de...@gmail.com>
>>> wrote:
>>>
>>> > Hi,
>>> >
>>> > I have browsed the online documentation and it is stated that PySpark
>>> only read text files as sources. Is it still the case?
>>> >
>>> > From what I understand, the RDD can after this first step be any
>>> serialized python structure if the class definitions are well distributed.
>>> >
>>> > Is it not possible to read back those RDDs? That is create a flow to
>>> parse everything and then, e.g. the next week, start from the binary,
>>> structured data?
>>> >
>>> > Technically, what is the difficulty? I would assume the code reading a
>>> binary python RDD or a binary python file to be quite similar. Where can I
>>> know more about this subject?
>>> >
>>> > Thanks in advance
>>> >
>>> > Bertrand
>>>
>>>
>>
>>
>> --
>> We dont beat the reaper by living longer. We beat the reaper by living
>> well and living fully. The reaper will come for all of us. Question is,
>> what do we do between the time we are born and the time he shows up? -Randy
>> Pausch
>>
>>
>>
>

Re: PySpark still reading only text?

Posted by Bertrand Dechoux <de...@gmail.com>.

Thanks for the IRA reference. I really need to look at Spark SQL.

Am I right to understand that due to Spark SQL, hive data can be read (and
it does not need to be a text format) and then 'classical' Spark can work
on this extraction?

It seems logical but I haven't worked with Spark SQL as of now.

Does it also imply the reverse is true? That I can write data as hive data
with spark SQL using results from a random (python) Spark application?

Bertrand Dechoux


On Thu, Apr 17, 2014 at 7:23 AM, Matei Zaharia <ma...@gmail.com>wrote:

> Yes, this JIRA would enable that. The Hive support also handles HDFS.
>
> Matei
>
> On Apr 16, 2014, at 9:55 PM, Jesvin Jose <fr...@gmail.com> wrote:
>
> When this is implemented, can you load/save an RDD of pickled objects to
> HDFS?
>
>
> On Thu, Apr 17, 2014 at 1:51 AM, Matei Zaharia <ma...@gmail.com>wrote:
>
>> Hi Bertrand,
>>
>> We should probably add a SparkContext.pickleFile and RDD.saveAsPickleFile
>> that will allow saving pickled objects. Unfortunately this is not in yet,
>> but there is an issue up to track it:
>> https://issues.apache.org/jira/browse/SPARK-1161.
>>
>> In 1.0, one feature we do have now is the ability to load binary data
>> from Hive using Spark SQL’s Python API. Later we will also be able to save
>> to Hive.
>>
>> Matei
>>
>> On Apr 16, 2014, at 4:27 AM, Bertrand Dechoux <de...@gmail.com> wrote:
>>
>> > Hi,
>> >
>> > I have browsed the online documentation and it is stated that PySpark
>> only read text files as sources. Is it still the case?
>> >
>> > From what I understand, the RDD can after this first step be any
>> serialized python structure if the class definitions are well distributed.
>> >
>> > Is it not possible to read back those RDDs? That is create a flow to
>> parse everything and then, e.g. the next week, start from the binary,
>> structured data?
>> >
>> > Technically, what is the difficulty? I would assume the code reading a
>> binary python RDD or a binary python file to be quite similar. Where can I
>> know more about this subject?
>> >
>> > Thanks in advance
>> >
>> > Bertrand
>>
>>
>
>
> --
> We dont beat the reaper by living longer. We beat the reaper by living
> well and living fully. The reaper will come for all of us. Question is,
> what do we do between the time we are born and the time he shows up? -Randy
> Pausch
>
>
>

Re: PySpark still reading only text?

Posted by Matei Zaharia <ma...@gmail.com>.

Yes, this JIRA would enable that. The Hive support also handles HDFS.

Matei

On Apr 16, 2014, at 9:55 PM, Jesvin Jose <fr...@gmail.com> wrote:

> When this is implemented, can you load/save an RDD of pickled objects to HDFS?
> 
> 
> On Thu, Apr 17, 2014 at 1:51 AM, Matei Zaharia <ma...@gmail.com> wrote:
> Hi Bertrand,
> 
> We should probably add a SparkContext.pickleFile and RDD.saveAsPickleFile that will allow saving pickled objects. Unfortunately this is not in yet, but there is an issue up to track it: https://issues.apache.org/jira/browse/SPARK-1161.
> 
> In 1.0, one feature we do have now is the ability to load binary data from Hive using Spark SQL’s Python API. Later we will also be able to save to Hive.
> 
> Matei
> 
> On Apr 16, 2014, at 4:27 AM, Bertrand Dechoux <de...@gmail.com> wrote:
> 
> > Hi,
> >
> > I have browsed the online documentation and it is stated that PySpark only read text files as sources. Is it still the case?
> >
> > From what I understand, the RDD can after this first step be any serialized python structure if the class definitions are well distributed.
> >
> > Is it not possible to read back those RDDs? That is create a flow to parse everything and then, e.g. the next week, start from the binary, structured data?
> >
> > Technically, what is the difficulty? I would assume the code reading a binary python RDD or a binary python file to be quite similar. Where can I know more about this subject?
> >
> > Thanks in advance
> >
> > Bertrand
> 
> 
> 
> 
> -- 
> We dont beat the reaper by living longer. We beat the reaper by living well and living fully. The reaper will come for all of us. Question is, what do we do between the time we are born and the time he shows up? -Randy Pausch
>

Re: PySpark still reading only text?

Posted by Jesvin Jose <fr...@gmail.com>.

When this is implemented, can you load/save an RDD of pickled objects to
HDFS?


On Thu, Apr 17, 2014 at 1:51 AM, Matei Zaharia <ma...@gmail.com>wrote:

> Hi Bertrand,
>
> We should probably add a SparkContext.pickleFile and RDD.saveAsPickleFile
> that will allow saving pickled objects. Unfortunately this is not in yet,
> but there is an issue up to track it:
> https://issues.apache.org/jira/browse/SPARK-1161.
>
> In 1.0, one feature we do have now is the ability to load binary data from
> Hive using Spark SQL’s Python API. Later we will also be able to save to
> Hive.
>
> Matei
>
> On Apr 16, 2014, at 4:27 AM, Bertrand Dechoux <de...@gmail.com> wrote:
>
> > Hi,
> >
> > I have browsed the online documentation and it is stated that PySpark
> only read text files as sources. Is it still the case?
> >
> > From what I understand, the RDD can after this first step be any
> serialized python structure if the class definitions are well distributed.
> >
> > Is it not possible to read back those RDDs? That is create a flow to
> parse everything and then, e.g. the next week, start from the binary,
> structured data?
> >
> > Technically, what is the difficulty? I would assume the code reading a
> binary python RDD or a binary python file to be quite similar. Where can I
> know more about this subject?
> >
> > Thanks in advance
> >
> > Bertrand
>
>


-- 
We dont beat the reaper by living longer. We beat the reaper by living well
and living fully. The reaper will come for all of us. Question is, what do
we do between the time we are born and the time he shows up? -Randy Pausch

Re: PySpark still reading only text?

Posted by Matei Zaharia <ma...@gmail.com>.

Hi Bertrand,

We should probably add a SparkContext.pickleFile and RDD.saveAsPickleFile that will allow saving pickled objects. Unfortunately this is not in yet, but there is an issue up to track it: https://issues.apache.org/jira/browse/SPARK-1161.

In 1.0, one feature we do have now is the ability to load binary data from Hive using Spark SQL’s Python API. Later we will also be able to save to Hive.

Matei

On Apr 16, 2014, at 4:27 AM, Bertrand Dechoux <de...@gmail.com> wrote:

> Hi,
> 
> I have browsed the online documentation and it is stated that PySpark only read text files as sources. Is it still the case?
> 
> From what I understand, the RDD can after this first step be any serialized python structure if the class definitions are well distributed.
> 
> Is it not possible to read back those RDDs? That is create a flow to parse everything and then, e.g. the next week, start from the binary, structured data?
> 
> Technically, what is the difficulty? I would assume the code reading a binary python RDD or a binary python file to be quite similar. Where can I know more about this subject?
> 
> Thanks in advance
> 
> Bertrand