You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Oleg Ruchovets <or...@gmail.com> on 2014/11/14 06:39:19 UTC

pyspark and hdfs file name

Hi ,
  I am running pyspark job.
I need serialize final result to *hdfs in binary files* and having ability
to give a *name for output files*.

I found this post:
http://stackoverflow.com/questions/25293962/specifying-the-output-file-name-in-apache-spark


but it explains how to do it using scala.

Question:
 How to do it using pyspark

Thanks
Oleg.

Re: pyspark and hdfs file name

Posted by Davies Liu <da...@databricks.com>.

On Fri, Nov 14, 2014 at 12:14 AM, Oleg Ruchovets <or...@gmail.com> wrote:
> Hi Devies.
> Thank you for the quick answer.
>
> I have a code like this:
>
> ....
>
> sc = SparkContext(appName="TAD")
> lines = sc.textFile(sys.argv[1], 1)
> result = lines.map(doSplit).groupByKey().map(lambda (k,vc):
> traffic_process_model(k,vc))
> result.saveAsTextFile(sys.argv[2])
>
>
> Can  you please give short example what should I do?
>
> Also I found only saveAsTextFile. Does PySpark has saveAsBinary options or
> what is the way to change text format output files?

You can use saveAsPickleFile() [1], you could use the following line
to rename (it's slow):

>>> os.system( "hadoop fs -mv URI [URI …] <dest>")

Just found that there is a pure python client for HDFS [2] (not verified).

[1] http://spark.apache.org/docs/latest/api/python/pyspark.rdd.RDD-class.html#saveAsPickleFile
[2] https://labs.spotify.com/2013/05/07/snakebite/

> Thanks
> Oleg.
>
> On Fri, Nov 14, 2014 at 3:26 PM, Davies Liu <da...@databricks.com> wrote:
>>
>> One option maybe call HDFS tools or client to rename them after
>> saveAsXXXFile().
>>
>> On Thu, Nov 13, 2014 at 9:39 PM, Oleg Ruchovets <or...@gmail.com>
>> wrote:
>> > Hi ,
>> >   I am running pyspark job.
>> > I need serialize final result to hdfs in binary files and having ability
>> > to
>> > give a name for output files.
>> >
>> > I found this post:
>> >
>> > http://stackoverflow.com/questions/25293962/specifying-the-output-file-name-in-apache-spark
>> >
>> > but it explains how to do it using scala.
>> >
>> > Question:
>> >  How to do it using pyspark
>> >
>> > Thanks
>> > Oleg.
>> >
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: pyspark and hdfs file name

Posted by Oleg Ruchovets <or...@gmail.com>.

Hi Devies.
Thank you for the quick answer.

I have a code like this:

....

sc = SparkContext(appName="TAD")
lines = sc.textFile(sys.argv[1], 1)
result = lines.map(doSplit).groupByKey().map(lambda (k,vc):
traffic_process_model(k,vc))
result.saveAsTextFile(sys.argv[2])


Can  you please give short example what should I do?

Also I found only saveAsTextFile. Does PySpark has saveAsBinary options or
what is the way to change text format output files?

Thanks
Oleg.

On Fri, Nov 14, 2014 at 3:26 PM, Davies Liu <da...@databricks.com> wrote:

> One option maybe call HDFS tools or client to rename them after
> saveAsXXXFile().
>
> On Thu, Nov 13, 2014 at 9:39 PM, Oleg Ruchovets <or...@gmail.com>
> wrote:
> > Hi ,
> >   I am running pyspark job.
> > I need serialize final result to hdfs in binary files and having ability
> to
> > give a name for output files.
> >
> > I found this post:
> >
> http://stackoverflow.com/questions/25293962/specifying-the-output-file-name-in-apache-spark
> >
> > but it explains how to do it using scala.
> >
> > Question:
> >  How to do it using pyspark
> >
> > Thanks
> > Oleg.
> >
>

Re: pyspark and hdfs file name

Posted by Davies Liu <da...@databricks.com>.

One option maybe call HDFS tools or client to rename them after saveAsXXXFile().

On Thu, Nov 13, 2014 at 9:39 PM, Oleg Ruchovets <or...@gmail.com> wrote:
> Hi ,
>   I am running pyspark job.
> I need serialize final result to hdfs in binary files and having ability to
> give a name for output files.
>
> I found this post:
> http://stackoverflow.com/questions/25293962/specifying-the-output-file-name-in-apache-spark
>
> but it explains how to do it using scala.
>
> Question:
>  How to do it using pyspark
>
> Thanks
> Oleg.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org