You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Andrew Ehrlich <an...@aehrlich.com> on 2016/08/01 00:18:29 UTC

Re: How to write contents of RDD to HDFS as separate file for each item in RDD (PySpark)

You could write each image to a different directory instead of a different file. That can be done by filtering the RDD into one RDD for each image and then saving each. That might not be what you’re after though, in terms of space and speed efficiency. Another way would be to save them multiple outputs into one parquet (or text) file. There might be information on the image you can partition on (probably by some timestamp) to make lookups faster.

> On Jul 30, 2016, at 8:01 PM, Bhaarat Sharma <bh...@gmail.com> wrote:
> 
> I am just trying to do this as a proof of concept. The actual content of the files will be quite bit. 
> 
> I'm having problem using foreach or something similar on an RDD. 
> sc.binaryFiles("/root/sift_images_test/*.jpg")
> returns
> ("filename1", bytes)
> ("filname2",bytes)
> I'm wondering if there is a do processing one each of these (process in this case is just getting the bytes length but will be something else in real world) and then write the contents to separate HDFS files. 
> If this doesn't make sense, would it make more sense to have all contents in a single HDFS file?
> 
> On Sat, Jul 30, 2016 at 10:19 PM, ayan guha <guha.ayan@gmail.com <ma...@gmail.com>> wrote:
> This sounds a bad idea, given hdfs does not work well with small files.
> 
> On Sun, Jul 31, 2016 at 8:57 AM, Bhaarat Sharma <bhaarat.s@gmail.com <ma...@gmail.com>> wrote:
> I am reading bunch of files in PySpark using binaryFiles. Then I want to get the number of bytes for each file and write this number to an HDFS file with the corresponding name. 
> 
> Example:
> 
> if directory /myimages has one.jpg, two.jpg, and three.jpg then I want three files one-success.jpg, two-success.jpg, and three-success.jpg in HDFS with a number in each. The number will specify the length of bytes. 
> 
> Here is what I've done thus far:
> 
> from pyspark import SparkContext
> import numpy as np
> 
> sc = SparkContext("local", "test")
> 
> def bytes_length(rawdata):
>         length = len(np.asarray(bytearray(rawdata),dtype=np.uint8))
>         return length
> 
> images = sc.binaryFiles("/root/sift_images_test/*.jpg")
> images.map(lambda(filename, contents): bytes_length(contents)).saveAsTextFile("hdfs://localhost:9000/tmp/somfile")
> 
> However, doing this creates a single file in HDFS:
> $ hadoop fs -cat /tmp/somfile/part-00000
> 113212
> 144926
> 178923
> Instead I want /tmp/somefile in HDFS to have three files:
> one-success.txt with value 113212
> two-success.txt with value 144926
> three-success.txt with value 178923
> 
> Is it possible to achieve what I'm after? I don't want to write files to local file system and them put them in HDFS. Instead, I want to use the saveAsTextFile method on the RDD directly.
> 
> 
> 
> 
> -- 
> Best Regards,
> Ayan Guha
>