You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by ayuio5799 <ay...@gmail.com> on 2022/12/25 13:40:23 UTC

RE: Re: RDD to InputStream

الجنرالOn 2015/03/18 17:20:54 Ayoub wrote:> In case it would interest other peoples, here is what I come up with and it> seems to work fine:> >   case class RDDAsInputStream(private val rdd: RDD[String]) extends> java.io.InputStream {>     var bytes = rdd.flatMap(_.getBytes("UTF-8")).toLocalIterator> >     def read(): Int = {>       if(bytes.hasNext) bytes.next.toInt>       else -1>     }>     override def markSupported(): Boolean = false>   }> > > 2015-03-13 13:56 GMT+01:00 Sean Owen <so...@cloudera.com>:> > > OK, then you do not want to collect() the RDD. You can get an iterator,> > yes.> > There is no such thing as making an Iterator into an InputStream. An> > Iterator is a sequence of arbitrary objects; an InputStream is a> > channel to a stream of bytes.> > I think you can employ similar Guava / Commons utilities to make an> > Iterator of Streams in a stream of Readers, join the Readers, and> > encode the result as bytes in an InputStream.> >> > On Fri, Mar 13, 2015 at 10:33 AM, Ayoub <be...@gmail.com>> > wrote:> > > Thanks Sean,> > >> > > I forgot to mention that the data is too big to be collected on the> > driver.> > >> > > So yes your proposition would work in theory but in my case I cannot hold> > > all the data in the driver memory, therefore it wouldn't work.> > >> > > I guess the crucial point to to do the collect in a lazy way and in that> > > subject I noticed that we can get a local iterator from an RDD but that> > > rises two questions:> > >> > > - does that involves an mediate collect just like with "collect()" or is> > it> > > lazy process ?> > > - how to go from an iterator to an InputStream ?> > >> > >> > > 2015-03-13 11:17 GMT+01:00 Sean Owen <[hidden email]>:> > >>> > >> These are quite different creatures. You have a distributed set of> > >> Strings, but want a local stream of bytes, which involves three> > >> conversions:> > >>> > >> - collect data to driver> > >> - concatenate strings in some way> > >> - encode strings as bytes according to an encoding> > >>> > >> Your approach is OK but might be faster to avoid disk, if you have> > >> enough memory:> > >>> > >> - collect() to a Array[String] locally> > >> - use Guava utilities to turn a bunch of Strings into a Reader> > >> - Use the Apache Commons ReaderInputStream to read it as encoded bytes> > >>> > >> I might wonder if that's all really what you want to do though.> > >>> > >>> > >> On Fri, Mar 13, 2015 at 9:54 AM, Ayoub <[hidden email]> wrote:> > >> > Hello,> > >> >> > >> > I need to convert an RDD[String] to a java.io.InputStream but I didn't> > >> > find> > >> > an east way to do it.> > >> > Currently I am saving the RDD as temporary file and then opening an> > >> > inputstream on the file but that is not really optimal.> > >> >> > >> > Does anybody know a better way to do that ?> > >> >> > >> > Thanks,> > >> > Ayoub.> > >> >> > >> >> > >> >> > >> > --> > >> > View this message in context:> > >> >> > http://apache-spark-user-list.1001560.n3.nabble.com/RDD-to-InputStream-tp22031.html> > >> > Sent from the Apache Spark User List mailing list archive at> > Nabble.com.> > >> >> > >> > ---------------------------------------------------------------------> > >> > To unsubscribe, e-mail: [hidden email]> > >> > For additional commands, e-mail: [hidden email]> > >> >> > >> > >> > >> > > ________________________________> > > View this message in context: Re: RDD to InputStream> > >> > > Sent from the Apache Spark User List mailing list archive at Nabble.com.> >> > > > > --> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Re-RDD-to-InputStream-tp22121.html> Sent from the Apache Spark User List mailing list archive at Nabble.com.مرسل من هاتف Samsung Galaxy الذكي.