You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Steve Lewis <lo...@gmail.com> on 2014/09/22 18:22:14 UTC

Is there any way (in Java) to make a JavaRDD from an iterable

   The only way I find is to turn it into a list - in effect holding
everything in memory (see code below). Surely Spark has a better way.

Also what about unterminated iterables like a Fibonacci series - (useful
only if limited in some other way )


/**
     * make an RDD from an iterable
     * @param inp input iterator
     * @param ctx  context
     * @param <T> type
     * @return  rdd from inerator as a list
     */
    public static  @Nonnull <T>   JavaRDD<T> fromIterable(@Nonnull final
Iterable<T> inp,@Nonnull final  JavaSparkContext ctx) {
        List<T> holder = new ArrayList<T>();
        for (T k : inp) {
            holder.add(k);
        }
        return ctx.parallelize(holder);
    }

Re: Is there any way (in Java) to make a JavaRDD from an iterable

Posted by Steve Lewis <lo...@gmail.com>.
is there a way to write as a temporary file? Also what about a Stream -
something like an RSS feed

On Mon, Sep 22, 2014 at 10:21 AM, Victor Tso-Guillen <vt...@paxata.com>
wrote:

> You can write to disk and have Spark read it as a stream. This is how
> Hadoop files are iterated in Spark.
>
> On Mon, Sep 22, 2014 at 9:22 AM, Steve Lewis <lo...@gmail.com>
> wrote:
>
>>    The only way I find is to turn it into a list - in effect holding
>> everything in memory (see code below). Surely Spark has a better way.
>>
>> Also what about unterminated iterables like a Fibonacci series - (useful
>> only if limited in some other way )
>>
>>
>> /**
>>      * make an RDD from an iterable
>>      * @param inp input iterator
>>      * @param ctx  context
>>      * @param <T> type
>>      * @return  rdd from inerator as a list
>>      */
>>     public static  @Nonnull <T>   JavaRDD<T> fromIterable(@Nonnull final
>> Iterable<T> inp,@Nonnull final  JavaSparkContext ctx) {
>>         List<T> holder = new ArrayList<T>();
>>         for (T k : inp) {
>>             holder.add(k);
>>         }
>>         return ctx.parallelize(holder);
>>     }
>>
>>
>


-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: Is there any way (in Java) to make a JavaRDD from an iterable

Posted by Victor Tso-Guillen <vt...@paxata.com>.
You can write to disk and have Spark read it as a stream. This is how
Hadoop files are iterated in Spark.

On Mon, Sep 22, 2014 at 9:22 AM, Steve Lewis <lo...@gmail.com> wrote:

>    The only way I find is to turn it into a list - in effect holding
> everything in memory (see code below). Surely Spark has a better way.
>
> Also what about unterminated iterables like a Fibonacci series - (useful
> only if limited in some other way )
>
>
> /**
>      * make an RDD from an iterable
>      * @param inp input iterator
>      * @param ctx  context
>      * @param <T> type
>      * @return  rdd from inerator as a list
>      */
>     public static  @Nonnull <T>   JavaRDD<T> fromIterable(@Nonnull final
> Iterable<T> inp,@Nonnull final  JavaSparkContext ctx) {
>         List<T> holder = new ArrayList<T>();
>         for (T k : inp) {
>             holder.add(k);
>         }
>         return ctx.parallelize(holder);
>     }
>
>

Re: Is there any way (in Java) to make a JavaRDD from an iterable

Posted by Sean Owen <so...@cloudera.com>.
I imagine it is because parallelize() inherently only makes sense for
smallish data to begin with, since it will have to be broadcast from
the driver. Large enough data should probably live in distributed
storage to begin with.

The Scala equivalent wants a Seq, so I assume there is some need or
value in knowing the size of the input, which Iterable does not give.
(I am guessing Java's version could have take a Collection, but hey.)

I don't know that it makes sense to contemplate infinite distributed
collections. You could do something like an RDD with 10 elements, each
of which was an infinite Range, each generating every 10th Fibonacci
number, and at least grapple with it in 10-way parallelism, but I
don't know if that's of much practical use? what save/count method at
the end would make sense?

On Mon, Sep 22, 2014 at 5:22 PM, Steve Lewis <lo...@gmail.com> wrote:
>    The only way I find is to turn it into a list - in effect holding
> everything in memory (see code below). Surely Spark has a better way.
>
> Also what about unterminated iterables like a Fibonacci series - (useful
> only if limited in some other way )
>
>
> /**
>      * make an RDD from an iterable
>      * @param inp input iterator
>      * @param ctx  context
>      * @param <T> type
>      * @return  rdd from inerator as a list
>      */
>     public static  @Nonnull <T>   JavaRDD<T> fromIterable(@Nonnull final
> Iterable<T> inp,@Nonnull final  JavaSparkContext ctx) {
>         List<T> holder = new ArrayList<T>();
>         for (T k : inp) {
>             holder.add(k);
>         }
>         return ctx.parallelize(holder);
>     }
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org