You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by Stephen Darlington <st...@gridgain.com> on 2019/01/02 10:11:58 UTC

Re: Loading data from Spark Cluster to Ignite Cache to perform Ignite ML

Where does the data in your Spark DataFrame come from? As I understand it, that would all be in Spark’s memory anyway?

Anyway, I didn’t test this exact scenario, but it seems that it writing directly to an Ignite DataFrame should work — why did you think it wouldn’t? I can’t say whether it would be the most efficient way of doing it but it would certainly be more efficient than your code below.

        ds.write()
           .outputMode("append")
           .format(IgniteDataFrameSettings.FORMAT_IGNITE())
           .option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), igniteCfgFile)
           .option(IgniteDataFrameSettings.OPTION_CREATE_TABLE_PARAMETERS(), "backups=1,key_type=Integer")
           .save();

As a general point for bulk-loading data, using putAll from a collection is more efficient than putting in a loop. You might also consider the DataStream API.

Regards,
Stephen

> On 26 Dec 2018, at 13:38, zaleslaw <za...@gmail.com> wrote:
> 
> Hi, Igniters!
> 
> I am looking for a possibility to load data from Spark RDD or DataFrame to
> Ignite cache with next declaration IgniteCache<Integer, Object[]> dataCache
> to perform Ignite ML algorithms.
> 
> As I understand the current mechanism of Ignite-Spark integration helps to
> store RDD/DF from Spark in Ignite to improve performance of Spark Jobs and
> this implementation couldn't help me, am I correct?
> 
> Dou  you know how to make this small ETL more effectively? Without
> collecting data on one node like in example below?
> 
> IgniteCache<Integer, Object[]> cache = getCache(ignite);
> 
>        SparkSession spark = SparkSession
>            .builder()
>            .appName("SparkForIgnite")
>            .master("local")
>            .config("spark.executor.instances", "2")
>            .getOrCreate();
> 
>        Dataset<Row> ds = <ds in Spark>;
> 
>        ds.show();
> 
>        List<Row> data = ds.collectAsList(); // stupid solution
> 
>        Object[] parsedRow = new Object[14];
>        for (int i = 0; i < data.size(); i++) {
>            for (int j = 0; j < 14; j++)
>                parsedRow[j] = data.get(i).get(j);
>            cache.put(i, parsedRow);
>        }
> 
>        spark.stop();
> 
> 
> 
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/