You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ignite.apache.org by Stephen Darlington <st...@gridgain.com> on 2019/01/02 10:11:58 UTC
Re: Loading data from Spark Cluster to Ignite Cache to perform Ignite
ML
Where does the data in your Spark DataFrame come from? As I understand it, that would all be in Spark’s memory anyway?
Anyway, I didn’t test this exact scenario, but it seems that it writing directly to an Ignite DataFrame should work — why did you think it wouldn’t? I can’t say whether it would be the most efficient way of doing it but it would certainly be more efficient than your code below.
ds.write()
.outputMode("append")
.format(IgniteDataFrameSettings.FORMAT_IGNITE())
.option(IgniteDataFrameSettings.OPTION_CONFIG_FILE(), igniteCfgFile)
.option(IgniteDataFrameSettings.OPTION_CREATE_TABLE_PARAMETERS(), "backups=1,key_type=Integer")
.save();
As a general point for bulk-loading data, using putAll from a collection is more efficient than putting in a loop. You might also consider the DataStream API.
Regards,
Stephen
> On 26 Dec 2018, at 13:38, zaleslaw <za...@gmail.com> wrote:
>
> Hi, Igniters!
>
> I am looking for a possibility to load data from Spark RDD or DataFrame to
> Ignite cache with next declaration IgniteCache<Integer, Object[]> dataCache
> to perform Ignite ML algorithms.
>
> As I understand the current mechanism of Ignite-Spark integration helps to
> store RDD/DF from Spark in Ignite to improve performance of Spark Jobs and
> this implementation couldn't help me, am I correct?
>
> Dou you know how to make this small ETL more effectively? Without
> collecting data on one node like in example below?
>
> IgniteCache<Integer, Object[]> cache = getCache(ignite);
>
> SparkSession spark = SparkSession
> .builder()
> .appName("SparkForIgnite")
> .master("local")
> .config("spark.executor.instances", "2")
> .getOrCreate();
>
> Dataset<Row> ds = <ds in Spark>;
>
> ds.show();
>
> List<Row> data = ds.collectAsList(); // stupid solution
>
> Object[] parsedRow = new Object[14];
> for (int i = 0; i < data.size(); i++) {
> for (int j = 0; j < 14; j++)
> parsedRow[j] = data.get(i).get(j);
> cache.put(i, parsedRow);
> }
>
> spark.stop();
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/