You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Onur EKİNCİ <oe...@innova.com.tr> on 2018/01/16 08:00:57 UTC

Run jobs in parallel in standalone mode

Hi,

We are trying to get data from an Oracle database into Kinetica database through Apache Spark.

We installed Spark in standalone mode. We executed the following commands. However, we have tried everything but we couldnt manage to run jobs in parallel. We use 2 IBM servers each of which has 128cores and 1TB memory.

We also added  in the spark-defaults.conf  :
spark.executor.memory=64g
spark.executor.cores=32
spark.default.parallelism=32
spark.cores.max=64
spark.scheduler.mode=FAIR
spark.sql.shuffle.partions=32


On the machine: 10.20.10.228
./start-master.sh --webui-port 8585

./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077


On the machine 10.20.10.229:
./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077


On the machine: 10.20.10.228:

We start the Spark shell:

spark-shell --master spark://10.20.10.228:7077

Then we make configurations:

val df  = spark.read.format("jdbc").option("url", "jdbc:sqlserver://10.20.10.148:1433;databaseName=testdb").option("dbtable", "dbo.temp_muh_hareket").option("user", "gpudb").option("password", "Kinetica2017!").load()
import com.kinetica.spark._
val lp = new LoaderParams("http://10.20.10.228:9191", "jdbc:simba://10.20.10.228:9292;ParentSet=MASTER", "muh_hareket_20", false,"",100000,true,true,"admin","Kinetica2017!",4, true, true, 1)
SparkKineticaLoader.KineticaWriter(df,lp);


The above commands successfully work. The data transfer completes. However, jobs work serially not in parallel. Also executors work serially and take turns. They donw work in parallel.

How can we make jobs work in parallel?


[cid:image002.jpg@01D38EB9.47FE04C0]

[cid:image006.jpg@01D38EB9.47FE04C0]

[cid:image008.jpg@01D38EB9.47FE04C0]
[cid:image012.jpg@01D38EB9.47FE04C0]

[cid:image013.jpg@01D38EB9.47FE04C0]






[cid:image014.jpg@01D38EB9.47FE04C0]


I really appreciate your help. We have done everything that we could.



Onur EKİNCİ
Bilgi Yönetimi Yöneticisi
Knowledge Management Manager

m:+90 553 044 2341  d:+90 212 329 7000

İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google Maps<http://www.innova.com.tr/istanbul.asp>


<http://www.innova.com.tr/><http://www.innova.com.tr/><http://www.innova.com.tr/>[cid:imza_2d4cbd2e-9f86-452e-8fa7-5851198cb9af.png]<http://www.innova.com.tr/><http://www.innova.com.tr>




Yasal Uyarı :
Bu elektronik posta işbu linki kullanarak ulaşabileceğiniz Koşul ve Şartlar dokümanına tabidir :
http://www.innova.com.tr/disclaimer-yasal-uyari.asp

RE: Run jobs in parallel in standalone mode

Posted by Onur EKİNCİ <oe...@innova.com.tr>.

Thank you Bill.

What about the number of ColumnProcessor.java:50 jobs?
How can we change their number? Does Spark configure them automatically? I think Spark extract data column by column?





Onur EKİNCİ
Bilgi Yönetimi Yöneticisi
Knowledge Management Manager

m:+90 553 044 2341  d:+90 212 329 7000

İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google Maps<http://www.innova.com.tr/istanbul.asp>


<http://www.innova.com.tr/><http://www.innova.com.tr/><http://www.innova.com.tr/>[cid:imza_2d4cbd2e-9f86-452e-8fa7-5851198cb9af.png]<http://www.innova.com.tr/><http://www.innova.com.tr>



From: Bill Schwanitz [mailto:bilsch@bilsch.org]
Sent: Tuesday, January 16, 2018 3:39 PM
To: Onur EKİNCİ <oe...@innova.com.tr>
Cc: user@spark.apache.org
Subject: Re: Run jobs in parallel in standalone mode

https://docs.databricks.com/spark/latest/data-sources/sql-databases.html#jdbc-reads

I had the same issue with a different db but its down in the jdbc and task management. You need to specify a column with upper and lower bounds. Also need to specify how many threads to use ( 1 thread per worker ).

On Tue, Jan 16, 2018 at 3:00 AM, Onur EKİNCİ <oe...@innova.com.tr>> wrote:
Hi,

We are trying to get data from an Oracle database into Kinetica database through Apache Spark.

We installed Spark in standalone mode. We executed the following commands. However, we have tried everything but we couldnt manage to run jobs in parallel. We use 2 IBM servers each of which has 128cores and 1TB memory.

We also added  in the spark-defaults.conf  :
spark.executor.memory=64g
spark.executor.cores=32
spark.default.parallelism=32
spark.cores.max=64
spark.scheduler.mode=FAIR
spark.sql.shuffle.partions=32


On the machine: 10.20.10.228
./start-master.sh --webui-port 8585

./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077<http://10.20.10.228:7077>


On the machine 10.20.10.229<http://10.20.10.229>:
./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077<http://10.20.10.228:7077>


On the machine: 10.20.10.228<http://10.20.10.228>:

We start the Spark shell:

spark-shell --master spark://10.20.10.228:7077<http://10.20.10.228:7077>

Then we make configurations:

val df  = spark.read.format("jdbc").option("url", "jdbc:sqlserver://10.20.10.<http://10.20.10.>148:1433;databaseName=testdb").option("dbtable", "dbo.temp_muh_hareket").option("user", "gpudb").option("password", "Kinetica2017!").load()
import com.kinetica.spark._
val lp = new LoaderParams("http://10.20.10.228:9191", "jdbc:simba://10.20.10.228<http://10.20.10.228>:9292;ParentSet=MASTER", "muh_hareket_20", false,"",100000,true,true,"admin","Kinetica2017!",4, true, true, 1)
SparkKineticaLoader.KineticaWriter(df,lp);


The above commands successfully work. The data transfer completes. However, jobs work serially not in parallel. Also executors work serially and take turns. They donw work in parallel.

How can we make jobs work in parallel?


[cid:image001.jpg@01D38EE3.3618C630]

[cid:image003.jpg@01D38EE3.3618C630]

[cid:image004.jpg@01D38EE3.3618C630]
[cid:image005.jpg@01D38EE3.3618C630]

[cid:image007.jpg@01D38EE3.3618C630]






[cid:image009.jpg@01D38EE3.3618C630]


I really appreciate your help. We have done everything that we could.



Onur EKİNCİ
Bilgi Yönetimi Yöneticisi
Knowledge Management Manager

m:+90 553 044 2341<tel:+90%20553%20044%2023%2041>  d:+90 212 329 7000<tel:(212)%20329-7000>

İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google Maps<http://www.innova.com.tr/istanbul.asp>


[cid:image010.png@01D38EE3.3618C630]<http://www.innova.com.tr/>




Yasal Uyarı :
Bu elektronik posta işbu linki kullanarak ulaşabileceğiniz Koşul ve Şartlar dokümanına tabidir :
http://www.innova.com.tr/disclaimer-yasal-uyari.asp

Re: Run jobs in parallel in standalone mode

Posted by Bill Schwanitz <bi...@bilsch.org>.

https://docs.databricks.com/spark/latest/data-sources/sql-databases.html#jdbc-reads

I had the same issue with a different db but its down in the jdbc and task
management. You need to specify a column with upper and lower bounds. Also
need to specify how many threads to use ( 1 thread per worker ).

On Tue, Jan 16, 2018 at 3:00 AM, Onur EKİNCİ <oe...@innova.com.tr> wrote:

> Hi,
>
>
>
> We are trying to get data from an Oracle database into Kinetica database
> through Apache Spark.
>
>
>
> We installed Spark in standalone mode. We executed the following commands.
> However, we have tried everything but we couldnt manage to run jobs in
> parallel. We use 2 IBM servers each of which has 128cores and 1TB memory.
>
>
>
> We also added  in the spark-defaults.conf  :
>
> spark.executor.memory=64g
>
> spark.executor.cores=32
>
> spark.default.parallelism=32
>
> spark.cores.max=64
>
> spark.scheduler.mode=FAIR
>
> spark.sql.shuffle.partions=32
>
>
>
>
>
> *On the machine: 10.20.10.228*
>
> ./start-master.sh --webui-port 8585
>
>
>
> ./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077
>
>
>
>
>
> *On the machine 10.20.10.229 <http://10.20.10.229>:*
>
> ./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077
>
>
>
>
>
> *On the machine: 10.20.10.228 <http://10.20.10.228>:*
>
>
>
> We start the Spark shell:
>
>
>
> spark-shell --master spark://10.20.10.228:7077
>
>
>
> Then we make configurations:
>
>
>
> val df  = spark.read.format("jdbc").option("url", "jdbc:sqlserver://
> 10.20.10.148:1433;databaseName=testdb").option("dbtable",
> "dbo.temp_muh_hareket").option("user", "gpudb").option("password",
> "Kinetica2017!").load()
>
> import com.kinetica.spark._
>
> val lp = new LoaderParams("http://10.20.10.228:9191", "jdbc:simba://
> 10.20.10.228:9292;ParentSet=MASTER", "muh_hareket_20",
> false,"",100000,true,true,"admin","Kinetica2017!",4, true, true, 1)
>
> SparkKineticaLoader.KineticaWriter(df,lp);
>
>
>
>
>
> The above commands successfully work. The data transfer completes. However,
> jobs work serially not in parallel. Also executors work serially and take
> turns. They donw work in parallel.
>
>
>
> How can we make jobs work in parallel?
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> I really appreciate your help. We have done everything that we could.
>
>
>
> Onur EKİNCİ
> Bilgi Yönetimi Yöneticisi
> Knowledge Management Manager
>
> m:+90 553 044 2341 <+90%20553%20044%2023%2041>  d:+90 212 329 7000
> <(212)%20329-7000>
>
> İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google
> Maps <http://www.innova.com.tr/istanbul.asp>
>
> <http://www.innova.com.tr/> <http://www.innova.com.tr/>
> <http://www.innova.com.tr/> <http://www.innova.com.tr/>
> <http://www.innova.com.tr>
>
>
>
>
> Yasal Uyarı :
> Bu elektronik posta işbu linki kullanarak ulaşabileceğiniz Koşul ve
> Şartlar dokümanına tabidir :
> http://www.innova.com.tr/disclaimer-yasal-uyari.asp
>

RE: Run jobs in parallel in standalone mode

Posted by Onur EKİNCİ <oe...@innova.com.tr>.

Hi Eyal,
Thank you for your help. The following commands worked in terms of running multiple executors simultaneously. However, Spark repeats the 10 same jobs consecutively.  It had been doing it before as well. The jobs are extracting data from Mssql. Why would it run the same job 10 times?

.option("numPartitions", 4)
.option("partitionColumn", "MUHASEBESUBE_KD")
.option("lowerBound", 0)
.option("upperBound", 1000)

[cid:image002.jpg@01D38ED4.90C09EF0]




Onur EKİNCİ
Bilgi Yönetimi Yöneticisi
Knowledge Management Manager

m:+90 553 044 2341  d:+90 212 329 7000

İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google Maps<http://www.innova.com.tr/istanbul.asp>


<http://www.innova.com.tr/><http://www.innova.com.tr/><http://www.innova.com.tr/>[cid:imza_2d4cbd2e-9f86-452e-8fa7-5851198cb9af.png]<http://www.innova.com.tr/><http://www.innova.com.tr>



From: Eyal Zituny [mailto:eyal.zituny@equalum.io]
Sent: Tuesday, January 16, 2018 12:13 PM
To: Onur EKİNCİ <oe...@innova.com.tr>
Cc: Richard Qiao <ri...@gmail.com>; user@spark.apache.org
Subject: Re: Run jobs in parallel in standalone mode

hi,

I'm not familiar with the Kinetica spark driver, but it seems that your job has a single task which might indicate that you have a single partition in the df
i would suggest to try to create your df with more partitions, this can be done by adding the following options when reading the source:

.option("numPartitions", 4)
.option("partitionColumn", "id")
.option("lowerBound", 0)
.option("upperBound", 1000)

take a look at the spark jdbc configuration<https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases> for more info

you can also do df.repartition(10) but that might be less efficient since the reading from the source will not be in parallel

hope it will help

Eyal




On Tue, Jan 16, 2018 at 11:01 AM, Onur EKİNCİ <oe...@innova.com.tr>> wrote:
Sorry it is not oracle. It is Mssql.

Do you have any opinion for the solution. I really appreciate




Onur EKİNCİ
Bilgi Yönetimi Yöneticisi
Knowledge Management Manager

m:+90 553 044 2341<tel:+90%20553%20044%2023%2041>  d:+90 212 329 7000<tel:(212)%20329-7000>

İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google Maps<http://www.innova.com.tr/istanbul.asp>


[cid:image001.png@01D38ED3.A894BD00]<http://www.innova.com.tr/>



From: Richard Qiao [mailto:richardqiao2000@gmail.com<ma...@gmail.com>]
Sent: Tuesday, January 16, 2018 11:59 AM
To: Onur EKİNCİ <oe...@innova.com.tr>>
Cc: user@spark.apache.org<ma...@spark.apache.org>
Subject: Re: Run jobs in parallel in standalone mode

Curious you are using"jdbc:sqlserve" to connect oracle, why?
Also kindly reminder scrubbing your user id password.

Sent from my iPhone

On Jan 16, 2018, at 03:00, Onur EKİNCİ <oe...@innova.com.tr>> wrote:
Hi,

We are trying to get data from an Oracle database into Kinetica database through Apache Spark.

We installed Spark in standalone mode. We executed the following commands. However, we have tried everything but we couldnt manage to run jobs in parallel. We use 2 IBM servers each of which has 128cores and 1TB memory.

We also added  in the spark-defaults.conf  :
spark.executor.memory=64g
spark.executor.cores=32
spark.default.parallelism=32
spark.cores.max=64
spark.scheduler.mode=FAIR
spark.sql.shuffle.partions=32


On the machine: 10.20.10.228
./start-master.sh --webui-port 8585

./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077<http://10.20.10.228:7077>


On the machine 10.20.10.229<http://10.20.10.229>:
./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077<http://10.20.10.228:7077>


On the machine: 10.20.10.228<http://10.20.10.228>:

We start the Spark shell:

spark-shell --master spark://10.20.10.228:7077<http://10.20.10.228:7077>

Then we make configurations:

val df  = spark.read.format("jdbc").option("url", "jdbc:sqlserver://10.20.10.<http://10.20.10.>148:1433;databaseName=testdb").option("dbtable", "dbo.temp_muh_hareket").option("user", "gpudb").option("password", "Kinetica2017!").load()
import com.kinetica.spark._
val lp = new LoaderParams("http://10.20.10.228:9191", "jdbc:simba://10.20.10.228<http://10.20.10.228>:9292;ParentSet=MASTER", "muh_hareket_20", false,"",100000,true,true,"admin","Kinetica2017!",4, true, true, 1)
SparkKineticaLoader.KineticaWriter(df,lp);


The above commands successfully work. The data transfer completes. However, jobs work serially not in parallel. Also executors work serially and take turns. They donw work in parallel.

How can we make jobs work in parallel?


<image002.jpg>

<image006.jpg>

<image008.jpg>
<image012.jpg>

<image013.jpg>






<image014.jpg>


I really appreciate your help. We have done everything that we could.



Onur EKİNCİ
Bilgi Yönetimi Yöneticisi
Knowledge Management Manager

m:+90 553 044 2341<tel:+90%20553%20044%2023%2041>  d:+90 212 329 7000<tel:(212)%20329-7000>

İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google Maps<http://www.innova.com.tr/istanbul.asp>


<imza_2d4cbd2e-9f86-452e-8fa7-5851198cb9af.png><http://www.innova.com.tr/>




Yasal Uyarı :
Bu elektronik posta işbu linki kullanarak ulaşabileceğiniz Koşul ve Şartlar dokümanına tabidir :
http://www.innova.com.tr/disclaimer-yasal-uyari.asp

RE: Run jobs in parallel in standalone mode

Posted by Onur EKİNCİ <oe...@innova.com.tr>.

Hi Eyal,

I have added the following pictures just in case you might not have received int he first email. In the picture that give information about executors , it states that completed tasks are more than one.

[cid:image001.jpg@01D38ECA.C66C9860]

[cid:image002.jpg@01D38ECA.C66C9860]



[cid:image003.jpg@01D38ECA.C66C9860]

[cid:image004.jpg@01D38ECA.C66C9860]


[cid:image005.jpg@01D38ECA.C66C9860]









Onur EKİNCİ
Bilgi Yönetimi Yöneticisi
Knowledge Management Manager

m:+90 553 044 2341  d:+90 212 329 7000

İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google Maps<http://www.innova.com.tr/istanbul.asp>


<http://www.innova.com.tr/><http://www.innova.com.tr/><http://www.innova.com.tr/>[cid:imza_2d4cbd2e-9f86-452e-8fa7-5851198cb9af.png]<http://www.innova.com.tr/><http://www.innova.com.tr>



From: Eyal Zituny [mailto:eyal.zituny@equalum.io]
Sent: Tuesday, January 16, 2018 12:13 PM
To: Onur EKİNCİ <oe...@innova.com.tr>
Cc: Richard Qiao <ri...@gmail.com>; user@spark.apache.org
Subject: Re: Run jobs in parallel in standalone mode

hi,

I'm not familiar with the Kinetica spark driver, but it seems that your job has a single task which might indicate that you have a single partition in the df
i would suggest to try to create your df with more partitions, this can be done by adding the following options when reading the source:

.option("numPartitions", 4)
.option("partitionColumn", "id")
.option("lowerBound", 0)
.option("upperBound", 1000)

take a look at the spark jdbc configuration<https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases> for more info

you can also do df.repartition(10) but that might be less efficient since the reading from the source will not be in parallel

hope it will help

Eyal




On Tue, Jan 16, 2018 at 11:01 AM, Onur EKİNCİ <oe...@innova.com.tr>> wrote:
Sorry it is not oracle. It is Mssql.

Do you have any opinion for the solution. I really appreciate




Onur EKİNCİ
Bilgi Yönetimi Yöneticisi
Knowledge Management Manager

m:+90 553 044 2341<tel:+90%20553%20044%2023%2041>  d:+90 212 329 7000<tel:(212)%20329-7000>

İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google Maps<http://www.innova.com.tr/istanbul.asp>


[cid:image006.png@01D38ECA.C66C9860]<http://www.innova.com.tr/>



From: Richard Qiao [mailto:richardqiao2000@gmail.com<ma...@gmail.com>]
Sent: Tuesday, January 16, 2018 11:59 AM
To: Onur EKİNCİ <oe...@innova.com.tr>>
Cc: user@spark.apache.org<ma...@spark.apache.org>
Subject: Re: Run jobs in parallel in standalone mode

Curious you are using"jdbc:sqlserve" to connect oracle, why?
Also kindly reminder scrubbing your user id password.

Sent from my iPhone

On Jan 16, 2018, at 03:00, Onur EKİNCİ <oe...@innova.com.tr>> wrote:
Hi,

We are trying to get data from an Oracle database into Kinetica database through Apache Spark.

We installed Spark in standalone mode. We executed the following commands. However, we have tried everything but we couldnt manage to run jobs in parallel. We use 2 IBM servers each of which has 128cores and 1TB memory.

We also added  in the spark-defaults.conf  :
spark.executor.memory=64g
spark.executor.cores=32
spark.default.parallelism=32
spark.cores.max=64
spark.scheduler.mode=FAIR
spark.sql.shuffle.partions=32


On the machine: 10.20.10.228
./start-master.sh --webui-port 8585

./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077<http://10.20.10.228:7077>


On the machine 10.20.10.229<http://10.20.10.229>:
./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077<http://10.20.10.228:7077>


On the machine: 10.20.10.228<http://10.20.10.228>:

We start the Spark shell:

spark-shell --master spark://10.20.10.228:7077<http://10.20.10.228:7077>

Then we make configurations:

val df  = spark.read.format("jdbc").option("url", "jdbc:sqlserver://10.20.10.<http://10.20.10.>148:1433;databaseName=testdb").option("dbtable", "dbo.temp_muh_hareket").option("user", "gpudb").option("password", "Kinetica2017!").load()
import com.kinetica.spark._
val lp = new LoaderParams("http://10.20.10.228:9191", "jdbc:simba://10.20.10.228<http://10.20.10.228>:9292;ParentSet=MASTER", "muh_hareket_20", false,"",100000,true,true,"admin","Kinetica2017!",4, true, true, 1)
SparkKineticaLoader.KineticaWriter(df,lp);


The above commands successfully work. The data transfer completes. However, jobs work serially not in parallel. Also executors work serially and take turns. They donw work in parallel.

How can we make jobs work in parallel?


<image002.jpg>

<image006.jpg>

<image008.jpg>
<image012.jpg>

<image013.jpg>






<image014.jpg>


I really appreciate your help. We have done everything that we could.



Onur EKİNCİ
Bilgi Yönetimi Yöneticisi
Knowledge Management Manager

m:+90 553 044 2341<tel:+90%20553%20044%2023%2041>  d:+90 212 329 7000<tel:(212)%20329-7000>

İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google Maps<http://www.innova.com.tr/istanbul.asp>


<imza_2d4cbd2e-9f86-452e-8fa7-5851198cb9af.png><http://www.innova.com.tr/>




Yasal Uyarı :
Bu elektronik posta işbu linki kullanarak ulaşabileceğiniz Koşul ve Şartlar dokümanına tabidir :
http://www.innova.com.tr/disclaimer-yasal-uyari.asp

Re: Run jobs in parallel in standalone mode

Posted by Eyal Zituny <ey...@equalum.io>.

Hi,
as far as i know, this is not a typical behavior for spark,
it might be relates to the implementation of the Kinetica spark connector
you can try to write the DF to a csv instead using
*df.write.csv("<path-to-csv-folder>")*
and see how the spark job behave

Eyal

On Tue, Jan 16, 2018 at 2:19 PM, Onur EKİNCİ <oe...@innova.com.tr> wrote:

>
>
> Correction. We found out that Spark extracts data from mssql database
> column by column.  Spark divides data by column. Then it executes 10 jobs
> to pull data from mssql database.
>
>
>
> Is there a way that we can run those jobs in parallel or increse/decrease
> the number of jobs?  According to what criteria does Spark run jobs
> ,especially 10 jobs?
>
>
>
> Onur EKİNCİ
> Bilgi Yönetimi Yöneticisi
> Knowledge Management Manager
>
> m:+90 553 044 2341 <+90%20553%20044%2023%2041>  d:+90 212 329 7000
> <(212)%20329-7000>
>
> İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google
> Maps <http://www.innova.com.tr/istanbul.asp>
>
> <http://www.innova.com.tr/> <http://www.innova.com.tr/>
> <http://www.innova.com.tr/> <http://www.innova.com.tr/>
> <http://www.innova.com.tr>
>
>
>
> *From:* Onur EKİNCİ
> *Sent:* Tuesday, January 16, 2018 2:16 PM
> *To:* 'Eyal Zituny' <ey...@equalum.io>
> *Cc:* user@spark.apache.org
> *Subject:* RE: Run jobs in parallel in standalone mode
>
>
>
> Hi Eyal,
>
> Thank you for your help. The following commands worked in terms of running
> multiple executors simultaneously. However, Spark repeats the 10 same
> jobs consecutively.  It had been doing it before as well. The jobs are
> extracting data from Mssql. Why would it run the same job 10 times?
>
>
>
> .option("numPartitions", 4)
>
> .option("partitionColumn", "MUHASEBESUBE_KD")
>
> .option("lowerBound", 0)
>
> .option("upperBound", 1000)
>
>
>
>
>
> *From:* Eyal Zituny [mailto:eyal.zituny@equalum.io
> <ey...@equalum.io>]
> *Sent:* Tuesday, January 16, 2018 12:13 PM
> *To:* Onur EKİNCİ <oe...@innova.com.tr>
> *Cc:* Richard Qiao <ri...@gmail.com>; user@spark.apache.org
>
> *Subject:* Re: Run jobs in parallel in standalone mode
>
>
>
> hi,
>
>
>
> I'm not familiar with the Kinetica spark driver, but it seems that your
> job has a single task which might indicate that you have a single partition
> in the df
>
> i would suggest to try to create your df with more partitions, this can be
> done by adding the following options when reading the source:
>
>
>
> .option("numPartitions", 4)
>
> .option("partitionColumn", "id")
>
> .option("lowerBound", 0)
>
> .option("upperBound", 1000)
>
>
>
> take a look at the spark jdbc configuration
> <https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases> for
> more info
>
>
>
> you can also do df.repartition(10) but that might be less efficient since
> the reading from the source will not be in parallel
>
>
>
> hope it will help
>
>
>
> Eyal
>
>
>
>
>
>
>
>
>
> On Tue, Jan 16, 2018 at 11:01 AM, Onur EKİNCİ <oe...@innova.com.tr>
> wrote:
>
> Sorry it is not oracle. It is Mssql.
>
>
>
> Do you have any opinion for the solution. I really appreciate
>
>
>
>
>
> *Onur EKİNCİ*
> Bilgi Yönetimi Yöneticisi
> Knowledge Management Manager
>
> m:+90 553 044 2341 <+90%20553%20044%2023%2041>  d:+90 212 329 7000
> <(212)%20329-7000>
>
> İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google
> Maps <http://www.innova.com.tr/istanbul.asp>
>
> <http://www.innova.com.tr/>
>
>
>
> *From:* Richard Qiao [mailto:richardqiao2000@gmail.com]
> *Sent:* Tuesday, January 16, 2018 11:59 AM
> *To:* Onur EKİNCİ <oe...@innova.com.tr>
> *Cc:* user@spark.apache.org
> *Subject:* Re: Run jobs in parallel in standalone mode
>
>
>
> Curious you are using"jdbc:sqlserve" to connect oracle, why?
>
> Also kindly reminder scrubbing your user id password.
>
> Sent from my iPhone
>
>
> On Jan 16, 2018, at 03:00, Onur EKİNCİ <oe...@innova.com.tr> wrote:
>
> Hi,
>
>
>
> We are trying to get data from an Oracle database into Kinetica database
> through Apache Spark.
>
>
>
> We installed Spark in standalone mode. We executed the following commands.
> However, we have tried everything but we couldnt manage to run jobs in
> parallel. We use 2 IBM servers each of which has 128cores and 1TB memory.
>
>
>
> We also added  in the spark-defaults.conf  :
>
> spark.executor.memory=64g
>
> spark.executor.cores=32
>
> spark.default.parallelism=32
>
> spark.cores.max=64
>
> spark.scheduler.mode=FAIR
>
> spark.sql.shuffle.partions=32
>
>
>
>
>
> *On the machine: 10.20.10.228*
>
> ./start-master.sh --webui-port 8585
>
>
>
> ./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077
>
>
>
>
>
> *On the machine 10.20.10.229 <http://10.20.10.229>:*
>
> ./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077
>
>
>
>
>
> *On the machine: 10.20.10.228 <http://10.20.10.228>:*
>
>
>
> We start the Spark shell:
>
>
>
> spark-shell --master spark://10.20.10.228:7077
>
>
>
> Then we make configurations:
>
>
>
> val df  = spark.read.format("jdbc").option("url", "jdbc:sqlserver://
> 10.20.10.148:1433;databaseName=testdb").option("dbtable",
> "dbo.temp_muh_hareket").option("user", "gpudb").option("password",
> "Kinetica2017!").load()
>
> import com.kinetica.spark._
>
> val lp = new LoaderParams("http://10.20.10.228:9191", "jdbc:simba://
> 10.20.10.228:9292;ParentSet=MASTER", "muh_hareket_20",
> false,"",100000,true,true,"admin","Kinetica2017!",4, true, true, 1)
>
> SparkKineticaLoader.KineticaWriter(df,lp);
>
>
>
>
>
> The above commands successfully work. The data transfer completes. However,
> jobs work serially not in parallel. Also executors work serially and take
> turns. They donw work in parallel.
>
>
>
> How can we make jobs work in parallel?
>
>
>
>
>
> <image002.jpg>
>
>
>
> <image006.jpg>
>
>
>
> <image008.jpg>
>
> <image012.jpg>
>
>
>
> <image013.jpg>
>
>
>
>
>
>
>
>
>
>
>
>
>
> <image014.jpg>
>
>
>
>
>
> I really appreciate your help. We have done everything that we could.
>
>
>
> *Onur EKİNCİ*
> Bilgi Yönetimi Yöneticisi
> Knowledge Management Manager
>
> m:+90 553 044 2341 <+90%20553%20044%2023%2041>  d:+90 212 329 7000
> <(212)%20329-7000>
>
> İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google
> Maps <http://www.innova.com.tr/istanbul.asp>
>
> <imza_2d4cbd2e-9f86-452e-8fa7-5851198cb9af.png>
> <http://www.innova.com.tr/>
>
>
>
>
> Yasal Uyarı :
> Bu elektronik posta işbu linki kullanarak ulaşabileceğiniz Koşul ve
> Şartlar dokümanına tabidir :
> http://www.innova.com.tr/disclaimer-yasal-uyari.asp
>
>
>

RE: Run jobs in parallel in standalone mode

Posted by Onur EKİNCİ <oe...@innova.com.tr>.

Correction. We found out that Spark extracts data from mssql database column by column.  Spark divides data by column. Then it executes 10 jobs to pull data from mssql database.

Is there a way that we can run those jobs in parallel or increse/decrease the number of jobs?  According to what criteria does Spark run jobs ,especially 10 jobs?



Onur EKİNCİ
Bilgi Yönetimi Yöneticisi
Knowledge Management Manager

m:+90 553 044 2341  d:+90 212 329 7000

İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google Maps<http://www.innova.com.tr/istanbul.asp>


<http://www.innova.com.tr/><http://www.innova.com.tr/><http://www.innova.com.tr/>[cid:imza_2d4cbd2e-9f86-452e-8fa7-5851198cb9af.png]<http://www.innova.com.tr/><http://www.innova.com.tr>



From: Onur EKİNCİ
Sent: Tuesday, January 16, 2018 2:16 PM
To: 'Eyal Zituny' <ey...@equalum.io>
Cc: user@spark.apache.org
Subject: RE: Run jobs in parallel in standalone mode

Hi Eyal,
Thank you for your help. The following commands worked in terms of running multiple executors simultaneously. However, Spark repeats the 10 same jobs consecutively.  It had been doing it before as well. The jobs are extracting data from Mssql. Why would it run the same job 10 times?

.option("numPartitions", 4)
.option("partitionColumn", "MUHASEBESUBE_KD")
.option("lowerBound", 0)
.option("upperBound", 1000)

[cid:image003.jpg@01D38EDD.64427570]

From: Eyal Zituny [mailto:eyal.zituny@equalum.io]
Sent: Tuesday, January 16, 2018 12:13 PM
To: Onur EKİNCİ <oe...@innova.com.tr>>
Cc: Richard Qiao <ri...@gmail.com>>; user@spark.apache.org<ma...@spark.apache.org>
Subject: Re: Run jobs in parallel in standalone mode

hi,

I'm not familiar with the Kinetica spark driver, but it seems that your job has a single task which might indicate that you have a single partition in the df
i would suggest to try to create your df with more partitions, this can be done by adding the following options when reading the source:

.option("numPartitions", 4)
.option("partitionColumn", "id")
.option("lowerBound", 0)
.option("upperBound", 1000)

take a look at the spark jdbc configuration<https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases> for more info

you can also do df.repartition(10) but that might be less efficient since the reading from the source will not be in parallel

hope it will help

Eyal




On Tue, Jan 16, 2018 at 11:01 AM, Onur EKİNCİ <oe...@innova.com.tr>> wrote:
Sorry it is not oracle. It is Mssql.

Do you have any opinion for the solution. I really appreciate




Onur EKİNCİ
Bilgi Yönetimi Yöneticisi
Knowledge Management Manager

m:+90 553 044 2341<tel:+90%20553%20044%2023%2041>  d:+90 212 329 7000<tel:(212)%20329-7000>

İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google Maps<http://www.innova.com.tr/istanbul.asp>


[cid:image004.png@01D38EDD.64427570]<http://www.innova.com.tr/>



From: Richard Qiao [mailto:richardqiao2000@gmail.com<ma...@gmail.com>]
Sent: Tuesday, January 16, 2018 11:59 AM
To: Onur EKİNCİ <oe...@innova.com.tr>>
Cc: user@spark.apache.org<ma...@spark.apache.org>
Subject: Re: Run jobs in parallel in standalone mode

Curious you are using"jdbc:sqlserve" to connect oracle, why?
Also kindly reminder scrubbing your user id password.

Sent from my iPhone

On Jan 16, 2018, at 03:00, Onur EKİNCİ <oe...@innova.com.tr>> wrote:
Hi,

We are trying to get data from an Oracle database into Kinetica database through Apache Spark.

We installed Spark in standalone mode. We executed the following commands. However, we have tried everything but we couldnt manage to run jobs in parallel. We use 2 IBM servers each of which has 128cores and 1TB memory.

We also added  in the spark-defaults.conf  :
spark.executor.memory=64g
spark.executor.cores=32
spark.default.parallelism=32
spark.cores.max=64
spark.scheduler.mode=FAIR
spark.sql.shuffle.partions=32


On the machine: 10.20.10.228
./start-master.sh --webui-port 8585

./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077<http://10.20.10.228:7077>


On the machine 10.20.10.229<http://10.20.10.229>:
./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077<http://10.20.10.228:7077>


On the machine: 10.20.10.228<http://10.20.10.228>:

We start the Spark shell:

spark-shell --master spark://10.20.10.228:7077<http://10.20.10.228:7077>

Then we make configurations:

val df  = spark.read.format("jdbc").option("url", "jdbc:sqlserver://10.20.10.<http://10.20.10.>148:1433;databaseName=testdb").option("dbtable", "dbo.temp_muh_hareket").option("user", "gpudb").option("password", "Kinetica2017!").load()
import com.kinetica.spark._
val lp = new LoaderParams("http://10.20.10.228:9191", "jdbc:simba://10.20.10.228<http://10.20.10.228>:9292;ParentSet=MASTER", "muh_hareket_20", false,"",100000,true,true,"admin","Kinetica2017!",4, true, true, 1)
SparkKineticaLoader.KineticaWriter(df,lp);


The above commands successfully work. The data transfer completes. However, jobs work serially not in parallel. Also executors work serially and take turns. They donw work in parallel.

How can we make jobs work in parallel?


<image002.jpg>

<image006.jpg>

<image008.jpg>
<image012.jpg>

<image013.jpg>






<image014.jpg>


I really appreciate your help. We have done everything that we could.



Onur EKİNCİ
Bilgi Yönetimi Yöneticisi
Knowledge Management Manager

m:+90 553 044 2341<tel:+90%20553%20044%2023%2041>  d:+90 212 329 7000<tel:(212)%20329-7000>

İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google Maps<http://www.innova.com.tr/istanbul.asp>


<imza_2d4cbd2e-9f86-452e-8fa7-5851198cb9af.png><http://www.innova.com.tr/>




Yasal Uyarı :
Bu elektronik posta işbu linki kullanarak ulaşabileceğiniz Koşul ve Şartlar dokümanına tabidir :
http://www.innova.com.tr/disclaimer-yasal-uyari.asp

Re: Run jobs in parallel in standalone mode

Posted by Eyal Zituny <ey...@equalum.io>.

hi,

I'm not familiar with the Kinetica spark driver, but it seems that your job
has a single task which might indicate that you have a single partition in
the df
i would suggest to try to create your df with more partitions, this can be
done by adding the following options when reading the source:

.option("numPartitions", 4)
.option("partitionColumn", "id")
.option("lowerBound", 0)
.option("upperBound", 1000)

take a look at the spark jdbc configuration
<https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases>
for
more info

you can also do df.repartition(10) but that might be less efficient since
the reading from the source will not be in parallel

hope it will help

Eyal




On Tue, Jan 16, 2018 at 11:01 AM, Onur EKİNCİ <oe...@innova.com.tr> wrote:

> Sorry it is not oracle. It is Mssql.
>
>
>
> Do you have any opinion for the solution. I really appreciate
>
>
>
>
>
> Onur EKİNCİ
> Bilgi Yönetimi Yöneticisi
> Knowledge Management Manager
>
> m:+90 553 044 2341 <+90%20553%20044%2023%2041>  d:+90 212 329 7000
> <(212)%20329-7000>
>
> İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google
> Maps <http://www.innova.com.tr/istanbul.asp>
>
> <http://www.innova.com.tr/> <http://www.innova.com.tr/>
> <http://www.innova.com.tr/> <http://www.innova.com.tr/>
> <http://www.innova.com.tr>
>
>
>
> *From:* Richard Qiao [mailto:richardqiao2000@gmail.com]
> *Sent:* Tuesday, January 16, 2018 11:59 AM
> *To:* Onur EKİNCİ <oe...@innova.com.tr>
> *Cc:* user@spark.apache.org
> *Subject:* Re: Run jobs in parallel in standalone mode
>
>
>
> Curious you are using"jdbc:sqlserve" to connect oracle, why?
>
> Also kindly reminder scrubbing your user id password.
>
> Sent from my iPhone
>
>
> On Jan 16, 2018, at 03:00, Onur EKİNCİ <oe...@innova.com.tr> wrote:
>
> Hi,
>
>
>
> We are trying to get data from an Oracle database into Kinetica database
> through Apache Spark.
>
>
>
> We installed Spark in standalone mode. We executed the following commands.
> However, we have tried everything but we couldnt manage to run jobs in
> parallel. We use 2 IBM servers each of which has 128cores and 1TB memory.
>
>
>
> We also added  in the spark-defaults.conf  :
>
> spark.executor.memory=64g
>
> spark.executor.cores=32
>
> spark.default.parallelism=32
>
> spark.cores.max=64
>
> spark.scheduler.mode=FAIR
>
> spark.sql.shuffle.partions=32
>
>
>
>
>
> *On the machine: 10.20.10.228*
>
> ./start-master.sh --webui-port 8585
>
>
>
> ./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077
>
>
>
>
>
> *On the machine 10.20.10.229 <http://10.20.10.229>:*
>
> ./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077
>
>
>
>
>
> *On the machine: 10.20.10.228 <http://10.20.10.228>:*
>
>
>
> We start the Spark shell:
>
>
>
> spark-shell --master spark://10.20.10.228:7077
>
>
>
> Then we make configurations:
>
>
>
> val df  = spark.read.format("jdbc").option("url", "jdbc:sqlserver://
> 10.20.10.148:1433;databaseName=testdb").option("dbtable",
> "dbo.temp_muh_hareket").option("user", "gpudb").option("password",
> "Kinetica2017!").load()
>
> import com.kinetica.spark._
>
> val lp = new LoaderParams("http://10.20.10.228:9191", "jdbc:simba://
> 10.20.10.228:9292;ParentSet=MASTER", "muh_hareket_20",
> false,"",100000,true,true,"admin","Kinetica2017!",4, true, true, 1)
>
> SparkKineticaLoader.KineticaWriter(df,lp);
>
>
>
>
>
> The above commands successfully work. The data transfer completes. However,
> jobs work serially not in parallel. Also executors work serially and take
> turns. They donw work in parallel.
>
>
>
> How can we make jobs work in parallel?
>
>
>
>
>
> <image002.jpg>
>
>
>
> <image006.jpg>
>
>
>
> <image008.jpg>
>
> <image012.jpg>
>
>
>
> <image013.jpg>
>
>
>
>
>
>
>
>
>
>
>
>
>
> <image014.jpg>
>
>
>
>
>
> I really appreciate your help. We have done everything that we could.
>
>
>
> *Onur EKİNCİ*
> Bilgi Yönetimi Yöneticisi
> Knowledge Management Manager
>
> m:+90 553 044 2341 <+90%20553%20044%2023%2041>  d:+90 212 329 7000
> <(212)%20329-7000>
>
> İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google
> Maps <http://www.innova.com.tr/istanbul.asp>
>
> <imza_2d4cbd2e-9f86-452e-8fa7-5851198cb9af.png>
> <http://www.innova.com.tr/>
>
>
>
>
> Yasal Uyarı :
> Bu elektronik posta işbu linki kullanarak ulaşabileceğiniz Koşul ve
> Şartlar dokümanına tabidir :
> http://www.innova.com.tr/disclaimer-yasal-uyari.asp
>
>

Re: Run jobs in parallel in standalone mode

Posted by Richard Qiao <ri...@gmail.com>.

2 points to consider:
1. Check sql server/simba max connection number
2. Allocate 3-5 cores for each executor and allocate more executors.

Sent from my iPhone

> On Jan 16, 2018, at 04:01, Onur EKİNCİ <oe...@innova.com.tr> wrote:
> 
> Sorry it is not oracle. It is Mssql.
>  
> Do you have any opinion for the solution. I really appreciate
>  
>  
> 
> Onur EKİNCİ
> Bilgi Yönetimi Yöneticisi
> Knowledge Management Manager
>  
> m:+90 553 044 2341  d:+90 212 329 7000
>  
> İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google Maps
> 
>  
> 
> From: Richard Qiao [mailto:richardqiao2000@gmail.com] 
> Sent: Tuesday, January 16, 2018 11:59 AM
> To: Onur EKİNCİ <oe...@innova.com.tr>
> Cc: user@spark.apache.org
> Subject: Re: Run jobs in parallel in standalone mode
>  
> Curious you are using"jdbc:sqlserve" to connect oracle, why?
> Also kindly reminder scrubbing your user id password.
> 
> Sent from my iPhone
> 
> On Jan 16, 2018, at 03:00, Onur EKİNCİ <oe...@innova.com.tr> wrote:
> 
> Hi,
>  
> We are trying to get data from an Oracle database into Kinetica database through Apache Spark.
>  
> We installed Spark in standalone mode. We executed the following commands. However, we have tried everything but we couldnt manage to run jobs in parallel. We use 2 IBM servers each of which has 128cores and 1TB memory.
>  
> We also added  in the spark-defaults.conf  :
> spark.executor.memory=64g
> spark.executor.cores=32
> spark.default.parallelism=32
> spark.cores.max=64
> spark.scheduler.mode=FAIR
> spark.sql.shuffle.partions=32
>  
>  
> On the machine: 10.20.10.228
> ./start-master.sh --webui-port 8585
>  
> ./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077
>  
>  
> On the machine 10.20.10.229:
> ./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077
>  
>  
> On the machine: 10.20.10.228:
>  
> We start the Spark shell:
>  
> spark-shell --master spark://10.20.10.228:7077
>  
> Then we make configurations:
>  
> val df  = spark.read.format("jdbc").option("url", "jdbc:sqlserver://10.20.10.148:1433;databaseName=testdb").option("dbtable", "dbo.temp_muh_hareket").option("user", "gpudb").option("password", "Kinetica2017!").load()
> import com.kinetica.spark._
> val lp = new LoaderParams("http://10.20.10.228:9191", "jdbc:simba://10.20.10.228:9292;ParentSet=MASTER", "muh_hareket_20", false,"",100000,true,true,"admin","Kinetica2017!",4, true, true, 1)
> SparkKineticaLoader.KineticaWriter(df,lp);
>  
>  
> The above commands successfully work. The data transfer completes. However, jobs work serially not in parallel. Also executors work serially and take turns. They donw work in parallel.
>  
> How can we make jobs work in parallel?
>  
>  
> <image002.jpg>
>  
> <image006.jpg>
>  
> <image008.jpg>
> <image012.jpg>
>  
> <image013.jpg>
>  
>  
>  
>  
>  
>  
> <image014.jpg>
>  
>  
> I really appreciate your help. We have done everything that we could.
>  
> 
> Onur EKİNCİ
> Bilgi Yönetimi Yöneticisi
> Knowledge Management Manager
>  
> m:+90 553 044 2341  d:+90 212 329 7000
>  
> İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google Maps
> <imza_2d4cbd2e-9f86-452e-8fa7-5851198cb9af.png>
>  
> 
> 
> Yasal Uyarı :
> Bu elektronik posta işbu linki kullanarak ulaşabileceğiniz Koşul ve Şartlar dokümanına tabidir :
> http://www.innova.com.tr/disclaimer-yasal-uyari.asp

RE: Run jobs in parallel in standalone mode

Posted by Onur EKİNCİ <oe...@innova.com.tr>.

Sorry it is not oracle. It is Mssql.

Do you have any opinion for the solution. I really appreciate




Onur EKİNCİ
Bilgi Yönetimi Yöneticisi
Knowledge Management Manager

m:+90 553 044 2341  d:+90 212 329 7000

İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google Maps<http://www.innova.com.tr/istanbul.asp>


<http://www.innova.com.tr/><http://www.innova.com.tr/><http://www.innova.com.tr/>[cid:imza_2d4cbd2e-9f86-452e-8fa7-5851198cb9af.png]<http://www.innova.com.tr/><http://www.innova.com.tr>



From: Richard Qiao [mailto:richardqiao2000@gmail.com]
Sent: Tuesday, January 16, 2018 11:59 AM
To: Onur EKİNCİ <oe...@innova.com.tr>
Cc: user@spark.apache.org
Subject: Re: Run jobs in parallel in standalone mode

Curious you are using"jdbc:sqlserve" to connect oracle, why?
Also kindly reminder scrubbing your user id password.

Sent from my iPhone

On Jan 16, 2018, at 03:00, Onur EKİNCİ <oe...@innova.com.tr>> wrote:
Hi,

We are trying to get data from an Oracle database into Kinetica database through Apache Spark.

We installed Spark in standalone mode. We executed the following commands. However, we have tried everything but we couldnt manage to run jobs in parallel. We use 2 IBM servers each of which has 128cores and 1TB memory.

We also added  in the spark-defaults.conf  :
spark.executor.memory=64g
spark.executor.cores=32
spark.default.parallelism=32
spark.cores.max=64
spark.scheduler.mode=FAIR
spark.sql.shuffle.partions=32


On the machine: 10.20.10.228
./start-master.sh --webui-port 8585

./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077


On the machine 10.20.10.229:
./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077


On the machine: 10.20.10.228:

We start the Spark shell:

spark-shell --master spark://10.20.10.228:7077

Then we make configurations:

val df  = spark.read.format("jdbc").option("url", "jdbc:sqlserver://10.20.10.148:1433;databaseName=testdb").option("dbtable", "dbo.temp_muh_hareket").option("user", "gpudb").option("password", "Kinetica2017!").load()
import com.kinetica.spark._
val lp = new LoaderParams("http://10.20.10.228:9191", "jdbc:simba://10.20.10.228:9292;ParentSet=MASTER", "muh_hareket_20", false,"",100000,true,true,"admin","Kinetica2017!",4, true, true, 1)
SparkKineticaLoader.KineticaWriter(df,lp);


The above commands successfully work. The data transfer completes. However, jobs work serially not in parallel. Also executors work serially and take turns. They donw work in parallel.

How can we make jobs work in parallel?


<image002.jpg>

<image006.jpg>

<image008.jpg>
<image012.jpg>

<image013.jpg>






<image014.jpg>


I really appreciate your help. We have done everything that we could.



Onur EKİNCİ
Bilgi Yönetimi Yöneticisi
Knowledge Management Manager

m:+90 553 044 2341  d:+90 212 329 7000

İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google Maps<http://www.innova.com.tr/istanbul.asp>


<imza_2d4cbd2e-9f86-452e-8fa7-5851198cb9af.png><http://www.innova.com.tr/>




Yasal Uyarı :
Bu elektronik posta işbu linki kullanarak ulaşabileceğiniz Koşul ve Şartlar dokümanına tabidir :
http://www.innova.com.tr/disclaimer-yasal-uyari.asp

Re: Run jobs in parallel in standalone mode

Posted by Richard Qiao <ri...@gmail.com>.

Curious you are using"jdbc:sqlserve" to connect oracle, why?
Also kindly reminder scrubbing your user id password.

Sent from my iPhone

> On Jan 16, 2018, at 03:00, Onur EKİNCİ <oe...@innova.com.tr> wrote:
> 
> Hi,
>  
> We are trying to get data from an Oracle database into Kinetica database through Apache Spark.
>  
> We installed Spark in standalone mode. We executed the following commands. However, we have tried everything but we couldnt manage to run jobs in parallel. We use 2 IBM servers each of which has 128cores and 1TB memory.
>  
> We also added  in the spark-defaults.conf  :
> spark.executor.memory=64g
> spark.executor.cores=32
> spark.default.parallelism=32
> spark.cores.max=64
> spark.scheduler.mode=FAIR
> spark.sql.shuffle.partions=32
>  
>  
> On the machine: 10.20.10.228
> ./start-master.sh --webui-port 8585
>  
> ./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077
>  
>  
> On the machine 10.20.10.229:
> ./start-slave.sh --webui-port 8586 spark://10.20.10.228:7077
>  
>  
> On the machine: 10.20.10.228:
>  
> We start the Spark shell:
>  
> spark-shell --master spark://10.20.10.228:7077
>  
> Then we make configurations:
>  
> val df  = spark.read.format("jdbc").option("url", "jdbc:sqlserver://10.20.10.148:1433;databaseName=testdb").option("dbtable", "dbo.temp_muh_hareket").option("user", "gpudb").option("password", "Kinetica2017!").load()
> import com.kinetica.spark._
> val lp = new LoaderParams("http://10.20.10.228:9191", "jdbc:simba://10.20.10.228:9292;ParentSet=MASTER", "muh_hareket_20", false,"",100000,true,true,"admin","Kinetica2017!",4, true, true, 1)
> SparkKineticaLoader.KineticaWriter(df,lp);
>  
>  
> The above commands successfully work. The data transfer completes. However, jobs work serially not in parallel. Also executors work serially and take turns. They donw work in parallel.
>  
> How can we make jobs work in parallel?
>  
>  
> <image002.jpg>
>  
> <image006.jpg>
>  
> <image008.jpg>
> <image012.jpg>
>  
> <image013.jpg>
>  
>  
>  
>  
>  
>  
> <image014.jpg>
>  
>  
> I really appreciate your help. We have done everything that we could.
>  
> 
> Onur EKİNCİ
> Bilgi Yönetimi Yöneticisi
> Knowledge Management Manager
>  
> m:+90 553 044 2341  d:+90 212 329 7000
>  
> İTÜ Ayazağa Kampüsü, Teknokent ARI4 Binası 34469 Maslak İstanbul - Google Maps
> <imza_2d4cbd2e-9f86-452e-8fa7-5851198cb9af.png>
>  
> 
> 
> Yasal Uyarı :
> Bu elektronik posta işbu linki kullanarak ulaşabileceğiniz Koşul ve Şartlar dokümanına tabidir :
> http://www.innova.com.tr/disclaimer-yasal-uyari.asp