You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Chris Mattmann <ma...@apache.org> on 2016/08/08 14:03:59 UTC

FW: Have I done everything correctly when subscribing to Spark User List





On 8/8/16, 2:03 AM, "Matthias.Dueck@fiduciagad.de" <Ma...@fiduciagad.de> wrote:

>Hello,
>
>I write to you because I am not really sure whether I did everything right when registering and subscribing to the spark user list.
>
>I posted the appended question to Spark User list after subscribing and receiving the "WELCOME to user@spark.apache.org" mail from "user-help@spark.apache.org".
> But this post is still in state "This post has NOT been accepted by the mailing list yet.".
>
>Is this because I forgot something to do or did something wrong with my user account (dueckm)? Or is it because no member of the Spark User List reacted to that post yet?
>
>Thanks a lot for yout help.
>
>Matthias
>
>Fiducia & GAD IT AG | www.fiduciagad.de
>AG Frankfurt a. M. HRB 102381 | Sitz der Gesellschaft: Hahnstr. 48, 60528 Frankfurt a. M. | USt-IdNr. DE 143582320
>Vorstand: Klaus-Peter Bruns (Vorsitzender), Claus-Dieter Toben (stv. Vorsitzender),
>
>Jens-Olaf Bartels, Martin Beyer, Jörg Dreinhöfer, Wolfgang Eckert, Carsten Pfläging, Jörg Staff
>Vorsitzender des Aufsichtsrats: Jürgen Brinkmann
>
>----- Weitergeleitet von Matthias Dück/M/FAG/FIDUCIA/DE am 08.08.2016 10:57 -----
>
>Von: dueckm <ma...@fiduciagad.de>
>An: user@spark.apache.org
>Datum: 04.08.2016 13:27
>Betreff: Are join/groupBy operations with wide Java Beans using Dataset API much slower than using RDD API?
>
>________________________________________
>
>
>
>Hello,
>
>I built a prototype that uses join and groupBy operations via Spark RDD API.
>Recently I migrated it to the Dataset API. Now it runs much slower than with
>the original RDD implementation. 
>Did I do something wrong here? Or is this a price I have to pay for the more
>convienient API?
>Is there a known solution to deal with this effect (eg configuration via
>"spark.sql.shuffle.partitions" - but now could I determine the correct
>value)?
>In my prototype I use Java Beans with a lot of attributes. Does this slow
>down Spark-operations with Datasets?
>
>Here I have an simple example, that shows the difference: 
>JoinGroupByTest.zip
><http://apache-spark-user-list.1001560.n3.nabble.com/file/n27473/JoinGroupByTest.zip>  
>- I build 2 RDDs and join and group them. Afterwards I count and display the
>joined RDDs.  (Method de.testrddds.JoinGroupByTest.joinAndGroupViaRDD() )
>- When I do the same actions with Datasets it takes approximately 40 times
>as long (Methodd e.testrddds.JoinGroupByTest.joinAndGroupViaDatasets()).
>
>Thank you very much for your help.
>Matthias
>
>PS1: excuse me for sending this post more than once, but I am new to this
>mailing list and probably did something wrong when registering/subscribing,
>so my previous postings have not been accepted ...
>
>PS2: See the appended screenshots taken from Spark UI (jobs 0/1 belong to
>RDD implementation, jobs 2/3 to Dataset):
><http://apache-spark-user-list.1001560.n3.nabble.com/file/n27473/jobs.png>
>
><http://apache-spark-user-list.1001560.n3.nabble.com/file/n27473/Job_RDD_Details.png>
>
><http://apache-spark-user-list.1001560.n3.nabble.com/file/n27473/Job_Dataset_Details.png>
>
>
>
>
>--
>View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Are-join-groupBy-operations-with-wide-Java-Beans-using-Dataset-API-much-slower-than-using-RDD-API-tp27473.html
>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
>---------------------------------------------------------------------
>To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Have I done everything correctly when subscribing to Spark User List

Posted by Ovidiu-Cristian MARCU <ov...@inria.fr>.

Probably the yellow warning message can be confusing even more than not receiving an answer/opinion on his post.

Best,
Ovidiu
> On 08 Aug 2016, at 20:10, Sean Owen <so...@cloudera.com> wrote:
> 
> I also don't know what's going on with the "This post has NOT been
> accepted by the mailing list yet" message, because actually the
> messages always do post. In fact this has been sent to the list 4
> times:
> 
> https://www.mail-archive.com/search?l=user%40spark.apache.org&q=dueckm&submit.x=0&submit.y=0
> 
> On Mon, Aug 8, 2016 at 3:03 PM, Chris Mattmann <ma...@apache.org> wrote:
>> 
>> 
>> 
>> 
>> 
>> On 8/8/16, 2:03 AM, "Matthias.Dueck@fiduciagad.de" <Ma...@fiduciagad.de> wrote:
>> 
>>> Hello,
>>> 
>>> I write to you because I am not really sure whether I did everything right when registering and subscribing to the spark user list.
>>> 
>>> I posted the appended question to Spark User list after subscribing and receiving the "WELCOME to user@spark.apache.org" mail from "user-help@spark.apache.org".
>>> But this post is still in state "This post has NOT been accepted by the mailing list yet.".
>>> 
>>> Is this because I forgot something to do or did something wrong with my user account (dueckm)? Or is it because no member of the Spark User List reacted to that post yet?
>>> 
>>> Thanks a lot for yout help.
>>> 
>>> Matthias
>>> 
>>> Fiducia & GAD IT AG | www.fiduciagad.de
>>> AG Frankfurt a. M. HRB 102381 | Sitz der Gesellschaft: Hahnstr. 48, 60528 Frankfurt a. M. | USt-IdNr. DE 143582320
>>> Vorstand: Klaus-Peter Bruns (Vorsitzender), Claus-Dieter Toben (stv. Vorsitzender),
>>> 
>>> Jens-Olaf Bartels, Martin Beyer, Jörg Dreinhöfer, Wolfgang Eckert, Carsten Pfläging, Jörg Staff
>>> Vorsitzender des Aufsichtsrats: Jürgen Brinkmann
>>> 
>>> ----- Weitergeleitet von Matthias Dück/M/FAG/FIDUCIA/DE am 08.08.2016 10:57 -----
>>> 
>>> Von: dueckm <ma...@fiduciagad.de>
>>> An: user@spark.apache.org
>>> Datum: 04.08.2016 13:27
>>> Betreff: Are join/groupBy operations with wide Java Beans using Dataset API much slower than using RDD API?
>>> 
>>> ________________________________________
>>> 
>>> 
>>> 
>>> Hello,
>>> 
>>> I built a prototype that uses join and groupBy operations via Spark RDD API.
>>> Recently I migrated it to the Dataset API. Now it runs much slower than with
>>> the original RDD implementation.
>>> Did I do something wrong here? Or is this a price I have to pay for the more
>>> convienient API?
>>> Is there a known solution to deal with this effect (eg configuration via
>>> "spark.sql.shuffle.partitions" - but now could I determine the correct
>>> value)?
>>> In my prototype I use Java Beans with a lot of attributes. Does this slow
>>> down Spark-operations with Datasets?
>>> 
>>> Here I have an simple example, that shows the difference:
>>> JoinGroupByTest.zip
>>> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n27473/JoinGroupByTest.zip>
>>> - I build 2 RDDs and join and group them. Afterwards I count and display the
>>> joined RDDs.  (Method de.testrddds.JoinGroupByTest.joinAndGroupViaRDD() )
>>> - When I do the same actions with Datasets it takes approximately 40 times
>>> as long (Methodd e.testrddds.JoinGroupByTest.joinAndGroupViaDatasets()).
>>> 
>>> Thank you very much for your help.
>>> Matthias
>>> 
>>> PS1: excuse me for sending this post more than once, but I am new to this
>>> mailing list and probably did something wrong when registering/subscribing,
>>> so my previous postings have not been accepted ...
>>> 
>>> PS2: See the appended screenshots taken from Spark UI (jobs 0/1 belong to
>>> RDD implementation, jobs 2/3 to Dataset):
>>> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n27473/jobs.png>
>>> 
>>> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n27473/Job_RDD_Details.png>
>>> 
>>> <http://apache-spark-user-list.1001560.n3.nabble.com/file/n27473/Job_Dataset_Details.png>
>>> 
>>> 
>>> 
>>> 
>>> --
>>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Are-join-groupBy-operations-with-wide-Java-Beans-using-Dataset-API-much-slower-than-using-RDD-API-tp27473.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> 
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: FW: Have I done everything correctly when subscribing to Spark User List

Posted by Sean Owen <so...@cloudera.com>.

I also don't know what's going on with the "This post has NOT been
accepted by the mailing list yet" message, because actually the
messages always do post. In fact this has been sent to the list 4
times:

https://www.mail-archive.com/search?l=user%40spark.apache.org&q=dueckm&submit.x=0&submit.y=0

On Mon, Aug 8, 2016 at 3:03 PM, Chris Mattmann <ma...@apache.org> wrote:
>
>
>
>
>
> On 8/8/16, 2:03 AM, "Matthias.Dueck@fiduciagad.de" <Ma...@fiduciagad.de> wrote:
>
>>Hello,
>>
>>I write to you because I am not really sure whether I did everything right when registering and subscribing to the spark user list.
>>
>>I posted the appended question to Spark User list after subscribing and receiving the "WELCOME to user@spark.apache.org" mail from "user-help@spark.apache.org".
>> But this post is still in state "This post has NOT been accepted by the mailing list yet.".
>>
>>Is this because I forgot something to do or did something wrong with my user account (dueckm)? Or is it because no member of the Spark User List reacted to that post yet?
>>
>>Thanks a lot for yout help.
>>
>>Matthias
>>
>>Fiducia & GAD IT AG | www.fiduciagad.de
>>AG Frankfurt a. M. HRB 102381 | Sitz der Gesellschaft: Hahnstr. 48, 60528 Frankfurt a. M. | USt-IdNr. DE 143582320
>>Vorstand: Klaus-Peter Bruns (Vorsitzender), Claus-Dieter Toben (stv. Vorsitzender),
>>
>>Jens-Olaf Bartels, Martin Beyer, Jörg Dreinhöfer, Wolfgang Eckert, Carsten Pfläging, Jörg Staff
>>Vorsitzender des Aufsichtsrats: Jürgen Brinkmann
>>
>>----- Weitergeleitet von Matthias Dück/M/FAG/FIDUCIA/DE am 08.08.2016 10:57 -----
>>
>>Von: dueckm <ma...@fiduciagad.de>
>>An: user@spark.apache.org
>>Datum: 04.08.2016 13:27
>>Betreff: Are join/groupBy operations with wide Java Beans using Dataset API much slower than using RDD API?
>>
>>________________________________________
>>
>>
>>
>>Hello,
>>
>>I built a prototype that uses join and groupBy operations via Spark RDD API.
>>Recently I migrated it to the Dataset API. Now it runs much slower than with
>>the original RDD implementation.
>>Did I do something wrong here? Or is this a price I have to pay for the more
>>convienient API?
>>Is there a known solution to deal with this effect (eg configuration via
>>"spark.sql.shuffle.partitions" - but now could I determine the correct
>>value)?
>>In my prototype I use Java Beans with a lot of attributes. Does this slow
>>down Spark-operations with Datasets?
>>
>>Here I have an simple example, that shows the difference:
>>JoinGroupByTest.zip
>><http://apache-spark-user-list.1001560.n3.nabble.com/file/n27473/JoinGroupByTest.zip>
>>- I build 2 RDDs and join and group them. Afterwards I count and display the
>>joined RDDs.  (Method de.testrddds.JoinGroupByTest.joinAndGroupViaRDD() )
>>- When I do the same actions with Datasets it takes approximately 40 times
>>as long (Methodd e.testrddds.JoinGroupByTest.joinAndGroupViaDatasets()).
>>
>>Thank you very much for your help.
>>Matthias
>>
>>PS1: excuse me for sending this post more than once, but I am new to this
>>mailing list and probably did something wrong when registering/subscribing,
>>so my previous postings have not been accepted ...
>>
>>PS2: See the appended screenshots taken from Spark UI (jobs 0/1 belong to
>>RDD implementation, jobs 2/3 to Dataset):
>><http://apache-spark-user-list.1001560.n3.nabble.com/file/n27473/jobs.png>
>>
>><http://apache-spark-user-list.1001560.n3.nabble.com/file/n27473/Job_RDD_Details.png>
>>
>><http://apache-spark-user-list.1001560.n3.nabble.com/file/n27473/Job_Dataset_Details.png>
>>
>>
>>
>>
>>--
>>View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Are-join-groupBy-operations-with-wide-Java-Beans-using-Dataset-API-much-slower-than-using-RDD-API-tp27473.html
>>Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>>---------------------------------------------------------------------
>>To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org