You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@phoenix.apache.org by manohar mc <ma...@yahoo.co.in> on 2019/07/21 17:18:12 UTC

Fw: Read Performance in latest code

 

   ----- Forwarded message ----- From: manohar mc <ma...@yahoo.co.in>To: user-allow@phoenix.apache.org <us...@phoenix.apache.org>Sent: Friday, 19 July, 2019, 11:14:41 am ISTSubject: Read Performance in latest code
 Hi List, I am using the latest phoenix spark connector https://github.com/apache/phoenix-connectors/tree/master/phoenix-spark. While using initially we observed issues in write performance and after some changes we could be get down time from 30 minutes to < 1 minute in our test environment. But we are seeing lots of CPU time is consumed while reading data into dataframe, if you see below picture >50% cpu time is spent in ShuffleMapTask. 






If you see picture there are lots of recursive calls till DataSourceRDD.compute get called. So wanted to understand what happening in this case and any way we can reduce the CPU time while shuffleMapTask.

Re: Fw: Read Performance in latest code

Posted by manohar3 <ma...@yahoo.co.in>.

Hi Chinmay,

Queries are of select * from 
 where name=value type, they are not complex having joins. From profiler i
see that lots of cpu time gets consumed during the course of instantiating
PheonixInputPartition.createPartitionReader().

Please check the profiler picture i have attached to know to know which
method is consuming lots of time in shuffleMaPTask 





--
Sent from: http://apache-phoenix-user-list.1124778.n5.nabble.com/

Re: Fw: Read Performance in latest code

Posted by Chinmay Kulkarni <ch...@gmail.com>.

Hi Manohar,

What query are you using when reading the data into a DataFrame? Can you
show us the DAG for your job? Perhaps you can further filter the data to
decrease the amount of data being shuffled. Also, are you doing any
group-by or join operations which would could lead to significant data
shuffling?
Another thing I found is to tune the GC, see this
<https://stackoverflow.com/questions/38981772/spark-shuffle-operation-leading-to-long-gc-pause/39111205>
.

Thanks,
Chinmay

On Sun, Jul 21, 2019 at 10:19 AM manohar mc <ma...@yahoo.co.in> wrote:

>
>
> ----- Forwarded message -----
> *From:* manohar mc <ma...@yahoo.co.in>
> *To:* user-allow@phoenix.apache.org <us...@phoenix.apache.org>
> *Sent:* Friday, 19 July, 2019, 11:14:41 am IST
> *Subject:* Read Performance in latest code
>
> Hi List, I am using the latest phoenix spark connector
> https://github.com/apache/phoenix-connectors/tree/master/phoenix-spark.
> While using initially we observed issues in write performance and after
> some changes we could be get down time from 30 minutes to < 1 minute in our
> test environment. But we are seeing lots of CPU time is consumed while
> reading data into dataframe, if you see below picture >50% cpu time is
> spent in ShuffleMapTask.
>
> [image: Inline image]
>
>
>
>
>
> If you see picture there are lots of recursive calls till
> DataSourceRDD.compute get called. So wanted to understand what happening in
> this case and any way we can reduce the CPU time while shuffleMapTask.
>

-- 
Chinmay Kulkarni