You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by suman bharadwaj <su...@gmail.com> on 2014/02/02 20:09:56 UTC

How to improve performance in searching for URLs.

Hi,

I was exploring SPARK. And in the process, I was trying to search a column
containing URL.

Basically we are doing a contains operator on the column. This is taking
around >3 min  to return the results. Is there any way to optimize this
query ?

.filter( line=>line.contains("someUrl"))

I currently have a system in standalone mode with *8GB ram*.
Everything is stored in memory in De-serialized format. The data size in
memory( De-serialized ) is around *1 GB.*


Any suggestions ?

Thanks in advance.

Regards,
SB

Re: How to improve performance in searching for URLs.

Posted by Mayur Rustagi <ma...@gmail.com>.
Can you describe looking at the task list on spark dashboard around number
of mappers & reducers and time taken by the same.


Mayur Rustagi
Ph: +919632149971
h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
https://twitter.com/mayur_rustagi



On Mon, Feb 3, 2014 at 12:39 AM, suman bharadwaj <su...@gmail.com>wrote:

> Hi,
>
> I was exploring SPARK. And in the process, I was trying to search a column
> containing URL.
>
> Basically we are doing a contains operator on the column. This is taking
> around >3 min  to return the results. Is there any way to optimize this
> query ?
>
> .filter( line=>line.contains("someUrl"))
>
> I currently have a system in standalone mode with *8GB ram*.
> Everything is stored in memory in De-serialized format. The data size in
> memory( De-serialized ) is around *1 GB.*
>
>
> Any suggestions ?
>
> Thanks in advance.
>
> Regards,
> SB
>