You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Sandeep Singh <sa...@techaddict.me> on 2014/09/07 09:15:30 UTC

Crawler and Scraper with different priorities

Hi all,

I am Implementing a Crawler, Scraper. The It should be able to process the
request for crawling & scraping, within few seconds of submitting the
job(around 1mil/sec), for rest I can take some time(scheduled evenly all
over the day). What is the best way to implement this?

Thanks.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Crawler-and-Scraper-with-different-priorities-tp13645.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Crawler and Scraper with different priorities

Posted by Peng Cheng <pc...@uow.edu.au>.

Hi Sandeep,

would you be interesting in joining my open source project?

https://github.com/tribbloid/spookystuff

IMHO spark is indeed not for general purpose crawling, of which distributed
job is highly homogeneous. But good enough for directional scraping which
involves heterogeneous input and deep graph following & extraction. Please
drop me a line if you have a user case, as I'll try to integrate it as a
feature.

Yours Peng



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Crawler-Scraper-with-different-priorities-tp13645p13838.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Crawler and Scraper with different priorities

Posted by Sandeep Singh <sa...@techaddict.me>.

Hi Daniil,

I have to do some processing of the results, as well as pushing the data to
the front end. Currently I'm using akka for this application, but I was
thinking maybe spark streaming would be a better thing to do. as well as i
can use mllib for processing the results. Any specific reason's why spark
streaming won't be better than akka ?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Crawler-Scraper-with-different-priorities-tp13645p13763.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Crawler and Scraper with different priorities

Posted by Daniil Osipov <da...@shazam.com>.

Depending on what you want to do with the result of the scraping, Spark may
not be the best framework for your use case. Take a look at a general Akka
application.

On Sun, Sep 7, 2014 at 12:15 AM, Sandeep Singh <sa...@techaddict.me>
wrote:

> Hi all,
>
> I am Implementing a Crawler, Scraper. The It should be able to process the
> request for crawling & scraping, within few seconds of submitting the
> job(around 1mil/sec), for rest I can take some time(scheduled evenly all
> over the day). What is the best way to implement this?
>
> Thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Crawler-and-Scraper-with-different-priorities-tp13645.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>