You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Alexander Sibiryakov <si...@yandex.ru> on 2015/10/02 17:33:23 UTC

Frontera: large-scale, distributed web crawling framework

Hi Nutch users!

Last 8 months at Scrapinghub we’ve been working on a new web crawling framework called Frontera. This is a distributed implementation of crawl frontier part of web crawler, the component which decides what to crawl next, when and when to stop. So, it’s not a complete web crawler. However, it suggests overall crawler design. There is a clean and tested way how to build a such crawler in half of the day from existing components.

Here is a list of main features:
Online operation: scheduling of new batch, updating of DB state. No need to stop crawling to change the crawling strategy.
Storage abstraction: write your own backend (sqlalchemy, HBase is included).
Canonical URLs resolution abstraction: each document has many URLs, which to use? We provide a place where you can code your own logic.
Scrapy ecosystem: good documentation, big community, ease of customization.
Communication layer is Apache Kafka: easy to plug somewhere and debug.
Crawling strategy abstraction: crawling goal, url ordering, scoring model is coded in separate module.
Polite by design: each website is downloaded by at most one spider process.
Workers are implemented in Python.
In general, such a web crawler should be very easy for customization, easy to plug in existing infrastructure and it’s online operation could be useful for crawling frequently changing web pages: news websites for example. We tested it at some scale, by crawling part of Spanish internet, you can find details in my presentation.
http://events.linuxfoundation.org/sites/events/files/slides/Frontera-crawling%20the%20spanish%20web.pdf

This project currently on a github, it’s an open source, under own license.
https://github.com/scrapinghub/frontera
https://github.com/scrapinghub/distributed-frontera

The questions are, what you guys think? Is this a useful thing? If yes, what kind of use cases do you see? Currently, I’m looking for a businesses who can benefit from it, please write me if you have any ideas on that.

Re: Frontera: large-scale, distributed web crawling framework

Posted by Jessica Glover <gl...@gmail.com>.

Sorry, just re-read and saw that it's open source and under what license? I
apologize if you're not trying to sell this.

On Fri, Oct 2, 2015 at 11:45 AM, Jessica Glover <gl...@gmail.com>
wrote:

> Hmm... you're asking for a free consultation on an open source software
> user mailing list? First, this doesn't exactly seem like the appropriate
> place for that. Second, offer some incentive if you want someone to help
> you with your business.
>
> On Fri, Oct 2, 2015 at 11:33 AM, Alexander Sibiryakov <sixty-one@yandex.ru
> > wrote:
>
>> Hi Nutch users!
>>
>> Last 8 months at Scrapinghub we’ve been working on a new web crawling
>> framework called Frontera. This is a distributed implementation of crawl
>> frontier part of web crawler, the component which decides what to crawl
>> next, when and when to stop. So, it’s not a complete web crawler. However,
>> it suggests overall crawler design. There is a clean and tested way how to
>> build a such crawler in half of the day from existing components.
>>
>> Here is a list of main features:
>> Online operation: scheduling of new batch, updating of DB state. No need
>> to stop crawling to change the crawling strategy.
>> Storage abstraction: write your own backend (sqlalchemy, HBase is
>> included).
>> Canonical URLs resolution abstraction: each document has many URLs, which
>> to use? We provide a place where you can code your own logic.
>> Scrapy ecosystem: good documentation, big community, ease of
>> customization.
>> Communication layer is Apache Kafka: easy to plug somewhere and debug.
>> Crawling strategy abstraction: crawling goal, url ordering, scoring model
>> is coded in separate module.
>> Polite by design: each website is downloaded by at most one spider
>> process.
>> Workers are implemented in Python.
>> In general, such a web crawler should be very easy for customization,
>> easy to plug in existing infrastructure and it’s online operation could be
>> useful for crawling frequently changing web pages: news websites for
>> example. We tested it at some scale, by crawling part of Spanish internet,
>> you can find details in my presentation.
>>
>> http://events.linuxfoundation.org/sites/events/files/slides/Frontera-crawling%20the%20spanish%20web.pdf
>>
>> This project currently on a github, it’s an open source, under own
>> license.
>> https://github.com/scrapinghub/frontera
>> https://github.com/scrapinghub/distributed-frontera
>>
>> The questions are, what you guys think? Is this a useful thing? If yes,
>> what kind of use cases do you see? Currently, I’m looking for a businesses
>> who can benefit from it, please write me if you have any ideas on that.
>>
>> A.
>
>
>

Re: Frontera: large-scale, distributed web crawling framework

Posted by Jessica Glover <gl...@gmail.com>.

Alexander, I apologize. I misunderstood the intent of your message and I
was very rude in my response. I will think about what you've asked and get
back to you.

Also, I enjoyed your slide presentation. It's very pleasing to the eye.

Sincerely,
Jessica

On Fri, Oct 2, 2015 at 11:51 AM, Mattmann, Chris A (3980) <
chris.a.mattmann@jpl.nasa.gov> wrote:

> Hi,
>
> I don’t think Alexander is doing anything wrong. In fact, he’s
> asking for input on his web crawling framework on the Nutch user
> list which I imagine contains many people interested in distributed
> web crawling.
>
> There doesn’t appear to be a direct Nutch connection here in his
> framework, however it uses other Apache technologies, Kafka, HBase,
> etc., that we are using (or thinking of using) and are interested in
> at least from my perspective as a Nutch developer and PMC Member.
> There are also several efforts to figure out how to use Scrapy
> with Nutch and this may be an interesting connection.
>
> If Alexander and people like him who aren’t using Nutch per-se never
> came to the Nutch list and discussed common web crawling topics of
> interest, we’d continue to have our silos and our own separate lists,
> and our own discussions, etc., instead of trying to work together
> as a broader community of folks and we’d miss out on potential
> opportunities where in the future, perhaps we could actually share
> more than simply ideas, but also software too.
>
> I applaud Alexander for coming to this list and not staying in his
> own silo and trying to get input from the Apache Nutch community.
>
> Thank you Alexander.
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> -----Original Message-----
> From: Jessica Glover <gl...@gmail.com>
> Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
> Date: Friday, October 2, 2015 at 8:45 AM
> To: "user@nutch.apache.org" <us...@nutch.apache.org>
> Subject: Re: Frontera: large-scale, distributed web crawling framework
>
> >Hmm... you're asking for a free consultation on an open source software
> >user mailing list? First, this doesn't exactly seem like the appropriate
> >place for that. Second, offer some incentive if you want someone to help
> >you with your business.
> >
> >On Fri, Oct 2, 2015 at 11:33 AM, Alexander Sibiryakov
> ><si...@yandex.ru>
> >wrote:
> >
> >> Hi Nutch users!
> >>
> >> Last 8 months at Scrapinghub we’ve been working on a new web crawling
> >> framework called Frontera. This is a distributed implementation of crawl
> >> frontier part of web crawler, the component which decides what to crawl
> >> next, when and when to stop. So, it’s not a complete web crawler.
> >>However,
> >> it suggests overall crawler design. There is a clean and tested way how
> >>to
> >> build a such crawler in half of the day from existing components.
> >>
> >> Here is a list of main features:
> >> Online operation: scheduling of new batch, updating of DB state. No need
> >> to stop crawling to change the crawling strategy.
> >> Storage abstraction: write your own backend (sqlalchemy, HBase is
> >> included).
> >> Canonical URLs resolution abstraction: each document has many URLs,
> >>which
> >> to use? We provide a place where you can code your own logic.
> >> Scrapy ecosystem: good documentation, big community, ease of
> >>customization.
> >> Communication layer is Apache Kafka: easy to plug somewhere and debug.
> >> Crawling strategy abstraction: crawling goal, url ordering, scoring
> >>model
> >> is coded in separate module.
> >> Polite by design: each website is downloaded by at most one spider
> >>process.
> >> Workers are implemented in Python.
> >> In general, such a web crawler should be very easy for customization,
> >>easy
> >> to plug in existing infrastructure and it’s online operation could be
> >> useful for crawling frequently changing web pages: news websites for
> >> example. We tested it at some scale, by crawling part of Spanish
> >>internet,
> >> you can find details in my presentation.
> >>
> >>
> >>
> http://events.linuxfoundation.org/sites/events/files/slides/Frontera-craw
> >>ling%20the%20spanish%20web.pdf
> >>
> >> This project currently on a github, it’s an open source, under own
> >>license.
> >> https://github.com/scrapinghub/frontera
> >> https://github.com/scrapinghub/distributed-frontera
> >>
> >> The questions are, what you guys think? Is this a useful thing? If yes,
> >> what kind of use cases do you see? Currently, I’m looking for a
> >>businesses
> >> who can benefit from it, please write me if you have any ideas on that.
> >>
> >> A.
>
>

Re: Frontera: large-scale, distributed web crawling framework

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

Hi,

I don’t think Alexander is doing anything wrong. In fact, he’s
asking for input on his web crawling framework on the Nutch user
list which I imagine contains many people interested in distributed
web crawling. 

There doesn’t appear to be a direct Nutch connection here in his
framework, however it uses other Apache technologies, Kafka, HBase,
etc., that we are using (or thinking of using) and are interested in
at least from my perspective as a Nutch developer and PMC Member.
There are also several efforts to figure out how to use Scrapy
with Nutch and this may be an interesting connection.

If Alexander and people like him who aren’t using Nutch per-se never
came to the Nutch list and discussed common web crawling topics of
interest, we’d continue to have our silos and our own separate lists,
and our own discussions, etc., instead of trying to work together
as a broader community of folks and we’d miss out on potential
opportunities where in the future, perhaps we could actually share
more than simply ideas, but also software too.

I applaud Alexander for coming to this list and not staying in his
own silo and trying to get input from the Apache Nutch community.

Thank you Alexander.

Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Jessica Glover <gl...@gmail.com>
Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
Date: Friday, October 2, 2015 at 8:45 AM
To: "user@nutch.apache.org" <us...@nutch.apache.org>
Subject: Re: Frontera: large-scale, distributed web crawling framework

>Hmm... you're asking for a free consultation on an open source software
>user mailing list? First, this doesn't exactly seem like the appropriate
>place for that. Second, offer some incentive if you want someone to help
>you with your business.
>
>On Fri, Oct 2, 2015 at 11:33 AM, Alexander Sibiryakov
><si...@yandex.ru>
>wrote:
>
>> Hi Nutch users!
>>
>> Last 8 months at Scrapinghub we’ve been working on a new web crawling
>> framework called Frontera. This is a distributed implementation of crawl
>> frontier part of web crawler, the component which decides what to crawl
>> next, when and when to stop. So, it’s not a complete web crawler.
>>However,
>> it suggests overall crawler design. There is a clean and tested way how
>>to
>> build a such crawler in half of the day from existing components.
>>
>> Here is a list of main features:
>> Online operation: scheduling of new batch, updating of DB state. No need
>> to stop crawling to change the crawling strategy.
>> Storage abstraction: write your own backend (sqlalchemy, HBase is
>> included).
>> Canonical URLs resolution abstraction: each document has many URLs,
>>which
>> to use? We provide a place where you can code your own logic.
>> Scrapy ecosystem: good documentation, big community, ease of
>>customization.
>> Communication layer is Apache Kafka: easy to plug somewhere and debug.
>> Crawling strategy abstraction: crawling goal, url ordering, scoring
>>model
>> is coded in separate module.
>> Polite by design: each website is downloaded by at most one spider
>>process.
>> Workers are implemented in Python.
>> In general, such a web crawler should be very easy for customization,
>>easy
>> to plug in existing infrastructure and it’s online operation could be
>> useful for crawling frequently changing web pages: news websites for
>> example. We tested it at some scale, by crawling part of Spanish
>>internet,
>> you can find details in my presentation.
>>
>> 
>>http://events.linuxfoundation.org/sites/events/files/slides/Frontera-craw
>>ling%20the%20spanish%20web.pdf
>>
>> This project currently on a github, it’s an open source, under own
>>license.
>> https://github.com/scrapinghub/frontera
>> https://github.com/scrapinghub/distributed-frontera
>>
>> The questions are, what you guys think? Is this a useful thing? If yes,
>> what kind of use cases do you see? Currently, I’m looking for a
>>businesses
>> who can benefit from it, please write me if you have any ideas on that.
>>
>> A.

Re: Frontera: large-scale, distributed web crawling framework

Posted by Jessica Glover <gl...@gmail.com>.

Hmm... you're asking for a free consultation on an open source software
user mailing list? First, this doesn't exactly seem like the appropriate
place for that. Second, offer some incentive if you want someone to help
you with your business.

On Fri, Oct 2, 2015 at 11:33 AM, Alexander Sibiryakov <si...@yandex.ru>
wrote:

> Hi Nutch users!
>
> Last 8 months at Scrapinghub we’ve been working on a new web crawling
> framework called Frontera. This is a distributed implementation of crawl
> frontier part of web crawler, the component which decides what to crawl
> next, when and when to stop. So, it’s not a complete web crawler. However,
> it suggests overall crawler design. There is a clean and tested way how to
> build a such crawler in half of the day from existing components.
>
> Here is a list of main features:
> Online operation: scheduling of new batch, updating of DB state. No need
> to stop crawling to change the crawling strategy.
> Storage abstraction: write your own backend (sqlalchemy, HBase is
> included).
> Canonical URLs resolution abstraction: each document has many URLs, which
> to use? We provide a place where you can code your own logic.
> Scrapy ecosystem: good documentation, big community, ease of customization.
> Communication layer is Apache Kafka: easy to plug somewhere and debug.
> Crawling strategy abstraction: crawling goal, url ordering, scoring model
> is coded in separate module.
> Polite by design: each website is downloaded by at most one spider process.
> Workers are implemented in Python.
> In general, such a web crawler should be very easy for customization, easy
> to plug in existing infrastructure and it’s online operation could be
> useful for crawling frequently changing web pages: news websites for
> example. We tested it at some scale, by crawling part of Spanish internet,
> you can find details in my presentation.
>
> http://events.linuxfoundation.org/sites/events/files/slides/Frontera-crawling%20the%20spanish%20web.pdf
>
> This project currently on a github, it’s an open source, under own license.
> https://github.com/scrapinghub/frontera
> https://github.com/scrapinghub/distributed-frontera
>
> The questions are, what you guys think? Is this a useful thing? If yes,
> what kind of use cases do you see? Currently, I’m looking for a businesses
> who can benefit from it, please write me if you have any ideas on that.
>
> A.

Re: Frontera: large-scale, distributed web crawling framework

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

Thank you so much for writing this up Alexander. This is a comprehensive
plan for how Nutch and Frontera overlap. I went ahead and created a wiki
page that we can use to outline possible goals and research projects,
perhaps Google Summer of Code projects where they can work together,
even starting this year. I would be happy to grant you permissions
to the wiki so you can update the page too:

https://wiki.apache.org/nutch/NutchAndFronteraDesignGoals

To answer your question about Nutch Design goals. For me, these are
areas that I am very interested and investing in right now with Nutch:

1. Deep Web Extractions - both from a crawler, and using various
interactive and non-obtrusive Javascript libraries. We started with
Selenium but are now looking at HTMLUntil, PhantomJS, and others.

2. Measuring Crawl Footprint - I think we need to understand better
the crawl footprint, and use that information to better guide and
strategize crawling.

3. Adaptive and ML-based crawling algorithms - my team is working on
a Machine Learning based algorithm for crawling that leverages Naive
Bayes, and RL. 

4. Content Extraction from more and more formats with Tika. This is
one potential area we could overlap on since there is both a Tika Python
[1] library and Nutch Python [2] library (originating from DARPA
Memex).

Seems like the more stuff we do in the Python libraries, and with
Tika potentially could serve as an initial integration. As for broader
crawling, I’m also interested in how Nutch and Spark can work together.
Nutch over Spark is something I have a few researchers in my team
working on now.

OK that’s my ideas. Sorry it took so long for the reply!

Cheers,
Chris

[1] http://github.com/chrismattmann/tika-python/
[2] http://github.com/chrismattmann/nutch-python/

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Alexander Sibiryakov <si...@yandex.ru>
Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
Date: Wednesday, January 13, 2016 at 10:12 AM
To: "user@nutch.apache.org" <us...@nutch.apache.org>
Cc: Paul Tremberth <pa...@scrapinghub.com>
Subject: Re: Frontera: large-scale, distributed web crawling framework

>Hi Chris,
>
>Sorry for a long delay, it wasn’t easy to answer your questions, so I was
>thinking. Please forgive me, if I mention some facts about Nutch, which
>aren’t true, this is mostly because of my time limitations.
>
>Here are the possible goals of integration of Frontera and Nutch:
>- to get the best of two: Nutch is good at scale, faster on
>fetching/parsing, but Frontera/Scrapy is online, much easier on
>customization, having good docs and written in Python,
>- to ease the migration from Frontera to Nutch and opposite,
>- identify and fix design problems.
>
>Now, few words how Nutch and Frontera could work together.
>1. Nutch Fetcher can be easily used with Frontera, if it will be
>implemented as a service, communicating by means of Kafka or ZeroMQ and
>talking Frontera protocol (which is documented). Fetching involves
>parsing and many string operations, that could be more efficient in JVM.
>FetchItem would require adapter for Frontera Request, the same for
>ParseData.
>
>It could help Frontera users save some time on fetching, but if use case
>requires scraping (for broad crawling it isn’t), they would need to add
>scraping step later.
>
>2. Scrapy can be used as fetcher for Nutch too. We just need to figure
>out a way how to run Scrapy spider in Hadoop environment. Input/Output
>adapters, process wrapper are needed. Some interface modifications are
>also required to use extracted items from content in Nutch-Solr(or other
>Lucene based) pipeline. Scrapy is much more efficient in network
>operations conceptually: asynchronous select()/epoll based http client
>and connection pool. This can be improved in Nutch.
>
>This would allow writing/debugging of custom scraping code amazingly
>easy. Plus Nutch is used as a crawl frontier for Scrapy and Tika-based
>parsing and indexing primitives can be used for building search.
>
>3. Frontera’s DB and strategy workers can be used in Hadoop/Nutch
>pipeline to generate Nutch segments and read fetcher output with slight
>modifications. It’s possible to generate quite big segment by
>continuously running get_next_requests() routine (meant to be used for
>small batches). They use low level storage, currently HBase and RDBMS are
>supported. Number of workers can be scaled, they’re designed for this.
>Same problems are here, need of adapters and process wrappers. RDBMS
>could suffer from concurrent access, but that’s solvable.
>
>This would allow to use Frontera as a crawl frontier with Nutch. It could
>be helpful if someone wants to implement crawling strategy in Python.
>
>Nutch and Frontera use cases aren’t completely overlap. Majority of
>people who look into Frontera want to crawl some small amount of
>websites, scrape some data from them and revisit. Sometimes they need to
>scale fetching (meaning no polite crawling here) or parsing/scraping
>part, and sometimes they need some custom prioritization or external
>queue management. Quite few is using it for broad crawling with Kafka and
>HBase. 
>
>I would appreciate if you could write your vision of major Nutch use
>cases, so we could compare.
>
>It’s up to us which direction to choose, but I think 1. and 2. options
>are most important.
>
>Currently, Frontera is moving towards the ease of use: ZeroMQ transport,
>transport layer abstraction, standalone Frontera/Scrapy based crawler in
>Docker, web UI.
>
>
>A.
>
>> 28 окт. 2015 г., в 16:46, Mattmann, Chris A (3980)
>><ch...@jpl.nasa.gov> написал(а):
>> 
>> Hi Alex,
>> 
>> I didn’t see any more traffic about this. Are you still looking
>> for feedback? Are there any plans to make Frontera and Nutch
>> work together?
>> 
>> I’m still interested of course. Thanks.
>> 
>> Thanks,
>> Chris
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: chris.a.mattmann@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Alexander Sibiryakov <si...@yandex.ru>
>> Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
>> Date: Friday, October 2, 2015 at 8:33 AM
>> To: "user@nutch.apache.org" <us...@nutch.apache.org>
>> Subject: Frontera: large-scale, distributed web crawling framework
>> 
>>> Hi Nutch users!
>>> 
>>> Last 8 months at Scrapinghub we’ve been working on a new web crawling
>>> framework called Frontera. This is a distributed implementation of
>>>crawl
>>> frontier part of web crawler, the component which decides what to crawl
>>> next, when and when to stop. So, it’s not a complete web crawler.
>>> However, it suggests overall crawler design. There is a clean and
>>>tested
>>> way how to build a such crawler in half of the day from existing
>>> components.
>>> 
>>> Here is a list of main features:
>>> Online operation: scheduling of new batch, updating of DB state. No
>>>need
>>> to stop crawling to change the crawling strategy.
>>> Storage abstraction: write your own backend (sqlalchemy, HBase is
>>> included).
>>> Canonical URLs resolution abstraction: each document has many URLs,
>>>which
>>> to use? We provide a place where you can code your own logic.
>>> Scrapy ecosystem: good documentation, big community, ease of
>>> customization.
>>> Communication layer is Apache Kafka: easy to plug somewhere and debug.
>>> Crawling strategy abstraction: crawling goal, url ordering, scoring
>>>model
>>> is coded in separate module.
>>> Polite by design: each website is downloaded by at most one spider
>>> process.
>>> Workers are implemented in Python.
>>> In general, such a web crawler should be very easy for customization,
>>> easy to plug in existing infrastructure and it’s online operation could
>>> be useful for crawling frequently changing web pages: news websites for
>>> example. We tested it at some scale, by crawling part of Spanish
>>> internet, you can find details in my presentation.
>>> 
>>>http://events.linuxfoundation.org/sites/events/files/slides/Frontera-cra
>>>wl
>>> ing%20the%20spanish%20web.pdf
>>> 
>>> This project currently on a github, it’s an open source, under own
>>> license.
>>> https://github.com/scrapinghub/frontera
>>> https://github.com/scrapinghub/distributed-frontera
>>> 
>>> The questions are, what you guys think? Is this a useful thing? If yes,
>>> what kind of use cases do you see? Currently, I’m looking for a
>>> businesses who can benefit from it, please write me if you have any
>>>ideas
>>> on that.
>>> 
>>> A.
>> 
>

Re: Frontera: large-scale, distributed web crawling framework

Posted by Alexander Sibiryakov <si...@yandex.ru>.

Hi Chris,

Sorry for a long delay, it wasn’t easy to answer your questions, so I was thinking. Please forgive me, if I mention some facts about Nutch, which aren’t true, this is mostly because of my time limitations.

Here are the possible goals of integration of Frontera and Nutch:
- to get the best of two: Nutch is good at scale, faster on fetching/parsing, but Frontera/Scrapy is online, much easier on customization, having good docs and written in Python,
- to ease the migration from Frontera to Nutch and opposite,
- identify and fix design problems.

Now, few words how Nutch and Frontera could work together. 
1. Nutch Fetcher can be easily used with Frontera, if it will be implemented as a service, communicating by means of Kafka or ZeroMQ and talking Frontera protocol (which is documented). Fetching involves parsing and many string operations, that could be more efficient in JVM. FetchItem would require adapter for Frontera Request, the same for ParseData.

It could help Frontera users save some time on fetching, but if use case requires scraping (for broad crawling it isn’t), they would need to add scraping step later.

2. Scrapy can be used as fetcher for Nutch too. We just need to figure out a way how to run Scrapy spider in Hadoop environment. Input/Output adapters, process wrapper are needed. Some interface modifications are also required to use extracted items from content in Nutch-Solr(or other Lucene based) pipeline. Scrapy is much more efficient in network operations conceptually: asynchronous select()/epoll based http client and connection pool. This can be improved in Nutch.

This would allow writing/debugging of custom scraping code amazingly easy. Plus Nutch is used as a crawl frontier for Scrapy and Tika-based parsing and indexing primitives can be used for building search.

3. Frontera’s DB and strategy workers can be used in Hadoop/Nutch pipeline to generate Nutch segments and read fetcher output with slight modifications. It’s possible to generate quite big segment by continuously running get_next_requests() routine (meant to be used for small batches). They use low level storage, currently HBase and RDBMS are supported. Number of workers can be scaled, they’re designed for this. Same problems are here, need of adapters and process wrappers. RDBMS could suffer from concurrent access, but that’s solvable.

This would allow to use Frontera as a crawl frontier with Nutch. It could be helpful if someone wants to implement crawling strategy in Python.

Nutch and Frontera use cases aren’t completely overlap. Majority of people who look into Frontera want to crawl some small amount of websites, scrape some data from them and revisit. Sometimes they need to scale fetching (meaning no polite crawling here) or parsing/scraping part, and sometimes they need some custom prioritization or external queue management. Quite few is using it for broad crawling with Kafka and HBase. 

I would appreciate if you could write your vision of major Nutch use cases, so we could compare.

It’s up to us which direction to choose, but I think 1. and 2. options are most important.

Currently, Frontera is moving towards the ease of use: ZeroMQ transport, transport layer abstraction, standalone Frontera/Scrapy based crawler in Docker, web UI.


A.

> 28 окт. 2015 г., в 16:46, Mattmann, Chris A (3980) <ch...@jpl.nasa.gov> написал(а):
> 
> Hi Alex,
> 
> I didn’t see any more traffic about this. Are you still looking
> for feedback? Are there any plans to make Frontera and Nutch
> work together?
> 
> I’m still interested of course. Thanks.
> 
> Thanks,
> Chris
> 
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattmann@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 
> 
> 
> 
> 
> -----Original Message-----
> From: Alexander Sibiryakov <si...@yandex.ru>
> Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
> Date: Friday, October 2, 2015 at 8:33 AM
> To: "user@nutch.apache.org" <us...@nutch.apache.org>
> Subject: Frontera: large-scale, distributed web crawling framework
> 
>> Hi Nutch users!
>> 
>> Last 8 months at Scrapinghub we’ve been working on a new web crawling
>> framework called Frontera. This is a distributed implementation of crawl
>> frontier part of web crawler, the component which decides what to crawl
>> next, when and when to stop. So, it’s not a complete web crawler.
>> However, it suggests overall crawler design. There is a clean and tested
>> way how to build a such crawler in half of the day from existing
>> components.
>> 
>> Here is a list of main features:
>> Online operation: scheduling of new batch, updating of DB state. No need
>> to stop crawling to change the crawling strategy.
>> Storage abstraction: write your own backend (sqlalchemy, HBase is
>> included).
>> Canonical URLs resolution abstraction: each document has many URLs, which
>> to use? We provide a place where you can code your own logic.
>> Scrapy ecosystem: good documentation, big community, ease of
>> customization.
>> Communication layer is Apache Kafka: easy to plug somewhere and debug.
>> Crawling strategy abstraction: crawling goal, url ordering, scoring model
>> is coded in separate module.
>> Polite by design: each website is downloaded by at most one spider
>> process.
>> Workers are implemented in Python.
>> In general, such a web crawler should be very easy for customization,
>> easy to plug in existing infrastructure and it’s online operation could
>> be useful for crawling frequently changing web pages: news websites for
>> example. We tested it at some scale, by crawling part of Spanish
>> internet, you can find details in my presentation.
>> http://events.linuxfoundation.org/sites/events/files/slides/Frontera-crawl
>> ing%20the%20spanish%20web.pdf
>> 
>> This project currently on a github, it’s an open source, under own
>> license.
>> https://github.com/scrapinghub/frontera
>> https://github.com/scrapinghub/distributed-frontera
>> 
>> The questions are, what you guys think? Is this a useful thing? If yes,
>> what kind of use cases do you see? Currently, I’m looking for a
>> businesses who can benefit from it, please write me if you have any ideas
>> on that.
>> 
>> A.
>

Re: Frontera: large-scale, distributed web crawling framework

Posted by "Mattmann, Chris A (3980)" <ch...@jpl.nasa.gov>.

Hi Alex,

I didn’t see any more traffic about this. Are you still looking
for feedback? Are there any plans to make Frontera and Nutch
work together?

I’m still interested of course. Thanks.

Thanks,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Alexander Sibiryakov <si...@yandex.ru>
Reply-To: "user@nutch.apache.org" <us...@nutch.apache.org>
Date: Friday, October 2, 2015 at 8:33 AM
To: "user@nutch.apache.org" <us...@nutch.apache.org>
Subject: Frontera: large-scale, distributed web crawling framework

>Hi Nutch users!
>
>Last 8 months at Scrapinghub we’ve been working on a new web crawling
>framework called Frontera. This is a distributed implementation of crawl
>frontier part of web crawler, the component which decides what to crawl
>next, when and when to stop. So, it’s not a complete web crawler.
>However, it suggests overall crawler design. There is a clean and tested
>way how to build a such crawler in half of the day from existing
>components.
>
>Here is a list of main features:
>Online operation: scheduling of new batch, updating of DB state. No need
>to stop crawling to change the crawling strategy.
>Storage abstraction: write your own backend (sqlalchemy, HBase is
>included).
>Canonical URLs resolution abstraction: each document has many URLs, which
>to use? We provide a place where you can code your own logic.
>Scrapy ecosystem: good documentation, big community, ease of
>customization.
>Communication layer is Apache Kafka: easy to plug somewhere and debug.
>Crawling strategy abstraction: crawling goal, url ordering, scoring model
>is coded in separate module.
>Polite by design: each website is downloaded by at most one spider
>process.
>Workers are implemented in Python.
>In general, such a web crawler should be very easy for customization,
>easy to plug in existing infrastructure and it’s online operation could
>be useful for crawling frequently changing web pages: news websites for
>example. We tested it at some scale, by crawling part of Spanish
>internet, you can find details in my presentation.
>http://events.linuxfoundation.org/sites/events/files/slides/Frontera-crawl
>ing%20the%20spanish%20web.pdf
>
>This project currently on a github, it’s an open source, under own
>license.
>https://github.com/scrapinghub/frontera
>https://github.com/scrapinghub/distributed-frontera
>
>The questions are, what you guys think? Is this a useful thing? If yes,
>what kind of use cases do you see? Currently, I’m looking for a
>businesses who can benefit from it, please write me if you have any ideas
>on that.
>
>A.