You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by xiao yang <ya...@gmail.com> on 2010/10/26 04:56:39 UTC

Are there any web crawlers based on database?

Hi, guys,

Nutch has its own data format for CrawlDB and LinkDB, which are
difficult to manage and share among applications.
Are there any web crawlers based on relational database?
I can see that Nutch is trying to use HBase for storage, but why not
use a relational database instead? We can use partitioning to solve
scalability problem.

Thanks!
Xiao

Fwd: Are there any web crawlers based on database?

Posted by Scott Gonyea <sc...@aitrus.org>.

Ugh, I fail at all things GMail.  I wish it'd learn to do what I meant
to tell it to.


Lots of things will "work," the question is all about what you're
doing, specifically.  I avoid trolling with phrases like "MySQL can't
scale" (unless I know I can get a funny response).  MySQL works and
scales wonderfully for a specific set of problems, 'more than good
enough' for most problems, and will make your life needlessly
difficult for some others.

If you post some larger insights into what you want to warehouse from
your crawl data, and what you plan to do with it, I can try to give
some deeper feedback on how to approach it.  But really, nothing too
awful can come from putting it into SQL and picking up your own set of
lessons.  It may well be good enough and have just the right level of
convenience for whomever is using it.

There's no real "right" or "wrong" answer, which is what makes some of
this stuff a real PITA.  Sometimes, it'd be nice if someone told me
what tool to use--so I could move on with my life, and solve the
nonsense I was supposed to.  It's all still very new, right now--but
Solr (thus Lucene) have a fairly established track record in
indexing/cataloguing heavily de-normalized internet sludge.

Scott Gonyea

On Tue, Oct 26, 2010 at 10:14 PM, xiao yang <ya...@gmail.com> wrote:
> Hi, Scott,
>
> I agree with you on the uselessness of row-locking and transactional
> integrity features. But we can reduce the overhead by reading data by
> block. I mean read many rows(like 1K, or more) at a time, and process
> them in memory. Do you think whether it will work?
>
> Thanks!
> Xiao
>
> On Wed, Oct 27, 2010 at 4:53 AM, Scott Gonyea <me...@sgonyea.com> wrote:
>> Not that it's guaranteed to be of "next to no value" but really,
>> you've probably already lost pages just crawling them.  Server /
>> network errors, for example, takes the integrity question and makes it
>> a cost-benefit.  Do you recrawl a bunch?  At different times?
>> Different geographies?
>>
>> Row locking is reasonably nice, but that begs other questions.  It can
>> easily be solved one of two ways:  Put your data is Solr, and persist
>> your efforts in both places:  Solr and an SQL backend.  If you're
>> using riak (or Cassandra), you allow document collisions to exist and
>> reconcile them within your application.
>>
>> It sounds complex, but are actually quite trivial to implement.
>>
>> Scott
>>
>> On Tue, Oct 26, 2010 at 1:39 PM, Scott Gonyea <me...@sgonyea.com> wrote:
>>> I love relational databases, but their many features are (in my
>>> opinion) wasted on what you find in Nutch.  Row-locking and
>>> transactional integrity is great for lots of applications, but becomes
>>> a whole lot of overhead when it's of next-to-no-value to whatever
>>> you're doing.
>>>
>>> RE: counting URLs:  Have you looked at Solr's facets, etc?  I use them
>>> like they're going out of style--and it's very powerful.
>>>
>>> For my application, Solr *is* my database.  Nutch crawls data, stores
>>> it somewhere, then picks it back up and drops it in Solr.  all of my
>>> crawl data sits in Solr.  I actively report on stats from Solr, as
>>> well as make updates to the content that's stored.  Lots of fields /
>>> boolean attributes sit in the schema.
>>>
>>> As the user works through the app, their changes get pushed back into
>>> Solr.  Then when they next hit "Search," results disappear / move
>>> around as they had organized it.
>>>
>>> Scott
>>>
>>> On Tue, Oct 26, 2010 at 12:20 AM, xiao yang <ya...@gmail.com> wrote:
>>>> Hi, Scott,
>>>>
>>>> Thanks for your reply.
>>>> I'm curious about the reason why using database is awful.
>>>> Here is my requirement: we have two developers who want to do some
>>>> processing and analysis work on the crawled data. If the data is
>>>> stored in database, we can easily share our data, for the well-defined
>>>> data models. What's more, the analysis results can also be easily
>>>> stored back into the database by just adding a few fields.
>>>> For example, I need to know the average number of urls in one site. In
>>>> database, a single SQL will do. If I want to extract and store the
>>>> main part of web pages, I can't easily modify the data structure of
>>>> Nutch easily. Even in Solr, it's difficult and inefficient to iterate
>>>> through the data set.
>>>> The crawled data is structured, then why not using database?
>>>>
>>>> Thanks!
>>>> Xiao
>>>>
>>>> On Tue, Oct 26, 2010 at 11:54 AM, Scott Gonyea <me...@sgonyea.com> wrote:
>>>>> Use Solr?  At its core, Solr is a document database.  Using a relational database, to warehouse your crawl data, is generally an awful idea.  I'd go so far as to suggest that you're probably looking at things the wrong way. :)
>>>>>
>>>>> I liken crawl data to sludge.  Don't try to normalize it.  Know what you want to get from it, and expose that data the best way possible.  If you want to store it, index it, query it, transform it, collect statistics, etc... Solr is a terrific tool.  Amazingly so.
>>>>>
>>>>> That said, you also have another very good choice.  Take a look at Riak Search.  They hijacked many core elements of Solr, which I applaud, and is compatible with Solr's http interface.  In effect, you can point Nutch's solr-index job, instead, at a Riak Search node and put your data there.
>>>>>
>>>>> The other nice thing: Riak is a (self-described) "mini-hadoop."  So you can search across the Solr indexes, that it's built on top of, or you can throw MapReduce jobs at riak and perform some very detailed analytics.
>>>>>
>>>>> I don't know of a database that lacks a Java client, so the potential for indexing plugins is limitless... regardless of where the data is placed.
>>>>>
>>>>> Scott Gonyea
>>>>>
>>>>> On Oct 25, 2010, at 7:56 PM, xiao yang wrote:
>>>>>
>>>>>> Hi, guys,
>>>>>>
>>>>>> Nutch has its own data format for CrawlDB and LinkDB, which are
>>>>>> difficult to manage and share among applications.
>>>>>> Are there any web crawlers based on relational database?
>>>>>> I can see that Nutch is trying to use HBase for storage, but why not
>>>>>> use a relational database instead? We can use partitioning to solve
>>>>>> scalability problem.
>>>>>>
>>>>>> Thanks!
>>>>>> Xiao
>>>>>
>>>>>
>>>>
>>>
>>
>

More real-time crawling

Posted by Ken Krugler <kk...@transpac.com>.

Hi Xiao,

FWIR there is adaptive refetch interval support in Nutch currently -  
or are you looking for something different?

Regards,

-- Ken

On Oct 27, 2010, at 1:42am, xiao yang wrote:

> I want to modify the schedule of crawler to make it more real-time.
> Some web pages are frequently updated, while others seldom change. My
> idea is to classify URL into 2 categories which will affect the score
> of URL, so I want to add a field to store which category a URL belongs
> to.
> The idea is simple, but I found it's not so easy to implement in  
> Nutch.
>
> Thanks!
> Xiao

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: Are there any web crawlers based on database?

Posted by xiao yang <ya...@gmail.com>.

I want to modify the schedule of crawler to make it more real-time.
Some web pages are frequently updated, while others seldom change. My
idea is to classify URL into 2 categories which will affect the score
of URL, so I want to add a field to store which category a URL belongs
to.
The idea is simple, but I found it's not so easy to implement in Nutch.

Thanks!
Xiao

On Wed, Oct 27, 2010 at 2:04 PM, Scott Gonyea <sc...@aitrus.org> wrote:
> Lots of things will "work," the question is all about what you're
> doing, specifically.  I avoid trolling with phrases like "MySQL can't
> scale" (unless I know I can get a funny response).  MySQL works and
> scales wonderfully for a specific set of problems, 'more than good
> enough' for most problems, and will make your life needlessly
> difficult for some others.
>
> If you post some larger insights into what you want to warehouse from
> your crawl data, and what you plan to do with it, I can try to give
> some deeper feedback on how to approach it.  But really, nothing too
> awful can come from putting it into SQL and picking up your own set of
> lessons.  It may well be good enough and have just the right level of
> convenience for whomever is using it.
>
> There's no real "right" or "wrong" answer, which is what makes some of
> this stuff a real PITA.  Sometimes, it'd be nice if someone told me
> what tool to use--so I could move on with my life, and solve the
> nonsense I was supposed to.  It's all still very new, right now--but
> Solr (thus Lucene) have a fairly established track record in
> indexing/cataloguing heavily de-normalized internet sludge.
>
> Scott Gonyea
>
> On Tue, Oct 26, 2010 at 10:14 PM, xiao yang <ya...@gmail.com> wrote:
>> Hi, Scott,
>>
>> I agree with you on the uselessness of row-locking and transactional
>> integrity features. But we can reduce the overhead by reading data by
>> block. I mean read many rows(like 1K, or more) at a time, and process
>> them in memory. Do you think whether it will work?
>>
>> Thanks!
>> Xiao
>>
>> On Wed, Oct 27, 2010 at 4:53 AM, Scott Gonyea <me...@sgonyea.com> wrote:
>>> Not that it's guaranteed to be of "next to no value" but really,
>>> you've probably already lost pages just crawling them.  Server /
>>> network errors, for example, takes the integrity question and makes it
>>> a cost-benefit.  Do you recrawl a bunch?  At different times?
>>> Different geographies?
>>>
>>> Row locking is reasonably nice, but that begs other questions.  It can
>>> easily be solved one of two ways:  Put your data is Solr, and persist
>>> your efforts in both places:  Solr and an SQL backend.  If you're
>>> using riak (or Cassandra), you allow document collisions to exist and
>>> reconcile them within your application.
>>>
>>> It sounds complex, but are actually quite trivial to implement.
>>>
>>> Scott
>>>
>>> On Tue, Oct 26, 2010 at 1:39 PM, Scott Gonyea <me...@sgonyea.com> wrote:
>>>> I love relational databases, but their many features are (in my
>>>> opinion) wasted on what you find in Nutch.  Row-locking and
>>>> transactional integrity is great for lots of applications, but becomes
>>>> a whole lot of overhead when it's of next-to-no-value to whatever
>>>> you're doing.
>>>>
>>>> RE: counting URLs:  Have you looked at Solr's facets, etc?  I use them
>>>> like they're going out of style--and it's very powerful.
>>>>
>>>> For my application, Solr *is* my database.  Nutch crawls data, stores
>>>> it somewhere, then picks it back up and drops it in Solr.  all of my
>>>> crawl data sits in Solr.  I actively report on stats from Solr, as
>>>> well as make updates to the content that's stored.  Lots of fields /
>>>> boolean attributes sit in the schema.
>>>>
>>>> As the user works through the app, their changes get pushed back into
>>>> Solr.  Then when they next hit "Search," results disappear / move
>>>> around as they had organized it.
>>>>
>>>> Scott
>>>>
>>>> On Tue, Oct 26, 2010 at 12:20 AM, xiao yang <ya...@gmail.com> wrote:
>>>>> Hi, Scott,
>>>>>
>>>>> Thanks for your reply.
>>>>> I'm curious about the reason why using database is awful.
>>>>> Here is my requirement: we have two developers who want to do some
>>>>> processing and analysis work on the crawled data. If the data is
>>>>> stored in database, we can easily share our data, for the well-defined
>>>>> data models. What's more, the analysis results can also be easily
>>>>> stored back into the database by just adding a few fields.
>>>>> For example, I need to know the average number of urls in one site. In
>>>>> database, a single SQL will do. If I want to extract and store the
>>>>> main part of web pages, I can't easily modify the data structure of
>>>>> Nutch easily. Even in Solr, it's difficult and inefficient to iterate
>>>>> through the data set.
>>>>> The crawled data is structured, then why not using database?
>>>>>
>>>>> Thanks!
>>>>> Xiao
>>>>>
>>>>> On Tue, Oct 26, 2010 at 11:54 AM, Scott Gonyea <me...@sgonyea.com> wrote:
>>>>>> Use Solr?  At its core, Solr is a document database.  Using a relational database, to warehouse your crawl data, is generally an awful idea.  I'd go so far as to suggest that you're probably looking at things the wrong way. :)
>>>>>>
>>>>>> I liken crawl data to sludge.  Don't try to normalize it.  Know what you want to get from it, and expose that data the best way possible.  If you want to store it, index it, query it, transform it, collect statistics, etc... Solr is a terrific tool.  Amazingly so.
>>>>>>
>>>>>> That said, you also have another very good choice.  Take a look at Riak Search.  They hijacked many core elements of Solr, which I applaud, and is compatible with Solr's http interface.  In effect, you can point Nutch's solr-index job, instead, at a Riak Search node and put your data there.
>>>>>>
>>>>>> The other nice thing: Riak is a (self-described) "mini-hadoop."  So you can search across the Solr indexes, that it's built on top of, or you can throw MapReduce jobs at riak and perform some very detailed analytics.
>>>>>>
>>>>>> I don't know of a database that lacks a Java client, so the potential for indexing plugins is limitless... regardless of where the data is placed.
>>>>>>
>>>>>> Scott Gonyea
>>>>>>
>>>>>> On Oct 25, 2010, at 7:56 PM, xiao yang wrote:
>>>>>>
>>>>>>> Hi, guys,
>>>>>>>
>>>>>>> Nutch has its own data format for CrawlDB and LinkDB, which are
>>>>>>> difficult to manage and share among applications.
>>>>>>> Are there any web crawlers based on relational database?
>>>>>>> I can see that Nutch is trying to use HBase for storage, but why not
>>>>>>> use a relational database instead? We can use partitioning to solve
>>>>>>> scalability problem.
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Xiao
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Are there any web crawlers based on database?

Posted by xiao yang <ya...@gmail.com>.

I want to modify the schedule of crawler to make it more real-time.
Some web pages are frequently updated, while others seldom change. My
idea is to classify URL into 2 categories which will affect the score
of URL, so I want to add a field to store which category a URL belongs
to.
The idea is simple, but I found it's not so easy to implement in Nutch.

Thanks!
Xiao

On Wed, Oct 27, 2010 at 2:04 PM, Scott Gonyea <sc...@aitrus.org> wrote:
> Lots of things will "work," the question is all about what you're
> doing, specifically.  I avoid trolling with phrases like "MySQL can't
> scale" (unless I know I can get a funny response).  MySQL works and
> scales wonderfully for a specific set of problems, 'more than good
> enough' for most problems, and will make your life needlessly
> difficult for some others.
>
> If you post some larger insights into what you want to warehouse from
> your crawl data, and what you plan to do with it, I can try to give
> some deeper feedback on how to approach it.  But really, nothing too
> awful can come from putting it into SQL and picking up your own set of
> lessons.  It may well be good enough and have just the right level of
> convenience for whomever is using it.
>
> There's no real "right" or "wrong" answer, which is what makes some of
> this stuff a real PITA.  Sometimes, it'd be nice if someone told me
> what tool to use--so I could move on with my life, and solve the
> nonsense I was supposed to.  It's all still very new, right now--but
> Solr (thus Lucene) have a fairly established track record in
> indexing/cataloguing heavily de-normalized internet sludge.
>
> Scott Gonyea
>
> On Tue, Oct 26, 2010 at 10:14 PM, xiao yang <ya...@gmail.com> wrote:
>> Hi, Scott,
>>
>> I agree with you on the uselessness of row-locking and transactional
>> integrity features. But we can reduce the overhead by reading data by
>> block. I mean read many rows(like 1K, or more) at a time, and process
>> them in memory. Do you think whether it will work?
>>
>> Thanks!
>> Xiao
>>
>> On Wed, Oct 27, 2010 at 4:53 AM, Scott Gonyea <me...@sgonyea.com> wrote:
>>> Not that it's guaranteed to be of "next to no value" but really,
>>> you've probably already lost pages just crawling them.  Server /
>>> network errors, for example, takes the integrity question and makes it
>>> a cost-benefit.  Do you recrawl a bunch?  At different times?
>>> Different geographies?
>>>
>>> Row locking is reasonably nice, but that begs other questions.  It can
>>> easily be solved one of two ways:  Put your data is Solr, and persist
>>> your efforts in both places:  Solr and an SQL backend.  If you're
>>> using riak (or Cassandra), you allow document collisions to exist and
>>> reconcile them within your application.
>>>
>>> It sounds complex, but are actually quite trivial to implement.
>>>
>>> Scott
>>>
>>> On Tue, Oct 26, 2010 at 1:39 PM, Scott Gonyea <me...@sgonyea.com> wrote:
>>>> I love relational databases, but their many features are (in my
>>>> opinion) wasted on what you find in Nutch.  Row-locking and
>>>> transactional integrity is great for lots of applications, but becomes
>>>> a whole lot of overhead when it's of next-to-no-value to whatever
>>>> you're doing.
>>>>
>>>> RE: counting URLs:  Have you looked at Solr's facets, etc?  I use them
>>>> like they're going out of style--and it's very powerful.
>>>>
>>>> For my application, Solr *is* my database.  Nutch crawls data, stores
>>>> it somewhere, then picks it back up and drops it in Solr.  all of my
>>>> crawl data sits in Solr.  I actively report on stats from Solr, as
>>>> well as make updates to the content that's stored.  Lots of fields /
>>>> boolean attributes sit in the schema.
>>>>
>>>> As the user works through the app, their changes get pushed back into
>>>> Solr.  Then when they next hit "Search," results disappear / move
>>>> around as they had organized it.
>>>>
>>>> Scott
>>>>
>>>> On Tue, Oct 26, 2010 at 12:20 AM, xiao yang <ya...@gmail.com> wrote:
>>>>> Hi, Scott,
>>>>>
>>>>> Thanks for your reply.
>>>>> I'm curious about the reason why using database is awful.
>>>>> Here is my requirement: we have two developers who want to do some
>>>>> processing and analysis work on the crawled data. If the data is
>>>>> stored in database, we can easily share our data, for the well-defined
>>>>> data models. What's more, the analysis results can also be easily
>>>>> stored back into the database by just adding a few fields.
>>>>> For example, I need to know the average number of urls in one site. In
>>>>> database, a single SQL will do. If I want to extract and store the
>>>>> main part of web pages, I can't easily modify the data structure of
>>>>> Nutch easily. Even in Solr, it's difficult and inefficient to iterate
>>>>> through the data set.
>>>>> The crawled data is structured, then why not using database?
>>>>>
>>>>> Thanks!
>>>>> Xiao
>>>>>
>>>>> On Tue, Oct 26, 2010 at 11:54 AM, Scott Gonyea <me...@sgonyea.com> wrote:
>>>>>> Use Solr?  At its core, Solr is a document database.  Using a relational database, to warehouse your crawl data, is generally an awful idea.  I'd go so far as to suggest that you're probably looking at things the wrong way. :)
>>>>>>
>>>>>> I liken crawl data to sludge.  Don't try to normalize it.  Know what you want to get from it, and expose that data the best way possible.  If you want to store it, index it, query it, transform it, collect statistics, etc... Solr is a terrific tool.  Amazingly so.
>>>>>>
>>>>>> That said, you also have another very good choice.  Take a look at Riak Search.  They hijacked many core elements of Solr, which I applaud, and is compatible with Solr's http interface.  In effect, you can point Nutch's solr-index job, instead, at a Riak Search node and put your data there.
>>>>>>
>>>>>> The other nice thing: Riak is a (self-described) "mini-hadoop."  So you can search across the Solr indexes, that it's built on top of, or you can throw MapReduce jobs at riak and perform some very detailed analytics.
>>>>>>
>>>>>> I don't know of a database that lacks a Java client, so the potential for indexing plugins is limitless... regardless of where the data is placed.
>>>>>>
>>>>>> Scott Gonyea
>>>>>>
>>>>>> On Oct 25, 2010, at 7:56 PM, xiao yang wrote:
>>>>>>
>>>>>>> Hi, guys,
>>>>>>>
>>>>>>> Nutch has its own data format for CrawlDB and LinkDB, which are
>>>>>>> difficult to manage and share among applications.
>>>>>>> Are there any web crawlers based on relational database?
>>>>>>> I can see that Nutch is trying to use HBase for storage, but why not
>>>>>>> use a relational database instead? We can use partitioning to solve
>>>>>>> scalability problem.
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Xiao
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Are there any web crawlers based on database?

Posted by Scott Gonyea <sc...@aitrus.org>.

Lots of things will "work," the question is all about what you're
doing, specifically.  I avoid trolling with phrases like "MySQL can't
scale" (unless I know I can get a funny response).  MySQL works and
scales wonderfully for a specific set of problems, 'more than good
enough' for most problems, and will make your life needlessly
difficult for some others.

If you post some larger insights into what you want to warehouse from
your crawl data, and what you plan to do with it, I can try to give
some deeper feedback on how to approach it.  But really, nothing too
awful can come from putting it into SQL and picking up your own set of
lessons.  It may well be good enough and have just the right level of
convenience for whomever is using it.

There's no real "right" or "wrong" answer, which is what makes some of
this stuff a real PITA.  Sometimes, it'd be nice if someone told me
what tool to use--so I could move on with my life, and solve the
nonsense I was supposed to.  It's all still very new, right now--but
Solr (thus Lucene) have a fairly established track record in
indexing/cataloguing heavily de-normalized internet sludge.

Scott Gonyea

On Tue, Oct 26, 2010 at 10:14 PM, xiao yang <ya...@gmail.com> wrote:
> Hi, Scott,
>
> I agree with you on the uselessness of row-locking and transactional
> integrity features. But we can reduce the overhead by reading data by
> block. I mean read many rows(like 1K, or more) at a time, and process
> them in memory. Do you think whether it will work?
>
> Thanks!
> Xiao
>
> On Wed, Oct 27, 2010 at 4:53 AM, Scott Gonyea <me...@sgonyea.com> wrote:
>> Not that it's guaranteed to be of "next to no value" but really,
>> you've probably already lost pages just crawling them.  Server /
>> network errors, for example, takes the integrity question and makes it
>> a cost-benefit.  Do you recrawl a bunch?  At different times?
>> Different geographies?
>>
>> Row locking is reasonably nice, but that begs other questions.  It can
>> easily be solved one of two ways:  Put your data is Solr, and persist
>> your efforts in both places:  Solr and an SQL backend.  If you're
>> using riak (or Cassandra), you allow document collisions to exist and
>> reconcile them within your application.
>>
>> It sounds complex, but are actually quite trivial to implement.
>>
>> Scott
>>
>> On Tue, Oct 26, 2010 at 1:39 PM, Scott Gonyea <me...@sgonyea.com> wrote:
>>> I love relational databases, but their many features are (in my
>>> opinion) wasted on what you find in Nutch.  Row-locking and
>>> transactional integrity is great for lots of applications, but becomes
>>> a whole lot of overhead when it's of next-to-no-value to whatever
>>> you're doing.
>>>
>>> RE: counting URLs:  Have you looked at Solr's facets, etc?  I use them
>>> like they're going out of style--and it's very powerful.
>>>
>>> For my application, Solr *is* my database.  Nutch crawls data, stores
>>> it somewhere, then picks it back up and drops it in Solr.  all of my
>>> crawl data sits in Solr.  I actively report on stats from Solr, as
>>> well as make updates to the content that's stored.  Lots of fields /
>>> boolean attributes sit in the schema.
>>>
>>> As the user works through the app, their changes get pushed back into
>>> Solr.  Then when they next hit "Search," results disappear / move
>>> around as they had organized it.
>>>
>>> Scott
>>>
>>> On Tue, Oct 26, 2010 at 12:20 AM, xiao yang <ya...@gmail.com> wrote:
>>>> Hi, Scott,
>>>>
>>>> Thanks for your reply.
>>>> I'm curious about the reason why using database is awful.
>>>> Here is my requirement: we have two developers who want to do some
>>>> processing and analysis work on the crawled data. If the data is
>>>> stored in database, we can easily share our data, for the well-defined
>>>> data models. What's more, the analysis results can also be easily
>>>> stored back into the database by just adding a few fields.
>>>> For example, I need to know the average number of urls in one site. In
>>>> database, a single SQL will do. If I want to extract and store the
>>>> main part of web pages, I can't easily modify the data structure of
>>>> Nutch easily. Even in Solr, it's difficult and inefficient to iterate
>>>> through the data set.
>>>> The crawled data is structured, then why not using database?
>>>>
>>>> Thanks!
>>>> Xiao
>>>>
>>>> On Tue, Oct 26, 2010 at 11:54 AM, Scott Gonyea <me...@sgonyea.com> wrote:
>>>>> Use Solr?  At its core, Solr is a document database.  Using a relational database, to warehouse your crawl data, is generally an awful idea.  I'd go so far as to suggest that you're probably looking at things the wrong way. :)
>>>>>
>>>>> I liken crawl data to sludge.  Don't try to normalize it.  Know what you want to get from it, and expose that data the best way possible.  If you want to store it, index it, query it, transform it, collect statistics, etc... Solr is a terrific tool.  Amazingly so.
>>>>>
>>>>> That said, you also have another very good choice.  Take a look at Riak Search.  They hijacked many core elements of Solr, which I applaud, and is compatible with Solr's http interface.  In effect, you can point Nutch's solr-index job, instead, at a Riak Search node and put your data there.
>>>>>
>>>>> The other nice thing: Riak is a (self-described) "mini-hadoop."  So you can search across the Solr indexes, that it's built on top of, or you can throw MapReduce jobs at riak and perform some very detailed analytics.
>>>>>
>>>>> I don't know of a database that lacks a Java client, so the potential for indexing plugins is limitless... regardless of where the data is placed.
>>>>>
>>>>> Scott Gonyea
>>>>>
>>>>> On Oct 25, 2010, at 7:56 PM, xiao yang wrote:
>>>>>
>>>>>> Hi, guys,
>>>>>>
>>>>>> Nutch has its own data format for CrawlDB and LinkDB, which are
>>>>>> difficult to manage and share among applications.
>>>>>> Are there any web crawlers based on relational database?
>>>>>> I can see that Nutch is trying to use HBase for storage, but why not
>>>>>> use a relational database instead? We can use partitioning to solve
>>>>>> scalability problem.
>>>>>>
>>>>>> Thanks!
>>>>>> Xiao
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Are there any web crawlers based on database?

Posted by xiao yang <ya...@gmail.com>.

Hi, Scott,

I agree with you on the uselessness of row-locking and transactional
integrity features. But we can reduce the overhead by reading data by
block. I mean read many rows(like 1K, or more) at a time, and process
them in memory. Do you think whether it will work?

Thanks!
Xiao

On Wed, Oct 27, 2010 at 4:53 AM, Scott Gonyea <me...@sgonyea.com> wrote:
> Not that it's guaranteed to be of "next to no value" but really,
> you've probably already lost pages just crawling them.  Server /
> network errors, for example, takes the integrity question and makes it
> a cost-benefit.  Do you recrawl a bunch?  At different times?
> Different geographies?
>
> Row locking is reasonably nice, but that begs other questions.  It can
> easily be solved one of two ways:  Put your data is Solr, and persist
> your efforts in both places:  Solr and an SQL backend.  If you're
> using riak (or Cassandra), you allow document collisions to exist and
> reconcile them within your application.
>
> It sounds complex, but are actually quite trivial to implement.
>
> Scott
>
> On Tue, Oct 26, 2010 at 1:39 PM, Scott Gonyea <me...@sgonyea.com> wrote:
>> I love relational databases, but their many features are (in my
>> opinion) wasted on what you find in Nutch.  Row-locking and
>> transactional integrity is great for lots of applications, but becomes
>> a whole lot of overhead when it's of next-to-no-value to whatever
>> you're doing.
>>
>> RE: counting URLs:  Have you looked at Solr's facets, etc?  I use them
>> like they're going out of style--and it's very powerful.
>>
>> For my application, Solr *is* my database.  Nutch crawls data, stores
>> it somewhere, then picks it back up and drops it in Solr.  all of my
>> crawl data sits in Solr.  I actively report on stats from Solr, as
>> well as make updates to the content that's stored.  Lots of fields /
>> boolean attributes sit in the schema.
>>
>> As the user works through the app, their changes get pushed back into
>> Solr.  Then when they next hit "Search," results disappear / move
>> around as they had organized it.
>>
>> Scott
>>
>> On Tue, Oct 26, 2010 at 12:20 AM, xiao yang <ya...@gmail.com> wrote:
>>> Hi, Scott,
>>>
>>> Thanks for your reply.
>>> I'm curious about the reason why using database is awful.
>>> Here is my requirement: we have two developers who want to do some
>>> processing and analysis work on the crawled data. If the data is
>>> stored in database, we can easily share our data, for the well-defined
>>> data models. What's more, the analysis results can also be easily
>>> stored back into the database by just adding a few fields.
>>> For example, I need to know the average number of urls in one site. In
>>> database, a single SQL will do. If I want to extract and store the
>>> main part of web pages, I can't easily modify the data structure of
>>> Nutch easily. Even in Solr, it's difficult and inefficient to iterate
>>> through the data set.
>>> The crawled data is structured, then why not using database?
>>>
>>> Thanks!
>>> Xiao
>>>
>>> On Tue, Oct 26, 2010 at 11:54 AM, Scott Gonyea <me...@sgonyea.com> wrote:
>>>> Use Solr?  At its core, Solr is a document database.  Using a relational database, to warehouse your crawl data, is generally an awful idea.  I'd go so far as to suggest that you're probably looking at things the wrong way. :)
>>>>
>>>> I liken crawl data to sludge.  Don't try to normalize it.  Know what you want to get from it, and expose that data the best way possible.  If you want to store it, index it, query it, transform it, collect statistics, etc... Solr is a terrific tool.  Amazingly so.
>>>>
>>>> That said, you also have another very good choice.  Take a look at Riak Search.  They hijacked many core elements of Solr, which I applaud, and is compatible with Solr's http interface.  In effect, you can point Nutch's solr-index job, instead, at a Riak Search node and put your data there.
>>>>
>>>> The other nice thing: Riak is a (self-described) "mini-hadoop."  So you can search across the Solr indexes, that it's built on top of, or you can throw MapReduce jobs at riak and perform some very detailed analytics.
>>>>
>>>> I don't know of a database that lacks a Java client, so the potential for indexing plugins is limitless... regardless of where the data is placed.
>>>>
>>>> Scott Gonyea
>>>>
>>>> On Oct 25, 2010, at 7:56 PM, xiao yang wrote:
>>>>
>>>>> Hi, guys,
>>>>>
>>>>> Nutch has its own data format for CrawlDB and LinkDB, which are
>>>>> difficult to manage and share among applications.
>>>>> Are there any web crawlers based on relational database?
>>>>> I can see that Nutch is trying to use HBase for storage, but why not
>>>>> use a relational database instead? We can use partitioning to solve
>>>>> scalability problem.
>>>>>
>>>>> Thanks!
>>>>> Xiao
>>>>
>>>>
>>>
>>
>

Re: Are there any web crawlers based on database?

Posted by xiao yang <ya...@gmail.com>.

Hi, Scott,

I agree with you on the uselessness of row-locking and transactional
integrity features. But we can reduce the overhead by reading data by
block. I mean read many rows(like 1K, or more) at a time, and process
them in memory. Do you think whether it will work?

Thanks!
Xiao

On Wed, Oct 27, 2010 at 4:53 AM, Scott Gonyea <me...@sgonyea.com> wrote:
> Not that it's guaranteed to be of "next to no value" but really,
> you've probably already lost pages just crawling them.  Server /
> network errors, for example, takes the integrity question and makes it
> a cost-benefit.  Do you recrawl a bunch?  At different times?
> Different geographies?
>
> Row locking is reasonably nice, but that begs other questions.  It can
> easily be solved one of two ways:  Put your data is Solr, and persist
> your efforts in both places:  Solr and an SQL backend.  If you're
> using riak (or Cassandra), you allow document collisions to exist and
> reconcile them within your application.
>
> It sounds complex, but are actually quite trivial to implement.
>
> Scott
>
> On Tue, Oct 26, 2010 at 1:39 PM, Scott Gonyea <me...@sgonyea.com> wrote:
>> I love relational databases, but their many features are (in my
>> opinion) wasted on what you find in Nutch.  Row-locking and
>> transactional integrity is great for lots of applications, but becomes
>> a whole lot of overhead when it's of next-to-no-value to whatever
>> you're doing.
>>
>> RE: counting URLs:  Have you looked at Solr's facets, etc?  I use them
>> like they're going out of style--and it's very powerful.
>>
>> For my application, Solr *is* my database.  Nutch crawls data, stores
>> it somewhere, then picks it back up and drops it in Solr.  all of my
>> crawl data sits in Solr.  I actively report on stats from Solr, as
>> well as make updates to the content that's stored.  Lots of fields /
>> boolean attributes sit in the schema.
>>
>> As the user works through the app, their changes get pushed back into
>> Solr.  Then when they next hit "Search," results disappear / move
>> around as they had organized it.
>>
>> Scott
>>
>> On Tue, Oct 26, 2010 at 12:20 AM, xiao yang <ya...@gmail.com> wrote:
>>> Hi, Scott,
>>>
>>> Thanks for your reply.
>>> I'm curious about the reason why using database is awful.
>>> Here is my requirement: we have two developers who want to do some
>>> processing and analysis work on the crawled data. If the data is
>>> stored in database, we can easily share our data, for the well-defined
>>> data models. What's more, the analysis results can also be easily
>>> stored back into the database by just adding a few fields.
>>> For example, I need to know the average number of urls in one site. In
>>> database, a single SQL will do. If I want to extract and store the
>>> main part of web pages, I can't easily modify the data structure of
>>> Nutch easily. Even in Solr, it's difficult and inefficient to iterate
>>> through the data set.
>>> The crawled data is structured, then why not using database?
>>>
>>> Thanks!
>>> Xiao
>>>
>>> On Tue, Oct 26, 2010 at 11:54 AM, Scott Gonyea <me...@sgonyea.com> wrote:
>>>> Use Solr?  At its core, Solr is a document database.  Using a relational database, to warehouse your crawl data, is generally an awful idea.  I'd go so far as to suggest that you're probably looking at things the wrong way. :)
>>>>
>>>> I liken crawl data to sludge.  Don't try to normalize it.  Know what you want to get from it, and expose that data the best way possible.  If you want to store it, index it, query it, transform it, collect statistics, etc... Solr is a terrific tool.  Amazingly so.
>>>>
>>>> That said, you also have another very good choice.  Take a look at Riak Search.  They hijacked many core elements of Solr, which I applaud, and is compatible with Solr's http interface.  In effect, you can point Nutch's solr-index job, instead, at a Riak Search node and put your data there.
>>>>
>>>> The other nice thing: Riak is a (self-described) "mini-hadoop."  So you can search across the Solr indexes, that it's built on top of, or you can throw MapReduce jobs at riak and perform some very detailed analytics.
>>>>
>>>> I don't know of a database that lacks a Java client, so the potential for indexing plugins is limitless... regardless of where the data is placed.
>>>>
>>>> Scott Gonyea
>>>>
>>>> On Oct 25, 2010, at 7:56 PM, xiao yang wrote:
>>>>
>>>>> Hi, guys,
>>>>>
>>>>> Nutch has its own data format for CrawlDB and LinkDB, which are
>>>>> difficult to manage and share among applications.
>>>>> Are there any web crawlers based on relational database?
>>>>> I can see that Nutch is trying to use HBase for storage, but why not
>>>>> use a relational database instead? We can use partitioning to solve
>>>>> scalability problem.
>>>>>
>>>>> Thanks!
>>>>> Xiao
>>>>
>>>>
>>>
>>
>

Re: Are there any web crawlers based on database?

Posted by Scott Gonyea <me...@sgonyea.com>.

Not that it's guaranteed to be of "next to no value" but really,
you've probably already lost pages just crawling them.  Server /
network errors, for example, takes the integrity question and makes it
a cost-benefit.  Do you recrawl a bunch?  At different times?
Different geographies?

Row locking is reasonably nice, but that begs other questions.  It can
easily be solved one of two ways:  Put your data is Solr, and persist
your efforts in both places:  Solr and an SQL backend.  If you're
using riak (or Cassandra), you allow document collisions to exist and
reconcile them within your application.

It sounds complex, but are actually quite trivial to implement.

Scott

On Tue, Oct 26, 2010 at 1:39 PM, Scott Gonyea <me...@sgonyea.com> wrote:
> I love relational databases, but their many features are (in my
> opinion) wasted on what you find in Nutch.  Row-locking and
> transactional integrity is great for lots of applications, but becomes
> a whole lot of overhead when it's of next-to-no-value to whatever
> you're doing.
>
> RE: counting URLs:  Have you looked at Solr's facets, etc?  I use them
> like they're going out of style--and it's very powerful.
>
> For my application, Solr *is* my database.  Nutch crawls data, stores
> it somewhere, then picks it back up and drops it in Solr.  all of my
> crawl data sits in Solr.  I actively report on stats from Solr, as
> well as make updates to the content that's stored.  Lots of fields /
> boolean attributes sit in the schema.
>
> As the user works through the app, their changes get pushed back into
> Solr.  Then when they next hit "Search," results disappear / move
> around as they had organized it.
>
> Scott
>
> On Tue, Oct 26, 2010 at 12:20 AM, xiao yang <ya...@gmail.com> wrote:
>> Hi, Scott,
>>
>> Thanks for your reply.
>> I'm curious about the reason why using database is awful.
>> Here is my requirement: we have two developers who want to do some
>> processing and analysis work on the crawled data. If the data is
>> stored in database, we can easily share our data, for the well-defined
>> data models. What's more, the analysis results can also be easily
>> stored back into the database by just adding a few fields.
>> For example, I need to know the average number of urls in one site. In
>> database, a single SQL will do. If I want to extract and store the
>> main part of web pages, I can't easily modify the data structure of
>> Nutch easily. Even in Solr, it's difficult and inefficient to iterate
>> through the data set.
>> The crawled data is structured, then why not using database?
>>
>> Thanks!
>> Xiao
>>
>> On Tue, Oct 26, 2010 at 11:54 AM, Scott Gonyea <me...@sgonyea.com> wrote:
>>> Use Solr?  At its core, Solr is a document database.  Using a relational database, to warehouse your crawl data, is generally an awful idea.  I'd go so far as to suggest that you're probably looking at things the wrong way. :)
>>>
>>> I liken crawl data to sludge.  Don't try to normalize it.  Know what you want to get from it, and expose that data the best way possible.  If you want to store it, index it, query it, transform it, collect statistics, etc... Solr is a terrific tool.  Amazingly so.
>>>
>>> That said, you also have another very good choice.  Take a look at Riak Search.  They hijacked many core elements of Solr, which I applaud, and is compatible with Solr's http interface.  In effect, you can point Nutch's solr-index job, instead, at a Riak Search node and put your data there.
>>>
>>> The other nice thing: Riak is a (self-described) "mini-hadoop."  So you can search across the Solr indexes, that it's built on top of, or you can throw MapReduce jobs at riak and perform some very detailed analytics.
>>>
>>> I don't know of a database that lacks a Java client, so the potential for indexing plugins is limitless... regardless of where the data is placed.
>>>
>>> Scott Gonyea
>>>
>>> On Oct 25, 2010, at 7:56 PM, xiao yang wrote:
>>>
>>>> Hi, guys,
>>>>
>>>> Nutch has its own data format for CrawlDB and LinkDB, which are
>>>> difficult to manage and share among applications.
>>>> Are there any web crawlers based on relational database?
>>>> I can see that Nutch is trying to use HBase for storage, but why not
>>>> use a relational database instead? We can use partitioning to solve
>>>> scalability problem.
>>>>
>>>> Thanks!
>>>> Xiao
>>>
>>>
>>
>

Re: Are there any web crawlers based on database?

Posted by Scott Gonyea <sc...@aitrus.org>.

Yep, GORA will be a huge boon.  One future problem to be dealt with is
going to be versioning many iterations of the same content.  My
application of Nutch is geared more towards compliance, and is less
interested in the analytics beyond high-level statistics and drilling
down into web content.

For my purposes, there will be a growing need to compare many
iterations of the same (or similar) content, over a given time period.
 More than that, retaining original data is very important (and is
something I already do, in Amazon's S3).

I'm really interested in GORA, and wish I had more time to really look
at it and contribute.  Such is life :(

Scott

On Tue, Oct 26, 2010 at 2:15 PM, Andrzej Bialecki <ab...@getopt.org> wrote:
> On 2010-10-26 22:39, Scott Gonyea wrote:
>> I love relational databases, but their many features are (in my
>> opinion) wasted on what you find in Nutch.  Row-locking and
>> transactional integrity is great for lots of applications, but becomes
>> a whole lot of overhead when it's of next-to-no-value to whatever
>> you're doing.
>>
>> RE: counting URLs:  Have you looked at Solr's facets, etc?  I use them
>> like they're going out of style--and it's very powerful.
>>
>> For my application, Solr *is* my database.  Nutch crawls data, stores
>
> .. then you may be interested in the upcoming Gora feature:
> http://issues.apache.org/jira/browse/GORA-9 . When this is committed you
> will be able to keep all your data in Solr.
>
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: Are there any web crawlers based on database?

Posted by Scott Gonyea <me...@sgonyea.com>.

Not that it's guaranteed to be of "next to no value" but really,
you've probably already lost pages just crawling them.  Server /
network errors, for example, takes the integrity question and makes it
a cost-benefit.  Do you recrawl a bunch?  At different times?
Different geographies?

Row locking is reasonably nice, but that begs other questions.  It can
easily be solved one of two ways:  Put your data is Solr, and persist
your efforts in both places:  Solr and an SQL backend.  If you're
using riak (or Cassandra), you allow document collisions to exist and
reconcile them within your application.

It sounds complex, but are actually quite trivial to implement.

Scott

On Tue, Oct 26, 2010 at 1:39 PM, Scott Gonyea <me...@sgonyea.com> wrote:
> I love relational databases, but their many features are (in my
> opinion) wasted on what you find in Nutch.  Row-locking and
> transactional integrity is great for lots of applications, but becomes
> a whole lot of overhead when it's of next-to-no-value to whatever
> you're doing.
>
> RE: counting URLs:  Have you looked at Solr's facets, etc?  I use them
> like they're going out of style--and it's very powerful.
>
> For my application, Solr *is* my database.  Nutch crawls data, stores
> it somewhere, then picks it back up and drops it in Solr.  all of my
> crawl data sits in Solr.  I actively report on stats from Solr, as
> well as make updates to the content that's stored.  Lots of fields /
> boolean attributes sit in the schema.
>
> As the user works through the app, their changes get pushed back into
> Solr.  Then when they next hit "Search," results disappear / move
> around as they had organized it.
>
> Scott
>
> On Tue, Oct 26, 2010 at 12:20 AM, xiao yang <ya...@gmail.com> wrote:
>> Hi, Scott,
>>
>> Thanks for your reply.
>> I'm curious about the reason why using database is awful.
>> Here is my requirement: we have two developers who want to do some
>> processing and analysis work on the crawled data. If the data is
>> stored in database, we can easily share our data, for the well-defined
>> data models. What's more, the analysis results can also be easily
>> stored back into the database by just adding a few fields.
>> For example, I need to know the average number of urls in one site. In
>> database, a single SQL will do. If I want to extract and store the
>> main part of web pages, I can't easily modify the data structure of
>> Nutch easily. Even in Solr, it's difficult and inefficient to iterate
>> through the data set.
>> The crawled data is structured, then why not using database?
>>
>> Thanks!
>> Xiao
>>
>> On Tue, Oct 26, 2010 at 11:54 AM, Scott Gonyea <me...@sgonyea.com> wrote:
>>> Use Solr?  At its core, Solr is a document database.  Using a relational database, to warehouse your crawl data, is generally an awful idea.  I'd go so far as to suggest that you're probably looking at things the wrong way. :)
>>>
>>> I liken crawl data to sludge.  Don't try to normalize it.  Know what you want to get from it, and expose that data the best way possible.  If you want to store it, index it, query it, transform it, collect statistics, etc... Solr is a terrific tool.  Amazingly so.
>>>
>>> That said, you also have another very good choice.  Take a look at Riak Search.  They hijacked many core elements of Solr, which I applaud, and is compatible with Solr's http interface.  In effect, you can point Nutch's solr-index job, instead, at a Riak Search node and put your data there.
>>>
>>> The other nice thing: Riak is a (self-described) "mini-hadoop."  So you can search across the Solr indexes, that it's built on top of, or you can throw MapReduce jobs at riak and perform some very detailed analytics.
>>>
>>> I don't know of a database that lacks a Java client, so the potential for indexing plugins is limitless... regardless of where the data is placed.
>>>
>>> Scott Gonyea
>>>
>>> On Oct 25, 2010, at 7:56 PM, xiao yang wrote:
>>>
>>>> Hi, guys,
>>>>
>>>> Nutch has its own data format for CrawlDB and LinkDB, which are
>>>> difficult to manage and share among applications.
>>>> Are there any web crawlers based on relational database?
>>>> I can see that Nutch is trying to use HBase for storage, but why not
>>>> use a relational database instead? We can use partitioning to solve
>>>> scalability problem.
>>>>
>>>> Thanks!
>>>> Xiao
>>>
>>>
>>
>

Re: Are there any web crawlers based on database?

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 2010-10-26 22:39, Scott Gonyea wrote:
> I love relational databases, but their many features are (in my
> opinion) wasted on what you find in Nutch.  Row-locking and
> transactional integrity is great for lots of applications, but becomes
> a whole lot of overhead when it's of next-to-no-value to whatever
> you're doing.
> 
> RE: counting URLs:  Have you looked at Solr's facets, etc?  I use them
> like they're going out of style--and it's very powerful.
> 
> For my application, Solr *is* my database.  Nutch crawls data, stores

.. then you may be interested in the upcoming Gora feature:
http://issues.apache.org/jira/browse/GORA-9 . When this is committed you
will be able to keep all your data in Solr.



-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Are there any web crawlers based on database?

Posted by Scott Gonyea <me...@sgonyea.com>.

I love relational databases, but their many features are (in my
opinion) wasted on what you find in Nutch.  Row-locking and
transactional integrity is great for lots of applications, but becomes
a whole lot of overhead when it's of next-to-no-value to whatever
you're doing.

RE: counting URLs:  Have you looked at Solr's facets, etc?  I use them
like they're going out of style--and it's very powerful.

For my application, Solr *is* my database.  Nutch crawls data, stores
it somewhere, then picks it back up and drops it in Solr.  all of my
crawl data sits in Solr.  I actively report on stats from Solr, as
well as make updates to the content that's stored.  Lots of fields /
boolean attributes sit in the schema.

As the user works through the app, their changes get pushed back into
Solr.  Then when they next hit "Search," results disappear / move
around as they had organized it.

Scott

On Tue, Oct 26, 2010 at 12:20 AM, xiao yang <ya...@gmail.com> wrote:
> Hi, Scott,
>
> Thanks for your reply.
> I'm curious about the reason why using database is awful.
> Here is my requirement: we have two developers who want to do some
> processing and analysis work on the crawled data. If the data is
> stored in database, we can easily share our data, for the well-defined
> data models. What's more, the analysis results can also be easily
> stored back into the database by just adding a few fields.
> For example, I need to know the average number of urls in one site. In
> database, a single SQL will do. If I want to extract and store the
> main part of web pages, I can't easily modify the data structure of
> Nutch easily. Even in Solr, it's difficult and inefficient to iterate
> through the data set.
> The crawled data is structured, then why not using database?
>
> Thanks!
> Xiao
>
> On Tue, Oct 26, 2010 at 11:54 AM, Scott Gonyea <me...@sgonyea.com> wrote:
>> Use Solr?  At its core, Solr is a document database.  Using a relational database, to warehouse your crawl data, is generally an awful idea.  I'd go so far as to suggest that you're probably looking at things the wrong way. :)
>>
>> I liken crawl data to sludge.  Don't try to normalize it.  Know what you want to get from it, and expose that data the best way possible.  If you want to store it, index it, query it, transform it, collect statistics, etc... Solr is a terrific tool.  Amazingly so.
>>
>> That said, you also have another very good choice.  Take a look at Riak Search.  They hijacked many core elements of Solr, which I applaud, and is compatible with Solr's http interface.  In effect, you can point Nutch's solr-index job, instead, at a Riak Search node and put your data there.
>>
>> The other nice thing: Riak is a (self-described) "mini-hadoop."  So you can search across the Solr indexes, that it's built on top of, or you can throw MapReduce jobs at riak and perform some very detailed analytics.
>>
>> I don't know of a database that lacks a Java client, so the potential for indexing plugins is limitless... regardless of where the data is placed.
>>
>> Scott Gonyea
>>
>> On Oct 25, 2010, at 7:56 PM, xiao yang wrote:
>>
>>> Hi, guys,
>>>
>>> Nutch has its own data format for CrawlDB and LinkDB, which are
>>> difficult to manage and share among applications.
>>> Are there any web crawlers based on relational database?
>>> I can see that Nutch is trying to use HBase for storage, but why not
>>> use a relational database instead? We can use partitioning to solve
>>> scalability problem.
>>>
>>> Thanks!
>>> Xiao
>>
>>
>

Re: Are there any web crawlers based on database?

Posted by Scott Gonyea <me...@sgonyea.com>.

I love relational databases, but their many features are (in my
opinion) wasted on what you find in Nutch.  Row-locking and
transactional integrity is great for lots of applications, but becomes
a whole lot of overhead when it's of next-to-no-value to whatever
you're doing.

RE: counting URLs:  Have you looked at Solr's facets, etc?  I use them
like they're going out of style--and it's very powerful.

For my application, Solr *is* my database.  Nutch crawls data, stores
it somewhere, then picks it back up and drops it in Solr.  all of my
crawl data sits in Solr.  I actively report on stats from Solr, as
well as make updates to the content that's stored.  Lots of fields /
boolean attributes sit in the schema.

As the user works through the app, their changes get pushed back into
Solr.  Then when they next hit "Search," results disappear / move
around as they had organized it.

Scott

On Tue, Oct 26, 2010 at 12:20 AM, xiao yang <ya...@gmail.com> wrote:
> Hi, Scott,
>
> Thanks for your reply.
> I'm curious about the reason why using database is awful.
> Here is my requirement: we have two developers who want to do some
> processing and analysis work on the crawled data. If the data is
> stored in database, we can easily share our data, for the well-defined
> data models. What's more, the analysis results can also be easily
> stored back into the database by just adding a few fields.
> For example, I need to know the average number of urls in one site. In
> database, a single SQL will do. If I want to extract and store the
> main part of web pages, I can't easily modify the data structure of
> Nutch easily. Even in Solr, it's difficult and inefficient to iterate
> through the data set.
> The crawled data is structured, then why not using database?
>
> Thanks!
> Xiao
>
> On Tue, Oct 26, 2010 at 11:54 AM, Scott Gonyea <me...@sgonyea.com> wrote:
>> Use Solr?  At its core, Solr is a document database.  Using a relational database, to warehouse your crawl data, is generally an awful idea.  I'd go so far as to suggest that you're probably looking at things the wrong way. :)
>>
>> I liken crawl data to sludge.  Don't try to normalize it.  Know what you want to get from it, and expose that data the best way possible.  If you want to store it, index it, query it, transform it, collect statistics, etc... Solr is a terrific tool.  Amazingly so.
>>
>> That said, you also have another very good choice.  Take a look at Riak Search.  They hijacked many core elements of Solr, which I applaud, and is compatible with Solr's http interface.  In effect, you can point Nutch's solr-index job, instead, at a Riak Search node and put your data there.
>>
>> The other nice thing: Riak is a (self-described) "mini-hadoop."  So you can search across the Solr indexes, that it's built on top of, or you can throw MapReduce jobs at riak and perform some very detailed analytics.
>>
>> I don't know of a database that lacks a Java client, so the potential for indexing plugins is limitless... regardless of where the data is placed.
>>
>> Scott Gonyea
>>
>> On Oct 25, 2010, at 7:56 PM, xiao yang wrote:
>>
>>> Hi, guys,
>>>
>>> Nutch has its own data format for CrawlDB and LinkDB, which are
>>> difficult to manage and share among applications.
>>> Are there any web crawlers based on relational database?
>>> I can see that Nutch is trying to use HBase for storage, but why not
>>> use a relational database instead? We can use partitioning to solve
>>> scalability problem.
>>>
>>> Thanks!
>>> Xiao
>>
>>
>

Re: Are there any web crawlers based on database?

Posted by xiao yang <ya...@gmail.com>.

I've heard some rumors about Facebook abandoning Cassandra. Don't know
whether it's true.

On Tue, Oct 26, 2010 at 8:15 PM, Markus Jelsma
<ma...@openindex.io> wrote:
> They also use the column oriented database Apache Cassandra which is also
> supported as storage backend by Gora. I can, much like Apache CouchDB, scale
> very well.
>
> http://cassandra.apache.org/
> http://couchdb.apache.org/
>
> On Tuesday 26 October 2010 14:11:12 Andrzej Bialecki wrote:
>> On 2010-10-26 14:02, xiao yang wrote:
>> > Hi, Andrzej
>> >
>> > Great, I'll definitely try Nutch 2.0!
>> > As far as I know, Facebook is still using MySQL for storage. I believe
>> > the its data scale will exceed 100K. Do you have any clues how they
>> > solve the problem?
>>
>> By partitioning the database. There was a talk on the Facebook
>> architecture at the Lucene Revolution conference, and I remember that
>> there was a diagram that described their design - see
>> lucenerevolution.com for more details.
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536600 / 06-50258350
>

Re: Are there any web crawlers based on database?

Posted by Markus Jelsma <ma...@openindex.io>.

They also use the column oriented database Apache Cassandra which is also 
supported as storage backend by Gora. I can, much like Apache CouchDB, scale 
very well.

http://cassandra.apache.org/
http://couchdb.apache.org/

On Tuesday 26 October 2010 14:11:12 Andrzej Bialecki wrote:
> On 2010-10-26 14:02, xiao yang wrote:
> > Hi, Andrzej
> > 
> > Great, I'll definitely try Nutch 2.0!
> > As far as I know, Facebook is still using MySQL for storage. I believe
> > the its data scale will exceed 100K. Do you have any clues how they
> > solve the problem?
> 
> By partitioning the database. There was a talk on the Facebook
> architecture at the Lucene Revolution conference, and I remember that
> there was a diagram that described their design - see
> lucenerevolution.com for more details.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536600 / 06-50258350

Re: Are there any web crawlers based on database?

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 2010-10-26 14:02, xiao yang wrote:
> Hi, Andrzej
> 
> Great, I'll definitely try Nutch 2.0!
> As far as I know, Facebook is still using MySQL for storage. I believe
> the its data scale will exceed 100K. Do you have any clues how they
> solve the problem?

By partitioning the database. There was a talk on the Facebook
architecture at the Lucene Revolution conference, and I remember that
there was a diagram that described their design - see
lucenerevolution.com for more details.


-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Are there any web crawlers based on database?

Posted by xiao yang <ya...@gmail.com>.

Hi, Andrzej

Great, I'll definitely try Nutch 2.0!
As far as I know, Facebook is still using MySQL for storage. I believe
the its data scale will exceed 100K. Do you have any clues how they
solve the problem?

Thanks!
Xiao

On Tue, Oct 26, 2010 at 5:30 PM, Andrzej Bialecki <ab...@getopt.org> wrote:
> On 2010-10-26 09:20, xiao yang wrote:
>> Hi, Scott,
>>
>> Thanks for your reply.
>> I'm curious about the reason why using database is awful.
>
> I only partially agree with Scott's statement - it really depends on the
> scale. For a volume below 100k pages I think a DB could be a good
> storage platform. But as the volume of data grows, the cost of updates
> in a relational DB grows disproportionately high. As you reach a volume
> of tens of millions of documents, a relational database storage will
> have very poor performance or extremely high cost.
>
>
>> Here is my requirement: we have two developers who want to do some
>> processing and analysis work on the crawled data. If the data is
>> stored in database, we can easily share our data, for the well-defined
>> data models. What's more, the analysis results can also be easily
>> stored back into the database by just adding a few fields.
>> For example, I need to know the average number of urls in one site. In
>> database, a single SQL will do. If I want to extract and store the
>> main part of web pages, I can't easily modify the data structure of
>> Nutch easily. Even in Solr, it's difficult and inefficient to iterate
>> through the data set.
>> The crawled data is structured, then why not using database?
>
> What you want is Nutch 2.0 :) where the storage layer uses Gora (an
> abstraction for key-value stores), and one of the supported backend
> types is an SQL database. Your use case - to be able to use existing
> standard tools for DBs or other data warehousing platforms - was one of
> the motivations to redesign Nutch this way.
>
> Please check out Nutch trunk, and configure it to use an SQL backend.
> Currently only MySQL and HSQLDB databases are supported (and HBase), but
> it's not that hard to add support for other database types.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: Are there any web crawlers based on database?

Posted by Peter Burden <pe...@gmail.com>.

On 26 October 2010 10:30, Andrzej Bialecki <ab...@getopt.org> wrote:
> On 2010-10-26 09:20, xiao yang wrote:
>> Hi, Scott,
>>
>> Thanks for your reply.
>> I'm curious about the reason why using database is awful.
>
> I only partially agree with Scott's statement - it really depends on the
> scale. For a volume below 100k pages I think a DB could be a good
> storage platform. But as the volume of data grows, the cost of updates
> in a relational DB grows disproportionately high. As you reach a volume
> of tens of millions of documents, a relational database storage will
> have very poor performance or extremely high cost.

I wrote such a system a few years ago and this was exactly what I discovered.
Non-locality of database reads and writes proved an insurmountable bottleneck
beyond about 10 million pages. [Crawl slowed to 2-3 pages/second with
everything on a single PC, using separate machine as DB server didn't seem
to help]. Might have got further if I hadn't included a table that recorded
every inter-page link!.

But it was really nice to be able to make arbitrary queries of the
page collection
and its structure, although some queries could be horrendously slow.

I was using MySQL BTW. Have seen some hints that latest version is much
faster so might have another go.

Re: Are there any web crawlers based on database?

Posted by Andrzej Bialecki <ab...@getopt.org>.

On 2010-10-26 09:20, xiao yang wrote:
> Hi, Scott,
> 
> Thanks for your reply.
> I'm curious about the reason why using database is awful.

I only partially agree with Scott's statement - it really depends on the
scale. For a volume below 100k pages I think a DB could be a good
storage platform. But as the volume of data grows, the cost of updates
in a relational DB grows disproportionately high. As you reach a volume
of tens of millions of documents, a relational database storage will
have very poor performance or extremely high cost.

> Here is my requirement: we have two developers who want to do some
> processing and analysis work on the crawled data. If the data is
> stored in database, we can easily share our data, for the well-defined
> data models. What's more, the analysis results can also be easily
> stored back into the database by just adding a few fields.
> For example, I need to know the average number of urls in one site. In
> database, a single SQL will do. If I want to extract and store the
> main part of web pages, I can't easily modify the data structure of
> Nutch easily. Even in Solr, it's difficult and inefficient to iterate
> through the data set.
> The crawled data is structured, then why not using database?

What you want is Nutch 2.0 :) where the storage layer uses Gora (an
abstraction for key-value stores), and one of the supported backend
types is an SQL database. Your use case - to be able to use existing
standard tools for DBs or other data warehousing platforms - was one of
the motivations to redesign Nutch this way.

Please check out Nutch trunk, and configure it to use an SQL backend.
Currently only MySQL and HSQLDB databases are supported (and HBase), but
it's not that hard to add support for other database types.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Are there any web crawlers based on database?

Posted by xiao yang <ya...@gmail.com>.

Hi, Scott,

Thanks for your reply.
I'm curious about the reason why using database is awful.
Here is my requirement: we have two developers who want to do some
processing and analysis work on the crawled data. If the data is
stored in database, we can easily share our data, for the well-defined
data models. What's more, the analysis results can also be easily
stored back into the database by just adding a few fields.
For example, I need to know the average number of urls in one site. In
database, a single SQL will do. If I want to extract and store the
main part of web pages, I can't easily modify the data structure of
Nutch easily. Even in Solr, it's difficult and inefficient to iterate
through the data set.
The crawled data is structured, then why not using database?

Thanks!
Xiao

On Tue, Oct 26, 2010 at 11:54 AM, Scott Gonyea <me...@sgonyea.com> wrote:
> Use Solr?  At its core, Solr is a document database.  Using a relational database, to warehouse your crawl data, is generally an awful idea.  I'd go so far as to suggest that you're probably looking at things the wrong way. :)
>
> I liken crawl data to sludge.  Don't try to normalize it.  Know what you want to get from it, and expose that data the best way possible.  If you want to store it, index it, query it, transform it, collect statistics, etc... Solr is a terrific tool.  Amazingly so.
>
> That said, you also have another very good choice.  Take a look at Riak Search.  They hijacked many core elements of Solr, which I applaud, and is compatible with Solr's http interface.  In effect, you can point Nutch's solr-index job, instead, at a Riak Search node and put your data there.
>
> The other nice thing: Riak is a (self-described) "mini-hadoop."  So you can search across the Solr indexes, that it's built on top of, or you can throw MapReduce jobs at riak and perform some very detailed analytics.
>
> I don't know of a database that lacks a Java client, so the potential for indexing plugins is limitless... regardless of where the data is placed.
>
> Scott Gonyea
>
> On Oct 25, 2010, at 7:56 PM, xiao yang wrote:
>
>> Hi, guys,
>>
>> Nutch has its own data format for CrawlDB and LinkDB, which are
>> difficult to manage and share among applications.
>> Are there any web crawlers based on relational database?
>> I can see that Nutch is trying to use HBase for storage, but why not
>> use a relational database instead? We can use partitioning to solve
>> scalability problem.
>>
>> Thanks!
>> Xiao
>
>

Re: Are there any web crawlers based on database?

Posted by xiao yang <ya...@gmail.com>.

Hi, Scott,

Thanks for your reply.
I'm curious about the reason why using database is awful.
Here is my requirement: we have two developers who want to do some
processing and analysis work on the crawled data. If the data is
stored in database, we can easily share our data, for the well-defined
data models. What's more, the analysis results can also be easily
stored back into the database by just adding a few fields.
For example, I need to know the average number of urls in one site. In
database, a single SQL will do. If I want to extract and store the
main part of web pages, I can't easily modify the data structure of
Nutch easily. Even in Solr, it's difficult and inefficient to iterate
through the data set.
The crawled data is structured, then why not using database?

Thanks!
Xiao

On Tue, Oct 26, 2010 at 11:54 AM, Scott Gonyea <me...@sgonyea.com> wrote:
> Use Solr?  At its core, Solr is a document database.  Using a relational database, to warehouse your crawl data, is generally an awful idea.  I'd go so far as to suggest that you're probably looking at things the wrong way. :)
>
> I liken crawl data to sludge.  Don't try to normalize it.  Know what you want to get from it, and expose that data the best way possible.  If you want to store it, index it, query it, transform it, collect statistics, etc... Solr is a terrific tool.  Amazingly so.
>
> That said, you also have another very good choice.  Take a look at Riak Search.  They hijacked many core elements of Solr, which I applaud, and is compatible with Solr's http interface.  In effect, you can point Nutch's solr-index job, instead, at a Riak Search node and put your data there.
>
> The other nice thing: Riak is a (self-described) "mini-hadoop."  So you can search across the Solr indexes, that it's built on top of, or you can throw MapReduce jobs at riak and perform some very detailed analytics.
>
> I don't know of a database that lacks a Java client, so the potential for indexing plugins is limitless... regardless of where the data is placed.
>
> Scott Gonyea
>
> On Oct 25, 2010, at 7:56 PM, xiao yang wrote:
>
>> Hi, guys,
>>
>> Nutch has its own data format for CrawlDB and LinkDB, which are
>> difficult to manage and share among applications.
>> Are there any web crawlers based on relational database?
>> I can see that Nutch is trying to use HBase for storage, but why not
>> use a relational database instead? We can use partitioning to solve
>> scalability problem.
>>
>> Thanks!
>> Xiao
>
>

Re: Are there any web crawlers based on database?

Posted by Scott Gonyea <me...@sgonyea.com>.

Use Solr?  At its core, Solr is a document database.  Using a relational database, to warehouse your crawl data, is generally an awful idea.  I'd go so far as to suggest that you're probably looking at things the wrong way. :)

I liken crawl data to sludge.  Don't try to normalize it.  Know what you want to get from it, and expose that data the best way possible.  If you want to store it, index it, query it, transform it, collect statistics, etc... Solr is a terrific tool.  Amazingly so.

That said, you also have another very good choice.  Take a look at Riak Search.  They hijacked many core elements of Solr, which I applaud, and is compatible with Solr's http interface.  In effect, you can point Nutch's solr-index job, instead, at a Riak Search node and put your data there.

The other nice thing: Riak is a (self-described) "mini-hadoop."  So you can search across the Solr indexes, that it's built on top of, or you can throw MapReduce jobs at riak and perform some very detailed analytics.

I don't know of a database that lacks a Java client, so the potential for indexing plugins is limitless... regardless of where the data is placed.

Scott Gonyea

On Oct 25, 2010, at 7:56 PM, xiao yang wrote:

> Hi, guys,
> 
> Nutch has its own data format for CrawlDB and LinkDB, which are
> difficult to manage and share among applications.
> Are there any web crawlers based on relational database?
> I can see that Nutch is trying to use HBase for storage, but why not
> use a relational database instead? We can use partitioning to solve
> scalability problem.
> 
> Thanks!
> Xiao

Re: Are there any web crawlers based on database?

Posted by Scott Gonyea <me...@sgonyea.com>.

Use Solr?  At its core, Solr is a document database.  Using a relational database, to warehouse your crawl data, is generally an awful idea.  I'd go so far as to suggest that you're probably looking at things the wrong way. :)

I liken crawl data to sludge.  Don't try to normalize it.  Know what you want to get from it, and expose that data the best way possible.  If you want to store it, index it, query it, transform it, collect statistics, etc... Solr is a terrific tool.  Amazingly so.

That said, you also have another very good choice.  Take a look at Riak Search.  They hijacked many core elements of Solr, which I applaud, and is compatible with Solr's http interface.  In effect, you can point Nutch's solr-index job, instead, at a Riak Search node and put your data there.

The other nice thing: Riak is a (self-described) "mini-hadoop."  So you can search across the Solr indexes, that it's built on top of, or you can throw MapReduce jobs at riak and perform some very detailed analytics.

I don't know of a database that lacks a Java client, so the potential for indexing plugins is limitless... regardless of where the data is placed.

Scott Gonyea

On Oct 25, 2010, at 7:56 PM, xiao yang wrote:

> Hi, guys,
> 
> Nutch has its own data format for CrawlDB and LinkDB, which are
> difficult to manage and share among applications.
> Are there any web crawlers based on relational database?
> I can see that Nutch is trying to use HBase for storage, but why not
> use a relational database instead? We can use partitioning to solve
> scalability problem.
> 
> Thanks!
> Xiao