You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Gabriele Kahlout <ga...@mysimpatico.com> on 2011/07/16 02:00:14 UTC

Isn't there redudant/wasteful duplication between nutch crawldb and solr index?

Hello,

I had this draft lurking for a while now, and before archiving for personal reference I wondered if it's accurate, and if you recommend posting it to the wiki.

Nutch maintains a crawldb (and linkdb, for that matter) of the urls it crawled, the fetch status, and the date. This data is maintained beyond fetch so that pages may be re-crawled, after the a re-crawling period.
At the same time Solr maintains an inverted index of all the fetched pages.
It'd seem more efficient if nutch relied on the index instead of maintaining its own crawldb, to !store the same url twice. 
[BUT THAT'S JUST A KEY/ID, NOT WASTE AT ALL, WOULD ALSO END UP THE SAME IN SOLR]

-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) < Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).

Re: Isn't there redudant/wasteful duplication between nutch crawldb and solr index?

Posted by Julien Nioche <li...@gmail.com>.

Gabriele

What you are describing could be done with Nutch 2.0 by adding a SOLR
backend to GORA. SOLR would be used to store the webtable and provided that
you setup the schema accordingly you could index the appropriate fields for
searching. I think there were plans to add SOLR as a GORA backend. I think
Nutch 2.0 would be a natural fit for what you are describing, more than 1.x
IMHO.

HTH

Julien



On 16 July 2011 01:00, Gabriele Kahlout <ga...@mysimpatico.com> wrote:

> Hello,
>
> I had this draft lurking for a while now, and before archiving for personal
> reference I wondered if it's accurate, and if you recommend posting it to
> the wiki.
>
> Nutch maintains a crawldb (and linkdb, for that matter) of the urls it
> crawled, the fetch status, and the date. This data is maintained beyond
> fetch so that pages may be re-crawled, after the a re-crawling period.
> At the same time Solr maintains an inverted index of all the fetched pages.
> It'd seem more efficient if nutch relied on the index instead of
> maintaining its own crawldb, to !store the same url twice.
> [BUT THAT'S JUST A KEY/ID, NOT WASTE AT ALL, WOULD ALSO END UP THE SAME IN
> SOLR]
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Isn't there redudant/wasteful duplication between nutch crawldb and solr index?

Posted by Markus Jelsma <ma...@openindex.io>.

Because Nutch is a crawler intending to write to more than one search engine. 
Besides, the crawldb is gone, as a flat file, in trunk. Also, Solr is really 
slow when it comes to updating millions of records, the crawldb isn't when 
split over multiple machines.

> Hello,
> 
> I had this draft lurking for a while now, and before archiving for personal
> reference I wondered if it's accurate, and if you recommend posting it to
> the wiki.
> 
> Nutch maintains a crawldb (and linkdb, for that matter) of the urls it
> crawled, the fetch status, and the date. This data is maintained beyond
> fetch so that pages may be re-crawled, after the a re-crawling period. At
> the same time Solr maintains an inverted index of all the fetched pages.
> It'd seem more efficient if nutch relied on the index instead of
> maintaining its own crawldb, to !store the same url twice. [BUT THAT'S
> JUST A KEY/ID, NOT WASTE AT ALL, WOULD ALSO END UP THE SAME IN SOLR]

Re: Isn't there redudant/wasteful duplication between nutch crawldb and solr index?

Posted by lewis john mcgibbney <le...@gmail.com>.

Please feel free to add this to the wiki as it is a question that will
undoubtably arise in the future.

Lewis

On Sat, Jul 16, 2011 at 12:37 PM, Gabriele Kahlout <gabriele@mysimpatico.com
> wrote:

> On Sat, Jul 16, 2011 at 1:29 PM, lewis john mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
> > Hi Gabriele,
> >
> > At first this seems like a plausable arguement,
>
>
> Indeed, I think it could be a FAQ. Shall I add it to nutch wiki?
>
>
> > however my question concerns
> > what Nutch would do if we wished to change the Solr core which to index
> to?
> >
> > If we removed this functionality from the crawldb there would be no way
> to
> > determine what Nutch was to fetch and what it wasn't.
> >
>
> Indeed, you confirm my though.
>
> >
> > > crawled, the fetch status, and the date. This data is maintained beyond
> > > fetch so that pages may be re-crawled, after the a re-crawling period.
> > > At the same time Solr maintains an inverted index of all the fetched
> > pages.
> > > It'd seem more efficient if nutch relied on the index instead of
> > > maintaining its own crawldb, to !store the same url twice.
> > > [BUT THAT'S JUST A KEY/ID, NOT WASTE AT ALL, WOULD ALSO END UP THE SAME
> > IN
> > > SOLR]
> > >
> > > --
> > > Regards,
> > > K. Gabriele
> > >
> > > --- unchanged since 20/9/10 ---
> > > P.S. If the subject contains "[LON]" or the addressee acknowledges the
> > > receipt within 48 hours then I don't resend the email.
> > > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> > > time(x) < Now + 48h) ⇒ ¬resend(I, this).
> > >
> > > If an email is sent by a sender that is not a trusted contact or the
> > email
> > > does not contain a valid code then the email is not received. A valid
> > code
> > > starts with a hyphen and ends with "X".
> > > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y
> ∈
> > > L(-[a-z]+[0-9]X)).
> >
> >
> >
> >
> >
> >
> > --
> > *Lewis*
> >
>
>
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x)
> < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>



-- 
*Lewis*

Re: Isn't there redudant/wasteful duplication between nutch crawldb and solr index?

Posted by Gabriele Kahlout <ga...@mysimpatico.com>.

On Sat, Jul 16, 2011 at 1:29 PM, lewis john mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> Hi Gabriele,
>
> At first this seems like a plausable arguement,


Indeed, I think it could be a FAQ. Shall I add it to nutch wiki?


> however my question concerns
> what Nutch would do if we wished to change the Solr core which to index to?
>
> If we removed this functionality from the crawldb there would be no way to
> determine what Nutch was to fetch and what it wasn't.
>

Indeed, you confirm my though.

>
> > crawled, the fetch status, and the date. This data is maintained beyond
> > fetch so that pages may be re-crawled, after the a re-crawling period.
> > At the same time Solr maintains an inverted index of all the fetched
> pages.
> > It'd seem more efficient if nutch relied on the index instead of
> > maintaining its own crawldb, to !store the same url twice.
> > [BUT THAT'S JUST A KEY/ID, NOT WASTE AT ALL, WOULD ALSO END UP THE SAME
> IN
> > SOLR]
> >
> > --
> > Regards,
> > K. Gabriele
> >
> > --- unchanged since 20/9/10 ---
> > P.S. If the subject contains "[LON]" or the addressee acknowledges the
> > receipt within 48 hours then I don't resend the email.
> > subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> > time(x) < Now + 48h) ⇒ ¬resend(I, this).
> >
> > If an email is sent by a sender that is not a trusted contact or the
> email
> > does not contain a valid code then the email is not received. A valid
> code
> > starts with a hyphen and ends with "X".
> > ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> > L(-[a-z]+[0-9]X)).
>
>
>
>
>
>
> --
> *Lewis*
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: Isn't there redudant/wasteful duplication between nutch crawldb and solr index?

Posted by lewis john mcgibbney <le...@gmail.com>.

Hi Gabriele,

At first this seems like a plausable arguement, however my question concerns
what Nutch would do if we wished to change the Solr core which to index to?

If we removed this functionality from the crawldb there would be no way to
determine what Nutch was to fetch and what it wasn't.

On Sat, Jul 16, 2011 at 1:00 AM, Gabriele Kahlout
<ga...@mysimpatico.com>wrote:

> Hello,
>
> I had this draft lurking for a while now, and before archiving for personal
> reference I wondered if it's accurate, and if you recommend posting it to
> the wiki.
>
> Nutch maintains a crawldb (and linkdb, for that matter) of the urls it
> crawled, the fetch status, and the date. This data is maintained beyond
> fetch so that pages may be re-crawled, after the a re-crawling period.
> At the same time Solr maintains an inverted index of all the fetched pages.
> It'd seem more efficient if nutch relied on the index instead of
> maintaining its own crawldb, to !store the same url twice.
> [BUT THAT'S JUST A KEY/ID, NOT WASTE AT ALL, WOULD ALSO END UP THE SAME IN
> SOLR]
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).




-- 
*Lewis*