You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by lewis john mcgibbney <le...@gmail.com> on 2011/08/27 22:19:24 UTC

Trying to complete index structure wiki page

Hi,

As the title suggests, I'm in the process of getting some comprehensive
documentation sorted out for Nutch, this obviously starts at wiki level. I'm
currently working on the IndexStructure page [1]. I would appreciate if some
guys could have a quick look and correct where they see fit.

In addition I have a couple of quick questions regarding the last 4 fields
I'm trying to account for

1) BOOST - As far as I am aware this was deprecated in Nutch 1.2 or Nutch
1.1... correct/wrong?
2) DIGEST - Don't have a clue
3) SEGMENT - as 2
4) TIMESTAMP - as 2

Would be great if people could fill me in with the grey areas please.

Finally, what a job all contributors, dev's and committers made cleaning up
plugin directory even between Nutch 1.2 and 1.3 release. It's not until you
see previous versions on SVN that you can fully appreciate the excellent job
that has been made with 1.3 release.  :0)

[1] http://wiki.apache.org/nutch/IndexStructure

-- 
*Lewis*

Re: Trying to complete index structure wiki page

Posted by lewis john mcgibbney <le...@gmail.com>.
excellent Markus

We are slowly but surely getting nearer... :0)

On Tue, Aug 30, 2011 at 1:14 AM, Markus Jelsma
<ma...@openindex.io>wrote:

>
> > Hi,
> >
> > As the title suggests, I'm in the process of getting some comprehensive
> > documentation sorted out for Nutch, this obviously starts at wiki level.
> > I'm currently working on the IndexStructure page [1]. I would appreciate
> > if some guys could have a quick look and correct where they see fit.
> >
> > In addition I have a couple of quick questions regarding the last 4
> fields
> > I'm trying to account for
> >
> > 1) BOOST - As far as I am aware this was deprecated in Nutch 1.2 or Nutch
> > 1.1... correct/wrong?
>
> This would be value of the scoring filter, OPIC or LinkRank or some custom
> made scoring.
>
> > 2) DIGEST - Don't have a clue
>
> The digest of the document. Can be MD5 over content and headers or more
> sophisticated text profile of the content.
>
> > 3) SEGMENT - as 2
>
> The originating segment of the document, used to identify the most recent
> segment in which you can find this document. In older Nutch version this
> was
> also used (IIRC) to load a `cached` version of the document.
>
> > 4) TIMESTAMP - as 2
>
> Most recent fetch time.
>
> >
> > Would be great if people could fill me in with the grey areas please.
> >
> > Finally, what a job all contributors, dev's and committers made cleaning
> up
> > plugin directory even between Nutch 1.2 and 1.3 release. It's not until
> you
> > see previous versions on SVN that you can fully appreciate the excellent
> > job that has been made with 1.3 release.  :0)
> >
> > [1] http://wiki.apache.org/nutch/IndexStructure
>



-- 
*Lewis*

Re: Trying to complete index structure wiki page

Posted by Markus Jelsma <ma...@openindex.io>.
> Hi,
> 
> As the title suggests, I'm in the process of getting some comprehensive
> documentation sorted out for Nutch, this obviously starts at wiki level.
> I'm currently working on the IndexStructure page [1]. I would appreciate
> if some guys could have a quick look and correct where they see fit.
> 
> In addition I have a couple of quick questions regarding the last 4 fields
> I'm trying to account for
> 
> 1) BOOST - As far as I am aware this was deprecated in Nutch 1.2 or Nutch
> 1.1... correct/wrong?

This would be value of the scoring filter, OPIC or LinkRank or some custom 
made scoring.

> 2) DIGEST - Don't have a clue

The digest of the document. Can be MD5 over content and headers or more 
sophisticated text profile of the content.

> 3) SEGMENT - as 2

The originating segment of the document, used to identify the most recent 
segment in which you can find this document. In older Nutch version this was 
also used (IIRC) to load a `cached` version of the document.

> 4) TIMESTAMP - as 2

Most recent fetch time.

> 
> Would be great if people could fill me in with the grey areas please.
> 
> Finally, what a job all contributors, dev's and committers made cleaning up
> plugin directory even between Nutch 1.2 and 1.3 release. It's not until you
> see previous versions on SVN that you can fully appreciate the excellent
> job that has been made with 1.3 release.  :0)
> 
> [1] http://wiki.apache.org/nutch/IndexStructure