You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Gilbert Groenendijk <gi...@gmail.com> on 2007/01/27 19:34:16 UTC

Nutch content with Lucene search

Hello,

Today i created a simple index with nutch by command line. After that i
copied the index to the machine to use it with a lucene envirionment, no
Nutch. Fetching the URL and title works pretty good but how can i get the
content? if i tak a look in Luke, the field content is not stored or
tokenized but when i look in nutch-default.xml and nutch-site.xml, i have
definied:

<property>
  <name>fetcher.store.content</name>
  <value>true</value>
  <description>If true, fetcher will store content.</description>
</property>

it doesn't seem to work, any idea's?


-- 
Gilbert Groenendijk
__________________________________________________

Re: Nutch content with Lucene search

Posted by Brian Whitman <br...@variogr.am>.

On Jan 27, 2007, at 1:34 PM, Gilbert Groenendijk wrote:
> Hello,
>
> Today i created a simple index with nutch by command line. After  
> that i
> copied the index to the machine to use it with a lucene  
> envirionment, no
> Nutch. Fetching the URL and title works pretty good but how can i  
> get the
> content? if i tak a look in Luke, the field content is not stored or
> tokenized but when i look in nutch-default.xml and nutch-site.xml,  
> i have
> definied:
>
> <property>
>  <name>fetcher.store.content</name>
>  <value>true</value>
>  <description>If true, fetcher will store content.</description>
> </property>
>
> it doesn't seem to work, any idea's?

I'm pretty sure that just means to store content in the WebDB, not  
the Lucene index. The stored content in the WebDB is used for the  
cache and the search summary. The WebDB cannot be directly read by  
Lucene. You can write Java apps to work with the WebDB APi, fetching  
the content per page as needed. Or, you could use the OpenSearch  
servlet to pull out the summaries and cache per URI.

Re: Nutch content with Lucene search

Posted by Gilbert Groenendijk <gi...@gmail.com>.

Thank you, it worked! this saved me a lot of time!

On 1/29/07, Enis Soztutar <en...@gmail.com> wrote:
>
> Gilbert Groenendijk wrote:
> > Thank you (and Brian) for your anwsers. I noticed this to, but i want
> > to get
> > the content with the java API with Lucene 2.0. If it is impossible, i
> > have
> > to write some extensions for my current code but rather not. I guess the
> > problem is the unstored property. Any config property available for
> that?
> >
> > On 1/27/07, Gal Nitzan <gn...@usa.net> wrote:
> >>
> >>
> >> 1. Open your index in Luke
> >>
> >> 2. click on the documents tab
> >>
> >> 3. click on the next arrow to move to the first document
> >>
> >> 4. than click on the reconstruct button.
> >>
> >> You shall see the content field data in the right pane
> >>
> >> HTH
> >>
> >> -----Original Message-----
> >> From: Gilbert Groenendijk [mailto:gilbert.groenendijk@gmail.com]
> >> Sent: Saturday, January 27, 2007 8:34 PM
> >> To: nutch-user@lucene.apache.org
> >> Subject: Nutch content with Lucene search
> >>
> >> Hello,
> >>
> >> Today i created a simple index with nutch by command line. After that i
> >> copied the index to the machine to use it with a lucene envirionment,
> no
> >> Nutch. Fetching the URL and title works pretty good but how can i get
> >> the
> >> content? if i tak a look in Luke, the field content is not stored or
> >> tokenized but when i look in nutch-default.xml and nutch-site.xml, i
> >> have
> >> definied:
> >>
> >> <property>
> >> <name>fetcher.store.content</name>
> >> <value>true</value>
> >> <description>If true, fetcher will store content.</description>
> >> </property>
> >>
> >> it doesn't seem to work, any idea's?
> >>
> >>
> >> --
> >> Gilbert Groenendijk
> >> __________________________________________________
> >>
> >>
> >>
> >
> Just change the 72nd line in BasicIndexingFilter in index-basic plugin
> from
>
> doc.add(new Field("content", parse.getText(), Field.Store.NO,
> Field.Index.TOKENIZED));
>
> to
>
> doc.add(new Field("content", parse.getText(), Field.Store.YES,
> Field.Index.TOKENIZED));
>
>
> and you are done. But remember that you do not need to store the content
> to search it.
>
>


-- 
Gilbert Groenendijk
__________________________________________________

Boomgaardpad 45
3257 KA Ooltgensplaat
The Netherlands

T  +31 (0)187  63 14 38
M  +31 (0)621  27 51 85

Re: Nutch content with Lucene search

Posted by Enis Soztutar <en...@gmail.com>.

Gilbert Groenendijk wrote:
> Thank you (and Brian) for your anwsers. I noticed this to, but i want 
> to get
> the content with the java API with Lucene 2.0. If it is impossible, i 
> have
> to write some extensions for my current code but rather not. I guess the
> problem is the unstored property. Any config property available for that?
>
> On 1/27/07, Gal Nitzan <gn...@usa.net> wrote:
>>
>>
>> 1. Open your index in Luke
>>
>> 2. click on the documents tab
>>
>> 3. click on the next arrow to move to the first document
>>
>> 4. than click on the reconstruct button.
>>
>> You shall see the content field data in the right pane
>>
>> HTH
>>
>> -----Original Message-----
>> From: Gilbert Groenendijk [mailto:gilbert.groenendijk@gmail.com]
>> Sent: Saturday, January 27, 2007 8:34 PM
>> To: nutch-user@lucene.apache.org
>> Subject: Nutch content with Lucene search
>>
>> Hello,
>>
>> Today i created a simple index with nutch by command line. After that i
>> copied the index to the machine to use it with a lucene envirionment, no
>> Nutch. Fetching the URL and title works pretty good but how can i get 
>> the
>> content? if i tak a look in Luke, the field content is not stored or
>> tokenized but when i look in nutch-default.xml and nutch-site.xml, i 
>> have
>> definied:
>>
>> <property>
>> <name>fetcher.store.content</name>
>> <value>true</value>
>> <description>If true, fetcher will store content.</description>
>> </property>
>>
>> it doesn't seem to work, any idea's?
>>
>>
>> -- 
>> Gilbert Groenendijk
>> __________________________________________________
>>
>>
>>
>
Just change the 72nd line in BasicIndexingFilter in index-basic plugin from

doc.add(new Field("content", parse.getText(), Field.Store.NO, 
Field.Index.TOKENIZED));

to

doc.add(new Field("content", parse.getText(), Field.Store.YES, 
Field.Index.TOKENIZED));


and you are done. But remember that you do not need to store the content 
to search it.

Re: Nutch content with Lucene search

Posted by Gilbert Groenendijk <gi...@gmail.com>.

Thank you (and Brian) for your anwsers. I noticed this to, but i want to get
the content with the java API with Lucene 2.0. If it is impossible, i have
to write some extensions for my current code but rather not. I guess the
problem is the unstored property. Any config property available for that?

On 1/27/07, Gal Nitzan <gn...@usa.net> wrote:
>
>
> 1. Open your index in Luke
>
> 2. click on the documents tab
>
> 3. click on the next arrow to move to the first document
>
> 4. than click on the reconstruct button.
>
> You shall see the content field data in the right pane
>
> HTH
>
> -----Original Message-----
> From: Gilbert Groenendijk [mailto:gilbert.groenendijk@gmail.com]
> Sent: Saturday, January 27, 2007 8:34 PM
> To: nutch-user@lucene.apache.org
> Subject: Nutch content with Lucene search
>
> Hello,
>
> Today i created a simple index with nutch by command line. After that i
> copied the index to the machine to use it with a lucene envirionment, no
> Nutch. Fetching the URL and title works pretty good but how can i get the
> content? if i tak a look in Luke, the field content is not stored or
> tokenized but when i look in nutch-default.xml and nutch-site.xml, i have
> definied:
>
> <property>
> <name>fetcher.store.content</name>
> <value>true</value>
> <description>If true, fetcher will store content.</description>
> </property>
>
> it doesn't seem to work, any idea's?
>
>
> --
> Gilbert Groenendijk
> __________________________________________________
>
>
>

RE: Nutch content with Lucene search

Posted by Gal Nitzan <gn...@usa.net>.

1. Open your index in Luke

2. click on the documents tab

3. click on the next arrow to move to the first document

4. than click on the reconstruct button.

You shall see the content field data in the right pane

HTH

-----Original Message-----
From: Gilbert Groenendijk [mailto:gilbert.groenendijk@gmail.com] 
Sent: Saturday, January 27, 2007 8:34 PM
To: nutch-user@lucene.apache.org
Subject: Nutch content with Lucene search

Hello,

Today i created a simple index with nutch by command line. After that i
copied the index to the machine to use it with a lucene envirionment, no
Nutch. Fetching the URL and title works pretty good but how can i get the
content? if i tak a look in Luke, the field content is not stored or
tokenized but when i look in nutch-default.xml and nutch-site.xml, i have
definied:

<property>
  <name>fetcher.store.content</name>
  <value>true</value>
  <description>If true, fetcher will store content.</description>
</property>

it doesn't seem to work, any idea's?


-- 
Gilbert Groenendijk
__________________________________________________