You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Pedro Bezunartea López <pe...@bezunartea.net> on 2010/02/21 23:23:56 UTC

Content storage, results highlighting

Hi,

I've developed a web application in lucene that searches web pages using a
nutch generated index. I'd like to highlight the query searched for when
showing the results, and I understand that the content of the pages need to
be stored, as well as indexed.

This is what I've tried so far:
1.- In the file conf/nutch-site.xml, I changed the value of
"file.content.ignored" to false.
2.- In the file conf/schema.xml I modified the line:
 <field name="content" type="text" stored="false" indexed="true"/>
to
 <field name="content" type="text" stored="true" indexed="true"/>
3.- In the sources file
src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexingFilter.java,
line 116 to:
 LuceneWriter.addFieldOptions("content", LuceneWriter.STORE.YES,
        LuceneWriter.INDEX.TOKENIZED, conf)

I tried running the command "bin/nutch crawl urls -dir crawl -depth 10 -topN
5000" after the first two steps, but the crawl didn't store the contents. I
then tried the third step, recompiled nutch, and run the crawl command again
to no avail.

What am I missing? Any hints, please?

TIA,

Pedro.

Re: Content storage, results highlighting

Posted by Pedro Bezunartea López <pe...@bezunartea.net>.

Hi Sami,

The schema.xml file there is usable only when using Solr as the search
> server. Are you using Solr?
>

Not yet! thanks for clarifying it. Cheers,

Pedro.


> --
>  Sami Siren
>
>
> Pedro Bezunartea López wrote:
> > Hi,
>
>>
>> I've developed a web application in lucene that searches web pages using a
>> nutch generated index. I'd like to highlight the query searched for when
>> showing the results, and I understand that the content of the pages need
>> to
>> be stored, as well as indexed.
>>
>> This is what I've tried so far:
>> 1.- In the file conf/nutch-site.xml, I changed the value of
>> "file.content.ignored" to false.
>> 2.- In the file conf/schema.xml I modified the line:
>>  <field name="content" type="text" stored="false" indexed="true"/>
>> to
>>  <field name="content" type="text" stored="true" indexed="true"/>
>> 3.- In the sources file
>>
>> src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexingFilter.java,
>> line 116 to:
>>  LuceneWriter.addFieldOptions("content", LuceneWriter.STORE.YES,
>>        LuceneWriter.INDEX.TOKENIZED, conf)
>>
>> I tried running the command "bin/nutch crawl urls -dir crawl -depth 10
>> -topN
>> 5000" after the first two steps, but the crawl didn't store the contents.
>> I
>> then tried the third step, recompiled nutch, and run the crawl command
>> again
>> to no avail.
>>
>> What am I missing? Any hints, please?
>>
>> TIA,
>>
>> Pedro.
>>
>>
>

Re: Content storage, results highlighting

Posted by Sami Siren <ss...@gmail.com>.

The schema.xml file there is usable only when using Solr as the search 
server. Are you using Solr?

--
  Sami Siren

Pedro Bezunartea López wrote:
 > Hi,
> 
> I've developed a web application in lucene that searches web pages using a
> nutch generated index. I'd like to highlight the query searched for when
> showing the results, and I understand that the content of the pages need to
> be stored, as well as indexed.
> 
> This is what I've tried so far:
> 1.- In the file conf/nutch-site.xml, I changed the value of
> "file.content.ignored" to false.
> 2.- In the file conf/schema.xml I modified the line:
>  <field name="content" type="text" stored="false" indexed="true"/>
> to
>  <field name="content" type="text" stored="true" indexed="true"/>
> 3.- In the sources file
> src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexingFilter.java,
> line 116 to:
>  LuceneWriter.addFieldOptions("content", LuceneWriter.STORE.YES,
>         LuceneWriter.INDEX.TOKENIZED, conf)
> 
> I tried running the command "bin/nutch crawl urls -dir crawl -depth 10 -topN
> 5000" after the first two steps, but the crawl didn't store the contents. I
> then tried the third step, recompiled nutch, and run the crawl command again
> to no avail.
> 
> What am I missing? Any hints, please?
> 
> TIA,
> 
> Pedro.
>