You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Kelvin <ks...@yahoo.com.sg> on 2011/07/20 05:41:43 UTC

How to get the original html file that is crawled by Nutch?

Dear all,

I have used both nutch 1.2 and 1.3. Both work fine for the crawling, indexing. When I want to search using some keywords, it return the results, showing snippets of the htmls that contain the keywords. Is there a way to retrieve or access the full original html pages that contain the keywords?

Thank you for your help.

Re: How to get the original html file that is crawled by Nutch?

Posted by Julien Nioche <li...@gmail.com>.

The original content (e.g. HTML) is not sent for indexing and is not the
content (extracted text). What you are describing would store the  text and
should be sufficient for generating snippets in SOLR.

On 20 July 2011 11:47, Chris Alexander <ch...@kusiri.com> wrote:

> One way I have seen this working is to edit the schema.xml file
> {SOLR_HOME}/conf/schema.xml. Modify the field with name "content" to have
> its "stored" parameter set to "true". Something like this:
>
> <field name="content" type="text" *stored="true"* .....
>
> You will need to re-index pages (either by emptying solr and deleting the
> crawl directory for nutch, or re-crawling the page when it has timed out)
> for this to take effect; new pages will have their content stored
> automatically.
>
> Hope this helps
>
> Chris
>
> On 20 July 2011 04:41, Kelvin <ks...@yahoo.com.sg> wrote:
>
> > Dear all,
> >
> > I have used both nutch 1.2 and 1.3. Both work fine for the crawling,
> > indexing. When I want to search using some keywords, it return the
> results,
> > showing snippets of the htmls that contain the keywords. Is there a way
> to
> > retrieve or access the full original html pages that contain the
> keywords?
> >
> > Thank you for your help.
> >
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: How to get the original html file that is crawled by Nutch?

Posted by Chris Alexander <ch...@kusiri.com>.

One way I have seen this working is to edit the schema.xml file
{SOLR_HOME}/conf/schema.xml. Modify the field with name "content" to have
its "stored" parameter set to "true". Something like this:

<field name="content" type="text" *stored="true"* .....

You will need to re-index pages (either by emptying solr and deleting the
crawl directory for nutch, or re-crawling the page when it has timed out)
for this to take effect; new pages will have their content stored
automatically.

Hope this helps

Chris

On 20 July 2011 04:41, Kelvin <ks...@yahoo.com.sg> wrote:

> Dear all,
>
> I have used both nutch 1.2 and 1.3. Both work fine for the crawling,
> indexing. When I want to search using some keywords, it return the results,
> showing snippets of the htmls that contain the keywords. Is there a way to
> retrieve or access the full original html pages that contain the keywords?
>
> Thank you for your help.
>

Re: How to get the original html file that is crawled by Nutch?

Posted by Kelvin <ks...@yahoo.com.sg>.

I have found the solution for my problem, I'm posting it, in case others are also stuck in this problem. :)

Nutch can store the whole text content of the html pages. for nutch 1.3


Step 1:In nutch/runtime/local/conf/nutch-site.xml 
            add

<property>
 <name>http.content.limit</name>
 <value>-1</value>
</property>

Step 2:In Solr /example/solr/conf/schema.xml

Set <field name="content" type="text" stored="true" indexed="true"/>


From: Kelvin <ks...@yahoo.com.sg>
To: "user@nutch.apache.org" <us...@nutch.apache.org>
Sent: Wednesday, 20 July 2011 11:41 AM
Subject: How to get the original html file that is crawled by Nutch?

Dear all,

I have used both nutch 1.2 and 1.3. Both work fine for the crawling, indexing. When I want to search using some keywords, it return the results, showing snippets of the htmls that contain the keywords. Is there a way to retrieve or access the full original html pages that contain the keywords?

Thank you for your help.