You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Christopher Gross <co...@gmail.com> on 2011/08/10 14:12:13 UTC

Crawl Page, Store full HTML content

I have Nutch 1.3 running, and have it connected to a Solr 3.3
instance.  Right now the data comes over from Nutch to Solr just fine,
but I'd like it to send the "content" field to Solr as the raw HTML,
so that I can have all the original markup to work with later.

I've tried digging around on Google and I can't seem to find anything.
 Can someone please push me in the right direction?

Thanks!

-- Christopher Gross

Re: Crawl Page, Store full HTML content

Posted by Markus Jelsma <ma...@openindex.io>.

Nutch doesn't put raw HTML in NutchDocument objects.

> You can try using the string type as below:
> <field name="content" type="string" stored="true" indexed="true"/>
> 
> 
> On Wed, Aug 10, 2011 at 6:20 AM, Markus Jelsma
> 
> <ma...@openindex.io>wrote:
> > I'm not sure how to do this but i think creating an parse and indexing
> > filter
> > will do the trick. First you make the parse filter that reads the byte[]
> > content from the Content object that is available in the parse filter.
> > You then add the raw data in that parse filter to the parse data.
> > 
> > In your indexing filter you simply read that field and add it to the
> > document.
> > See writing plugin example on the wiki for basic introduction to writing
> > plugins.
> > 
> > On Wednesday 10 August 2011 14:12:13 Christopher Gross wrote:
> > > I have Nutch 1.3 running, and have it connected to a Solr 3.3
> > > instance.  Right now the data comes over from Nutch to Solr just fine,
> > > but I'd like it to send the "content" field to Solr as the raw HTML,
> > > so that I can have all the original markup to work with later.
> > > 
> > > I've tried digging around on Google and I can't seem to find anything.
> > > 
> > >  Can someone please push me in the right direction?
> > > 
> > > Thanks!
> > > 
> > > -- Christopher Gross
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

Re: Crawl Page, Store full HTML content

Posted by Way Cool <wa...@gmail.com>.

You can try using the string type as below:
<field name="content" type="string" stored="true" indexed="true"/>


On Wed, Aug 10, 2011 at 6:20 AM, Markus Jelsma
<ma...@openindex.io>wrote:

> I'm not sure how to do this but i think creating an parse and indexing
> filter
> will do the trick. First you make the parse filter that reads the byte[]
> content from the Content object that is available in the parse filter. You
> then add the raw data in that parse filter to the parse data.
>
> In your indexing filter you simply read that field and add it to the
> document.
> See writing plugin example on the wiki for basic introduction to writing
> plugins.
>
> On Wednesday 10 August 2011 14:12:13 Christopher Gross wrote:
> > I have Nutch 1.3 running, and have it connected to a Solr 3.3
> > instance.  Right now the data comes over from Nutch to Solr just fine,
> > but I'd like it to send the "content" field to Solr as the raw HTML,
> > so that I can have all the original markup to work with later.
> >
> > I've tried digging around on Google and I can't seem to find anything.
> >  Can someone please push me in the right direction?
> >
> > Thanks!
> >
> > -- Christopher Gross
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>

Re: Crawl Page, Store full HTML content

Posted by Markus Jelsma <ma...@openindex.io>.

I'm not sure how to do this but i think creating an parse and indexing filter 
will do the trick. First you make the parse filter that reads the byte[] 
content from the Content object that is available in the parse filter. You 
then add the raw data in that parse filter to the parse data.

In your indexing filter you simply read that field and add it to the document. 
See writing plugin example on the wiki for basic introduction to writing 
plugins.

On Wednesday 10 August 2011 14:12:13 Christopher Gross wrote:
> I have Nutch 1.3 running, and have it connected to a Solr 3.3
> instance.  Right now the data comes over from Nutch to Solr just fine,
> but I'd like it to send the "content" field to Solr as the raw HTML,
> so that I can have all the original markup to work with later.
> 
> I've tried digging around on Google and I can't seem to find anything.
>  Can someone please push me in the right direction?
> 
> Thanks!
> 
> -- Christopher Gross

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350