You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Cam Bazz <ca...@gmail.com> on 2011/07/15 18:19:55 UTC

what does the parse command does

Hello,

Finally I got a working build environment, and I am doing some
modifications and playing around.

I also got my first plugin to build, and almost done with my custom parser.

I have my custom plugin and the method

public ParseResult filter(Content content, ParseResult parseResult,
HTMLMetaTags metaTags, DocumentFragment doc) { ...

does indeed have all the information that I need to do my custom parsing.

Now this is what I dont understand: there is a content field in solr.
I have read the solrindexer code, and figured out that pretty much any
field in the doc is indexed to solr.

What must I do, so I can open another content like field such as
"Content2" and put my custom extracted data, so solr indexes it? I
think this does not have to do with solr, but the fields in the
document.

In the recommended example, the found result is only added to
contentMeta - and this one is not indexed by solr.

Best Regards,
-C.B.

Re: what does the parse command does

Posted by lewis john mcgibbney <le...@gmail.com>.
Hi C.B.,

Quite a few things here

On Fri, Jul 15, 2011 at 5:19 PM, Cam Bazz <ca...@gmail.com> wrote:

> Hello,
>
> Finally I got a working build environment, and I am doing some
> modifications and playing around.
>

Good to hear, although it is off topic can you share any hurdles you
overcame with us please. It would be good to hear how you solved you
configuration problems.


> I also got my first plugin to build, and almost done with my custom parser.
>

Excellent, I will proceed with adding your comment to a page in plugin
central on the wiki, in the meantime it would be good to hear more about
your plugin and what functionality it encapsulates! Would it be possible to
get a wiki entry? We are a bit short for Nutch 1.3 custom plugin tutorials.

>
> I have my custom plugin and the method
>
> public ParseResult filter(Content content, ParseResult parseResult,
> HTMLMetaTags metaTags, DocumentFragment doc) { ...
>
> does indeed have all the information that I need to do my custom parsing.
>
> Now this is what I dont understand: there is a content field in solr.
> I have read the solrindexer code, and figured out that pretty much any
> field in the doc is indexed to solr.
>

If you have a look at boht your schema and solr-mapping documents you will
see how fields are generated and passed to Solr for indexing.

>
> What must I do, so I can open another content like field such as
> "Content2" and put my custom extracted data, so solr indexes it? I
> think this does not have to do with solr, but the fields in the
> document.
>
My suggestion would be to specify extraction of the field within the plugin
code then add the various configuration parameters to both of the
aforementioned config documents.


>
> In the recommended example, the found result is only added to
> contentMeta - and this one is not indexed by solr.
>

What recommended example? I am not following you here.

>
> Best Regards,
> -C.B.
>



-- 
*Lewis*