You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Richard Braman <rb...@bramantax.com> on 2006/03/09 20:19:59 UTC

RE: writing a metadata content tag:use case example

I am following this thread as I have a similar issue to deal with in my
coming developments.  Howie thanks for your insights into this as I
think this may solve my problem.  

I am trying to index Title 26 of the US Code
http://www.access.gpo.gov/uscode/title26/title26.html

The problem is I don't want the search engines users to have to go crazy
trying to find a particular code section.

Genrally the code is cited by users in this format: 26USC1
Which transaltes to Title 26, Section 1.

Fortunately, the government puts the citation on the top of each page
[CITE: 26USC1]
See"
http://frwebgate.access.gpo.gov/cgi-bin/getdoc.cgi?dbname=browse_usc&doc
id=Cite:+26USC1 at the top of the page

My goal is to parse that citation out and make it so that I can let
users search on the citation.

So would I do something like 

1. parse out the citation
2. metadata.put(<citation>, <citation>);

?

Thanks for your help on this.


-----Original Message-----
From: Raghavendra Prabhu [mailto:rrprabhu@gmail.com] 
Sent: Thursday, March 09, 2006 2:53 AM
To: nutch-user@lucene.apache.org
Subject: Re: writing a metadata content tag


Hi Howie

That is what i am looking at it

But as you said generalize for all requirements including intranet
requirement

I am better off doing what u said

Rgds
Prabu


On 3/9/06, Howie Wang <ho...@hotmail.com> wrote:
>
> >What i want to do is i should add some header info in parse-filter 
> >which will be used by index-filter to add my own nature of the new 
> >FIELD
> >
> >Rgds
> >Prabhu
>
> I would recommend doing it at the index phase if possible. If the end 
> goal is to have it searchable from the index, ask if you really need 
> to have the information at the parsing stage. If you decide you want 
> to tweak your keywords, it's easy to re-index. If you do it at the 
> parsing stage, it will take twice as long since you have to re-parse 
> and then re-index. Plus re-parsing is not complicated, but involves 
> kind of a hack with renaming a bunch of directories.
>
> One reason to do your analysis at parse time is that it's easier to 
> get the entire page contents like HTML tags in case you need that for 
> categorization. If you don't need this stuff, you probably don't need 
> to categorize at the parsing phase.
>
> If you really want to do it at parse time, it's not difficult. Take a 
> look at parse-html. You can use the metadata object to store your 
> category. Look in HtmlParseFilter.java in getParse. Just do:
>
> metadata.put("myfield", "sports");
>
> In your index filter, you can then do a metadata.get to get your 
> category and then index it.
>
> Howie
>
>
>


Re: writing a metadata content tag:use case example

Posted by TDLN <di...@gmail.com>.
Richard.

So would I do something like
>
> 1. parse out the citation
> 2. metadata.put(<citation>, <citation>);



Yes, I think that is the way to proceed. And then on implementing the
Indexing and Query FIlters, all as desribed in the WritingPlugin tutorial:
http://wiki.apache.org/nutch/WritingPluginExample

Rgrds, Thomas

?
>
> Thanks for your help on this.
>
>
> -----Original Message-----
> From: Raghavendra Prabhu [mailto:rrprabhu@gmail.com]
> Sent: Thursday, March 09, 2006 2:53 AM
> To: nutch-user@lucene.apache.org
> Subject: Re: writing a metadata content tag
>
>
> Hi Howie
>
> That is what i am looking at it
>
> But as you said generalize for all requirements including intranet
> requirement
>
> I am better off doing what u said
>
> Rgds
> Prabu
>
>
> On 3/9/06, Howie Wang <ho...@hotmail.com> wrote:
> >
> > >What i want to do is i should add some header info in parse-filter
> > >which will be used by index-filter to add my own nature of the new
> > >FIELD
> > >
> > >Rgds
> > >Prabhu
> >
> > I would recommend doing it at the index phase if possible. If the end
> > goal is to have it searchable from the index, ask if you really need
> > to have the information at the parsing stage. If you decide you want
> > to tweak your keywords, it's easy to re-index. If you do it at the
> > parsing stage, it will take twice as long since you have to re-parse
> > and then re-index. Plus re-parsing is not complicated, but involves
> > kind of a hack with renaming a bunch of directories.
> >
> > One reason to do your analysis at parse time is that it's easier to
> > get the entire page contents like HTML tags in case you need that for
> > categorization. If you don't need this stuff, you probably don't need
> > to categorize at the parsing phase.
> >
> > If you really want to do it at parse time, it's not difficult. Take a
> > look at parse-html. You can use the metadata object to store your
> > category. Look in HtmlParseFilter.java in getParse. Just do:
> >
> > metadata.put("myfield", "sports");
> >
> > In your index filter, you can then do a metadata.get to get your
> > category and then index it.
> >
> > Howie
> >
> >
> >
>
>