You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Raghavendra Prabhu <rr...@gmail.com> on 2006/03/08 21:59:54 UTC

writing a metadata content tag

Hi guys

Sorry for the follow up mail

My requirement as i was mentioning previously shud let me stamp documents
with some kind of type


How do i do it ?


For example add sports to a field TYPEFIELD on seeing football,tennis in
extracted text

For example add technology to the same field TYPEFIELD on seeing
web,internet


Where do i add this ??

Rgds

Prabhu

Re: writing a metadata content tag:use case example

Posted by TDLN <di...@gmail.com>.
Richard.

So would I do something like
>
> 1. parse out the citation
> 2. metadata.put(<citation>, <citation>);



Yes, I think that is the way to proceed. And then on implementing the
Indexing and Query FIlters, all as desribed in the WritingPlugin tutorial:
http://wiki.apache.org/nutch/WritingPluginExample

Rgrds, Thomas

?
>
> Thanks for your help on this.
>
>
> -----Original Message-----
> From: Raghavendra Prabhu [mailto:rrprabhu@gmail.com]
> Sent: Thursday, March 09, 2006 2:53 AM
> To: nutch-user@lucene.apache.org
> Subject: Re: writing a metadata content tag
>
>
> Hi Howie
>
> That is what i am looking at it
>
> But as you said generalize for all requirements including intranet
> requirement
>
> I am better off doing what u said
>
> Rgds
> Prabu
>
>
> On 3/9/06, Howie Wang <ho...@hotmail.com> wrote:
> >
> > >What i want to do is i should add some header info in parse-filter
> > >which will be used by index-filter to add my own nature of the new
> > >FIELD
> > >
> > >Rgds
> > >Prabhu
> >
> > I would recommend doing it at the index phase if possible. If the end
> > goal is to have it searchable from the index, ask if you really need
> > to have the information at the parsing stage. If you decide you want
> > to tweak your keywords, it's easy to re-index. If you do it at the
> > parsing stage, it will take twice as long since you have to re-parse
> > and then re-index. Plus re-parsing is not complicated, but involves
> > kind of a hack with renaming a bunch of directories.
> >
> > One reason to do your analysis at parse time is that it's easier to
> > get the entire page contents like HTML tags in case you need that for
> > categorization. If you don't need this stuff, you probably don't need
> > to categorize at the parsing phase.
> >
> > If you really want to do it at parse time, it's not difficult. Take a
> > look at parse-html. You can use the metadata object to store your
> > category. Look in HtmlParseFilter.java in getParse. Just do:
> >
> > metadata.put("myfield", "sports");
> >
> > In your index filter, you can then do a metadata.get to get your
> > category and then index it.
> >
> > Howie
> >
> >
> >
>
>

RE: writing a metadata content tag:use case example

Posted by Richard Braman <rb...@bramantax.com>.
I am following this thread as I have a similar issue to deal with in my
coming developments.  Howie thanks for your insights into this as I
think this may solve my problem.  

I am trying to index Title 26 of the US Code
http://www.access.gpo.gov/uscode/title26/title26.html

The problem is I don't want the search engines users to have to go crazy
trying to find a particular code section.

Genrally the code is cited by users in this format: 26USC1
Which transaltes to Title 26, Section 1.

Fortunately, the government puts the citation on the top of each page
[CITE: 26USC1]
See"
http://frwebgate.access.gpo.gov/cgi-bin/getdoc.cgi?dbname=browse_usc&doc
id=Cite:+26USC1 at the top of the page

My goal is to parse that citation out and make it so that I can let
users search on the citation.

So would I do something like 

1. parse out the citation
2. metadata.put(<citation>, <citation>);

?

Thanks for your help on this.


-----Original Message-----
From: Raghavendra Prabhu [mailto:rrprabhu@gmail.com] 
Sent: Thursday, March 09, 2006 2:53 AM
To: nutch-user@lucene.apache.org
Subject: Re: writing a metadata content tag


Hi Howie

That is what i am looking at it

But as you said generalize for all requirements including intranet
requirement

I am better off doing what u said

Rgds
Prabu


On 3/9/06, Howie Wang <ho...@hotmail.com> wrote:
>
> >What i want to do is i should add some header info in parse-filter 
> >which will be used by index-filter to add my own nature of the new 
> >FIELD
> >
> >Rgds
> >Prabhu
>
> I would recommend doing it at the index phase if possible. If the end 
> goal is to have it searchable from the index, ask if you really need 
> to have the information at the parsing stage. If you decide you want 
> to tweak your keywords, it's easy to re-index. If you do it at the 
> parsing stage, it will take twice as long since you have to re-parse 
> and then re-index. Plus re-parsing is not complicated, but involves 
> kind of a hack with renaming a bunch of directories.
>
> One reason to do your analysis at parse time is that it's easier to 
> get the entire page contents like HTML tags in case you need that for 
> categorization. If you don't need this stuff, you probably don't need 
> to categorize at the parsing phase.
>
> If you really want to do it at parse time, it's not difficult. Take a 
> look at parse-html. You can use the metadata object to store your 
> category. Look in HtmlParseFilter.java in getParse. Just do:
>
> metadata.put("myfield", "sports");
>
> In your index filter, you can then do a metadata.get to get your 
> category and then index it.
>
> Howie
>
>
>


Re: writing a metadata content tag

Posted by Raghavendra Prabhu <rr...@gmail.com>.
Hi Howie

That is what i am looking at it

But as you said generalize for all requirements including intranet
requirement

I am better off doing what u said

Rgds
Prabu


On 3/9/06, Howie Wang <ho...@hotmail.com> wrote:
>
> >What i want to do is i should add some header info in parse-filter which
> >will be used by index-filter to add my own nature of the new FIELD
> >
> >Rgds
> >Prabhu
>
> I would recommend doing it at the index phase if possible. If the end
> goal is to have it searchable from the index, ask if you really need to
> have
> the information at the parsing stage. If you decide you want to
> tweak your keywords, it's easy to re-index. If you do it at the parsing
> stage, it will take twice as long since you have to re-parse and then
> re-index. Plus re-parsing is not complicated, but involves kind of a
> hack with renaming a bunch of directories.
>
> One reason to do your analysis at parse time is that it's easier to
> get the entire page contents like HTML tags in case you need that
> for categorization. If you don't need this stuff, you probably don't
> need to categorize at the parsing phase.
>
> If you really want to do it at parse time, it's not difficult. Take a
> look at parse-html. You can use the metadata object to store
> your category. Look in HtmlParseFilter.java in getParse. Just do:
>
> metadata.put("myfield", "sports");
>
> In your index filter, you can then do a metadata.get to get your
> category and then index it.
>
> Howie
>
>
>

Re: writing a metadata content tag

Posted by Howie Wang <ho...@hotmail.com>.
>What i want to do is i should add some header info in parse-filter which
>will be used by index-filter to add my own nature of the new FIELD
>
>Rgds
>Prabhu

I would recommend doing it at the index phase if possible. If the end
goal is to have it searchable from the index, ask if you really need to have
the information at the parsing stage. If you decide you want to
tweak your keywords, it's easy to re-index. If you do it at the parsing
stage, it will take twice as long since you have to re-parse and then
re-index. Plus re-parsing is not complicated, but involves kind of a
hack with renaming a bunch of directories.

One reason to do your analysis at parse time is that it's easier to
get the entire page contents like HTML tags in case you need that
for categorization. If you don't need this stuff, you probably don't
need to categorize at the parsing phase.

If you really want to do it at parse time, it's not difficult. Take a
look at parse-html. You can use the metadata object to store
your category. Look in HtmlParseFilter.java in getParse. Just do:

metadata.put("myfield", "sports");

In your index filter, you can then do a metadata.get to get your
category and then index it.

Howie



Re: writing a metadata content tag

Posted by Raghavendra Prabhu <rr...@gmail.com>.
Hi Howie

What you have mentioned is in the indexing fields

I am speaking abt content

i thought there are three steps


parse-filter
index-filter
query-filter


I think you are referring to the second step index-filter. I want more on
the first step parse-filter

What i want to do is i should add some header info in parse-filter which
will be used by index-filter to add my own nature of the new FIELD

Rgds
Prabhu


On 3/9/06, Howie Wang <ho...@hotmail.com> wrote:
>
> You need to write your own indexing filter plugin. Take a look
> at index-basic. In BasicIndexingFilter.java there are a whole
> bunch of lines that do something like:
>
> doc.add(Field.Text("myfield", myFieldValue));
>
> Just add your own field. You have access to title, anchor,
> and page text in this function. Search the text for your
> keywords and add whatever field you want.
>
> To search on this field, you'll have to create a query filter plugin also
> so that you can search for "myfield:sports".  See query-site for an
> example. You'll only have to change a couple of lines of code:
>
> public class MyQueryFilter extends RawFieldQueryFilter {
> public MyQueryFilter() {
>    super("myfield");
> }
> }
>
> Don't forget to add your new plugins to nutch-site.xml.
>
> By the way, I would recommend writing some extra code to
> allow yourself to read in keywords from a file and map them
> to your category. It's kind of a pain to edit the code every
> time you think of a new keyword.
>
> Howie
>
> >Hi guys
> >
> >Sorry for the follow up mail
> >
> >My requirement as i was mentioning previously shud let me stamp documents
> >with some kind of type
> >
> >
> >How do i do it ?
> >
> >
> >For example add sports to a field TYPEFIELD on seeing football,tennis in
> >extracted text
> >
> >For example add technology to the same field TYPEFIELD on seeing
> >web,internet
> >
> >
> >Where do i add this ??
> >
> >Rgds
> >
> >Prabhu
>
>
>

RE: writing a metadata content tag

Posted by Howie Wang <ho...@hotmail.com>.
You need to write your own indexing filter plugin. Take a look
at index-basic. In BasicIndexingFilter.java there are a whole
bunch of lines that do something like:

doc.add(Field.Text("myfield", myFieldValue));

Just add your own field. You have access to title, anchor,
and page text in this function. Search the text for your
keywords and add whatever field you want.

To search on this field, you'll have to create a query filter plugin also
so that you can search for "myfield:sports".  See query-site for an
example. You'll only have to change a couple of lines of code:

public class MyQueryFilter extends RawFieldQueryFilter {
  public MyQueryFilter() {
    super("myfield");
  }
}

Don't forget to add your new plugins to nutch-site.xml.

By the way, I would recommend writing some extra code to
allow yourself to read in keywords from a file and map them
to your category. It's kind of a pain to edit the code every
time you think of a new keyword.

Howie

>Hi guys
>
>Sorry for the follow up mail
>
>My requirement as i was mentioning previously shud let me stamp documents
>with some kind of type
>
>
>How do i do it ?
>
>
>For example add sports to a field TYPEFIELD on seeing football,tennis in
>extracted text
>
>For example add technology to the same field TYPEFIELD on seeing
>web,internet
>
>
>Where do i add this ??
>
>Rgds
>
>Prabhu