You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Hetan Shah <He...@Sun.COM> on 2004/10/13 20:42:40 UTC

Index + Searching

Hello,

I am using the IndexHTML class to index around 30,000 files and it is 
working fine. Question that I have is, is there a way to add multiple 
fields to index so that when the actual search is performed I can 
extract the exact match.
E.g.
the fields can be
1) title - abc
2) name - foo inc,
3) description - Lorem ipsum dolor sit
4) URL - www.lorem.ipsum

and so on,

 From search when the match for title 'abc' is found then searching for 
doc.get("name") can return foo inc and so on.

Is this already happening in any other indexing class if not what do I 
need to add to IndexHTML class to accomplish this?

thanks for all the help gang.
-H


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Index + Searching

Posted by Fred Toth <ft...@synernet.com>.
Hi,

I built a MetaTagsManager class.

In HTMLDocument, there is a line of code that causes the doc
to be parsed:

HTMLParser parser = new HTMLParser(f);

After parsing, all of the parse results are available. I added:

Properties prop = parser.getMetaTags();
//prop.list(System.out);
MetaTagManager.process(doc, prop);

Then, in the "process" method of MetaTagManager, you have
access to all of the meta tags as well as the document. You
can add them as any sort of fields you need. A typical bit might be:

         String desc = prop.getProperty(META_DESCRIPTION);
         if (desc != null) {
             System.out.println("DESC: " + desc);
             doc.add(Field.UnIndexed(DOC_DESCRIPTION, desc));
         }


Be warned, though: As has been discussed on this list several times,
IndexHTML has some significant limitations. The most important
one for me has been the fact that it's not unicode-clean.

However, I'm still using it!

Good luck,

Fred


At 07:06 PM 10/14/2004, you wrote:
>Hi Fred,
>
>Thanks for replying. I see what you mean by creating tags for name and 
>description. I am not sure about how to hack the methods and where. If you 
>can point me in right direction that would really be appreciated. I am 
>thinking about editing the shipped HTMLDocument.java and HTMLParser.java 
>files but not sure what I need to add. Can you please explain a little more.
>
>TIA.
>-H
>
>Fred Toth wrote:
>>Hi,
>>Could be your best bet is to use HTML <meta> tags. Create tags
>>for name, description, etc. (title is already parsed). The HTML parser
>>that ships with Lucene will parse these tags into java Properties. You
>>will need to hack a bit, but you can easily pick these up and add them
>>as specific fields to your index.
>>Fred
>>At 02:42 PM 10/13/2004, you wrote:
>>
>>>Hello,
>>>
>>>I am using the IndexHTML class to index around 30,000 files and it is 
>>>working fine. Question that I have is, is there a way to add multiple 
>>>fields to index so that when the actual search is performed I can 
>>>extract the exact match.
>>>E.g.
>>>the fields can be
>>>1) title - abc
>>>2) name - foo inc,
>>>3) description - Lorem ipsum dolor sit
>>>4) URL - www.lorem.ipsum
>>>
>>>and so on,
>>>
>>> From search when the match for title 'abc' is found then searching for 
>>> doc.get("name") can return foo inc and so on.
>>>
>>>Is this already happening in any other indexing class if not what do I 
>>>need to add to IndexHTML class to accomplish this?
>>>
>>>thanks for all the help gang.
>>>-H
>>>
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Index + Searching

Posted by Hetan Shah <He...@Sun.COM>.
Hi Fred,

Thanks for replying. I see what you mean by creating tags for name and 
description. I am not sure about how to hack the methods and where. If 
you can point me in right direction that would really be appreciated. I 
am thinking about editing the shipped HTMLDocument.java and 
HTMLParser.java files but not sure what I need to add. Can you please 
explain a little more.

TIA.
-H

Fred Toth wrote:
> Hi,
> 
> Could be your best bet is to use HTML <meta> tags. Create tags
> for name, description, etc. (title is already parsed). The HTML parser
> that ships with Lucene will parse these tags into java Properties. You
> will need to hack a bit, but you can easily pick these up and add them
> as specific fields to your index.
> 
> Fred
> 
> At 02:42 PM 10/13/2004, you wrote:
> 
>> Hello,
>>
>> I am using the IndexHTML class to index around 30,000 files and it is 
>> working fine. Question that I have is, is there a way to add multiple 
>> fields to index so that when the actual search is performed I can 
>> extract the exact match.
>> E.g.
>> the fields can be
>> 1) title - abc
>> 2) name - foo inc,
>> 3) description - Lorem ipsum dolor sit
>> 4) URL - www.lorem.ipsum
>>
>> and so on,
>>
>> From search when the match for title 'abc' is found then searching for 
>> doc.get("name") can return foo inc and so on.
>>
>> Is this already happening in any other indexing class if not what do I 
>> need to add to IndexHTML class to accomplish this?
>>
>> thanks for all the help gang.
>> -H
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Index + Searching

Posted by Fred Toth <ft...@synernet.com>.
Hi,

Could be your best bet is to use HTML <meta> tags. Create tags
for name, description, etc. (title is already parsed). The HTML parser
that ships with Lucene will parse these tags into java Properties. You
will need to hack a bit, but you can easily pick these up and add them
as specific fields to your index.

Fred

At 02:42 PM 10/13/2004, you wrote:
>Hello,
>
>I am using the IndexHTML class to index around 30,000 files and it is 
>working fine. Question that I have is, is there a way to add multiple 
>fields to index so that when the actual search is performed I can extract 
>the exact match.
>E.g.
>the fields can be
>1) title - abc
>2) name - foo inc,
>3) description - Lorem ipsum dolor sit
>4) URL - www.lorem.ipsum
>
>and so on,
>
> From search when the match for title 'abc' is found then searching for 
> doc.get("name") can return foo inc and so on.
>
>Is this already happening in any other indexing class if not what do I 
>need to add to IndexHTML class to accomplish this?
>
>thanks for all the help gang.
>-H
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org