You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Maryam <mk...@yahoo.com> on 2007/03/15 03:25:52 UTC

Indexing HTML pages and phrases

Hi, 

I am wondering if we can index a phrase (not term) in
Lucene? Also, I am not usre if it can index HTML
pages? I need to have access to the text of some of
tags, I am not sure if this can be done in Lucene. I
would be so glad if you help me in this case. 

Thanks 



 
____________________________________________________________________________________
Expecting? Get great news right away with email Auto-Check. 
Try the Yahoo! Mail Beta.
http://advision.webevents.yahoo.com/mailbeta/newmail_tools.html 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing HTML pages and phrases

Posted by Bhavin Pandya <bh...@rediff.co.in>.
----- Original Message ----- 
From: "Maryam" <mk...@yahoo.com>
To: <ja...@lucene.apache.org>
Sent: Thursday, March 15, 2007 7:55 AM
Subject: Indexing HTML pages and phrases


> Hi,
>
> I am wondering if we can index a phrase (not term) in
> Lucene? Also, I am not usre if it can index HTML
> pages? I need to have access to the text of some of
> tags, I am not sure if this can be done in Lucene. I
> would be so glad if you help me in this case.
>
> Thanks
>
>
>
>
> ____________________________________________________________________________________
> Expecting? Get great news right away with email Auto-Check.
> Try the Yahoo! Mail Beta.
> http://advision.webevents.yahoo.com/mailbeta/newmail_tools.html
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing HTML pages and phrases

Posted by Bhavin Pandya <bh...@rediff.co.in>.
Hi Maryam,

You can index the content of specific field as UN_TOKENIZED and then you can 
do phrase search on that field..
It will search for only phrases not tokens...
To index HTML pages you can use any HTML parser...
this may be useful to you..
http://lucene.apache.org/java/docs/api/org/apache/lucene/demo/html/HTMLParser.html

Thanks.
Bhavin pandya


----- Original Message ----- 
From: "Maryam" <mk...@yahoo.com>
To: <ja...@lucene.apache.org>
Sent: Thursday, March 15, 2007 7:55 AM
Subject: Indexing HTML pages and phrases


> Hi,
>
> I am wondering if we can index a phrase (not term) in
> Lucene? Also, I am not usre if it can index HTML
> pages? I need to have access to the text of some of
> tags, I am not sure if this can be done in Lucene. I
> would be so glad if you help me in this case.
>
> Thanks
>
>
>
>
> ____________________________________________________________________________________
> Expecting? Get great news right away with email Auto-Check.
> Try the Yahoo! Mail Beta.
> http://advision.webevents.yahoo.com/mailbeta/newmail_tools.html
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Indexing HTML pages and phrases

Posted by Doron Cohen <DO...@il.ibm.com>.
For search phrases there's no need to "detect the phrases" at indexing time
- the position of each "word" is saved in the index and then used at search
time to match phrase queries. (also see 'query syntax document'.)

Lucene takes plain text as document input - extraction of content text and
properties from (say) an HTML should be done external to Lucene. (also see
'Lucene FAQ'.)

Assigning Store.YES to a field added to a document being indexed would save
the text of that field in the index so that it is later (at search time)
fetchable. (also see javadocs for org.apache.lucene.document.Field.)

Regards,
Doron

Maryam <mk...@yahoo.com> wrote on 14/03/2007 19:25:52:

> Hi,
>
> I am wondering if we can index a phrase (not term) in
> Lucene? Also, I am not usre if it can index HTML
> pages? I need to have access to the text of some of
> tags, I am not sure if this can be done in Lucene. I
> would be so glad if you help me in this case.
>
> Thanks


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org