You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by dealmaker <vk...@yahoo.com> on 2008/08/13 02:10:31 UTC
How to Query for Documents' Anchor Text?
Hi,
I know that there is already a anchor text feature in lucene that
index/query all anchor text that leads to a document I want. But it is not
what I am looking for. I want to index/query for all available anchor text
in all document/ a subset of documents, is there already some kind of plugin
or parser that do this?
e.g. I write a query: "wep wireless card", it should return all the anchor
text in all the documents that are related to wep, wireless and card.
Thanks.
--
View this message in context: http://www.nabble.com/How-to-Query-for-Documents%27-Anchor-Text--tp18954776p18954776.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
RE: How to Query for Documents' Anchor Text?
Posted by Steven A Rowe <sa...@syr.edu>.
Hi dealmaker,
The java-dev mailing list is devoted to discussion of the *development* of Lucene. In the future, please use the java-user mailing list for questions about *using* Lucene.
If by "anchor text" you mean HTML <a href="...">anchor text</a>, then you must make sure that you index this text in its own field - AFAIK, Lucene doesn't have the native capability to do this - you must write this functionality yourself.
To pull out terms from the "anchor text" field once you have a set of documents you want to look at (e.g. as a result of a search), use Lucene's Term Vectors feature.
At index time, use the Field constructor that takes in a Field.TermVector specifier:
<http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/document/Field.html#Field(java.lang.String,%20java.lang.String,%20org.apache.lucene.document.Field.Store,%20org.apache.lucene.document.Field.Index,%20org.apache.lucene.document.Field.TermVector)>
Once you have created an index with Term Vectors, you will be able to access any documents' terms along with their frequencies, using IndexReader.getTermFreqVector():
<http://lucene.apache.org/java/2_3_2/api/org/apache/lucene/index/IndexReader.html#getTermFreqVector(int,%20java.lang.String)>
Steve
On 08/12/2008 at 8:10 PM, dealmaker wrote:
>
> Hi,
> I know that there is already a anchor text feature in lucene that
> index/query all anchor text that leads to a document I want.
> But it is not what I am looking for. I want to index/query for all
> available anchor text in all document/ a subset of documents, is there
> already some kind of plugin or parser that do this?
>
> e.g. I write a query: "wep wireless card", it should return
> all the anchor text in all the documents that are related to wep,
> wireless and card.
>
> Thanks.
>
> --
> View this message in context: http://www.nabble.com/How-to-Query-for-Documents%27-Anchor-Text--tp18954776p18954776.html
> Sent from the Lucene - Java Developer mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org