You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by jake dsouza <ja...@gmail.com> on 2012/04/12 07:45:51 UTC

Indexing TREC GOV2 data in Lucene

Hi All ,

I am working on a project on Static Index pruning and I am using the TREC
GOV2 database . I have seen that the Trec data can be parsed and the
necessary java files are present in the contrib package , but has any user
used Lucene to index the GOV2 dataset or is there source code available for
the same ?

Regards
Jake Dsouza

Re: Indexing TREC GOV2 data in Lucene

Posted by "Dr. Hany Azzam" <ha...@eecs.qmul.ac.uk>.
Hi,

I am not sure if there's something in the contrib for GOV2 but it really
depends on what you want to parse. If you are just interested in full-text
search then it should be similar to parsing a regular document while being
conscious of the trec-specific delimiters. It's something like <DOC>.
However, if you are interested in performing structured search and
maintaining indexes over different fields such as titles, etc. then this
will require some customisation. Note that if you want to store the anchor
text separately and perform some sort of link resolution and page ranking
then again you will need to customize your parsing.

h.

> Hi All ,
>
> I am working on a project on Static Index pruning and I am using the TREC
> GOV2 database . I have seen that the Trec data can be parsed and the
> necessary java files are present in the contrib package , but has any user
> used Lucene to index the GOV2 dataset or is there source code available
> for
> the same ?
>
> Regards
> Jake Dsouza
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org