You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Matt Quail <ma...@ctx.com.au> on 2004/04/07 11:07:20 UTC
looking for a large test corpus for a lucene presentation
Hi all,
I'm doing a presentation to my local JUG on Lucene, and I'm looking for
a "good" set of documents to use as a demonstration.
Ideally it would be:
1) large (10,000 plus?).
2) contain some metadata besides "body" (like author, date, primarykey,
etc).
3) freely available.
I was going to use the data from the previous Google programming
contest, but it doesn't seem to be available.
If I can't find anything satisfactory, I'll probably:
- generate a fake whitepages phonebook
- grab documents from project Gutenberg
My preference is for some "real" data, but I'm happy to generate fake
data if no-one has any better ideas.
:D
=Matt
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: looking for a large test corpus for a lucene presentation
Posted by Magnus Johansson <ma...@technohuman.com>.
I have used some posts from usenet. There are many
of them and they contain metadata
/magnus
> Hi all,
>
> I'm doing a presentation to my local JUG on Lucene, and I'm looking for
> a "good" set of documents to use as a demonstration.
>
> Ideally it would be:
> 1) large (10,000 plus?).
> 2) contain some metadata besides "body" (like author, date, primarykey,
> etc).
> 3) freely available.
>
> I was going to use the data from the previous Google programming
> contest, but it doesn't seem to be available.
>
> If I can't find anything satisfactory, I'll probably:
> - generate a fake whitepages phonebook
> - grab documents from project Gutenberg
>
> My preference is for some "real" data, but I'm happy to generate fake
> data if no-one has any better ideas.
>
> :D
>
> =Matt
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: looking for a large test corpus for a lucene presentation
Posted by Matt Quail <ma...@ctx.com.au>.
> how about http://dmoz.org/rdf
Perfect! And hierarchical data, as well!
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: looking for a large test corpus for a lucene presentation
Posted by Andrzej Bialecki <ab...@getopt.org>.
Matt Quail wrote:
> Hi all,
>
> I'm doing a presentation to my local JUG on Lucene, and I'm looking for
> a "good" set of documents to use as a demonstration.
>
> Ideally it would be:
> 1) large (10,000 plus?).
> 2) contain some metadata besides "body" (like author, date, primarykey,
> etc).
> 3) freely available.
>
> I was going to use the data from the previous Google programming
> contest, but it doesn't seem to be available.
>
> If I can't find anything satisfactory, I'll probably:
> - generate a fake whitepages phonebook
> - grab documents from project Gutenberg
>
> My preference is for some "real" data, but I'm happy to generate fake
> data if no-one has any better ideas.
>
how about http://dmoz.org/rdf, and specifically content.rdf.u8.gz? You
can find a parser/converter in Nutch for this format, but it's trivial
to do it yourself - so long as you use SAX... (unless, of course, you
run it on Cray or something.. :-) )
--
Best regards,
Andrzej Bialecki
-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: looking for a large test corpus for a lucene presentation
Posted by Stephane James Vaucher <va...@cirano.qc.ca>.
A few references:
http://www.daviddlewis.com/resources/testcollections/reuters21578/
http://www.daviddlewis.com/resources/testcollections/reuters21578/readme.txt
http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/
http://www-2.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/news20.html
sv
On Wed, 7 Apr 2004, Matt Quail wrote:
> Hi all,
>
> I'm doing a presentation to my local JUG on Lucene, and I'm looking for
> a "good" set of documents to use as a demonstration.
>
> Ideally it would be:
> 1) large (10,000 plus?).
> 2) contain some metadata besides "body" (like author, date, primarykey,
> etc).
> 3) freely available.
>
> I was going to use the data from the previous Google programming
> contest, but it doesn't seem to be available.
>
> If I can't find anything satisfactory, I'll probably:
> - generate a fake whitepages phonebook
> - grab documents from project Gutenberg
>
> My preference is for some "real" data, but I'm happy to generate fake
> data if no-one has any better ideas.
>
> :D
>
> =Matt
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org