You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Jos van der Meer <jm...@aidministrator.nl> on 2003/07/30 15:19:48 UTC

Free, medium size, downloadable corpus of newspaper articles ?

For my experiments with Lucene, I would like to have a publicly available
free, medium size, downloadable corpus of newspaper articles
(topics do not matter, nor does its publication date).

For I would like to share the results of the experiments, and people
should be able to reproduce and to extend it.

Don't send the corpora themselves (..), but please send me their URLs.

Thanks in advance,


jos.van.der.meer@aidministrator.nl
aidministrator nederland bv  -  http://www.aidministrator.nl/
prinses julianaplein 14-b, 3817 cs amersfoort, the netherlands
tel. +31-(0)33-4659987   fax. +31-(0)33-4659987


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Free, medium size, downloadable corpus of newspaper articles ?

Posted by Peter Becker <pb...@dstc.edu.au>.
[redirected to lucene-user]

Me, too! :-)

We are currently playing with the small Reuters collection (about 21.500 
news items from the 80s), but I don't know if I am allowed to distribute 
it and it is too small anyway -- many of the implications we find are 
based on 1 to 3 documents. I still have a collection of 5 CDs I 
downloaded from Google about 2 years ago, which I still haven't found 
the time to look at -- it probably will be one of the next things to do. 
I can see what their licence says about redistribution.

Another good source for collections is 
http://www.kdnuggets.com/datasets/index.html -- but I haven't found a 
good news corpus yet.

I am definitely interested in comparing some results on whatever 
collection we can find. I'll talk to our IR guys in the next days to see 
what they know about.

  Peter



Jos van der Meer wrote:

>For my experiments with Lucene, I would like to have a publicly available
>free, medium size, downloadable corpus of newspaper articles
>(topics do not matter, nor does its publication date).
>
>For I would like to share the results of the experiments, and people
>should be able to reproduce and to extend it.
>
>Don't send the corpora themselves (..), but please send me their URLs.
>
>Thanks in advance,
>
>
>jos.van.der.meer@aidministrator.nl
>aidministrator nederland bv  -  http://www.aidministrator.nl/
>prinses julianaplein 14-b, 3817 cs amersfoort, the netherlands
>tel. +31-(0)33-4659987   fax. +31-(0)33-4659987
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
>  
>



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org