You are viewing a plain text version of this content. The canonical link for it is here.
Posted to openrelevance-dev@lucene.apache.org by Patrick Durusau <pa...@durusau.net> on 2011/06/12 17:31:56 UTC

ORP Background

Greetings!

Still catching up on background and have some questions:

Grant mentioned the ASF email archives the other day in response to a 
question I asked about using those as a corpora.

Reading the background document I see:

> We have started a preliminary crawl of Creative Commons content using 
> Nutch. This is currently hosted on a private machine, but we would 
> like to bring this "in house" to the ASF and have the ASF host both 
> the crawling and the dissemination of the data. This, obviously, will 
> need to be supported by the ASF infrastructure, as it is potentially 
> quite burdensome in terms of disk space and bandwidth.

Is that still an operational assumption for at least one corpus?

I ask because design and focus on a single corpus, perhaps not the 
largest one possible, such as a subpart of the email archives, could be 
viewed as a shakedown run to create processes and test assumptions, not 
to mention demonstrating viability of the project to others.

Hope everyone is having a great weekend!

Patrick

PS: I know use of the TREC corpus was investigated. I know there are 
other corpora research projects. Has there been an effort to survey 
those for existing corpora with better licensing terms or likely 
alliances? Thinking there may be projects that would offer better terms 
in order to have the imprimatur of being part of an ASF umbrella corpora 
project.

-- 
Patrick Durusau
patrick@durusau.net
Chair, V1 - US TAG to JTC 1/SC 34
Convener, JTC 1/SC 34/WG 3 (Topic Maps)
Editor, OpenDocument Format TC (OASIS), Project Editor ISO/IEC 26300
Co-Editor, ISO/IEC 13250-1, 13250-5 (Topic Maps)

Another Word For It (blog): http://tm.durusau.net
Homepage: http://www.durusau.net
Twitter: patrickDurusau