You are viewing a plain text version of this content. The canonical link for it is here.
Posted to openrelevance-dev@lucene.apache.org by Patrick Durusau <pa...@durusau.net> on 2011/06/10 01:24:45 UTC
Hosting and the Federal Register: was Re: Corpus suggestion

Grant,

On your comments about the Fed Register and hosting of a collection:

<snip>

>> But are those really representative of all the documents that are encountered in a modern searching context?
> No.  Well, the newswire ones probably still emulate current newswire and the Fed Register probably still simulates that kind of stuff, albeit w/ updated language.
>


1) The 1989 Federal Register used on the NIST disks isn't online. Is the 
idea to re-use other relevance judgments from prior work? If not, then 
the availability of the 1989 Federal Register is a moot point.

2) Other than its presence in the NIST collection, is there some other 
reason for choosing the Federal Register as a resource?

The Federal Register is composed of presidential documents, notices of 
new regs (all depts), announcements of various sorts. The regulation 
part feeds into the Code of Federal Regulations, both of which are 
available in annual bulk XML:

Annual bulk XML is available from: http://www.gpo.gov/fdsys/bulkdata/FR

Annual bulk XML is available from: http://www.gpo.gov/fdsys/bulkdata/CFR

Judging "relevance" against specialized materials is doable but is going 
to depend on the number of people that can be attracted to contribute 
the "relevance" judgments that then underlie analysis of the corpus.

The Federal Register caught my eye because I was familiar with it a very 
long time ago.

<snip>

>> There are other text collections that could be used but it occurred to me that starting close to home might avoid some of the licensing issues that were troublesome in the past.
> Definitely.  What we need is a way of gathering judgments as well as collecting queries, etc.
>
> I think we should also take the public NIST ones and host all of them here as well, along w/ judgments and queries so that it all just works seamlessly.
>

Well, but the NIST, Brown and other corpus efforts, in addition to be 
hobbled by last century licensing agreements and dated data, were 
products of a time when gathering that much electronic data was a 
non-trivial task. So it was important to deliver it as a set.

I see no reason why we could not produce a checksum against say a 
particular year of the Federal Register (if you want to include it) and 
allow it to stay where it is.

Unless by "...it just works seamlessly" you are envisioning some hosted 
dataset + queries + relevance measures sort of setup.

I am sure that is possible but introduces a layer of complexity on top 
of identifying datasets, creating relevance measures for those datasets 
and queries that are a baseline for judgments of relevance.

Hope you are having a great day!

Patrick
-- 

Patrick Durusau
patrick@durusau.net
Chair, V1 - US TAG to JTC 1/SC 34
Convener, JTC 1/SC 34/WG 3 (Topic Maps)
Editor, OpenDocument Format TC (OASIS), Project Editor ISO/IEC 26300
Co-Editor, ISO/IEC 13250-1, 13250-5 (Topic Maps)

Another Word For It (blog): http://tm.durusau.net
Homepage: http://www.durusau.net
Twitter: patrickDurusau