You are viewing a plain text version of this content. The canonical link for it is here.
Posted to openrelevance-dev@lucene.apache.org by Patrick Durusau <pa...@durusau.net> on 2011/06/10 01:24:45 UTC
Hosting and the Federal Register: was Re: Corpus suggestion
Grant,
On your comments about the Fed Register and hosting of a collection:
<snip>
>> But are those really representative of all the documents that are encountered in a modern searching context?
> No. Well, the newswire ones probably still emulate current newswire and the Fed Register probably still simulates that kind of stuff, albeit w/ updated language.
>
1) The 1989 Federal Register used on the NIST disks isn't online. Is the
idea to re-use other relevance judgments from prior work? If not, then
the availability of the 1989 Federal Register is a moot point.
2) Other than its presence in the NIST collection, is there some other
reason for choosing the Federal Register as a resource?
The Federal Register is composed of presidential documents, notices of
new regs (all depts), announcements of various sorts. The regulation
part feeds into the Code of Federal Regulations, both of which are
available in annual bulk XML:
Annual bulk XML is available from: http://www.gpo.gov/fdsys/bulkdata/FR
Annual bulk XML is available from: http://www.gpo.gov/fdsys/bulkdata/CFR
Judging "relevance" against specialized materials is doable but is going
to depend on the number of people that can be attracted to contribute
the "relevance" judgments that then underlie analysis of the corpus.
The Federal Register caught my eye because I was familiar with it a very
long time ago.
<snip>
>> There are other text collections that could be used but it occurred to me that starting close to home might avoid some of the licensing issues that were troublesome in the past.
> Definitely. What we need is a way of gathering judgments as well as collecting queries, etc.
>
> I think we should also take the public NIST ones and host all of them here as well, along w/ judgments and queries so that it all just works seamlessly.
>
Well, but the NIST, Brown and other corpus efforts, in addition to be
hobbled by last century licensing agreements and dated data, were
products of a time when gathering that much electronic data was a
non-trivial task. So it was important to deliver it as a set.
I see no reason why we could not produce a checksum against say a
particular year of the Federal Register (if you want to include it) and
allow it to stay where it is.
Unless by "...it just works seamlessly" you are envisioning some hosted
dataset + queries + relevance measures sort of setup.
I am sure that is possible but introduces a layer of complexity on top
of identifying datasets, creating relevance measures for those datasets
and queries that are a baseline for judgments of relevance.
Hope you are having a great day!
Patrick
--
Patrick Durusau
patrick@durusau.net
Chair, V1 - US TAG to JTC 1/SC 34
Convener, JTC 1/SC 34/WG 3 (Topic Maps)
Editor, OpenDocument Format TC (OASIS), Project Editor ISO/IEC 26300
Co-Editor, ISO/IEC 13250-1, 13250-5 (Topic Maps)
Another Word For It (blog): http://tm.durusau.net
Homepage: http://www.durusau.net
Twitter: patrickDurusau