You are viewing a plain text version of this content. The canonical link for it is here.
Posted to openrelevance-dev@lucene.apache.org by ab...@apache.org on 2010/02/08 17:12:15 UTC

svn commit: r907712 - in /lucene/openrelevance/trunk/collections/ohsumed: ./ src/ src/java/ src/java/org/ src/java/org/apache/ src/java/org/apache/or/ src/java/org/apache/or/collections/ src/java/org/apache/or/collections/ohsumed/

Author: ab
Date: Mon Feb  8 16:12:14 2010
New Revision: 907712

URL: http://svn.apache.org/viewvc?rev=907712&view=rev
Log:
Add TREC-9 / OHSUMED collection.

Added:
    lucene/openrelevance/trunk/collections/ohsumed/
    lucene/openrelevance/trunk/collections/ohsumed/LICENSE.txt   (with props)
    lucene/openrelevance/trunk/collections/ohsumed/OHSU_TREC9_README.txt   (with props)
    lucene/openrelevance/trunk/collections/ohsumed/README.txt   (with props)
    lucene/openrelevance/trunk/collections/ohsumed/build.xml   (with props)
    lucene/openrelevance/trunk/collections/ohsumed/src/
    lucene/openrelevance/trunk/collections/ohsumed/src/java/
    lucene/openrelevance/trunk/collections/ohsumed/src/java/org/
    lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/
    lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/
    lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/
    lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/
    lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedCorpusConverter.java   (with props)
    lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedQrelConverter.java   (with props)
    lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedTopicConverter.java   (with props)

Added: lucene/openrelevance/trunk/collections/ohsumed/LICENSE.txt
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/collections/ohsumed/LICENSE.txt?rev=907712&view=auto
==============================================================================
--- lucene/openrelevance/trunk/collections/ohsumed/LICENSE.txt (added)
+++ lucene/openrelevance/trunk/collections/ohsumed/LICENSE.txt Mon Feb  8 16:12:14 2010
@@ -0,0 +1,9 @@
+There is no explicit licensing information at the TREC site, the statement below was
+taken from the original OHSUMED corpus available here:
+
+	http://ir.ohsu.edu/ohsumed/ohsumed.html 
+
+The National Library of Medicine has agreed to make the MEDLINE references in the
+test database available for experimentation, restricted to the following conditions:
+1.  The data will not be used in any non-experimental clinical, library, or other setting.
+2.  Any human users of the data will explicitly be told that the data is incomplete and out-of-date.

Propchange: lucene/openrelevance/trunk/collections/ohsumed/LICENSE.txt
------------------------------------------------------------------------------
    svn:eol-style = native

Added: lucene/openrelevance/trunk/collections/ohsumed/OHSU_TREC9_README.txt
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/collections/ohsumed/OHSU_TREC9_README.txt?rev=907712&view=auto
==============================================================================
--- lucene/openrelevance/trunk/collections/ohsumed/OHSU_TREC9_README.txt (added)
+++ lucene/openrelevance/trunk/collections/ohsumed/OHSU_TREC9_README.txt Mon Feb  8 16:12:14 2010
@@ -0,0 +1,212 @@
+This README file describes all the data files associated with the
+OHSUMED document collection as it was used for the TREC-9
+Filtering Track.  Please see "The TREC-9 Filtering Track Final
+Report" by Stephen Robertson and David A. Hull in the TREC-9
+proceedings for a description of the tasks performed in the track.
+
+(A) Description of the OHSUMED document collection (files: ohsumed.*)
+
+The OHSUMED test collection is a set of 348,566 references from
+MEDLINE, the on-line medical information database, consisting of
+titles and/or abstracts from 270 medical journals over a five-year
+period (1987-1991). The available fields are title, abstract, MeSH
+indexing terms, author, source, and publication type. The National
+Library of Medicine has agreed to make the MEDLINE references in the
+test database available for experimentation, restricted to the
+following conditions:
+
+1. The data will not be used in any non-experimental clinical,
+library, or other setting.
+2.  Any human users of the data will explicitly be told that the data
+is incomplete and out-of-date.
+
+The OHSUMED document collection was obtained by William Hersh
+(hersh@OHSU.EDU) and colleagues for the experiments described in the
+papers below:
+
+Hersh WR, Buckley C, Leone TJ, Hickam DH, OHSUMED: An interactive
+retrieval evaluation and new large test collection for research, 
+Proceedings of the 17th Annual ACM SIGIR Conference, 1994, 192-201.
+
+Hersh WR, Hickam DH, Use of a multi-application computer workstation
+in a clinical setting, Bulletin of the Medical Library Association,
+1994, 82: 382-389.
+
+Here are the field definitions:
+
+ .I      sequential identifier 
+	 (important note: documents should be processed in this order)
+ .U      MEDLINE identifier (UI) 
+	 (<DOCNO> used for relevance judgements)
+ .M      Human-assigned MeSH terms (MH)
+ .T      Title (TI)
+ .P      Publication type (PT)
+ .W      Abstract (AB)
+ .A      Author (AU)
+ .S      Source (SO)
+
+Note: some abstracts are truncated at 250 words and some references
+have no abstracts at all (titles only). We do not have access to the
+full text of the documents.
+
+(B) Description of the topic statements (files: query.*)
+
+There were three different sets of filtering topics for the
+TREC-9 Filtering track: 
+(1) a subset of 63 of the original query set developed by Hersh et al.xi
+    for their IR experiments (OHSUMED),
+(2) a set of 4904 MeSH terms and their definitions (MSH), and
+(3) a subset of 500 of the MeSH terms (MSH-SMP).
+ 
+The existing OHSUMED topics describe actual information needs, but the
+relevance judgements probably do not have the same coverage provided
+by the TREC pooling process. The MeSH terms do not directly represent
+information needs, rather they are controlled indexing terms. However,
+the assessment should be more or less complete and there are a lot of
+them, so this provides an unusual opportunity to work with a very
+large topic sample.
+
+The topic statements are provided in the standard TREC format and
+consist of <title> and <desc> (= description) fields only. The meaning
+of these fields is slightly different for each query type.
+
+(1) OHSUMED topics (files: query.ohsu.*)
+
+<title> = patient description
+<desc>  = information request
+
+The test collection was built as part of a study assessing the use of
+MEDLINE by physicians in a clinical setting (Hersh and Hickam, above).
+Novice physicians using MEDLINE generated 106 queries. Only a subset
+of these queries were used in the TREC-9 Filtering Track. Before
+they searched, they were asked to provide a statement of information
+about their patient as well as their information need.
+
+(2) MeSH topics (files: query.mesh.*)
+
+<title> = MeSH concept name
+<desc> = MeSH scope note, a definition of the concept (source: MeSH 2000)
+
+The National Library of Medicine has authorized us to use a subset of
+the MeSH 2000 scope notes for Filtering Track experiments with the
+OHSUMED collection. If you wish to use the MeSH scope notes for any
+other purpose, please visit the NLM Web Site,
+		http://www.nlm.nih.gov/mesh/
+sign the attached Memorandum of Understanding, and download the full
+MeSH 2000 database directly from the source.
+
+The subset of the MeSH topics used for the MSH-SMP runs is defined
+by the file "sample.map".  The perl script mesh-sample.prl will
+produce a file containing only the 500 topics in the subset
+from the file containing the full set of 4904 topics.
+
+(3) Use of MeSH term field (.M) during filtering
+
+TREC-9 filtering track participants were allowed to use
+the MeSH term field (.M) during the filtering of the
+OHSU topic set provided the use of the field was noted in 
+the run description.  The entire MeSH term field was *not*
+allowed to be accessed during the filtering of the MeSH topic set.
+Information on the presence or absence of the specific MeSH term
+represented in the filtering topic is contained in the relevant
+document files described below (simulating human judgement).
+
+(C) Description of the relevance judgements (files: qrels.*)
+
+The format of the relevance judgements is slightly different for the
+two topic sets.
+
+(1) OHSUMED relevance judgements (files: qrels.ohsu.*)
+
+Format: <topic-ID> \t <DOCNO> \t <Relevant> \n
+
+<DOCNO> - MEDLINE identifier (.U/UI)
+<Relevant> - 1 = possibly relevant, 2 = definitely relevant
+
+Each query was replicated by four searchers, two physicians
+experienced in searching and two medical librarians.  The results were
+assessed for relevance by a different group of physicians, using a
+three point scale: definitely, possibly, or not relevant.  The list of
+documents explicitly judged to be not relevant is not provided here.
+Over 10% of the query-document pairs were judged in duplicate to
+assess inter-observer reliability.  For evaluation, all documents
+judged here as either possibly or definitely relevant were
+considered relevant.  TREC-9 systems were allowed to distinguish
+between these two categories during the learning process if desired.
+
+(2) MeSH relevance judgments (files: qrels.mesh.*)
+
+Format: <topic-ID> \t <DOCNO> \n
+
+<DOCNO> - MEDLINE identifier (.U/UI)
+
+A document is considered "relevant" to a MeSH "topic" if the MeSH
+concept name is listed in the MeSH term field (.M) of the document.
+Please note that the MeSH concepts form a hierarchy. It is common
+practice to index a document *only* by the most specific MeSH concept
+that is relevant.
+
+(D) Description of the ohsu-trec directories
+
+Here we describe the contents of the three sub-directories of
+ohsu-trec.
+
+(1) pre-test - directory of material for preliminary system testing
+
+ ohsumed.87@ - MEDLINE references from 1987 (note: this is a symbolic
+               link to the actual document file located in trec9-train)
+
+ query.ohsu.test.1-43  - set of 43 OHSU test topics
+ query.mesh.test.1-119 - set of 119 MeSH test topics
+
+ qrels.ohsu.test.87 - relevance judgements for OHSU test topics (1987)
+ qrels.mesh.test.87 - relevance judgements for MeSH test topics (1987)
+
+This directory is intended for people interested in doing some
+preliminary testing of their filtering system on this domain.
+Important note: the test topics available here are *not* an unbiased
+sample of the TREC-9 topics. In particular, they are the ones that
+were specifically rejected from the official runs for a variety of
+reasons (usually because they had too many or too few relevance
+judgments). Therefore, they should not be used for optimizing system
+parameters, just for general tests to make sure that the system is
+functioning properly.
+
+(2) trec9-train - directory of TREC-9 training material
+
+ ohsumed.87 - MEDLINE references from 1987
+
+ query.ohsu.1-63     - set of 63 TREC-9 OHSU topics
+ query.mesh.1-4904   - set of 4904 TREC-9 MeSH topics
+
+ qrels.ohsu.adapt.87 - training qrels / OHSU / adaptive filtering (1987)
+ qrels.ohsu.batch.87 - training qrels / OHSU / batch filtering (1987)
+ qrels.mesh.adapt.87 - training qrels / MeSH / adaptive filtering (1987)
+ qrels.mesh.batch.87 - training qrels / MeSH / batch filtering (1987)
+
+This directory contains all the training material for the TREC-9
+filtering task. Routing systems should use the same data as the batch
+filtering systems. The 1987 OHSUMED documents are intended for
+training purposes only. The batch filtering qrels files contain all
+the evaluated documents for the 1987 collection. The OHSU qrels for
+adaptive filtering contain two documents judged definitely relevant
+for each topic. The MeSH qrels for adaptive filtering contain four
+documents assigned to each topic. In both case, the training samples
+extracted for adaptive filtering were selected by random sampling.
+TREC-9 participants were allowed to use the 1987 OHSUMED collection
+for generating collection summary statistics (such as IDF) or other
+purposes (for adaptive filtering runs such use had to be declared).
+
+(3) trec9-test  - directory of TREC-9 test material
+
+ ohsumed.88-91 - MEDLINE references from 1988-1991
+
+ qrels.ohsu.88-91 - relevance judgements for OHSU topics
+ qrels.mesh.88-91 - relevance judgements for MeSH topics
+
+This directory contains the documents and relevance judgements used
+to run the official TREC-9 Filtering Track experiments.  TREC-9
+participants were allowed to use the relevance judgement for a
+document only after that document was retrieved.  Relevance
+judgements from documents not retrieved were never accessed
+(except for the final evaluation).

Propchange: lucene/openrelevance/trunk/collections/ohsumed/OHSU_TREC9_README.txt
------------------------------------------------------------------------------
    svn:eol-style = native

Added: lucene/openrelevance/trunk/collections/ohsumed/README.txt
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/collections/ohsumed/README.txt?rev=907712&view=auto
==============================================================================
--- lucene/openrelevance/trunk/collections/ohsumed/README.txt (added)
+++ lucene/openrelevance/trunk/collections/ohsumed/README.txt Mon Feb  8 16:12:14 2010
@@ -0,0 +1,22 @@
+This is the version of OHSUMED corpus as used during TREC-9 filtering track. The corpus can
+be obtained from this page:
+
+	http://trec.nist.gov/data/t9_filtering.html
+
+Please see the original OHSU_TREC9-README.txt for detailed information about the corpus.
+
+The build process builds two corpora from this collection: one that uses the trec9-train/
+data, and the other that uses trec9-test data.
+
+There are two types of topics (queries) in this collection, and they are significantly
+different. The MeSH topics contain just the MeSH concept in the title, which quite often
+doesn't occur in the relevant documents - instead these documents match terms
+from the topic's "description" field. The OHSU topics often use colloquial and
+inconsistent abbreviations such as "60 yo" for "60 year old" (but often also
+"60 y o" or "60 yr old"). In this case as well, the matching terms appear only in
+the description field of the topic and not in the title.
+
+The description of the TREC filtering track underlines that qrels are NOT ranked by
+relevance, instead they simply list relevant documents in random order. Therefore any
+metrics that assume a ranked retrieval will require either some preprocessing step
+(such as sorting of qrels by relevance+docId) or may be inapplicable to this corpus.

Propchange: lucene/openrelevance/trunk/collections/ohsumed/README.txt
------------------------------------------------------------------------------
    svn:eol-style = native

Added: lucene/openrelevance/trunk/collections/ohsumed/build.xml
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/collections/ohsumed/build.xml?rev=907712&view=auto
==============================================================================
--- lucene/openrelevance/trunk/collections/ohsumed/build.xml (added)
+++ lucene/openrelevance/trunk/collections/ohsumed/build.xml Mon Feb  8 16:12:14 2010
@@ -0,0 +1,97 @@
+<?xml version="1.0"?>
+
+<!--
+    Licensed to the Apache Software Foundation (ASF) under one or more
+    contributor license agreements.  See the NOTICE file distributed with
+    this work for additional information regarding copyright ownership.
+    The ASF licenses this file to You under the Apache License, Version 2.0
+    the "License"); you may not use this file except in compliance with
+    the License.  You may obtain a copy of the License at
+ 
+        http://www.apache.org/licenses/LICENSE-2.0
+ 
+    Unless required by applicable law or agreed to in writing, software
+    distributed under the License is distributed on an "AS IS" BASIS,
+    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+    See the License for the specific language governing permissions and
+    limitations under the License.
+ -->
+
+<project name="ohsumed" default="dist">
+
+  <import file="../collections-build.xml"/>
+
+  <property name="t9" location="${build.dir}/download/t9.filtering.tar.gz"/>
+  <available file="${t9}" property="t9.exists"/>
+
+  <target name="fetch" unless="t9.exists">
+    <mkdir dir="${build.dir}/download"/>
+    <get src="http://trec.nist.gov/data/filtering/t9.filtering.tar.gz"
+         dest="${t9}"/>
+  </target>
+
+  <target name="extract" depends="fetch">
+    <untar src="${t9}" dest="${build.dir}/extracted" compression="gzip">
+      <patternset>
+        <include name="ohsu-trec/trec9-*/*"/>
+      </patternset>
+    </untar>
+  </target>
+
+  <target name="dist" depends="compile,extract">
+    <mkdir dir="${dist.dir}"/>
+    <java classname="org.apache.or.collections.ohsumed.OhsumedCorpusConverter">
+      <arg value="${build.dir}/extracted/ohsu-trec/trec9-train/ohsumed.87"/>
+      <arg value="${dist.dir}/train-corpus.gz"/>
+      <classpath refid="classpath"/>
+    </java>
+    <java classname="org.apache.or.collections.ohsumed.OhsumedCorpusConverter">
+      <arg value="${build.dir}/extracted/ohsu-trec/trec9-test/ohsumed.88-91"/>
+      <arg value="${dist.dir}/test-corpus.gz"/>
+      <classpath refid="classpath"/>
+    </java>
+    <java classname="org.apache.or.collections.ohsumed.OhsumedTopicConverter">
+      <arg value="${build.dir}/extracted/ohsu-trec/trec9-train/query.ohsu.1-63"/>
+      <arg value="${dist.dir}/queries-ohsu.txt"/>
+      <classpath refid="classpath"/>
+    </java>
+    <java classname="org.apache.or.collections.ohsumed.OhsumedTopicConverter">
+      <arg value="${build.dir}/extracted/ohsu-trec/trec9-train/query.mesh.1-4904"/>
+      <arg value="${dist.dir}/queries-mesh.txt"/>
+      <classpath refid="classpath"/>
+    </java>
+    <concat destfile="${dist.dir}/queries.txt" fixlastline="yes"
+        encoding="UTF-8" outputencoding="UTF-8">
+      <filelist dir="${dist.dir}" files="queries-ohsu.txt,queries-mesh.txt"/>
+    </concat>
+    <java classname="org.apache.or.collections.ohsumed.OhsumedQrelConverter">
+      <arg value="${build.dir}/extracted/ohsu-trec/trec9-train/qrels.mesh.batch.87"/>
+      <arg value="${dist.dir}/train-judgements-mesh.txt"/>
+      <classpath refid="classpath"/>
+    </java>
+    <java classname="org.apache.or.collections.ohsumed.OhsumedQrelConverter">
+      <arg value="${build.dir}/extracted/ohsu-trec/trec9-train/qrels.ohsu.batch.87"/>
+      <arg value="${dist.dir}/train-judgements-ohsu.txt"/>
+      <classpath refid="classpath"/>
+    </java>
+    <concat destfile="${dist.dir}/train-judgements.txt" fixlastline="yes"
+        encoding="UTF-8" outputencoding="UTF-8">
+      <filelist dir="${dist.dir}" files="train-judgements-ohsu.txt,train-judgements-mesh.txt"/>
+    </concat>
+    <java classname="org.apache.or.collections.ohsumed.OhsumedQrelConverter">
+      <arg value="${build.dir}/extracted/ohsu-trec/trec9-test/qrels.ohsu.88-91"/>
+      <arg value="${dist.dir}/test-judgements-ohsu.txt"/>
+      <classpath refid="classpath"/>
+    </java>
+    <java classname="org.apache.or.collections.ohsumed.OhsumedQrelConverter">
+      <arg value="${build.dir}/extracted/ohsu-trec/trec9-test/qrels.mesh.88-91"/>
+      <arg value="${dist.dir}/test-judgements-mesh.txt"/>
+      <classpath refid="classpath"/>
+    </java>
+    <concat destfile="${dist.dir}/test-judgements.txt" fixlastline="yes"
+        encoding="UTF-8" outputencoding="UTF-8">
+      <filelist dir="${dist.dir}" files="test-judgements-ohsu.txt,test-judgements-mesh.txt"/>
+    </concat>
+  </target>
+
+</project>

Propchange: lucene/openrelevance/trunk/collections/ohsumed/build.xml
------------------------------------------------------------------------------
    svn:eol-style = native

Added: lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedCorpusConverter.java
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedCorpusConverter.java?rev=907712&view=auto
==============================================================================
--- lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedCorpusConverter.java (added)
+++ lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedCorpusConverter.java Mon Feb  8 16:12:14 2010
@@ -0,0 +1,138 @@
+package org.apache.or.collections.ohsumed;
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+import java.io.BufferedReader;
+import java.io.File;
+import java.io.FileInputStream;
+import java.io.InputStreamReader;
+import java.util.Date;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.Map;
+
+import org.apache.or.util.TrecDocument;
+import org.apache.or.util.TrecDocumentWriter;
+
+public class OhsumedCorpusConverter {
+  
+  private static final String OHSU_SEQID = ".I "; // the only single-line field
+  private static final String OHSU_DOCID = ".U";
+  private static final String OHSU_SUBJECT = ".S";
+  private static final String OHSU_MESH = ".M";
+  private static final String OHSU_TITLE = ".T";
+  private static final String OHSU_TYPE = ".P";
+  private static final String OHSU_BODY = ".W";
+  private static final String OHSU_AUTHORS = ".A";
+  
+  private static final HashSet<String> multiLine = new HashSet<String>();
+  static {
+    multiLine.add(OHSU_DOCID);
+    multiLine.add(OHSU_SUBJECT);
+    multiLine.add(OHSU_MESH);
+    multiLine.add(OHSU_TITLE);
+    multiLine.add(OHSU_TYPE);
+    multiLine.add(OHSU_BODY);
+    multiLine.add(OHSU_AUTHORS);
+  }
+  
+  private static TrecDocument doc = new TrecDocument();
+  private static Date date = new Date(); // this corpus does not have a date, use a fake one.
+  
+  public static void main(String[] args) throws Exception {
+    if (args.length == 0) {
+      System.err.println("Usage: OhsumedCorpusConverter <inputFile> <outputFile>");
+      System.err.println("\tinputFile\tone of the ohsumed.87 or ohsumed.88-91 files");
+      System.err.println("\toutputFile\toutput to store the converted corpus. NOTE: will be silently overwritten if exists!");
+      System.exit(-1);
+    }
+    BufferedReader in = new BufferedReader(new InputStreamReader(
+            new FileInputStream(args[0]), "UTF-8"));
+    TrecDocumentWriter writer = new TrecDocumentWriter(new File(args[1]));
+    
+    String line = null;
+    String did = null;
+    StringBuilder body = new StringBuilder();
+    HashMap<String, StringBuilder> fields = new HashMap<String, StringBuilder>();
+    String curField = null;
+    while ((line = in.readLine()) != null) {
+      if (line.startsWith(OHSU_SEQID)) { // new document
+        if (!fields.isEmpty()) {
+          writeDocument(fields, writer);
+          fields.clear();
+        }
+        fields.put(OHSU_SEQID, new StringBuilder(line.substring(OHSU_SEQID.length())));
+      } else {
+        if (line.charAt(0) == '.' && Character.isUpperCase(line.charAt(1))) { // field id, for multi-line fields
+          line = line.trim();
+          if (multiLine.contains(line)) {
+            curField = line;
+          } else {
+            System.err.println("Invalid field name: " + line + ", skipping ...");
+            curField = null;
+          }
+          continue;
+        } else {
+          // value of the current field
+          StringBuilder sb = fields.get(curField);
+          if (sb == null) {
+            sb = new StringBuilder();
+            fields.put(curField, sb);
+          } else {
+            sb.append('\n');
+          }
+          sb.append(line);
+        }
+      }
+    }
+    if (!fields.isEmpty()) {
+      writeDocument(fields, writer);
+    }
+    in.close();
+    writer.close();
+  }
+  
+  // for now glue title + body + authors - this is primitive, but probably
+  // better than ignoring everything except the body ...
+  private static void writeDocument(Map<String, StringBuilder> fields, TrecDocumentWriter writer) throws Exception {
+    // Note: some document have an empty body
+    StringBuilder body = fields.get(OHSU_BODY);
+    StringBuilder title = fields.get(OHSU_TITLE);
+    if (title != null) {
+      if (body != null) title.append('\n').append(body);
+      body = title;
+    }
+    StringBuilder authors = fields.get(OHSU_AUTHORS);
+    if (authors != null) {
+      body.append('\n').append(authors);
+    }
+    StringBuilder mesh = fields.get(OHSU_MESH);
+    if (mesh != null) {
+      body.append('\n').append(mesh);
+    }
+    doc.setBody(body);
+    doc.setDate(date);
+    StringBuilder docName = fields.get(OHSU_DOCID);
+    if (docName == null) {
+      System.err.println("-Empty docid - skipping ...");
+      return;
+    }
+    doc.setDocname(docName);
+    writer.write(doc);
+  }
+
+}

Propchange: lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedCorpusConverter.java
------------------------------------------------------------------------------
    svn:eol-style = native

Added: lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedQrelConverter.java
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedQrelConverter.java?rev=907712&view=auto
==============================================================================
--- lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedQrelConverter.java (added)
+++ lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedQrelConverter.java Mon Feb  8 16:12:14 2010
@@ -0,0 +1,60 @@
+package org.apache.or.collections.ohsumed;
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+import java.io.BufferedReader;
+import java.io.File;
+import java.io.FileInputStream;
+import java.io.InputStreamReader;
+
+import org.apache.or.util.TrecQrel;
+import org.apache.or.util.TrecQrelWriter;
+
+public class OhsumedQrelConverter {
+  
+  public static void main(String[] args) throws Exception {
+    if (args.length == 0) {
+      System.err.println("Usage: OhsumedQrelConverter <inputQrels> <outputQrels>");
+      System.err.println("\tinputQrels\tone of the qrels.mesh.* or qrels.ohsu.* files from OHSUMED");
+      System.err.println("\toutputQrels\toutput file (will be silently overwritten if exists!)");
+      System.exit(-1);
+    }
+    BufferedReader in = new BufferedReader(new InputStreamReader(
+            new FileInputStream(args[0]), "UTF-8"));
+    TrecQrelWriter writer = new TrecQrelWriter(new File(args[1]));
+    TrecQrel qrel = new TrecQrel();
+    
+    String line = null;
+    while ((line = in.readLine()) != null) {
+      String[] fields = line.split("\\s+");
+      if (fields.length < 2) {
+        System.err.println("-invalid line, skiping: " + line);
+        continue;
+      }
+      qrel.setDocno(fields[1]);
+      qrel.setIter("0");
+      qrel.setQid(fields[0]);
+      if (fields.length > 2) {
+        qrel.setRel(Integer.parseInt(fields[2]));
+      } else {
+        qrel.setRel(1);
+      }
+      writer.write(qrel);
+    }
+    in.close();
+    writer.close();
+  }
+}

Propchange: lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedQrelConverter.java
------------------------------------------------------------------------------
    svn:eol-style = native

Added: lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedTopicConverter.java
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedTopicConverter.java?rev=907712&view=auto
==============================================================================
--- lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedTopicConverter.java (added)
+++ lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedTopicConverter.java Mon Feb  8 16:12:14 2010
@@ -0,0 +1,80 @@
+package org.apache.or.collections.ohsumed;
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+import java.io.BufferedReader;
+import java.io.File;
+import java.io.FileInputStream;
+import java.io.InputStreamReader;
+
+import org.apache.or.util.TrecTopic;
+import org.apache.or.util.TrecTopicWriter;
+
+public class OhsumedTopicConverter {
+  
+  public static void main(String[] args) throws Exception {
+    if (args.length == 0) {
+      System.err.println("Usage: OhsumedTopicConverter <inputTopics> <outputTopics>");
+      System.err.println("\tinputTopics\tone of the query.mesh.* or query.ohsu.* files from OHSUMED");
+      System.err.println("\toutputTopics\toutput file (will be silently overwritten if exists!)");
+      System.exit(-1);
+    }
+    BufferedReader in = new BufferedReader(new InputStreamReader(
+            new FileInputStream(args[0]), "UTF-8"));
+    TrecTopicWriter writer = new TrecTopicWriter(new File(args[1]));
+    TrecTopic topic = new TrecTopic();
+    topic.setNarrative(""); // no narratives
+    
+    String line = null;
+    boolean description = false;
+    while ((line = in.readLine()) != null) {
+      String lineT = line.trim();
+      if (lineT.equals("") || lineT.equals("</top>")) {
+        continue;
+      }
+      if (line.trim().equals("<top>")) { // output existing doc & reset
+        if (topic.getNumber() != null && !topic.getNumber().equals("")) {
+          writer.write(topic);
+        }
+        topic.setNumber("");
+        topic.setDescription("");
+        topic.setTitle("");
+        continue;
+      }
+      if (lineT.startsWith("<num> Number: ")) {
+        topic.setNumber(lineT.substring(14));
+      } else if (lineT.startsWith("<title> ")) {
+        topic.setTitle(line.substring(8));
+      } else if (lineT.equals("<desc> Description:")) {
+        description = true;
+        continue;
+      } else if (description) {
+        topic.setDescription(line);
+        description = false;
+      } else {
+        System.err.println("Unrecognized line, skipping: '" + line + "'");
+        continue;
+      }
+    }
+    // output last topic if present
+    if (!topic.getNumber().equals("")) {
+      writer.write(topic);
+    }
+    in.close();
+    writer.close();
+  }
+}

Propchange: lucene/openrelevance/trunk/collections/ohsumed/src/java/org/apache/or/collections/ohsumed/OhsumedTopicConverter.java
------------------------------------------------------------------------------
    svn:eol-style = native