You are viewing a plain text version of this content. The canonical link for it is here.
Posted to openrelevance-dev@lucene.apache.org by rm...@apache.org on 2010/01/02 21:44:24 UTC
svn commit: r895284 - in /lucene/openrelevance/trunk: LICENSE.txt
collections/hamshahri/LICENSE.txt collections/hamshahri/README.txt
collections/tempo/LICENSE.txt collections/tempo/README.txt
Author: rmuir
Date: Sat Jan 2 20:44:23 2010
New Revision: 895284
URL: http://svn.apache.org/viewvc?rev=895284&view=rev
Log:
ORP-3: collections need LICENSE and README
Added:
lucene/openrelevance/trunk/collections/hamshahri/LICENSE.txt (with props)
lucene/openrelevance/trunk/collections/hamshahri/README.txt (with props)
lucene/openrelevance/trunk/collections/tempo/LICENSE.txt (with props)
lucene/openrelevance/trunk/collections/tempo/README.txt (with props)
Modified:
lucene/openrelevance/trunk/LICENSE.txt
Modified: lucene/openrelevance/trunk/LICENSE.txt
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/LICENSE.txt?rev=895284&r1=895283&r2=895284&view=diff
==============================================================================
--- lucene/openrelevance/trunk/LICENSE.txt (original)
+++ lucene/openrelevance/trunk/LICENSE.txt Sat Jan 2 20:44:23 2010
@@ -200,3 +200,7 @@
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
+
+Note: the resources downloaded by these scripts are not Apache works and are
+governed by their own individual licenses. Please read the LICENSE.txt in each
+folder for details.
Added: lucene/openrelevance/trunk/collections/hamshahri/LICENSE.txt
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/collections/hamshahri/LICENSE.txt?rev=895284&view=auto
==============================================================================
--- lucene/openrelevance/trunk/collections/hamshahri/LICENSE.txt (added)
+++ lucene/openrelevance/trunk/collections/hamshahri/LICENSE.txt Sat Jan 2 20:44:23 2010
@@ -0,0 +1,34 @@
+From http://ece.ut.ac.ir/dbrg/Hamshahri/
+
+Copyright
+
+Hamshahri corpus was created in DBRG Lab. at University of Tehran ECE department.
+All rights of the corpus' news are reserved for Hamshahri newspaper. All rights of
+the corpus' data and the tools that are included in this package are reserved for
+University of Tehran - Database Research Group. Usage of this package for any
+research or non-commercial purposes is free with the precondition that you cite the
+related papers below.
+
+Darrudi E., Hejazi M.R., Oroumchian F., Assessment of a Modern Farsi Corpus. In
+Proceedings of the 2nd Workshop on Information Technology & its Disciplines (WITID)
+ 2004, ITRC, Kish Island, Iran.
+
+Hadi Amiri, Abolfazl AleAhmad, Farhad Oroumchian, Caro Lucas, Masoud Rahgozar, Using
+ OWA Fuzzy Operator to Merge Retrieval System Results, The Second Workshop on
+Computational Approaches to Arabic Script-based Languages, LSA 2007 Linguistic
+Institute, Stanford University, USA, 2007.
+
+Abolfazl Aleahmad, Parsia Hakimian, Farzad Mahdikhani and Farhad Oroumchian. N-Gram
+and Local Context Analysis For Persian Text Retrieval. International Sympo-sium on
+Signal Processing and Its Applications, Sharjah U.A.E., 2007.
+
+Farhad Oroumchian, Ehsan Darrudi, Fattane Taghiyareh, Neeyaz Angoshtari. Experiments
+ with persian text compression for web. 13th International World Wide Web
+conference, New York, NY, USA, 2004
+
+Nayyeri, A. Oroumchian, F. FuFaIR: a Fuzzy Farsi Information Retrieval System. IEEE
+ International Conference on Computer Systems and Applications, Sharjah U.A.E., 2006.
+
+Abolfazl AleAhmad, Hadi Amiri, Ehsan Darrudi, Masoud Rahgozar, Farhad Oroumchian.
+Hamshahri: A Standard Persian Text Collection.
+
Propchange: lucene/openrelevance/trunk/collections/hamshahri/LICENSE.txt
------------------------------------------------------------------------------
svn:eol-style = native
Added: lucene/openrelevance/trunk/collections/hamshahri/README.txt
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/collections/hamshahri/README.txt?rev=895284&view=auto
==============================================================================
--- lucene/openrelevance/trunk/collections/hamshahri/README.txt (added)
+++ lucene/openrelevance/trunk/collections/hamshahri/README.txt Sat Jan 2 20:44:23 2010
@@ -0,0 +1,31 @@
+The homepage of this test collection is http://ece.ut.ac.ir/dbrg/Hamshahri/
+
+The Hamshahri corpus contains news documents from the Hamshahri online newspaper
+ (http://www.hamshahrionline.ir/). The documents span from 1996 to 2002, and cover
+82 different categories such as politics, literature, art, economics, etc.
+
+The corpus has the following statistics:
+ Size (MB): 345MB (564MB with tags)
+ # of documents: 166,774
+ # of unique terms: 417,339
+ Average Document Length (words): 380
+ Average Document Length: 1.8KB (note: not unicode characters)
+ Documents range from short news under 1KB to long articles up to 140KB.
+
+There are two sets of queries and relevance judgements provided:
+
+The first contains 65 topics, which were developed by 17 different people.
+These queries have the following statistics:
+ Average length: 2.84 terms
+
+The corresponding judgements were created via pooling process from 7 different
+retrieval engines using the LM1, LM2, LM3, LM4, VS2, VS4, and VS5 models. 17
+different users (IT students) judged the top 100 documents as either relevant or
+not relevant, according to TREC guidelines.
+
+The second set of queries and relevance judgements contains 58 topics, where only
+the top 20 retrieved documents were judged.
+
+More information regarding this corpus is available from this paper:
+http://ece.ut.ac.ir/dbrg/Hamshahri/Papers/Hamshahri_Description.pdf
+
Propchange: lucene/openrelevance/trunk/collections/hamshahri/README.txt
------------------------------------------------------------------------------
svn:eol-style = native
Added: lucene/openrelevance/trunk/collections/tempo/LICENSE.txt
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/collections/tempo/LICENSE.txt?rev=895284&view=auto
==============================================================================
--- lucene/openrelevance/trunk/collections/tempo/LICENSE.txt (added)
+++ lucene/openrelevance/trunk/collections/tempo/LICENSE.txt Sat Jan 2 20:44:23 2010
@@ -0,0 +1,23 @@
+From http://ilps.science.uva.nl/resources/bahasa
+
+Information Retrieval Resources for Bahasa Indonesia
+
+To support document retrieval in Bahasa Indonesia, we are making available a Porter
+ stemmer for the language, a stop word list, as well as two document collections,
+together with topics and qrels for those topics: the Kompas online collection and
+the Tempo online collection.
+
+Details about the development of those resources and their usage for evaluation
+purposes can be found in the following publications:
+
+ * F.Z. Tala, A Study of Stemming Effects on Information
+ Retrieval in Bahasa Indonesia, M.Sc. Thesis, University of
+ Amsterdam, 2003
+ * F. Tala, J. Kamps, K. Mller, and M. de Rijke. The Impact of
+ Stemming on Information Retrieval in Bahasa Indonesia (Abstract).
+ In: 14th Meeting of Computational Linguistics in the Netherlands
+ (CLIN-2003), 2003.
+
+If you use these resources, please let us know. If you publish results obtained
+using the resources made available here, please cite the Tala, Kamps, Mller and
+de Rijke paper listed above.
Propchange: lucene/openrelevance/trunk/collections/tempo/LICENSE.txt
------------------------------------------------------------------------------
svn:eol-style = native
Added: lucene/openrelevance/trunk/collections/tempo/README.txt
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/collections/tempo/README.txt?rev=895284&view=auto
==============================================================================
--- lucene/openrelevance/trunk/collections/tempo/README.txt (added)
+++ lucene/openrelevance/trunk/collections/tempo/README.txt Sat Jan 2 20:44:23 2010
@@ -0,0 +1,29 @@
+The homepage of this test collection is http://ilps.science.uva.nl/resources/bahasa
+
+The tempo corpus contains daily news documents from the Tempo online daily newspaper
+ (http://www.tempo.com). The documents span from June 2000 to July 2002.
+
+The corpus has the following statistics:
+ Size (MB): 45.57
+ # of documents: 22,944
+ avg. doc length (byte): 1549.59
+ avg. unique words (terms): 155.00
+
+The document collection was parsed to remove all HTML tags, and transformed into
+an SGML-like structure. Manually correction was performed in some cases.
+
+There are 35 queries provided, covering widely known events which happened in
+Indonesia during the timeframe.
+
+The queries have the following statistics:
+ # of queries: 35
+ avg query length (word): 5.2
+ avg # unique words: 5.17
+ avg # of relevant docs per query: 66.971
+
+The set of relevant documents for each query was constructed manually by one person,
+and assessed again by a second person. In the case of disagreement, the document
+was considered not relevant.
+
+More information regarding this corpus is available from this paper:
+http://www.illc.uva.nl/Publications/ResearchReports/MoL-2003-02.text.pdf
Propchange: lucene/openrelevance/trunk/collections/tempo/README.txt
------------------------------------------------------------------------------
svn:eol-style = native