You are viewing a plain text version of this content. The canonical link for it is here.
Posted to openrelevance-dev@lucene.apache.org by rm...@apache.org on 2010/01/02 21:44:24 UTC

svn commit: r895284 - in /lucene/openrelevance/trunk: LICENSE.txt collections/hamshahri/LICENSE.txt collections/hamshahri/README.txt collections/tempo/LICENSE.txt collections/tempo/README.txt

Author: rmuir
Date: Sat Jan  2 20:44:23 2010
New Revision: 895284

URL: http://svn.apache.org/viewvc?rev=895284&view=rev
Log:
ORP-3: collections need LICENSE and README

Added:
    lucene/openrelevance/trunk/collections/hamshahri/LICENSE.txt   (with props)
    lucene/openrelevance/trunk/collections/hamshahri/README.txt   (with props)
    lucene/openrelevance/trunk/collections/tempo/LICENSE.txt   (with props)
    lucene/openrelevance/trunk/collections/tempo/README.txt   (with props)
Modified:
    lucene/openrelevance/trunk/LICENSE.txt

Modified: lucene/openrelevance/trunk/LICENSE.txt
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/LICENSE.txt?rev=895284&r1=895283&r2=895284&view=diff
==============================================================================
--- lucene/openrelevance/trunk/LICENSE.txt (original)
+++ lucene/openrelevance/trunk/LICENSE.txt Sat Jan  2 20:44:23 2010
@@ -200,3 +200,7 @@
    WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    See the License for the specific language governing permissions and
    limitations under the License.
+
+Note: the resources downloaded by these scripts are not Apache works and are
+governed by their own individual licenses. Please read the LICENSE.txt in each
+folder for details.

Added: lucene/openrelevance/trunk/collections/hamshahri/LICENSE.txt
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/collections/hamshahri/LICENSE.txt?rev=895284&view=auto
==============================================================================
--- lucene/openrelevance/trunk/collections/hamshahri/LICENSE.txt (added)
+++ lucene/openrelevance/trunk/collections/hamshahri/LICENSE.txt Sat Jan  2 20:44:23 2010
@@ -0,0 +1,34 @@
+From http://ece.ut.ac.ir/dbrg/Hamshahri/
+
+Copyright
+
+Hamshahri corpus was created in DBRG Lab. at University of Tehran  ECE department.
+All rights of the corpus' news are reserved for Hamshahri newspaper. All rights of 
+the corpus' data and the tools that are included in this package are reserved for 
+University of Tehran - Database Research Group. Usage of this package for any 
+research or non-commercial purposes is free with the precondition that you cite the 
+related papers below.
+
+Darrudi E., Hejazi M.R., Oroumchian F., Assessment of a Modern Farsi Corpus. In 
+Proceedings of the 2nd Workshop on Information Technology & its Disciplines (WITID)
+ 2004, ITRC, Kish Island, Iran.
+
+Hadi Amiri, Abolfazl AleAhmad, Farhad Oroumchian, Caro Lucas, Masoud Rahgozar, Using
+ OWA Fuzzy Operator to Merge Retrieval System Results, The Second Workshop on 
+Computational Approaches to Arabic Script-based Languages, LSA 2007 Linguistic 
+Institute, Stanford University, USA, 2007.
+
+Abolfazl Aleahmad, Parsia Hakimian, Farzad Mahdikhani and Farhad Oroumchian. N-Gram 
+and Local Context Analysis For Persian Text Retrieval. International Sympo-sium on 
+Signal Processing and Its Applications, Sharjah U.A.E., 2007.
+
+Farhad Oroumchian, Ehsan Darrudi, Fattane Taghiyareh, Neeyaz Angoshtari. Experiments
+ with persian text compression for web. 13th International World Wide Web 
+conference, New York, NY, USA, 2004 
+
+Nayyeri, A. Oroumchian, F. FuFaIR: a Fuzzy Farsi Information Retrieval System. IEEE
+ International Conference on Computer Systems and Applications, Sharjah U.A.E., 2006.
+
+Abolfazl AleAhmad, Hadi Amiri, Ehsan Darrudi, Masoud Rahgozar, Farhad Oroumchian. 
+Hamshahri: A Standard Persian Text Collection.
+

Propchange: lucene/openrelevance/trunk/collections/hamshahri/LICENSE.txt
------------------------------------------------------------------------------
    svn:eol-style = native

Added: lucene/openrelevance/trunk/collections/hamshahri/README.txt
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/collections/hamshahri/README.txt?rev=895284&view=auto
==============================================================================
--- lucene/openrelevance/trunk/collections/hamshahri/README.txt (added)
+++ lucene/openrelevance/trunk/collections/hamshahri/README.txt Sat Jan  2 20:44:23 2010
@@ -0,0 +1,31 @@
+The homepage of this test collection is http://ece.ut.ac.ir/dbrg/Hamshahri/
+
+The Hamshahri corpus contains news documents from the Hamshahri online newspaper
+ (http://www.hamshahrionline.ir/). The documents span from 1996 to 2002, and cover
+82 different categories such as politics, literature, art, economics, etc.
+
+The corpus has the following statistics:
+  Size (MB): 345MB (564MB with tags)
+  # of documents: 166,774
+  # of unique terms: 417,339
+  Average Document Length (words): 380
+  Average Document Length: 1.8KB (note: not unicode characters)
+  Documents range from short news under 1KB to long articles up to 140KB.
+
+There are two sets of queries and relevance judgements provided:
+
+The first contains 65 topics, which were developed by 17 different people.
+These queries have the following statistics:
+  Average length: 2.84 terms
+
+The corresponding judgements were created via pooling process from 7 different
+retrieval engines using the LM1, LM2, LM3, LM4, VS2, VS4, and VS5 models. 17 
+different users (IT students) judged the top 100 documents as either relevant or
+not relevant, according to TREC guidelines.
+
+The second set of queries and relevance judgements contains 58 topics, where only
+the top 20 retrieved documents were judged.
+
+More information regarding this corpus is available from this paper:
+http://ece.ut.ac.ir/dbrg/Hamshahri/Papers/Hamshahri_Description.pdf
+

Propchange: lucene/openrelevance/trunk/collections/hamshahri/README.txt
------------------------------------------------------------------------------
    svn:eol-style = native

Added: lucene/openrelevance/trunk/collections/tempo/LICENSE.txt
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/collections/tempo/LICENSE.txt?rev=895284&view=auto
==============================================================================
--- lucene/openrelevance/trunk/collections/tempo/LICENSE.txt (added)
+++ lucene/openrelevance/trunk/collections/tempo/LICENSE.txt Sat Jan  2 20:44:23 2010
@@ -0,0 +1,23 @@
+From http://ilps.science.uva.nl/resources/bahasa
+
+Information Retrieval Resources for Bahasa Indonesia
+
+To support document retrieval in Bahasa Indonesia, we are making available a Porter
+ stemmer for the language, a stop word list, as well as two document collections, 
+together with topics and qrels for those topics: the Kompas online collection and 
+the Tempo online collection.
+
+Details about the development of those resources and their usage for evaluation 
+purposes can be found in the following publications:
+
+    * F.Z. Tala, A Study of Stemming Effects on Information
+      Retrieval in Bahasa Indonesia, M.Sc. Thesis, University of
+      Amsterdam, 2003
+    * F. Tala, J. Kamps, K. Mller, and M. de Rijke. The Impact of 
+      Stemming on Information Retrieval in Bahasa Indonesia (Abstract).
+      In: 14th Meeting of Computational Linguistics in the Netherlands 
+      (CLIN-2003), 2003.
+
+If you use these resources, please let us know. If you publish results obtained 
+using the resources made available here, please cite the Tala, Kamps, Mller and 
+de Rijke paper listed above.

Propchange: lucene/openrelevance/trunk/collections/tempo/LICENSE.txt
------------------------------------------------------------------------------
    svn:eol-style = native

Added: lucene/openrelevance/trunk/collections/tempo/README.txt
URL: http://svn.apache.org/viewvc/lucene/openrelevance/trunk/collections/tempo/README.txt?rev=895284&view=auto
==============================================================================
--- lucene/openrelevance/trunk/collections/tempo/README.txt (added)
+++ lucene/openrelevance/trunk/collections/tempo/README.txt Sat Jan  2 20:44:23 2010
@@ -0,0 +1,29 @@
+The homepage of this test collection is http://ilps.science.uva.nl/resources/bahasa
+
+The tempo corpus contains daily news documents from the Tempo online daily newspaper
+ (http://www.tempo.com). The documents span from June 2000 to July 2002.
+
+The corpus has the following statistics:
+  Size (MB): 45.57
+  # of documents: 22,944
+  avg. doc length (byte): 1549.59
+  avg. unique words (terms): 155.00
+
+The document collection was parsed to remove all HTML tags, and transformed into
+an SGML-like structure. Manually correction was performed in some cases.
+
+There are 35 queries provided, covering widely known events which happened in 
+Indonesia during the timeframe.
+
+The queries have the following statistics:
+  # of queries: 35
+  avg query length (word): 5.2
+  avg # unique words: 5.17
+  avg # of relevant docs per query: 66.971
+
+The set of relevant documents for each query was constructed manually by one person,
+and assessed again by a second person. In the case of disagreement, the document
+was considered not relevant.  
+
+More information regarding this corpus is available from this paper:
+http://www.illc.uva.nl/Publications/ResearchReports/MoL-2003-02.text.pdf

Propchange: lucene/openrelevance/trunk/collections/tempo/README.txt
------------------------------------------------------------------------------
    svn:eol-style = native