You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2009/07/01 15:06:33 UTC
Fwd: [SIG-IRList] CFP: INEX 2009 - Clustering Task for collection Selection

FYI

Begin forwarded message:

> From: Richi Nayak <r....@qut.edu.au>
> Date: June 25, 2009 9:47:42 PM EDT
> To: "IRList@lists.shef.ac.uk" <IR...@lists.shef.ac.uk>
> Cc: Richi Nayak <r....@qut.edu.au>
> Subject: [SIG-IRList] CFP: INEX 2009 - Clustering Task for  
> collection Selection
>
> This is a call for participation in XML Clustering Task in INEX  
> 2009. INEX 2009 clustering task is an evaluation forum that provides  
> a platform to measure the performance of clustering methods for  
> collection selection on a huge scale test collection (consisting of  
> a set of documents, their labels, a set of information needs  
> (queries), and the answers to those information needs).
> In the last decade, we have observed a proliferation of approaches  
> for clustering XML documents based on their structure and content.  
> There have been many approaches developed for diverse application  
> domains. Many applications require data objects to be grouped by  
> similarity of content, tags, paths, structure and semantics.
>
> The clustering task in INEX 2009 evaluates unsupervised machine  
> learning in the context of XML information retrieval. This year we  
> are running a novel evaluation task using manual query assessments  
> from the INEX Ad Hoc track.  The clustering track will explicitly  
> test the Jardine and van Rijsbergen cluster hypothesis (1971), which  
> states that documents that cluster together have a similar relevance  
> to a given query. The task is to split the English Wikipedia  
> collection, 60 Gigabytes in size having around 2.7 million documents  
> in XML format, into disjoint clusters for collection selection.  If  
> the cluster hypothesis holds true, and if suitable clustering can be  
> achieved, then a clustering solution will minimise the number of  
> clusters that need to be searched to satisfy any given query. There  
> are important practical reasons for performing collection selection  
> on a very large corpus. If only a small fraction of clusters (hence  
> documents) need to be searched, then the throughput of an  
> information retrieval system will be greatly improved.
>
> The INEX XML Wikipedia collection is a marked-up version of the  
> Wikipedia documents.  The mark-up includes, for instance, explicit  
> tagging of named entities.  In order to enable participation with  
> minimal overheads in data-preparation the collection has been pre- 
> processed to provide various representations of the documents.  For  
> instance, a bag-of-words representation of terms and frequent  
> phrases in a document, frequencies of various XML structures in the  
> form of trees, links, named entities, etc.  These various collection  
> representations will be released by the end of this month. As well,  
> the entire document collection is available in XML format and in  
> text-only format if you wish to try different representation  
> approaches. A subset of collection containing about 50,000 documents  
> (of the INEX 2009 corpus) will also be provided, in order to cluster  
> them, for teams that are unable to process such a large data  
> collection.
>
> The clustering solutions will be evaluated by two means. Firstly,  
> the clustering solution will be evaluated by using the standard  
> criteria such as purity, entropy and F-score to determine the  
> quality of clusters. These evaluation results will be provided  
> online and ongoing along  the same lines as NetFlix, starting from  
> mid-September. Secondly, the clustering solutions will be evaluated  
> to determine the quality of cluster relative to the optimal  
> collection selection goal, given a set of queries.  Better  
> clustering solutions in this context will tend to (on average) group  
> together relevant results for (previously unseen) ad-hoc queries.   
> Real Ad-hoc retrieval queries and their manual assessment results  
> will be utilised in this evaluation.  This novel approach evaluates  
> the clustering solutions relative to a very specific objective -  
> clustering a large document collection in an optimal manner in order  
> to satisfy queries while minimising the search space. Results of  
> second evaluation will be released at the INEX workshop in December.
>
> The clustering task in INEX 2009 brings together researchers from  
> Information Retrieval, Data Mining, Machine Learning and XML fields.  
> It allows participants to evaluate clustering methods  against a   
> real use case and with significant volumes of data.  The task is  
> designed to facilitate participation with minimal effort by  
> providing not only raw data, but also pre-processed data which can  
> be easily used by existing clustering software.
>
>
> Dr Richi Nayak, School of Information Technology,
> Queensland University of Technology, Brisbane, QLD 4001
> Office: GP S537  Phone: 3138 1976
> Email: r.nayak@qut.edu.au
> http://sky.scitech.qut.edu.au/~nayak/
>
>
>
> ************************************************
> This SIGIR-IRList message and the SIG-IRList Digest (a moderated IR  
> newsletter), are brought to you by SIGIR, distributed from the  
> University of Sheffield and edited by Raman Chandrasekar (irlist-editor@acm.org 
> ).
> o	To submit an article, e-mail IRList@lists.shef.ac.uk
> o	To subscribe, send mail to sympa@lists.shef.ac.uk , with the  
> subject: SUBSCRIBE irlist firstname lastname
> o	To unsubscribe, send mail to sympa@lists.shef.ac.uk, with the  
> subject: UNSUBSCRIBE irlist email
> [The email address is required only if you want to unsubscribe with  
> an address other than the address with which you send the message]
>
> o	For more info, visit: http://www.sigir.org/sigirlist/
> o	Subscribe to a feed of these messages at http://searchtextmining.spaces.live.com/feed.rss
> These files are not to be sold or used for commercial purposes.
> THE OPINIONS EXPRESSED WITHIN THIS DOCUMENT DO NOT REPRESENT THOSE  
> OF THE EDITOR, MICROSOFT CORPORATION OR THE UNIVERSITY OF SHEFFIELD.
> AUTHORS ASSUME FULL RESPONSIBILITY FOR THEIR MATERIAL.
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search