You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Grant Ingersoll <gs...@apache.org> on 2009/07/01 15:06:33 UTC
Fwd: [SIG-IRList] CFP: INEX 2009 - Clustering Task for collection Selection
FYI
Begin forwarded message:
> From: Richi Nayak <r....@qut.edu.au>
> Date: June 25, 2009 9:47:42 PM EDT
> To: "IRList@lists.shef.ac.uk" <IR...@lists.shef.ac.uk>
> Cc: Richi Nayak <r....@qut.edu.au>
> Subject: [SIG-IRList] CFP: INEX 2009 - Clustering Task for
> collection Selection
>
> This is a call for participation in XML Clustering Task in INEX
> 2009. INEX 2009 clustering task is an evaluation forum that provides
> a platform to measure the performance of clustering methods for
> collection selection on a huge scale test collection (consisting of
> a set of documents, their labels, a set of information needs
> (queries), and the answers to those information needs).
> In the last decade, we have observed a proliferation of approaches
> for clustering XML documents based on their structure and content.
> There have been many approaches developed for diverse application
> domains. Many applications require data objects to be grouped by
> similarity of content, tags, paths, structure and semantics.
>
> The clustering task in INEX 2009 evaluates unsupervised machine
> learning in the context of XML information retrieval. This year we
> are running a novel evaluation task using manual query assessments
> from the INEX Ad Hoc track. The clustering track will explicitly
> test the Jardine and van Rijsbergen cluster hypothesis (1971), which
> states that documents that cluster together have a similar relevance
> to a given query. The task is to split the English Wikipedia
> collection, 60 Gigabytes in size having around 2.7 million documents
> in XML format, into disjoint clusters for collection selection. If
> the cluster hypothesis holds true, and if suitable clustering can be
> achieved, then a clustering solution will minimise the number of
> clusters that need to be searched to satisfy any given query. There
> are important practical reasons for performing collection selection
> on a very large corpus. If only a small fraction of clusters (hence
> documents) need to be searched, then the throughput of an
> information retrieval system will be greatly improved.
>
> The INEX XML Wikipedia collection is a marked-up version of the
> Wikipedia documents. The mark-up includes, for instance, explicit
> tagging of named entities. In order to enable participation with
> minimal overheads in data-preparation the collection has been pre-
> processed to provide various representations of the documents. For
> instance, a bag-of-words representation of terms and frequent
> phrases in a document, frequencies of various XML structures in the
> form of trees, links, named entities, etc. These various collection
> representations will be released by the end of this month. As well,
> the entire document collection is available in XML format and in
> text-only format if you wish to try different representation
> approaches. A subset of collection containing about 50,000 documents
> (of the INEX 2009 corpus) will also be provided, in order to cluster
> them, for teams that are unable to process such a large data
> collection.
>
> The clustering solutions will be evaluated by two means. Firstly,
> the clustering solution will be evaluated by using the standard
> criteria such as purity, entropy and F-score to determine the
> quality of clusters. These evaluation results will be provided
> online and ongoing along the same lines as NetFlix, starting from
> mid-September. Secondly, the clustering solutions will be evaluated
> to determine the quality of cluster relative to the optimal
> collection selection goal, given a set of queries. Better
> clustering solutions in this context will tend to (on average) group
> together relevant results for (previously unseen) ad-hoc queries.
> Real Ad-hoc retrieval queries and their manual assessment results
> will be utilised in this evaluation. This novel approach evaluates
> the clustering solutions relative to a very specific objective -
> clustering a large document collection in an optimal manner in order
> to satisfy queries while minimising the search space. Results of
> second evaluation will be released at the INEX workshop in December.
>
> The clustering task in INEX 2009 brings together researchers from
> Information Retrieval, Data Mining, Machine Learning and XML fields.
> It allows participants to evaluate clustering methods against a
> real use case and with significant volumes of data. The task is
> designed to facilitate participation with minimal effort by
> providing not only raw data, but also pre-processed data which can
> be easily used by existing clustering software.
>
>
> Dr Richi Nayak, School of Information Technology,
> Queensland University of Technology, Brisbane, QLD 4001
> Office: GP S537 Phone: 3138 1976
> Email: r.nayak@qut.edu.au
> http://sky.scitech.qut.edu.au/~nayak/
>
>
>
> ************************************************
> This SIGIR-IRList message and the SIG-IRList Digest (a moderated IR
> newsletter), are brought to you by SIGIR, distributed from the
> University of Sheffield and edited by Raman Chandrasekar (irlist-editor@acm.org
> ).
> o To submit an article, e-mail IRList@lists.shef.ac.uk
> o To subscribe, send mail to sympa@lists.shef.ac.uk , with the
> subject: SUBSCRIBE irlist firstname lastname
> o To unsubscribe, send mail to sympa@lists.shef.ac.uk, with the
> subject: UNSUBSCRIBE irlist email
> [The email address is required only if you want to unsubscribe with
> an address other than the address with which you send the message]
>
> o For more info, visit: http://www.sigir.org/sigirlist/
> o Subscribe to a feed of these messages at http://searchtextmining.spaces.live.com/feed.rss
> These files are not to be sold or used for commercial purposes.
> THE OPINIONS EXPRESSED WITHIN THIS DOCUMENT DO NOT REPRESENT THOSE
> OF THE EDITOR, MICROSOFT CORPORATION OR THE UNIVERSITY OF SHEFFIELD.
> AUTHORS ASSUME FULL RESPONSIBILITY FOR THEIR MATERIAL.
>
--------------------------
Grant Ingersoll
http://www.lucidimagination.com/
Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
using Solr/Lucene:
http://www.lucidimagination.com/search