You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Stefan Groschupf <sg...@101tec.com> on 2006/08/21 21:16:12 UTC

Fwd: [webspam-announces] Web Spam Collection Announced

Hi,
May be some people will find that posting interesting.
Webspam is one of the biggest issues or nutch for whole web crawls  
from my POV.

Greetings,
Stefan


>
> During AIRWeb'06 we announced the availability of the collection.
>
> We are currently planning a Web Spam challenge based on the dataset we
> have built. I assume most of you will be interested on this, so I have
> moved the "webspam-volunteers" list to "webspam-announces". If you do
> not want to be in this new "webspam-announces" list, please send me an
> e-mail.
>
> This was shown during AIRWeb in Seattle:
>
> .............................................................
>
> Web Spam Collection Available
> August 10th, 2006
>
> We are pleased to announce the availability of a public collection for
> research on Web spam. This collection is the result of efforts by a
> team of volunteers:
>
> Thiago Alves    Antonio Gulli            Tamas Sarlos
> Luca Becchetti  Zoltan Gyongyi           Mike Thelwall
> Paolo Boldi     Thomas Lavergn           Belle Tseng
> Paul Chirita    Alex Ntoulas             Tanguy Urvoy
> Mirel Cosulschi Josiane-Xavier Parreira  Wenzhong Zhao
> Brian Davison   Xiaoguang Qi
> Pascal Filoche  Massimo Santini
>
> The corpus is a large set of Web pages in 11,000 {\tt .uk} hosts
> downloaded in May 2006 by the Laboratory of Web Algorithmics,
> Universit{\`a} degli Studi di Milano. The labelling process was
> coordinated by Carlos Castillo working at the Algorithmic Engineering
> group at Universit{\`a} di Roma ``La Sapienza'' The project was funded
> by the DELIS project (Dynamically Evolving, Large Scale Information
> Systems).
>
> Volunteers were provided with a set of guidelines and were asked to
> mark a set of hosts as either normal, spam, or borderline. The
> collection includes about 6,700 judgments done by the volunteers and
> can be used for testing link-based and content-based Web spam
> detection and demotion techniques.
>
> More information is available in our Web page, including the
> guidelines given to the human judges, the instructions for obtaining
> the links and contents of the pages in this collection, and the
> contact information for questions and comments.
>
> http://aeserver.dis.uniroma1.it/webspam/
>
> If you use this data set please subscribe to our mailing list by
> sending an e-mail to webspam-announces-subscribe@yahoogroups.com.
>
> --
> Carlos Castillo
> Universita di Roma "La Sapienza"
> Rome, ITALY
>
>
>
>
>
> Yahoo! Groups Links
>
> <*> To visit your group on the web, go to:
>     http://groups.yahoo.com/group/webspam-announces/
>
> <*> To unsubscribe from this group, send an email to:
>     webspam-announces-unsubscribe@yahoogroups.com
>
> <*> Your use of Yahoo! Groups is subject to:
>     http://docs.yahoo.com/info/terms/
>
>
>
>