You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Cristian Prodan <pr...@gmail.com> on 2010/04/07 10:09:27 UTC

[GSoC 2010] Proposal to implement SimHash clustering for Mahout

Hello,

I'm posting a draft for my proposal for this year's GSoC. I kindly ask for
your feedback on it.

I have also posted a JIRA ticket with it:
https://issues.apache.org/jira/browse/MAHOUT-365 .

Thank you in advance.
Cristi.

-------------------------------------------------------------------------------------------------------------------------------------

Application for Google Summer of Code 2010 - Mahout Project

Student: Cristian Prodan

1. Synopsis

I will add a map-reduce implementation of the SimHash clustering algorithm
to the Mahout project. This algorithm provides an efficient way of finding
similar/identical files in a large collection of files.

2. Project

Storage capacities become larger and thus it is difficult to organize and
manage growing file systems. It is easy to loose track of identical copies
or older versions of the files in a directory structure. Duplicate detection
technologies do not extend well when files are not identical. A typical idea
for detecting similar files is to see the features of a file into a
high-dimensional space and then use distance within space as a measure of
similarity. This may not be feasible since it involves a O(n^2) complexity
of the algorithm. If these file-to-vector mappings are reduced ow a
one-dimensional space, then the data points could be sorted in O(n log n)
time - a big increase of detection speed.

I will implement the SimHash algorithm presented in detail in [1]. The idea
is the following: using a hash function that hashed similar files to similar
values, file similarity could be determined simply by comparing pre-sorted
hash key values.
I will implement a family of similarity hash functions which will do this,
as described in [1]. Furthermore, the performance will be enhanced by
storing auxiliary data used to compute the hash keys. This data will be used
as a second filter after a hash key comparison indicates that two files are
potentially similar.

Properties for the similarity function and the algorithm:
- very similar files map to very similar or even the same hash key;
- distance between keys should be some measure of the difference between
files. This would lead to keys proportional to file sizes and this would
create false positives. The auxiliary data mentioned above will provide an
easy and efficient way of refining the similarity detection.
- the metric used will be a binary metric (simhash operates at byte level).
- given a similarity metric, there needs to be a threshold to determine how
close within the metric files need to be to count as similar.

>From a distributed point of view, the algorithm above is very suited for a
MapReduce implementation. The sketch is the following:

I. MAP phase
In this phase we compute the hash for a file along with additional info
which serves as a second filter in similarity detection.
It outputs (File, simhash_key).

II. REDUCE phase
Once every file has a simhash, group every file with the same simhash into a
cluster, sorted by the simhashkey (the key is an integer) .
* The similarity check can be done in main memory.

3. Deliverables:

1. Batch for sim-hashing the files. The hashes will be stored either on
normal file system (for small tests), HDFS or on HBase.
2. A tool for answering the next type of queries:
- retrieving a set of files, similar to a given file;
- retrieving all pairs of similar files.
There will be the option to run this as a standalone program, for tests or
smaller scale purposes.
3. Documentation and unit tests for the written code.
4. Getting started tutorials for SimHash.
5. Demos for SimHash applied to 20newsgroups and Wikipedia data sets.

4. Benefits for the Mahout community:

1) A distributed tool for efficient similarity detection of files in large
datasets. Algorithms for detecting similar files can also be useful for
classification purposes and as an aid to search. Next release of Mahout will
contain this functionality.
2) This will be used at smaller scale also, by running it in an
"urn-distributed" mode, for small scale datasets.
3) Good tutorials on how to work with this tool.

5. Roadmap

1). Community Bonding Period (21th April to 24rd May)
- 1 week: Familiarize with Mahout, hadoop, and the existing clustering
algorithms;
- 2 weeks: Understand the data structures (Vector, ClusterBase, Writable and
other interfaces and classes used in Mahout) and start initial planning of
the system in terms of interactions and data structures.
- 1 week: Setup a hadoop cluster formed of 2-3 nodes;
- During this time, I plan to speak with my mentor and ask him very specific
questions about the plans I make for the implementation.

2). May 24th to Jul 12th
- Week 1: Detailed planning on how the objects would interact (classes,
methods, the Vector interfaces I plan to use etc.) and ask feedback from the
mentor.
- Week 2: Write the code for hashing files (a batch process which will work
as an indexer), according to the algorithm, in a TDD style. At the end tests
will accompany the code. The results will be stored on normal file system or
HDFS.
- Week 3: Half week: Test the program in "single" mode (urn-distributed).
Start interacting with HBase if time permits.
- Week 4: Interact with HBase; The algorithm will store the hashes in HBase
for fast retrieval.
- Week 5: Test the algorithm on different datasets: local file system,
20newsgroups and wikipedia.
- Week 6, 7: Add the distributed MapReduce implementation (based on hadoop)
and write unit tests for it;
- Week 8: Continue with unit tests and test the distributed implementation
on the data sets for local file system, HDFS and HBase. Fix outstanding
bugs. Analyze results.
- Week 9: Continue with testing and bug fixing and make sure everything is
ready for mid-term evaluation;

Mid-term evaluation period.

3). Jul 16th - Aug 9th
- Week 10: Extend the batch tool with different options (different hash
functions families, threshold for similarity). Experiment with different
hash functions families; Give the user an option to specify hash functions.
- Week 11: Finish off the demos for the 20newsgroups and wikipedia datasets.
Write some tests for them. Publish the results from the demos on the wiki.
- Week 12: Write the getting started guides on how to use the SimHash tool;
- Week 13: Finish off everything, fix outstanding bugs, clean and document
undocumented code.

6. Biography

My name is Cristi Prodan, I am 23 years old and currently a 2nd year student
pursuing a MSc degree in Computer Science. During the past year, I have been
interested in the study of machine learning mainly Recommender Systems and
clustering techniques. I have heard about hadoop and Mahout and I found
challenging the idea of working with them if I ever have the chance. My
dissertation paper on Recommender Systems and I will use Mahout at it's
core.
Since I heard that The Apache Software Foundation is willing to participate
in this year edition of Google Summer of Code, I was eager to try my hand
at contributing to the Mahout project. Being a subscriber to the list since
December 2009, I started to interact with the community and presenting my
ideas. I was mostly interested in the MinHash algorithm [2] (which may also
be used for similarity detection), whose implementation was started by a
contributor. After analyzing his code and seeking for advice from the other
members, I have committed my first patch to Mahout (on JIRA, MAHOUT-344).
During this time I have also skimmed through the wiki and [3].
At university, I have extensively worked with Java. I am also a big fan of
Ruby, Python and currently started digging into Erlang. In the past two
years, in the free time, I did some freelancing, working on various web
applications.

7. References

[1] Caitlin Sadowski, Greg Levin - SimHash: Hash-based Similarity Detection,
December 13, 2007 (Manning MEAP).

[2] Abhinandan Das, Mayur Datar, Ashutosh Garg, Shyam Rajaram - Google News
Personalization: Scalable Online Collaborative Filtering, WWW 2007.

[3] Owen, Anil - Mahout in Action. Manning, 2010.

------------------------------------------------------
This proposal is partly inspired by the one of Konstantin Kafer [
http://drupal.org/files/application.pdf]

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------