You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2011/05/22 17:30:47 UTC

[jira] [Resolved] (MAHOUT-658) UH-Mine algorithm for frequent pattern mining of uncertain data

     [ https://issues.apache.org/jira/browse/MAHOUT-658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved MAHOUT-658.
------------------------------

    Resolution: Later

I presume this was a GSoC proposal that did not result in a project. I am going to mark it as "Later" for now as a result. Looks like a fine idea IMHO.

> UH-Mine algorithm for frequent pattern mining of uncertain data
> ---------------------------------------------------------------
>
>                 Key: MAHOUT-658
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-658
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Frequent Itemset/Association Rule Mining
>    Affects Versions: 0.4, 0.5
>            Reporter: Yarco Hayduk
>              Labels: gsoc, gsoc11, gsoc2011, mahout-gsoc-11
>
> Proposal Title: UH-Mine algorithm for frequent pattern mining of uncertain data
> Student Name: Yaroslav Hayduk
> Student E-mail: yarcoh@gmail.com
> Organization/Project: Apache Mahout
> Assigned Mentor:
> Abstract
> Frequent pattern mining detects frequent patterns in data, which in turn permits data analysts determine interesting correlations in the dataset. In uncertain data mining, we suspect, but cannot guarantee, the presence or absence of an item. Currently, Apache
> Mahout does not contain algorithm implementations capable of mining frequent patterns from uncertain data. Hence, I propose to implement UH-Mine during GSoC 2011, used for frequent pattern mining of uncertain data.
> Motivation
> There are many real-life situations, where we observe uncertain data, such as in temperature and wind speed readings, patient diagnosis, and satellite imaging. Due to the probabilistic nature of this data, it takes more time and resources to mine it. Current state of-the-art frequent pattern mining of uncertain data algorithms do not provide sufficient performance results, as most of them are not crafted to execute in parallel.
> In uncertain data mining, we suspect, but cannot guarantee, the presence or absence of an item. For example, when a doctor is 90% sure that a patient suffers from a particular disease. A video lecture by Jian Pei [3] contains a thorough overview of the uncertain data problem domain as well as elaborates on the process of frequent pattern mining of uncertain data.
> Comparison of UF-Growth and UH-Mine
> The UF-Growth [4] algorithm modifies the FP-Growth [2] algorithm in the way the transaction tree is built. FP-Growth uses the FP-Tree, a tree-based data structure, to store a compact representation of the transaction database, which contains frequency information of all frequent items. The FP-Tree, however, does not store existential probabilities, associated with items. Hence, Leung et al. [4] created a data structure, called UF-Tree, which is based on the FP-tree. Each node in the UF-Tree stores an item, its expected support, as well as the number of occurrence of such expected support for each item. To merge the transaction with the child node in UF-Tree, UF-Growth requires both the item and its corresponding existential probability to match. Hence, the algorithm arrives with the UF-Tree, having a much lower compression ratio, that in the FP-Tree, constructed by FP-Growth.
> As such, I propose to adapt and implement the UH-Mine algorithms to the MapReduce programming model. The UH-Struct structure uses the linkage behaviour among transactions corresponding to a branch of the FP-Tree(UF-Tree) without actually creating a projected database.
> This approach is better than FP-Tree even in the deterministic case, when compression from FP-Tree is not high. This turns out to be particularly true for the uncertain case, as discussed earlier. The UH-mine [1] algorithm provides the best trade-offs both in terms of running time and memory usage.
> The UH-Mine algorithm works as follows: it
> 1. prunes the initial DB such that all singleton infrequent items are removed; 
> 2. divides the pruned DB into equal chunks; 
> 3. mines these chunks separately using the UH-Mine(mem) algorithm. To mine frequent patterns, UH-Mine(mem) maintains an H-Struct, which contains pointers to transaction items. At each step, the algorithm adjusts these pointers and does not incur the overheads, associated with the FP(UF)-Tree construction;
> 4. joins the results and 
> 5. scans the pruned DB once again to remove false positives and obtain the actual counts.
> Benefits for the Mahout community
> a) My work adds an algorithm implementation for discovering frequent patterns in uncertain data 
> b) If my algorithm implementation proves to be fast, it would permit future algorithm developers to retrofit my UH-Mine implementation to create H-Mine[5]. H-Mine is an alternative algorithm, which mines frequent patterns from precise data.
> Timeline
> Weeks 1-4: Implement UH-Mine in Java, set up a local Hadoop cluster composed of 3 machines Weeks 5-7: Investigate Mahout code structure, start Adopting UH-Mine to MapReduce. 
> Weeks 8: Summit for mid-term evaluation 
> Weeks 9 - 11: Finish-up with my implementation, refactor code smells, identify and fix performance issues 
> Weeks 11 - 12: Code cleaning, documenting and testing.
> Biography 
> My name is Yarco Hayduk and I am a Comp Sci graduate student at the University of Manitoba, Canada. I'm very interested in concurrency and data mining. Currently, I am working on a similar project, which adopts the UFP-Growth[6] algorithm to MapReduce. I'm using the Parallel FP-Growth algorithm as a starting point for implementing UFP-Growth. I also implemented UH-Mine and H-Mine algorithms in C. Thus, I would use it as a reference for my proposed UH-Mine implementation on MapReduce.
> References
> [1] Jianyong Wang Charu C. Aggarwal, Yan Li and Jing Wang. Frequent pattern mining
> with uncertain data. In KDD '09: 15th ACM SIGKDD International Conference on
> Knowledge Discovery and Data Mining, pages 29--38, Paris, France, June 2010. ACM.
> 10.1145/1557019.1557030.
> [2] Jiawei Han, Jian Pei, and Yiwen Yin. Mining frequent patterns without candidate generation. In SIGMOD '00: 2000 ACM SIGMOD international conference on management
> of data, pages 1--12, Dallas, TX, USA, May 2000. ACM. 10.1145/342009.335372.
> [3] Ming Hua Jian Pei. Mining uncertain and probabilistic data: problems, challenges,
> methods, and applications. http://videolectures.net/kdd08_pei_mupd, Accessed on
> March 11, 2011.
> [4] Carson Kai-Sang Leung, Christopher Lee Carmichael, and Boyu Hao. Efficient mining
> of frequent patterns from uncertain data. In ICDMW '07: 7th IEEE International Conference on Data Mining Workshops, pages 489--494, Omaha, NE, USA, October 2007.
> IEEE. 10.1109/ICDMW.2007.84.
> [5] Jian Pei, Jiawei Han, Hongjun Lu, Shojiro Nishio, Shiwei Tang, and Dongqing Yang.
> H-Mine: Hyper-structure mining of frequent patterns in large databases. In ICDM '01:
> 1st IEEE International Conference on Data Mining, pages 441--448, San Jose, California,
> USA, November 2001. IEEE. 10.1109/ICDM.2001.989550.
> [6] Calin Garboni Toon Calders and Bart Goethals. Efficient pattern mining from uncertain data with sampling. In PAKDD 2010: 14th Pacific-Asia Conference on Knowledge
> Discovery and Data Mining, pages 480--487, Hyderabad, India, June 2010. Springer.
> 10.1007/978-3-642-13657-351.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira