Philip S. Yu

UP-Growth: An Efficient Algorithm for High Utility Itemset Mining
July 26 10:30AM
Mining high utility itemsets from a transactional database refers to the discovery of itemsets with high utility like profits. Although a number of relevant approaches have been proposed in recent years, they incur the problem of producing a large number of candidate itemsets for high utility itemsets. Such a large number of candidate itemsets degrades the mining performance in terms of execution time and space requirement. The situation may become worse when the database contains lots of long transactions or long high utility itemsets. In this paper, we propose an efficient algorithm, namely UP-Growth (Utility Pattern Growth), for mining high utility itemsets with a set of techniques for pruning candidate itemsets. The information of high utility itemsets is maintained in a special data structure named UP-Tree (Utility Pattern Tree) such that the candidate itemsets can be generated efficiently with only two scans of the database. The performance of UP-Growth was evaluated in comparison with the state-of-the-art algorithms on different types of datasets. The experimental results show that UP-Growth not only reduces the number of candidates effectively but also outperforms other algorithms substantially in terms of execution time, especially when the database contains lots of long transactions.
k-Support Anonymity Based on Pseudo Taxonomy for Outsourcing of Frequent Itemset Mining
July 26 2:01PM
For any outsourcing service, privacy is a major concern. This paper focuses on outsourcing frequent itemset mining and examines the issue on how to protect privacy against the case where the attackers have precise knowledge on the supports of some items. We propose a new approach referred to as k-support anonymity to protect each sensitive item with k-1 other items of similar support. To achieve k-support anonymity, we introduce a pseudo taxonomy tree and have the third party mine the generalized frequent itemsets under the corresponding generalized association rules instead of association rules. The pseudo taxonomy is a construct to facilitate hiding of the original items, where each original item can map to either a leaf node or an internal node in the taxonomy tree. The rationale for this approach is that with a taxonomy tree, the k nodes to satisfy the k-support anonymity may be any k nodes in the taxonomy tree with the appropriate supports. So this approach can provide more candidates for k-support anonymity with limited fake items as only the leaf nodes, not the internal nodes, of the taxonomy tree need to appear in the transactions. Otherwise for the association rule mining, the k nodes to satisfy the k-support anonymity have to correspond to the leaf nodes in the taxonomy tree. This is far more restricted. The challenge is thus on how to generate the pseudo taxonomy tree to facilitate k-support anonymity and to ensure the conservation of original frequent itemsets. The experimental results showed that our methods of k-support anonymity can achieve very good privacy protection with moderate storage overhead.
Semi-supervised Feature Selection for Graph Classification
July 27 10:30AM
The problem of graph classification has attracted great interest in the last decade. Current research on graph classification assumes the existence of large amounts of labeled training graphs. However, in many applications, the labels of graph data are very expensive or difficult to obtain, while there are often copious amounts of unlabeled graph data available. In this paper, we study the problem of semi-supervised feature selection for graph classification and propose a novel solution, called gSSC, to efficiently search for optimal subgraph features with labeled and unlabeled graphs. Different from existing feature selection methods in vector spaces which assume the feature set is given, we perform semi-supervised feature selection for graph data in a progressive way together with the subgraph feature mining process. We derive a feature evaluation criterion, named gSemi, to estimate the usefulness of subgraph features based upon both labeled and unlabeled graphs. Then we propose a branch-and-bound algorithm to efficiently search for optimal subgraph features by judiciously pruning the subgraph search space. Empirical studies on several real-world tasks demonstrate that our semi-supervised feature selection approach can effectively boost graph classification performances with semi-supervised feature selection and is very efficient by pruning the subgraph search space using both labeled and unlabeled graphs.
Tutorial 9: Mining Heterogeneous Information Networks
July 25 2:00PM
With the ubiquity of information networks and their broad applications, there have been numerous studies on the construction, online analytical processing, and mining of information networks in multiple disciplines, including social network analysis, World-Wide Web, database systems, data mining, machine learning, and networked communication and information systems. Moreover, with a great demand of research in this direction, there is a need of a systematic introduction of methods for analysis of information networks from multiple disciplines. Recently there have been some tutorials on structures and laws of homogeneous information networks and graphs. However, there are few systematic tutorials on mining a more important kind of networks, heterogeneous information networks, where information networks are formed by interconnected, multi-typed nodes and links. In this tutorial, we will present an organized picture on scalable mining of heterogeneous information networks, which complements existing tutorials on knowledge discovery in homogeneous information networks. The tutorial includes the following topics:
  1. introduction: information networks and information network analysis,
  2. data integration, data cleaning and data validation in heterogeneous information networks,
  3. clustering and ranking in heterogeneous information networks
  4. classification of heterogeneous information networks,
  5. summarization, OLAP and multidimensional analysis in heterogeneous information networks,
  6. evolution of dynamic heterogeneous information networks, and research challenges on mining heterogeneous information networks.