Abstract
Knowledge discovery in databases has been studied intensively recent years. In KDD, inductive classifier learning methods which were developed in statistics and machine learning have been used to extract classification rules from databases. Although in KDD we have to deal with large databases in many cases, many of the previous classifier learning methods are not suitable for large databases. They were designed under assumption that any data in databases is accessible on demand and they usually need to access a datum several times in a process of learning. So, they require a huge memory space or a large I/O cost to access storage devices. In this paper, we propose a classifier learning method, we call CIDRE, in which data summaries are constructed and classifiers are learned from the summaries. This learning method is realized by using a clustering method, we call MCF-tree, which is an extension of CF-tree proposed by Zhang et al. In the method, we can specify the size of memory space occupied by data summaries, and databases are swept only once to construct the summaries. In addition, new instances can be inserted into the summaries incrementally. Thus, the method possesses important properties which are desirable to deal with large databases. We also show empirical results, which indicate that our method performs very well in comparison to C4.5 and naive Bayes, and the extension from CF-tree to MCF-tree is indispensable to achieve high classification accuracy.