[PNSa05] Combined use of association rules mining and clustering methods
Conférence Internationale avec comité de lecture :
3rd world Conf. on Computational Statistics & Data Analysis, Limassol, Chypre,
January 2005,
motcle:
Résumé:
We consider the problem of discovering correlations between variables in the case of large sparse matrices applied to car manufacturing. Our data consist of with more than 80000 vehicles described by more than 3000 binary attributes. Each attribute is a boolean variable equal to 1 if the vehicle has the attribute, 0 otherwise. A first idea is to mine association rules to find frequent co occurrences of attributes. The problem is to generate all association rules that have support and confidence greater than the user-specified minima thresholds. Higher is the support, more frequent is the variables set ; greater is the confidence, less there are rules counter-examples. In our case, thresholds configuration for support and confidence is particularly tricky. Minimum support has to be very low because vehicles attributes are very rare instead of basket data. In addition, by a slight threshold variation, the number of rules has gone up from zero to more than one million and the complexity has severely increased (for instance some rules contain 13 attributes). To solve this problem we propose to use clustering of variables in order to build homogeneous groups of attributes and then, mine association rules inside each of these groups. We have used several clustering methods (partitional and hierarchical with different similarity coefficients and aggregation strategy). Obtained partitions have been compared thanks to the Rand index proving that they are rather close. The study show that the joint use of association rules and classification methods is more relevant. Actually this approach brings about an important decreasing of the number of produced rules. Furthermore it appears that complex rules are always generated by the same grouped attributes identified thanks to the classification.