# [SAP05a] Some Statistical Aspects of Credit Scoring

**Conférences invitées : **
3rd world Conf. on Computational Statistics & Data Analysis, Limassol, Chypre,
January 2005,

**motcle: **

**Résumé: **
Basel 2 regulations brought new interest in supervised classification methodologies for predicting default probability for loans.
Default probabilities may be computed directly, or by means of a score function. An important feature of consumer credit is that predictors are generally categorical. Logistic regression and linear discriminant analysis are the most frequently used techniques for they provide easy-to-use scorecards based on additive partial scores. Vapnik's statistical learning theory explains why a prior dimension reduction (eg by means of multiple correspondence analysis) improves the robustness of the score function. Ridge regression, linear SVM, PLS regression are also valuable competitors.
Density estimation , neural networks, non linear SVM provide direct estimates of default probability but are not so widely used because of the lack of interpretability.
Since a probability is also a score, almost all classification methods (including classification trees), may be compared with ROC analysis, which is more informative than the simple misclassification rate. AUC, Gini's index are related to the well known non-parametric Wilcoxon-Mann-Whitney test. Some experiments on real data will be presented.
Distinguish between good and bad customers is not enough, especially for long-term loans. The question is then not only "if", but "when" the customers default. Survival analysis provides new types of scores, but their performance are far more difficult to measure.