How to use Machine Learning in Credit Scoring

Many financial institutes use scoring models to lower credit risk in credit appraisals, and in the granting and supervision of credit. Credit scoring models based on classical statistical theories are widely used. However, these models are less resilient when it comes to large amounts of data input; as a consequence, some of the assumptions in the classical statistics analysis fail. This influences the accuracy of prediction and of model generalizations. In this blog post, we will explain how machine learning can be used in credit scoring to achieve a more accurate scoring from large amounts of data.

According to a large number of empirical studies, machine learning techniques – along with other data-mining algorithms based on computational innovation and transformation – seem to perform better when fitting data and forecasting. Machine learning algorithms are designed to learn from large amounts of historical data and then make a forecast. Take the credit scoring for loans from retail banks as an example. The typical business process for the provision of a loan service is: accept loan applications, evaluate the credit risk, make the decision on the granting of the loans, and supervise the repayment of principles and interests. Then problems may materialize, such as how to accelerate the credit appraisal process and how to supervise the repayment process and make adjustments in time once a possible defaulting has been detected.

To solve the above two problems, we could build two models during the loan origination process and the supervising process.

In the origination process, our research population consists of all the applicants who want to apply for loans. By using the historical data of application records, the model could be trained to judge whether a new applicant is sufficiently reliable to be granted the loan if the characteristic indicators of the applicant have been provided, such as their income, marital status, age, previous actions of default, etc.

In the supervising process, our research target is the successful applicant. By using the historical data of repayment records and the characteristics status of customers who have completed the entire loan process, we could train another model to make a judgment regarding whether or not this new customer has a large probability of defaulting; by observing the applicant’s repayment record for the first few payback periods and the change of characteristics, this model would make new adjustments based on the updated information. This automated process is more time efficient and accurate compared with the traditional ways.

However, there are many machine learning algorithms available right now, so which one is the best? There is no definite answer to this question as the performance of the algorithms is data-sensitive and the performance is dependent upon the specific data structure. The general way to find an appropriate model for a single specific data set, or a type of data set, is to apply it to some widely -used and well-proven algorithms.

In the model-building processes illustrated above, both processes look similar but have different models. The repayment supervising process is similar to the loan granting process, but it learns from different historical data, which comes from the old customers who have already finished the loan repayment, including their repayment records and characteristics status. A potentially different machine learning algorithm may be applied to incorporate adjustments for the new data structure.

Nowadays, the hottest machine learning algorithms could be categorized as being either single classifiers or ensemble classifiers. The representatives of the single classifiers are CART, Naïve Bayes, SVM, logistics. The modification of single classifiers by logit of bagging and boosting (and their derivatives such as Adaboost) are widely used, such as Random Forests, CART-Adaboost, etc.

All in all, machine learning is like teaching a freshman in finance how to make a judgment call (based on the historical data) about the quality of the loans, and then he would be experienced enough to make the decisions himself. In broader terms, machine learning techniques could be used in all kinds of classification problems. If your business could be regarded as having a classification problem, why not give it a shot?

In the field of banking/insurance, based on Machine Learning, we develop applications, including Credit Scoring, Risk Analytics, Fraud Detection, Cross-Sell.