Loan Approval Prediction - A Comparative Study
Project Overview:
This project tackles the critical financial task of loan approval prediction using machine learning. Utilizing the "Loan Status Prediction" dataset from Kaggle, the study develops and compares predictive models, focusing on the impact of comprehensive data preprocessing, feature engineering, and feature selection techniques on model performance.
Problem Statement:
Financial institutions face the challenge of accurately assessing loan applicant risk to minimize defaults while maximizing approval rates. This project aims to build robust machine learning models that can automate and improve the accuracy of loan approval decisions, addressing common data challenges like missing values, class imbalance, and feature irrelevance.
Dataset:
- Source: Kaggle ("Loan Status Prediction" dataset)
- Size: 614 records, 13 initial features.
- Features: Applicant/Co-applicant income, loan amount, loan term, credit history, gender, marital status, education, employment status, dependents, property area.
- Target Variable:
Loan_Status (Approved 'Y' / Rejected 'N').
Key Features & Technologies Used:
- Python Libraries:
- Data Manipulation & Analysis: Pandas, NumPy
- Data Visualization: Matplotlib, Seaborn
- Machine Learning: Scikit-learn (for preprocessing, model selection, metrics, and algorithms), Imblearn (for SMOTE)
- Techniques:
- Data Cleaning (Imputation: Mode, Median)
- Exploratory Data Analysis (EDA)
- Feature Engineering (Label Encoding, One-Hot Encoding, Derived Features)
- Feature Scaling (StandardScaler)
- Handling Class Imbalance (SMOTE)
- Feature Selection (Recursive Feature Elimination - RFE)
- Model Training & Evaluation (Comparison of two feature sets)
- Algorithms Compared:
- Logistic Regression
- Random Forest
- Support Vector Machine (SVM)
- Decision Tree
- Gaussian Naive Bayes
Methodology:
- Data Loading & Exploration: Loaded the dataset using Pandas, performed initial checks using
.info(), .head(), and .describe(). Visualized missing values.
- Data Cleaning & Preprocessing:
- Handled missing values using mode imputation for categorical features and median imputation for numerical features.
- Converted data types where necessary.
- Exploratory Data Analysis (EDA):
- Visualized distributions of categorical (bar plots) and numerical features (distribution plots).
- Analyzed correlations between features using a heatmap, identifying
Credit_History as highly correlated with Loan_Status.
- Identified and visualized significant class imbalance in the
Loan_Status target variable.
- Feature Engineering & Scaling:
- Applied Label Encoding to ordinal categorical features (Gender, Married, Education, etc.).
- Used One-Hot Encoding for the 'Dependents' feature after mapping '3+' to 3.
- Created a derived binary feature 'LongTermLoan' based on
Loan_Amount_Term.
- Scaled numerical features (ApplicantIncome, CoapplicantIncome, LoanAmount) using
StandardScaler to ensure uniform contribution.