Program Overview

This R script is designed to test different specifications for predicting loan origination outcomes, based on the Home Mortgage Disclosure Act (HMDA) dataset. The goal is to predict the probability of loan origination in the US, focusing on factors like interest rate, debt-to-income ratio, and property value, among others. The analysis also includes model evaluation using performance metrics such as accuracy, AUC, and confusion matrices.

1. Clear Environment and Load Libraries

rm(list = ls()) 

 

packages = c("base64enc","speedglm", "tidyverse", "dplyr","janitor","data.table",
             "openxlsx", "ggplot2", "reshape2","summarytools", "pdftools",  "broom",
             "lmtest","glmnet","margins","doMC","pROC","caret" ,"statar" )

# lapply(packages, install.packages, character.only = TRUE)
lapply(packages, library, character.only = TRUE)

## Loading required package: Matrix

## Loading required package: MASS

## Loading required package: biglm

## Loading required package: DBI

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::expand() masks Matrix::expand()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ tidyr::pack()   masks Matrix::pack()
## ✖ dplyr::select() masks MASS::select()
## ✖ tidyr::unpack() masks Matrix::unpack()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 
## Attaching package: 'janitor'
## 
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
## 
## 
## 
## Attaching package: 'data.table'
## 
## 
## The following objects are masked from 'package:lubridate':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday, week,
##     yday, year
## 
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## 
## 
## The following object is masked from 'package:purrr':
## 
##     transpose
## 
## 
## 
## Attaching package: 'reshape2'
## 
## 
## The following objects are masked from 'package:data.table':
## 
##     dcast, melt
## 
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths
## 
## 
## 
## Attaching package: 'summarytools'
## 
## 
## The following object is masked from 'package:tibble':
## 
##     view
## 
## 
## Using poppler version 25.05.0
## 
## Loading required package: zoo
## 
## 
## Attaching package: 'zoo'
## 
## 
## The following objects are masked from 'package:data.table':
## 
##     yearmon, yearqtr
## 
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## 
## Loaded glmnet 4.1-10
## 
## Loading required package: foreach
## 
## 
## Attaching package: 'foreach'
## 
## 
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
## 
## 
## Loading required package: iterators
## 
## Loading required package: parallel
## 
## Type 'citation("pROC")' for a citation.
## 
## 
## Attaching package: 'pROC'
## 
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
## 
## 
## Loading required package: lattice
## 
## 
## Attaching package: 'caret'
## 
## 
## The following object is masked from 'package:purrr':
## 
##     lift

## [[1]]
## [1] "base64enc" "stats"     "graphics"  "grDevices" "utils"     "datasets" 
## [7] "methods"   "base"     
## 
## [[2]]
##  [1] "speedglm"  "biglm"     "DBI"       "MASS"      "Matrix"    "base64enc"
##  [7] "stats"     "graphics"  "grDevices" "utils"     "datasets"  "methods"  
## [13] "base"     
## 
## [[3]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "speedglm"  "biglm"    
## [13] "DBI"       "MASS"      "Matrix"    "base64enc" "stats"     "graphics" 
## [19] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[4]]
##  [1] "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"     "readr"    
##  [7] "tidyr"     "tibble"    "ggplot2"   "tidyverse" "speedglm"  "biglm"    
## [13] "DBI"       "MASS"      "Matrix"    "base64enc" "stats"     "graphics" 
## [19] "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[5]]
##  [1] "janitor"   "lubridate" "forcats"   "stringr"   "dplyr"     "purrr"    
##  [7] "readr"     "tidyr"     "tibble"    "ggplot2"   "tidyverse" "speedglm" 
## [13] "biglm"     "DBI"       "MASS"      "Matrix"    "base64enc" "stats"    
## [19] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"     
## 
## [[6]]
##  [1] "data.table" "janitor"    "lubridate"  "forcats"    "stringr"   
##  [6] "dplyr"      "purrr"      "readr"      "tidyr"      "tibble"    
## [11] "ggplot2"    "tidyverse"  "speedglm"   "biglm"      "DBI"       
## [16] "MASS"       "Matrix"     "base64enc"  "stats"      "graphics"  
## [21] "grDevices"  "utils"      "datasets"   "methods"    "base"      
## 
## [[7]]
##  [1] "openxlsx"   "data.table" "janitor"    "lubridate"  "forcats"   
##  [6] "stringr"    "dplyr"      "purrr"      "readr"      "tidyr"     
## [11] "tibble"     "ggplot2"    "tidyverse"  "speedglm"   "biglm"     
## [16] "DBI"        "MASS"       "Matrix"     "base64enc"  "stats"     
## [21] "graphics"   "grDevices"  "utils"      "datasets"   "methods"   
## [26] "base"      
## 
## [[8]]
##  [1] "openxlsx"   "data.table" "janitor"    "lubridate"  "forcats"   
##  [6] "stringr"    "dplyr"      "purrr"      "readr"      "tidyr"     
## [11] "tibble"     "ggplot2"    "tidyverse"  "speedglm"   "biglm"     
## [16] "DBI"        "MASS"       "Matrix"     "base64enc"  "stats"     
## [21] "graphics"   "grDevices"  "utils"      "datasets"   "methods"   
## [26] "base"      
## 
## [[9]]
##  [1] "reshape2"   "openxlsx"   "data.table" "janitor"    "lubridate" 
##  [6] "forcats"    "stringr"    "dplyr"      "purrr"      "readr"     
## [11] "tidyr"      "tibble"     "ggplot2"    "tidyverse"  "speedglm"  
## [16] "biglm"      "DBI"        "MASS"       "Matrix"     "base64enc" 
## [21] "stats"      "graphics"   "grDevices"  "utils"      "datasets"  
## [26] "methods"    "base"      
## 
## [[10]]
##  [1] "summarytools" "reshape2"     "openxlsx"     "data.table"   "janitor"     
##  [6] "lubridate"    "forcats"      "stringr"      "dplyr"        "purrr"       
## [11] "readr"        "tidyr"        "tibble"       "ggplot2"      "tidyverse"   
## [16] "speedglm"     "biglm"        "DBI"          "MASS"         "Matrix"      
## [21] "base64enc"    "stats"        "graphics"     "grDevices"    "utils"       
## [26] "datasets"     "methods"      "base"        
## 
## [[11]]
##  [1] "pdftools"     "summarytools" "reshape2"     "openxlsx"     "data.table"  
##  [6] "janitor"      "lubridate"    "forcats"      "stringr"      "dplyr"       
## [11] "purrr"        "readr"        "tidyr"        "tibble"       "ggplot2"     
## [16] "tidyverse"    "speedglm"     "biglm"        "DBI"          "MASS"        
## [21] "Matrix"       "base64enc"    "stats"        "graphics"     "grDevices"   
## [26] "utils"        "datasets"     "methods"      "base"        
## 
## [[12]]
##  [1] "broom"        "pdftools"     "summarytools" "reshape2"     "openxlsx"    
##  [6] "data.table"   "janitor"      "lubridate"    "forcats"      "stringr"     
## [11] "dplyr"        "purrr"        "readr"        "tidyr"        "tibble"      
## [16] "ggplot2"      "tidyverse"    "speedglm"     "biglm"        "DBI"         
## [21] "MASS"         "Matrix"       "base64enc"    "stats"        "graphics"    
## [26] "grDevices"    "utils"        "datasets"     "methods"      "base"        
## 
## [[13]]
##  [1] "lmtest"       "zoo"          "broom"        "pdftools"     "summarytools"
##  [6] "reshape2"     "openxlsx"     "data.table"   "janitor"      "lubridate"   
## [11] "forcats"      "stringr"      "dplyr"        "purrr"        "readr"       
## [16] "tidyr"        "tibble"       "ggplot2"      "tidyverse"    "speedglm"    
## [21] "biglm"        "DBI"          "MASS"         "Matrix"       "base64enc"   
## [26] "stats"        "graphics"     "grDevices"    "utils"        "datasets"    
## [31] "methods"      "base"        
## 
## [[14]]
##  [1] "glmnet"       "lmtest"       "zoo"          "broom"        "pdftools"    
##  [6] "summarytools" "reshape2"     "openxlsx"     "data.table"   "janitor"     
## [11] "lubridate"    "forcats"      "stringr"      "dplyr"        "purrr"       
## [16] "readr"        "tidyr"        "tibble"       "ggplot2"      "tidyverse"   
## [21] "speedglm"     "biglm"        "DBI"          "MASS"         "Matrix"      
## [26] "base64enc"    "stats"        "graphics"     "grDevices"    "utils"       
## [31] "datasets"     "methods"      "base"        
## 
## [[15]]
##  [1] "margins"      "glmnet"       "lmtest"       "zoo"          "broom"       
##  [6] "pdftools"     "summarytools" "reshape2"     "openxlsx"     "data.table"  
## [11] "janitor"      "lubridate"    "forcats"      "stringr"      "dplyr"       
## [16] "purrr"        "readr"        "tidyr"        "tibble"       "ggplot2"     
## [21] "tidyverse"    "speedglm"     "biglm"        "DBI"          "MASS"        
## [26] "Matrix"       "base64enc"    "stats"        "graphics"     "grDevices"   
## [31] "utils"        "datasets"     "methods"      "base"        
## 
## [[16]]
##  [1] "doMC"         "parallel"     "iterators"    "foreach"      "margins"     
##  [6] "glmnet"       "lmtest"       "zoo"          "broom"        "pdftools"    
## [11] "summarytools" "reshape2"     "openxlsx"     "data.table"   "janitor"     
## [16] "lubridate"    "forcats"      "stringr"      "dplyr"        "purrr"       
## [21] "readr"        "tidyr"        "tibble"       "ggplot2"      "tidyverse"   
## [26] "speedglm"     "biglm"        "DBI"          "MASS"         "Matrix"      
## [31] "base64enc"    "stats"        "graphics"     "grDevices"    "utils"       
## [36] "datasets"     "methods"      "base"        
## 
## [[17]]
##  [1] "pROC"         "doMC"         "parallel"     "iterators"    "foreach"     
##  [6] "margins"      "glmnet"       "lmtest"       "zoo"          "broom"       
## [11] "pdftools"     "summarytools" "reshape2"     "openxlsx"     "data.table"  
## [16] "janitor"      "lubridate"    "forcats"      "stringr"      "dplyr"       
## [21] "purrr"        "readr"        "tidyr"        "tibble"       "ggplot2"     
## [26] "tidyverse"    "speedglm"     "biglm"        "DBI"          "MASS"        
## [31] "Matrix"       "base64enc"    "stats"        "graphics"     "grDevices"   
## [36] "utils"        "datasets"     "methods"      "base"        
## 
## [[18]]
##  [1] "caret"        "lattice"      "pROC"         "doMC"         "parallel"    
##  [6] "iterators"    "foreach"      "margins"      "glmnet"       "lmtest"      
## [11] "zoo"          "broom"        "pdftools"     "summarytools" "reshape2"    
## [16] "openxlsx"     "data.table"   "janitor"      "lubridate"    "forcats"     
## [21] "stringr"      "dplyr"        "purrr"        "readr"        "tidyr"       
## [26] "tibble"       "ggplot2"      "tidyverse"    "speedglm"     "biglm"       
## [31] "DBI"          "MASS"         "Matrix"       "base64enc"    "stats"       
## [36] "graphics"     "grDevices"    "utils"        "datasets"     "methods"     
## [41] "base"        
## 
## [[19]]
##  [1] "statar"       "caret"        "lattice"      "pROC"         "doMC"        
##  [6] "parallel"     "iterators"    "foreach"      "margins"      "glmnet"      
## [11] "lmtest"       "zoo"          "broom"        "pdftools"     "summarytools"
## [16] "reshape2"     "openxlsx"     "data.table"   "janitor"      "lubridate"   
## [21] "forcats"      "stringr"      "dplyr"        "purrr"        "readr"       
## [26] "tidyr"        "tibble"       "ggplot2"      "tidyverse"    "speedglm"    
## [31] "biglm"        "DBI"          "MASS"         "Matrix"       "base64enc"   
## [36] "stats"        "graphics"     "grDevices"    "utils"        "datasets"    
## [41] "methods"      "base"

cat("\014")

#Number of Cores for paralell computing  
registerDoMC(cores =detectCores()-2)

path <- "[[FILEPATH]]"
input <- paste(path,"/input",sep="/" )
output <- paste(path,"/output",sep="/" )

source(paste0(path,"/code/stat_binscatter.R"))

2. Load and Clean Data

The HMDA (Home Mortgage Disclosure Act) Public Loan Application Registry (LAR) dataset is a comprehensive collection of data on mortgage applications in the United States, maintained by the Consumer Financial Protection Bureau (CFPB). It includes information from lenders about loan applications, including whether loans were approved, denied, or withdrawn, as well as demographic details about the applicants. The data aims to provide transparency into lending practices, especially with respect to how loans are made to different racial, ethnic, and income groups.

Key elements of the LAR dataset include:

Loan characteristics: Loan amount, interest rate, loan purpose, property type, and whether the loan is conforming or non-conforming.
Applicant information: Demographic data such as race, ethnicity, gender, income, and age.
Geographic data: Location of the property, including metropolitan statistical area (MSA) and census tract.
Action taken: Outcome of the loan application (e.g., approved, denied, withdrawn).
Credit characteristics: Information on credit scores, debt-to-income ratio, and other risk factors.

This dataset is used for research, policy analysis, and regulatory monitoring to ensure fair lending practices and assess patterns in access to credit across different communities, and is available at an annual frequency. The data is available for download here, and the full data field descriptions are available here.

For this exercise, I use data for 2024 in the entire US with the goal to estimate a model that predicts the probability of a loan originating in the USA based on different features. In this project, I focus on Single Family, first lien mortgages in the entire USA. The dependent variable is the application status, which includes the following categories:

1 - Loan originated.
2 - Application approved but not accepted.
3 - Application denied
4 - Application withdrawn by applicant.
5 - File closed for incompleteness.
6 - Purchased loan.
7 - Pre-approval request denied.
8 - Pre-approval request approved but not accepted

I classify a loan as “originated” if it falls under category 1, and 0 otherwise. Also, I perform some data cleaning steps to guarantee better quality data to use in the model:

First, I determine if the application had a co-applicant based on their credit score field. if co_applicant_credit_score_type=10, then I conclude that the application had no co-applicant.
The purpose variable is assigned two values based on the data dictionary: “Purchase” (if purpose = 1) or “Refinancing” (if purpose = 31) .
I focus on family (non-commercial) properties only. Therefore, I keep applications where business_or_commercial_purpose==2 according to the data dictionary.
I also keep loans where its possible to determine whether they are conforming or non conforming GSE loans. I keep these loans because lenders have strong incentives to originate GSE conforming loans, as these loans comply with a defined set of underwriting guidelines set the the U.S Government.
I also keep loans with complete information, such as interest rate, property value and applicant income
Finally, I exclude reverse mortgages and exempt loans.

I also winsorize at 99% (only upper bound) quantitative variables prone to outliers, such as loan amount, income, property value, and loan to value ratios.

data_file <- paste(input,"2024_public_lar_csv.csv", sep = "/") 

hmda_data <- fread(data_file)  %>% clean_names()  
str(hmda_data)

## Classes 'data.table' and 'data.frame':   12229298 obs. of  99 variables:
##  $ activity_year                           : int  2024 2024 2024 2024 2024 2024 2024 2024 2024 2024 ...
##  $ lei                                     : chr  "WWB2V0FCW3A0EE3ZJN75" "WWB2V0FCW3A0EE3ZJN75" "WWB2V0FCW3A0EE3ZJN75" "WWB2V0FCW3A0EE3ZJN75" ...
##  $ derived_msa_md                          : int  14860 35300 99999 29540 99999 25540 11694 25540 20100 33874 ...
##  $ state_code                              : chr  "CT" "CT" "PA" "PA" ...
##  $ county_code                             : int  9190 9170 42107 42071 42057 9110 51059 9110 10001 42017 ...
##  $ census_tract                            : chr  "09190210701" "09170194101" "42107000700" "42071014501" ...
##  $ conforming_loan_limit                   : chr  "C" "C" "C" "C" ...
##  $ derived_loan_product_type               : chr  "Conventional:Subordinate Lien" "Conventional:Subordinate Lien" "Conventional:First Lien" "Conventional:Subordinate Lien" ...
##  $ derived_dwelling_category               : chr  "Single Family (1-4 Units):Site-Built" "Single Family (1-4 Units):Site-Built" "Single Family (1-4 Units):Site-Built" "Single Family (1-4 Units):Site-Built" ...
##  $ derived_ethnicity                       : chr  "Hispanic or Latino" "Not Hispanic or Latino" "Not Hispanic or Latino" "Not Hispanic or Latino" ...
##  $ derived_race                            : chr  "Black or African American" "White" "White" "White" ...
##  $ derived_sex                             : chr  "Male" "Joint" "Male" "Joint" ...
##  $ action_taken                            : int  3 1 3 1 1 3 1 3 1 1 ...
##  $ purchaser_type                          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ preapproval                             : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ loan_type                               : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ loan_purpose                            : int  31 2 2 2 31 2 4 2 2 31 ...
##  $ lien_status                             : int  2 2 1 2 1 2 2 2 2 2 ...
##  $ reverse_mortgage                        : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ open_end_line_of_credit                 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ business_or_commercial_purpose          : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ loan_amount                             :integer64 105000 85000 25000 125000 55000 15000 455000 355000 ... 
##  $ combined_loan_to_value_ratio            : chr  "79.41" "44.37" "35" "50.55" ...
##  $ interest_rate                           : chr  NA "12.59" NA "9.34" ...
##  $ rate_spread                             : chr  NA "4.31" NA "1.17" ...
##  $ hoepa_status                            : int  3 2 3 2 2 3 2 3 2 2 ...
##  $ total_loan_costs                        : chr  NA NA NA NA ...
##  $ total_points_and_fees                   : chr  NA NA NA NA ...
##  $ origination_charges                     : chr  NA NA NA NA ...
##  $ discount_points                         : chr  NA NA NA NA ...
##  $ lender_credits                          : chr  NA NA NA NA ...
##  $ loan_term                               : chr  "360" "360" "360" "360" ...
##  $ prepayment_penalty_term                 : chr  NA NA NA NA ...
##  $ intro_rate_period                       : chr  "1" "1" "1" "1" ...
##  $ negative_amortization                   : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ interest_only_payment                   : int  2 1 2 1 1 2 1 2 1 1 ...
##  $ balloon_payment                         : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ other_nonamortizing_features            : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ property_value                          : chr  "455000" "695000" "85000" "865000" ...
##  $ construction_method                     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ occupancy_type                          : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ manufactured_home_secured_property_type : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ manufactured_home_land_property_interest: int  5 5 5 5 5 5 5 5 5 5 ...
##  $ total_units                             : chr  "1" "1" "1" "1" ...
##  $ multifamily_affordable_units            : chr  NA NA NA NA ...
##  $ income                                  : int  86 124 NA 113 34 NA 352 67 NA 379 ...
##  $ debt_to_income_ratio                    : chr  "50%-60%" "42" NA "47" ...
##  $ applicant_credit_score_type             : int  12 12 12 12 12 12 12 12 12 12 ...
##  $ co_applicant_credit_score_type          : int  10 12 10 12 10 10 12 10 10 12 ...
##  $ applicant_ethnicity_1                   : int  1 2 2 2 2 2 2 2 2 2 ...
##  $ applicant_ethnicity_2                   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ applicant_ethnicity_3                   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ applicant_ethnicity_4                   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ applicant_ethnicity_5                   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ co_applicant_ethnicity_1                : int  5 2 5 2 5 5 2 5 5 2 ...
##  $ co_applicant_ethnicity_2                : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ co_applicant_ethnicity_3                : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ co_applicant_ethnicity_4                : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ co_applicant_ethnicity_5                : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ applicant_ethnicity_observed            : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ co_applicant_ethnicity_observed         : int  4 2 4 2 4 4 2 4 4 2 ...
##  $ applicant_race_1                        : int  3 5 5 5 5 3 2 5 5 5 ...
##  $ applicant_race_2                        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ applicant_race_3                        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ applicant_race_4                        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ applicant_race_5                        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ co_applicant_race_1                     : int  8 5 8 5 8 8 5 8 8 5 ...
##  $ co_applicant_race_2                     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ co_applicant_race_3                     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ co_applicant_race_4                     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ co_applicant_race_5                     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ applicant_race_observed                 : int  1 2 2 2 2 2 2 2 2 2 ...
##  $ co_applicant_race_observed              : int  4 2 4 2 4 4 2 4 4 2 ...
##  $ applicant_sex                           : int  1 2 1 1 1 2 2 2 2 1 ...
##  $ co_applicant_sex                        : int  5 1 5 2 5 5 1 5 5 2 ...
##  $ applicant_sex_observed                  : int  1 2 2 2 2 2 2 2 2 2 ...
##  $ co_applicant_sex_observed               : int  4 2 4 2 4 4 2 4 4 2 ...
##  $ applicant_age                           : chr  "35-44" "55-64" ">74" "55-64" ...
##  $ co_applicant_age                        : chr  "9999" "65-74" "9999" "55-64" ...
##  $ applicant_age_above_62                  : chr  "No" "No" "Yes" "Yes" ...
##  $ co_applicant_age_above_62               : chr  NA "Yes" NA "No" ...
##  $ submission_of_application               : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ initially_payable_to_institution        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ aus_1                                   : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ aus_2                                   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ aus_3                                   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ aus_4                                   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ aus_5                                   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ denial_reason_1                         : int  3 10 3 10 10 3 10 1 10 10 ...
##  $ denial_reason_2                         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ denial_reason_3                         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ denial_reason_4                         : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ tract_population                        : int  5877 2045 3364 5221 4692 4363 3418 4734 5612 5684 ...
##  $ tract_minority_population_percent       : num  80.84 6.26 12.13 7.03 5.58 ...
##  $ ffiec_msa_md_median_family_income       : int  146500 116200 81400 106700 81400 123800 164200 123800 91300 143700 ...
##  $ tract_to_msa_income_percentage          : num  40 132 107 107 112 93 184 90 121 144 ...
##  $ tract_owner_occupied_units              : int  554 834 1338 1600 1514 1241 990 1464 1620 1849 ...
##  $ tract_one_to_four_family_homes          : int  1715 1455 1974 1891 2130 1587 1121 1789 2330 1903 ...
##  $ tract_median_age_of_housing_units       : int  67 59 0 34 41 71 50 56 28 26 ...
##  - attr(*, ".internal.selfref")=<externalptr>

hmda_data_subset <- hmda_data %>% filter(
          grepl("Single Family",
          derived_dwelling_category,ignore.case = FALSE), 
          #state_code=="CA" ,
          derived_loan_product_type=="Conventional:First Lien",
          action_taken!=6 ,
          loan_purpose %in% c(1,31) ,
          conforming_loan_limit %in% c("C","NC"),
          !is.na(income) ,
          !is.na(property_value),
          !is.na(interest_rate),
          business_or_commercial_purpose==2,
          reverse_mortgage==2,
          debt_to_income_ratio !="Exempt",
          ffiec_msa_md_median_family_income!=0,
          applicant_age != "8888"
                           ) %>% 
          #Generate output variable for logistic regression
          mutate(originated      = as.integer(
                                   case_when(action_taken %in% 1~ 1,
                                          action_taken %in% c(2,3,4,5,7,8) ~ 0,
                                                TRUE ~ NA_real_)), 
                 co_applicant    = ifelse(co_applicant_credit_score_type==10,0,1),
                 purpose         = case_when( loan_purpose == 1 ~ "Purchase",
                                              loan_purpose == 31 ~ "Refinance" ),
                preapproval_req  = case_when( preapproval == 1 ~ "Requested",
                                              preapproval == 2 ~ "Not Requested" )
                             ) %>% 
                      rename(age="applicant_age") %>%
                    #Drop variables that wont be used
                     select(-derived_dwelling_category,
                            -state_code,
                            derived_loan_product_type,
                            -balloon_payment,
                            interest_only_payment,
                            -other_nonamortizing_features,
                            -submission_of_application,
                            -prepayment_penalty_term,
                            -intro_rate_period,
                            -applicant_age_above_62,
                            -initially_payable_to_institution,
                            -negative_amortization,
                            -multifamily_affordable_units,
                            -business_or_commercial_purpose,
                            -starts_with("co_applicant"),-starts_with("applicant"),
                            -matches("aus_[1-5]"),
                            -ends_with("credit_score_type"),
                            -matches("denial_reason_[1-5]")) 

#Lets create  features to include on the logistic regression

hmda_data_analysis <- hmda_data_subset %>% 
  #Winsorize upper bound for quantitative variables.
        mutate( 
         loan_term                    = as.numeric(loan_term),
         interest_rate                = as.numeric(interest_rate),
         loan_amount                  = winsorize(
                                        as.numeric(loan_amount),
                                        probs = c(NA, 0.99)),
         total_units                  = as.numeric(total_units),
         property_value               = winsorize(
                                        as.numeric(property_value), 
                                        probs = c(NA, 0.99)),
         combined_loan_to_value_ratio = winsorize(loan_amount/property_value,
                                        probs = c(NA, 0.99)),
         rate_spread                  = as.numeric(rate_spread),
         income                       = winsorize(income*1000,
                                        probs = c(NA, 0.99)), 
         #Income is in thousands USD
         lloan                        = log1p(loan_amount),
         lincome                      = log1p(income),
         lproperty_value              = log1p(property_value),
         income_loan_ratio            = income/loan_amount,
         income_prop_ratio            = income/property_value,
         rel_income                   = income  /ffiec_msa_md_median_family_income,
         age                          = factor(age),
         derived_race                 = factor(derived_race),
         derived_ethnicity            = factor(derived_ethnicity),
         derived_sex                  = factor(derived_sex),
         conforming_loan_limit        = factor(conforming_loan_limit),
         preapproval_req              = factor(preapproval_req) ,
         hoepa_status                 = factor(hoepa_status) ,
         purpose                      = factor(purpose),
         debt_to_income_ratio         = factor(debt_to_income_ratio ),
         derived_msa_md               = factor(derived_msa_md) 
                                                  )

## 0.00 % observations replaced at the bottom

## 1.00 % observations replaced at the top

## 0.00 % observations replaced at the bottom

## 1.00 % observations replaced at the top

## 0.00 % observations replaced at the bottom

## 0.38 % observations replaced at the top

## 0.00 % observations replaced at the bottom

## 1.00 % observations replaced at the top

## Warning: There were 3 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `property_value = winsorize(as.numeric(property_value), probs =
##   c(NA, 0.99))`.
## Caused by warning in `winsorize()`:
## ! NAs introduced by coercion
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 2 remaining warnings.

#Drop datasets to save memory
rm(hmda_data)
gc()

##             used   (Mb) gc trigger   (Mb)   max used    (Mb)
## Ncells   4744438  253.4   10536911  562.8    6204832   331.4
## Vcells 147496386 1125.4 1272595884 9709.2 1325071957 10109.5

vars_use     <-  c("originated", "derived_ethnicity","derived_race", "derived_sex",
                   "conforming_loan_limit", "preapproval_req", "hoepa_status",
                   "purpose", "debt_to_income_ratio","combined_loan_to_value_ratio",
                   "interest_rate", "rate_spread", "age", "loan_term",
                   "total_units",  "tract_population",
                   "tract_minority_population_percent",
                   "tract_to_msa_income_percentage", "tract_owner_occupied_units",
                   "tract_one_to_four_family_homes",
                   "property_value","income",
                   "tract_median_age_of_housing_units",  "lloan", "lincome", 
                   "lproperty_value",  "income_loan_ratio","income_prop_ratio",
                   "rel_income" ,"derived_msa_md" 
)

reg_data <- hmda_data_analysis  %>% select(all_of(vars_use)) %>% filter(complete.cases(.))

3. Descriptive Plots

Here, I plot some visuals to check whether there is relationship between the mortgage action and some of the features. Before diving in, I create a summary table for all the variables, which allows me to see if there are outliers (after Winsorizing) and provides me with an idea of how the data loo

dfSummary(reg_data)

## Data Frame Summary  
## reg_data  
## Dimensions: 2396648 x 30  
## Duplicates: 6134  
## 
## -----------------------------------------------------------------------------------------------------------------------------------------------
## No   Variable                            Stats / Values                    Freqs (% of Valid)       Graph                  Valid      Missing  
## ---- ----------------------------------- --------------------------------- ------------------------ ---------------------- ---------- ---------
## 1    originated                          Min  : 0                          0 :  103835 ( 4.3%)                             2396648    0        
##      [integer]                           Mean : 1                          1 : 2292813 (95.7%)      IIIIIIIIIIIIIIIIIII    (100.0%)   (0.0%)   
##                                          Max  : 1                                                                                              
## 
## 2    derived_ethnicity                   1. Ethnicity Not Available         302355 (12.6%)          II                     2396648    0        
##      [factor]                            2. Free Form Text Only                684 ( 0.0%)                                 (100.0%)   (0.0%)   
##                                          3. Hispanic or Latino              249231 (10.4%)          II                                         
##                                          4. Joint                            65879 ( 2.7%)                                                     
##                                          5. Not Hispanic or Latino         1778499 (74.2%)          IIIIIIIIIIIIII                             
## 
## 3    derived_race                        1. 2 or more minority races          4361 ( 0.2%)                                 2396648    0        
##      [factor]                            2. American Indian or Alaska        11816 ( 0.5%)                                 (100.0%)   (0.0%)   
##                                          3. Asian                           237070 ( 9.9%)          I                                          
##                                          4. Black or African American       117092 ( 4.9%)                                                     
##                                          5. Free Form Text Only                306 ( 0.0%)                                                     
##                                          6. Joint                            61220 ( 2.6%)                                                     
##                                          7. Native Hawaiian or Other          3058 ( 0.1%)                                                     
##                                          8. Race Not Available              326215 (13.6%)          II                                         
##                                          9. White                          1635510 (68.2%)          IIIIIIIIIIIII                              
## 
## 4    derived_sex                         1. Female                         523035 (21.8%)           IIII                   2396648    0        
##      [factor]                            2. Joint                          974982 (40.7%)           IIIIIIII               (100.0%)   (0.0%)   
##                                          3. Male                           770539 (32.2%)           IIIIII                                     
##                                          4. Sex Not Available              128092 ( 5.3%)           I                                          
## 
## 5    conforming_loan_limit               1. C                              2275160 (94.9%)          IIIIIIIIIIIIIIIIII     2396648    0        
##      [factor]                            2. NC                              121488 ( 5.1%)          I                      (100.0%)   (0.0%)   
## 
## 6    preapproval_req                     1. Not Requested                  2250561 (93.9%)          IIIIIIIIIIIIIIIIII     2396648    0        
##      [factor]                            2. Requested                       146087 ( 6.1%)          I                      (100.0%)   (0.0%)   
## 
## 7    hoepa_status                        1. 1                                  892 ( 0.0%)                                 2396648    0        
##      [factor]                            2. 2                              2161474 (90.2%)          IIIIIIIIIIIIIIIIII     (100.0%)   (0.0%)   
##                                          3. 3                               234282 ( 9.8%)          I                                          
## 
## 8    purpose                             1. Purchase                       2116797 (88.3%)          IIIIIIIIIIIIIIIII      2396648    0        
##      [factor]                            2. Refinance                       279851 (11.7%)          II                     (100.0%)   (0.0%)   
## 
## 9    debt_to_income_ratio                1. <20%                           137690 ( 5.7%)           I                      2396648    0        
##      [factor]                            2. >60%                             8647 ( 0.4%)                                  (100.0%)   (0.0%)   
##                                          3. 20%-<30%                       407143 (17.0%)           III                                        
##                                          4. 30%-<36%                       412599 (17.2%)           III                                        
##                                          5. 36                              80422 ( 3.4%)                                                      
##                                          6. 37                              83699 ( 3.5%)                                                      
##                                          7. 38                              87547 ( 3.7%)                                                      
##                                          8. 39                              92339 ( 3.9%)                                                      
##                                          9. 40                              94555 ( 3.9%)                                                      
##                                          10. 41                             99036 ( 4.1%)                                                      
##                                          [ 9 others ]                      892971 (37.3%)           IIIIIII                                    
## 
## 10   combined_loan_to_value_ratio        Mean (sd) : 0.8 (0.2)             16973 distinct values                      :    2396648    0        
##      [numeric]                           min < med < max:                                                         : . :    (100.0%)   (0.0%)   
##                                          0 < 0.8 < 1                                                              : : :                        
##                                          IQR (CV) : 0.2 (0.3)                                                   . : : :                        
##                                                                                                         . . : : : : : :                        
## 
## 11   interest_rate                       Mean (sd) : 6.8 (1)               4255 distinct values           :                2396648    0        
##      [numeric]                           min < med < max:                                                 :                (100.0%)   (0.0%)   
##                                          0 < 6.6 < 20.5                                                   :                                    
##                                          IQR (CV) : 0.9 (0.1)                                           . :                                    
##                                                                                                         : : .                                  
## 
## 12   rate_spread                         Mean (sd) : 0.3 (1.1)             66529 distinct values                  :        2396648    0        
##      [numeric]                           min < med < max:                                                         :        (100.0%)   (0.0%)   
##                                          -385 < 0.2 < 100                                                         :                            
##                                          IQR (CV) : 0.8 (3.3)                                                     :                            
##                                                                                                                   :                            
## 
## 13   age                                 1. <25                            116099 ( 4.8%)                                  2396648    0        
##      [factor]                            2. >74                             61876 ( 2.6%)                                  (100.0%)   (0.0%)   
##                                          3. 25-34                          686017 (28.6%)           IIIII                                      
##                                          4. 35-44                          626517 (26.1%)           IIIII                                      
##                                          5. 45-54                          404939 (16.9%)           III                                        
##                                          6. 55-64                          316013 (13.2%)           II                                         
##                                          7. 65-74                          185187 ( 7.7%)           I                                          
## 
## 14   loan_term                           Mean (sd) : 345.7 (48.7)          346 distinct values                    :        2396648    0        
##      [numeric]                           min < med < max:                                                         :        (100.0%)   (0.0%)   
##                                          1 < 360 < 480                                                            :                            
##                                          IQR (CV) : 0 (0.1)                                                       :                            
##                                                                                                                   :                            
## 
## 15   total_units                         Mean (sd) : 1 (0.2)               1 : 2360165 (98.5%)      IIIIIIIIIIIIIIIIIII    2396648    0        
##      [numeric]                           min < med < max:                  2 :   29311 ( 1.2%)                             (100.0%)   (0.0%)   
##                                          1 < 1 < 4                         3 :    4526 ( 0.2%)                                                 
##                                          IQR (CV) : 0 (0.2)                4 :    2646 ( 0.1%)                                                 
## 
## 16   tract_population                    Mean (sd) : 4858.1 (2152.4)       8174 distinct values       :                    2396648    0        
##      [integer]                           min < med < max:                                             :                    (100.0%)   (0.0%)   
##                                          0 < 4570 < 30199                                             :                                        
##                                          IQR (CV) : 2423 (0.4)                                      . : .                                      
##                                                                                                     : : :                                      
## 
## 17   tract_minority_population_percent   Mean (sd) : 34 (23.7)             9741 distinct values       :                    2396648    0        
##      [numeric]                           min < med < max:                                             : :                  (100.0%)   (0.0%)   
##                                          0 < 27.1 < 100                                             . : : :                                    
##                                          IQR (CV) : 32.1 (0.7)                                      : : : : : .                                
##                                                                                                     : : : : : : : : . .                        
## 
## 18   tract_to_msa_income_percentage      Mean (sd) : 115.5 (42.5)          346 distinct values        : :                  2396648    0        
##      [numeric]                           min < med < max:                                             : :                  (100.0%)   (0.0%)   
##                                          0 < 110 < 516                                                : :                                      
##                                          IQR (CV) : 48 (0.4)                                          : : .                                    
##                                                                                                     . : : : .                                  
## 
## 19   tract_owner_occupied_units          Mean (sd) : 1261.2 (578.8)        2834 distinct values       :                    2396648    0        
##      [integer]                           min < med < max:                                             : :                  (100.0%)   (0.0%)   
##                                          0 < 1202 < 6276                                              : :                                      
##                                          IQR (CV) : 741 (0.5)                                       . : : .                                    
##                                                                                                     : : : :                                    
## 
## 20   tract_one_to_four_family_homes      Mean (sd) : 1663.2 (690.1)        3511 distinct values       : .                  2396648    0        
##      [integer]                           min < med < max:                                             : :                  (100.0%)   (0.0%)   
##                                          0 < 1613 < 8233                                              : :                                      
##                                          IQR (CV) : 851 (0.4)                                         : :                                      
##                                                                                                     : : : :                                    
## 
## 21   property_value                      Mean (sd) : 541920.1 (433464.2)   266 distinct values        :                    2396648    0        
##      [numeric]                           min < med < max:                                             :                    (100.0%)   (0.0%)   
##                                          5000 < 425000 < 2655000                                    : : .                                      
##                                          IQR (CV) : 380000 (0.8)                                    : : :                                      
##                                                                                                     : : : : .                                  
## 
## 22   income                              Mean (sd) : 170163.9 (153800)     1003 distinct values     : :                    2396648    0        
##      [numeric]                           min < med < max:                                           : :                    (100.0%)   (0.0%)   
##                                          0 < 125000 < 1002000                                       : :                                        
##                                          IQR (CV) : 121000 (0.9)                                    : : :                                      
##                                                                                                     : : : . .                                  
## 
## 23   tract_median_age_of_housing_units   Mean (sd) : 35.6 (18.1)           77 distinct values             : : :            2396648    0        
##      [integer]                           min < med < max:                                               : : : :            (100.0%)   (0.0%)   
##                                          -1 < 35 < 80                                                 . : : : : : .                            
##                                          IQR (CV) : 26 (0.5)                                        : : : : : : : : .                          
##                                                                                                     : : : : : : : : : .                        
## 
## 24   lloan                               Mean (sd) : 12.6 (0.7)            177 distinct values                    :        2396648    0        
##      [numeric]                           min < med < max:                                                       : :        (100.0%)   (0.0%)   
##                                          8.5 < 12.7 < 14.4                                                      : :                            
##                                          IQR (CV) : 0.9 (0.1)                                                 : : : :                          
##                                                                                                             . : : : : .                        
## 
## 25   lincome                             Mean (sd) : 11.7 (1)              1003 distinct values                     :      2396648    0        
##      [numeric]                           min < med < max:                                                           :      (100.0%)   (0.0%)   
##                                          0 < 11.7 < 13.8                                                            :                          
##                                          IQR (CV) : 0.9 (0.1)                                                       :                          
##                                                                                                                   : : :                        
## 
## 26   lproperty_value                     Mean (sd) : 13 (0.7)              266 distinct values                  . :        2396648    0        
##      [numeric]                           min < med < max:                                                       : :        (100.0%)   (0.0%)   
##                                          8.5 < 13 < 14.8                                                        : :                            
##                                          IQR (CV) : 0.9 (0.1)                                                 . : : :                          
##                                                                                                             . : : : : .                        
## 
## 27   income_loan_ratio                   Mean (sd) : 0.5 (0.5)             75358 distinct values    :                      2396648    0        
##      [numeric]                           min < med < max:                                           :                      (100.0%)   (0.0%)   
##                                          0 < 0.4 < 75.8                                             :                                          
##                                          IQR (CV) : 0.3 (1)                                         :                                          
##                                                                                                     :                                          
## 
## 28   income_prop_ratio                   Mean (sd) : 0.4 (0.3)             102519 distinct values   :                      2396648    0        
##      [numeric]                           min < med < max:                                           :                      (100.0%)   (0.0%)   
##                                          0 < 0.3 < 47.8                                             :                                          
##                                          IQR (CV) : 0.2 (0.8)                                       :                                          
##                                                                                                     :                                          
## 
## 29   rel_income                          Mean (sd) : 1.7 (1.5)             135754 distinct values   :                      2396648    0        
##      [numeric]                           min < med < max:                                           :                      (100.0%)   (0.0%)   
##                                          0 < 1.2 < 38.8                                             :                                          
##                                          IQR (CV) : 1.1 (0.9)                                       :                                          
##                                                                                                     : .                                        
## 
## 30   derived_msa_md                      1. 10180                             1019 ( 0.0%)                                 2396648    0        
##      [factor]                            2. 10380                              209 ( 0.0%)                                 (100.0%)   (0.0%)   
##                                          3. 10420                             6043 ( 0.3%)                                                     
##                                          4. 10500                              432 ( 0.0%)                                                     
##                                          5. 10540                              752 ( 0.0%)                                                     
##                                          6. 10580                             6502 ( 0.3%)                                                     
##                                          7. 10740                             6208 ( 0.3%)                                                     
##                                          8. 10780                              587 ( 0.0%)                                                     
##                                          9. 10900                             6640 ( 0.3%)                                                     
##                                          10. 11020                             584 ( 0.0%)                                                     
##                                          [ 408 others ]                    2367672 (98.8%)          IIIIIIIIIIIIIIIIIII                        
## -----------------------------------------------------------------------------------------------------------------------------------------------

Now, lets turn to the plots. I selected some examples that show some interesting patterns:

Property Value and Originations: this plot shows the relationsuip between property values and the origination decision. Altough the data between originated and not originated is evenly distributed across property values, the scatter-plot shows that for properties above $1.8 million, there is a higher concentration in originated mortgages, relative to non-originated mortgages. This is likely because rich customers purchase the most expensive homes.
Income and Originations: this plot shows that applicants earning more than $500,000 are more likely than not to have a mortgage originated.
Loan to Value Ratio and Originations: this plot shows the relationship between loan to value ratio and origination. In principle, this plot shows how common is for a mortgage to be originated as the loan size relative to the property value changes. The closer this value is to 1, the largest share of the property value will be financed by the requested mortgage. As expected, the plot shows a lower mass of denied origination when loan to value ratios are lower than 20% (0.2).

In all instances, I add the geom_jitter( ) option to the scatter-plots so I can see the distribution values across the entire sample.

p1 <-ggplot(reg_data, aes(x=property_value/1e06 ,y = originated)) +    
  geom_jitter(  )+   
  labs(  title = "Property Value vs Originations",  x = "Property Value (Million $)",y = "Origination") + 
  theme_bw()+
  theme( plot.title = element_text(size = 12, face = "bold", hjust = 0.5), 
         # remove the vertical and horizontal grid lines
         panel.grid.major.x = element_blank(),
         panel.grid.minor.x = element_blank(),
         panel.grid.major.y = element_blank(),
         panel.grid.minor.y = element_blank(),
         legend.position="bottom") + 
  guides(color=guide_legend(nrow=1, byrow=TRUE))+
  scale_x_continuous( n.breaks=15  ) 

p2 <-ggplot(reg_data, aes(x=income/1000 ,y = originated)) +    
  geom_jitter(  )+   
  labs(  title = "Income vs Originations",  x = "Income (Thousands $)",y = "Origination") + 
  theme_bw()+
  theme( plot.title = element_text(size = 12, face = "bold", hjust = 0.5), 
         # remove the vertical and horizontal grid lines
         panel.grid.major.x = element_blank(),
         panel.grid.minor.x = element_blank(),
         panel.grid.major.y = element_blank(),
         panel.grid.minor.y = element_blank(),
         legend.position="bottom") + 
  guides(color=guide_legend(nrow=1, byrow=TRUE))+
  scale_x_continuous( n.breaks=15  ) 

p3 <-ggplot(reg_data, aes(x=combined_loan_to_value_ratio,y = originated)) +    
  geom_jitter(  )+   
  labs(  title = "Loan To Value Ratio vs Originations",  x = "Loan To Value Ratio ",y = "Origination") + 
  theme_bw()+
  theme( plot.title = element_text(size = 12, face = "bold", hjust = 0.5), 
         # remove the vertical and horizontal grid lines
         panel.grid.major.x = element_blank(),
         panel.grid.minor.x = element_blank(),
         panel.grid.major.y = element_blank(),
         panel.grid.minor.y = element_blank(),
         legend.position="bottom") + 
  guides(color=guide_legend(nrow=1, byrow=TRUE))+
  scale_x_continuous( n.breaks=15  ) 

print(p1)

print(p2)

print(p3)

4. Running a Baseline Logit regression

This is the baseline logit regression where I include all of the features. This code prepares training and testing samples for a k-fold cross validation, handles class imbalance by weighting observations, and then estimates a logistic regression with a full set of predictors. Here are some considerations that go into defining the model:

Set a seed to guarantee that any random sampling generates the same result each time the script is run. This ensures that the model is replicable across exercises.
Split the data into training and testing sets. Given a seed, the dataset is randomly divided into 80% training and 20% testing, as it is common practice. The training set is used to estimate (“train”) the model, while the test set will later be used for out-of-sample prediction accuracy.
Create10 K-folds for cross validation. A K-fold cross-validation is a method to evaluate a model’s performance by splitting the data into K equal parts (folds). The idea is to train the model on K-1 folds and test it on the remaining fold, repeating the process for each of the K folds, so each fold serves as the test set once. Then, I average the model results from each fold to provide a more reliable estimate of the model’s performance. This helps reduce bias and assess how well the model generalizes to new data.
Addressing Class Imbalance. As it can be seen in the summary section, the HMDA data shows that most applications are originated, so the dependent variable is highly imbalanced, which may create biased in the model where it over-predicts the origination class. To improve the model’s ability to discriminate between originated vs. non-originated applications, I weight the observations in a way such that observations in the minority class receive higher weight, and those in the majority class receive lower weight. Although it may not eliminate the bias completely, it can help in addressing this bias partially.

Taking these considerations into account, I then proceed to estimate the baseline model. The binary nature of the origination variable is appropriate logistic regression, a commonly applied statistical framework. This model describes the relationship between the co-variates (or features) $x_i$ and the probability of an outcome $y_i$ as

\[ P\{Y = 1 \mid X\} = F\left(\beta_0 + \sum_{k=1}^K x_{ik}\beta_k\right) \]

where, in the case of logistic regression,

\[ F(z) = \frac{e^{z}}{1 + e^{z}}, \quad \text{with } z = \beta_0 + \sum_{k=1}^K x_{ik}\beta_k. \]

One quirk of this framework is that the relationship between the predictors and the probability is nonlinear, so the coefficients cannot be directly interpreted as the impact of a particular feature on the dependent variable (origination). In fact, the marginal effect depends on the values taken by each feature. Specifically, the marginal effect of variable $x_l$ is:

\[ \frac{\partial P\{Y = 1 \mid X\}}{\partial x_l} = f\left(\beta_0 + \sum_{k=1}^K x_{ik}\beta_k\right)\beta_l, \]

where the logistic density function is

\[ f(z) = \frac{e^{z}}{(1 + e^{z})^2}. \]

This shows that, although the coefficient cannot be directly be interpreted, it informs the direction of the feature impact on origination. Using this framework, I estimate a logit regression and summarize the results. The baseline logistic regression model includes the following selected features:

Loan pricing variables: interest rate, rate spread.
Loan contract terms: loan_term, total_units, CLTV.
Income and property controls: log loan amount, log income, log property value.
Relative ratios: income-to-loan, income-to-property, relative income.
Applicant characteristics: age.
Loan classification and process variables: conforming limit status, pre-approval, purpose, DTI ratio.

The model is estimated using the logistic link function, which is appropriate when predicting a binary outcome such as loan origination.

#seed
set.seed(100)
#split model into Train and test set
split     <- 0.8
idxt      <- sample(nrow(reg_data), floor(nrow(reg_data)*split)  )
 
train_set <- reg_data[idxt,]
test_set  <- reg_data[-idxt,]

y_train   <- train_set$originated
 
y_test    <- test_set$originated

#create k-folds for model evaluation
kfolds    <- 10
train_set <- train_set %>% mutate(folds= sample(rep(1:kfolds, length.out = nrow(train_set))))
 
#Notice that there is high imbalance of origination. To circumvent this, we wan weight the regression observations 
#to improve the model accuracy 
 
class_counts <- table(train_set$originated)
print(class_counts)

## 
##       0       1 
##   83012 1834306

w_reg     <- ifelse(train_set$originated == 1,
                           class_counts[2]/sum(class_counts),
                           class_counts[1]/sum(class_counts))


model1 <- glm(originated ~  interest_rate + rate_spread + loan_term + 
                            total_units  +    combined_loan_to_value_ratio +
                            lloan+ lincome +lproperty_value +income_loan_ratio + income_prop_ratio +
                            rel_income +
                           age + # derived_sex + derived_race + derived_ethnicity+
                            conforming_loan_limit  +  preapproval_req   +
                            purpose  +   debt_to_income_ratio    , 
                            data = train_set,
                            family = binomial(link="logit"),
                            weights = w_reg )

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(model1)

## 
## Call:
## glm(formula = originated ~ interest_rate + rate_spread + loan_term + 
##     total_units + combined_loan_to_value_ratio + lloan + lincome + 
##     lproperty_value + income_loan_ratio + income_prop_ratio + 
##     rel_income + age + conforming_loan_limit + preapproval_req + 
##     purpose + debt_to_income_ratio, family = binomial(link = "logit"), 
##     data = train_set, weights = w_reg)
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   4.8795279  0.6177527   7.899 2.82e-15 ***
## interest_rate                -0.3510755  0.0162928 -21.548  < 2e-16 ***
## rate_spread                  -0.0654729  0.0108019  -6.061 1.35e-09 ***
## loan_term                     0.0019715  0.0002987   6.601 4.09e-11 ***
## total_units                  -0.1381124  0.0909706  -1.518 0.128962    
## combined_loan_to_value_ratio -0.2663941  0.2981713  -0.893 0.371629    
## lloan                         0.0690528  0.1713605   0.403 0.686972    
## lincome                      -0.0535478  0.0321762  -1.664 0.096072 .  
## lproperty_value               0.2686721  0.1738214   1.546 0.122182    
## income_loan_ratio             0.0590934  0.0594706   0.994 0.320389    
## income_prop_ratio             0.1871901  0.0997586   1.876 0.060596 .  
## rel_income                   -0.0374449  0.0193345  -1.937 0.052783 .  
## age>74                       -0.4091205  0.1188039  -3.444 0.000574 ***
## age25-34                     -0.0603867  0.0819668  -0.737 0.461292    
## age35-44                     -0.1491200  0.0833957  -1.788 0.073759 .  
## age45-54                     -0.2018191  0.0862739  -2.339 0.019321 *  
## age55-64                     -0.2644938  0.0882270  -2.998 0.002719 ** 
## age65-74                     -0.2984472  0.0955807  -3.122 0.001793 ** 
## conforming_loan_limitNC      -0.3103028  0.0984929  -3.151 0.001630 ** 
## preapproval_reqRequested     -1.2575178  0.0489259 -25.702  < 2e-16 ***
## purposeRefinance             -0.1516038  0.0589669  -2.571 0.010141 *  
## debt_to_income_ratio>60%     -0.0046827  0.2167303  -0.022 0.982762    
## debt_to_income_ratio20%-<30%  0.1119832  0.0801958   1.396 0.162602    
## debt_to_income_ratio30%-<36%  0.1235669  0.0827056   1.494 0.135161    
## debt_to_income_ratio36        0.0845910  0.1164888   0.726 0.467733    
## debt_to_income_ratio37        0.0912878  0.1155725   0.790 0.429601    
## debt_to_income_ratio38        0.1407608  0.1156708   1.217 0.223639    
## debt_to_income_ratio39        0.1267745  0.1136363   1.116 0.264586    
## debt_to_income_ratio40        0.1041655  0.1127038   0.924 0.355361    
## debt_to_income_ratio41        0.1239216  0.1120885   1.106 0.268913    
## debt_to_income_ratio42        0.1186308  0.1097539   1.081 0.279750    
## debt_to_income_ratio43        0.1146258  0.1121983   1.022 0.306953    
## debt_to_income_ratio44        0.0782997  0.1093976   0.716 0.474155    
## debt_to_income_ratio45        0.0391938  0.1133361   0.346 0.729479    
## debt_to_income_ratio46       -0.0101036  0.1131872  -0.089 0.928871    
## debt_to_income_ratio47       -0.1509358  0.1079310  -1.398 0.161979    
## debt_to_income_ratio48        0.2893566  0.1192149   2.427 0.015217 *  
## debt_to_income_ratio49        0.2953503  0.1072764   2.753 0.005902 ** 
## debt_to_income_ratio50%-60%   0.0791309  0.1274855   0.621 0.534794    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 51696  on 1917317  degrees of freedom
## Residual deviance: 49233  on 1917279  degrees of freedom
## AIC: 7581.5
## 
## Number of Fisher Scoring iterations: 9

As explained above, although the coefficients are not fully indicative of the effect of the feature on originations (under no endogeneity), the coefficient sign is still informative of the direction of the feature’s predictive capability of the dependent variable (origination). Below, I interpret the coefficient sins for some of these features:

interest_rate: Higher interest rates reduce the probability that a mortgage is originated. The effect is also significant at the 5% significance level.
rate_spread: This variable measures the difference between the covered loan’s annual percentage rate (APR) and the average prime offer rate (APOR) for a comparable transaction. The negative and statistically significant sign showsthat the an increase of this rate spread results in a less likely approval.
loan_term: +0.00197 ***

Longer-term loans (30-year vs 15-year) very slightly increase origination odds.
The effect is small, but statistically significant.
combined_loan_to_value_ratio: Higher loan to value ratio tends to reduce origination likelihood, but it is not statistically significant .
income_loan_ratio: A higher household income relative to loan amount predict that origination is more likely.
income_prop_ratio: Higher income relative to property value raises origination probability.
rel_income: Applicants earning more than area median show a small decrease in odds. Given the multiple income variables included, this may be contaminated by multicolinearity. To deal with this, I will test multiple specifications in the later sections
Age variables: the base category is age 18–24 and all significant age coefficients are negative, meaning older applicants are less likely to be originated, all else constant. These results fit the pattern lenders may perceive higher long-term risk or lower loan profitability for older applicants.
conforming_loan_limitNC: Being non-conforming reduces origination probability.
This is expected because conforming loans are easier to sell/securitize and non-conforming loans carry higher underwriting frictions.
preapproval_reqRequested: Applicants who requested a pre-approval have lower origination probability. This reflects that many pre-approval requests never convert and that borrowers likely shop for rates.
purposeRefinance: Refinance loans are less likely to be originated relative to purchase loans. This is consistent with higher documentation hurdles required to refinance a property, higher rate-trigger sensitivity, and potential borrower attrition during refinancing process.

5. Evaluating Logit Model prediction performance.

I create a custom function, performance_measures(), which will later be used in the cross-validation exercises. This function computes several commonly used evaluation metrics for my logistic regressions. The goal is to have a unified tool that assesses both in-sample and out-of-sample predictive performance. The function returns a comparison table, listing each of the following measures:

Threshold: This is the cutoff value used to classify predicted probabilities into classes:
- If P(Y=1 | X) > threshold, predict 1
- Otherwise, predict 0

Different thresholds change the trade off between sensitivity and specificity.

Sensitivity (True Positive Rate): this metric calculates how well the model identifies actual positives.

\[ \text{Sensitivity} = \frac{\text{True Positives}}{\text{True Positives + False Negatives}} \]

In practical terms, Sensitivity answers this question: Of all actual positive cases, how many did the model correctly identify?
Specificity (True Negative Rate): this metric calculates how well the model identifies actual negatives.

\[ \text{Specificity} = \frac{\text{True Negatives}}{\text{True Negatives + False Positives}} \]

Similar to Sensitivity, Specificity answers this question: Of all actual negative cases, how many did the model correctly classify?
Accuracy: This metric calculates the fraction of all predictions that the model got correct.

\[ \text{Accuracy} = \frac{\text{True Positives + True Negatives}}{\text{Total Observations}} \]

Although helpful, it can be misleading as it can grow substantially if the outcome variable is imbalanced.

Class Error: This metric is the complement of accuracy and represents the proportion of incorrect predictions.

\[ \text{Class Error} = 1 - \text{Accuracy} \]

Separately, I also include the ROC curve in the function. the ROC curve depicts the trade-off between true and false positive rates predicted by different thresholds against those delivered by a random number generator. Consequently, the further away the model’s ROC curve from a 45 degree line, the better it will be at predicting mortgage origination. This trade-off can be summarized by calculating the Area Under the ROC Curve (AUC). By design, the random-number generator produces an area of 0.5, so any value higher than that is indicative of a better fit, up to an area of 1, where the fit is perfect. Therefore, the AUC measures overall discrimination ability of the model.

I declare the function and run all of these measures (in-sample) for the baseline model below:

performance_measures <- function (model, y,pred_data=NULL,threshold=0.5){
  #Calculate performance Measures: accuracy, specificity, sensitivity, confusion matrix, 
  #ROC, AUC, and optimal threshold
  #Inputs:
  #model: a glm  or cv.glmnet object
  #y: dependent variables used to evaluate fit.
  #pred_data: prediction x variables if doing an out of sample check (if NULL, in-sample measures are used for glm)
  #if using a cv.glmnet object, this argument should always be provided using the output of the model.matrix() call
  #threshold: threshold used to classify y=1. Default to 0.5
  
  if(class(model)[1]=="cv.glmnet"){
    if (is.null(pred_data)) stop("pred_data (x-matrix) is necessary for glmnet models.")
    
    probs <- as.numeric(predict(model, newx = pred_data, 
                                type = "response", s = model$lambda.min
    )
    )
    
  }else if (class(model)[1]=="glm"){
    
    if(is.null(pred_data)){
      probs       <- predict(model, type = "response")
    }else{
      probs       <- predict(model, pred_data, type = "response") 
    }
  } else{
    stop("Function only handles glm or cv.glmnet models")
  }
  
  preds       <- factor(ifelse(probs > 0.5, 1, 0), levels = c(0,1))
  yvar        <- factor(y, levels = c(0,1)) 
  # Accuracy  confusion matrix
  accuracy    <- mean(preds == yvar)
  classerror  <-  1- accuracy
  Confmat     <- confusionMatrix(yvar, preds, positive="1")
  #roc  
  roc_obj     <- roc(response = yvar, predictor = probs)
  
  coords          <- coords(roc_obj,threshold, ret = c("threshold", 
                                                       "sensitivity", "specificity") )
  optimal_coords  <- coords(roc_obj, "best", ret = c("threshold", 
                                                     "sensitivity", "specificity"), 
                            best.method = "youden")
  
  # Accuracy + confusion matrix under optimal threshold
  preds_optim       <-  factor(ifelse(probs > optimal_coords$threshold,1,0), 
                               levels = c(0,1))
  accuracy_optim    <- mean(preds_optim == yvar)
  classerror_optim  <-  1- accuracy_optim
  Confmat_optim     <- confusionMatrix(yvar, preds_optim, positive="1")
  
  #put comparison table together
  comparison           <- rbind(c(coords,accuracy,classerror),
                           c(optimal_coords,accuracy_optim,
                             classerror_optim))
  
  rownames(comparison) <- c("Baseline","Optimal")
  colnames(comparison) <- c("Threshold","Sensitivity",
                            "Specificity","Accuracy","Class_Error")
  
  results           <- list(roc      = roc_obj,
                            yhat     = probs,
                            baseline = list(y_preds           = preds,
                                            accuracy          = accuracy,
                                            classerror        = classerror,
                                            confusion_matrix  = Confmat,
                                            performance       = coords),
                            
                            optimal   = list(y_preds          = preds_optim,
                                            accuracy          = accuracy_optim,
                                            classerror        = classerror_optim,
                                            confusion_matrix  = Confmat_optim,
                                            performance       = optimal_coords),
                            
                            comparison = comparison
                            
                            )
return(results)
}

performance <- performance_measures(model=model1,y=y_train )

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

#print auc
performance$roc$auc

## Area under the curve: 0.7062

#Print confusion matrices
performance$baseline$confusion_matrix$table

##           Reference
## Prediction       0       1
##          0       4   83008
##          1       1 1834305

performance$optimal$confusion_matrix$table

##           Reference
## Prediction       0       1
##          0   42716   40296
##          1  300229 1534077

#Print Comparison
performance$comparison

##          Threshold Sensitivity Specificity Accuracy  Class_Error
## Baseline 0.5       0.9999995   4.81858e-05 0.9567057 0.04329433 
## Optimal  0.9976337 0.8363256   0.5145762   0.8223951 0.1776049

The results above show that the model does a better job as classifying origination (Sensitivity) than denials (Specificity). This is not surprising given the sample imbalance discussed above. Moreover, the optimal threshold is 99% which gives an increase in Specificity of about 51% at a 16.3 percentage point cost in Sensitivity.

6. Cross Validation and Model Choice.

The baseline model include includes a selection of variables. When working with many predictors (especially when some are correlated) it becomes easy for a model to over-fit. Over-fitting occurs when a model captures noise or idiosyncrasies in the training data rather than the true underlying patterns. An over-fit model performs extremely well on the training set but poorly on new, unseen data, which defeats the purpose of prediction. In this section, I implement two methods for model choice: using regularization techniques, and by choosing between different candidate specifications.

6.1 Using Regularization (Ridge and Lasso)

To improve model performance and prevent over fitting, I estimate lasso and ridge models. Both techniques apply regularization, which means they add a penalty to the size of the regression coefficients. This helps reduce overfitting, stabilize the model, and improve prediction accuracy.

Ridge (L2 penalty):
Ridge regression adds a penalty proportional to the sum of squared coefficients in the objective function (in this case, the likelihood function). This method shrinks coefficients toward zero but never forces them to be exactly zero. In logistic regression, this helps handle multicollinearity and produces more stable estimates.
The penalty term is
\[\lambda \sum_j \beta_j^2\]
Lasso (L1 penalty):
Lasso adds a penalty proportional to the sum of the absolute values of the coefficients in the objective function (in this case, the likelihood function). Unlike ridge, lasso can shrink some coefficients exactly to zero, effectively performing variable selection inside the logistic regression model.
The penalty term is
\[\lambda \sum_j |\beta_j|\]

Both ridge and lasso use cross-validation to choose the optimal penalty level $\lambda$. A higher $\lambda$ increases the amount of shrinkage, and a lower $\lambda$ makes the model closer to standard logistic regression. In the sections below, I estimate both models using cv.glmnet() with 10-fold cross-validation.

#Estimate Ridge + Lasso logit
model_formula <- as.formula(originated ~ interest_rate + rate_spread + loan_term + 
                              total_units + combined_loan_to_value_ratio +
                              lloan + lincome + lproperty_value + income_loan_ratio                               + income_prop_ratio +   rel_income +  age +  
                              conforming_loan_limit + preapproval_req +
                              purpose + debt_to_income_ratio  )
#Define design matrix with all variables for both traiining and testing sets
X_train <- model.matrix(
  model_formula,
  data = train_set
)[, -1]  # drop intercept column

X_test <- model.matrix(model_formula,
  data = test_set
)[, -1]  # drop intercept column

#I also use paralell computing to speed the calculation 
ridge_model <- cv.glmnet(  x=X_train, y=y_train, family = "binomial",
                           alpha = 0,         # ridge penalty
                           weights = w_reg,
                           type.measure = "auc" ,  # cross-validated AUC
                           foldid = train_set$fold ,
                           parallel = TRUE
)

## Warning: executing %dopar% sequentially: no parallel backend registered

plot(ridge_model)

best_lambda_ridge <- ridge_model$lambda.min
best_lambda_ridge

## [1] 0.0003049336

coef_ridge <- coef(ridge_model, s = "lambda.min")
 


lasso_model <- cv.glmnet(  x=X_train, y=y_train, family = "binomial",
                         alpha = 1,         # ridge penalty
                         weights = w_reg,
                         type.measure = "auc",   # cross-validated AUC
                         foldid = train_set$fold ,
                         parallel = TRUE
)

plot(lasso_model)

best_lambda_lasso <- lasso_model$lambda.min
best_lambda_lasso

## [1] 6.569596e-06

coef_lasso <- coef(lasso_model, s = "lambda.min")

Both penalized models identify broadly similar core predictors of loan origination but differ in how aggressively they shrink coefficients. Ridge, which uses an ℓ₂ penalty, retains all predictors and shrinks them smoothly toward zero. Its optimal penalty (λ ≈ 0.00030) yields coefficients that remain close to the baseline logit estimates in both sign and magnitude (e.g., interest_rate and rate_spread remain strongly negative, while variables such as purposeRefinance and preapproval_reqRequested continue to exert substantial negative effects). Lasso, using an ℓ₁ penalty, selects a more parsimonious model due to much stronger shrinkage (λ ≈ 6.6e-06). Several debt-to-income indicator variables drop out entirely, suggesting limited predictive value once correlated predictors are jointly considered. The remaining lasso coefficients mirror the baseline model’s direction and relative importance, with particularly strong effects for preapproval_reqRequested, interest_rate, and key age categories.

Comparing across methods, ridge maintains stability across all predictors, which is expected given its suitability for handling multicollinearity without enforcing sparsity. Lasso, in contrast, performs variable selection: categories of debt-to-income and some age buckets collapse to exactly zero, indicating these effects are not needed for predictive performance once the model regularizes correlated features.

Interestingly, both methods preserve the strongest structural relationships from the baseline logit—large negative effects for interest_rate, rate_spread, and preapproval_reqRequested—suggesting that these signals are robust to penalization. The lasso’s trimmed structure highlights which categorical splines truly matter for classification, while ridge emphasizes consistency and coefficient stability. Together, the models indicate that the baseline logistic specification is not overfitting substantially, but lasso offers a more interpretable subset of predictors without sacrificing predictive discrimination (as indicated by the relatively flat AUC curves near the chosen λ values).

6.2 By Intuition.

6.2.1 Define the Candidate Specifications.

I prepare several candidate specifications based on economic intuition. The idea is to compare different sets of predictors (in-sample) to determine which combination provides the best balance between predictive accuracy and interpretability, and then to perform a cross validation using the final model. In mortgage underwriting, different groups of variables play distinct roles:

Pricing variables (such as interest rate and rate spread) capture the cost of borrowing;
Borrower income and loan size variables describe repayment capacity; and
Categorical factors, such as loan purpose, age group, or conforming loan status, reflect regulatory constraints and borrower characteristics that lenders may consider.

By constructing separate specifications (full, numeric-only, categorical-only, and loan-term-focused), I can evaluate how each conceptual grouping contributes to predicting loan origination. Contrary to the regularization techniques described before, this model comparison process helps avoid over-fitting and ensures that the final chosen model is grounded in both statistical performance and economic reasoning.

specifications <- list(
  full              = y_var ~ interest_rate + rate_spread + loan_term +                                   total_units  + combined_loan_to_value_ratio +  
                      lloan+ lincome +lproperty_value +income_loan_ratio +                                income_prop_ratio + rel_income + age + 
                      conforming_loan_limit  +  preapproval_req   + purpose  +                            debt_to_income_ratio,
  
numeric_only      = y_var ~ interest_rate + rate_spread + loan_term + total_units  +                     combined_loan_to_value_ratio + lloan+ lincome +lproperty_value                      +income_loan_ratio+ income_prop_ratio  
                    +rel_income,
categories_only   = y_var ~   age + 
                    conforming_loan_limit  +  preapproval_req   + purpose  +                            debt_to_income_ratio  ,
  
loan_terms_only = y_var ~  interest_rate + rate_spread + loan_term +   total_units   
)


# Cross the dependent variables with specifications
specifications <- as.data.frame(expand.grid(dependent = "originated", 
                                            spec_name = names(specifications),
                                      stringsAsFactors = FALSE ) %>%
                  mutate(
                     formula = map2(
                     spec_name, dependent, ~ {
                     # Get the formula template
                     formula_string <- deparse(specifications[[.x]])
                     # Collapse to a single string
                     formula_string <- paste(formula_string, collapse = " ")
                     # Replace `y_var` with the actual dependent variable
                     formula_string <- sub("y_var", .y, formula_string)
                     # Convert back to formula
                     as.formula(formula_string)
                     })
                   )
                  )

6.2.2 Estimating Model Specifications.

After defining the set of candidate model specifications, the next step is to estimate each of these models and compare their performance. The code below automates this process by looping through every specification, fitting a logistic regression model, and extracting key diagnostic statistics. By doing this programmatically, I can efficiently evaluate how each variable combination performs in predicting loan origination.

For each model, I collect the regression coefficients, standard errors, in-sample AIC, in-sample accuracy, and in-sample ROC AUC, which allows me to assess both statistical fit and predictive power. This systematic comparison helps identify which specification provides the most informative and robust explanation of loan origination outcomes, balancing interpretability with predictive performance.

Models <- specifications  %>%
  mutate(
    # Fit a linear model for each formula
    model = map(formula, ~ glm(.x, data = train_set, weights=w_reg,
                               family = binomial(link="logit"))),
    # Tidy the model results
    tidy_output = map(model, ~ tidy(.x)),
    #retrieve sample size for power calculations
    n = map_dbl(model, ~ nobs(.x)), 
    # Calculate degrees of freedom (DF) for the overall model
    df = map_dbl(model, ~ nobs(.x) - length(coef(.x))),
    #Add AIC
    aic = map_dbl(model, ~ AIC(.x)) ,
    accuracy = map_dbl(model, ~ {
      prob <- fitted(.x)
      pred <- ifelse(prob > 0.5, 1, 0)
      mean(pred == reg_data$originated)
    }),
    # ROC AUC
    auc = map_dbl(model, ~ {
      probs <- predict(.x, type = "response")
      roc_obj <- roc(response = .x$y, predictor = probs)
      auc(roc_obj)
    })) %>%
  unnest(tidy_output)   %>%
 # select(dependent,spec_name, term, estimate, std.error, statistic, p.value, aic,accuracy,auc ) %>% 
  arrange(dependent,spec_name)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Warning: There were 10 warnings in `mutate()`.
## The first warning was:
## ℹ In argument: `model = map(formula, ~glm(.x, data = train_set, weights =
##   w_reg, family = binomial(link = "logit")))`.
## Caused by warning:
## ! non-integer #successes in a binomial glm!
## ℹ Run `dplyr::last_dplyr_warnings()` to see the 9 remaining warnings.

# Display the regression results
print(Models)

## # A tibble: 84 × 14
##    dependent  spec_name       formula   model term  estimate std.error statistic
##    <chr>      <chr>           <list>    <lis> <chr>    <dbl>     <dbl>     <dbl>
##  1 originated categories_only <formula> <glm> (Int…   6.11      0.0984    62.1  
##  2 originated categories_only <formula> <glm> age>…  -0.318     0.115     -2.76 
##  3 originated categories_only <formula> <glm> age2…   0.189     0.0809     2.34 
##  4 originated categories_only <formula> <glm> age3…   0.131     0.0814     1.61 
##  5 originated categories_only <formula> <glm> age4…   0.0230    0.0842     0.274
##  6 originated categories_only <formula> <glm> age5…  -0.0624    0.0860    -0.726
##  7 originated categories_only <formula> <glm> age6…  -0.122     0.0927    -1.32 
##  8 originated categories_only <formula> <glm> conf…   0.183     0.0813     2.25 
##  9 originated categories_only <formula> <glm> prea…  -1.06      0.0482   -22.0  
## 10 originated categories_only <formula> <glm> purp…  -0.0700    0.0529    -1.32 
## # ℹ 74 more rows
## # ℹ 6 more variables: p.value <dbl>, n <dbl>, df <dbl>, aic <dbl>,
## #   accuracy <dbl>, auc <dbl>

6.2.3 Choose the Final Model.

Before moving to out-of-sample evaluation, I first compare the candidate model specifications using simple in-sample metrics. The goal here is to identify which specification provides the best overall fit according to commonly used statistical criteria. To do this, I extract each model’s AIC, accuracy, and AUC, and summarize them in a compact comparison table. These metrics allow me to rank the specifications based on their explanatory power and predictive performance.

Using the metrics described above, I then select the model with the highest AUC as the “best” in-sample performer and extract both the model object and its formula for later use. Finally, I compute a chi-squared ANOVA test on the best model to assess the joint significance of its predictors.

#keep information criteria and accurancy measures
in_sample_accuracy <- Models%>% group_by(dependent,spec_name) %>% 
             summarise(aic          = unique(aic),
                       accuracy     = unique(accuracy),
                       auc          = unique(auc))

## `summarise()` has grouped output by 'dependent'. You can override using the
## `.groups` argument.

print(in_sample_accuracy)

## # A tibble: 4 × 5
## # Groups:   dependent [1]
##   dependent  spec_name         aic accuracy   auc
##   <chr>      <chr>           <dbl>    <dbl> <dbl>
## 1 originated categories_only 7560.    0.957 0.598
## 2 originated full            7582.    0.957 0.706
## 3 originated loan_terms_only 7516.    0.957 0.659
## 4 originated numeric_only    7527.    0.957 0.657

best_spec_name <- in_sample_accuracy %>% filter(auc==max(auc)) %>% select(spec_name)

## Adding missing grouping variables: `dependent`

Best       <-  Models %>% filter(spec_name== as.character(best_spec_name)[2])

best_model <- Best[1,]$model[[1]]

best_spec <- specifications[which(specifications$spec_name == 
                                    as.character(best_spec_name)[2]),]$formula[[1]]

ANOVA_best   <- anova(best_model,  test = 'Chisq')

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

print(ANOVA_best)

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: originated
## 
## Terms added sequentially (first to last)
## 
## 
##                              Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                                       1917317      51696              
## interest_rate                 1  1725.16   1917316      49971 < 2.2e-16 ***
## rate_spread                   1    35.18   1917315      49936 3.013e-09 ***
## loan_term                     1    64.72   1917314      49871 8.615e-16 ***
## total_units                   1     0.45   1917313      49871 0.5027863    
## combined_loan_to_value_ratio  1     7.35   1917312      49864 0.0067137 ** 
## lloan                         1    37.05   1917311      49827 1.151e-09 ***
## lincome                       1     1.16   1917310      49825 0.2821564    
## lproperty_value               1     3.11   1917309      49822 0.0777758 .  
## income_loan_ratio             1     6.04   1917308      49816 0.0139930 *  
## income_prop_ratio             1     1.04   1917307      49815 0.3068054    
## rel_income                    1    12.31   1917306      49803 0.0004504 ***
## age                           6    25.54   1917300      49777 0.0002714 ***
## conforming_loan_limit         1    10.77   1917299      49767 0.0010340 ** 
## preapproval_req               1   500.01   1917298      49267 < 2.2e-16 ***
## purpose                       1     6.62   1917297      49260 0.0101009 *  
## debt_to_income_ratio         18    27.06   1917279      49233 0.0777959 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The in-sample AUC metrics indicate clear performance differences across the models. The model with the highest AUC is the full model (approximately 0.7), which demonstrates the strongest ability to correctly distinguish between positive and negative outcomes, making it the most reliable classifier in this set. Models with noticeably lower AUC values provide weaker discrimination and are therefore less suitable for prediction in this context.

The ANOVA results further confirm that these differences are statistically significant. The low p-value indicates that the observed variation in AUC in the full model is unlikely to be due to random chance. In other words, at conventional significance levels, we can reject the null hypothesis of equal mean performance and conclude that at least one model’s predictive accuracy is meaningfully different from the others.

I then run a cross validation for the full model and compare the out of sample capabilities for the regularization specifications, and this model.

6.2.4 K-Fold Cross Validation in Final Model.

Using the full specification, I perform a K-fold cross validation using the same folds I used for the ridge and lasso methods. The codes below show each of the stages of the K-fold cross validation process. The procedure can be summarized as:

Split the data into K-Folds
Estimate the model for each K-Fold and Calculate Each Accuracy Measure.
Average Each Performance Measure and Conclude.

The code below implements this procedure

results_cross <- data.frame(fold = 1:kfolds, Threshold=NA, Sensitivity=NA, 
                            Specificity=NA, Accuracy=NA,  Class_Error=NA,
                            auc=NA)

models        <- vector("list", kfolds)
for (i in 1:kfolds) {
  
  #1. for each fold i, estimate the model using the data of all folds but i (so i is the test set).
  Train        <- train_set %>% filter(folds != i )
  Test         <- train_set %>% filter(folds == i )
  
  w_reg_train  <- ifelse(train_set$originated == 1,
                      class_counts[2]/sum(class_counts),
                      class_counts[1]/sum(class_counts))
  #2. Fit logistic regression
  models[[i]] <- glm(best_spec, data = Train, family = binomial(link="logit"))
  
  #Using the estimated model, use the data from fold i to test out of sample predictions and calculate performance measures.
  
  performance <- performance_measures(model=models[[i]],y=Test$originated,pred_data=Test)
    
  results_cross[i,2:ncol(results_cross)] <- c(as.numeric(performance$comparison["Optimal",]),
                                              as.numeric(performance$roc$auc))
}

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

#3. average the chosen performance measures and conclude

average_performance <- results_cross %>% select(-fold) %>%
                        summarise(across(everything(), mean, na.rm = TRUE))

## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(everything(), mean, na.rm = TRUE)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
## 
##   # Previously
##   across(a:b, mean, na.rm = TRUE)
## 
##   # Now
##   across(a:b, \(x) mean(x, na.rm = TRUE))

print(average_performance)

##   Threshold Sensitivity Specificity  Accuracy Class_Error       auc
## 1 0.9481891   0.8458955   0.5076569 0.8312502   0.1687498 0.7087089

The cross-validated performance metrics indicate that the model delivers reasonably strong and stable predictive performance across folds. The optimal threshold averages around 0.95, suggesting the model should only classify an observation as a positive outcome when it is highly confident (given the sample imbalance). At this threshold, the model achieves solid sensitivity, correctly identifying about 84.6% of positive cases, while specificity is more modest at roughly 50.8%, implying that nearly half of negative cases are misclassified as positives. Despite this imbalance, overall accuracy remains strong at about 83%, and the corresponding class error of 16.9% indicates a relatively low misclassification rate in aggregate.

Moreover, the AUC of approximately 0.71 across the 10 folds shows that the model has moderate discriminative ability: it separates originated from non-originated cases better than chance but still faces overlap in predicted probabilities across classes. Overall, these results that the model generalizes well, prioritizes correctly identifying positive outcomes, and maintains acceptable predictive accuracy in out-of-sample settings.

6.3 Comparing Final Results.

As a final step, I evaluate how well the selected specifications perform on new, unseen data. A model that fits the training data perfectly may still perform poorly if the model is over-fit. I compare the out-of-sample predictive performance of the three methods: (1) the manually selected “best” model based on economic intuition, (2) the Ridge regression model, and (3) the Lasso regression model. Using the performance_measures function I defined above, I compute accuracy, sensitivity, specificity, optimal thresholds, and AUC for each model on the test set. The resulting comparison table allows me to identify which modeling strategy generalizes best beyond the sample it was trained on.

#Now compare out of sample for the test set
out_performance   <- performance_measures(model=best_model,
                                          y=y_test,pred_data=test_set)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

ridge_performance <- performance_measures(model=ridge_model,y=y_test,
                                          pred_data=X_test)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

lasso_performance <- performance_measures(model=lasso_model,y=y_test,
                                          pred_data=X_test)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

#Compare
Comparison <- rbind( c(out_performance$comparison["Optimal",], 
                       out_performance$roc$auc),
                     
                     c(ridge_performance$comparison["Optimal",],
                       ridge_performance$roc$auc),
                     
                     c(lasso_performance$comparison["Optimal",],
                       lasso_performance$roc$auc)
                     )

rownames(Comparison) <- c("Manual","Ridge","Lasso")
colnames(Comparison)[ncol(Comparison)] <- "AUC"

print(Comparison)

##        Threshold Sensitivity Specificity Accuracy  Class_Error AUC      
## Manual 0.9974892 0.8519499   0.4941651   0.8364071 0.1635929   0.7039944
## Ridge  0.9975674 0.8453546   0.5011766   0.8304029 0.1695971   0.7043865
## Lasso  0.9975691 0.8465236   0.5001681   0.8314773 0.1685227   0.7039384

Across the three models, out-of-sample performance is nearly identical, but the ridge model shows a very slight edge, driven primarily by its highest AUC (0.7044), which indicates marginally better ranking ability among the models. Although sensitivity, specificity, and overall accuracy differ only minimally across the Manual, Ridge, and Lasso specifications, the ridge regression achieves the best trade-off by maintaining competitive classification metrics while offering the strongest discrimination power. These differences are small in absolute terms, but the AUC advantage makes ridge the most reliable predictor out of sample.

Testing Different Specifications for Prediction

Juan Espinosa

2025-12-09