So thanks for the excellent presentations from all previous speakers. So I'm the last one to go. So my name is Shen, and I'm from the Institute of Financial Surctics, and my research advisors are Professor Binton Chen and Professor We who are also here today. Okay. So I will now start my presentation. So before I get started, I want to ask everyone a question. So how many of you have applied for mortgage loans or plan to apply for mortgage loans in the future. So please raise your hand. Okay, so we can say mot of you. So from this plot, actually, as you can say, the horizontal axis is like the number of years, and the Y axis is the number of mortgage loan applications. So as you can see the number of mortgage loan applications in the United States was like peaked in 2021, and the number was like over 20 million. So given such a large number of mortgage loan applications. Of course, we cannot just rely on human reviewers, right? It will take them forever to finish the reviewing process. So in banks, currently we need to rely on the machine learning techniques. So basically we rely on machining techniques to help us predict the default risk of loan and help us to make the automated decision making process. So now I want to show you a flow chart. So basically tells you how banks actually make this possible. So basically banks they will start from some past loan data. So this data they just collected from the a past customers or past applicants, and then they will fade this data into the machinery technique, and then they will have the automated decision making process. So with this automated decision making process, if you have the new customers or new applicants coming in, so this applicant just submit their profiles to the system, and this system can automatically generate the suggested decision, like accept this low application or just deny this low application. But given this automated process, so should we have some concerns about this process. So, of course, we should, especially for the fairness concerns. And actually, government has similar concerns. So that's why according to the Equal Credit Opportunity Act, sorry, the government actually regulates that the banks lending decisions shouldn't discriminate against certain protected groups based on your gender information or your race information. So why this machinery technique can become unfair. So this is because the data itself may not be fair. So let's take a look at what do banks have in their past loan data. So first, they will have some non protected variables X. So like your occupation or your address, may be zip code. So these are the variables that are allowed to use for banks. And of course, they will have some protected variables, like your gender, your race, of course, you should not use these variables in your decision making process. And they will have the outcome variable y, which is the outcome of those past loans. Is it defaulted or is it successfully repaid? So as I mentioned, non protective variables are allowed to be used, but protective variables they are not allowed to be used. However, prohibiting the use of protected variables P does not guarantee fairness. So can you imagine why? Because some variables in X, they are actually highly correlated with this protected variables. Like your zip code may reflect your race and your occupation may reflect your gender, right? So what we need to do is that we need to select a subset of variables from X that have lower correlation with P, so they are more fair. And at the same time, we want them to have higher correlation with Y, so it means that they will be more accurate for banks. Okay. So Examples of such variables like your low credit history or your number of credit accounts. So these are the fair and accurate examples of this. So how do we select these variables from x? The ses is, if x have 100 variables, then we will have a total of two to the power of 100 possible choices. Be for each variable, you have two choices, select it or remove it. Given such a large number of possibilities, of course, this problem cannot be directly solved. So that's why we need to rely on approximations to solve this problem. And the approximation used by some banks. So their current practice is actually something like this, so they will start with a model that use all variables they have. So this model is the most accurate one. But of course, it can be unfair, right? So the banks, each time they will remove one variable from this model, that has the highest correlation with the protected variable P. For example, here, maybe the darkest right one variable is the most unfair one, so we just remove it from R model, and R model can become more fair. So they will repeat this process until the model is fair. But can you see any problem with this process? So the problem is actually that we only consider the correlation with P when we do this selection, right? We ignore the correlation with those Y. So we ignore the accuracy here. So what we plan to do is that, we actually have already finished this. So we have a better approximation. So we start from an empty model. So as you can imagine, an empty model with no variable in it. It will treat everyone in the same way, right? So this model is a completely fair model, but at the same time, it's useless. Be it treats everyone in the same way. So basically, it just tells, Okay, so accept everyone or deny everyone. So it has no use, right? So what we plan to do is that actually, each time we are adding one variable, that has the best balance between the correlation with p and the correlation with y, because these are the two objectives we want to have a balance, right? And for example, here, we add the orange because that one has the a best balance, and we just repeat this process. Each time we add vulnerable and we will stop until the model becomes unfair. So it means that we will have the most accurate fair model in this way. So what is the take home message today? So the first take home message is that, so, compared to the banks current approximations, our approximations can actually achieve a better balance between the applicant's fear treatments and banks abilities to predict low risks. So basically, we actually improve on both sides, both for applicants and for banks. And the other thing is that this is not just for the lending industry or loan application. Actually, this method can be applied to other domains such as criminal justice. Like the machinery technique is nowadays used by the criminal justice system to make the automated ball decisions. So actually, our method can also be used in those domains, and it can ultimately help to create more fair and equitable society. Yeah. Thank you. Please. What's the definition of fairness here. So actually, there are a lot of definitions of fairness. So to make I mean, some of them are too technical. So we actually consider three fairness mesments that are commonly used in the literature. So I will introduce one here. For example, female group and male group, we want these two groups have the same probability to be accepted for the loan application. So this is one and we also consider two other more complicated fairness mesments. Please. So is this new methodology acceptable to the current regulations? Yes, because I mean, actually we talked with some folks from the regulation side, and they told me that what they currently audit, the banks practice is like they will look at your model input to make sure that you are not actually input those protected varables into your model, and they will audit based on your outcomes. So they will actually evaluate your predictions to see whether your predictions are fair with respect to those protected groups. So actually, our method, as you can say, we don't input those protected variables for sure because we are selecting from those non protected variables. And for the second part, and as I mentioned here, actually, we achieve a better balance between these two objectives. So it means that we can satisfy the regulation requirement. Please. So in your method, you mentioned that you kind of gradually adding more variables into your model. Does this mean you have to retrain them every time? Actually not. Yeah. Because actually what we want to do, because as I mentioned, banks have some pass loan data. So you just need to train our model on the banks pass loan data and we can select the best subset of non protective variables based on those pass loan data. So you only need to train your model once. Yeah. Have you consider scaling this up to higher dimensional data? Actually, our method is specifically designed for high dimensional data. Yeah. So Images Yeah. Not for images, but actually when we do the evaluation, actually, our method turns out to be the more variables you have in your non protected variables X, actually our method is going to achieve a much better balance compared to those existing methods. So it means that it can handle the high dimensional case. Yeah. Oh, okay, please. Curious, how you make sure with the past road data that you're using. Actually, as I mentioned in the previous site, so what do those biases come from? Because you know, the machine learning, they just learn every pattern, it can learn from those pass load data. So why machine learning technique can become biases biased because your pass loan data have some biases. So what we are going to do is that we want to make sure that one machine learning technique learn from those pass loan data itself can mitigate those biases in the pass loan data. So it only learn those fair pattern, but not those unfair pattern basically. Yeah, please. And you do something with reject inference in your setting? Sorry, pardon me. Reject inference, the fact that you only observed out of loans that were actually given out. I someone never received the loan. Okay. You don't really have a counter factual. Yeah, I know this is actually a problem that we are thinking when we do this project. So because currently we actually don't have the access to the credit loan data because you know those part of data is highly confidential. So actually, I was thinking, so probably this he may not or she may not get the loan from this agency, but he may actually get the loan from another agency. So if we can have a larger pool of the data, so we can know the credit history of this person from different agencies. So that will be a way to solve this problem. Yeah. Okay. Thank you.
Spring 2024 Spark! in Five featuring Zeyu Chen
From Suprawee Tepsuporn September 14, 2024
26 plays
26
0 comments
0
You unliked the media.