U.S Permanent Visa Approval Prediction

Chandramouli Yalamanchili
Updated - 06/03/2021 [Created - 03/28/2021]
View Project Code on GitHub

U.S Permanent Visa Approval Prediction


Introduction

With ever changing immigration policies and minimum wage requirements the immigration work visa employees as well as employers are facing challenges due to uncertainity of visa extention approvals, especially considering the never ending permanent visa queue or wait for resources from few nations. Considering this situation, it is interesting to consider and research on how different attributes are impacting work visa applications for immigrant employees from different nations and different sectors.
Through this project we have performed the analysis to identify different important factors that could impact the US permanent visa application. We have also built a logistic regression model to predict the approval of the US permanent visa application based on selected features.

back to top

Project Motivation

back to top

Project Details

Dataset Details

back to top

Technology used

back to top

Exploratory Data Analysis

I have applied several data cleaning steps as part of this project to be able to derive proper insights from it as well as to prepare it for modeling. Below I have documented different steps I have performed as part of data preperation as well as data analysis through graphs.

1. Data Overview

Feature/Column Count
case_number 0
case_status 0
class_of_admission 25,845
country_of_citizenship 59
decision_date 0
employer_name 12
employer_num_employees 135,349
employer_name.1 12
employer_state 42
foreign_worker_info_birth_country 135,300
job_info_work_city 102
job_info_work_state 103
pw_job_title_9089 392
pw_level_9089 27,627
pw_soc_title 2,336
pw_amount_9089 2,216
pw_unit_of_pay_9089 1,572

back to top

2. Data cleanup needed for graph analysis

I have performed below data clean up steps to extract the data in the format needed for graph analysis.

I have also modified the extracted feature values either to normalize them, or to fill missing values or to extract more meaningful value.

back to top

3. Graph Analysis - Histogram Plots

I plotted some histograms as showin Figure 1 to understand the data from different perspectives. Below are some of my observations from the histograms plotted.

Histogram Plots
Figure 1: Histogram charts including both raw and normalized values for salary & no_of_employees.

back to top

4. Graph Analysis - Bar charts

I have plotted bar charts using some of the other features, once again to gain understanding of the data from a different perspective. Below are my observations from the bar charts plotted.

Bar charts
Figure 2: Bar charts for entry visa type, citizenship country, Case status, and Year.

back to top

5. Graph Analysis - Correlation matrix

I have used visualizer to find out the relationship between different features using Pearson ranking. Below are my observations:

Correlation Matrix Plots
Figure 3: Correlation between different features.

back to top

6. Graph Analysis - Parallel coordinates plot

I have compared several numeric parameters in the data using the parallel coordinates plot.

Parallel coordinates plot
Figure 4: Parallel coordinates plot.

back to top

7. Graph Analysis - Stacked bar plots

I have compared various features using the stacked bar chart with respect to case status counts for each feature. Below are my observations out of this step.

Stacked Bar Plots
Figure 5: Stacked bar charts.

back to top

Data Preparation

I have taken below steps as part of data preperation to get the dataset ready for modeling:

back to top

Modeling

Model Details

back to top

Model Performance

  1. Confusion Matrix - As you can see below in Fugure 6, TP (True Positives) are high, but model failed to identify the denied cases accurately, only 99 cases (out of total 3823) denied cases were correctly predicted.
Confusion Matrix
Figure 6: Confusion Matrix.
  1. Logistic regression classification report - Similar to what we have seen in confusion matrix, the other evaluation parameters show the poor performance of the model when it comes to denied class, as shown below in Figure 7.
Classification report
Figure 7: Logistic regression classification report.
  1. ROC Curves - ROC curves show a better performance of the model as all of the curves are above the dotted line, which is randomly guessed.
ROC Curves
Figure 8: Logistic regression ROC Curves.

back to top

Acknowledgement

Thanks to Bellevue University and all professors for the continuous guidance and support through out the data science course. Thanks to Professor Fadi Alsaleem for providing continuous constructive feedback and peers for their valuable inputs and discussions that helped me in building this project.

I also like to thank all the authors of the reference papers and articles.

back to top

Conclusion

The graph analysis on the US permanent visa applications dataset has given me very good insight into the dataset, helped me in understanding this dataset in different perspective. It also helped me to realize some interesting facts. One of such fact being the very high number of approved, but expired cases. Also one good thing I see out of this analysis is that there are very less number of denials.

Through this project, I have also learned that it is better to apply the normalization after we complete the graph analysis to understand the data and before we feed data into any models. I have also noticed that conversion of categorical features into numeric through one hot technique is probably not ideal when we have many possible values like in this case.

And finally, I have built a logistic regression model to predict if a US permanent visa will be granted based on provided data or not. Overall the model I have built seems to be predicting the certified cases well, but predicting too many of the denied cases as certified as well.

back to top

References

  1. US Permanent Visa Applications - Kaggle - https://www.kaggle.com/jboysen/us-perm-visas

back to top