Skip to main content

Instrumental Variable in Logistic Regression Model

In the last post "The Difference of Occupation Choice Among Graduates with Different Majors and Degrees", I built a logistic model with log of salary as one of the independent variable. But we will concern about the endogeneity of it, because the salary may related to other unincluded variables that may affect the probability of working at education institute, and moreover, as a current salary, the occupation itself may affect the salary. So some instruments are needed to solve the possible endogeneity problem in the regression analysis. In the dataset, a variable "satis" measuring the satisfaction of salary seems to be a good possible instrument. For this variable, 4 means very satisfied, 3 means somewhat satisfied, 2 means somewhat unsatisfied, and 1 means very unsatisfied.

Let's firstly take a look at the model without using IV. The dependent variable is whether working at education institutes, and independent variables are degree, major, number of years after graduation, gender, log of salary, and citizenship. There's no interaction term and weight in the model. The SAS code and results of estimation are as follows. We can find that the log of salary has significant negative relationship with the probability to work at education institutes.

proc logistic data = proj_jobchoice descending;
class degree (ref = '1') ndgmemg (ref = '1') citizen (ref = '1') gender / param = ref;
model jobedu = degree ndgmemg gradyr gender lnsalary citizen;

run;

IV Logistic

But what if we use a logistic model with the satisfaction of salary as a IV? In SAS, we have to take two steps to do the logistic model estimation with instrumental variables. First, we estimate a linear regression model (reduced model), with the endogenous variable as the dependent variable, and the IV and other variables as the independent variables. Then, we obtain the residual of the linear model, and put it into the logistic model (full model) as a new independent variable. As the proc reg is not able to deal with the categorical variables, we should use proc glm to run the linear model with categorical variables. The output statement is used to output the result data, prediction values (p = ), and residual (residual = ). The SAS code and estimation results are as follows.

proc glm data = proj_jobchoice;
class degree ndgmemg citizen gender;
model lnsalary = satis degree ndgmemg gradyr gender citizen / solution;
output out = temp residual = vhat;
run
proc logistic data = temp;
class degree (ref = '1') ndgmemg (ref = '1') citizen (ref = '1') gender / param = ref;
model jobedu = degree ndgmemg gradyr gender lnsalary vhat citizen;

run;

2SLS

Then how to use 2SLS to run the IV logistic model? According to the two stages of 2SLS, we firstly need to estimate a linear model with endogenous variables as dependent variable, and IV and other variables as independent variables. After obtaining the predicted values of dependent variable of linear model, we replace the endogenous variable with the predicted values as the independent variable of the second stage logistic model. The SAS code and estimation results are as follows.

proc glm data = proj_jobchoice;
class degree ndgmemg citizen gender;
model lnsalary = satis degree ndgmemg gradyr gender citizen / solution;
output out = temp p = xhat residual = vhat;
run
proc logistic data = temp;
class degree (ref = '1') ndgmemg (ref = '1') citizen (ref = '1') gender / param = ref;
model jobedu = degree ndgmemg gradyr gender xhat citizen;

run;

Comments

Popular posts from this blog

Weighted Percentile in Python Pandas

Unfortunately, there is no weighted built-in functions in Python. If we want to get some weighted percentiles by Python, one possible method is to extend the list of data, letting the values of weight as the numbers of elements, which is discussed in a Stack Overflow poster . For example, if we have a data like, score   weight 5          2 4          3 2          4 8          1 we firstly extend the list of scores to {5, 5, 4, 4, 4, 2, 2, 2, 2, 8}, and then find the percentiles such as 10% or 50% percentile. The limitations of this method are, (1) weight must be integers; (2) values of weight cannot be very large. What if we want to calculate the weighted percentiles of a large dataset with very large non-integer weights? In this article, I want to show you an alternative method, under Python pandas. step1: given percentile q, (0<=q<=1), calculate p = q * sum of weights; step2: sort the data according the column we want to calculate the weighted percentile thereof;

Rcpp Example: Partition Based Selection Algorithm

In this post, I'm going to take a Rcpp example that call a C++ function to find kth smallest element from an array. A partition-based selection algorithm could be used for implementation. A most basic partition-based selection algorithm, quickselect , is able to achieve linear performance to find the kth element in an unordered list. Quickselect is a variant of quicksort , both of which choose a pivot and then partitions the data by it. The procedure of quickselect is to firstly move all elements smaller than the pivot to the left and what greater than the pivot the the right by exchanging the location of them, given a pivot such as the last element in the list; and then to move the elements in the left or right sublist again according to a new pivot until getting exact kth elements. The difference from quicksort is that quickselect only need to recurses on one side where the desired kth element is, instead of recursing on both sides of the partition which is what quicksort

Trend Removal Using the Hodrick-Prescott (HP) Filter

Hodrick-Prescott filter (see Hodrick and Prescott (1997)) is a popular tool in macroeconomics for fitting smooth trend to time series. In SAS, we can use PROC UCM to realize the HP filter.  The dataset considered in this example consists of quarterly real GDP for the United States from 1947-2016  (b illions of chained 2009 dollars ,  seasonally adjusted annual rate ). The data can be download from this link  https://fred.stlouisfed.org/series/GDPC1   %macro hp(input= ,date= ,int= ,var= ,par= ,out= ); proc ucm data=&input; id &date interval=&int; model &var; irregular plot=smooth; level var= 0 noest plot=smooth; slope var=&par noest; estimate PROFILE; forecast plot=(decomp) outfor=&out; run; %mend ; % hp (input=gdp,date=year,int=qtr,var=gdp,par= 0.000625 ,out=result); I use SAS MACROS to define a function for HP filter. "input" is the data file you use, "date" is the variable for time, "int&qu