Skip to main content

Diagnostics and Remedial Measures for Outlying and Influential Cases

An Example using Zillow Data


In order to identify outliers (outlying Y), we can take look at the values of semistudentized residuals, studentized residuals, studentized deleted residuals.

a. semistudentized residuals: $\frac{e_i}{\sqrt{MSE}}$.
b. studentized residuals: $\frac{e_i}{\sqrt{MSE(1-h_{ii})}}$, where $h_{ii}$ is diagonal element of Hat matrix.
c. studentized deleted residuals: $d_i = Y_i - \hat{Y_{i(i)}} = \frac{e_i}{1-h_{ii}}$, redo the regression without observation i, to get $MSE_{(i)}$, so that $s^2\{d_i\} = MSE_{(i)}(1+X'_i(X'_{(i)}X_{(i)})^{-1}X_i)=\frac{MSE_{(i)}}{1-h_{ii}}$
$t_i = \frac{d_i}{s\{d_i\}}=\frac{e_i}{\sqrt{MSE_{(i)}(1-h_{ii})}}\sim t(n-p-1)$

Outlying X (leverage) can be identified by using Hat matrix. An observation is usually considered to be a leverage if $h_{ii} > 2p/n$. Another suggested guideline is that $h_{ii}$ exceeding 0.5 indicates very high leverage, whereas between 0.2 and 0.5 indicates moderate leverage.

Influential cases can be identified by DFFITS, Cook's Distance, or DFBETAS.

a. DFFITS: $\frac{hat{Y}_i - \hat{Y}_{i(i)}}{\sqrt{MSE_{(i)}h_{ii}}}=t_i(\frac{h_ii}{1-h_{ii}})^{1/2}$. We suggest considering a case influential if the absolute value of DFFITS exceeds 1 for small to medium datasets and $2\sqrt{p/n}$ for large datasets.
b. Cook's Distance: $\frac{(\hat{Y}_i-\hat{Y}_{(i)})'(\hat{Y}_i-\hat{Y}_{(i)})}{pMSE}=\frac{e_i^2}{pMSE}[\frac{h_{ii}}{(1-h_{ii})^2}]$, relates to $F(p, n-p)$. It is considered to be little influence if less than 10% or 20%, and substantial influence if greater than 50%.
c. DFBETAS: $\frac{b_k-b_{k(i)}}{\sqrt{MSE_{(i)}c_{kk}}}$, where $c_{kk}$ is diagonal of $(X'X)^{-1}$. It is considered to be influential if absolute value greater than 1 or $2/\sqrt{n}$.

Robust regression is an important method for outlier influence remedial.

Now I wanna use a Zillow dataset about housing price as an example to show how to identify outliers, leverage, and influential points, and how to realize the robust regression by SAS. For the linear regression model, dependent variable is the log of housing price, independent variables are lot size, waterfront (0,1), age of the house, land value, new construct (0,1), central air (0,1), heat type (type 2, 3 and 4), living area, number of bedrooms, number of bathroom, and total number of rooms.

I use PROC REG to estimate this linear regression model, and save the studentized residuals, studentized deleted residuals, leverage, dffits, and cook's distance in a new output dataset "residual" by student, rstudent, h, dffits, and cookd options.

proc reg data=zillow plots=all;
model log_price = lot_size waterfront age land_value new_construct central_air heat_type3 heat_type4 living_area bedrooms bathrooms rooms / vif spec lackfit dwprob;
output out=residuals residual=resid student=stu_resid rstudent=del_resid h=leverage DFFITS=dfits cookd=cookd;
run;

By the plots option at SAS PROC REG, we can get the outlier & leverage plot, DFFITS plot, and Cook's D plot as follows




Now I want to print out the outliers and count the number of outliers, using a guideline usually used that the absolute value of studentized residual greater than 2. In this case, there are 62 outliers.

proc print data=residuals;
var stu_resid del_resid leverage dfits cookd;
where abs(stu_resid) > 2;
run;
proc sql;
select count(*) from residuals where abs(stu_resid) > 2;

quit;

The number of leverage and influential cases can also be counted by PROC SQL. In this case, there are 139 leverages and 91 influential points.

%let p = 12;
proc sql;
select count(*) into :n from zillow;
quit;
proc sql;
select count(*) from residuals where leverage > (2*(&p+1)/&n);
quit;
proc sql;
select count(*) from residuals where abs(dfits) > (2*sqrt((&p+1)/&n));

quit;

Then I want to use robust regression to deal with this datasets with lots of outliers and influential cases.

proc robustreg data=zillow method=m (wf=huber) plots=all;
model log_price = lot_size waterfront age land_value new_construct central_air heat_type3 heat_type4 living_area bedrooms bathrooms rooms / diagnostics
output out = robust weight=wgt;

run;

Robust Regression

Linear Regression


Comments

  1. Contact Trulife Diagnostics anytime for best lab services at your door step. Contact online on all social media channels for best offers in all lab tests
    for more details:http://trulife.co.in/contact-us/

    ReplyDelete

Post a Comment

Popular posts from this blog

Weighted Percentile in Python Pandas

Unfortunately, there is no weighted built-in functions in Python. If we want to get some weighted percentiles by Python, one possible method is to extend the list of data, letting the values of weight as the numbers of elements, which is discussed in a Stack Overflow poster . For example, if we have a data like, score   weight 5          2 4          3 2          4 8          1 we firstly extend the list of scores to {5, 5, 4, 4, 4, 2, 2, 2, 2, 8}, and then find the percentiles such as 10% or 50% percentile. The limitations of this method are, (1) weight must be integers; (2) values of weight cannot be very large. What if we want to calculate the weighted percentiles of a large dataset with very large non-integer weights? In this article, I want to show you an alternative method, under Python pandas. step1: given percentile q, (0<=q<=1), calculate p = q * sum of weights; step2: sort the data according the column we want to calculate the weighted percentile thereof;

Rcpp Example: Partition Based Selection Algorithm

In this post, I'm going to take a Rcpp example that call a C++ function to find kth smallest element from an array. A partition-based selection algorithm could be used for implementation. A most basic partition-based selection algorithm, quickselect , is able to achieve linear performance to find the kth element in an unordered list. Quickselect is a variant of quicksort , both of which choose a pivot and then partitions the data by it. The procedure of quickselect is to firstly move all elements smaller than the pivot to the left and what greater than the pivot the the right by exchanging the location of them, given a pivot such as the last element in the list; and then to move the elements in the left or right sublist again according to a new pivot until getting exact kth elements. The difference from quicksort is that quickselect only need to recurses on one side where the desired kth element is, instead of recursing on both sides of the partition which is what quicksort

Trend Removal Using the Hodrick-Prescott (HP) Filter

Hodrick-Prescott filter (see Hodrick and Prescott (1997)) is a popular tool in macroeconomics for fitting smooth trend to time series. In SAS, we can use PROC UCM to realize the HP filter.  The dataset considered in this example consists of quarterly real GDP for the United States from 1947-2016  (b illions of chained 2009 dollars ,  seasonally adjusted annual rate ). The data can be download from this link  https://fred.stlouisfed.org/series/GDPC1   %macro hp(input= ,date= ,int= ,var= ,par= ,out= ); proc ucm data=&input; id &date interval=&int; model &var; irregular plot=smooth; level var= 0 noest plot=smooth; slope var=&par noest; estimate PROFILE; forecast plot=(decomp) outfor=&out; run; %mend ; % hp (input=gdp,date=year,int=qtr,var=gdp,par= 0.000625 ,out=result); I use SAS MACROS to define a function for HP filter. "input" is the data file you use, "date" is the variable for time, "int&qu