Skip to main content

Posts

Showing posts from April, 2017

Statistics Guide

Ridge Regression

Ridge regression can be used to deal with the multicollinearity. In this example, I want to study on the factors having influences on the beef consumption, using the time series data including beef consumption, price of beef, pork, chicken, and fish from 1975 to 2015. The dependent variable is beef consumption, and independent variables are the real price of beef, pork, chicken, and fish, and CPI. We can use ridge option in PROC REG for ridge regression, by setting the value of ridge parameter. The results will be stored in the data file "outest=". proc reg data =beef outvif outest =b ridge = 0 to 0.05 by .005 ; model beef = year pricebeef_real pricepork_real pricebroilers_real pricefish_real cpi / vif lackfit dwprob spec ; run ; proc print data =b; run ; The first plot is the estimation results of linear regression. We can find that there exists severe multicollinearity, according to the values of VIF. The second plot shows that the VIF decl...

Trend Removal Using the Hodrick-Prescott (HP) Filter

Hodrick-Prescott filter (see Hodrick and Prescott (1997)) is a popular tool in macroeconomics for fitting smooth trend to time series. In SAS, we can use PROC UCM to realize the HP filter.  The dataset considered in this example consists of quarterly real GDP for the United States from 1947-2016  (b illions of chained 2009 dollars ,  seasonally adjusted annual rate ). The data can be download from this link  https://fred.stlouisfed.org/series/GDPC1   %macro hp(input= ,date= ,int= ,var= ,par= ,out= ); proc ucm data=&input; id &date interval=∫ model &var; irregular plot=smooth; level var= 0 noest plot=smooth; slope var=&par noest; estimate PROFILE; forecast plot=(decomp) outfor=&out; run; %mend ; % hp (input=gdp,date=year,int=qtr,var=gdp,par= 0.000625 ,out=result); I use SAS MACROS to define a function for HP filter. "input" is the data file you use, "date" is the variable for time, "int...

Diagnostics and Remedial Measures for Outlying and Influential Cases

An Example using Zillow Data In order to identify outliers (outlying Y), we can take look at the values of semistudentized residuals, studentized residuals, studentized deleted residuals. a. semistudentized residuals: $\frac{e_i}{\sqrt{MSE}}$. b. studentized residuals: $\frac{e_i}{\sqrt{MSE(1-h_{ii})}}$, where $h_{ii}$ is diagonal element of Hat matrix. c. studentized deleted residuals: $d_i = Y_i - \hat{Y_{i(i)}} = \frac{e_i}{1-h_{ii}}$, redo the regression without observation i, to get $MSE_{(i)}$, so that $s^2\{d_i\} = MSE_{(i)}(1+X'_i(X'_{(i)}X_{(i)})^{-1}X_i)=\frac{MSE_{(i)}}{1-h_{ii}}$ $t_i = \frac{d_i}{s\{d_i\}}=\frac{e_i}{\sqrt{MSE_{(i)}(1-h_{ii})}}\sim t(n-p-1)$ Outlying X ( leverage ) can be identified by using Hat matrix. An observation is usually considered to be a leverage if $h_{ii} > 2p/n$. Another suggested guideline is that $h_{ii}$ exceeding 0.5 indicates very high leverage, whereas between 0.2 and 0.5 indicates moderate leverage. Influential c...

PROC GLM Example: Linear Regression with Dummies

Absenteeism . Data on 77 employees of the ABX Company have been collected. The dependent variable is absenteeism (ABSENT). The possible explanatory variables are COMPLX = measure of job complexity SENIOR = seniority SATIS = response to "How satisfied are you with your foreman?" In this example, use SENINV = 1/SENIOR, which is the reciprocal of the seniority variable, and COMPLX as two of the explanatory variables. The variable SATIS should be transformed into indicator variables (1 is very dissatisfied, 2 is somewhat dissatisfied, 3 is neither satisfied nor dissatisfied, 4 is somewhat satisfied, 5 is very satisfied). Q1 . Is there a difference in average absenteeism for employees in different supervisor satisfaction groups? Q2 . Using the model chosen, what would be your estimate of the average absenteeism rate for all employees with COMPLX = 60 and SENIOR = 30 who were very dissatisfied with their supervisor? What if they were very satisfied with their supervisor...

LeetCode Database Questions by MySQL

175. Combine Two Tables Table:  Person +-------------+---------+ | Column Name | Type | +-------------+---------+ | PersonId | int | | FirstName | varchar | | LastName | varchar | +-------------+---------+ PersonId is the primary key column for this table. Table:  Address +-------------+---------+ | Column Name | Type | +-------------+---------+ | AddressId | int | | PersonId | int | | City | varchar | | State | varchar | +-------------+---------+ AddressId is the primary key column for this table. Write a SQL query for a report that provides the following information for each person in the Person table, regardless if there is an address for each of those people: FirstName, LastName, City, State My Solution: # Write your MySQL query statement below select FirstName, LastName, City, State from Person left join Address on Person.PersonId = Address.PersonId; 176. Second Highest Salary Write a SQL query to get the ...