Skip to main content

Posts

Subdata Selection Approach for Generalized Linear Model

Subdata is almost inevitable to be used for analysis when the full data is very big. The theory of optimal design developed for constructing small amounts of data of experiments could be introduced to proposed Information-Based Optimal Subdata Selection (IBOSS) method for data reduction, based on maximizing information matrices. I develop the algorithm of this method for implementations in generalized linear models. By numerical studies with simulated and real datasets, the IBOSS method using this algorithm has significantly better performances in model estimation and computation efficiency, compared with uniform sampling method and estimation by full data.  2. Algorithm 2.1 Framework Assume that the full data $(\mathbf{x}_1,y_1), (\mathbf{x}_2,y_2), ..., (\mathbf{x}_n,y_n)$ follows a generalized linear model, \[E(\mathbf{Y}) = \boldsymbol{\mu} = g^{-1}(\mathbf{X}\boldsymbol{\beta})\] where the full data of size $n$ includes of $p$ covariates, $\mathbf{X} = (\mathbf{x}_1, ...,
Recent posts

Face completion with a multi-output estimators using Python scikit-learn

This example shows the performances of extremely randomized trees, k nearest neighbors, linear regression, and ridge regression in model estimation and prediction for face completion, i.e. predicting the lower half of a face given the upper half. The dataset is "fetch_olivetti_faces" coming from sklearn library. From scikit-learn library, we should import: extremely randomized trees: ensemble.ExtraTreeRegressor k nearest neighbors: neighbors.KNeighborsRegressor linear regression: linear_model.LinearRegression ridge regression: linear_model.RidgeCV The example code is as follows: # face completion print(__doc__) import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import fetch_olivetti_faces from sklearn.utils.validation import check_random_state from sklearn.ensemble import ExtraTreesRegressor from sklearn.neighbors import KNeighborsRegressor from sklearn.linear_model import LinearRegression from sklearn.linear_model import RidgeCV

LeetCode Algorithm Questions by Python

4. Median of Two Sorted Arrays There are two sorted arrays  nums1  and  nums2  of size m and n respectively. Find the median of the two sorted arrays. Example 1: nums1 = [1, 3] nums2 = [2] The median is 2.0 Example 2: nums1 = [1, 2] nums2 = [3, 4] The median is (2 + 3)/2 = 2.5 The idea is to pick up the numbers in nums2 that are smaller than a number in nums1, until we get the median. Pay attention: 1. consider three cases: odd, even, and null; 2. max of nums1 may smaller than min of nums2; 3. use float() while doing the division. My code in Python is as follows: class Solution(object):     def findMedianSortedArrays(self, nums1, nums2):         """         :type nums1: List[int]         :type nums2: List[int]         :rtype: float         """         m = len(nums1)         n = len(nums2)         mn = m + n         i = 0         acc = 0         med2 = 0         if m == 0 or n == 0:             nums1.extend(nums2)  

Display the Records Which Have N or More Consecutive Rows with Amount More Than K

Example: Human Traffic of Stadium X city built a new stadium, each day many people visit it and the stats are saved as these columns:  id ,  date ,  people Please write a query to display the records which have 3 or more consecutive rows and the amount of people more than 100(inclusive). For example, the table  stadium : +------+------------+-----------+ | id | date | people | +------+------------+-----------+ | 1 | 2017-01-01 | 10 | | 2 | 2017-01-02 | 109 | | 3 | 2017-01-03 | 150 | | 4 | 2017-01-04 | 99 | | 5 | 2017-01-05 | 145 | | 6 | 2017-01-06 | 1455 | | 7 | 2017-01-07 | 199 | | 8 | 2017-01-08 | 188 | +------+------------+-----------+ For the sample data above, the output is: +------+------------+-----------+ | id | date | people | +------+------------+-----------+ | 5 | 2017-01-05 | 145 | | 6 | 2017-01-06 | 1455 | | 7 | 2017-01-07 | 199 | | 8 | 201

Statistics Guide

Ridge Regression

Ridge regression can be used to deal with the multicollinearity. In this example, I want to study on the factors having influences on the beef consumption, using the time series data including beef consumption, price of beef, pork, chicken, and fish from 1975 to 2015. The dependent variable is beef consumption, and independent variables are the real price of beef, pork, chicken, and fish, and CPI. We can use ridge option in PROC REG for ridge regression, by setting the value of ridge parameter. The results will be stored in the data file "outest=". proc reg data =beef outvif outest =b ridge = 0 to 0.05 by .005 ; model beef = year pricebeef_real pricepork_real pricebroilers_real pricefish_real cpi / vif lackfit dwprob spec ; run ; proc print data =b; run ; The first plot is the estimation results of linear regression. We can find that there exists severe multicollinearity, according to the values of VIF. The second plot shows that the VIF decl

Trend Removal Using the Hodrick-Prescott (HP) Filter

Hodrick-Prescott filter (see Hodrick and Prescott (1997)) is a popular tool in macroeconomics for fitting smooth trend to time series. In SAS, we can use PROC UCM to realize the HP filter.  The dataset considered in this example consists of quarterly real GDP for the United States from 1947-2016  (b illions of chained 2009 dollars ,  seasonally adjusted annual rate ). The data can be download from this link  https://fred.stlouisfed.org/series/GDPC1   %macro hp(input= ,date= ,int= ,var= ,par= ,out= ); proc ucm data=&input; id &date interval=∫ model &var; irregular plot=smooth; level var= 0 noest plot=smooth; slope var=&par noest; estimate PROFILE; forecast plot=(decomp) outfor=&out; run; %mend ; % hp (input=gdp,date=year,int=qtr,var=gdp,par= 0.000625 ,out=result); I use SAS MACROS to define a function for HP filter. "input" is the data file you use, "date" is the variable for time, "int&qu