Subdata is almost inevitable to be used for analysis when the full data is very big. The theory of optimal design developed for constructing small amounts of data of experiments could be introduced to proposed Information-Based Optimal Subdata Selection (IBOSS) method for data reduction, based on maximizing information matrices. I develop the algorithm of this method for implementations in generalized linear models. By numerical studies with simulated and real datasets, the IBOSS method using this algorithm has significantly better performances in model estimation and computation efficiency, compared with uniform sampling method and estimation by full data. 2. Algorithm 2.1 Framework Assume that the full data $(\mathbf{x}_1,y_1), (\mathbf{x}_2,y_2), ..., (\mathbf{x}_n,y_n)$ follows a generalized linear model, \[E(\mathbf{Y}) = \boldsymbol{\mu} = g^{-1}(\mathbf{X}\boldsymbol{\beta})\] where the full data of size $n$ includes of $p$ covariates, $\mathbf{X} = (\mathbf{x}_1, ....
This example shows the performances of extremely randomized trees, k nearest neighbors, linear regression, and ridge regression in model estimation and prediction for face completion, i.e. predicting the lower half of a face given the upper half. The dataset is "fetch_olivetti_faces" coming from sklearn library. From scikit-learn library, we should import: extremely randomized trees: ensemble.ExtraTreeRegressor k nearest neighbors: neighbors.KNeighborsRegressor linear regression: linear_model.LinearRegression ridge regression: linear_model.RidgeCV The example code is as follows: # face completion print(__doc__) import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import fetch_olivetti_faces from sklearn.utils.validation import check_random_state from sklearn.ensemble import ExtraTreesRegressor from sklearn.neighbors import KNeighborsRegressor from sklearn.linear_model import LinearRegression from sklearn.linear_model import RidgeCV ...