Subdata Selection Approach for Generalized Linear Model

Subdata is almost inevitable to be used for analysis when the full data is very big. The theory of optimal design developed for constructing small amounts of data of experiments could be introduced to proposed Information-Based Optimal Subdata Selection (IBOSS) method for data reduction, based on maximizing information matrices. I develop the algorithm of this method for implementations in generalized linear models. By numerical studies with simulated and real datasets, the IBOSS method using this algorithm has significantly better performances in model estimation and computation efficiency, compared with uniform sampling method and estimation by full data.

2. Algorithm

2.1 Framework

Assume that the full data $(\mathbf{x}_1,y_1), (\mathbf{x}_2,y_2), ..., (\mathbf{x}_n,y_n)$ follows a generalized linear model,

\[E(\mathbf{Y}) = \boldsymbol{\mu} = g^{-1}(\mathbf{X}\boldsymbol{\beta})\]

where the full data of size $n$ includes of $p$ covariates, $\mathbf{X} = (\mathbf{x}_1, ..., \mathbf{x}_n)'$, $\mathbf{x}_i = (1, x_{i1}, ..., x_{ip})'$, $\mathbf{Y} = (y_1,...,y_n)'$; $\beta_0$ is the intercept parameter, and $\boldsymbol{\beta}_1$ is the $p$ dimentional vector of slope parameters, $\boldsymbol{\beta} = (\beta_0, \boldsymbol{\beta}_1)' = (\beta_0, \beta_1, ..., \beta_p)$; $E(\mathbf{Y})$ is the expected value of $\mathbf{Y}$, and $g$ is the link function of the generalized linear model.

By calculating the second derivatives of log-likelihood function, $-E\left(\frac{\partial^2 \ln L}{\partial \beta_j \partial \beta_k}\right)$, we find that the information matrix of generalized linear model can be written as

\[ \mathbf{M} = \mathbf{X}'\mathbf{V}\mathbf{X} \]

where $\mathbf{V} = diag(\Psi(\mathbf{x}_1'\boldsymbol{\beta}), ..., \Psi(\mathbf{x}_n'\boldsymbol{\beta}))$ is a diagonal matrix, and the $(j,k)$ element of the information matrix is

\[\sum_{i=1}^n (\Psi(\mathbf{x}_i'\boldsymbol{\beta})x_{i,j}x_{i,k})\]

$\Psi(\mathbf{x}_i'\boldsymbol{\beta})$ is a function of $\mathbf{x}_i'\boldsymbol{\beta}$ depending on specific link of GLM:

For example, the log of likelihood function of a logistic model that $P(y_i) = \mu_i^{y_i}(1-\mu_i)^{1-y_i}$, $y_i \in \{0,1\}$, and $E(y_i) = \mu_i = \frac{\exp(\boldsymbol{x}_i'\boldsymbol{\beta})}{1+\exp(\boldsymbol{x}_i'\boldsymbol{\beta})}$, is

\[\ln \mathcal{L} = \sum_{i=1}^n [y_i \ln \mu_i + (1-y_i)\ln (1-\mu_i)]\]

Since the first derivative of $\mu_i = \frac{\exp(\boldsymbol{x}_i'\boldsymbol{\beta})}{1+\exp(\boldsymbol{x}_i'\boldsymbol{\beta})}$ respect to some parameter $\beta_j$ is $\mu_i(1-\mu_i)x_{i,j}$, we can get the first derivative of log of likelihood function respect to $\beta_j$ is

\[\frac{\partial \ln \mathcal{L}}{\partial \beta_j} = \sum_{i=1}^n [(y_i - \mu_i)x_{i,j}]\]

and the second derivative is

\[\frac{\partial^2 \ln \mathcal{L}}{\partial \beta_j \partial \beta_k} = -\sum_{i=1}^n [\mu_i(1-\mu_i)x_{i,j}x_{i,k}]\]

Thus, we can get the $\Psi(\mathbf{x}_i'\boldsymbol{\beta})$ function of a logistic model is $\mu_i(1-\mu_i)$.

The strategy of IBOSS is to choose a subdata such that the information matrix for the subdata is maximized, consisting of the most informative data points. The "maximum" of the matrix should be defined by some optimality criterion, and in this paper we develop algorithms under D- and T- optimality criteria.

2.2 D-Optimality

Under D-optimality, we need to select a subdata of size $k$ so that the determinant of its information matrix $\mathbf{M}_s = \sum_{i=1}^k \Psi(\mathbf{x}_i'\boldsymbol{\beta})\mathbf{x}_i\mathbf{x}_i'$ is maximized.

Theorem 1 (D-optimality) For any subdata of size $k$, the determinant of its information matrix

\[|\mathbf{M}_s| \leq \frac{(\sum_{i=1}^k\Psi(\mathbf{x}_i'\boldsymbol{\beta}))^{p+1}}{4^p}\prod_{j=1}^p (x_{(n)j}-x_{(1)j})^2 \]

where $x_{(n)j} = max\{x_{1j}, ..., x_{nj}\}$ and $x_{(1)j} = min\{x_{1j}, ..., x_{nj}\}$.

Theorem 1 give us guidance to extract an optimal subdata, although two difficulties exist in making the determinant of information matrix close to the upper bound. First, this upper bound is proportional to $(\sum_{i=1}^k\Psi(\mathbf{x}_i'\boldsymbol{\beta}))^{p+1}\prod_{j=1}^p (x_{(n)j}-x_{(1)j})^2$, so we not only need to select data points with extreme order statistics in each covariate $j$, but also have to make the sum of $\Psi(\mathbf{x}_i'\boldsymbol{\beta})$ larger.Second, the value of $\Psi(\mathbf{x}_i'\boldsymbol{\beta})$ depends on unknown parameters $\boldsymbol{\beta}$, and we do not know its value before model estimation. One way to overcome the second difficulty is to use a ``guess'' of the parameters, like what researchers do for locally optimal designs. For example, the values of these unknown parameters can be assign by estimating a simple random subsample. A Bayesian approach is an alternative, that can facilitate the inclusion of some uncertainty about a guess for unknown parameters. We can also employ a multistage approach to deal with this problem, that starts with a small initial design to obtain some information about the parameters and then do the estimation in the next stage using this information (Silvey, 1980; Sitter and Forbes, 1997). In the algorithm we propose in this paper, we firstly assign the value of $\boldsymbol{\beta}$ by estimating a subdata using uniform subsampling method. Our solution of the first difficulty is a little bit tricky. Note that

\[(\sum_{i=1}^k\Psi(\mathbf{x}_i'\boldsymbol{\beta}))^{p+1}\prod_{j=1}^p (x_{(n)j}-x_{(1)j})^2 = \prod_{j=1}^p \left[(\sum_{i=1}^k\Psi(\mathbf{x}_i'\boldsymbol{\beta}))^{\frac{p+1}{2p}}(x_{(n)j}-x_{(1)j})\right]^2\]

and $x_{(n)j}-x_{(1)j} = x^{\ast}_{(n)j}-x^{\ast}_{(1)j}$, where $x^{\ast}_{(n)j} = x_{(n)j} - \bar{x}_j$, $x^{\ast}_{(1)j} = x_{(1)j} - \bar{x}_j$. We can select the extreme order statistics of $\tilde{x}_{1j}, ..., \tilde{x}_{nj}$, instead of $x_{1j}, ..., x_{nj}$, where

\[\tilde{x}_{ij} = \Psi(\mathbf{x}_i'\boldsymbol{\beta})^{\frac{p+1}{2p}}x^{\ast}_{ij} = \Psi(\mathbf{x}_i'\boldsymbol{\beta})^{\frac{p+1}{2p}}(x_{ij} - \bar{x}_j), \ i = 1,...,n, j = 1,...,p\]

By centralizing the data, we can ensure that both $\tilde{x}_{(n)j}$ and $\tilde{x}_{(1)j}$ have large $\Psi(\mathbf{x}_i'\boldsymbol{\beta})$ as well. Taking these strategies, the algorithm under D-optimality is as follows.

Algorithm 1 (D-optimality algorithm)
1. Draw a random subsample of size $k_0$, according to the probabilities $\pi_i = 1/n$, and estimate $\hat{\boldsymbol{\beta}}^{\text{UNIF}}$ based on the subsample.
2. Replace $\boldsymbol{\beta}$ with $\hat{\boldsymbol{\beta}}^{\text{UNIF}}$, and calculate $\Psi(\mathbf{x}_i'\hat{\boldsymbol{\beta}}^{\text{UNIF}})$ using $\hat{\boldsymbol{\beta}}^{\text{UNIF}}$ as an assumption on unknown parameter.
3. Centralize the data, $x^{\ast}_{ij} = x_{ij} - \bar{x}_j$, and calculate $\tilde{x}_{ij} = \Psi(\mathbf{x}_i'\boldsymbol{\beta})^{\frac{p+1}{2p}}x^{\ast}_{ij}$, $i = 1, ..., n$, $j = 1, ..., p$.
4. Include $r$ data points with the $r$ smallest $\tilde{x}_{i1}$ values and $r$ data points with the $r$ largest $\tilde{x}_{i1}$ values, where $r = k/(2p)$.
5. For $j = 2, ..., p$, exclude data points previously selected, and from the remainder select $r$ data points with the smallest $\tilde{x}_{ij}$ values and $r$ data points with the largest $\tilde{x}_{ij}$ values for the subdata.
6. With the subdata extracted, estimate the parameters of the model.

2.3 T-Optimality

The trace of the information matrix of full data is

\[\sum_{j=1}^p \sum_{i=1}^n (\Psi(\mathbf{x}_i'\boldsymbol{\beta})x_{i,j}^2) = \sum_{i=1}^n \Psi(\mathbf{x}_i'\boldsymbol{\beta})\mathbf{x}_i' \mathbf{x}_i\]

therefore, under T-optimality criterion, we can maximize the trace of information matrix of the subdata by extracting data points with largest value of $\Psi(\mathbf{x}_i'\boldsymbol{\beta}) \sum_{j=1}^p \tilde{x}_{i,j}^2$, where $\tilde{x}_{i,j}$ is the scaled data value.

When using the T-optimality, the data should be standardized, otherwise the subdata may focus too much in one direction. And also we take the estimators from a random subsample as the assumption of unknown parameters to calculate $\Psi(\mathbf{x}_i'\boldsymbol{\beta})$. The Algorithm under T-optimality is as follows.

Algorithm 2 (T-optimality algorithm)
1. Draw a random subsample of size $k_0$, according to the probabilities $\pi_i = 1/n$, and estimate $\hat{\boldsymbol{\beta}}^{\text{UNIF}}$ based on the subsample.
2. Replace $\boldsymbol{\beta}$ with $\hat{\boldsymbol{\beta}}^{\text{UNIF}}$, and calculate $\Psi(\mathbf{x}_i'\hat{\boldsymbol{\beta}}^{\text{UNIF}})$ using $\hat{\boldsymbol{\beta}}^{\text{UNIF}}$ as an assumption on unknown parameter.
3. Standardize the data, $\breve{x}_{ij} = \frac{x_{ij} - \bar{x}_j}{std\{x_j\}}$, $i = 1,...,n$, $j = 1,...,p$.
4. Select $k$ data points with largest value of $\Psi(\mathbf{x}_i'\boldsymbol{\beta})\breve{\mathbf{x}}_i'\breve{\mathbf{x}}_i$.
5. With the subdata extracted, estimate the parameters of the model.

The generalized linear model is usually estimated by some iterative algorithms such as gradient descent, or Newton's method, which requires $O(np^2)$ computing time. Using the uniform subsampling, it requires $O(n)$ to extract a random subsample, and $O(k_0p^2)$ for estimation. Both two IBOSS algorithms need estimators from random subsampling method, and also need $O(np)$ time to centralize (standardize) data and calculate $\Psi(\mathbf{x}_i'\boldsymbol{\beta})^{\frac{p+1}{2p}}x^{\ast}_{ij}$ or $\Psi(\mathbf{x}_i'\boldsymbol{\beta})\breve{\mathbf{x}}_i'\breve{\mathbf{x}}_i$. In step 4 \& 5 of Algorithm 1, and step 4 of Algorithm 2, we use partition-based selection algorithm for selecting $k$th smallest or largest elements from the data, due to its computational efficiency with $O(n)$ average case performance. It takes $O(np)$ time for Algorithm 1 and $O(n)$ time for Algorithm 2 for selection. Therefore, the time complexity of Algorithm 1 is $O(n + k_0p^2 + np + np + kp^2)$, and the time complexity of Algorithm 2 is $O(n + k_0p^2 + np + n + kp^2)$. The algorithms are faster than estimation using full data when $n >> k_0 + k$

We use R programming language to implement each subdata selection methods, calling the C++ built-in function "nth element" with Rcpp for partition-based selection algorithm, in order to improve the efficiency of R code.

3. Numerical Experiments

In this section, we evaluate the model estimation and computation performances of IBOSS method for generalized linear model, based on simulated and real datasets.

3.1 Simulation Studies

The simulated full data with size $n$ are generated based on generalized linear model. The response $\mathbf{Y}$ is assumed to be generated from a particular distribution, such as normal, binomial, Poisson, and gamma distribution. And the mean of the distribution $\boldsymbol{\mu}$ is related to $\mathbf{X}\boldsymbol{\beta}$ via link function $g$. The parameter $\boldsymbol{\beta} = 0.1 \times \boldsymbol{1}_{11\times 1}$ is an 11 dimensional vector. $\mathbf{X}$ includes $p = 10$ variables and one constant, and follows one of the distributions below.

Case 1: $\mathbf{X}$ has a multivariate normal, $N(\boldsymbol{1}, \boldsymbol{\Sigma})$;
Case 2: $\mathbf{X}$ has a multivariate $t$ distribution with degree of freedom $df = 2$, $t_2(\boldsymbol{1}, \boldsymbol{\Sigma})$;
Case 3: $\mathbf{X}$ has a folded normal distribution, $\mathbf{X} = |\mathbf{K}|$, and $\mathbf{K} \sim N(\boldsymbol{1}, \boldsymbol{\Sigma})$.
where $\Sigma_{ij}$, the element of $\boldsymbol{\Sigma}$, is $1$ if $i=j$; and $0.5$ if $i \neq j$.

The simulation is repeated $S = 100$ times and the mean squared errors of estimations are calculated by $\text{MSE}_{\beta_0} = S^{-1}\sum_{s=1}^S (\hat{\beta}_0^{(s)}-\beta_0)^2$ and $\text{MSE}_{\boldsymbol{\beta}_1} = S^{-1}\sum_{s=1}^S \parallel \hat{\boldsymbol{\beta}}_1^{(s)} - \boldsymbol{\beta}_1 \parallel^2$. Figure 1 and Figure 2 compare the MSEs of slope parameter estimations by different methods. We consider four different approaches here, D-optimality IBOSS, T-optimality IBOSS, uniform subsampling, and the full data approach. The true value of parameter $\boldsymbol{\beta} = 0.1 \times \boldsymbol{1}_{11\times 1}$ and $k_0$ is fixed at $k_0 = 1000$. In Figure 1, we fix subdata sample size $k = 1000$, and the full data sample size $n$ changes; in Figure 2, we fix the full data sample size $n = 10^5$, and the subdata sample size $k$ changes.

As seen in Figure 1, the D-optimality IBOSS method has better performances than the uniform subsampling method on the estimation of generalied linear model, and its advantage is more significant when the covariates follows $t$ distribution with heavier tail. The performance of IBOSS method differs among different generalized linear models as well. The estimation by IBOSS method is almost as accurate as that of using full dataset for poisson model (family: poisson, link: log) when $\mathbf{X}$ follows a multivariate $t$ distribution. Unlike the uniform subsampling method with almost constant MSEs of estimators, D-optimality IBOSS method generates MSEs of slope parameter estimations decreasing when the full data sample size increases, given fixed subdata size. The performance of T-optimality IBOSS method is better than the uniform subsampling method as well, but worse then the D-optimality.

As the subdata size $k$ increases, the performances of IBOSS method on estimation are improved. Figure 2 shows the MSEs of slope parameter estimations against the subdata size $k$, chosen with $k = 500, 1000, 2000, 3000, 4000$, and $5000$. The full data sample size is fixed at $n = 10^5$, and $k_0 = 1000$ is fixed as well. In general, both IBOSS method and uniform subsampling method have decreasing MSEs by the subdata size. From Figure 2, D-optimality IBOSS method always has better performance than uniform subsampling method. But T-optimality IBOSS method does not significantly improve the estimation of gamma regression model (family: gamma, link: inverse) when $\mathbf{X}$ follows a folded multivariate normal distribution. Nevertheless, with the increase in the subdata size, the performance of IBOSS method promote and reveal its superiority quickly.

Except some particular generalized linear model, such as gaussian family with identity link, poisson family with sqrt link, gamma family with log link, we should replace $\boldsymbol{\beta}$ by $\hat{\boldsymbol{\beta}}^{\text{\tiny{UNIF}}}$ from a uniform subsample of size $k_0$, in order to calculate function $\Psi(\mathbf{x}_i'\boldsymbol{\beta})$ before subdata selection. As the sample size $k_0$ affects the estimation accuracy of subsampling method, we may concern whether the choice of $k_0$ would significantly affect the estimation performance of IBOSS method. Figure 3 shows the MSEs of estimating the slope parameters by IBOSS method with different choice of $k_0$ thereby different assumptions on the unknown parameters. We can find that the performance of IBOSS method is very stable, even while using estimators $\hat{\boldsymbol{\beta}}^{\text{\tiny{UNIF}}}$ with very large MSE resulting from small subsample size $k_0$. It tells us that when we use the IBOSS method, we do not necessarily need extract a subsample of large size $k_0$ to have assumptions on unknown parameters as accurate as possible.

When we use IBOSS method to do the subdata selection, we should know the particular model to be estimated, because the exact form of function $\Psi(\mathbf{x}_i'\boldsymbol{\beta})$ depends on the family and link of GLM. Figure 4 shows the consequences of choosing incorrect $\Psi(\mathbf{x}_i'\boldsymbol{\beta})$, and compares the MSEs of slope parameter estimations. In the model of binomial family and logit link case, even though we use $\Psi(\mathbf{x}_i'\boldsymbol{\beta})$ of other link, we can still get very similar results. But if we simply let $\Psi = 1$, the estimation would significantly become worse. In the model of poisson family and log link case, if $\Psi(\mathbf{x}_i'\boldsymbol{\beta})$ of identity link was chosen incorrectly, the estimation would be even worse than that of using uniform subsampling method.

3.2 Computational Efficiency

Besides a better performance in estimation accuracy, now we examine the computational efficiency of the IBOSS approach for the generalized linear model, using a logistic model as a typical example. Table 2 shows the CPU times for different combinations of the full data size $n$ and the number of covariates $p$ for a fixed subdata size of $k = 1000$, with $\mathbf{X}\boldsymbol{\beta}$ following a multivariate normal distribution. All these comparisons are carried out on a MacBook Pro running OS X with 2.9 GHz Intel Core i7 processor and 8 GB 1600 MHz DDR3 memory. We can find that all method require longer computing time as the full data size or number of covariates become larger, and IBOSS method has significant advantages in computation efficiency than that of using the full data. This results are accordant to our discussions about the computational complexity of IBOSS methods in the Algorithm section.

3.3 Real Data

In this section, we use a data set cited by Econometric Analysis 8th edition (Greene, 2012), to evaluate the performance of IBOSS approach for the generalized linear models. This data set comes from Greene (1992) for his research about the default on credit card loans, having 10499 observations. we are interested in modeling customers’ number of derogatory reports ($y$), using default status ($x_1$), age ($x_2$), number of dependents ($x_3$), home owning ($x_4$), ratio of monthly credit card expenditure to yearly income ($x_5$), and log of spending ($x_5$). To model the count variable, the number of derogatory reports, we fit a poisson regression model using the D-Optimality and T-optimality IBOSS methods with $k_0 = 500$ and sample size of subdata $k = 1000$, as well as using the uniform subsampling method and full data. The model estimation results are summarized in Table 3, which illustrates better estimation performances from IBOSS method than the uniform subsampling method.

Also we calculate the average standard errors for estimates of slope parameters by each method, using 100 bootstrap samples, to compare the performances of IBOSS method with uniform subsampling method in this poisson model case. We fix $k_0 = 500$, and consider cases with different $k$, that $k = 500, 1000, 1500, 2000$. And from the results shown by Figure 5 (left), the IBOSS method significantly dominates the uniform subsampling method with much smaller average standard errors. We also examine the impact of the choice of $k_0$ on the performance of IBOSS method, by fixing the subdata sample size $k=1000$. Figure 5 (right) shows that the average standard errors for estimates by IBOSS method are very stable even though we use estimates with large standard errors from a small uniform subsample.

Jason's Blog

Search This Blog