Generalized Linear Models (GLMs)

The term general linear model (GLM) usually refers to conventional linear regression models for a continuous response variable given continuous and/or categorical predictors.

It includes multiple linear regression, as well as ANOVA and ANCOVA (with fixed effects only).

The form is

yiN(xiT,β,σ2)y_i \sim N(x^T_i,\beta,\sigma^2)

where xix_i contains known covariates and β contains the coefficients to be estimated.

These models are fit by least squares and weighted least squares.

The term generalized linear model (GLIM or GLM) refers to a larger class of models popularized by McCullagh and Nelder (1982, 2nd edition 1989).

In these models, the response variable yiy_i is assumed to follow an exponential family distribution with mean μi\mu_i, which is assumed to be some (often nonlinear) function of xiTβx^T_iβ

Some would call these “nonlinear” because μi\mu_i is often a nonlinear function of the covariates, but McCullagh and Nelder consider them to be linear, because the covariates affect the distribution of yiy_ionly through the linear combination xiTβx^T_iβ.

The generalized linear models (GLMs) are a broad class of models that include linear regression, ANOVA, Poisson regression, log-linear models etc. The table below provides a good summary of GLMs following Agresti

Model

Random

Link

Systematic

Linear Regression

Normal

Identity

Continuous

ANOVA

Normal

Identity

Categorical

ANCOVA

Normal

Identity

Mixed

Logistic Regression

Binomial

Logit

Mixed

Loglinear

Poisson

Log

Categorical

Poisson Regression

Poisson

Log

Mixed

Multinomial response

Multinomial

Generalized Logit

Mixed

There are three components to any GLM:

  • Random Component – refers to the probability distribution of the response variable (Y); e.g. normal distribution for Y in the linear regression, or binomial distribution for Y in the binary logistic regression. Also called a noise model or error model. How is random error added to the prediction that comes out of the link function?

  • Systematic Component - specifies the explanatory variables (X1, X2, ... Xk) in the model, more specifically their linear combination in creating the so called linear predictor; e.g., β0 + β1x1 + β2x2 as we have seen in a linear regression, or as we will see in a logistic regression in this lesson.

  • Link Function, η or g(μ) - specifies the link between random and systematic components. It says how the expected value of the response relates to the linear predictor of explanatory variables; e.g., η = g(E(Yi)) = E(Yi) for linear regression, or η = logit(π) for logistic regression.

Assumptions:

  • The data Y1, Y2, ..., Yn are independently distributed, i.e., cases are independent.

  • The dependent variable Yi does NOT need to be normally distributed, but it typically assumes a distribution from an exponential family (e.g. binomial, Poisson, multinomial, normal,...)

  • GLM does NOT assume a linear relationship between the dependent variable and the independent variables, but it does assume linear relationship between the transformed response in terms of the link function and the explanatory variables; e.g., for binary logistic regression logit(π) = β0 + βX.

  • Independent (explanatory) variables can be even the power terms or some other nonlinear transformations of the original independent variables.

  • The homogeneity of variance does NOT need to be satisfied. In fact, it is not even possible in many cases given the model structure, and overdispersion (when the observed variance is larger than what the model assumes) maybe present.

  • Errors need to be independent but NOT normally distributed.

  • It uses maximum likelihood estimation (MLE) rather than ordinary least squares (OLS) to estimate the parameters, and thus relies on large-sample approximations.

  • Goodness-of-fit measures rely on sufficiently large samples, where a heuristic rule is that not more than 20% of the expected cells counts are less than 5.

Summary of advantages of GLMs over traditional (OLS) regression

  • We do not need to transform the response Y to have a normal distribution

  • The choice of link is separate from the choice of random component thus we have more flexibility in modeling

  • If the link produces additive effects, then we do not need constant variance.

  • The models are fitted via Maximum Likelihood estimation; thus optimal properties of the estimators.

  • All the inference tools and model checking that we will discuss for log-linear and logistic regression models apply for other GLMs too; e.g., Wald and Likelihood ratio tests, Deviance, Residuals, Confidence intervals, Overdispersion.

  • There is often one procedure in a software package to capture all the models listed above, e.g. PROC GENMOD in SAS or glm() in R, etc... with options to vary the three components.

But there are some limitations of GLMs too, such as,

  • Linear function, e.g. can have only a linear predictor in the systematic component

  • Responses must be independent

Last updated