Generalized Linear Models (GLMs)
Last updated
Last updated
The term general linear model (GLM) usually refers to conventional linear regression models for a continuous response variable given continuous and/or categorical predictors.
It includes multiple linear regression, as well as ANOVA and ANCOVA (with fixed effects only).
The form is
where contains known covariates and β contains the coefficients to be estimated.
These models are fit by least squares and weighted least squares.
The term generalized linear model (GLIM or GLM) refers to a larger class of models popularized by McCullagh and Nelder (1982, 2nd edition 1989).
In these models, the response variable is assumed to follow an exponential family distribution with mean , which is assumed to be some (often nonlinear) function of
Some would call these “nonlinear” because is often a nonlinear function of the covariates, but McCullagh and Nelder consider them to be linear, because the covariates affect the distribution of only through the linear combination .
The generalized linear models (GLMs) are a broad class of models that include linear regression, ANOVA, Poisson regression, log-linear models etc. The table below provides a good summary of GLMs following Agresti
Random Component – refers to the probability distribution of the response variable (Y); e.g. normal distribution for Y in the linear regression, or binomial distribution for Y in the binary logistic regression. Also called a noise model or error model. How is random error added to the prediction that comes out of the link function?
Systematic Component - specifies the explanatory variables (X1, X2, ... Xk) in the model, more specifically their linear combination in creating the so called linear predictor; e.g., β0 + β1x1 + β2x2 as we have seen in a linear regression, or as we will see in a logistic regression in this lesson.
Link Function, η or g(μ) - specifies the link between random and systematic components. It says how the expected value of the response relates to the linear predictor of explanatory variables; e.g., η = g(E(Yi)) = E(Yi) for linear regression, or η = logit(π) for logistic regression.
The data Y1, Y2, ..., Yn are independently distributed, i.e., cases are independent.
The dependent variable Yi does NOT need to be normally distributed, but it typically assumes a distribution from an exponential family (e.g. binomial, Poisson, multinomial, normal,...)
GLM does NOT assume a linear relationship between the dependent variable and the independent variables, but it does assume linear relationship between the transformed response in terms of the link function and the explanatory variables; e.g., for binary logistic regression logit(π) = β0 + βX.
Independent (explanatory) variables can be even the power terms or some other nonlinear transformations of the original independent variables.
The homogeneity of variance does NOT need to be satisfied. In fact, it is not even possible in many cases given the model structure, and overdispersion (when the observed variance is larger than what the model assumes) maybe present.
Errors need to be independent but NOT normally distributed.
It uses maximum likelihood estimation (MLE) rather than ordinary least squares (OLS) to estimate the parameters, and thus relies on large-sample approximations.
Goodness-of-fit measures rely on sufficiently large samples, where a heuristic rule is that not more than 20% of the expected cells counts are less than 5.
We do not need to transform the response Y to have a normal distribution
The choice of link is separate from the choice of random component thus we have more flexibility in modeling
If the link produces additive effects, then we do not need constant variance.
The models are fitted via Maximum Likelihood estimation; thus optimal properties of the estimators.
All the inference tools and model checking that we will discuss for log-linear and logistic regression models apply for other GLMs too; e.g., Wald and Likelihood ratio tests, Deviance, Residuals, Confidence intervals, Overdispersion.
There is often one procedure in a software package to capture all the models listed above, e.g. PROC GENMOD in SAS or glm() in R, etc... with options to vary the three components.
But there are some limitations of GLMs too, such as,
Linear function, e.g. can have only a linear predictor in the systematic component
Responses must be independent
Model
Random
Link
Systematic
Linear Regression
Normal
Identity
Continuous
ANOVA
Normal
Identity
Categorical
ANCOVA
Normal
Identity
Mixed
Logistic Regression
Binomial
Logit
Mixed
Loglinear
Poisson
Log
Categorical
Poisson Regression
Poisson
Log
Mixed
Multinomial response
Multinomial
Generalized Logit
Mixed