# Generalized Linear Models (GLMs)

The term *general linear model* (GLM) usually refers to conventional linear regression models for a continuous response variable given continuous and/or categorical predictors.&#x20;

It includes multiple linear regression, as well as ANOVA and ANCOVA (with fixed effects only).&#x20;

The form is&#x20;

$$
y\_i \sim N(x^T\_i,\beta,\sigma^2)
$$

where $$x\_i$$ contains known covariates and β contains the coefficients to be estimated.&#x20;

These models are fit by least squares and weighted least squares.

The term ***generalized linear model*** (GLIM or GLM) refers to a larger class of models popularized by McCullagh and Nelder (1982, 2nd edition 1989).&#x20;

In these models, the response variable $$y\_i$$ is assumed to follow an exponential family distribution with mean $$\mu\_i$$, which is assumed to be some (often nonlinear) function of $$x^T\_iβ$$&#x20;

Some would call these “nonlinear” because $$\mu\_i$$ is often a nonlinear function of the covariates, but McCullagh and Nelder consider them to be linear, because the covariates affect the distribution of $$y\_i$$only through the linear combination $$x^T\_iβ$$. &#x20;

The generalized linear models (GLMs) are a broad class of models that include linear regression, ANOVA, Poisson regression, log-linear models etc. The table below provides a good summary of GLMs following Agresti&#x20;

| **Model**            | **Random**  | **Link**          | **Systematic** |
| -------------------- | ----------- | ----------------- | -------------- |
| Linear Regression    | Normal      | Identity          | Continuous     |
| ANOVA                | Normal      | Identity          | Categorical    |
| ANCOVA               | Normal      | Identity          | Mixed          |
| Logistic Regression  | Binomial    | Logit             | Mixed          |
| Loglinear            | Poisson     | Log               | Categorical    |
| Poisson Regression   | Poisson     | Log               | Mixed          |
| Multinomial response | Multinomial | Generalized Logit | Mixed          |

##

## There are three components to any GLM:

* ***Random Component*** – refers to the probability distribution of the response variable (Y); e.g. normal distribution for *Y* in the linear regression, or binomial distribution for *Y* in the binary logistic regression.  Also called a noise model or error model.  How is random error added to the prediction that comes out of the link function?
* ***Systematic Component*** - specifies the explanatory variables (*X*1, *X*2, ... *Xk*) in the model, more specifically their linear combination in creating the so called *linear predictor*; e.g., β0 + β1*x*1 + β2*x*2 as we have seen in a linear regression, or as we will see in a logistic regression in this lesson.
* ***Link Function*****,&#x20;*****η or g(μ)*** - specifies the link between random and systematic components. It says how the expected value of the response relates to the linear predictor of explanatory variables; e.g., *η* = *g*(*E*(*Yi*)) = *E*(*Yi*) for linear regression, or  *η* = *logit*(π) for logistic regression.

## *Assumptions*:

* The data *Y*1, *Y*2, ..., *Yn* are independently distributed, i.e., cases are independent.
* The dependent variable *Yi* does NOT need to be normally distributed, but it typically assumes a distribution from an exponential family (e.g. binomial, Poisson, multinomial, normal,...)
* GLM does NOT assume a linear relationship between the dependent variable and the independent variables, but it does assume linear relationship between the transformed response in terms of the link function and the explanatory variables; e.g., for binary logistic regression *logit*(π) = β0 + β*X*.
* Independent (explanatory) variables can be even the power terms or some other nonlinear transformations of the original independent variables.
* The homogeneity of variance does NOT need to be satisfied. In fact, it is not even possible in many cases given the model structure, and *overdispersion* (when the observed variance is larger than what the model assumes) maybe present.
* Errors need to be independent but NOT normally distributed.
* It uses maximum likelihood estimation (MLE) rather than ordinary least squares (OLS) to estimate the parameters, and thus relies on large-sample approximations.
* Goodness-of-fit measures rely on sufficiently large samples, where a heuristic rule is that not more than 20% of the expected cells counts are less than 5.

## **Summary of advantages of GLMs over traditional (OLS) regression**

* We do not need to transform the response *Y* to have a normal distribution
* The choice of link is separate from the choice of random component thus we have more flexibility in modeling
* If the link produces additive effects, then we do not need constant variance.
* The models are fitted via Maximum Likelihood estimation; thus optimal properties of the estimators.
* All the inference tools and model checking that we will discuss for log-linear and logistic regression models apply for other GLMs too; e.g., Wald and Likelihood ratio tests, Deviance, Residuals, Confidence intervals, Overdispersion.
* There is often one procedure in a software package to capture all the models listed above, e.g. *PROC GENMOD in SAS* or *glm() in R*, etc... with options to vary the three components.

But there are some limitations of GLMs too, such as,

* Linear function, e.g. can have only a linear predictor in the systematic component
* Responses must be independent
