What is Poisson distribution in statistics?

13.3 Poisson Regression

The Poisson distribution is widely used as a model for count data. As discussed in Section 2.3, it is frequently appropriate when the counts are of events in specific regions of time or space. Dependent variables that might be modeled using the Poisson regression would include the

number of fatal auto accidents during a year at intersections as a function of lane width,

number of service interruptions during a month for a network server as a function of usage, and

number of fire ant colonies in an acre of land as a function of tree density.

There is no fixed upper limit on the possible number of events. Recalling the properties of the Poisson, there is a single parameter, μ, which is the expected number of events. It is essential that μ be positive, and the regression function must enforce this.

Case Study 13.1

Warner (2007) studied the proportions of robberies where a gun was used as a function of the characteristics of the neighborhoods where the robberies occurred. Using logistic regression, the author related the probability the robbery would involve a gun to several independent variables:

Disadvantaged, a blend of poverty level, % female-headed households, and other economic variables, with higher scores indicating poorer neighborhoods

Percent population that are young black males

Faith in police, a score obtained by surveying residents of the neighborhood, with high scores indicating a greater trust of police

Perceived oppositional values, a score obtained by surveying residents of the neighborhood, with high scores indicating more opposition to mainstream attitudes toward drugs and crime

Table 13.5 shows the results of the bivariate (dependent variable against a single independent variable) logistic regressions. Also shown are the means (M) and standard deviations (SD) for the independent variables.

Table 13.5. Results of Logistic Regressions for Gun Use in Robberies

Ind. VariableM (SD)β^(std. error)zOdds RatioDisadvantaged0.00 (1.00)0.33 (0.15)2.161.39% young black male3.26 (2.11)0.14 (0.08)1.751.14Faith in police3.40 (0.17)3.37(1.13)2.970.03Perceived opp. values40.70 (7.30)0.02(0.02)1.150.98

Disadvantaged shows a significant positive association; that is, the more disadvantaged a neighborhood, the greater the probability a robbery will involve a gun. Faith in police shows a significant negative association; that is, the greater the neighborhood's faith in police, the less the probability the robbery will involve a gun. Neither of the other independent variables showed a significant association.

The information regarding the standard deviations of the independent variable is important in helping understand each variable's impact. Disadvantaged has SD=1.0, so that a moderately low score might be 1 point below the mean and a high score might be 1 point above the mean. Moving from a moderately low score to a moderately high score on Disadvantaged would shift the odds ratio by exp(2×1×0.33)=1.93. By contrast, Faith in police, which has a much larger coefficient, only has SD=0.17. Moving from a moderately high score to a moderately low score on Faith in police would shift the odds by exp(2×.17×3.37)=3.14. Thus, the difference in impact of the two variables is not as large as we would think if we only examined the coefficients. Note that we have described a shift from high to low for Faith in police, which changed the sign on the coefficient. This made the comparison of the two independent variables easier.

Poisson regression assumes each yi follows a Poisson distribution with mean μi, where

ln(μi)=β0+β1x1+β2x2++βmxm.

The linear expression may take on either positive or negative values, but μi=exp(β0+β1x1+β2x2++βmxm) will always be positive. Note that the link function is the logarithmic function. The proper method of fitting this model is via ML. Likelihood ratio tests replace F tests, and Wald χ2 tests will replace t tests.

Example 13.3

A hospital tracks the number of hypoglycemic incidents among diabetic patients recovering from cardiovascular surgery. The dependent variable is the number of incidents experienced by a patient during the first 72 hours post-surgery. The research question is whether a patient's age can be related to the frequency of incidents. An artificial data set illustrating this situation is given in Table 13.6.

Table 13.6. Number of Incidents of Hypoglycemia

OBSAGEHYPOGOBSAGEHYPOG152016622274117661357118710473019501572120642653021650772222670875023751957024570106902556111631265501273127700136712867014643296611576030740

Solution

We will model each person's number of incidents as having a Poisson distribution where the expected number (μ) is a function of AGE,

The SAS System's PROC GENMOD was used to fit this model, and a portion of the output is shown in Table 13.7. Several portions of this printout deserve comment.

Table 13.7. Poisson Regression Results for Hypoglycemia

The GENMOD ProcedureCriteria For Assessing Goodness of FitCriterionDFValueValue/DFDeviance2831.12721.1117 (*1*)Scaled Deviance2831.12721.1117Pearson Chi-Square2827.99400.9998Scaled Pearson X22827.99400.9998Log Likelihood28.1089(*2*)Algorithm converged.Analysis Of Parameter EstimatesParameterDFEstimateStandard ErrorWald 95% Confidence LimitsChi-SquarePr > ChiSqIntercept10.34861.93944.14963.45250.030.8574age10.00090.02950.05860.05690.000.9764 (*3*)Scale01.00000.00001.00001.0000

Note: The scale parameter was held fixed.

(*1*) Deviance is a measure of lack of fit for the proposed model versus a saturated model that essentially includes one parameter for every observation. The saturated model represents a kind of gold standard. Comparing the Deviance (31.1272) to the critical values for a chi-squared distribution with degrees of freedom as shown on the printout (28) gives a very rough lack of fit test. That is, large values of deviance indicate the model is not a good fit. As a rough rule-of-thumb, we expect deviance divided by its degrees of freedom to be in the vicinity of 1. The value here is 1.1117, indicating that the model fits the data reasonably well.

(*2*) ln(L) of the current model (28.1089), which is useful when constructing customized likelihood ratio tests comparing full and reduced models.

(*3*) The χ2 tests for each individual independent variable, analogous to the individual t tests in ordinary regression. For AGE, X2=0.00, with p value=0.9764.

There is no significant evidence that age is related to the frequency of hypoglycemic incidents.

Sometimes each observation is a count from a region that varies greatly in size. For example, we might have y=number of flaws in a Mylar sheet, but some sheets are quite large and others are small. In this situation, size is an important part of the expected count. The independent variables are assumed to influence the rate per unit of size, denoted λ. The rate must be positive. Given a set of independent variables x1,x2,,xm, we model

ln(λ)=β0+β1x1+β2x2++βmxm.

If observation yi comes from an observational unit with size si, then yi has the Poisson distribution with expected value μi=λisi and

ln(μi)=β0+β1x1+β2x2++βmxm+ln(si).

At first glance, the term ln(si) may seem like just another independent variable in the Poisson regression. However, its coefficient is identically 1, so that no parameter need be estimated for it. This is called an offset variable, and all Poisson regression software will allow you to indicate such a size marker. Sometimes size is only specified up to a constant of proportionality. That is, we might not know exactly the size of units i and i, but we know that unit i is twice the size of unit i. This suffices, as the unknown proportionality constant will become an additive constant once logarithms are computed, and be combined with the intercept β0.

Example 13.4

Bailer et al. (1997) published an article showing how Poisson regression could be an important tool in safety research. Table 13.8 shows their counts of fatalities in the agriculture, forestry, and fishing industries and estimates of the number of workers in those industries. Figure 13.4 graphs the rates per 1000 workers (number of fatalities×1000/number of workers). We would like to see that fatality rates are declining, but is there any evidence that this is so?

FIGURE 13.4. Fatality Rates among Workers for Example 13.4.

Table 13.8. Fatalities and Number of Workers

YearFatalitiesWorkersYearFatalitiesWorkers19835112850803198850626490441984530276782919894912665645198556626673231990464261461219864992679587199148426664771987529270996619924682581603

Solution

We will model the number of fatalities each year as a Poisson variable with mean μi=λisi where λi is the rate of fatalities per worker in year i, and si is the number of workers in these industries during year i. To model a trend in time, we use

where i=year1982. The link function is the logarithmic function, and the offset variable is ln(si). The SAS System's PROC GENMOD yielded β^1=0.0073 with a standard error of 0.0049 and Wald's χ2=2.21,pvalue=0.1373. Hence, there is no significant evidence of a linear trend in the fatality rate over this time period. In a second analysis where fatalities and number of workers were subdivided by gender and age, the authors found that rates were decreasing significantly among male workers, but increasing among female workers.

13.3.1 Choosing Between Logistic and Poisson Regression

When data are presented as results for individual observations, as in Example 13.2 and Example 13.3, the choice between logistic regression and Poisson regression is usually clear. In Example 13.2, the dependent variable was whether or not a city had adopted TIF, which happened to be coded as 0s and 1s but could have been Y/N or any other abbreviation. At the individual level, the dichotomous variable is whether or not a success has occurred. This is the type of dependent variable where logistic regression is helpful as we attempt to model the probability of a success.

In Example 13.3, the dependent variable is truly quantitative, the number of hypoglycemic incidents experienced by a patient. This number happens to almost always be 0 or 1, but is not necessarily one of these two values. In fact, the data contains two individuals who had more than one incident, though another sample might not have had any. This is the type of situation where Poisson regression is helpful as we attempt to model the expected number of incidents per patient.

The choice is somewhat less distinct when the data have been aggregated for groups of similar individuals, as in Example 13.1 and Example 13.4. In Example 13.4, we treated the dependent variable as yi=number of fatalities in year i, assumed to have a Poisson distribution with number of workers as an offset variable. However, we could also treat yi as a binomial random variable with ni= number of workers. After all, a worker cannot have more than one fatality! In fact, these two approaches would give very similar fitted values for μy|x=px, because the Poisson and binomial are very similar when n is large and p is very small.

By contrast, Example 13.1 can only use logistic regression. First, at the individual level, our data is whether or not a mouse developed a tumor. This is a binary dependent variable. If, at every concentration, the probability of a tumor stayed small, we could still use Poisson regression if it were more convenient. Our dependent variable would be the number of mice with tumors within each sample at a given concentration. However, for this data set, p ranges from small to large. The approximation of the binomial via the Poisson deteriorates. Moreover, the link function for the logistic regression will keep all the fitted values for the probability between 0 and 1. The link function for Poisson regression will keep them greater than 0, but is likely to return some greater than 1.

As best we can, the choice between logistic regression and Poisson regression should match the nature of the dependent variable at the level of the individual observation. In certain cases, however, where the proportion of successes out of the total number of trials is quite small, we may analyze the data either way. Be aware, however, that the regression coefficients are giving different information. For Poisson regression, they reflect the influence of an independent variable on the ln(p) , but for logistic regression they reflect the influence on ln(odds).

Case Study 13.2

Darby et al. (2009) studied auto collision records for over 16,000 employees of a large British telecommunications firm. Each of these employees was the driver of a company car or van, and the dependent variable in question was each person's number of collisions, in a company vehicle, during the past three years. In addition to the more traditional risk factors of gender and age group, the authors attempted to assess whether certain personality traits were associated with a change in the rate of accidents. For data on personality traits, they had each employee's answers on a questionnaire given to them at the time they were approved to drive a company vehicle.

Since the data is in the form of counts (many of them zeroes), the authors chose Poisson regression as the primary means of analysis. Since some workers drove very little during the week and others a great deal, ln(# hours driven per week) was used as an offset variable. With this sample size, the authors were able to fit a model with a large number of independent variables. We cite a few of their results.

Dummy variables were used to code different age categories, with the 50+ age category acting as the reference group. For ages 21 to 25, β^=0.366 with p value<0.001. To interpret this, consider two persons with all independent variables equal except that one is in the 21 to 25 age category and the other is in the 50+ age category. Their fitted accident rates will differ by ln(λ^2125)ln(λ^50+)=0.366. Hence, λ^2125λ^50+=exp(0.366)=1.44. That is, the fitted rate in the 21 to 25 age category is 44% higher than that in the 50+ age category, all other variables being equal.

Persons scoring high on the aggressive/impulsive personality trait had a substantially higher rate of accidents (β^=0.529,p value<0.001). The structured personality trait had no significant relationship with accident rate (β^=0.14,p value=0.823).

Video liên quan

Chủ đề