Programs

Getting Started With Negative Binomial Regression: Step by Step Guide

The technique of Negative Binomial Regression is used for carrying out the modeling of count variables. The method is almost similar to the multiple regression method. However, there is the difference that in the case of Negative Binomial Regression, the dependent variable, i.e., Y, follows the negative binomial distribution. Therefore, the values of the variable can be non-negative integers such as 0, 1, 2.

The method is also an extension of the Poisson regression that makes a relaxation in assuming that the mean is equal to the variance. One of the traditional models of binomial regression, defined as “NB2,” is based on the mixed distribution of Poisson-gamma.

The method of the Poisson regression is generalized through the addition of a variable of gamma noise. This variable has a value of mean one and also a scale parameter which is “v.”

Here are a few examples of the Negative Binomial Regression:

  • The school administrators conducted a study to study the attendance behavior of the high school students from two schools. The factors that might influence the attendance behavior might include the days in which the juniors were absent from school. Also, the program in which they were enrolled.
  • A researcher from a health-related study carried out a study of how many senior citizens visited a hospital in the last 12 months. The study was based on the individual’s characteristics and the health plans that the senior citizens bought.

Example of Negative Binomial Regression 

Suppose there is an attendance sheet of around 314 students from high school. The data is taken from two urban schools and stored in a file named nb_data.dta. The interesting response variable in this example is the absent days which are “daysabs.” One variable, “math,” is present, which defines the math score for every student. There is another variable which is “prog.” This variable indicates the program in which the students are enrolled.

Source

Each of the variables has around 314 observations. Therefore, the distributions among the variables are also reasonable. Also, considering the outcome variable, the unconditional mean is lower than the variance.

Now, focus on the variable description considered in the dataset. A table tabulates the average days a student was absent from school in every program type. This suggests that the variable type program can predict the days the student was absent from school. You can also use it for predicting the outcome variable. This is because the mean value for the outcome variable varies by the variable prog. Also, the values of the variances are higher than are in each level of the variable prog. These values are called the variances and the means. The existing differences suggest that there is the presence of over-dispersion, and therefore it will be appropriate to use a negative binomial model.

Source

A researcher can consider several analysis methods for this type of study. These methods are described below. A few of the methods of analysis that the user can use for analyzing the regression model are:

1. Negative binomial regression

The method of Negative Binomial Regression is to be used when there is overdispersed data. This means that the value of conditional variance is higher or exceeds the value of the conditional mean. The method is considered to be generalized from the Poisson regression method. This is because both the methods have the same structure of the mean. But, there is an additional parameter in the Negative binomial regression used to model the overdispersion. The confidence intervals are considered narrower than passion regression when the conditional distribution is over-dispersed from the outcome variable.

2. Poisson regression

The method of Poisson regression is used in the modeling of the count data. Many extensions can be used for modeling the count variables in the Poisson regression.

3. OLS regression

The outcomes of the count variables are log-transformed sometimes and then analyzed through the method of OLS regression. However, there are sometimes issues related to the method of OLS regression. These issues might be the data loss due to the generation of any undefined value through consideration of the log of the value zero. Also, it might be generated due to the lack of modeling the dispersed data.

4. Zero-inflated models

These types of models try to account for all the excess zeros in the model.

Analysis Using the Negative Binomial Regression

The command “nbreg” is used for estimating the model of Negative Binomial Regression. There is an “i” before the variable “prog.” The presence of “i” indicates that the variable is of type factor, i.e., categorical variable. These should be included as indicator variables in the model.

  • The output of the model begins with an iteration log. It starts through the fitting of the model of Poisson, followed by a null model, and then the model of the negative binomial. The method uses the estimate of maximum likelihood and keeps on iterating until there is a change in the value of the final log. The likelihood of the log is used for the comparison of the models.
  • The next information is in the header file.
  • There is the information of coefficients of Negative Binomial Regression just below the header. The coefficients are generated for every variable along with the errors such as the p-values, z-scores. There is also a confidence interval of 95% for all the coefficients. The coefficient for the “math” variable is -0.006, which denotes that it is statistically significant. The result means that if there is an increase in one unit on the variable “math,” the expected log count for the absent number of days decreases by a value of 0.006. Also, the value of the 2. prog, the indicator variable, is the difference expected in the count of log between the two groups ( group 2 and reference group).
  • The parameter estimation for the log transferred over-dispersion is done and then displayed with the untransformed value. In the Poisson model, the value is zero.
  • There is a ratio test likelihood information below the coefficients table. The model can be further understood through the use of the commands “margins.”

Process of Doing Negative Binomial Regression Analysis in Python 

The required packages for carrying out the regression process are required to be imported from Python. These packages are listed below:

  • import statsmodels.api as sm
  • import matplotlib.pyplot as plt
  • import numpy as np
  • from patsy import dmatrices
  • import pandas as pd

Considerations for Negative Binomial Regression 

There are a few things that should be considered while applying the method of Negative Binomial Regression analysis. These include:

  • If there is the presence of small samples, then the Negative Binomial Regression method is not recommended.
  • Sometimes there are excess zeros present which might be a cause for the overdispersion. These zeros might be generated due to the process of adding data generation. If such a type of case occurs, it is recommended to use the method of the zero-inflated model.
  • If the process of data generation does not consider any zeros, then in such cases, it is recommended to use the method of the zero-truncated model.
  • There is an exposure variable associated with the count data. The variable denotes the times there is a chance that the event can occur. This variable is necessary to be incorporated into the model of Negative Binomial Regression. This is done through the option of exp().
  • The outcome variable cannot be any negative value in the model of the Negative Binomial Regression analysis. Also, the exposure variable cannot have the value 0.
  • The command “glm” can also be used for running a Negative Binomial Regression analysis method. This can be done through the link of the log and also the family of binomials.
  • The command “glm” is required for obtaining the residuals. This is to check if there are any other assumptions in the model of Negative Binomial Regression.
  • There is the existence of the various measures of the pseudo-R-squared. However, every measure provides information similar to the information provided by the R-squared in the regression of OLS.

Conclusion 

The article discussed the topic of Negative Binomial Regression. We have seen that it is almost similar to the method of multiple regressions and is a generalized form of the Poisson distribution. There are several applications of the method. The technique can also be applied through the python programming language or in R.

Several case studies are also present that show its application in studies such as aging. Also, the classical models of regressions that can be used on the count data are the Poisson Regression, Negative Binomial Regression, and Geometric Regression. These methods belonged to the family of linear models and were included in almost all statistical packages such as the R system.

If you want to excel in machine learning and want to explore the field of data, then you can check the course Executive PG Programme in Machine Learning & AI offered by upGrad. So, if you are a working professional who dreams of being an expert in machine learning, come and gain the experience of getting trained under experts. More details can be achieved through our website. For any queries, our team can assist you promptly.

Want to share this article?

Enhance Your Career in Machine Learning and Artificial Intelligence

Leave a comment

Your email address will not be published. Required fields are marked *

Our Popular Machine Learning Course

Get Free Consultation

Leave a comment

Your email address will not be published. Required fields are marked *

×
Let’s do it!
No, thanks.