In data science, working with variables is the most common thing as information is collected in the form of variables. The analysis of the variables is carried out to understand the processes of a business or research study. While working with the data, some tasks need to be performed to establish the relationship between the variables considered in the analysis. One such method that is widely used in understanding the behavior of the variables in regression analysis. Linear and logistic regression are the two types of regression analysis that have often been applied in most studies. However, the knowledge of regression remains limited.
With the different types of variables used in the study, the type of regression method changes too. There is this capability of dealing with different types of variables. The multi-level dependent variables can be analyzed through the use of regression analysis. However, to perform the analysis over such variables, specialized computational techniques are available. Several algorithms of machine learning such as random forest, decision tree, and Naive Bayes. The algorithms are a bit complicated initially, but if the logistic regression technique is understood well, then grasping the working of the algorithms is an easy process. The article focuses on the topic of ordinal regression.
The parameters that affect the degree of nodal involvement in patients with oral cancer are described using an ordinal logistic regression model, and its future validation is discussed.
It is used for predicting the value of an ordinal dependent variable when there is the presence of one independent variable or more than one independent variable. An ordinal variable can be defined as a variable that has a value on an arbitrary scale. The ordinary regression technique is often considered as a technique between the techniques of classification and regression.
The technique of ordinal regression is also known as ordinal logistic regression. It is mostly an extension of the technique of binomial logistic regression. An ordinary regression technique performs to predict the dependent variable with multiple ordered categories and independent variables. However, it can also be explained as a technique that facilitates the interaction between independent and dependent variables. To understand the concept more clearly, let us consider an example.
Assuming that a survey has been conducted, the respondents were asked whether they agreed or disagreed. However, the responses that were generated didn’t help in the study well. Therefore, further categories of responses were generated, such as disagree, strongly agree, strongly disagree, or agree. Once the categorizations were done in an ordered manner, it helped understand the nature of the responses. This is what is captured by the technique of ordinal logistic regression, where the categories are formed based on certain orders.
After carrying out the technique of ordinal regression, the user will be able to predict which independent variables are statistically significant to the dependent variable. For all the categorical independent variables, the user will predict the odds of which one group has a lower or higher value on the dependent variables. Also, for predicting the increase or decrease in the variables by a single unit, the user can use the OLR method.
The method can be widely used in several domains of studies. Because of this advantage of application in a wide range of studies, the model is the most admired in data analytics. Sometimes, the method is also referred to as the model of proportional odds.
Machine learning techniques can be used for carrying out the techniques of ordinary regression. It is also called ranking learning in machine learning. The technique is often performed through the model of generalized linear model (GLM). Various software provides the provision of carrying out the regression analysis. Such software’s are ORCA, MATLAB framework, and R packages such as Ordinal and MASS.
Statistical Models in Ordinary Logistic Regression
To handle the outcomes in the ordinal form, several models of ordinal logistic regression are present. Every model is different and has different ways of forming the logistics. Examples of such models are the proportional odds, continuation ratio, and adjacent category models. Every model that is used in the OLR studies has its limitations as well as advantages. As per the needs, the users can choose the models. The models of adjacent categories and the continuation ratio do not rely on the complete data. Also, in the applications such as biomedical and epidemiological studies, the model of proportional odds is often used. However, there might be cases where the user can also observe the application of the continuation ratio model. Also, it depends on what purposes the statistical analysis is to be carried out.
Assumptions to be Made in the OLR
For carrying out the OLR studies in the SPSS software, a few assumptions are required to be considered. The assumptions are listed below:
- The measurement of the dependent variable should be done at the ordinal level.
- The independent variable should be one or more in number. They should be continuous, categorical, or ordinal, which also includes the dichotomous variables.
- There should not be any multicollinearity between the independent variables. If there is any high correlation between any independent variables, then it creates the case of multicollinearity.
- The model should have proportional odds.
Ordinary Logistic Regression (OLR) in R
The following libraries are required in order to perform the OLR in R:
1. Loading the data: Once the libraries are loaded, the data then needs to be loaded.
2. Understanding the data: A variable “apply” is present in the dataset, acting as the dependent variable. There are three levels in the variable: very likely, somewhat likely, and unlikely, with the “very likely” is the highest while the “unlikely” is the lowest. It can be seen that there are ordered categories present in the data. Therefore, in such situations, ordinary logistic regression can be applied. If there is a pairing of (o/1), then it refers to a graduate degree with at least one parent and the public (0/1) refers to the institute type.
The command “polr” is used for building the model of ordinary logistic regression. The Hess=TRUE is then specified to show the model’s output as the information matrix retrieved from the optimization. This is done to receive any standard errors associated with the model.
The output shows the usual table of the output coefficient of the regression that includes the value for the standard errors of each coefficient. It also includes the values, residual deviance, estimated value for the intercepts, and the value for the AIC. The criteria for the information are AIC. If the value of AIC is lesser, it indicates a better model.
The next calculation is done for the metrics such as the Odds ratio, Cl, and the p-Value.
Interpretation of the Output
You can interpret the output generated from the Ordinary Regression in the following manner:
- There has been an increase of one unit in the section of parental education, from the value of 0 to 1, i.e., from the low to high. The odds for the variable, i.e., “very likely” to “somewhat likely” or “unlikely,” are combined to form a value of 2.85 or greater.
- In case there is a movement of 1 unit in the student’s GPA, the odds of “unlikely” to “somewhat likely” or “very likely” is multiplied by 1.85.
The model is then enhanced for better prediction. Finally, interaction terms are added to the model. Once these things are done, it is then time to plot the model.
Ordinary Logistic Regression Examples
There are several examples where the ordinary logistic regression technique can be applied. A few examples are listed below.
- Suppose a marketing firm investigates the factors that influence the soda size ordered by people in most fast food outlets. The sizes can be small, medium, extra-large, or large, depending on the requirements. Several factors might lead to ordering the sodas, such as whether the customer has ordered a sandwich or some French fries. It also depends on the age of the customer.
- Studies need to be conducted to analyze the factors that might influence the medalling in the swimming category in Olympics. The factors in these cases might be the hours of practice, the age of the swimmer, and the diet. It might also depend on how popular swimming is in the home country of the swimmer.
Join Machine Learning Course from the World’s top Universities. Earn Masters, Executive PGP, or Advanced Certificate Programs to fast-track your career.
The Department of Surgical Oncology, Dr. BRA-IRCH, AIIMS, New Delhi, India, provided the data used to develop the models. From 1995 through 2013, all OSCC patients who underwent complete surgery, including neck dissection, were included. For the model’s validation, additional data from 204 patients gathered prospectively between 2014 and 2015 were taken into account.
As a pioneering effort in the field of OSCC, a stepwise multivariable regression approach was utilised to evaluate the factors connected to the degree of nodal involvement. The results are shown as odds ratios and the matching 95% confidence interval (CI). The ordinal models were evaluated and compared for proper ordinal form accounting. Additionally, a prospectively acquired set of additional data was used to validate the established model’s performance.
Pain at the time of presentation, sub mucous fibrosis, a palpable neck node, oral site, and degree of differentiation were discovered to be strongly linked variables with the extent of nodal involvement under a multivariable proportional odds model. In addition, the partial-proportional odds model revealed that tumour size was also relevant.
Here are some examples of ordinal logistic regression.
Examples of ordinal logistic regression
Example 1: A marketing research company is looking at the variables that affect the soda size that customers order at a fast food restaurant (small, medium, large, or extra large). These elements could include the sandwich purchased (chicken or burger), whether or not fries were also requested, and the customer’s age. Although the outcome variable, soda size, is explicitly ordered, there are inconsistent differences between the different sizes. Small and medium are separated by 10 ounces, medium and large by 8, and extra large by 12 ounces.
Example 2: A researcher is curious about the elements that affect Olympic swimming medaling. Hours of training, food, age, and the popularity of swimming in the athlete’s own nation are all pertinent factors. According to the researcher, bronze and silver are separated from each other by a greater distance.
Example 3: A study examines the variables that affect applicants’ choices for graduate programs. Students in their junior year of college are asked if they are extremely likely, slightly likely, or unlikely to apply to graduate school. Thus, there are three types in our outcome variable. Data on the level of education of the parents, the nature of the undergraduate institution (public vs. private), and the current GPA are also gathered. The “distances” between these three points may not be comparable, according to the researchers. The “distance” between “unlikely” and “somewhat likely,” for instance, can be less than that between “somewhat likely” and “very likely.”
The statistical technique of ordinary regression and how to implement it in R have been discussed in this article. The technique is considered an extensor for the simple logistics model where categorical dependent variables are used. It returns the information of contribution from each of the independent variables. The benefit of the OLR over the multinomial regression model is that the information of the dependent variable’s ranking is not preserved when the contribution information is shown for all the independent variables. Also, in the case of OLR, every variable can be normalized as all the variables have a different scale.
Enroll for Advanced Certification in Master of Science in Machine Learning & AI.