top of page
Airplane Wing

Predictive Modeling on Airline Customer Segmentation

Conducted logistic regression, decision tree and random forest by using R to analyze 130k survey data. Discovered the best model to predict customer satisfaction and identified valuable variables for airline companies to make decisions.

Predictive Analytics Project

Business Objective

Dataset

Actions

Result

To exercise multiple predictive models to determine customer satisfaction and make practical business suggestions to airline companies.

Airline passenger demographic and survey data consisting of 130,000 unique passenger responses with 23 columns including gender, type of travel, satisfaction about inflight wifi service, and so on.

1. Cleaned the Dataset

My team used R to convert categorical variables into binary variables and group related variables.

​

2. Evaluated the Correlation

Since part of the dataset are survey data, we wondered whether survey data would influence the final result of passenger satisfaction. Therefore, we ran the correlation map on all variables to figure out the relationship between all independent variables and the satisfaction variable. We finally decided to remove the variable "Departure Delay in Minutes" to avoid multicollinearity problems.

​

3. Identified Plausible Predictive Models

Based on the business objective, we identified plausible predictive models, including Logistic Regression, Classification and Regression Trees (CART), and Random Forest model to evaluate the important variables for predicting passenger satisfaction.

The Random Forest Model has the highest accuracy in predicting passenger satisfaction because of the overall modeling accuracy. Also, independent variables identified by random forest’s variable importance are viable for companies to improve their flight services. Significant variables suggested by models are directly predictive for the customer satisfaction, therefore especially crucial to address for the improvement of passenger satisfaction

Results of Analysis

Below are the parts that I was responsible for the team project. 

Heading 3

CART Model Methodology

  • I performed Classification and Regression Trees (CART) model on the binary satisfaction versus all other variables to see which factor influences passenger satisfaction the most.

  • Based on the correlation among all independent variables, I used all variables except for "Departure Delay in Minutes" to predict the binary satisfaction to avoid multicollinearity.

  • According to the initial result of the default CART model, the CP value started to decrease between the size of tree of 5 to 8.

  • I then compared the variables used in tree construction and considered the size of tree for business implications.

  • As a result, I finalized that the best CART model is the size of tree of 6.

Heading 3

CART Model Results

  • The accuracy of this CART model is 89.77%, which is higher than the accuracy of 56.1% of the baseline model.

  • The variables used in tree construction were type of travel, inflight wifi service, online boarding, inflight entertainment and checkin service. 

  • Online boarding is the most important indicator of passenger satisfaction, followed by inflight wifi service.

  • LinkedIn

© 2022 by Peggy Liang. Proudly created with Wix.com

bottom of page