Reducing Commercial Aviation Fatalities | Kaggle Competition

12 min readJan 7, 2021

Overview
Business Problem
Dataset Analysis
Mapping the Real World Problem into ML problem
Performance Metric
Exploratory Data Analysis
Feature Engineering
Data preprocessing
Modeling
Future improvements
Result
References

It was a Kaggle competition which is organized by Booz Allen Hamilton company on Kaggle. Booz Allen Hamilton has been solving for business, government, and military leaders for over 100 years. competition is about reducing aviation fatalities we have to predict state of the pilot based on given physiological data in competition. So let’s start to explore the problem!

Overview

Most of the flight fatalities are due to the loss of airplane sate of awareness. One of the important today’s travelers are considered is safety. Accidents related flight leads to the loss of life of several people. So our challenge is to build a model to detect troubling events from aircrew’s physiological data.

The most frequent causes for these aviation accidents include:

Pilot error
Mechanical error
Runway problems
Air traffic failure
Climate problems

In this problem we mainly focus is on the first cause i.e, aviation accidents caused by pilot error, how to reduce that..

Business Problem

Air flight is one of the important ways of traveling. So the safety of passengers is primarily considered by most of the airplane companies. As a part of that pilots underwent severe training to handle different situations, important ability a pilot should pursue is multitasking capability.

Most of the flight fatalities or flight accidents due to pilot error are due to the loss of airplane state awareness. Airplane state awareness (ASA) is a pilot performance attribute wherein the pilot should be able to realize and respond quickly to any change of state of the airplane. Loss of airplane state awareness may lead to many dangerous situations and may result in loss of airplane control. Loss of ASA is mainly due to loss of attention on the part of pilots who may be distracted, sleepy, or in other dangerous cognitive states. Due to the stressful environment, while flying, the possibility of the loss of awareness is common.

In this dataset, you are provided with real physiological data from eighteen pilots who were subjected to various distracting events. The benchmark training set is comprised of a set of controlled experiments collected in a non-flight environment, outside of a flight simulator. The test set (abbreviated LOFT = Line Oriented Flight Training) consists of a full flight (take off, flight, and landing) in a flight simulator.

The pilots experienced distractions intended to induce one of the following three cognitive states:

Channelized Attention (CA) is, roughly speaking, the state of being focused on one task to the exclusion of all others. This is induced in benchmarking by having the subjects play an engaging puzzle-based video game.
Diverted Attention (DA) is the state of having one’s attention diverted by actions or thought processes associated with a decision. This is induced by having the subjects perform a display monitoring task. Periodically, a math problem showed up which had to be solved before returning to the monitoring task.
Startle/Surprise (SS) is induced by having the subjects watch movie clips with jump scares.

The aim is to build a model that can estimate the state of mind of the pilot in real-time using the physiological data given. When the pilot enters into any one of the above mentioned dangerous cognitive states, he/she should be alerted, thereby preventing any possible accident.

Dataset Analysis

The dataset are provided in csv files (both train and test dataset).

Now, let’s analyze each attribute in the dataset.

The main sensors used for the collecting physiological data are EEG, ECG, Respiration, Galvanic skin response.

id - (test.csv and sample_submission.csv only) A unique identifier for a crew + time combination. You must predict probabilities for each id.
crew - a unique id for a pair of pilots. There are 9 crews in the data.
experiment - One of CA, DA, SS or LOFT. The first 3 comprise the training set. The latter the test set.(The training data consist of three experiments: CA, DA, and SS. The output is one of the four labels: Baseline(no event), CA, DA, or SS. For example, if the experiment is CA, the output is either CA or Baseline(no event). The test data is taken from a full flight simulator. Here the experiment is called LOFT or Line Oriented Flight Training where the training of the pilot is carried out in a flight simulator, which artificially creates the environment of a real flight. In the test data, the experiment is given as LOFT and the output can be one of the four states at a given time.)
time - seconds into the experiment
seat - is the pilot in the left (0) or right (1) seat
eeg_fp1 ,eeg_f7 ,eeg_f8 ,eeg_t4 ,eeg_t6 ,eeg_t5 ,eeg_t3 ,eeg_fp2 ,eeg_o1 ,eeg_p3 ,eeg_pz ,eeg_f3 ,eeg_fz ,eeg_f4 ,eeg_c4 ,eeg_p4 ,eeg_poz eeg_c3 ,eeg_cz ,eeg_o2
ecg - 3-point Electrocardiogram signal. The sensor had a resolution/bit of .012215 µV and a range of -100mV to +100mV. The data are provided in microvolts.
r - Respiration, a measure of the rise and fall of the chest. The sensor had a resolution/bit of .2384186 µV and a range of -2.0V to +2.0V. The data are provided in microvolts.
gsr - Galvanic Skin Response, a measure of electrodermal activity. The sensor had a resolution/bit of .2384186 µV and a range of -2.0V to +2.0V. The data are provided in microvolts.
event - The state of the pilot at the given time: one of A= baseline, B= SS, C= CA, D= DA

Mapping the Real World Problem into ML problem

This is a multiclass classification(A,B,C,D) problem .For each id , we need to predict the state of the pilot as belonging to one of the four given classes

Performance Metric

The problem we are handling is a multiclass classification problem where the number of classes is 4

The evaluation matrix used in this competition is multiclass log loss

where N is the total number of data points, M is the number of classes.

yij is 1 if the data point i is predicted to be of class j, and is 0 otherwise.

pij is the probability of data point i belonging to class j

Exploratory Data Analysis(EDA)

The first step to explore train data is to check data is balanced or imbalanced. For this purpose we used countplot.

From the plot we can understand that train data is imbalanced

Time

We can see event B’s plot tells us that the values of the plot are present in the high and lower range of the time axis and there are very few values in the mid-range.
For other classes, the time feature is well distributed.
From the distplot, we can say that Time into the experiment is on an average between 0 to 360 sec.
From the above plot, we can understand that there is a huge difference in the train and test time of the experiment.

ECG (electrocardiogram)

From the box plot, it is observed that the ecg data has some outliers. But we cannot simply remove them because these extreme values might be useful in predicting the event. When the value of ECG is high (more than 10000 microvolts), the pilot is more likely to enter into the DA state. Similarly, when the value is too negative, the pilot is likely to be in CA state.
From the pdf of ECG we can see the range of values for ecg is between the range of -20k to +35k uV (approx). We can see that there are very few values between the 20k to 25k uV range than other values of uV of ECG.

GSR (galvanic skin response)

The pdf of GSR has a range from 0 to 2000uV approximately.
GSR has some role in determining the event. For example, if the value from gsr is too high the pilot is more likely to be in SS state. Also if the sensor output is very low, the pilot is probably in DA state

Respiration

We can see the same thing for pdf of respiration signal and the range of respiration signal is from 400 to 850 typically.
This sensor output is not at all separating the events. This might be because of the noise in the data

EEG ( electroencephalogram)

We can also observe that mostly all the events are lying in the almost same range so we can’t simply put some value as a threshold and use it to classify the events.
From box plot most of the features are highly overlapping
We can see above both Train and Test eeg data follow about to normal distribution but test data have a bigger peak at zero(0) and more variance in test data

Feature Engineering

Now let’s derive some features from existing features.

Deriving Heart Beat Information From ECG

In the given dataset ecg is recorded in microvolts. If we can extract heart rate from this ecg data it should be for modelling .For this purpose Python provides a powerful tool called Biosppy which can do bio-signal processing.

The output heart rate from the biosppy module is of at some particular timesteps. The heart rate corresponding to all the time stamps in our data set is not available . So we need to find out these values. Interpolation is used for this purpose.

Deriving Respiration Rate Information From respiration( ‘r’ )

It is the rise and fall of the chest.it represents the muscular activity of abdomen and diaphragm. so when a person is stressed ,the rate of respiration(respiration rate) will be high. For obtaining respiration rate we use same biosppy module.

For here also we need to use interpolation technique to get respiration rate in all time stamps.

Deriving Frequency Bands from EEG data

The electroencephalogram (EEG) is a recording of the electrical activity of the brain from the scalp. The recorded waveforms reflect the cortical electrical activity.
Signal intensity: EEG activity is quite small, measured in microvolts (mV).
Signal frequency: the main frequencies of the human EEG waves are:

Delta: has a frequency of below 3 Hz. It tends to be the highest in amplitude and the slowest waves. It is normal as the dominant rhythm in infants up to one year and in stages 3 and 4 of sleep. It may occur focally with subcortical lesions and in general distribution with diffuse lesions, metabolic encephalopathy hydrocephalus or deep midline lesions. It is usually most prominent frontally in adults (e.g. FIRDA — Frontal Intermittent Rhythmic Delta) and posteriorly in children e.g. OIRDA — Occipital Intermittent Rhythmic Delta).
Theta: has a frequency of 4 to 8Hz and is classified as “slow” activity. It is perfectly normal in children up to 13 years and in sleep but abnormal in awake adults. It can be seen as a manifestation of focal subcortical lesions; it can also be seen in generalized distribution in diffuse disorders such as metabolic encephalopathy or some instances of hydrocephalus.
Alpha: has a frequency between 8 and 13 Hz. Is usually best seen in the posterior regions of the head on each side, being higher in amplitude on the dominant side. It appears when closing the eyes and relaxing, and disappears when opening the eyes or alerting by any mechanism (thinking, calculating). It is the major rhythm seen in normal relaxed adults. It is present during most of life especially after the thirteenth year.
Beta: beta activity is “fast” activity. It has a frequency of 14 and greater Hz. It is usually seen on both sides in symmetrical distribution and is most evident frontally. It is accentuated by sedative-hypnotic drugs especially the benzodiazepines and the barbiturates. It may be absent or reduced in areas of cortical damage. It is generally regarded as a normal rhythm. It is the dominant rhythm in patients who are alert or anxious or have their eyes open.
Gamma (>25 Hz): High value indicates Anxiety, high arousal, stress, etc. A very low value indicates ADHD, depression, learning disabilities, etc.

Here also we use biosppy module and interpolation.

Deriving potential difference from electrodes

In this experiment, they are used 20 electrodes to fit the human brain. There are different formations(montages) to place the electrode in the brain. The potential difference between two electrodes gives mental activity going in that region.

There are different montages system are present but commonly used montages system is 2nd one in the above figure.

Correlation Matrix : removing unimportant features from dataset

As a final stage in the feature engineering stage, I have found correlation matrix features with events. From the correlation matrix find that.

The effect of crew and ecg for predicting the event is high.
The feature seat is less correlated with the event so that we can remove that feature.
Potential differences are having more impact in predicting the events as compared to the simple eeg electrode data.
Theta feature is having some considerable effect in events. The correlation is considerable. So this feature will be useful in predicting the event.
The alpha low frequency bands derived from this eeg is having more effect in determining the events.
The beta frequency bands derived from this eeg is had some effect in determining the events. But the correlation is not very high. Its effect might be similar to the effect of 20 eeg electrode data in determining the events.

First of all, we remove unimportant features like ‘seat’,’ experiment’(we want to predict an event without knowing the experiment). After that we remove derived features such as ‘beta’ and ‘gamma’ frequency bands because it has less correlation with the event, so we remove these features also.

Preprocessing

Here I have done standardization for converting numerical features. For this purpose used StandardScaler .

Modeling

For modelling stage first, simply fit a random model and note down log loss, after train data using a linear model like logistic regression. I got a considerable reduction in log loss as compared to the random model. Since the dataset contain about 105 features ensembles models are expected to perform better. So I trained data using LightGBM .

Future Improvements

Noise removal of physiological data can be done for obtaining more clean physiological data.
Try more hyperparameter tuning, fit other complex ensemble models.
Try to implement Deep learning models.

Results

I got minimum log loss when I trained data using LightGBM.

You can find my complete solution in my Github Repository ,and if you have any suggestions, please contact me via Linkedin