Statistical Analysis to Uncover Factors Influencing Healthcare Insurance Billings

13 min readJul 13, 2023

Hi! Here’s the list of content for each section in this article:

Introduction
Descriptive Statistical Analysis
Conditional Probability Analysis
Continuous Variables Analysis
Correlation Analysis
Hypothesis Testing
Conclusion
References

Introduction

Insurance is quite important and worth considering as it is related to someone’s future planning needs. Especially, we have experienced how scary a pandemic can be when it hits. The Financial Services Authority Indonesia (OJK) said that the insurance trend in Indonesia has increased during the pandemic, particularly at the end of 2020 and continuing into 2021, which creates even greater potential and opportunities for insurtech.

The basic mechanism applied in insurance is that insurance users are required to regularly pay a certain amount of money (premium) to the insurance company. The premium will be managed and processed by the insurance company to pay for the users’ health bills.

Problems arise when many insurance companies face difficulties in determining the premium amount, considering there are many factors that can influence and increase users’ risk profiles.

Therefore, this time I will try to conduct an analysis focused on statistical studies to find the relationship between a specific variable and the health bills received by each user.

Data Source Description

The data used in this project is personal health billing data which consists of 1338 customer data and has 7 variables. These variables include:

age: Age of the main person getting insurance
sex: Gender of the person getting insurance, female or male
bmi: Body mass index, which tells us if the person’s weight is relatively high or low compared to their height. It’s a number calculated using the person’s weight and height (kg/m2), and the ideal range is 18.5 to 24.9.
children: Number of kids covered by the insurance/Number of dependents
smoker: Whether the person is a smoker or not
region: The area where the person lives in the country, could be northeast, southeast, southwest, or northwest
charges: The amount of money billed by the insurance for individual medical costs

Analysis

Now we’re gonna dive into the analysis phase to uncover some juicy insights that will eventually lead us to a grand conclusion. Get ready, ’cause this one’s gonna have a heavy dose of statistics. Buckle up!

Descriptive Statistics

Descriptive statistics is all about summarizing, organizing, and analyzing a population data based on a set of sample data. It’s like getting the lowdown on our data real quick. This stuff is super important because it helps us understand the ins and outs, the nitty-gritty details, and the overall patterns of our data. And guess what? That’s gonna be our compass for making decisions and picking the right models. So yeah, descriptive statistics is where the magic begins!

In that data, the average age of customers is 39.2 years old. That’s the sweet spot, you know! At this age, according to some articles, people tend to be more financially and emotionally stable, so it’s no surprise that our age data is huddling around that number. Oh, and by the way, the youngest customer is 18 years old, while the oldest is a seasoned 64.

Healthcare bills are in the spotlight right now, and it turns out that the average healthcare charges for smokers are a whopping 3.8 times higher than the average charges for non-smokers. This suggests that the smoking variable has a significant impact on the healthcare costs. The same goes for customers with a BMI above 30, they tend to have healthcare bills that are 1.5 times higher than customers with a BMI below 25. Although not significant, this variable still has a slight influence on the healthcare charges.

When I looked at the distribution of healthcare charges for each region, I found some interesting insights. It seems like there’s a sense of uniformity or fairness in healthcare insurance bills across all regions based on the distribution patterns and similar median values. It suggests that regional factors don’t have a significant impact on the cost of healthcare services. So, it looks like where you live doesn’t play a big role in determining the size of your healthcare bill.

This can also be seen from the proportion of healthcare charges in each region, which is evenly distributed.

The largest proportion is in the southeast, accounting for 30.21% of the total healthcare charges across all regions. Based on these two pieces of information, I can conclude that there are no differences in healthcare utilization, treatment costs, or population characteristics across regions. So, we need to consider other factors such as disease prevalence, health profiles, or individual habits of each customer.

Conditional Probability

We’ve identified several variables that I believe have an impact on healthcare bills. It’s time to calculate conditional probabilities for various scenarios that reflect the characteristics of each customer so that we can get a more complete picture of the factors influencing insurance charges.

When we want to find the probability of an event occurring given that another event has already occurred, it’s called conditional probability. The formula is P(A | B) = P(A n B) / P(B).

There are a few things to consider about conditional probability:

Events A and B must occur simultaneously, and the conditional probability is denoted as P(A | B), which means the probability of event A occurring given that event B has occurred.
Event B must have a probability greater than 0.
Events A and B must be dependent, meaning event B should provide relevant information or at least affect the probability of event A.

Actually, to make it easier to calculate conditional probability, we can break it down into the sum of individual events if we don’t have enough information.

By dividing the sample space of the conditional event by event B, we get a much simpler formula to understand. The sample space of each event can be divided because they have the same value. This is because events A and B are subsets that come from the same sample space

Now let’s analyze the cluster of young people and see if someone under 30 years old has high health insurance bills if they are smokers. To answer this question, I’m trying to represent it in the form of a conditional probability formula like this.

The results of the calculation are quite surprising, with nearly 99% of young smokers having high health insurance bills. To get more accurate results, I compared it to the probability for young non-smokers, and the result was only 0.084% of young non-smokers having high health insurance bills. From this, we can conclude that the young age group (under 30) who smoke have a higher indication of health risks or higher medical costs compared to the non-smoking group in the same age range, which affects the amount of health insurance bills.

Next, I’m going to analyze how likely it is for men to have high health insurance bills if they are obese. It’s important to note that assessing whether someone is normal, overweight, or obese is done using a BMI calculator. Generally, someone is considered obese if their BMI is > 30, and someone with a normal weight ranges from 18 to 25.

Just like before, if we depict this question in the form of a conditional probability formula, it would look like this.

The results of the calculation show that obese men have a 35.8% chance of having high health insurance bills, while men with a normal weight only have a 30% chance of having above-average bills. Although there is a 5% difference, it is not significant enough for me to conclude that BMI is the main factor influencing the amount of health insurance bills.

Continuous Variables Analysis

In the world of statistics, we are familiar with the terms Cumulative Distribution Function (CDF) and Probability Density Function (PDF). The CDF of a random variable X is denoted as Fₓ(x), which is a function from R to [0,1] defined as Fₓ(x) = P(X ≤ x). If we visualize the CDF graph for a discrete random variable, it will look like a staircase that goes up.

However, for continuous random variables, the CDF doesn’t have “steps” and it will smoothly transition across the entire range of random variable values without any sudden changes like in discrete random variables. The difference from PDF is that CDF represents the cumulative integral of the PDF, which is denoted by the formula F(x) = ∫[−∞, x] f(t) dt.

Now, let’s try to find the probability values using the CDF to determine which is more likely: a smoker with a BMI above 25 or a smoker with a BMI below 25 to have health insurance bills above $16.7k.

Turns out, a smoker with a BMI above 25 has a higher probability of having health insurance bills above $16.7k, with a probability of 99.99%. Remember, we previously analyzed that smoking can have a very high chance of incurring high healthcare bills, with a probability of almost 99%, or specifically 98.8%. So, even though our analysis focused on the BMI variable, it turns out that the biggest contributing factor is the smoker variable. Oh, by the way, the average healthcare bill is around $13.2k, so $16.7k is already quite high.

Now, let’s take a look at the PDF graph for this case, and here are the results.

As we can see, the mean values of the two distributions are quite different, further confirming our previous conclusion that a smoker with a BMI > 25 has a higher probability of having high healthcare bills compared to a smoker with a BMI < 25.

It’s also important to note that both distributions are not normally distributed. This is significant because we will be conducting various statistical tests that rely on certain assumptions, one of which is normality.

Correlation Analysis

Correlation analysis is a statistical method used to measure the extent of the relationship between random variables. It indicates that if one variable increases numerically, the likelihood of the variable with the strongest relationship also increasing is high.

The strength of the relationship is measured using the correlation coefficient, which can range from -1 to 1. A value of 1 indicates a positive relationship, a value of 0 indicates no linear relationship, and a value of -1 indicates a negative relationship.

Remember! We are only measuring the strength of the relationship between two variables, not causation. This means we cannot conclude that one variable causes another variable to increase. This is a common misconception that correlation implies causation.

For correlation analysis, we typically use the Pearson correlation test, which is a parametric test. As a parametric statistical test, it has certain assumptions that need to be met, including:

Normality assumption
Linearity assumption
Absence of outliers in the data
The data should be of interval or ratio type

However, since we already know that the charges variable is not normally distributed even after transformation, we need an alternative using a non-parametric statistical test. I’m using the Spearman correlation test, which measures the strength of the relationship based on the ranking of the data.

By using the formula, we can calculate the strength of the relationship between the numeric variables. Here are the results.

It can be seen that the variable with a strong relationship with healthcare charges is the age variable with a correlation coefficient of 0.5. This aligns with our previous analysis on age groups and shows interesting results.

However, there is one variable that caught my attention in the data, which is the children variable. We know that the number of children is not a significant factor in healthcare charges. Even if someone has 1 or 5 children, the healthcare charges in this data are individual bills and not family-related. This indicates that there are specific characteristics leading to the second highest correlation between the number of children and healthcare charges in the data.

I tried visualizing the relationship between healthcare charges and age using a scatter plot, considering that age has a strong correlation with healthcare charges. Here are the results.

The graph on the left shows clusters, namely the low-bill cluster, normal-bill cluster, and high-bill cluster. Interestingly, all age groups are evenly spread across the range of healthcare charges, and what sets them apart is the characteristic of being a smoker or non-smoker. Smokers tend to have normal and high healthcare charges, while non-smokers mostly fall into the low-bill cluster, although there are a few in the normal-bill cluster.

The graph on the right indicates that the high-bill cluster is dominated by customers who are obese (30 ≤ BMI ≤ 34.99) and extremely obese (BMI ≥ 35). The normal-bill cluster is dominated by customers who are overweight (25 ≤ BMI ≤ 29.99) and of normal weight (18.5 ≤ BMI ≤ 24.99). The low-bill cluster consists of a mix of all BMI groups.

To further strengthen our argument about smokers having high healthcare charges, I attempted to visualize the relationship between healthcare charges and BMI, and here are the results.

The graph clearly depicts the difference in healthcare charges between smokers and non-smokers.

Statistical Testing

Feeling tired already? Alright, this is our final step before we finally get some rest. Honestly, I’m writing this at night when my body is already tired and sleepy, haha.

Based on the series of analyses we have conducted, there are two general hypotheses that we want to test:

Smokers have higher healthcare charges compared to non-smokers.
Healthcare charges are higher for individuals with obesity compared to those with normal BMI.

Pay attention to the statements we want to test, as they are all related to healthcare charges, which is a continuous variable. This means that there are at least two methods for hypothesis testing:

Parametric test: Independent t-test
Non-parametric test: Mann-Whitney U test

We cannot perform a parametric test because our data is not normally distributed. Therefore, we need to use the Mann-Whitney U test. Although it is a non-parametric test, there are still some assumptions that need to be met, including:

The variables being compared are continuous.
The data is not normally distributed.
The data consists of two independent samples randomly selected.
The sample size is sufficient, typically more than 5 observers in each group.

These assumptions have been met, so let’s proceed with the hypothesis testing.

Case 1

The hypotheses for the first case are as follows:

Null Hypothesis (H0): There is no difference in healthcare charges between smokers and non-smokers.
Alternative Hypothesis (H1): Healthcare charges for smokers are higher than for non-smokers

From the Mann-Whitney test calculation, we obtained a U-statistic value of 284133.0 and a p-value of 2.635e-130. It can be observed that the p-value is < the significance level alpha (0.05), indicating strong evidence to reject the null hypothesis (H0) or in other words, healthcare charges for smokers are higher than for non-smokers.

Case 2

The hypothesis for the second case is:

Null Hypothesis (H0): There is no difference in healthcare charges between obesity and normal BMI.
Alternative Hypothesis (H1): Healthcare charges for obesity are higher than for normal BMI.

From the Mann-Whitney test calculation, we obtained a U-statistic value of 88964.0 and a p-value of 0.00104. It can be observed that the p-value is < the significance level alpha (0.05), providing sufficient evidence to reject the null hypothesis (H0) or in other words, healthcare charges for individuals with obesity are higher than for individuals with normal weight.

Conclusion

FINALLY, WE’VE REACHED THE END!!!! There are several things we can answer from our initial question about the factors that influence healthcare charges.

The smoker variable plays a significant role in causing differences in healthcare insurance charges. This may be due to potential health profiles that tend to be unfavorable, resulting in increased healthcare costs.
The BMI variable also has an impact on the magnitude of healthcare charges, although the correlation between these two variables is not very strong.
The age variable, although not directly affecting healthcare charges, provides an insight into the distribution pattern of customer groups. Additionally, it supports the main variables such as smoker and BMI in influencing the magnitude of healthcare charges.

Thank you for reading until the end, I hope this provides you with interesting insights. I’ve learned a lot through this project, especially in the field of statistics. Once again, thank you so much!

References

1. Let’s Code: Descriptive Statistics Program for CSV Data from Scratch! | by Rivan Hasri | Medium
2. Continuous Random Variable. Topics Covered- CDF , PDF of Continuous… | by Tanav Bajaj | Medium
3. Body mass index — Wikipedia
4. Calculate Your BMI — Standard BMI Calculator (nih.gov)
5. Every statistical test to check feature dependence | by Karun Thankachan | Towards Data Science
6. What is a Monotonic Relationship? | DiscoverPhDs
7. Testing Linear Regression Assumptions in Python — Jeff Macaluso
8. Pearson vs Spearman correlations: practical applications | SurveyMonkey
9. Correlation (Pearson, Kendall, Spearman) — Statistics Solutions
10. Mann-Whitney U Test: Assumptions and Example | Technology Networks
11. Penjelasan Uji Mann Whitney U Test — Lengkap (statistikian.com)

Now, it’s 11:55 PM :)