Home Bayesian Linear Regression Analysis (Insurance Premium Data)
Post
Cancel

Bayesian Linear Regression Analysis (Insurance Premium Data)

Kaggle


Configuration

Import Library

1
2
library(rstan)
library(ggplot2)
  • Apply STAN for analysis

Data Preprocessing

1
2
3
4
5
6
insdata=read.csv("insurance.csv")
insdata$sex <- ifelse(insdata$sex == "male", 0, 1)
insdata$smoker <- ifelse(insdata$smoker == "no", 0, 1)
# sex: male=0, female=1; smoker: no=0, yes=1

insdata=insdata[,c('age','sex','bmi','children','smoker','charges')]
  • Translate the ‘sex’ variable into a binary format: male=0, female=1
  • Translate the ‘smoker’ variable into a numeric format: False=0, True=1
  • The ‘region’ variable is excluded from the analysis as it is categorical.
  • Dependent variable: $log(charges)$, Independent variable: age, sex, bmi, children, smoker


Analysis

Linear regression with non-informative prior

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
model.reg1 = "
data {              
  int<lower=0> n;
  int<lower=0> p;
  matrix[n,p] X;
  vector[n] y;
}
parameters {            
  real alpha;
  vector[p] beta;
  real<lower=0> sigma_sq;  // error scale
}
model { 
  y ~ normal(X * beta + alpha, sqrt(sigma_sq));
  target += -log(sigma_sq);
}
"

sm=stan_model(model_code=model.reg1)
fit1 = sampling(
  object=sm,
  data=list(n=nrow(X), p=ncol(X), X=X, y=y),
  chains = 4,
  warmup = 2500,
  iter = 5000,
  cores = 4
)

beta.hat=get_posterior_mean(fit1,par="beta")[,"mean-all chains"]
alpha.hat=get_posterior_mean(fit1,par="alpha")[,"mean-all chains"]

aggregate(y-(alpha.hat+as.matrix(X)%*%beta.hat), list(group), FUN=var)
  • Since there is no information about the population, a non-informative prior is employed for conducting linear regression analysis.
  • MCMC sample count = $chains * (iter-warmup) / thin = 4 * (5000-2500)/1 = 10,000$


1
print(fit1)

non_regression_result_table

  • With a chain count of 4 and all parameters having an Rhat value of 1, the MCMC has converged.


1
plot(fit1)

non_regression_result_img

  • It is possible to examine the posterior mean of each parameter.
    – age: 0.03, sex: 0.08, bmi: 0.01, children: 0.10, smoker: 1.55
  • It is possible to examine the 95% CI of each parameter.
    – age: (0.03, 0.04), sex: (0.03, 0.12), bmi: (0.01, 0.01), children: (0.08, 0.12), smoker: (1.49, 1.61)


1
traceplot(fit1,pars=c("alpha","beta"))

non_regression_parameter

  • During the chain iteration process, parameters display no persistent long-term trends in their fluctuations. Mean values maintain stability as they converge towards a stationary distribution. This trait serves as an indicator of the MCMC chain’s favorable quality.


1
pairs(fit1,pars=c("sigma_sq","beta[1]"))

non_porterior_density

  • Distribution of Posterior Distribution Density


Linear regression with informative prior

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
model.reg2="
data {
  int<lower=0> n;
  int<lower=1> p;
  int<lower=1> J;
  int<lower=1,upper=J> G[n];
  matrix[n, p] X;
  vector[n] y;
  vector[p] b0;
  matrix[p,p] Sig0;
}
parameters {
  real alpha;
  vector[p] beta;
  vector<lower=0>[J] sigma_sq; 
}
model {
  alpha ~ normal(0, 100);
  beta ~ multi_normal(b0,Sig0);
  for (j in 1:J) {
    sigma_sq[j] ~ inv_gamma(0.001, 0.001);
  }
  for (i in 1:n) {
    y[i] ~ normal(alpha+X[i]*beta, sqrt(sigma_sq[G[i]]));
  }
}
"

sm=stan_model(model_code=model.reg2)
fit2 = sampling(
  object=sm,
  data=list(n=nrow(X), p=ncol(X), J=length(table(group)), G=group, X=X, y=y, b0=rep(0,ncol(X)),Sig0=1000*diag(ncol(X))),
  chains = 4,
  warmup = 2500,
  iter = 5000,
  cores = 4
)
  • The age groups were segmented into their respective 10s, 20s, 30s, 40s, and 50s categories, and a regression model was fitted using an informative prior.
  • MCMC sample count = $chains * (iter-warmup) / thin = 4 * (5000-2500)/1 = 10,000$


1
print(fit2,digits_summary=8)

non_regression_result_img

  • It is possible to examine the posterior mean of each parameter.
    – age: 0.03, sex: 0.04, bmi: 0.01, children: 0.07, smoker: 1.32
  • It is possible to examine the 95% CI of each parameter.
    – age: (0.031, 0.035), sex: (0.002, 0.077), bmi: (0.007, 0.013), children: (0.055, 0.086), smoker: (1.273, 1.376)
  • With a chain count of 4 and Rhat values close to 1, the MCMC has converged.


1
plot(fit2, pars="sigma_sq", show_density = TRUE)

non_regression_result_img

  • Distribution of Posterior Distribution Density


결과

The final regression equation $ log(charges) = 7.144 + 0.033 * 𝑎𝑔𝑒 + 0.040* 𝑠𝑒𝑥 + 0.010 * 𝑏𝑚𝑖 + 0.071 * 𝑐h𝑖𝑙𝑑𝑟𝑒𝑛 + 1.324 * 𝑠𝑚𝑜𝑘𝑒𝑟 $

This post is licensed under CC BY 4.0 by the author.

문자열 검색 알고리즘: 브루트-포스, 보이어-무어, KMP

-