486
Views
6
CrossRef citations to date
0
Altmetric
Feature Articles

Model-Based and Nonparametric Approaches to Clustering for Data Compression in Actuarial Applications

&
Pages 107-146 | Published online: 04 Nov 2016
 

Abstract

Clustering is used by actuaries in a data compression process to make massive or nested stochastic simulations practical to run. A large data set of assets or liabilities is partitioned into a user-defined number of clusters, each of which is compressed to a single representative policy. The representative policies can then simulate the behavior of the entire portfolio over a large range of stochastic scenarios. Such processes are becoming increasingly important in understanding product behavior and assessing reserving requirements in a big-data environment. This article proposes a variety of clustering techniques that can be used for this purpose. Initialization methods for performing clustering compression are also compared, including principal components, factor analysis, and segmentation. A variety of methods for choosing a cluster's representative policy is considered. A real data set comprising variable annuity policies, provided by Milliman, is used to test the proposed methods. It is found that the compressed data sets produced by the new methods, namely, model-based clustering, Ward's minimum variance hierarchical clustering, and k-medoids clustering, can replicate the behavior of the uncompressed (seriatim) data more accurately than those obtained by the existing Milliman method. This is verified within sample by examining location variable totals of the representative policies versus the uncompressed data at the five levels of compression of interest. More crucially it is also verified out of sample by comparing the distributions of the present values of several variables after 20 years across 1000 simulated scenarios based on the compressed and seriatim data, using Kolmogorov-Smirnov goodness-of-fit tests and weighted sums of squared differences.

Notes

Ratchet—one means by which benefit bases for variable annuity policyholders can grow. “Ratchet” generally means that a policyholder's benefit base will reset to the maximum of the current value or a set of previous values (as money grows in equity/bond funds). The frequency of these “resets” is specified in policyholder contracts.

Rollup—another means by which benefit bases for variable annuity policyholders can grow. “Rollup” generally implies that a policyholder's benefit base will grow at a specified rate of interest until a specified time or policyholder action, again specified in the policyholder contract.

ROP—stands for “return of premium,” a standard guarantee in variable annuity contracts where the policyholder is generally guaranteed a benefit base equivalent to the initial premium he or she paid.

One thousand economic scenarios was the maximum number available for the purposes of the analysis in this article. However, the methods detailed have also been tested and shown to work well across 4000 scenarios in a related piece of research using clustering for mixed actuarial data in conjunction with Aegon. See http://mathsci.ucd.ie//docserve?id=146 and use PIN = 6317 for full details.

Under the k-means approach, observations are assigned to the cluster whose mean is closest in squared Euclidean distance, cluster means are then recalculated, and the process repeated until no observations change cluster membership. Since the arithmetic mean of observations in a cluster is the least squares estimate of the true cluster mean, this approach is minimizing the total within cluster sum of squares at each step. In turn this is equivalent to applying Ward's minimum variance approach. If the k-means algorithm achieves the global minimum within-cluster variance and not just a locally optimal partition, the cluster membership (and corresponding parameter estimates) will be the same as if the EM algorithm is used to fit the EII model-based clustering approach and arrives at the maximum likelihood estimates of the cluster means (in fact, Friedman (Citation1989) refers to the EII model as the nearest-means classifier). However, k-means requires an initial partition (usually random) of observations into the desired number of clusters and different initializations can produce different clustering solutions at convergence. This is not the case with Ward's method, which will always produce the same partition at convergence.

Log in via your institution

Log in to Taylor & Francis Online

PDF download + Online access

  • 48 hours access to article PDF & online version
  • Article PDF can be downloaded
  • Article PDF can be printed
USD 53.00 Add to cart

Issue Purchase

  • 30 days online access to complete issue
  • Article PDFs can be downloaded
  • Article PDFs can be printed
USD 114.00 Add to cart

* Local tax will be added as applicable

Related Research

People also read lists articles that other readers of this article have read.

Recommended articles lists articles that we recommend and is powered by our AI driven recommendation engine.

Cited by lists all citing articles based on Crossref citations.
Articles with the Crossref icon will open in a new tab.