14,304
Views
19
CrossRef citations to date
0
Altmetric
Viewpoint

A Review of the Performance Measurement of Long-Term Mutual Funds

&

Abstract

We review the major models of mutual fund performance: (1) using return data to evaluate equity funds—from single to multi-index models, (2) measuring passive portfolio performance, (3) using holdings-based performance measures, (4) measuring timing ability, and (5) measuring bond fund performance. We conclude with a discussion of issues affecting performance measurement: data sources and bias, missing factors, and improvements to benchmarks.

Disclosure: The authors report no conflicts of interest.

The Financial Analysts Journal has a long history of discussing efficient markets, active versus passive management, and mutual fund performance. Now is a good time to review the major models of mutual fund performance, to take stock of where we are, and to examine the future of performance measurement.

The sheer size of the industry around the world and in the United States has made the choice of models of key importance. Worldwide, as of the end of 2018, 118,978 mutual funds were managing $46.7 trillion of assets. The US mutual fund industry had $21.4 trillion under management [$3.4 trillion of the total was invested in exchange-traded funds (ETFs)]. The US asset management industry was larger in terms of assets under management than that of any other country or region. Mutual funds were of key importance to the US investor; more than 43% of US families owned mutual funds. The US investor interested in a long-term mutual fund had to choose among 3,200 equity funds and 2,114 bond funds. The need for guidance in making this important choice has led to the development of a rich literature on how to measure performance of mutual funds.Footnote1 The purpose of this article is to review that literature.

We start with a discussion of performance measurement for active equity mutual funds—that is, mutual funds that attempt to select securities to add value above a set of benchmarks, which we will describe. This section includes a discussion of single-index models and a discussion of various types of multi-index models. We then turn to performance measurement for passive portfolios, evaluation methods that rely on mutual fund holdings in addition to returns, and the question of whether mutual funds have timing ability. Up to this point, we will have been discussing measures of performance for equity funds. Next, we discuss the measures that have been developed for bond funds. Finally, we deal with some controversial issues and point out some future directions for research.

Using Return Data to Evaluate Equity Funds

The performance of equity funds has received the most attention in the literature, probably because investors place almost twice as much of their capital in these funds as they do in bond funds. In this section, we start with a discussion of the attempt to use a single index to evaluate funds. We then turn to a discussion of several types of multi-index models. Next, we examine issues involved in using holdings data to measure performance and models of investment timing.

Single-Factor Performance Measures.

The first attempts to take risk into account in measuring fund performance involved the introduction of the single-factor model by Sharpe (1966) and Jensen (1968). Jensen’s alpha is the intercept of a time-series regression of the return on a mutual fund minus the riskless rate of interest (RitRft) against the excess return on the market portfolio (RmtRft). This intercept is defined from the following equation:

RitRft=αi+βiRmtRft+eit. (1)

The theoretical foundation of this model is the capital asset pricing model (CAPM). Alpha measures how much better a fund did than the CAPM implies it should do. An alternative explanation is how much better a fund did than a combination of the market portfolio and lending and borrowing at the riskless rate of interest such that the combination has the same risk, βi, as the particular fund being examined.

The Sharpe ratio is the return on a portfolio minus the riskless rate divided by the extra risk taken. Risk is measured by the standard deviation of the fund’s return. Maximizing the Sharpe ratio is equivalent to picking the fund that has the highest excess return per unit of risk. In evaluating a fund, the Sharpe ratio of the fund is often compared with the Sharpe ratio of the market. Footnote2 If the Sharpe ratio of a fund is higher than the Sharpe ratio of the market, the fund’s alpha will be positive.

Thus, Jensen’s alpha and the Sharpe ratio will identify the same funds as high performing, but the two measures will assign different rankings to individual funds.

The academic profession has moved beyond evaluating performance on the basis of a single measure for risk. The investment community, however, has not. Jensen’s alpha and the Sharpe ratio are reported by Morningstar, used by many asset managers, and referred to frequently in the press.

Jensen’s alpha is usually based on a regression with a market index, and the Sharpe ratio is usually compared with the Sharpe ratio for the market, but a fund manager could substitute for the market portfolio an index that purports to better represent the investment policy of the fund. The index used is often based on the classification of the fund by a popular service such as Morningstar or Lipper. Sometimes, it is based on the fund’s primary “prospectus benchmark.” Footnote3

A number of studies have found that the prospectus benchmark or an external benchmark is not the best way to characterize a fund and leads to biases in alpha and fund performance. For example, Sensoy (2009) and Elton, Gruber, and Blake (2014) found that the prospectus benchmark is chosen in a way that overstates performance. Cremers, Petajisto, and Zitzewitz (2012) found that funds typically hold a large number of stocks that do not match the benchmark.

Although most authors agree that more than one measure of risk is necessary to capture management performance, several argue that performance relative to the single-index model best captures investor behavior. Barber, Huang, and Odean (2016) and Berk and van Binsbergen (2016) showed that the single-index model based on the market portfolio explains investor decisions better than the multifactor models discussed in the following paragraphs. Berk and van Binsbergen claimed that this finding is proof that the CAPM is the best model for estimating fund performance. Cremers, Fulkerson, and Riley (2019) provided some support for the use of the prospectus benchmark because investors tend to use that benchmark when making investment decisions, even when the benchmark is a poor fit to what the fund is doing.

Mutual funds have multifaceted behavior and hold various types of stocks. Recognition of this diversity and the desire to capture these multiple dimensions of risk led to the development of multi-index models.

Multi-Index Models.

In this section, we explore the general form of multi-index models and then discuss some specific models found in the literature of financial economics.

General model.

Both Jensen’s alpha and the Sharpe ratio have generalized forms in a multi-index world. The generalized Jensen measure has its theoretical underpinnings in arbitrage pricing theory (APT). The return-generating function or risk model can be written in a form analogous to EquationEquation 1: Rit=αi+k=1KβikIkt+eit, (2) where the I’s are “pervasive influences” and the betas are the sensitivity of portfolio i to each of the pervasive influences. Footnote4 Note that the alpha from this equation can be interpreted as how much better the fund did than an APT model said it should do. It can also be interpreted as how much better a fund did than a combination of indexes that has the same risk as the fund. Both of these interpretations have played a role in the development of subsequent literature.

Similarly, in Sharpe (1994), the Sharpe ratio was redefined as the generalized Sharpe ratio, which is the ratio of the average return on the fund minus the average return on its benchmark divided by the standard deviation of the difference between the returns on the fund and its benchmark. The generalized Sharpe ratio is often called the “information ratio.” The return on the benchmark for any fund may be estimated in several ways, but it is usually estimated as the sum of the products of the betas and the I’s in EquationEquation 2.

Although APT stipulates the structure of EquationEquation 2, it provides little guidance as to the definition of the betas or the I’s in the equation. We now turn to identifying and measuring these variables.

Estimating the I’s and betas on statistical grounds.

Roll and Ross (1980) used maximum likelihood factor analysis to estimate statistically the set of I’s and associated betas from EquationEquation 2. Examining the variance–covariance matrix of a set of securities or funds, factor analysis finds a set of k indexes such that once their influence has been removed, the covariance of security returns is as close to zero as possible. The I’s are called “factors,” and the betas are called “factor loadings.” Each of the k factors represents a particular portfolio of stocks. The larger k (the number of factors chosen) is for any set of securities, the smaller will be the covariance between the residuals, so the value of k is important. An analyst usually selects k by looking at the marginal impact on the covariance of the residuals of using (k + 1) factors rather than k factors. What value of k to select is to some extent a matter of judgment. Roll and Ross concluded that five factors were sufficient to describe the variance–covariance matrix for several samples of stocks.

The advantage of this approach is that each of the I’s is the return on a portfolio of stocks. The factor loadings are the betas in EquationEquation 2. Thus, the model is specified with no need to use theory to choose indexes. Connor and Korajczyk (1986) and Lehmann and Modest (1988) refined the mathematical technique of extracting factors from historical returns. Lehmann and Modest applied the factors (indexes) they found to evaluating mutual funds. Song and Zhao (2018) applied factor analysis to mutual funds rather than to stocks. Their reasoning was that if one wants to search for factors that explain mutual fund returns, one should extract the factors directly from the matrix of returns on mutual funds. Based on similar reasoning, Elton, Gruber, and Blake (1999) also extracted factors from mutual fund returns.

This approach is appealing because of the absence of a need to specify indexes on a priori grounds, but it has not been widely used because of the difficulty in identifying the economic meaning of the factors, the fact that the factors are not unique up to a linear transformation, and most importantly, the lack of stability of the composition of the factors over time. Also of key importance in using the approach is that after the first factor, the remaining factors are portfolios of stocks that involve a large amount of short selling. Therefore, it would be difficult, if not impossible, for most mutual funds to duplicate these factors.

Estimating factors on the basis of characteristics of mutual funds.

A number of models have been produced that estimate the I’s as returns on portfolios of the types of securities that are hypothesized to capture the relevant broad factors influencing a fund’s returns. These models differ in two ways. The first difference is that some authors used indexes constructed by commercial index providers whereas other authors constructed their own indexes (which we refer to as “author-constructed indexes”). The second difference is that some authors used an equity index minus the US T-bill rate to define any I, but the majority of authors used the difference between two indexes.

Many models discussed in the study of mutual funds use commonly available indexes or portfolios to capture additional influences beyond the market that affect returns. Sharpe (1992) produced the first complete model to do so. He used 12 commercial indexes to benchmark the performance of all mutual funds: those holding both debt and equity securities in foreign as well as domestic markets. In addition to using regression analysis to capture the sensitivities (betas) of a fund to each of the indexes, Sharpe (1992) constrained the betas for each fund to be greater than or equal to zero and to add up to 1.0. These requirements allowed the betas to be directly interpreted as portfolio weights. Therefore, the style of any fund could be determined from its betas. This method has become known as “return-based style analysis.” Although the constraints mean that the model does not fit the data as well as the unconstrained regression (i.e., it has a lower R2), it may produce a more meaningful benchmark because most mutual funds do not or cannot sell securities short.

Elton, Gruber, and Blake (1999) also used publicly available indexes to build performance measures for equity funds. Out of seven commercial indexes, they identified four indexes that capture most of the influences affecting fund returns to use to study the performance of domestic equity funds.

Commercial indexes continue to be used in mutual fund studies, but they are often used in conjunction with the three-factor model, which we now discuss.

Estimating factors on the basis of security attributes.

The next big innovation in performance measurement was the establishment of a multifactor model, proposed by Fama and French (1992). Their criterion for selecting any particular factor was that it must be priced in the APT framework. In other words, it must capture the cross-section of stock returns. They investigated several influences that had been identified in the financial economics literature as affecting stock returns and, after empirical investigation, settled on three: the market, company size, and company growth. They investigated several proxies for each of these variables and, unable to find available indexes for size and growth, they constructed their own. For size (small minus big, or SMB), they constructed a portfolio by taking a long position in the 30% of stocks with the smallest market capitalization and a short position in the 30% of stocks with the largest capitalization. Similarly, for growth (high minus low, or HML), they took a long position in a portfolio of high book-to-market stocks and a short position in a portfolio of stocks with low book-to-market ratios. This model is often augmented by a variable introduced by Carhart (1997). Drawing on the work of Jegadeesh and Titman (1993), who found that stocks that had 12 months of high past returns tended to have high future returns, Carhart formulated the momentum factor. The momentum factor is the return on a portfolio of the 30% of stocks with the highest return in the past 12 months minus the return on the 30% of stocks with the lowest return. Footnote5

The prominence of the Fama–French three-factor model and the Carhart four-factor model has led to a number of changes in and additions to these models in an attempt to improve their performance.Footnote6 A number of papers have pointed out that the Fama–French and Carhart models fail to correctly price certain anomalies, fail to find zero alphas on passive indexes, and fail to find reasonable alphas for major sectors of the market.

Perhaps the most damaging case against the Fama–French and Carhart models is that their use to evaluate many commercial indexes results in statistically significant positive alphas for some (e.g., the S&P 500 Index) but statistically significant negative alphas for others.Footnote7 This problem is of paramount importance. If an active fund is holding a portfolio that is virtually a duplicate of the S&P 500 but performs slightly worse, the performance will be designated positive because the positive alpha on the S&P 500 itself will swamp the negative increment in alpha resulting from managerial performance.

Because of the prominence of these factor models and their failure to value certain types of mutual funds correctly, more than 75 articles and working papers have modified the Fama–French and Carhart models. These papers can be divided into those that add factors or indexes and those that reformulate the Carhart or Fama–French model by redefining the factors. These papers try to evaluate the new models in several ways: Do they produce higher correlations with past returns? Do they lead to performance that is less affected by alphas on indexes? Do they generate alphas that are not affected by specific anomalies in the literature? Do they lead to more economically defensible levels of alpha for mutual funds in general? Do they produce better forecasts of future performance?

The set of modifications of the factor models that have appeared in the literature is too large to deal with here in detail. Instead, we have selected a few examples of the types of modification that exist.

Several models attempt to improve the Carhart or Fama–French model simply by adding a new index. Hunter, Kandel, Kandel, and Wermers (2014) showed that adding an active peer benchmark to the Carhart model improves its performance. The active peer benchmark is measured as the average return before expenses on all mutual funds that follow the same index as the fund being examined. This addition should alleviate the problem of mispricing in the peer group index.

A number of authors simply added a variable to capture an anomaly—the mispricing of stock funds with certain characteristics. For example, Jordan and Riley (2015) added a volatility factor to account for the fact that funds that hold securities with low volatility tend to have higher returns and higher Carhart alphas than funds that hold securities with high volatility. They also added a dummy variable for the January effect (i.e., an abnormal increase in stock prices in January).

Moreno and Rodriguez (2009) added a variable measuring coskewness to the Carhart model. They defined this variable as the return on assets with high coskewness minus the return on assets with low coskewness. Other authors have added measures of systematic and idiosyncratic risk (Ang, Hodrick, Xing, and Zhang 2006), investment growth (Cooper, Gulen, and Schill 2008), gross profitability (Novy-Marx 2013), and liquidity (Huang, Liu, Rhee, Wu 2012). Other variables have been added to some form of the three-factor or four-factor model, but although these studies help explain the anomaly they intend to explain (and sometimes increase the explanatory power of the model), they are somewhat suspect because they often use data on the observed return or alpha spread to explain the fact that the spread exists. Nevertheless, we expect in the future more and more studies will be published to explain present and future anomalies and mispriced sectors of the stock market.

Another approach is to accept the Fama–French model or the Carhart model and reformulate the variables to better explain returns. This approach was taken by Cremers et al. (2012). The authors made several changes in the formation of the variables in the Fama–French and Carhart models but maintained the general forms of the models. For US funds, they redefined the market portfolio as the return on a portfolio consisting only of US stocks. They then split the size variable into two variables: mid cap minus large cap and small cap minus mid cap. Finally, they measured the growth premium (HML) for each of the three size categories separately. They kept the momentum factor, although it did not have an impact on their results. This seven-factor model performed well in comparison with the traditional form of the Carhart four-factor model. Having found a seven-factor model that satisfied many of the criticisms of the Fama–French model, the researchers then proceeded to examine replicating the seven factors with commercially available indexes. They used the Carhart momentum index, the S&P 500, the Russell Midcap Index, the Russell 2000 Index, and the value component minus the growth component for each index (e.g., the S&P 500 Value Index minus the S&P 500 Growth Index). They found that commercially available indexes outperformed the author-constructed indexes for the seven-index model. Cremers et al. is one of the few modern studies showing the superiority of commercially available indexes. We discuss other studies using commercially available indexes in a later section.

Almost all of the articles reviewed to this point involved adding variables to or reformulating the definition of the variables in the Fama–French or Carhart model. Indeed, those two models have become the standards to use in mutual fund studies. Fama and French (2015) developed a new five-factor model, however, which may well become the new standard or base model for mutual fund studies.

Fama and French (2015) reexamined the Miller and Modigliani (1961) valuation equation for the firm and showed that the average stock return should be positively related to higher profitability and negatively related to investment. They showed that the book-to-market factor is related to profitability and investment, so this factor may be partly redundant. Further motivation for introduction of the new “profitability” and “investment” variables was provided by Novy-Marx (2013), who identified a proxy for expected profitability that is strongly related to expected return, and by Aharoni, Grundy, and Zeng (2013), who found a relationship between investment and expected return. Footnote8 They hypothesized that book-to-market data might not capture all of the influence of future profitability and investment. They added two variables to the three-factor model: (1) RMW is the return on a diversified portfolio of stocks with high profitability (robust, R) minus the return on a portfolio of stocks with low profitability (weak, W), and (2) CMA, the investment variable, is the return on a diversified portfolio of low-investment stocks (conservative, C) minus the return on a portfolio of high-investment stocks (aggressive, A).Footnote9 RMW and CMA were expected to have a positive impact on returns.

Using the five possible factors just described, Fama and French (2015) tested alternative forms of the three-, four-, and five-factor models. Although the five-factor model produced the best results, it did not improve the results produced by a four-factor model formed by deleting HML, the growth premium. The pattern of returns captured by HML seems to be adequately captured by the remaining variables. The four factors might suffice for the evaluation of mutual fund returns, but the authors stated that for style analysis, the book-to-market variable is still valuable because managers often tilt their portfolios to capture the book-to-market premium.

We believe that the five-factor model and the three-factor model will remain the most popular index-benchmarking models in the academic literature. Elton, Gruber, and de Souza (2019b) were among the first to compare the three-factor model with the five-factor model in measuring mutual fund performance. They found little difference in the performance of the two models. We expect more papers to be published in the future that compare, modify, or add additional indexes to the Fama–French three- and five-factor models.

All of the multi-index models discussed here compare the return on a mutual fund with the return on a set of indexes or on a set of differences between indexes. These author-constructed indexes are meant to represent influences in the pricing of stocks that offer a positive return. The theory behind their use in a performance model is that if these influences are known to have a return premium, asset managers should not be given credit for capturing this premium. In many cases, however, duplicating the indexes used in the performance model is difficult for managers. The most difficult case is one in which the model contains differences between two indexes. For example, the Fama–French size variable requires managers to sell short a dollar of large-capitalization stocks for each dollar of small-capitalization stocks that they hold. Not only is this action impractical; it is not possible for the large majority of mutual funds because they have either legal or self-imposed constraints on short sales. Furthermore, the cost of buying the author-constructed indexes is high. These indexes involve a lot of turnover over time, which would result in high transaction costs for a manager trying to replicate the index. These problems are discussed in Li, Chow, Pickard, and Garg (2019).

These reasons have led to efforts to define the model in terms of tradable assets, with an emphasis on the case in which the tradable assets cannot be sold short.

Using traded assets.

In the prior section, we described the use of factor models to evaluate portfolios. If the factors measure systematic risk, then according to APT theory, they should offer return. Furthermore, the factors should price all assets in the market, including mutual funds. If a fund earns a return above what is earned by sensitivity to the factors, then this return represents superior security selection.

An alternative to using factor models to measure performance is to examine whether an active fund outperforms passive funds containing similar assets. The conclusion from the current literature is that active funds, on average, have significant negative risk-adjusted performance when performance is measured by factor models. Footnote10 Comparing an active fund’s performance with the performance of passive portfolios allows analysts to determine whether the principal conclusion of the mutual fund literature concerning the underperformance of active managers holds when trading costs and expenses are included. In the current markets, the obvious instruments to use for such a comparison are ETFs and index funds.

Elton, Gruber, Das, and Hlavka (1993) initially selected indexes for comparison that captured the characteristics of the funds being examined. They were reacting to an article by Ippolito (1989) showing that when fund performance was measured relative to the S&P 500, the funds had, on average, positive performance and the funds that charged more did better. Ippolito’s sample included a number of small-cap stock funds and funds with long- and intermediate-term debt. If the debt was T-bills, the effect on performance would not matter, but longer-term bonds have, on average, higher returns, so not including them results in a higher intercept. Thus, Elton et al. (1993) included a small-cap stock and bond index as well as the S&P 500. Footnote11 The indexes they used were available as index funds—at least for institutional investors. In evaluating the mutual funds, they estimated the cost of the index funds and evaluated active mutual funds compared with passive portfolios of index funds. Their article was the first to evaluate mutual funds by examining whether they do better than passive portfolios. These comparisons reversed the Ippolito results.

As discussed earlier, Cremers et al. (2012) also proposed evaluating funds by using indexes that are followed by passive portfolios. Cremers (2017) argued that this approach is a major advantage of their model.

Berk and van Binsbergen (2015) used 11 Vanguard funds to measure performance. To deal with orthogonality, they regressed the return on each of the 11 index funds against the returns on the other 10 index funds. The indexes then became the intercept plus the residual. Cremers et al. (2012) and Berk and van Binsbergen (2015) used comparisons that included short sales. Index funds, however, cannot be shorted—but ETFs can. Berk and van Binsbergen were not concerned with whether any ETFs matched the index. If one wants to construct a portfolio that can be shorted, this aspect is important. For the purposes of measuring the performance of active funds, it is less important.

Elton, Gruber, and de Souza (2019a) were the first authors to use passive portfolios that could be held long or short to evaluate mutual fund performance. They found five ETFs that capture most of the variation in returns of all available ETFs. These five were a market ETF, a large-cap value ETF, a large-cap growth ETF, a small-cap growth ETF, and a mid-cap value ETF. The authors found that the combination of these five ETFs that most closely matched the return pattern for each of the active mutual funds they were studying outperformed the fund, on average. Furthermore, when many of the suggestions for picking better-performing funds were evaluated by some of the measures we discussed previously, they did not improve performance beyond the matching portfolio of ETFs.

Conditional betas.

Wayne Ferson, with a number of co-authors, has argued that returns on factors are somewhat predictable over time, so managers should not be given credit for changing betas on these factors in response to the known predictable relationships. Footnote12 To account for this issue, these authors expressed betas as a function of variables such as dividend yield and the one-month T-bills that have been shown to affect returns. The impact of this approach is that the equation measuring performance has the standard terms plus an additional term for each predictable variable.

In this section, we have examined the major models used to evaluate equity mutual funds. Although we haven’t examined every model that has been proposed, we have discussed all the types of models and the principal models of each type.

Measuring Performance for Passive Portfolios

Passive funds are generally the easiest type to evaluate because they have a well-defined single index they are trying to match. If the fund is a Wilshire 1000 Index fund, for example, then the appropriate measure for evaluating performance is alpha from a single-index model with the Wilshire 1000 as the index. Because some issues are unique to ETFs, we first discuss here general issues and then discuss the special issues related to ETFs. Footnote13

General Issues.

The way an index is constructed when one is appraising a passive investment does not affect evaluation techniques but can affect performance measurement and how the results are interpreted. The primary issue is how the index handles dividends. Does the index assume daily reinvestment, or are dividends cumulated and reinvested monthly? In addition, many European countries require that part of the dividend be withheld. How does the index deal with this issue? For example, MSCI indexes assume Luxembourg withholding rules. Footnote14

The second issue to consider in evaluation is tracking error. Tracking error is the difference between the fund return and the index return. Index funds and ETFs don’t track an index exactly every period. A superior performing fund has two characteristics: (1) The variability of the tracking error is small, and (2) the tracking error has low autocorrelation over time (i.e., low correlation between errors in adjacent periods). Low autocorrelation means that the tracking error, rather than growing, converges to a constant. Pope and Yadav (1994) studied the impact of the relationship between the frequency of the data used to estimate tracking error and the investor’s time horizon. They showed that longer horizons relative to estimation frequency lead to significant tracking-error bias.

Exchange-Traded Funds.

The unique issue for ETFs that does not affect index funds is differences between net asset value (NAV) and price. ETFs trade on an exchange, just as any stock does. Thus, whereas index funds are bought and sold at NAV, ETFs are bought and sold at whatever price they command on the exchange, which can differ from NAV. This difference can help or hurt an investor. If the investor buys at a price below NAV and sells at a price above NAV, the investor is helped. The reverse, of course, hurts the investor. Because an investor cannot know ahead of time what price the ETF will be sold at relative to NAV, this characteristic is best thought of as a risk. For actively traded ETFs, the difference between NAV and price is generally small, but for thinly traded ETFs and international ETFs, it can be large.Footnote15 An excellent discussion of the differences between price and NAV, as well as the potential arbitrage profits from these differences, was provided by Petajisto (2017).

Index funds and ETFs are the fastest-growing segments of the mutual fund market. Conveniently, these funds are the easiest types to evaluate.

Evaluating Equity Funds by Using Holdings-Based Measures

An alternative to using factor-based measures of portfolio performance is to use measures based on a mutual fund’s holdings. Most of the holdings-based approaches measure performance before expenses. Thus, they measure managers’ performance, not investors’ performance.Footnote16

Proponents of this approach argue that it has several advantages over factor-based methodologies: First, the analyst does not have to determine the factors affecting returns. Footnote17 Second, if the fund is changing its style over time (e.g., shifting to more small-cap stocks), the change does not affect holdings-based models, but it does make accurately estimating factor models more difficult. Third, holdings-based evaluations can more easily determine whether the manager has differential skill at selecting industries or sectors.

Holdings-based models also have a disadvantage. Because mutual funds are required to report holdings only quarterly, these measures miss anything that happens within a quarter—and intraquarter trades can add substantial value. Puckett and Yan (2011) examined intraquarter trades for a large sample of institutional investors. They found that such trades add 20–26 basis points per year to the average fund’s performance. Kacperczyk, Sialm, and Zheng (2008) examined the impact of unobserved actions on mutual fund performance. They compared the return difference between the actual performance of the fund over a quarter and what the performance would have been had the managers held the stocks they owned at the beginning of the quarter. This comparison measured the performance of stocks that were bought during the quarter relative to the performance of stocks that were sold during the quarter. The authors called the difference “the return gap.” They found strong persistence over time, with some funds adding substantial value and others realizing substantial loss. Elton, Gruber, Blake, Krasny, and Ozelge (2010) found (1) that 18.5% of trades are missed if one uses quarterly holdings data and (2) that the results of a number of studies would be reversed if monthly holdings data were used.

Several holdings-based methods are used to measure performance. The first was developed by Grinblatt and Titman (1989). They measured performance by examining the difference between the return on a portfolio of the stocks currently held by the fund at their current weights and the return on the same portfolio of stocks with their prior weights. In equation form, GT=I=1nWi, t1Wi, tk1Ri, t, (3) where

GT = the Grinblatt and Titman measure of performance

Wi, t–1 = weight on security i at the beginning of period t

Wi, t–k–1 = beginning-of-the-period weight on security i at k periods ago

Ri, t = return on security i in period tAnd the summation is over all securities in either portfolio.

EquationEquation 3 can be expanded in a number of ways, so that, for example, the effect of timing can be compared with the effect of security selection. Note that any difference between the riskiness of the current portfolio and the riskiness of the prior portfolio is not being measured. Also note that a manager who selected a well-performing sector but had no security selection ability within the sector would not be getting credit for the sector selection. In contrast, a manager who selected a poorly performing sector but selected the better-performing securities in that sector would look like the better manager.Footnote18

Another version of a holdings-based measure is the characteristics-based measure developed by Daniel, Grinblatt, Titman, and Wermers (1997). They argued that size, book-to-market value (BM), and momentum explain most of the cross-sectional variation in stock returns. Therefore, they computed these three variables for each stock each June. They then divided all the stocks into five groups by size. Stocks in each of the five size groups were next ranked by BM. The result was 25 groups that differed by size and BM. Stocks in each of the 25 groups were then ranked by momentum and divided into five groups. The result was 125 groups with different size, BM, and momentum values.

Daniel et al. (1997) computed the characteristics-based performance measure as follows: First, they computed the value-weighted return for each of the 125 groups. Next, they determined for each stock in the mutual fund at the beginning of the period what group the stock belonged to. Then, they calculated the mutual fund’s performance as the difference in return between each stock in the portfolio and the group’s return for that stock, weighted by the amount the stock represented in the portfolio, summed over all stocks. To measure the return on the stock portion of the portfolio, they scaled the weights so as to add up to 1.0. This measure also can be decomposed to measure how much return comes from a style shift and how much comes from timing.

Since the publication of Daniel et al. (1997), a number of other factors have been shown to affect returns, and these could be incorporated into the characteristics-based model. Also, note that unless the incorporated characteristics completely determine risk, the portfolio being evaluated and the benchmark portfolio need not have the same risk. Finally, note that unlike factor models, the characteristics-based model does not assume a linear relationship between sensitivity to a factor and return.

An alternative application is to use holdings data to estimate the betas in factor models. We know that portfolio betas are a weighted average of the betas on the securities that compose the portfolio, where the weight is the fraction of the portfolio any security represents. Thus, one can run a time-series regression for each security in the portfolio and use it to estimate the portfolio betas. If the fund changes betas over time by, for example, increasing its holdings of small-cap stocks, estimates of betas obtained by regressing the fund’s return against the factors’ returns will produce a single value that does not capture the different values over time. Estimating betas by using holdings data will capture the change. Treynor, Priest, Fisher, and Higgins (1968) were the first to propose this technique as a way to estimate the market beta. Elton, Gruber, and Blake (2011) used this procedure to estimate the betas in the Fama–French and Carhart models. The authors found substantial variation in betas over time. Furthermore, they found that estimates of mutual fund performance obtained by using estimates of betas from security betas and holdings data better predicted future performance than estimates from a time-series regression of the mutual fund returns against factor returns.

Holdings-based models have not been used as often as return-based models, nor have their properties been as extensively analyzed—probably for two reasons: First, holdings data are reported less frequently than return data, which can introduce bias. Second, holdings-based models are more difficult to implement.

Measuring Timing Ability

Timing involves a fund manager changing the sensitivity of a portfolio to a factor over time in response to changing beliefs about the return on the factor. For example, if the factor is the market, then the manager will increase the beta on the market if he or she believes the market will have a higher return in the next period than in the current period. Timing does not refer to market timing alone but can be present with any of the factors discussed earlier.

Timing can be measured from return data and from holdings data. Most timing measures that use return data are variants of the Treynor and Mazuy (1966) procedure or the Henriksson and Merton (1981) model. Treynor and Mazuy noted that if a manager has timing ability, then on average, the manager increases beta when returns on the market rise and decreases beta when returns on the market fall. Their procedure introduces curvature when beta is plotted against market return. Thus, for every factor for which timing is being measured, two terms are used: the return on the factor (maybe in excess-return form) and a squared return on the factor. Footnote19 If timing is present, the coefficient on the squared term (which measures curvature) should be positive.

Henriksson and Merton (1981) argued that a manager with timing ability should have a high beta on the factor in “good” markets and a low beta on the factor in “poor” markets. Whereas Treynor and Mazuy (1966) envisioned continuous change as expectations about a factor’s return change, Henriksson and Merton viewed the change as more discrete—that is, two different values. They defined good markets as those that give a return above the riskless rate and bad markets as those with returns below the riskless rate. They had two terms for the variable market returns: the standard one and an identical term whose coefficient is multiplied by a dummy that equals 1 if the market is good and 0 if it is bad. The coefficient on this term is the difference in beta between good and bad markets. If the manager has timing ability, this term should be significantly positive. Note that other definitions of good and bad markets could be used. A particularly appealing definition for a good market is one with returns above the average return on the market.

Although Henriksson and Merton (1981) discussed measuring only market timing, the same idea can be applied to other factors. Thus, using the three-factor Fama–French model, one could measure market timing, the timing of a move into small-cap or large-cap stocks, and increasing or decreasing growth exposure. Measuring timing on all of these factors would involve six terms—the three standard ones and three identical terms whose coefficient is multiplied by a dummy for the return on the factor equal to 1 when the return on the factor is high.

Elton, Gruber, and Blake (2012) and Daniel et al. (1997) used holdings data to measure timing. Recall that the beta on any factor in a portfolio is a weighted average of the betas on the securities that compose the portfolio, where the weight is the fraction of the portfolio that any security represents. This procedure allows one to estimate betas at each point in time, and these betas can be used as input in standard timing models. Recall also that Henriksson and Merton (1981) assumed that the market had two betas: one in good markets and one in bad markets. Similarly, Treynor and Mazuy (1966) assumed a relatively continuous change in beta as factor returns changed.

Estimating betas from holdings data allows for any kind of pattern of beta changes with factor returns. Timing is measured as the deviation of the beta from its target at the beginning of any period times the return over the period, averaged over all periods. Footnote20 Authors have used various definitions of the target beta. Elton, Gruber, and Blake (2012) used the average beta over all periods, but several other measures are possible. For a plan sponsor, for example, the target beta could be an agreed-upon beta to be used in normal times.

When comparing these timing measures, we have a strong preference for measures using holdings data. The use of holdings data allows an analyst to capture the effect of much more complicated patterns over time than does the use of structural return models. We also find simply looking at the detailed pattern over time to be informative. An important aspect is that timing should be considered for all factors. Even if the manager is timing only the changes in the market, changes in the market beta are likely to lead to changes in the sensitivity of the portfolio to other factors.

Bond Fund Performance Measurement

Equity mutual funds have been studied extensively; bond mutual funds have received much less attention. Techniques for evaluating bond mutual funds are similar, however, to those used in evaluating stock funds. As with evaluating equity funds, we can divide the techniques into return-based measures and holdings-based measures.

The first return-based measure was developed by Blake, Elton, and Gruber (1993). They compared a single-index model (the benchmark being a general bond index or a subindex that most closely matched the fund category) with two three-index models and a six-index model. The six indexes matched the principal types of securities held by the fund: an intermediate-term government bond index, a long-term government bond index, an intermediate-term corporate bond index, a long-term corporate bond index, a high-yield bond index, and an index of mortgage-backed securities (MBS).

Recall that when we compared models used for evaluating equity funds, our conclusions varied a lot depending on the model used. The same is not true for bond funds. The three- and six-index models resulted in little difference in the measured performance of the fund being evaluated. As long as the models each included a general index, a high-yield index, and either an MBS or term-structure index, the performance results were similar.

Chen, Ferson, and Peters (2010) measured performance net of timing. They used indexes based on the term structure, credit spreads, liquidity spreads, mortgage spreads, exchange rates, and two equity variables—dividend yield and equity volatility. Finally, Houweling and Zundert (2017) used factors similar to the Fama–French and Carhart factors.

Holdings data have also been used to measure bond fund performance. Cici and Gibson (2012) used a procedure similar to the one Daniels et al. (1997) used for common stocks. Cici and Gibson argued that the two characteristics that matter in bond performance are credit rating and duration. Thus, to compute portfolio performance for all bonds, they computed the difference between the performance of bonds in the portfolio and bonds not in the portfolio but with the same duration and credit risk, weighted by the fraction of the portfolio the bond represented, summed over all bonds.

Timing can be measured by looking at changes in duration and credit rating times the performance of these categories over the period. Cici and Gibson (2012) studied only corporate bond funds, so for studying mutual funds that hold other types of bonds, more characteristics would need to be included.

Another type of bond fund evaluation using holdings data that has been proposed uses the Grinblatt and Titman (1993) measure. Moneta (2015) measured bond mutual fund performance as the change in the percentage held in any bond from the prior period times the return in the next period, summed over all bonds. Note that risk is not controlled for with this measure.

Most bonds are infrequently traded. Therefore, most bond prices and bond returns are estimated from a model and not actually observed. This indefiniteness is probably a major reason why bond funds have not been examined as frequently as stock funds.

Issues and Future Directions

From time to time in this article, we have commented on how portfolio evaluation is likely to change. In this section, we discuss some issues that occur with implementation and provide additional observations on problems with mutual fund evaluation.

Data Sources and Bias.

The two principal types of data used in mutual fund studies are return data and holdings data. Return data are used in almost all studies examining mutual fund performance. Both types of data have issues that can affect the conclusions drawn from these studies.

Problems with return data.

The first problem with return data is incubator bias. Mutual fund companies often start a number of incubator funds that are subsidized by the fund family and not open to the public. They do so to build up a history before they attempt to market a fund. Funds with a good history are opened to the public; those with a poor history are merged into another fund or liquidated. When the funds that perform well are included in the databases, they usually come in with their full history, not just the history after they became available to the public. This practice introduces a bias. Evans (2010) estimated that the risk-adjusted return on successful funds was 3.5%. This shows the bias because this is much higher than for funds that had been in existence for some length of time.

Incubator bias can be controlled for in mutual fund studies in two ways. First, when a fund is available to the public, it gets a ticker; therefore, eliminating all data before the fund received a ticker would control for the bias. A second, quick-and-dirty method of controlling for the bias is to eliminate the first three years of data for all funds. This practice eliminates most incubator bias but, of course, at the cost of eliminating early data for nonincubator funds.

A second bias concerns the incompleteness of data for small funds. Funds that are less than $15 million in size and have fewer than 1,000 investors do not have to report daily NAVs. When they enter a database as successful funds, they usually enter with a history, whereas unsuccessful funds meeting size and investor criteria never enter. This backfill bias can be eliminated by omitting funds with less than $15 million in assets.

Finally, some widely used databases, such as CRSP, do not include all funds. Thus, most mutual fund studies need the qualifier that they apply only to funds included in the database being analyzed.

Problems with holdings data.

Holdings data are reported by Thompson and Morningstar, and these two databases have big differences. The Thompson database includes only data for traded equity. It excludes data for nontraded equity, equity holdings that cannot be identified, options, bonds, convertibles, and futures. Investigators using the Thompson database normally account for the missing data by making one of two assumptions. Some treat the listed assets as the complete portfolio. Others treat the missing assets as cash. Both treatments have problems. Investigators report that missing assets make up about 10% of the portfolio. Great variation can be found, however, among funds. The missing assets are likely to lead to misestimates of beta for many funds. The problem is particularly acute for timing studies. Elton, Gruber, and Blake (2011) analyzed the problem of missing holdings data and found that they identified very different funds as having superior performance if they used a complete set of holdings data.

Missing Factors.

When evaluating the performance of US equity funds, analysts should be aware that many of the funds have added a lot of non-US securities to their portfolios. Based on Morningstar holdings data, as of November 2019, about 15% of US equity funds had more than 10% of their portfolios in foreign securities and about 35% had more than 5% invested in foreign securities. Little research is available to guide analysts in determining the correct model for evaluating funds with substantial foreign investment.

One argument that can be made is that markets are fully integrated, so any given factor model prices both US and foreign markets. Thus, a factor model derived from US data is applicable for evaluating French or Hong Kong securities. The realized return on factors varies dramatically, however, among markets. The use of US factor returns on French stocks could result in large positive or negative alphas depending on the factor returns for the United States versus those for France.

When evaluating funds in foreign markets, most authors use a standard model developed for the US market and fit it with data from the non-US market they are evaluating. If they fit a model with French data, however, problems of foreign investment again arise. An examination of French equity funds listed in Morningstar shows that no fund invests only in French stocks. These funds’ average investment in French stocks is less than 50%. Thus, if realized factor returns are different among European markets, the model will be poorly specified. We do not have an answer to this problem, but much more research is needed.

Another possible missing factor is bonds. Standard factor models do not include bond returns. If the only debt instrument a fund holds is short-term money market debt, standard models are appropriate for that fund because most standard models make this assumption. If an equity fund holds long-term bonds in its portfolio, however, the standard performance measure will be biased. Because the model assumes that the only debt instrument the fund holds is money market debt, the difference in return between long-term bonds and short-term bonds times the percentage invested in long-term bonds is captured in the alpha of the performance measure. At one time, the missing bond element was a common feature of equity mutual funds. Recently, however, it has become much less important. Morningstar data for November 2019 show that only about 1% of equity mutual funds held more than 5% in long-term bonds and only slightly more than 0.5% of funds labeled as equity funds held more than 10% in long-term bonds. Thus, for most equity funds, the bond element is no longer an issue. If one is evaluating balanced funds, however, it remains an issue.

Goetzmann, Ingersoll, Spiegel, and Welch (2007) showed how standard performance measures can be manipulated to make a manager appear to have superior skill when the manager does not. The trick is accomplished by using options. For most mutual funds, options are not a possible investment.Footnote21 For hedge funds, however, the issue of options being used to manipulate performance is important, and models have been developed to deal with it (see, for example, Agarwal and Naik 2001). These models are potentially useful for evaluating mutual funds with significant option positions.

Improving Benchmarks.

What does the future hold for the development of improved benchmarks? We have described three approaches to defining benchmarks for performance measurement: factor analysis, APT (priced factors), and choosing tradable indexes.

On the one hand, factor analysis provides a way to simultaneously define benchmarks that capture the time-series and cross-sectional components of performance. It requires no a priori specification of appropriate individual influences. On the other hand, the economic meaning of the indexes or portfolios based on factors is unclear, and the composition of the indexes is unstable over time. Although factor analysis gives us some idea of the number of indexes that must be incorporated into a benchmark, we do not expect the factors defined by this method to be used in future models of performance evaluation.

Both of the other methods—the APT approach and choosing tradable indexes—will continue to play a role in evaluating mutual funds. They share a weakness—that is, the set of factors or indexes must initially be an a priori selection—but both allow for the initial choices to be refined. Both start with an idea as to which characteristics are important. The APT model says that the only factors that are important are those that are priced in the market—that is, factors that have a nonzero return both cross-sectionally and in time series. The theory is that managers should not be given any credit for performance attributable to the sensitivity of their portfolios to priced factors.

Proponents of index models argue that if actively managed funds can be replicated by ETFs or index funds, the mutual fund managers should get credit only for the extra return the fund earns above a portfolio of indexes that has the same risk as the fund.

Problems afflict both methodologies. In terms of the APT model, the question is, where do you stop adding factors? The literature contains articles presenting dozens of priced factors. Take an obscure one, such as coskewness. Should managers have their performance reduced because they hold a portfolio with positive coskewness? If the investor gets a high return but a low alpha because of high positive coskewness, does the investor care? Is the Fama–French three-factor model sufficient, or the Carhart model, or the Fama–French five-factor model, or the Cremers seven-factor model? We believe that over time, the profession will find more and more priced factors. What should be included in our models?

A similar problem exists with respect to traded indexes. Where do we stop? How many indexes should we choose? As more and more passive ETFs join the market, we will have more and more choices for indexes.

Although we don’t know exactly where to stop, we have some guidance. The factors that are priced in the APT setting can guide us to the types of traded indexes we need to include in a benchmark. Nowhere is this issue made clearer than in Cremers et al. (2012). These authors first developed a seven-factor version of the Fama–French five-factor model, but then, they found a set of seven commercial indexes that did a better job of explaining portfolio performance than their seven-factor model.

Elton, Gruber, and de Souza (2019a) showed how cluster analysis applied to the correlation structure of a large set of indexes can combine them into a smaller set. Then, they showed, following the procedure outlined in Fama and French (2018), how the set can be further reduced to a set that satisfies sufficient and necessary conditions. Footnote22

We expect that the search for the appropriate factor model and tradable index model will continue. As new factors or indexes are proposed, we expect to see more research comparing these models in an attempt to determine which have the most desirable properties. We also expect more models, particularly those using tradable assets that do not allow short selling, to be developed. Many mutual funds cannot sell short, and comparing their performance against a benchmark that has the same restriction is appropriate.

Notes

1 Although most of the literature has developed performance models to apply to US mutual funds, the tools described in this article can be applied to other types of asset portfolios in the United States and the rest of the world.

2 The statistical properties of the Sharpe ratio were first documented by Lo (2002), although the statistical properties of Jensen’s alpha had been well understood before that time.

3 Since 1998, the US SEC has required every fund to state its primary benchmark in the fund prospectus.

4 The connection between pervasive influences and factors will become clear in subsequent sections. In addition, both alphas and betas may vary over time, but this aspect is discussed in a later section.

5 Note that these variables are formulated so that they are expected to have positive returns, on average, over time.

6 The prominence of these models is at least partly caused by the fact that the index data can easily be downloaded from Kenneth French’s website ( https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/).

7 This flaw has been pointed out by several authors (e.g., Cremers et al. 2012).

8 The earliest evidence that the three-factor model did not account for profitability and investment was provided by Titman, Wei, and Xie (2004). Recently, Hsu, Kalesnik, and Kose (2019) set out to search for a definition of “quality,” a term used by practitioners. They found that profitability and investment were the strongest signals of quality and had an impact on return. Ball, Gerakos, Lannainmaa, and Nikolaev (2020) argued that a retained-earnings-to-market factor better predicts the cross-section of stock returns than does book to market.

9 Earlier, Novy-Marx (2013) advocated the use of gross profits divided by assets as a substitute for book to market in performance measures.

10 For a summary of the evidence, see Elton and Gruber (2013).

11 The origins of this approach lie in Sharpe (1992).

12 See Ferson and Schadt (1996) and Ferson and Qian (2006) for examples.

13 For active ETFs, the models discussed in earlier sections of this article are appropriate.

15 For empirical results for S&P index funds, see Elton, Gruber, and Busse (2004). For evaluation of the most popular ETF (Spider), see Elton, Gruber, Comer, and Li (2002). For comparison of the performance of index funds and ETFs, see Elton, Gruber, and de Souza (2019c). When comparing ETFs and index funds, one other consideration needs to be discussed: capital gains. Because ETFs can be created or deleted by exchanging stocks in the index for shares of the ETFs and these exchanges are tax free, ETFs have much lower capital gains. Thus, ETFs have a tax advantage over index funds.

16 Of course, one can subtract expenses from pre-expense returns to get an estimate of returns to investors.

17 Holdings evaluations based on characteristics (i.e., portfolios constructed to match the characteristics of the stocks held by a mutual fund) do need to make these assumptions.

18 At the time of a manager’s switch to a poor performing sector, the measure would correctly penalize the switch.

19 For an early use of the Treynor–Mazuy (1966) procedure, see Bello and Janjigian (1997).

20 Daniel et al. (1997) had a similar measure for timing. They measured timing as the difference in factor return in period t times the beta at the beginning of period t minus a similar measure 12 months earlier.

21 Some funds that actively rebalance toward a fixed bond–stock ratio buy stocks in a declining stock market and sell stocks in a rising stock market. This strategy replicates a short option position and should, as pointed out by Stephen Brown in his role as executive editor of the Financial Analysts Journal, produce higher Sharpe ratios and greater tail risk.

22 The sufficient condition is that the set should produce alphas near zero for a large set of passive portfolios. The necessary condition is that each index should have a significant alpha when regressed against all others in the set.

References

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.