ABSTRACT
A series of influential papers by Hoberg and Phillips measure the similarity of pairs of companies based on a textual analysis of their business descriptions and show these measures to be useful in a variety of research contexts in finance. Hoberg and Phillips derive the similarity measures from a comparison of word lists extracted from extensive business descriptions contained in US companies’ electronic 10-K filings. Unfortunately, this method is of little use in non-US settings, where lengthy English-language company self-descriptions are not available on a consistent basis. Instead, we use semantic fingerprinting to extract such similarity measures from much shorter but globally available third-party company descriptions. We show that our approach significantly predicts stock return correlations even after controlling for past correlations and for membership in the same industry. Remarkably, company similarity measures based on brief third-party company descriptions predict stock return correlations significantly better than those based on much longer company self-descriptions.
JEL CLASSIFICATION:
Acknowledgement
We are grateful to Sandrine Foldvari, David Le Bris, Xiaojuan Liu and participants at the ICMA Econometrics and Financial Data Science workshop for comments, and to Thomson Reuters for providing us with a history of company descriptions.
Disclosure statement
No potential conflict of interest was reported by the authors.
Notes
1 A zipped file with all the company descriptions we use can be downloaded from https://goo.gl/Bh8bgY.
2 http://www.cortical.io/keyword-extraction.html shows how text is converted into semantic fingerprints, and http://www.cortical.io/similarity-explorer.html shows how textual similarity is compared. See IKSS and Webber (Citation2015) for more details on semantic fingerprinting.
4 For Alcoa, the missing value in Column 10 is due to the fact that the only company in the same sector is Du Pont, for which HP similarities are missing, and the missing values in Columns 12 and 13 are due to the fact that 2013 TR company descriptions are unavailable for this company. Related to this, for Du Pont, the missing value in Column 10 is due to the fact HP similarities are missing for this company, and the missing values in Columns 12 and 13 are explained by the fact that Du Pont’s only GICS sector peer is Alcoa, for which, as mentioned earlier, 2013 TR company descriptions are unavailable.
5 We have also implemented the regression specifications of separately for pairs of firms belonging to the same sector as well as for different-sector firm pairs. The same-sector sample consists of only 40 firm pairs, making our tests less powerful. Nonetheless, our similarity measures based on both long and short TR descriptions are significant at the 10% level. For the different-sector sample, the corresponding significance easily clears the 1% threshold.
6 While we do not have access to other company descriptions on a historical basis, to demonstrate that different providers’ descriptions are broadly similar, in we showed that the correlation between cosine similarities derived from then-current (as of 3/2017) short TR descriptions and from Yahoo descriptions is high, at 0.676. To expand this point to further sources of company descriptions, we have collected current (as of 4/2018) company descriptions from another key provider of market data, Factset, and also updated our short TR descriptions to that date. We find that the TR/Factset correlation is 0.645. These high correlations suggest that a variety of third-party company descriptions are a potentially valuable alternative to extracting self-descriptions from 10-K filings.