93
Views
7
CrossRef citations to date
0
Altmetric
Original Articles

Outlier-free merging of homogeneous groups of pre-classified observations under contamination

&
Pages 2997-3020 | Received 18 May 2017, Accepted 02 Jul 2017, Published online: 17 Jul 2017
 

ABSTRACT

We study the problem of merging homogeneous groups of pre-classified observations from a robust perspective motivated by the anti-fraud analysis of international trade data. This problem may be seen as a clustering task which exploits preliminary information on the potential clusters, available in the form of group-wise linear regressions. Robustness is then needed because of the sensitivity of likelihood-based regression methods to deviations from the postulated model. Through simulations run under different contamination scenarios, we assess the impact of outliers both on group-wise regression fitting and on the quality of the final clusters. We also compare alternative robust methods that can be adopted to detect the outliers and thus to clean the data. One major conclusion of our study is that the use of robust procedures for preliminary outlier detection is generally recommended, except perhaps when contamination is weak and the identification of cluster labels is more important than the estimation of group-specific population parameters. We also apply the methodology to find homogeneous groups of transactions in one empirical example that illustrates our motivating anti-fraud framework.

2010 MATHEMATICS SUBJECT CLASSIFICATION:

Acknowledgements

The authors are grateful to Domenico Perrotta and Marco Riani both for their specific comments and for broader discussion of the topic addressed in this work. They also thank one referee and Prof. Andrei Volodin for several helpful suggestions on a previous version of this manuscript.

Disclosure statement

No potential conflict of interest was reported by the authors.

Notes

1. The complete description for this CN code is: slag, ash and residues (other than from the manufacture of iron or steel), containing mainly copper.

2. The average number of nominated outliers is given by n(1π0)size+nπ0power.

Additional information

Funding

The work described in this article has been conducted under the support of the Administrative Arrangement SI2.741170 between the European Anti-Fraud Office (OLAF) and the Joint Research Centre of the European Commission (AMT5, Phase I and Phase II projects), in the framework of the Hercule III Programme on the protection of the financial interests of the European Union.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.