ABSTRACT
This paper shows how data science can contribute to improving empirical research in economics by leveraging on large datasets and extracting information otherwise unsuitable for a traditional econometric approach. As a test-bed for our framework, machine learning algorithms allow to create a new holistic measure of innovation following a 2012 Italian Law aimed at boosting new high-tech firms. We adopt this measure to analyse the impact of innovativeness on a large population of Italian firms which entered the market at the beginning of the 2008 global crisis. The methodological contribution is organised in different steps. First, we train seven supervised learning algorithms to recognise innovative firms on 2013 firmographics data and select a combination of those models with the best prediction power. Second, we apply the latter on the 2008 dataset and predict which firms would have been labelled as innovative according to the definition of the 2012 law. Finally, we adopt this new indicator as the regressor in a survival model to explain firms' ability to remain in the market after 2008. The results suggest that innovative firms are more likely to survive than the rest of the sample, but the survival premium is likely to depend on location.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Notes
1 Although a stream of literature tries to develop models that overcome the trade-off between the prediction error due to a simple model and the variance of estimates in out-of-sample predictions (Pearl and Mackenzie Citation2018), in statistical learning, the trade-off is still binding.
2 The debate on the use of patents dates back at least to Pavitt (Citation1985).
3 For a review and future perspectives on history-friendly models see Capone et al. (Citation2019).
4 The 221/2012 Legislative Decree, was adopted, and when in force, on 17 December 2012.
5 See website: http://startup.registroimprese.it/startup/index.html
6 Alternatively, as a measure of performance, we can compare the area under the ROC curve (AUC). For further details on the interpretation of ROC curves, see Alpaydin (Citation2014).
7 We respectively assigned the weights 0.77 and 0.23 to BAG and ANN, according to a function which maximises the separation between the predicted probabilities for INNs and NOINNs and the area under the ROC curve (AUC). As a robustness check, we also tested the mix of different algorithms, but there was no substantial improvement in the performance. See Appendix 2 for further details.
8 We estimate the variance with Greenwood's formula using the Delta method, and we use log-minus-log transformation for the confidence interval (Borgan and Liestøl Citation1990).
9 Note that NAs are much too diffuse among the variables and observations, and therefore multiple imputations will add an extra variability to non justified observed variables. Even if we limit the multiple imputation to some crucial variables, we still do not have enough complete observations in the dataset to finalise the NA completion.
10 Note that management variables, which contain a huge amount of unstandardised text, are discarded from the beginning of the data construction process.
11 Note that, without this last MVA step, there would only have been 18,078 firms left in the 2008 sample, representing less than the 28% of the initial amount of 2008 start-ups.