2,756
Views
4
CrossRef citations to date
0
Altmetric
Research Articles

Building a drug development database: challenges in reliable data availability

, , , , &
Pages 74-78 | Received 28 Jan 2016, Accepted 16 Jul 2016, Published online: 24 Aug 2016

Abstract

Context: Policy and legislative efforts to improve the biomedical innovation process must rely on a detailed and thorough analysis of drug development and industry output.

Objective: As part of our efforts to build a publicly-available database on the characteristics of drug development, we present work undertaken to test methods for compiling data from public sources. These initial steps are designed to explore challenges in data extraction, completeness and reliability. Specifically, filing dates for Investigational New Drugs (IND) applications with the U.S. Food and Drug Administration (FDA) were chosen as the initial objective data element to be collected.

Materials and methods: FDA’s Drugs@FDA database and the Federal Register (FR) were used to collect IND dates for the 587 New Molecular Entities (NMEs) approved between 1994 and 2014. When available, the following data were captured: approval date, IND number, IND date and source of information.

Results: At least one IND date was available for 445 (75.8%) of the 587 NMEs. The Drugs@FDA database provided IND dates for 303 (51.6%) NMEs and the FR contributed with 297 (50.6%) IND dates. Out of the 445 NMEs for which an IND date was obtained, 274 (61.6%) had more than one date reported.

Discussion: Key finding of this paper is a considerable inconsistency in reliably available or reported data elements, in this particular case, IND application filing dates as assembled from publicly-available sources.

Conclusion: Our team will continue to focus on finding ways to collect relevant information to measure impact of drug innovation.

Introduction

The last year has seen a number of high-profile policy and legislative efforts in the United States aimed at improving the drug development processCitation1–3. They have been crafted to tackle underlying issues related to the outsize time, cost and inefficiencies associated with the traditional drug development process. To realize their full potential, these well-intentioned efforts would benefit from a deeper understanding of opportunities for improvement throughout this process, as well as from more research on the impact of reforms intended to influence how drugs are developed and approved. But such research is hampered by the limited availability of standard, consistent and routinely collected measures of progress in pharmaceutical innovation.

While new drugs and biologics have made a tremendous impact on the treatment of many diseases in the past several decades, research and development spending by the pharmaceutical industry has grown many-fold. At the same time, the annual number of new drugs approved year-to-year have shown variability but not a concomitant rise in keeping with steady growth of R&D spendingCitation4. Many analysts have attempted to explain this apparent decline in productivity of pharmaceutical research and development, but a significant challenge facing such analyses has been the paucity of metrics available to assess the health of different aspects of the complex ecosystem that propels pharmaceutical innovationCitation5–10.

Gathering real data to measure progress in reducing the time, cost and inefficiency of development, as well as interpreting those metrics, poses many challenges. To better understand the impact that policies have had or are having on the productivity of the innovation enterprise, we argue that more comprehensive historical data and trend analysis will be required.

In a previous article, our research teams proposed a publicly-available database designed to help address these important issues and enable better analyses of biomedical innovationCitation11. We outlined a database that centralizes data elements related to approved New Molecular Entities (NMEs), their development and downstream impact on patient care; this database can eventually be extended to incorporate data on drug candidates that do not successfully navigate the development and approval process. By building a more comprehensive set of data and measures related to these characteristics, analyses could move beyond more traditional approaches that tend to focus on only a few aspects of drug development and associated trends (e.g. tracking annual approvals, changes in research and development spending, etc.). In short, we hope to shed better light on the factors that enable and contribute to successful product development and on the underlying drivers of innovation.

For initial database development, and as a way to test approaches to collecting data from public sources, we decided to explore and better characterize trends in development timelines to uncover the factors and policies that hamper timely approval and patient access. A clearer picture of variability in development timelines may help to identify common bottlenecks or barriers to a more efficient process, and in turn inform policy making to overcome these challenges. Creating this picture will require collection of key dates associated with early compound development, initiation of clinical trial phases, and milestone dates during regulatory review. While the development ecosystem is increasingly global, for the purposes of this initial round of data development our focus has been on identifying and collecting data elements for drugs successfully approved by the U.S. Food and Drug Administration (FDA).

In an ideal scenario, we would be able to access dates from initiation of first-in-human testing all the way through to approval for each NME in the database. For purposes of assessing the immediate availability of relevant data, however, we decided to focus first on one objective data element – date of Investigational New Drug (IND) submission in the US. While this is far from conclusive in establishing the start date of clinical development, the IND date provides an initial reasonable proxy for the start of a clinical development program in the US and is an important first step in collecting broader data on development timelines.

It should be noted that there are some potential challenges to using IND dates for timeline analysis purposes given that they may not be indicative of previous international development efforts or may not be inclusive of other modes of early development (e.g. expanded access programs). Still, using IND dates for initial research will enable us to troubleshoot data collection activities and explore additional data elements that, used in conjunction with INDs, could create composite measures for tracking development timelines. Our hope was that IND dates would be accurately and reliably captured and reported, given their association with a Federal agency and approved (i.e. nonproprietary, non-failure) products.

The process described in this paper not only contributes to building our database in earnest, but also provides guideposts for a number of questions related to database development. First, how easy will it be to extract key information related to drug development from public sources? Second, how will we identify and overcome gaps in that data? And third, how reliable will the publicly-available data turn out to be? Below, we walk through our findings and outline the challenges to finding adequate and accurate data from public resources.

Methods

List of NMEs

The first step in building this database was to list the NMEs approved by the FDA since 1994. For this purpose, we used lists of NMEs available in various published papers for the years 1994–2011, and the list of approved NMEs published on the FDA’s website for the years 2012–2014Citation7,Citation9,Citation12. Following Lanthier’s methodology, we focused on novel therapeutic products and excluded diagnostic drugs, drugs approved under an abbreviated regulatory pathway, and products intended for use solely by US military personnel. Our final data set consisted of 587 NME approvals.

Sources for IND dates

IND dates were extracted from two sources: FDA’s Drugs@FDA database and the Federal Register (FR). The methodology to extract IND dates from the Drugs@FDA database was as follows: we downloaded all documents available in the database for each NME. After conversion into searchable documents, we performed a search in each document for the word “IND” and collected the date as well as the IND number, when these data points were present. Data were collected as shown in the relevant document. Information about whether it was the submission date, receipt date or safe to proceed date was not collected because it was not systematically mentioned next to the IND date. For those NMEs that had more than one IND date, we collected all the dates we could find. To search for IND dates on the FR website, we first downloaded the list of links corresponding to all drugs for which a company requested a patent extension using “Determination of Regulatory Review Period for Purposes of Patent Extension” as the search term; the IND date is systematically collected in the FR for the extension process. We accessed the patent extension request and identified the IND date and number. As with the Drugs@FDA database, the FR contained more than one IND date for some NMEs, in which cases we collected all dates.

Data analysis

Data were compiled in an Excel database. The following information was captured, when available: approval date, IND number, IND date, source of information (FR versus Drugs@FDA database). Analysis was done as a count of IND dates obtained for each NME and per data source. When more than one IND date was found, the time difference between these dates was calculated.

Limitations

It should be noted that this work presents a number of limitations. First of all, IND submission represents the opening of clinical development in the US. A proportion of drug development occurs outside of the US, and will not be captured by the IND date. Another example is situations where a drug’s sponsor changes, a new IND could be submitted. Keeping this limitation in mind, we still decided to proceed with collecting IND dates because they represent an objective and highly relevant data point in the US. Secondly, we did not collect information regarding the type of IND date, which depending on reporting could represent the date application sponsor submitted the application, the date FDA acknowledges receipt of the application, or the date FDA allows the IND to proceed. The search tool we used was designed to retrieve the IND number and the IND date only. The tool allowed us to collect all dates associated with an IND, regardless of the context for this date. This is why more than one date has been retrieved for some drugs. Thirdly, our effort concentrated on collecting IND dates that correspond to the IND submitted for the NME’s specific indication and dosage form/route of administration. In some instances however, several IND numbers and/or dates where available in the documents associated with a single NME. Some of these IND dates corresponded to the earliest IND submitted for the compound for any indication. As the purpose of the work was to determine what type of information was accessible in publicly-available sources, we collected any IND date we found in documents associated with a NME, regardless of whether it was for the NME’s specific indication or for some earlier development. This aspect is discussed further below.

Results

IND date availability

At least one IND date was available for 445 (75.8%) of the 587 NMEs. As was expected, as we went back in time, a lower percentage of IND dates were recovered (). For the 142 (24.2%) NMEs for which IND dates could not be found, there was either no mention of the IND date in the documents reviewed, or the date was redacted. We were unable to identify any consistent pattern in the redaction of IND dates.

Figure 1. IND date availability per year. Notes: IND dates were identified and collected from two primary public sources: the Drugs@FDA database and the Federal Register. The availability of dates varied over time, both by total availability of IND dates year-to-year and availability by source.

Figure 1. IND date availability per year. Notes: IND dates were identified and collected from two primary public sources: the Drugs@FDA database and the Federal Register. The availability of dates varied over time, both by total availability of IND dates year-to-year and availability by source.

IND date per data source

The Drugs@FDA database provided IND dates for 303 (51.6%) NMEs and the FR contributed with 297 (50.6%) IND dates. Looking at the overlap between the two data sources, 26.4% of IND dates came from both Drugs@FDA database and the FR, 25.2% from Drugs@FDA database only and 24.2% from the FR only, with no IND date identified for the remaining 24.2% ().

Figure 2. IND date availability per data source. Notes: IND dates were identified and collected from two primary public sources: the Drugs@FDA database and the Federal Register. The availability of dates varied over time, both by total availability of IND dates year-to-year and availability by source.

Figure 2. IND date availability per data source. Notes: IND dates were identified and collected from two primary public sources: the Drugs@FDA database and the Federal Register. The availability of dates varied over time, both by total availability of IND dates year-to-year and availability by source.

Access to data

IND dates were not referenced in a systematic way in the Drugs@FDA database documents. IND dates were found either in the medical review, correspondence review, pharmacology review, chemistry review, risk review or other areas. As a result, a resource-intensive process was required to locate the desired information. The information was more systematically captured in the FR and was easy to locate.

Multiple IND dates

Out of the 445 NMEs for which an IND date was obtained, 274 (61.6%) had more than one date reported. The split per data source is as follows: out of the 148 NMEs whose IND dates were found in the Drugs@FDA database only, 15% had more than one date. Out of the 142 NMEs whose IND dates were found only in the FR, 68% had more than one date. Regarding the NMEs with IND dates found in both the Drugs@FDA database and the FR, by definition all of them had more than one date reported. shows the number of IND dates per data source.

Table 1. Number of NMEs by date entry and data source.

Time gap between IND dates

For NMEs with multiple IND dates reported, we calculated the time difference between the most extreme dates and grouped them in six categories, from exact match to over 10 years difference. As shown in , dates were identical for 8% of NMEs with multiple INDs, and 45% of NMEs had dates with no more than one month gap. This most likely represents the difference between when the IND application was submitted and when it was allowed to proceed. The remaining 47% of NMEs had gaps of greater than one month, up to as long as 12 years. As shown in , the exact match between two dates was mainly seen in dates obtained through the FR. The frequency of drugs having over five year difference between available IND date was seen when the IND dates were sourced from both the Drugs@FDA database and the FR.

Table 2. Gaps between multiple dates reported for the same NME.

Discussion

Our rationale in selecting IND dates for this initial exercise in database development was to select an objective data element that would present minimal methodologic issues and could be obtained from public-domain sources. This exercise revealed, however, that IND dates are not systematically documented in the Drugs@FDA database, are sometimes redacted for reasons that do not appear to be clear or consistent, and sometimes present significant discrepancies. The FR provided more consistent IND data, but only on drugs with patent extension requests. Though more readily available for recent years, the process of identifying and assembling dates from various sources and subsections within those sources proves time consuming and resource intensive, potentially indicating additional challenges to accessing other data that is arguable public and nonproprietary moving forward.

Reconciling multiple IND dates for individual NMEs is another potential data collection challenge. In the 53% of cases where gaps in data are less than one month, we assume the gap is due to administrative reasons. Wider discrepancies in IND dates may potentially relate to IND transfers from one organization to another or initiation of development in a different indication, among other potential factors.

Lomitapide is a good illustration of this situation, with two different INDs (IND 50820 and IND 77775) and two dates (18 June 1996 and 16 May 2007) identified. Information compiled in the FDA’s Administrative and Correspondence documents clarify the history of the drug and lists the various owners of the molecule (Bristol-Myers Squibb; Daniel Rader, University of Pennsylvania; and Aegerion) and registration for various indications (Heterozygous Familial Hypercholesterolemia (HeFH) and HoFH). Such explanations allow a better understanding of the regulatory history of lomitapide’s development, and are helpful in deciding which IND date to use. Unfortunately, the existence of such a clear explanation is the exception, not the rule. In most cases where multiple IND dates were found, the explanation is either absent or redacted. In those cases, thorough desk research and discussions with experts will be necessary to understand development history and determine which date to use.

This exercise is also useful in establishing methods for the data mining and extraction that will have to be done on a large scale to continue populating a database on drug development. As we intend to broaden this database beyond successfully approved NMEs and to unapproved products, having a means by which we can easily and quickly gather reliable data will be key. Work related to populating IND dates, however, demonstrates that a high proportion of discrepancies between data elements pulled from multiple sources will necessitate rigorous data review processes for ensuring accuracy.

Initial work also further underscores the need for a constellation approach to innovation metrics. In many cases, it may not be possible to rely on a single data element such as IND date to adequately characterize the metric of interest, such as development time in this instance.

Given ongoing policy efforts to improve drug development, there remains a real need for systematically available data that can better contribute to such composites, characterize the innovation process and measure whether policies are having their intended effect. While retrospective data collection and analysis like we have proposed can eventually mine and compile such data from an assortment of public and private sources, policy makers themselves could begin to address the dearth of such detailed information by identifying new opportunities for centralizing and reporting data moving forward. The internal practices at FDA that lead to some of the inconsistencies in reporting described above could be addressed through more systematic and prospective data collection, more standardized review documents or new data fields populated by front-line review staff. This should be done in a way that does not divert internal resources or create additional workflow burden, but rather through more routine capture of data elements as part of already-established processes.

Tackling these issues through FDA reporting may require changes to underlying regulations. Ongoing PDUFA VI negotiations could serve as a vehicle for potential reporting requirements tied not to typical Agency performance measures, but instead the development characteristics of the products themselves. Similarly, many public reporting challenges would be most easily remedied by better, more consistent disclosure of basic information related to drug development from manufacturers themselves. Moving forward, we strongly propose that critical and objective information related to drug development, such as the IND dates we discuss above, be systematically captured during registration and made public in a consistent format.

Conclusion

As our teams continue work on the development of a comprehensive, publicly-available data resource on drug development and innovation, we look forward to finding novel ways for tackling these and other data challenges as they arise.

Disclosure statement

The authors declare no declarations of interest.

References