This vignette gives a very brief overview of the current package. To start, we load the package into R.
In the next few sections, we see how to setup the data set, query it, and then visualize the output.
Before querying the ClinicalTrials.gov data, we need to load a pre-processed version of the data into R. There are three ways to do this. If you have installed a copy of the data set locally into PostGRES, the data can be created from scratch with the following block of code (it will take a couple of minutes to finish):
library(DBI)
library(RPostgreSQL)
drv <- dbDriver('PostgreSQL')
con <- DBI::dbConnect(drv, dbname="aact")
ctgov_create_data(con)
Alternatively, we can download a static version of the data from GitHub and load this into R without needing the setup a local version of the database. This will be cached locally so that it can be re-loaded without downloading each time. To download and load this data, use the following:
Finally, we can load a small sample dataset (2% of the total) that is included with the package itself using the following:
This is the version of the data that is used in most of the tests, examples, and in this vignette.
The primary function for querying the dataset is called
ctgov_query
. It can be called after using any of the
functions in the previous section. Here are a few examples of how the
function works. We will see a few examples here; see the help pages for
a complete list of options.
There are a number of fields in the data that use exact matches of categories. Here, for example, we find the interventional studies:
## # A tibble: 2,403 × 32
## nct_id start_date phase enrollment brief_title official_title
## <chr> <date> <chr> <int> <chr> <chr>
## 1 NCT04999163 2021-12-31 N/A 50 Aortix Ther… Aortix Therap…
## 2 NCT05002153 2021-11-30 N/A 300 The Role of… The Role of M…
## 3 NCT04472702 2021-11-30 N/A 45 Fluoroscopi… Fluoroscopic …
## 4 NCT05032157 2021-11-30 Phase 3 450 A Phase 3 S… A Multicenter…
## 5 NCT04471142 2021-11-08 N/A 270 Effectivene… Effectiveness…
## 6 NCT04772651 2021-11-01 N/A 108 Mediterrane… Mediterranean…
## 7 NCT04390451 2021-11-01 Phase 1 54 Initial Tes… Initial Testi…
## 8 NCT04696861 2021-11-01 N/A 60 Telehealth … Telehealth to…
## 9 NCT03954431 2021-10-31 Phase 1/Phase 2 100 High-Resolu… Study of High…
## 10 NCT04273022 2021-10-31 N/A 20 Effect of E… The Effect of…
## # ℹ 2,393 more rows
## # ℹ 26 more variables: primary_completion_date <date>, study_type <chr>,
## # rec_status <chr>, completion_date <date>, last_update <date>,
## # description <chr>, eudract_num <chr>, other_id <chr>, allocation <chr>,
## # intervention_model <chr>, observational_model <chr>, primary_purpose <chr>,
## # time_perspective <chr>, masking_description <chr>,
## # intervention_model_description <chr>, sampling_method <chr>, …
Or, all of the interventional studies that have a primary industry sponsor:
## # A tibble: 640 × 32
## nct_id start_date phase enrollment brief_title official_title
## <chr> <date> <chr> <int> <chr> <chr>
## 1 NCT04999163 2021-12-31 N/A 50 Aortix Ther… Aortix Therap…
## 2 NCT05032157 2021-11-30 Phase 3 450 A Phase 3 S… A Multicenter…
## 3 NCT05029856 2021-10-04 Phase 1/Phase 2 240 Evaluation … A Randomized,…
## 4 NCT04963179 2021-09-30 N/A 154 PREvention … PREvention of…
## 5 NCT04875975 2021-09-30 Phase 2 68 A Study to … A Randomized,…
## 6 NCT04909879 2021-09-30 Phase 2 100 Study of Al… Treatment of …
## 7 NCT04925674 2021-09-29 Phase 1 60 Study of HE… Phase Ic Clin…
## 8 NCT04935177 2021-09-17 Phase 3 64 Renal Funct… An Open-label…
## 9 NCT04956744 2021-08-31 Phase 1 30 A Study to … A Phase 1, Do…
## 10 NCT04920253 2021-08-31 N/A 180 Real World … Real World Ev…
## # ℹ 630 more rows
## # ℹ 26 more variables: primary_completion_date <date>, study_type <chr>,
## # rec_status <chr>, completion_date <date>, last_update <date>,
## # description <chr>, eudract_num <chr>, other_id <chr>, allocation <chr>,
## # intervention_model <chr>, observational_model <chr>, primary_purpose <chr>,
## # time_perspective <chr>, masking_description <chr>,
## # intervention_model_description <chr>, sampling_method <chr>, …
A few fields have continuous values that can be searched by giving a vector with two values. The results return any values that fall between the lower bound (first value) and the upper bound (second value). Here, we find the studies that have between 40 and 42 patients enrolled in them:
## # A tibble: 125 × 32
## nct_id start_date phase enrollment brief_title official_title
## <chr> <date> <chr> <int> <chr> <chr>
## 1 NCT04188119 2021-09-30 Phase 2 42 A Proof of … A Proof of Co…
## 2 NCT04992975 2021-08-31 <NA> 40 Brain Iron … Brain Iron To…
## 3 NCT05001854 2021-08-31 Phase 2/Phase 3 40 Hemodynamic… Evaluation of…
## 4 NCT04749355 2021-08-14 Phase 2 40 Phase 2, Op… A Phase 2, Op…
## 5 NCT04648319 2021-04-15 Phase 2 40 A Study of … A Pilot Study…
## 6 NCT04744779 2021-03-31 N/A 40 Office Base… Effectiveness…
## 7 NCT04841174 2021-03-30 N/A 40 The Effect … The Effect of…
## 8 NCT04808180 2021-03-25 N/A 40 Clinical Ef… Effects of Bi…
## 9 NCT04746105 2021-02-24 Phase 1 40 A Clinical … A Study to Ev…
## 10 NCT04355780 2021-01-08 <NA> 40 Immunologic… Immunologic F…
## # ℹ 115 more rows
## # ℹ 26 more variables: primary_completion_date <date>, study_type <chr>,
## # rec_status <chr>, completion_date <date>, last_update <date>,
## # description <chr>, eudract_num <chr>, other_id <chr>, allocation <chr>,
## # intervention_model <chr>, observational_model <chr>, primary_purpose <chr>,
## # time_perspective <chr>, masking_description <chr>,
## # intervention_model_description <chr>, sampling_method <chr>, …
Setting one end of the range to missing avoids searching for that end of the range. For example, the following finds any studies with 1000 or more patients.
## # A tibble: 204 × 32
## nct_id start_date phase enrollment brief_title official_title
## <chr> <date> <chr> <int> <chr> <chr>
## 1 NCT05033782 2021-12-01 <NA> 1500 Impact of the Modif… Impact of the…
## 2 NCT05033548 2021-10-10 <NA> 4000 Technology Enabled … Technology En…
## 3 NCT04982614 2021-10-01 Phase 4 1400 HPV Vaccination in … A Multi-site,…
## 4 NCT05033678 2021-08-16 <NA> 8000 Implementation of T… Teledermoscop…
## 5 NCT04917185 2021-06-30 N/A 1000 EA for PAAS: A pRCT Electro-acupu…
## 6 NCT04839757 2021-06-03 <NA> 1400 Dengue Vaccine Stra… Preparing for…
## 7 NCT04889924 2021-06-01 N/A 1666 ALND vs RDT in Posi… Axillary Lymp…
## 8 NCT04472845 2021-03-30 N/A 1018 HYPofractionated Ad… HYPofractiona…
## 9 NCT04735744 2021-02-15 <NA> 1315 Evaluation of Allie… Evaluation of…
## 10 NCT04626973 2021-01-15 N/A 3048 Effects of Ezetimib… Effects of Ez…
## # ℹ 194 more rows
## # ℹ 26 more variables: primary_completion_date <date>, study_type <chr>,
## # rec_status <chr>, completion_date <date>, last_update <date>,
## # description <chr>, eudract_num <chr>, other_id <chr>, allocation <chr>,
## # intervention_model <chr>, observational_model <chr>, primary_purpose <chr>,
## # time_perspective <chr>, masking_description <chr>,
## # intervention_model_description <chr>, sampling_method <chr>, …
Similarly, we can give a range of dates. These are given in the form of strings as “YYYY-MM-DD”:
## # A tibble: 34 × 32
## nct_id start_date phase enrollment brief_title official_title
## <chr> <date> <chr> <int> <chr> <chr>
## 1 NCT04224597 2020-02-01 <NA> 48 Evaluation … Evaluation of…
## 2 NCT04255524 2020-02-01 N/A 200 Choroidal C… OCTA to Quant…
## 3 NCT04336605 2020-02-01 <NA> 25000 Killing Pai… Killing Pain …
## 4 NCT04218669 2020-02-01 N/A 105 The Approac… A Clinical Ra…
## 5 NCT04409613 2020-02-01 N/A 59 Cost-Effect… Cost-Effectiv…
## 6 NCT04424576 2020-01-31 <NA> 60 Ovarian Mor… Trajectory of…
## 7 NCT04115397 2020-01-31 Phase 4 80 Bisphosphon… Towards Effic…
## 8 NCT04497064 2020-01-30 <NA> 585 Breakfast K… Breakfast Kno…
## 9 NCT03892785 2020-01-27 Phase 3 200 MEthotrexat… MEthotrexate …
## 10 NCT03710122 2020-01-23 Phase 2/Phase 3 102 Vancomycin … A Prospective…
## # ℹ 24 more rows
## # ℹ 26 more variables: primary_completion_date <date>, study_type <chr>,
## # rec_status <chr>, completion_date <date>, last_update <date>,
## # description <chr>, eudract_num <chr>, other_id <chr>, allocation <chr>,
## # intervention_model <chr>, observational_model <chr>, primary_purpose <chr>,
## # time_perspective <chr>, masking_description <chr>,
## # intervention_model_description <chr>, sampling_method <chr>, …
Finally, we can also search free text fields using keywords. The following for example finds and study that includes the phrase “lung cancer” (ignoring case) in the description field:
## # A tibble: 59 × 32
## nct_id start_date phase enrollment brief_title official_title
## <chr> <date> <chr> <int> <chr> <chr>
## 1 NCT04814056 2021-06-01 Phase 4 15 To Evaluate the Eff… An Open-Label…
## 2 NCT04629027 2021-03-03 <NA> 80 Evaluation System f… Establishment…
## 3 NCT04179305 2020-10-25 N/A 58 Giving Information … Giving Inform…
## 4 NCT04452877 2020-08-19 Phase 2 20 A Study of Dabrafen… An Open-Label…
## 5 NCT04422392 2020-07-13 Phase 2 107 Neoadjuvant PD-1 An… Neoadjuvant P…
## 6 NCT04120454 2020-03-16 Phase 2 34 Ramucirumab and Pem… An Investigat…
## 7 NCT04332367 2019-12-19 Phase 2 59 Carboplatin, Taxane… Phase II, Sin…
## 8 NCT04309955 2019-12-01 N/A 60 Modified Versus Tra… Randomized Cl…
## 9 NCT04151940 2019-09-26 <NA> 40 PET/CT Changes Duri… An Observatio…
## 10 NCT04081688 2019-08-21 Phase 1 15 Atezolizumab and Va… A Phase I Tri…
## # ℹ 49 more rows
## # ℹ 26 more variables: primary_completion_date <date>, study_type <chr>,
## # rec_status <chr>, completion_date <date>, last_update <date>,
## # description <chr>, eudract_num <chr>, other_id <chr>, allocation <chr>,
## # intervention_model <chr>, observational_model <chr>, primary_purpose <chr>,
## # time_perspective <chr>, masking_description <chr>,
## # intervention_model_description <chr>, sampling_method <chr>, …
We can search two terms at once as well, by default it finds things that match at least one of the terms:
## # A tibble: 59 × 32
## nct_id start_date phase enrollment brief_title official_title
## <chr> <date> <chr> <int> <chr> <chr>
## 1 NCT04814056 2021-06-01 Phase 4 15 To Evaluate the Eff… An Open-Label…
## 2 NCT04629027 2021-03-03 <NA> 80 Evaluation System f… Establishment…
## 3 NCT04179305 2020-10-25 N/A 58 Giving Information … Giving Inform…
## 4 NCT04452877 2020-08-19 Phase 2 20 A Study of Dabrafen… An Open-Label…
## 5 NCT04422392 2020-07-13 Phase 2 107 Neoadjuvant PD-1 An… Neoadjuvant P…
## 6 NCT04120454 2020-03-16 Phase 2 34 Ramucirumab and Pem… An Investigat…
## 7 NCT04332367 2019-12-19 Phase 2 59 Carboplatin, Taxane… Phase II, Sin…
## 8 NCT04309955 2019-12-01 N/A 60 Modified Versus Tra… Randomized Cl…
## 9 NCT04151940 2019-09-26 <NA> 40 PET/CT Changes Duri… An Observatio…
## 10 NCT04081688 2019-08-21 Phase 1 15 Atezolizumab and Va… A Phase I Tri…
## # ℹ 49 more rows
## # ℹ 26 more variables: primary_completion_date <date>, study_type <chr>,
## # rec_status <chr>, completion_date <date>, last_update <date>,
## # description <chr>, eudract_num <chr>, other_id <chr>, allocation <chr>,
## # intervention_model <chr>, observational_model <chr>, primary_purpose <chr>,
## # time_perspective <chr>, masking_description <chr>,
## # intervention_model_description <chr>, sampling_method <chr>, …
But the match_all
flag can be set to search for both
terms at the same time (here, that returns no matches):
## # A tibble: 0 × 32
## # ℹ 32 variables: nct_id <chr>, start_date <date>, phase <chr>,
## # enrollment <int>, brief_title <chr>, official_title <chr>,
## # primary_completion_date <date>, study_type <chr>, rec_status <chr>,
## # completion_date <date>, last_update <date>, description <chr>,
## # eudract_num <chr>, other_id <chr>, allocation <chr>,
## # intervention_model <chr>, observational_model <chr>, primary_purpose <chr>,
## # time_perspective <chr>, masking_description <chr>, …
Other keyword fields include official_title_kw
,
source_kw
and criteria_kw
.
Any of the options can be combined as needed.
ctgov_query(
description_kw = "cancer",
enrollment_range = c(100, 200),
date_range = c("2019-01-01", "2020-02-01")
)
## # A tibble: 5 × 32
## nct_id start_date phase enrollment brief_title official_title
## <chr> <date> <chr> <int> <chr> <chr>
## 1 NCT04035447 2020-01-22 N/A 120 Symptom Management f… Improving Sym…
## 2 NCT04227327 2020-01-07 Phase 2 121 Study Evaluating Abe… A Phase 2, Op…
## 3 NCT04404244 2020-01-01 <NA> 100 Extraordinary Measur… Extraordinary…
## 4 NCT03902600 2019-03-12 <NA> 115 Moderately Hypofract… Moderately Hy…
## 5 NCT03813953 2019-02-20 N/A 160 The Effect of Analge… The Effect of…
## # ℹ 26 more variables: primary_completion_date <date>, study_type <chr>,
## # rec_status <chr>, completion_date <date>, last_update <date>,
## # description <chr>, eudract_num <chr>, other_id <chr>, allocation <chr>,
## # intervention_model <chr>, observational_model <chr>, primary_purpose <chr>,
## # time_perspective <chr>, masking_description <chr>,
## # intervention_model_description <chr>, sampling_method <chr>, gender <chr>,
## # minimum_age <dbl>, maximum_age <dbl>, population <chr>, criteria <chr>, …
Finally, we can also pass a current version of the data set to the query function, rather than starting with the full data set. This is useful when you want to combine queries in a more complex way. For example, this is equivalent to the above:
library(dplyr)
ctgov_query() %>%
ctgov_query(description_kw = "cancer") %>%
ctgov_query(enrollment_range = c(100, 200)) %>%
ctgov_query(date_range = c("2019-01-01", "2020-02-01"))
## # A tibble: 5 × 32
## nct_id start_date phase enrollment brief_title official_title
## <chr> <date> <chr> <int> <chr> <chr>
## 1 NCT04035447 2020-01-22 N/A 120 Symptom Management f… Improving Sym…
## 2 NCT04227327 2020-01-07 Phase 2 121 Study Evaluating Abe… A Phase 2, Op…
## 3 NCT04404244 2020-01-01 <NA> 100 Extraordinary Measur… Extraordinary…
## 4 NCT03902600 2019-03-12 <NA> 115 Moderately Hypofract… Moderately Hy…
## 5 NCT03813953 2019-02-20 N/A 160 The Effect of Analge… The Effect of…
## # ℹ 26 more variables: primary_completion_date <date>, study_type <chr>,
## # rec_status <chr>, completion_date <date>, last_update <date>,
## # description <chr>, eudract_num <chr>, other_id <chr>, allocation <chr>,
## # intervention_model <chr>, observational_model <chr>, primary_purpose <chr>,
## # time_perspective <chr>, masking_description <chr>,
## # intervention_model_description <chr>, sampling_method <chr>, gender <chr>,
## # minimum_age <dbl>, maximum_age <dbl>, population <chr>, criteria <chr>, …