Title: | Query Data from U.S. National Library of Medicine's Clinical Trials Database |
---|---|
Description: | Tools to create and query database from the U.S. National Library of Medicine's Clinical Trials database <https://clinicaltrials.gov/>. Functions provide access a variety of techniques for searching the data using range queries, categorical filtering, and by searching for full-text keywords. Minimal graphical tools are also provided for interactively exploring the constructed data. |
Authors: | Taylor Arnold [aut, cre] , Auston Wei [aut], Michael J. Kane [aut] |
Maintainer: | Taylor Arnold <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.2.5 |
Built: | 2024-11-12 03:10:56 UTC |
Source: | https://github.com/cran/ctrialsgov |
Cancer clinical trials based on a query where: 'study_type' is "Interventional"; 'sponsor_type' is "Industry"; 'date_range' is trials from 2021-01-01 or newer; The 'description' includes the keyword "cancer"; 'phase' is reported (not NA); 'primary_purpose' is "Treatment"; 'minimum_enrollment' is 100.
This function must be run prior to other functions in the package. It creates a parsed and cached version of the clinical trials dataset in memory in R. This makes other function calls relatively efficient. other
ctgov_create_data(con, verbose = TRUE)
ctgov_create_data(con, verbose = TRUE)
con |
an DBI connection object to the database |
verbose |
logical flag; should progress messages be printed?;
defaults to |
does not return any value; used only for side effects
Taylor B. Arnold, [email protected]
Create a Gantt Labeler for Timeline Tooltips
ctgov_gantt_labeller(x)
ctgov_gantt_labeller(x)
x |
the data.frame object returned from a query. |
a string that can be used as a label in ggplotly
Takes a keyword and vector of text and returns instances where the keyword is found within the text.
ctgov_kwic( term, text, names = NULL, n = Inf, ignore_case = TRUE, use_color = FALSE, width = 20L, output = c("cat", "character", "data.frame") )
ctgov_kwic( term, text, names = NULL, n = Inf, ignore_case = TRUE, use_color = FALSE, width = 20L, output = c("cat", "character", "data.frame") )
term |
search term as a string |
text |
vector of text to search |
names |
optional vector of names corresponding to the text |
n |
number of results to return; default is Inf |
ignore_case |
should search ignore case? default is TRUE |
use_color |
printed results include ASCII color escape sequences;
these are set to |
width |
how many characters to show as context |
output |
what kind of output to provide; default prints the
results using |
either nothing, character vector, or data frame depending on the the requested return type
This function downloads a saved version of the full clinical trials dataset from the package's development repository on GitHub (~150MB) and loads it into R for querying. The data will be cached so that it can be re-loaded without downloading. We try to update the cache frequently so this is a convenient way of grabbing the data if you do not need the most up-to-date version of the database.
ctgov_load_cache(force_download = FALSE)
ctgov_load_cache(force_download = FALSE)
force_download |
logical flag; should the cache be re-downloaded if
it already exists? defaults to |
does not return any value; used only for side effects
Taylor B. Arnold, [email protected]
This function loads a sample dataset for testing and prototyping purposes. after running, all of the functions in the package can then be used with this sample data. It consists of a 2.5 from ClinicalTrials.gov at the time of the package creation.
ctgov_load_sample()
ctgov_load_sample()
does not return any value; used only for side effects
Taylor B. Arnold, [email protected]
Plot a Timeline for a Set of Clinical Trials
ctgov_plot_timeline( x, start_date = "start_date", completion_date = "primary_completion_date", label_column = "nct_id", color = label_column, tooltip = ctgov_gantt_labeller(x) )
ctgov_plot_timeline( x, start_date = "start_date", completion_date = "primary_completion_date", label_column = "nct_id", color = label_column, tooltip = ctgov_gantt_labeller(x) )
x |
the data.frame object returned from a query. |
start_date |
the start date column name. (Default is "start_date") |
completion_date |
the date the trial is set to be complete. (Default "primary_completion_date"). (Default is "primary_completion_date") |
label_column |
the column denoting the labels for the y-axis. (Default is "nct_id") |
color |
the column to be used for coloring. (Default is label_column) |
tooltip |
the tooltips for each of trials. (Default is 'ctgov_gantt_labeller(x)'). |
ctgov_gantt_labeller
This function selects a subset of the clinical trials data by using a
a variety of different search parameters. These include free text search
keywords, range queries for the continuous variables, and exact matches for
categorical fields. The function ctgov_query_terms
shows the
categorical levels for the latter. The function will either take the entire
dataset loaded into the package environment or a previously queried input.
ctgov_query( data = NULL, description_kw = NULL, sponsor_kw = NULL, brief_title_kw = NULL, official_title_kw = NULL, criteria_kw = NULL, intervention_kw = NULL, intervention_desc_kw = NULL, outcome_kw = NULL, outcome_desc_kw = NULL, conditions_kw = NULL, population_kw = NULL, date_range = NULL, enrollment_range = NULL, minimum_age_range = NULL, maximum_age_range = NULL, study_type = NULL, allocation = NULL, intervention_model = NULL, observational_model = NULL, primary_purpose = NULL, time_perspective = NULL, masking_description = NULL, sampling_method = NULL, phase = NULL, gender = NULL, sponsor_type = NULL, ignore_case = TRUE, match_all = FALSE )
ctgov_query( data = NULL, description_kw = NULL, sponsor_kw = NULL, brief_title_kw = NULL, official_title_kw = NULL, criteria_kw = NULL, intervention_kw = NULL, intervention_desc_kw = NULL, outcome_kw = NULL, outcome_desc_kw = NULL, conditions_kw = NULL, population_kw = NULL, date_range = NULL, enrollment_range = NULL, minimum_age_range = NULL, maximum_age_range = NULL, study_type = NULL, allocation = NULL, intervention_model = NULL, observational_model = NULL, primary_purpose = NULL, time_perspective = NULL, masking_description = NULL, sampling_method = NULL, phase = NULL, gender = NULL, sponsor_type = NULL, ignore_case = TRUE, match_all = FALSE )
data |
a dataset to search over; set to |
description_kw |
character vector of keywords to search in the
intervention description field. Set to
|
sponsor_kw |
character vector of keywords to search in the
sponsor (the company that submitted the study).
Set to |
brief_title_kw |
character vector of keywords to search in the
brief title field. Set to
|
official_title_kw |
character vector of keywords to search in the
official title field. Set to
|
criteria_kw |
character vector of keywords to search in the
criteria field. Set to
|
intervention_kw |
character vector of keywords to search in the
intervention names field. Set to
|
intervention_desc_kw |
character vector of keywords to search in the
intervention description field. Set to
|
outcome_kw |
character vector of keywords to search in the
outcome measures field. Set to
|
outcome_desc_kw |
character vector of keywords to search in the
outcome description field. Set to
|
conditions_kw |
character vector of keywords to search in the
conditions field. Set to
|
population_kw |
character vector of keywords to search in the
population field. Set to
|
date_range |
string of length two formatted as "YYYY-MM-DD"
describing the earliest and latest data to
include in the results. Use a missing value
for either value search all dates. Set to
|
enrollment_range |
numeric of length two describing the smallest
and largest enrollment sizes to
include in the results. Use a missing value
for either value to avoid filtering. Set to
|
minimum_age_range |
numeric of length two describing the smallest
and largest minmum age (in years) to
include in the results. Use a missing value
for either value to avoid filtering. Set to
|
maximum_age_range |
numeric of length two describing the smallest
and largest maximum age (in years) to
include in the results. Use a missing value
for either value to avoid filtering. Set to
|
study_type |
character vector of study types to include
in the output. Set to |
allocation |
character vector of allocations to include
in the output. Set to |
intervention_model |
character vector of interventions to include
in the output. Set to |
observational_model |
character vector of observations to include
in the output. Set to |
primary_purpose |
character vector of primary purposes to
include in the output. Set to |
time_perspective |
character vector of time perspectives to
include in the output. Set to |
masking_description |
character vector of maskings to include
in the output. Set to |
sampling_method |
character vector of sampling methods to
include in the output. Set to |
phase |
character vector of phases to include
in the output. Set to |
gender |
character vector of genders to include
in the output. Set to |
sponsor_type |
character vector of sponsor types to include
in the output. Set to |
ignore_case |
logical. Should the search ignore
capitalization. The default is |
match_all |
logical. Should the results required matching
all the keywords? The default is |
a tibble object queried from the loaded database
Taylor B. Arnold, [email protected]
Returns a list showing the available category levels for querying the data
with the ctgov_query
function.
ctgov_query_terms()
ctgov_query_terms()
a named list of allowed categorical values for the query
This function sets the schema in which tables in which the CT Trials tables reside.
Get the current schema eiter of the following.
ctgov_schema() ctgov_get_schema()
Set the current schema with the following.
ctgov_schema(<SCHEMA NAME>) ctgov_set_schema(<SCHEMA NAME>)
A return of "" from the get functions indicates a schema is not specified.
ctgov_schema(schema = NULL)
ctgov_schema(schema = NULL)
schema |
the name of the schema. (Default is NULL - None) |
no return value; used for side effects
Takes one or more vectors of text and returns a similarity matrix.
ctgov_text_similarity( ..., max_terms = 10000, tolower = TRUE, min_df = 0, max_df = 1 )
ctgov_text_similarity( ..., max_terms = 10000, tolower = TRUE, min_df = 0, max_df = 1 )
... |
one or more vectors of text to search; must all be the same length |
max_terms |
maximum number of terms to consider for keywords |
tolower |
should keywords respect the case of the raw terms |
min_df |
minimum proportion of documents that a term should be present in to be included in the keywords |
max_df |
maximum proportion of documents that a term should be present in to be included in the keywords |
a distance matrix
Takes one or more vectors of text and returns a vector of keywords.
ctgov_tfidf( ..., max_terms = 10000, tolower = TRUE, nterms = 5L, min_df = 0, max_df = 1 )
ctgov_tfidf( ..., max_terms = 10000, tolower = TRUE, nterms = 5L, min_df = 0, max_df = 1 )
... |
one or more vectors of text to search; must all be the same length |
max_terms |
maximum number of terms to consider for keywords |
tolower |
should keywords respect the case of the raw terms |
nterms |
number of keyord terms to include |
min_df |
minimum proportion of documents that a term should be present in to be included in the keywords |
max_df |
maximum proportion of documents that a term should be present in to be included in the keywords |
a character vector of detected keywords
Convert a ctrialsgov Visualization to Plotly
ctgov_to_plotly(p, ...)
ctgov_to_plotly(p, ...)
p |
the plot returned by 'ctgov_plot_timeline()'. |
... |
currently not used. |
a Plotly object
Does a Term Appear in a Vector of Strings?
has_term(s, pattern, ignore_case = TRUE)
has_term(s, pattern, ignore_case = TRUE)
s |
the vector of strings. |
pattern |
the pattern to search for. |
ignore_case |
should the case be ignored? Default TRUE |
a single logical value
Data frame containing a 2.5 percent random sample of clinical trials.