Type: | Package |
Title: | miRNA Text Mining in Abstracts |
Version: | 1.3.4 |
Author: | Julian Friedrich [aut, cre], Hans-Peter Hammes [aut], Guido Krenning [aut] |
Maintainer: | Julian Friedrich <julian.friedrich@medma.uni-heidelberg.de> |
Description: | Providing tools for microRNA (miRNA) text mining. miRetrieve summarizes miRNA literature by extracting, counting, and analyzing miRNA names, thus aiming at gaining biological insights into a large amount of text within a short period of time. To do so, miRetrieve uses regular expressions to extract miRNAs and tokenization to identify meaningful miRNA associations. In addition, miRetrieve uses the latest miRTarBase version 8.0 (Hsi-Yuan Huang et al. (2020) "miRTarBase 2020: updates to the experimentally validated microRNA–target interaction database" <doi:10.1093/nar/gkz896>) to display field-specific miRNA-mRNA interactions. The most important functions are available as a Shiny web application under https://miretrieve.shinyapps.io/miRetrieve/. |
License: | GPL-3 |
Encoding: | UTF-8 |
LazyData: | true |
Depends: | R (≥ 3.1.0) |
Imports: | dplyr (≥ 1.0.7), forcats (≥ 0.5.1), ggplot2 (≥ 3.3.5), magrittr (≥ 2.0.1), openxlsx (≥ 4.2.4), plotly (≥ 4.9.4.1), purrr (≥ 0.3.4), readr (≥ 2.0.1), readxl (≥ 1.3.1), rlang (≥ 0.4.11), scales (≥ 1.1.1), stringr (≥ 1.4.0), textclean (≥ 0.9.3), tidyr (≥ 1.1.3), tidytext (≥ 0.3.1), topicmodels (≥ 0.2.12), wordcloud (≥ 2.6), xml2 (≥ 1.3.2), zoo (≥ 1.8-9) |
RoxygenNote: | 7.1.1 |
Suggests: | kableExtra, knitr, reshape2, rmarkdown, testthat |
NeedsCompilation: | no |
Packaged: | 2021-09-18 17:09:56 UTC; Julian |
Repository: | CRAN |
Date/Publication: | 2021-09-18 17:30:02 UTC |
Add topic column to data frame
Description
Add topic column to a data frame.
Usage
add_col_topic(df, col.topic = "Topic", topic.name = "Topic1")
Arguments
df |
Data frame which the topic column is added to. |
col.topic |
String. Name of the topic column to be created. |
topic.name |
String. Topic name to be contained in |
Details
Add a topic column to a data frame. This topic column is named col.topic
and
contains the string topic.name
.
Value
Data frame with a topic column added.
See Also
Keywords - animals.
Description
Keywords to identify abstracts using animal models.
Usage
animal_keywords
Format
An object of class character
of length 12.
Assign topics based on precalculated scores
Description
Assign topics to abstracts based on precalculated scores.
Usage
assign_topic(
df,
col.topic,
threshold,
topic.names = NULL,
col.topic.name = "Topic",
col.pmid = "PMID",
discard = FALSE
)
Arguments
df |
Data frame containing precalculated topic scores and PubMed-IDs. |
col.topic |
Character vector. Vector with column names containing precalculated topic scores. |
threshold |
Integer vector. Vector containing thresholds for topic
columns. Positions in |
topic.names |
Character vector. Optional. Vector containing names of new
topics. Positions in |
col.topic.name |
String. Name of the new topic column. |
col.pmid |
String. Column containing PubMed-IDs. |
discard |
Boolean. If |
Details
Assign topics to abstracts based on precalculated scores.
assign_topic()
compares different precalculated topic scores and
assigns the abstract to the topic with the highest score. If there is a
tie between topic scores, the abstract is assigned to all topics in question.
If an abstract matches no topic, it is assigned to the topic "Unknown".
Value
Data frame with topics based on precalculated topic scores.
See Also
calculate_score_topic()
, plot_score_topic()
,
add_col_topic()
Other score functions:
calculate_score_animals()
,
calculate_score_biomarker()
,
calculate_score_patients()
,
calculate_score_topic()
,
plot_score_animals()
,
plot_score_biomarker()
,
plot_score_patients()
,
plot_score_topic()
Assign topics based on LDA model
Description
Assign topics to abstracts based on an LDA model.
Usage
assign_topic_lda(df, lda_model, topic.names, col.pmid = PMID)
Arguments
df |
Data frame to assign topics to. Should be the same data frame that the LDA model was fitted on. |
lda_model |
LDA-model. |
topic.names |
Character vector. Vector containing names of the
new topics. Must have the same length as the number of topics |
col.pmid |
Symbol. Column containing PubMed-IDs. |
Details
Assign topic to abstracts based on an LDA model.
To identify the subject of a topic, use plot_lda_term()
.
Value
Data frame with topics assigned to each abstract based on an LDAmodel.
See Also
fit_lda()
, plot_lda_term()
, assign_topic()
Other LDA functions:
fit_lda()
,
plot_lda_term()
,
plot_perplexity()
Keywords - biomarkers.
Description
Keywords to identify abstracts reporting about miRNAs as biomarkers.
Usage
biomarker_keywords
Format
An object of class character
of length 18.
Calculate animal model scores for abstracts
Description
Calculate animal model score for each abstract to indicate possible use of animal models.
Usage
calculate_score_animals(
df,
keywords = animal_keywords,
case = FALSE,
threshold = NULL,
indicate = FALSE,
discard = FALSE,
col.abstract = Abstract
)
Arguments
df |
Data frame containing abstracts. |
keywords |
Character vector. Vector containing keywords. The score is
calculated based on these keywords. How much weight a keyword in |
case |
Boolean. If |
threshold |
Integer. Optional. Threshold to decide if an abstract is
considered to use animal models or not. If |
indicate |
Boolean. If |
discard |
Boolean. If |
col.abstract |
Symbol. Column containing abstracts. |
Details
Calculate animal model score for each abstract to indicate possible
use of animal models. This score is added to the data frame as an additional
column Animal_score
, containing the calculated animal model score.
To decide which abstracts are considered to contain animal models, a threshold
can be set via the threshold
argument. Furthermore, an additional
column can be added, verbally indicating the use of animal models in
an abstract.
Choosing the right threshold can be facilitated using plot_score_animals()
.
Value
Data frame with calculated animal model scores.
If discard = FALSE
, adds extra columns
to the original data frame with the calculated animal model scores.
If discard = TRUE
, only abstracts with animal models are kept.
See Also
Other score functions:
assign_topic()
,
calculate_score_biomarker()
,
calculate_score_patients()
,
calculate_score_topic()
,
plot_score_animals()
,
plot_score_biomarker()
,
plot_score_patients()
,
plot_score_topic()
Calculate biomarker scores for abstracts
Description
Calculate biomarker score for each abstract to indicate possible use of miRNAs as biomarker.
Usage
calculate_score_biomarker(
df,
keywords = biomarker_keywords,
case = FALSE,
threshold = NULL,
indicate = FALSE,
discard = FALSE,
col.abstract = Abstract
)
Arguments
df |
Data frame containing abstracts. |
keywords |
Character vector. Vector containing keywords. The score is
calculated based on these keywords. How much weight a keyword in |
case |
Boolean. If |
threshold |
Integer. Optional. Threshold to decide if use of miRNAs as
biomarker are present in an abstract or not. If |
indicate |
Boolean. If |
discard |
Boolean. If |
col.abstract |
Symbol. Column containing abstracts. |
Details
Calculate biomarker score for each abstract to indicate possible
use of miRNAs as biomarker. This score is added to the data frame as an additional
column Biomarker_score
, containing the calculated biomarker score.
To decide which abstracts are considered to contain use of miRNAs as biomarker, a threshold
can be set via the threshold
argument. Furthermore, an additional
column can be added, verbally indicating the general use of miRNAs as biomarker in
an abstract.
Choosing the right threshold can be facilitated using plot_score_biomarker()
.
Value
Data frame with calculated biomarker scores.
If discard = FALSE
, adds extra columns
to the original data frame with calculated biomarker scores.
If discard = TRUE
, only abstracts are with miRNAs as biomarker
are kept.
See Also
Other score functions:
assign_topic()
,
calculate_score_animals()
,
calculate_score_patients()
,
calculate_score_topic()
,
plot_score_animals()
,
plot_score_biomarker()
,
plot_score_patients()
,
plot_score_topic()
Calculate patients scores for abstracts
Description
Calculate patients score for each abstract to indicate possible use of patient material.
Usage
calculate_score_patients(
df,
keywords = patients_keywords,
case = FALSE,
threshold = NULL,
indicate = FALSE,
discard = FALSE,
col.abstract = Abstract
)
Arguments
df |
Data frame containing abstracts. |
keywords |
Character vector. Vector containing keywords. The score is
calculated based on these keywords. How much weight a keyword in |
case |
Boolean. If |
threshold |
Integer. Optional. Threshold to decide if use of patient tissue is
present in an abstract or not. If |
indicate |
Boolean. If |
discard |
Boolean. If |
col.abstract |
Symbol. Column containing abstracts. |
Details
Calculate patient score for each abstract to indicate possible
use of patient material. This score is added to the data frame as an additional
column Patient_score
, containing the calculated patients score.
To decide which abstracts are considered to contain patient material, a threshold
can be set via the threshold
argument. Furthermore, an additional
column can be added, verbally indicating the general use of patient material.
Choosing the right threshold can be facilitated using plot_score_patients()
.
Value
Data frame with calculated patient scores.
If discard = FALSE
, adds extra columns
to the original data frame with the calculated patient tissue scores.
If discard = TRUE
, only abstracts with use of patient tissue
are kept.
See Also
Other score functions:
assign_topic()
,
calculate_score_animals()
,
calculate_score_biomarker()
,
calculate_score_topic()
,
plot_score_animals()
,
plot_score_biomarker()
,
plot_score_patients()
,
plot_score_topic()
Calculate scores of a self-chosen topic
Description
Calculate score of a self-chosen topic for each abstract to identify abstracts possibly corresponding to the topic of interest.
Usage
calculate_score_topic(
df,
keywords,
case = FALSE,
col.score = "topic_score",
col.indicate = NULL,
threshold = NULL,
discard = FALSE,
col.abstract = Abstract
)
Arguments
df |
Data frame containing abstracts. |
keywords |
Character vector. Vector containing keywords. The score is
calculated based on these keywords. How much weight a keyword in |
case |
Boolean. If |
col.score |
String. Name of |
col.indicate |
String. Optional. Name of indicating column. If a string
is provided, an extra column is added to |
threshold |
Integer. Optional. Threshold to decide if abstract
corresponds to topic of interest. If |
discard |
Boolean. If |
col.abstract |
Symbol. Column containing abstracts. |
Details
Calculate score of a self-chosen topic for each abstract to identify
abstracts possibly corresponding to the topic of interest.
This score is added to the data frame as an additional
column, usually called topic_score
, containing the calculated topic score.
If there is more than one topic of interest, the column topic_score
should
be appropriately renamed.
To decide which abstracts are considered to correspond to the topic of interest,
a threshold
can be set via the threshold
argument. Furthermore, an additional
column can be added, verbally indicating if the abstract corresponds to the
topic.
Choosing the right threshold can be facilitated using plot_score_topic()
.
Value
Data frame with calculated topic scores.
If discard = FALSE
, adds extra columns
to the original data frame with the calculated topic scores.
If discard = TRUE
, only abstracts corresponding to
the topic of interest are kept.
See Also
assign_topic()
, plot_score_topic()
Other score functions:
assign_topic()
,
calculate_score_animals()
,
calculate_score_biomarker()
,
calculate_score_patients()
,
plot_score_animals()
,
plot_score_biomarker()
,
plot_score_patients()
,
plot_score_topic()
Combine data frames into one data frame
Description
Combine data frames into one data frame.
Usage
combine_df(...)
Arguments
... |
Data frames to combine into one data frame. Data frames must have the same number of columns and the same column names. |
Details
Combine data frames into one data frame. combine_df()
accepts several data frames that are combined into one data frame.
Data frames to be combined must have the same number
of columns and the same column names.
Value
Combined data frame.
See Also
Other combine functions:
combine_mir()
Combine miRNA vectors into one
Description
Combine miRNA vectors into one.
Usage
combine_mir(...)
Arguments
... |
Character vectors. Character vectors containing miRNA names. |
Details
Combine miRNA vectors into one. miRNA names occurring more than once are reduced to one instance.
Value
Combined character vector containing miRNA names.
See Also
Other combine functions:
combine_df()
Combine data frames containing stop words
Description
Combine data frames containing stop words into one data frame.
Usage
combine_stopwords(...)
Arguments
... |
Data frames with stop words. Data frames must have two columns named "word" and "lexicon". |
Details
Combine data frames containing stop words into one data frame. Provided data frames must have two columns named "word" and "lexicon".
Value
Combined data frame with stop words.
See Also
generate_stopwords()
, stopwords_miretrieve, tidytext::stop_words
Other stopword functions:
generate_stopwords()
Compare count of miRNA names between different topics
Description
Compare count of miRNA names between different topics.
Usage
compare_mir_count(
df,
mir,
topic = NULL,
normalize = TRUE,
col.topic = Topic,
col.mir = miRNA,
col.pmid = PMID,
title = NULL
)
Arguments
df |
Data frame containing columns for miRNA names, topics, and PubMed-IDs. |
mir |
Character vector. Vector specifying which miRNA names to compare. |
topic |
Character vector. Optional. Vector specifying which topics to compare. |
normalize |
Boolean. If |
col.topic |
Symbol. Column containing topic names. |
col.mir |
Symbol. Column containing miRNA names. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
title |
String. Plot title. |
Details
Compare count of miRNA names between different topics by plotting the number of abstracts mentioning the miRNA in a topic. This count can either be normalized, thus plotting the proportion of abstracts mentioning a miRNA name compared to all abstracts of a topic, or it can be not normalized, thus plotting the absolute number of abstracts mentioning a miRNA per topic.
Value
Bar plot comparing the count of miRNA names between different topics.
See Also
compare_mir_count_log2()
, compare_mir_count_unique()
Other compare functions:
compare_mir_count_log2()
,
compare_mir_count_unique()
,
compare_mir_terms_log2()
,
compare_mir_terms_scatter()
,
compare_mir_terms_unique()
,
compare_mir_terms()
Compare log2-frequency count of miRNA names between two topics
Description
Compare log2-frequency count of miRNA names between two topics
Usage
compare_mir_count_log2(
df,
mir,
topic = NULL,
normalize = TRUE,
col.topic = Topic,
col.mir = miRNA,
col.pmid = PMID,
title = NULL
)
Arguments
df |
Data frame containing miRNA names, topics, and PubMed-IDs. |
mir |
Character vector. Vector specifying which miRNA names to compare. |
topic |
Character vector. Optional. Vector specifying which
topics to compare. If |
normalize |
Boolean. If |
col.topic |
Symbol. Column containing topics. |
col.mir |
Symbol. Column containing miRNA names. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
title |
String. Plot title. |
Details
Compare log2-frequency count of miRNA names between two topics by plotting the log2-ratio of the miRNA count in two topics. The miRNA count per topic can either be normalized, thus taking the proportion of abstracts mentioning a miRNA name compared to all abstracts in a topic, or not normalized, thus taking the absolute number of abstracts mentioning a miRNA in a topic. The log2-plot is greatly inspired by the book “tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” by Silge and Robinson.
Value
List containing bar plot comparing the log2-frequency count of miRNA names between two topics and its corresponding data frame.
References
Silge, Julia, and David Robinson. 2016. “tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” JOSS 1 (3). The Open Journal. https://doi.org/10.21105/joss.00037.
See Also
compare_mir_count()
, compare_mir_count_unique()
Other compare functions:
compare_mir_count_unique()
,
compare_mir_count()
,
compare_mir_terms_log2()
,
compare_mir_terms_scatter()
,
compare_mir_terms_unique()
,
compare_mir_terms()
Compare top count of unique miRNA names per topic
Description
Compare top count of unique miRNA names per topic
Usage
compare_mir_count_unique(
df,
top = 5,
topic = NULL,
normalize = TRUE,
colour = "steelblue3",
col.topic = Topic,
col.mir = miRNA,
col.pmid = PMID,
title = NULL
)
Arguments
df |
Data frame containing miRNA names, topics, and PubMed-IDs. |
top |
Integer. Specifies number of top unique miRNAs to plot. |
topic |
Character vector. Optional. Vector specifying which
topics to compare. If |
normalize |
Boolean. If |
colour |
String. Colour of bar plot. |
col.topic |
Symbol. Column containing topics. |
col.mir |
Symbol. Column containing miRNA names. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
title |
String. Plot title. |
Details
Compare top count of unique miRNA names per topic by plotting the the miRNA count of unique miRNAs per topic. Per topic, the unique miRNAs are identified and their count is plotted. The miRNA count can either be normalized, thus taking the proportion of abstracts mentioning a miRNA name compared to all abstracts in a topic, or not normalized, thus taking the absolute number of abstracts mentioning a miRNA in a topic.
Value
Bar plot comparing frequency of unique miRNA count per topic.
See Also
compare_mir_count()
, compare_mir_count_log2()
Other compare functions:
compare_mir_count_log2()
,
compare_mir_count()
,
compare_mir_terms_log2()
,
compare_mir_terms_scatter()
,
compare_mir_terms_unique()
,
compare_mir_terms()
Compare count of terms associated with a miRNA name over various topics
Description
Compare count of top terms associated with a miRNA name over various topics.
Usage
compare_mir_terms(
df,
mir,
top = 20,
token = "words",
...,
topic = NULL,
shared = TRUE,
normalize = TRUE,
stopwords = stopwords_miretrieve,
stopwords_ngram = TRUE,
position = "dodge",
col.mir = miRNA,
col.abstract = Abstract,
col.topic = Topic,
col.pmid = PMID,
title = NULL
)
Arguments
df |
Data frame containing miRNA names, abstracts, topics, and PubMed-IDs. |
mir |
String. miRNA name of interest. |
top |
Integer. Number of top terms to plot. |
token |
String. Specifies how abstracts shall be split up. Taken from
|
... |
Additional arguments for tokenization, if necessary. |
topic |
Character vector. Optional. Specifies topics to plot.
If |
shared |
Boolean. If |
normalize |
Boolean. If |
stopwords |
Data frame containing stop words. |
stopwords_ngram |
Boolean. Specifies if stop words shall be removed
from abstracts when using ngrams. Only applied when |
position |
Character vector. Vector containing either "dodge" or "facet". Determines if bar plots are on top of or next to each other. |
col.mir |
Symbol. Column containing miRNA names. |
col.abstract |
Symbol. Column containing abstracts. |
col.topic |
Symbol. Column containing topic names. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
title |
String. Plot title. |
Details
Compare count of top terms associated with a miRNA name
over various topics.
miRNA names and topics must be in a data frame df
, while terms are taken
from abstracts contained in df
.
Number of top terms to plot is regulated by top
. Terms can either be
evaluated as their raw count, e.g. in how many abstracts they are mentioned
in conjunction with the miRNA name, or as their relative count, e.g.
in how many abstracts containing the miRNA they are mentioned compared to all
abstracts containing the miRNA.
compare_mir_terms()
is based on the tools available in the
tidytext package.
Value
Bar plot comparing the count of terms associated with a miRNA name over two topics.
See Also
compare_mir_terms_log2()
, compare_mir_terms_scatter()
Other compare functions:
compare_mir_count_log2()
,
compare_mir_count_unique()
,
compare_mir_count()
,
compare_mir_terms_log2()
,
compare_mir_terms_scatter()
,
compare_mir_terms_unique()
Compare log2-frequency count of terms associated with a miRNA name
Description
Compare log2-frequency count of terms associated with a miRNA name over two topics.
Usage
compare_mir_terms_log2(
df,
mir,
top = 20,
token = "words",
...,
topic = NULL,
shared = TRUE,
normalize = TRUE,
stopwords = stopwords_miretrieve,
stopwords_ngram = TRUE,
col.mir = miRNA,
col.abstract = Abstract,
col.topic = Topic,
col.pmid = PMID,
title = NULL
)
Arguments
df |
Data frame containing miRNA names, abstracts, topics, and PubMed-IDs. |
mir |
String. miRNA name of interest. |
top |
Integer. Number of top terms to plot. |
token |
String. Specifies how abstracts shall be split up. Taken from
|
... |
Additional arguments for tokenization, if necessary. |
topic |
Character vector. Optional. Specifies which topics to plot.
Must have length two.
If |
shared |
Boolean. If |
normalize |
Boolean. If |
stopwords |
Data frame containing stop words. |
stopwords_ngram |
Boolean. Specifies if stop words shall be removed
from abstracts when using ngrams. Only applied when |
col.mir |
Symbol. Column containing miRNA names. |
col.abstract |
Symbol. Column containing abstracts. |
col.topic |
Symbol. Column containing topic names. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
title |
String. Plot title. |
Details
Compare log2-frequency count of terms associated with a miRNA name over two topics by
plotting the log2-ratio of the term count associated with a miRNA name
over two topics.
miRNA names and topics must be in a data frame df
, while terms are taken
from abstracts contained in df
.
Number of top terms to plot is regulated by top
. Terms can either be
evaluated as their raw count, e.g. in how many abstracts they are mentioned
in conjunction with the miRNA name, or as their relative count, e.g.
in how many abstracts containing the miRNA they are mentioned compared to all
abstracts containing the miRNA.
compare_mir_terms_log2()
is based on the tools available in the
tidytext package.
The log2-plot is greatly inspired by the book
“tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” by
Silge and Robinson.
Value
List containing bar plot comparing the log2-frequency of terms associated with a miRNA over two topics and its corresponding data frame.
References
Silge, Julia, and David Robinson. 2016. “tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” JOSS 1 (3). The Open Journal. https://doi.org/10.21105/joss.00037.
See Also
compare_mir_terms()
, compare_mir_terms_scatter()
Other compare functions:
compare_mir_count_log2()
,
compare_mir_count_unique()
,
compare_mir_count()
,
compare_mir_terms_scatter()
,
compare_mir_terms_unique()
,
compare_mir_terms()
Compare shared terms associated with a miRNA name
Description
Compare shared terms associated with a miRNA name over two topics.
Usage
compare_mir_terms_scatter(
df,
mir,
top = 1000,
token = "words",
...,
topic = NULL,
stopwords = stopwords_miretrieve,
stopwords_ngram = TRUE,
html = TRUE,
colour.point = "red",
colour.term = "black",
col.mir = miRNA,
col.abstract = Abstract,
col.topic = Topic,
col.pmid = PMID,
title = NULL
)
Arguments
df |
Data frame containing miRNA names, abstracts, topics, and PubMed-IDs. |
mir |
String. miRNA name of interest. |
top |
Integer. Number of top terms to plot. |
token |
String. Specifies how abstracts shall be split up. Taken from
|
... |
Additional arguments for tokenization, if necessary. |
topic |
Character vector. Optional. Specifies which topics to plot.
Must have length two.
If |
stopwords |
Data frame containing stop words. |
stopwords_ngram |
Boolean. Specifies if stop words shall be removed
from abstracts when using ngrams. Only applied when |
html |
Boolean. Specifies if plot is returned as an HTML-widget or static. |
colour.point |
String. Colour of points for scatter plot. |
colour.term |
String. Colour of terms for scatter plot. |
col.mir |
Symbol. Column containing miRNAs. |
col.abstract |
Symbol. Column containing abstracts. |
col.topic |
Symbol. Column containing topics names. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
title |
String. Plot title. |
Details
Compare shared terms associated with a miRNA name over two topics. These terms are displayed
as a scatter plot, which is either interactive as an HTML-widget, or static. This
is regulated via the html
argument.
miRNA names and topics must be in a data frame df
, while terms are taken
from abstracts contained in df
.
Number of top terms to choose is regulated by top
. Terms are
evaluated as their raw count and plotted on a log10-scale.
compare_mir_terms_scatter()
is based on the tools available in the
tidytext package.
The term-plot is greatly inspired by
“tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” by
Silge and Robinson.
Value
Scatter plot comparing shared terms of a miRNA between two topics.
References
Silge, Julia, and David Robinson. 2016. “tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” JOSS 1 (3). The Open Journal. https://doi.org/10.21105/joss.00037.
See Also
compare_mir_terms()
, compare_mir_terms_log2()
Other compare functions:
compare_mir_count_log2()
,
compare_mir_count_unique()
,
compare_mir_count()
,
compare_mir_terms_log2()
,
compare_mir_terms_unique()
,
compare_mir_terms()
Compare terms uniquely associated with a miRNA name
Description
Compare terms uniquely associated with a miRNA name over topics.
Usage
compare_mir_terms_unique(
df,
mir,
top = 20,
token = "words",
...,
topic = NULL,
stopwords = stopwords_miretrieve,
stopwords_ngram = TRUE,
normalize = TRUE,
colour = "steelblue3",
col.mir = miRNA,
col.abstract = Abstract,
col.topic = Topic,
col.pmid = PMID,
title = NULL
)
Arguments
df |
Data frame containing miRNA names, abstracts, topics, and PubMed-IDs. |
mir |
String. miRNA name of interest. |
top |
Integer. Number of top terms to plot. |
token |
String. Specifies how abstracts shall be split up. Taken from
|
... |
Additional arguments for tokenization, if necessary. |
topic |
Character vector. Optional. Specifies which topics to plot.
If |
stopwords |
Data frame containing stop words. |
stopwords_ngram |
Boolean. Specifies if stop words shall be removed
from abstracts when using ngrams. Only applied when |
normalize |
Boolean. If |
colour |
String. Colour of bar plot. |
col.mir |
Symbol. Column containing miRNAs. |
col.abstract |
Symbol. Column containing abstracts. |
col.topic |
Symbol. Column containing topics names. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
title |
String. Plot title. |
Details
Compare terms uniquely associated with a miRNA name over topics.
miRNA names and topics must be in a data frame df
, while terms are taken
from abstracts contained in df
.
Number of top terms to choose is regulated by top
. Terms are
evaluated either as the number of times they are mentioned in all abstracts
with the miRNA name of interest, or the number of times they are relatively mentioned
compared to all abstracts with the miRNA name of interest.
compare_mir_terms_unique()
is based on the tools available in the
tidytext package.
Value
Bar plot containing unique miRNA-terms associations per topic.
See Also
compare_mir_terms()
, compare_mir_terms_log2()
, compare_mir_terms_scatter()
Other compare functions:
compare_mir_count_log2()
,
compare_mir_count_unique()
,
compare_mir_count()
,
compare_mir_terms_log2()
,
compare_mir_terms_scatter()
,
compare_mir_terms()
Count miRNA names in a data frame
Description
Count occurrence of miRNA names in a data frame.
Usage
count_mir(df, col.mir = miRNA)
Arguments
df |
Data frame containing miRNA names. |
col.mir |
Symbol. Column containing miRNA names. |
Details
Count occurrence of miRNA names in a data frame. The count of miRNA names is returned as a separate data frame, only listing the miRNA names and their respective frequency.
Value
Data frame. Data frame containing miRNA names and their respective frequency.
See Also
plot_mir_count()
, count_mir_threshold()
, plot_mir_count_threshold()
Other count functions:
count_mir_threshold()
,
count_snp()
,
plot_mir_count_threshold()
,
plot_mir_count()
Count occurrence of miRNA names above threshold
Description
Count occurrence of miRNA names above a threshold.
Usage
count_mir_threshold(df, threshold = 1, col.mir = miRNA, col.pmid = PMID)
Arguments
df |
Data frame containing miRNA names and PubMed-IDs. |
threshold |
Integer or float. If |
col.mir |
Symbol. Column containing miRNA names. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
Details
Count occurrence of miRNA names above a threshold. This threshold can either
be an absolute value, e.g. 3, or a float between 0 and 1, e.g. 0.2.
If threshold
is an absolute value, number of distinct miRNA names mentioned
in at least threshold
abstracts is returned.
If threshold
is a float between 0 and 1, number of distinct miRNA names
mentioned in at least threshold
abstracts
of all abstracts in df
is returned.
Value
Integer with the number of distinct miRNA names in df
.
See Also
plot_mir_count_threshold()
, count_mir()
, plot_mir_count()
Other count functions:
count_mir()
,
count_snp()
,
plot_mir_count_threshold()
,
plot_mir_count()
Count SNPs in a data frame
Description
Count occurrence of SNPs in a data frame.
Usage
count_snp(df, col.snp = SNPs, col.pmid = PMID)
Arguments
df |
Data frame containing SNPs and PubMed IDs. |
col.snp |
Symbol. Column containing SNPs. |
col.pmid |
Symbol. Column containing PubMed IDs. |
Details
Count occurrence of SNPs in a data frame. The count of SNPs is returned as a separate data frame, only listing the SNPs and their respective frequency.
Value
Data frame. Data frame containing SNPs and their respective frequency.
See Also
extract_snp()
,
get_snp()
,
subset_snp()
Other count functions:
count_mir_threshold()
,
count_mir()
,
plot_mir_count_threshold()
,
plot_mir_count()
Count targets in data frame
Description
Count occurrence of targets in a data frame.
Usage
count_target(df, col.target = Target, add.df = TRUE)
Arguments
df |
Data frame containing a column with targets. |
col.target |
Symbol. Column containing targets. |
add.df |
Boolean. If |
Details
Count occurrence of targets in a data frame. The count of targets can either be returned as a separate data frame, only listing the targets and their respective frequency, or it can be added to the data frame provided as an extra column.
Value
Data frame, either with the targets and their frequency as a new
data frame,
or with the frequency of targets added as a
new column to the input data frame df
.
See Also
join_targets()
, plot_target_count()
Other target functions:
join_mirtarbase()
,
join_targets()
,
plot_target_count()
,
plot_target_mir_scatter()
Dataset of PubMed data of miRNAs in Colorectal Cancer
Description
A dataset PubMed abstracts of miRNAs in Colorectal Cancer.
Usage
df_crc
Format
A data frame.
Source
https://pubmed.ncbi.nlm.nih.gov/
miRTarBase version 8.0
Description
The most recent miRTarBase version 8.0, containing miRNA stem, capitalized targets, and PMIDs.
Usage
df_mirtarbase
Format
A data frame with the columns "miRNA_tarbase", "Target", and "PMID".
Details
miRTarBase was published in
Hsi-Yuan Huang, Yang-Chi-Dung Lin, Jing Li, et al., miRTarBase 2020: updates to the experimentally validated microRNA–target interaction database, Nucleic Acids Research, Volume 48, Issue D1, 08 January 2020, Pages D148–D154, https://doi.org/10.1093/nar/gkz896
Source
https://miRTarBase.cuhk.edu.cn:443/
Dataset of PubMed data of miRNAs in Pancreatic Cancer
Description
A dataset PubMed abstracts of miRNAs in Pancreatic Cancer.
Usage
df_panc
Format
A data frame.
Source
https://pubmed.ncbi.nlm.nih.gov/
Test dataset of PubMed abstracts
Description
Test dataset of 20 PubMed abstracts.
Usage
df_test
Format
A data frame.
Source
https://pubmed.ncbi.nlm.nih.gov/
Extract miRNA names from abstracts in data frame
Description
Extract miRNA names from abstracts in a data frame.
Usage
extract_mir_df(
df,
threshold = 1,
col.abstract = Abstract,
extract_letters = FALSE
)
Arguments
df |
Data frame containing abstracts. |
threshold |
Integer. Specifies how often a miRNA must be mentioned in an abstract to be extracted. |
col.abstract |
Symbol. Column containing abstracts. |
extract_letters |
Boolean. If |
Details
Extract miRNA names from abstracts in a data frame. miRNA names can
either be extracted with their stem only, e.g. miR-23, or with their trailing
letter, e.g. miR-23a. miRNA names are adapted to the most recent miRBase
version (e.g. miR-97, miR-102, miR-180(a/b) become miR-30a, miR-29a,
and miR-172(a/b), respectively). Additionally, how often a miRNA must be
mentioned in an
abstract to be extracted can be regulated via the threshold
argument.
Ultimately, abstracts not containing any miRNA names
are silently dropped.
As many abstracts do not adhere to the miRNA nomenclature,
it is recommended to extract only the miRNA stem with
extract_letters = FALSE
.
Value
Data frame with miRNA names extracted from abstracts.
See Also
Other extract functions:
extract_mir_string()
,
extract_snp()
Extract miRNA names from string
Description
Extract miRNA names from a string.
Usage
extract_mir_string(string, threshold = 1, extract_letters = FALSE)
Arguments
string |
String. String to search for miRNA names. |
threshold |
Integer. Specifies how often a miRNA must be mentioned in |
extract_letters |
Boolean. If |
Details
Extract miRNA names from a string. miRNA names can either be extracted with their stem only, e.g. miR-23, or with their trailing letter, e.g. miR-23a. Furthermore, miRNA names are adapted to the most recent miRBase version (e.g. miR-97, miR-102, miR-180(a/b) become miR-30a, miR-29a, and miR-172(a/b), respectively).
Value
Character vector containing miRNA names, if miRNA names are present in the string. If no miRNA names are present in the string, a message is returned saying "No miRNA found.".
See Also
Other extract functions:
extract_mir_df()
,
extract_snp()
Extract SNPs from abstracts in data frame
Description
Extract SNPs from abstracts in a data frame.
Usage
extract_snp(
df,
pattern = snp_pattern,
col.abstract = Abstract,
indicate = FALSE,
discard = FALSE
)
Arguments
df |
Data frame containing abstracts. |
pattern |
String. Regex pattern to identify SNPs. |
col.abstract |
Symbol. Column containing abstracts. |
indicate |
Boolean. If |
discard |
Boolean. If |
Details
Extract SNPs from abstracts in a data frame. SNPs are added to the data frame in a separate column. Furthermore, an optional column can indicate if SNPs are generally present in an abstract.
Value
Data frame. If discard = FALSE
, return the data frame with
an additional column for SNPs.
If discard = TRUE
, return only abstracts containing SNPs.
See Also
count_snp()
,
get_snp()
,
subset_snp()
Other extract functions:
extract_mir_df()
,
extract_mir_string()
Fit LDA-model
Description
Fit LDA-model with k
topics.
Usage
fit_lda(
df,
k,
stopwords = stopwords_miretrieve,
method = "gibbs",
control = NULL,
seed = 42,
col.abstract = Abstract,
col.pmid = PMID
)
Arguments
df |
Data frame containing abstracts and PubMed-IDs. |
k |
Integer. Number of topics to fit. Must be >=2. |
stopwords |
Data frame containing stop words. |
method |
String. Either |
control |
Control parameters for LDA modeling. For more information,
see the documentation of the |
seed |
Integer. Seed for reproducibility. |
col.abstract |
Column containing abstracts. |
col.pmid |
Column containing PubMed-ID. |
Details
Fit LDA-model with k
topics from a data frame.
fit_lda()
is based on LDA()
from the package
topicmodels.
Value
LDA-model.
See Also
Other LDA functions:
assign_topic_lda()
,
plot_lda_term()
,
plot_perplexity()
Generate data frame containing stop words
Description
Generate a data frame containing stop words.
Usage
generate_stopwords(stopwords, combine_with = NULL)
Arguments
stopwords |
Character vector. Vector containing stop words. |
combine_with |
Data frame containing stop words. Optional.
Data frame provided here must have only two columns, namely
|
Details
Generate data frame containing stop words from a character vector. This data
frame consists of two columns, namely word
, containing the stop words, and
lexicon
, containing the string "self-defined".
Additionally, the created data frame can be combined with other stop words
containing data frames, e.g. tidytext::stop_words
or
stopwords_miretrieve
.
Value
Data frame containing stop words.
References
Silge, Julia, and David Robinson. 2016. “tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” JOSS 1 (3). The Open Journal. https://doi.org/10.21105/joss.00037.
See Also
combine_stopwords()
, stopwords_miretrieve, tidytext::stop_words
Other stopword functions:
combine_stopwords()
Identify top miRNA names distinct for one topic compared to another topic
Description
Identify top miRNA names distinct for one topic compared to another topic in a data frame.
Usage
get_distinct_mir_df(
df,
distinct,
top = 5,
topic = NULL,
col.topic = Topic,
col.mir = miRNA,
col.pmid = PMID
)
Arguments
df |
Data frame containing at least two topics and miRNA names. |
distinct |
String. Name of topic top distinct miRNAs shall be identified
for. |
top |
Integer. Number of top miRNA names to extract for both topics. |
topic |
String. Vector of strings containing topic names to compare
miRNA names for. If |
col.topic |
Symbol. Column containing topic names. |
col.mir |
Symbol. Column containing miRNA names. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
Details
Get top distinct miRNA names of one topic compared to another topic in a
data frame.
get_distinct_mir_df()
compares the top miRNA names of two topics and
returns the miRNA names that are exclusive for distinct
.
Value
Character vector containing miRNA names distinct for distinct
compared to the second topic provided in topic
.
See Also
Other get functions:
get_distinct_mir_vec()
,
get_mir()
,
get_pmid()
,
get_shared_mir_df()
,
get_shared_mir_vec()
,
get_snp()
Identify miRNA names distinct for one vector compared to another vector
Description
Identify miRNA names distinct for one vector compared to another vector.
Usage
get_distinct_mir_vec(mirna.vec.1, mirna.vec.2)
Arguments
mirna.vec.1 |
Character vector. First vector containing miRNA names. |
mirna.vec.2 |
Character vector. Second vector containing miRNA names. |
Details
Get distinct miRNA names of one vector compared to another vector.
get_distinct_mir()
compares two vectors containing miRNA names and
returns the miRNA names that are exclusive for mirna.vec.1
.
Value
Character vector containing miRNA names distinct for mirna.vec.1
compared to mirna.vec.2
.
See Also
Other get functions:
get_distinct_mir_df()
,
get_mir()
,
get_pmid()
,
get_shared_mir_df()
,
get_shared_mir_vec()
,
get_snp()
Get miRNA names from a data frame
Description
Get miRNA names from a data frame. These miRNA names can either be the most frequent ones, or the ones exceeding a threshold.
Usage
get_mir(
df,
top = NULL,
threshold = NULL,
topic = NULL,
col.mir = miRNA,
col.pmid = PMID,
col.topic = Topic
)
Arguments
df |
Data frame containing miRNA names. If |
top |
Integer. Optional. Specifies number of most frequent miRNA names
to return. If neither |
threshold |
Integer or float. Optional. If |
topic |
String. Optional. Character vector specifying which topics to obtain miRNA names from. |
col.mir |
Symbol. Column containing miRNA names. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
col.topic |
Symbol. Column containing topic names. |
Details
Get miRNA names from a data frame. These miRNA names can either be the most
frequent ones, or the ones exceeding a threshold. Furthermore, if the data
frame contains abstracts of different topics, only the miRNA names of
specific topics can be obtained by setting the topic
argument.
To get the most frequent miRNA names, set the
top
argument.top
determines how many most frequent miRNA names are returned, according to their rank. Ties among the most frequently mentioned miRNAs are treated as the same rank, e.g. if miR-126, miR-34, and miR-29 were all mentioned the most often with the same frequency, they would all be returned by specifyingtop = 1
,top = 2
, andtop = 3
.To get the miRNA names exceeding a threshold, set the
threshold
argument.threshold
can either be an absolute value, e.g. 3, or a float between 0 and 1, e.g. 0.2. Ifthreshold
is an absolute value,get_mir()
returns only the miRNA names mentioned in at leastthreshold
abstracts. Ifthreshold
is a float between 0 and 1,get_mir()
returns only miRNA names mentioned in at leastthreshold
abstracts of all abstracts.threshold
requires the data frame to have a column with PubMed IDs.
If neither top
nor threshold
is set, top
is automatically set to 5
.
Value
Character vector containing miRNA names.
See Also
Other get functions:
get_distinct_mir_df()
,
get_distinct_mir_vec()
,
get_pmid()
,
get_shared_mir_df()
,
get_shared_mir_vec()
,
get_snp()
Get PubMed-IDs of a data frame
Description
Get PubMed-IDs of a data frame.
Usage
get_pmid(df, col.pmid = PMID, copy = TRUE)
Arguments
df |
Data frame containing PubMed-IDs. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
copy |
Boolean. If |
Details
Get PubMed-IDs of a data frame. get_pmid
returns either a character
vector, containing PubMed-IDs, or copies PubMed-IDs to clipboard. If PubMed-IDs
are copied to the clipboard, they can be used e.g. to search for abstracts on
PubMed.
Value
Copy to clipboard or character vector.
If copy = TRUE
, get_pmid()
copies
PubMed-IDs to clipboard.
If copy = FALSE
, get_pmid()
returns a character
vector, containing PubMed-IDs.
See Also
Other get functions:
get_distinct_mir_df()
,
get_distinct_mir_vec()
,
get_mir()
,
get_shared_mir_df()
,
get_shared_mir_vec()
,
get_snp()
Get top miRNA names in common between two topics of a data frame
Description
Get top miRNA names in common between two topics of a data frame.
Usage
get_shared_mir_df(
df,
top = 5,
topic = NULL,
col.topic = Topic,
col.mir = miRNA,
col.pmid = PMID
)
Arguments
df |
Data frame containing at least two topics and miRNA names. |
top |
Integer. Number of top miRNA names to extract for both topics. |
topic |
String. Vector of strings containing topic names to compare
miRNA names for. If |
col.topic |
Symbol. Column containing topic names. |
col.mir |
Symbol. Column containing miRNA names. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
Details
Get top miRNA names in common between two topics of a data frame.
get_shared_mir_df()
compares the top miRNA names of two topics
in a data frame and returns the miRNA names in common.
Value
Character vector containing miRNA names in common between two topics.
See Also
Other get functions:
get_distinct_mir_df()
,
get_distinct_mir_vec()
,
get_mir()
,
get_pmid()
,
get_shared_mir_vec()
,
get_snp()
Get miRNA names in common between two vectors
Description
Get miRNA names in common between two vectors.
Usage
get_shared_mir_vec(mirna.vec.1, mirna.vec.2)
Arguments
mirna.vec.1 |
Character vector. First vector containing miRNA names. |
mirna.vec.2 |
Character vector. Second vector containing miRNA names. |
Details
Get miRNA names in common between two vectors.
get_shared_mir_vec()
compares two vectors containing miRNA names and
returns the miRNA names that are in both vectors.
Value
Character vector containing miRNA names in common between two vectors.
See Also
Other get functions:
get_distinct_mir_df()
,
get_distinct_mir_vec()
,
get_mir()
,
get_pmid()
,
get_shared_mir_df()
,
get_snp()
Get SNPs from a data frame
Description
Get SNPs from a data frame.
Usage
get_snp(df, row = NULL, top = NULL, col.snp = SNPs, col.pmid = PMID)
Arguments
df |
Data frame containing SNPs. If |
row |
Integer. Optional. Specifies row from which SNP shall be obtained. Works best
with a data frame listing counts only as from |
top |
Integer. Optional. Specifies number of most frequent SNPs to return. |
col.snp |
Symbol. Column containing SNPs. |
col.pmid |
Symbol. Column containing PubMed IDs. Necessary if the data frame provided is not a count data frame. |
Details
Get SNPs from a data frame.
If a data frame containing SNP counts as from
count_snp()
is provided, these SNPs are specified by the row they are listed in. To get the SNPs by row, set therow
argument.If a data frame with PubMed IDs is provided, these SNPs are specified by their top occurrence. To get the SNPs by frequency, set the
top
argument.
If neither row
nor top
is provided, row
is automatically set to 1
.
Value
String or character vector containing SNPs.
See Also
extract_snp()
,
count_snp()
,
subset_snp()
Other get functions:
get_distinct_mir_df()
,
get_distinct_mir_vec()
,
get_mir()
,
get_pmid()
,
get_shared_mir_df()
,
get_shared_mir_vec()
Indicate if a miRNA name is contained in an abstract
Description
Indicate if a miRNA name is contained in an abstract with "Yes"/"No".
Usage
indicate_mir(df, indicate.mir, col.mir = miRNA)
Arguments
df |
Data frame containing miRNA names. |
indicate.mir |
Character vector. Vector containing miRNA names to indicate. |
col.mir |
Symbol. Column containing miRNA names. |
Details
Indicate if a miRNA name is contained in an abstract with "Yes"/"No".
This requires miRNA names already to be extracted, e.g. with extract_mir_df()
,
and to be stored in a separate column, specified by col.mir
.
indicate_mir()
adds another column to a data frame which bears the name
of the miRNA(s) of interest. Within this column, a "Yes" or "No" specifies
if this miRNA name is contained in the corresponding abstract.
Value
Data frame with as many columns added as miRNA names given
in indicate.mir
.
Per column, a "Yes" or "No" indicates if the miRNA name of interest
is present in the
corresponding abstract.
See Also
extract_mir_df()
, indicate_term()
Other indicate functions:
indicate_term()
Indicate if a term is contained in abstracts
Description
Indicate if a term is contained in abstracts.
Usage
indicate_term(
df,
term,
threshold = 1,
case = FALSE,
discard = FALSE,
col.abstract = Abstract
)
Arguments
df |
Data frame containing abstracts. |
term |
Character vector. Vector containing terms to indicate. |
threshold |
Integer. Sets how often a term must be in an abstract to be considered "present". |
case |
Boolean. If |
discard |
Boolean. If |
col.abstract |
Symbol. Column containing abstracts. |
Details
Indicate if a term is contained in an abstract. Terms provided can either
be case sensitive or insensitive. Per term, a new column is added to the data
frame indicating if the term is present in an abstract. Furthermore, if a term
is considered "present" in an abstract can be regulated via the threshold
argument. threshold
determines how often a term must be in an abstract
to be considered "present".
Value
Data frame. If discard = FALSE
, the original data frame with additional
columns per term is returned. If discard = TRUE
, only abstracts containing the
terms in term
are returned.
See Also
Other indicate functions:
indicate_mir()
Add miRNA targets from miRTarBase version 8.0
Description
Add miRNA targets from miRTarBase version 8.0 to a data frame.
Usage
join_mirtarbase(
df,
col.pmid.df = PMID,
col.topic.df = NULL,
filter_na = TRUE,
reduce = FALSE
)
Arguments
df |
Data frame containing PubMed-IDs that the miRNA targets shall be joined to. |
col.pmid.df |
Symbol. Column containing PubMed-IDs in |
col.topic.df |
Symbol. Optional. Only important if |
filter_na |
Boolean. If |
reduce |
Boolean. If |
Details
Add miRNA targets from miRTarBase version 8.0 to a data frame.
join_mirtarbase()
can return two different data frames, regulated by reduce
:
If
reduce = FALSE
,join_mirtarbase()
adds targets from miRTarBase 8.0 to the data frame in a new column. These targets then correspond to the targets determined in the research paper, but do not necessarily correspond to the miRNA names mentioned in the abstract.If
reduce = TRUE
,join_mirtarbase()
adds targets from miRTarBase 8.0 to the data frame in a new column. However, an altered data frame is returned, containing the PubMed-IDs, targets, and miRNAs from miRTarBase 8.0.
miRTarBase was published in
Hsi-Yuan Huang, Yang-Chi-Dung Lin, Jing Li, et al., miRTarBase 2020: updates to the experimentally validated microRNA–target interaction database, Nucleic Acids Research, Volume 48, Issue D1, 08 January 2020, Pages D148–D154, https://doi.org/10.1093/nar/gkz896
Value
Data frame containing miRNA targets.
See Also
Other target functions:
count_target()
,
join_targets()
,
plot_target_count()
,
plot_target_mir_scatter()
Add miRNA targets from an xlsx-file to a data frame
Description
Add miRNA targets from an external xlsx-file to a data frame.
Usage
join_targets(
df,
excel_file,
col.pmid.excel,
col.target.excel,
col.mir.excel = NULL,
col.pmid.df = PMID,
col.topic.df = NULL,
filter_na = TRUE,
stem_mir_excel = TRUE,
reduce = FALSE
)
Arguments
df |
Data frame containing PubMed-IDs that the miRNA targets shall be joined to. |
excel_file |
xlsx-file. xlsx-file containing miRNA targets and PubMed-IDs. |
col.pmid.excel |
String. Column containing PubMed-IDs of the
|
col.target.excel |
String. Column containing targets of the
|
col.mir.excel |
String. Optional. Column containing miRNAs of the
|
col.pmid.df |
Symbol. Column containing PubMed-IDs in |
col.topic.df |
Symbol. Optional. Only important if |
filter_na |
Boolean. If |
stem_mir_excel |
Boolean. If |
reduce |
Boolean. If |
Details
Add miRNA targets from an external xlsx-file to a data frame. To add the targets to the
data frame, the xlsx-file and the data frame need to have one column in
common, such as PubMed-IDs.
join_targets()
can return two different data frames, regulated by reduce
:
If
reduce = FALSE
,join_targets()
adds targets from an excel-file to the data frame in a new column. These targets then correspond to the targets determined in the research paper, but do not necessarily correspond to the miRNA names mentioned in the abstract.If
reduce = TRUE
,join_targets()
adds targets from an xlsx-file to the data frame in a new column. However, an altered data frame is returned, containing the PubMed-IDs, targets, and miRNAs from the excel-file. Forreduce = TRUE
to work, the xlsx-file provided must contain a column with miRNA names.
Value
Data frame containing miRNA targets.
See Also
Other target functions:
count_target()
,
join_mirtarbase()
,
plot_target_count()
,
plot_target_mir_scatter()
Stop words for n-grams
Description
Vector containing stop words for n-grams, based on tidytext::stop_words.
Usage
ngram_stopwords
Format
Character vector.
Source
tidytext::stop_words
Keywords - patients.
Description
Keywords to identify abstracts investigating miRNAs in patients.
Usage
patients_keywords
Format
An object of class character
of length 10.
Plot terms associated with LDA-fitted topics
Description
Plot terms associated with LDA-fitted topics.
Usage
plot_lda_term(lda_model, top.terms = 10, title = NULL)
Arguments
lda_model |
LDA-model. |
top.terms |
Integer. Top terms to plot per topic. |
title |
String. Plot title. |
Details
Plot terms associated with LDA-fitted topics. For each topic in the LDA-model,
the top terms are plotted. Plotting top.terms
for each topic can help
identifying its subject.
Value
Bar plot with top terms per topic.
See Also
Other LDA functions:
assign_topic_lda()
,
fit_lda()
,
plot_perplexity()
Plot count of most frequently mentioned miRNA names
Description
Plot count of most frequently mentioned miRNA names in a data frame.
Usage
plot_mir_count(
df,
top = 10,
colour = "steelblue3",
col.mir = miRNA,
title = NULL
)
Arguments
df |
Data frame containing miRNA names. |
top |
Integer. Specifies number of most frequent miRNA names to plot. |
colour |
String. Colour of bar plot. |
col.mir |
Symbol. Column containing miRNA names. |
title |
String. Plot title. |
Details
Plot count of most frequently mentioned miRNA names in a data frame. How many
most frequently mentioned miRNAs are plotted is determined via the top
argument. Ties among the most frequently mentioned miRNAs are treated as
the same rank, e.g. if miR-126, miR-34, and miR-29 were all mentioned
the most often, they would all be plotted by specifying top = 1
, top = 2
,
or top = 3
.
Value
Bar plot with the most frequently mentioned miRNAs names in df
.
See Also
count_mir()
, count_mir_threshold()
, plot_mir_count_threshold()
Other count functions:
count_mir_threshold()
,
count_mir()
,
count_snp()
,
plot_mir_count_threshold()
Plot occurrence count of miRNA names over different thresholds
Description
Plot occurrence count of distinct miRNA names over different thresholds.
Usage
plot_mir_count_threshold(
df,
start = 1,
end = 5,
bins = NULL,
colour = "steelblue3",
col.mir = miRNA,
col.pmid = PMID,
title = NULL
)
Arguments
df |
Data frame containing columns with miRNAs and PubMed-IDs. |
start |
Integer or float. Must be greater than 0 and smaller than
|
end |
Integer or float. Must be greater than 0 and greater than
|
bins |
Integer. Optional. Only necessary if |
colour |
String. Colour of bar plot. |
col.mir |
Symbol. Column containing miRNAs. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
title |
String. Plot title. |
Details
Plot occurrence of distinct miRNA names over different thresholds.
These thresholds can either be absolute values or floating values between 0
and 1.
If the thresholds are absolute values, number of distinct miRNA names
mentioned in at least n abstracts are plotted, where n
is the range of thresholds defined by start
and end
.
If the thresholds are floating values, bins
must be specified as well.
Then the umber of distinct miRNA names
mentioned in at least n abstracts over bins
are plotted, where n is the
range of thresholds
between start
and end
.
Overall, plotting can help in identifying if the abstracts
at hand mention different miRNAs in a balanced way, or if there are few miRNAs
dominating the field.
Value
Bar plot counting the occurrence of miRNA names above different thresholds.
See Also
count_mir_threshold()
, count_mir()
, plot_mir_count()
Other count functions:
count_mir_threshold()
,
count_mir()
,
count_snp()
,
plot_mir_count()
Plot development of miRNA name mentioning over time
Description
Plot development of miRNA name mentioning over time.
Usage
plot_mir_development(
df,
mir,
start = NULL,
end = NULL,
linetype = "miRNA",
alpha = 0.8,
width = 0.3,
col.mir = miRNA,
col.year = Year,
title = NULL
)
Arguments
df |
Data frame containing miRNA names and publication years. |
mir |
Character vector. Vector containing miRNA names to plot. |
start |
Numeric. Optional. Specifies start year. If |
end |
Numeric. Optional. Specifies end year. If |
linetype |
String. Specifies linetype. |
alpha |
Float. Opacity of lines. |
width |
Float. Width of dodging lines. |
col.mir |
Symbol. Column containing miRNA names. |
col.year |
Symbol. Column containing year. |
title |
String. Plot title. |
Details
Plot how often a miRNA name was mentioned per year.
Value
Line plot displaying how often a miRNA name was mentioned per year..
See Also
Other miR development functions:
plot_mir_new()
Plot number of newly mentioned miRNA names/year
Description
Plot number of newly mentioned miRNA names/year.
Usage
plot_mir_new(
df,
threshold = 1,
start = NULL,
end = NULL,
colour = "steelblue3",
col.mir = miRNA,
col.year = Year,
title = NULL
)
Arguments
df |
Data frame containing miRNA names and publication years. |
threshold |
Integer. Specifies how often a miRNA must be mentioned in a year to be considered "mentioned". |
start |
Integer. Optional. Beginning of publication period.
If |
end |
Integer. Optional. End of publication period.
If |
colour |
String. Colour of bar plot. |
col.mir |
Symbol. Column containing miRNA names. |
col.year |
Symbol. Column containing publication year. |
title |
String. Plot title. |
Details
Plot how many miRNAs are mentioned for the first time in different year.
If a miRNA is considered to be "mentioned" in a year can be regulated
via the threshold
argument. If, for example, threshold
is set to 3, but a
miRNA is mentioned only twice in a year, it is not considered
to be "mentioned" for this year.
Value
Bar plot displaying the number of newly mentioned miRNA names/year.
See Also
Other miR development functions:
plot_mir_development()
Plot count of top terms associated with a miRNA name
Description
Plot count of top terms associated with a miRNA name.
Usage
plot_mir_terms(
df,
mir,
top = 20,
tf.idf = FALSE,
token = "words",
...,
stopwords = stopwords_miretrieve,
stopwords_ngram = TRUE,
normalize = TRUE,
colour = "steelblue3",
col.mir = miRNA,
col.abstract = Abstract,
col.pmid = PMID,
title = NULL
)
Arguments
df |
Data frame containing miRNA names, abstracts, and PubMed-IDs. |
mir |
String. miRNA name of interest. |
top |
Integer. Number of top terms to plot. |
tf.idf |
Boolean. If |
token |
String. Specifies how abstracts shall be split up. Taken from
|
... |
Additional arguments for tokenization, if necessary. |
stopwords |
Data frame containing stop words. |
stopwords_ngram |
Boolean. Specifies if stop words shall be removed
from abstracts when using ngrams. Only applied when |
normalize |
Boolean. If |
colour |
String. Colour of bar plot. |
col.mir |
Symbol. Column containing miRNA names |
col.abstract |
Symbol. Column containing abstracts. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
title |
String. Title plot. |
Details
Plot count of top terms associated with a miRNA name.
Top terms associated with mir
have to be in df
as abstracts.
Number of top terms to plot is regulated via the top
argument.
Terms can either be evaluated as their count or in a tf-idf fashion.
If terms are evaluated as their count, they can either be
evaluated as their raw count, e.g. in how many abstracts they are mentioned
in conjunction with the miRNA name, or as their relative count, e.g.
in how many abstracts containing the miRNA they are mentioned compared to all
abstracts containing the miRNA.
If terms are evaluated in a tf-idf fashion, miRNA names are considered as
separate documents and
terms often associated with one miRNA, but not with other miRNAs get
more weight.
plot_mir_terms()
is based on the tools available in the tidytext package.
Value
Bar plot displaying the count of the top terms associated with a miRNA name.
See Also
plot_wordcloud()
, tidytext::unnest_tokens()
Other miR term functions:
plot_wordcloud()
Plot perplexity score of various LDA models
Description
Plot perplexity score of various LDA models.
Usage
plot_perplexity(
df,
start = 2,
end = 5,
stopwords = stopwords_miretrieve,
method = "gibbs",
control = NULL,
col.abstract = Abstract,
col.pmid = PMID,
title = NULL
)
Arguments
df |
Data frame containing abstracts and PubMed-IDs. |
start |
Integer. Minimum amount of |
end |
Integer. Maximum amount of |
stopwords |
Data frame containing stop words. |
method |
String. Either |
control |
Control parameters for LDA modeling. For more information,
see the documentation of the |
col.abstract |
Column containing abstracts. |
col.pmid |
Column containing PubMed-ID. |
title |
String. Plot title. |
Details
Plot perplexity score of various LDA models. plot_perplexity()
fits
different LDA models for k
topics in the range
between start
and end
. For each
LDA model, the perplexity score is plotted against the corresponding value of
k
.
Plotting the perplexity score of various LDA models
can help in identifying the optimal number of topics to fit an LDA model for.
plot_perplexity()
is based on LDA()
from the package
topicmodels.
Value
Elbow plot displaying perplexity scores of different LDA models.
See Also
Other LDA functions:
assign_topic_lda()
,
fit_lda()
,
plot_lda_term()
Plot frequency of animal model scores in abstracts
Description
Plot frequency of animal model scores in abstracts.
Usage
plot_score_animals(
df,
keywords = animal_keywords,
case = FALSE,
bins = NULL,
colour = "steelblue3",
col.abstract = Abstract,
col.pmid = PMID,
title = NULL
)
Arguments
df |
Data frame containing abstracts. |
keywords |
Character vector. Vector containing keywords. The animal
model score is calculated based on these keywords. How much weight a keyword
in |
case |
Boolean. If |
bins |
Integer. Specifies how many bins are used to plot
the distribution. If |
colour |
String. Colour of histogram. |
col.abstract |
Symbol. Column containing abstracts. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
title |
String. Plot title. |
Details
Plots a frequency distribution of animal model scores in abstracts of a
data frame. The animal model score is influenced by the choice of
terms in keywords
.
Plotting the distribution can help deciding if the
terms are well-chosen, or in choosing the right threshold to decide
which abstracts are considered to contain animal models.
Value
Histogram displaying the distribution of animal scores in abstracts.
See Also
Other score functions:
assign_topic()
,
calculate_score_animals()
,
calculate_score_biomarker()
,
calculate_score_patients()
,
calculate_score_topic()
,
plot_score_biomarker()
,
plot_score_patients()
,
plot_score_topic()
Plot frequency of biomarker scores in abstracts
Description
Plot frequency of biomarker scores in abstracts.
Usage
plot_score_biomarker(
df,
keywords = biomarker_keywords,
case = FALSE,
bins = NULL,
colour = "steelblue3",
col.abstract = Abstract,
col.pmid = PMID,
title = NULL
)
Arguments
df |
Data frame containing abstracts. |
keywords |
Character vector. Vector containing keywords. The biomarker
score is calculated based on these keywords. How much weight a keyword
in |
case |
Boolean. If |
bins |
Integer. Specifies how many bins are used to plot
the distribution. If |
colour |
String. Colour of histogram. |
col.abstract |
Symbol. Column containing abstracts. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
title |
String. Plot title. |
Details
Plots a frequency distribution of biomarker scores in abstracts of a
data frame. The biomarker score is influenced by the choice of
terms in keywords
.
Plotting the distribution can help deciding if the
terms are well-chosen, or in choosing the right threshold to decide
which abstracts are considered to contain use of miRNAs as biomarker.
Value
Histogram displaying the distribution of biomarker scores in abstracts.
See Also
Other score functions:
assign_topic()
,
calculate_score_animals()
,
calculate_score_biomarker()
,
calculate_score_patients()
,
calculate_score_topic()
,
plot_score_animals()
,
plot_score_patients()
,
plot_score_topic()
Plot frequency of patient scores in abstracts
Description
Plot frequency of patient scores in abstracts.
Usage
plot_score_patients(
df,
keywords = patients_keywords,
case = FALSE,
bins = NULL,
colour = "steelblue3",
col.abstract = Abstract,
col.pmid = PMID,
title = NULL
)
Arguments
df |
Data frame containing abstracts. |
keywords |
Character vector. Vector containing keywords. The score is
calculated based on these keywords. How much weight a keyword in |
case |
Boolean. If |
bins |
Integer. Specifies how many bins are used to plot
the distribution. If |
colour |
String. Colour of histogram. |
col.abstract |
Symbol. Column containing abstracts. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
title |
String. Plot title. |
Details
Plots a frequency distribution of patient scores in abstracts of a
data frame. The patient score is influenced by the choice of
terms in keywords
.
Plotting the distribution can help deciding if the
terms are well-chosen, or in choosing the right threshold to decide
which abstracts are considered to contain patient material
Value
Histogram displaying the distribution of patient scores in abstracts.
See Also
Other score functions:
assign_topic()
,
calculate_score_animals()
,
calculate_score_biomarker()
,
calculate_score_patients()
,
calculate_score_topic()
,
plot_score_animals()
,
plot_score_biomarker()
,
plot_score_topic()
Plot frequency of self-chosen topic scores in abstracts
Description
Plot frequency of self-chosen topic scores in abstracts.
Usage
plot_score_topic(
df,
keywords,
case = FALSE,
name.topic = "TOPIC",
bins = NULL,
colour = "steelblue3",
col.abstract = Abstract,
col.pmid = PMID,
title = NULL
)
Arguments
df |
Data frame containing abstracts. |
keywords |
Character vector. Vector containing keywords. How much weight
a keyword in |
case |
Boolean. If |
name.topic |
String. Name of the topic. |
bins |
Integer. Specifies how many bins are used to plot
the distribution. If |
colour |
String. Colour of histogram. |
col.abstract |
Symbol. Column containing abstracts. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
title |
String. Plot title. |
Details
Plots a frequency distribution of self-chosen topic scores in abstracts of a
data frame. The topic score is influenced by the choice of
terms in keywords
. Plotting the distribution can help in choosing the right
threshold to decide which abstracts correspond to the self-chosen
topic.
Value
Histogram displaying the distribution of self-chosen topic scores in abstracts.
See Also
calculate_score_topic()
, assign_topic()
Other score functions:
assign_topic()
,
calculate_score_animals()
,
calculate_score_biomarker()
,
calculate_score_patients()
,
calculate_score_topic()
,
plot_score_animals()
,
plot_score_biomarker()
,
plot_score_patients()
Plot count of miRNA targets
Description
Plot count of miRNA targets.
Usage
plot_target_count(
df,
top = NULL,
threshold = NULL,
colour = "steelblue3",
col.target = Target,
title = NULL
)
Arguments
df |
Data frame with miRNA targets. |
top |
Numeric. Specifies number of top targets to be plotted. |
threshold |
Numeric. Specifies how often a target must be in |
colour |
String. Colour of bar plot. |
col.target |
Symbol. Column containing miRNA targets. |
title |
String. Plot title. |
Details
Plot count of miRNA targets as a bar plot. How many
targets are plotted is determined either by the top
or by
the threshold
argument.
If top
is given, targets with the highest count are plotted.
Ties among targets with the highest count are treated as
the same rank, e.g. if PTEN, AKT, and VEGFA all had the highest count,
they would all be plotted by specifying top = 1
, top = 2
,
and top = 3
.
If threshold
is given, only targets with a count of at least threshold
are plotted.
If neither top
nor threshold
is given, top
is automatically set
to 5
.
Value
Bar plot with target counts.
See Also
count_target()
, join_targets()
Other target functions:
count_target()
,
join_mirtarbase()
,
join_targets()
,
plot_target_mir_scatter()
Plot targets and corresponding miRNAs as a scatter plot
Description
Plot targets and corresponding miRNAs as a scatter plot.
Usage
plot_target_mir_scatter(
df,
mir = NULL,
target = NULL,
top = NULL,
threshold = NULL,
filter_for = "target",
col.target = Target,
col.mir = miRNA,
col.topic = Topic,
col.pmid = PMID,
title = NULL,
height = 0.05,
width = 0.05,
alpha = 0.6
)
Arguments
df |
Data frame containing targets and miRNA names. |
mir |
String or character vector. Specifies which miRNAs to plot. |
target |
String or character vector. Specifies which targets to plot. |
top |
Numeric. Specifies number of top targets/miRNA names to be plotted. |
threshold |
Numeric. Specifies how often a target/miRNA name must be in
|
filter_for |
String. Must either be |
col.target |
Symbol. Column containing miRNA targets. |
col.mir |
Symbol. Column containing miRNA names. |
col.topic |
Symbol. Column containing topic names. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
title |
String. Plot title. |
height |
Double. Specifies height of jitter. |
width |
Double. Specifies width of jitter. |
alpha |
Double. Specifies opacity of points. |
Details
Plot targets and corresponding miRNAs as a scatter plot.
With filter_for
, it can be determined if the focus shall be
on the top targets to plot their corresponding miRNAs,
or if the focus
shall be on the top miRNA names to plot their corresponding targets.
What "top targets" or "top miRNA names" mean can be determined via the
top
and threshold
arguments.
If
top
is given,df
is filtered for the most frequent targets/miRNA names.If
threshold
is given, data frame is filtered for all targets/miRNA names mentioned at leastthreshold
times.If neither
top
northreshold
is given,top
is automatically set to5
.
By plotting miRNAs
against their targets, it is visualized if one miRNA regulates many targets,
or if one target is regulated by many miRNAs. Furthermore, the miRNA-target
interactions are labelled according to their topic in col.topic
, thereby
facilitating comparison of miRNA-target interactions across different topics.
Value
Scatter plot with targets and corresponding miRNAs.
See Also
Other target functions:
count_target()
,
join_mirtarbase()
,
join_targets()
,
plot_target_count()
Create wordcloud of terms associated with a miRNA name
Description
Create wordcloud of terms associated with a miRNA name.
Usage
plot_wordcloud(
df,
mir,
min.freq = 1,
max.terms = 20,
tf.idf = FALSE,
token = "words",
...,
stopwords = stopwords_miretrieve,
stopwords_ngram = TRUE,
colours = "black",
random.colour = TRUE,
ordered.colour = FALSE,
col.mir = miRNA,
col.abstract = Abstract,
col.pmid = PMID
)
Arguments
df |
Data frame containing miRNA names, abstracts, and PubMed-IDs. |
mir |
String. miRNA name of interest. |
min.freq |
Integer. Specifies least number of times a term must be associated with
|
max.terms |
Integer. Maximum number of terms to plot. |
tf.idf |
Boolean. If |
token |
String. Specifies how abstracts shall be split up. Taken from
|
... |
Additional arguments for tokenization, if necessary. |
stopwords |
Data frame containing stop words. |
stopwords_ngram |
Boolean. Specifies if stop words shall be removed
from abstracts when using ngrams. Only applied when |
colours |
Vector of strings. Colours for wordcloud. |
random.colour |
Boolean. Taken from |
ordered.colour |
Boolean. Taken from |
col.mir |
Symbol. Column containing miRNA names. |
col.abstract |
Symbol. Column containing abstracts. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
Details
Create wordcloud of terms associated with a miRNA name.
miRNA names must be in a data frame df
, while terms are taken
from abstracts contained in df
.
Number of terms to plot is regulated by max.terms
, while min.freq
regulates
the least number of times a term must be mentioned to be plotted.
Terms can either be evaluated as their raw count, e.g. how often they are
mentioned in conjunction with the miRNA of interest, or weighed in a tf-idf
fashion. If tf.idf = TRUE
, miRNA names are considered as separate documents,
and terms often associated with one miRNA, but not with other miRNAs get
more weight.
plot_wordcloud()
is based on the tools available in the wordcloud
package.
Value
Wordcloud of terms associated with a miRNA name.
See Also
plot_mir_terms()
, wordcloud::wordcloud()
, tidytext::unnest_tokens()
Other miR term functions:
plot_mir_terms()
Convert PubMed-file from PubMed into a data frame
Description
Convert PubMed-file from PubMed into a data frame.
Usage
read_pubmed(pubmed_file, topic = NULL)
Arguments
pubmed_file |
PubMed-file as .txt, downloaded from PubMed. |
topic |
String. Optional. If provided, adds a "Topic" column containing
|
Details
Convert an PubMed-file from PubMed into a data frame. The PubMed-file should contain PubMed-IDs, abstracts from research articles, abstract title, publication year, abstract language, and article type. The data frame created holds at least six columns, namely
-
PMID
, containing the PubMed-ID, -
Year
, containing the publication year, -
Title
, containing the title of the abstracts, -
Abstract
, containing the actual abstract, -
Language
, containing the language(s) of the paper, -
Type
, containing the article type.
If topic
is provided, a "Topic" column is added, assigning all abstracts in
df
to topic
.
read_pubmed()
is faster than read_pubmed_jats()
and thus
recommended.
Value
Data frame containing PubMed-IDs, abstracts, abstract titles, publication years, languages, and article types.
See Also
Other external data functions:
read_pubmed_jats()
,
save_excel()
,
save_plot()
Convert JATS-file from PubMed into a data frame
Description
Convert JATS-file from PubMed into a data frame.
Usage
read_pubmed_jats(jats_file, topic = NULL)
Arguments
jats_file |
JATS-file, downloaded from PubMed. |
topic |
String. Optional. If provided, adds a "Topic" column containing
|
Details
Converts an JATS-file from PubMed into a data frame. The JATS-file should contain PubMed-IDs, abstracts from research articles, abstract title, publication year, abstract language, and article type. The data frame created holds at least six columns, namely
-
PMID
, containing the PubMed-ID, -
Year
, containing the publication year, -
Title
, containing the title of the abstracts, -
Abstract
, containing the actual abstract, -
Language
, containing the language(s) of the paper, -
Type
, containing the article type.
If topic
is provided, a "Topic" column is added, assigning all abstracts in
df
to topic
.
read_pubmed()
is faster than read_pubmed_jats()
and thus
recommended.
Value
Data frame containing PubMed-IDs, abstracts, abstract titles, publication years, languages, and article types.
See Also
Other external data functions:
read_pubmed()
,
save_excel()
,
save_plot()
Save data frame(s) as xlsx-file
Description
Save data frame(s) locally as an xlsx-file.
Usage
save_excel(..., excel_file = "miRetrieve_data.xlsx")
Arguments
... |
Data frame(s) to save. |
excel_file |
String. File name that |
Details
Saves data frame locally as an xlsx-file. If more than one data frame is provided, data frames are saved in an xlsx-file with one sheet per data frame.
Wrapper function of write.xlsx()
from openxlsx.
Value
xlsx-file, locally saved.
See Also
Other external data functions:
read_pubmed_jats()
,
read_pubmed()
,
save_plot()
Save the last generated figure
Description
Save the last generated figure locally.
Usage
save_plot(
plot_file,
width = NULL,
height = NULL,
units = "in",
dpi = 300,
device = NULL
)
Arguments
plot_file |
String. File name that the figure
shall be saved to. Can end in either ".png", ".tiff",
".pdf", ".jpeg", or ".bmp". For more information, see the documentation
of |
width |
Integer. Optional. Plot width. If |
height |
Integer. Optional. Plot height If |
units |
String. Units for |
dpi |
Integer. Resolution for raster graphics such as .pdf-files. |
device |
String or function. Specifies which device to use (such as
"pdf" or |
Details
Saves the last generated figure locally. Wrapper
function of ggsave()
from ggplot2. For further details, please
see ?ggplot2::ggsave.
Value
Plot, locally saved.
See Also
Other external data functions:
read_pubmed_jats()
,
read_pubmed()
,
save_excel()
Stop words for text mining with common PubMed 2-grams
Description
Data frame containing PubMed 2-gram stop words, manually curated from PubMed abstracts
Usage
stopwords_2gram
Format
Tibble.
-
word
: Column containing stop words. Pulled from various PubMed abstracts. -
lexicon
: Column specifying lexicon.
Source
Manually created from various PubMed abstracts.
Stop words for text mining with miRetrieve
Description
Data frame containing English stop words, PubMed stop words, and common 2-gram stopwords. English stop words are based on tidytext::stop_words, while PubMed stop words are manually curated from PubMed abstracts
Usage
stopwords_miretrieve
Format
Tibble.
-
word
: Column containing stop words. Pulled from various PubMed abstracts. -
lexicon
: Column specifying lexicon.
Source
tidytext::stop_words; manually created from various PubMed abstracts.
Stop words for text mining from PubMed abstracts
Description
Data frame containing PubMed stop words, manually curated from PubMed abstracts
Usage
stopwords_pubmed
Format
Tibble.
-
word
: Column containing stop words. Pulled from various PubMed abstracts. -
lexicon
: Column specifying lexicon.
Source
Manually created from various PubMed abstracts.
Subset data frame for a term
Description
Subset data frame for a term in a specified column.
Usage
subset_df(df, col.filter, filter_for = "Yes")
Arguments
df |
Data frame to subset. |
col.filter |
String. Name of column to filter. |
filter_for |
String. Term to filter for. |
Details
Subset data frame for a term in a specified column.
subset_df()
filters a data frame for a certain term in a specified column. All
rows containing the term in the specified column are kept, while the other
rows are silently dropped.
Here, col.filter
is a string rather than
a symbol to facilitate filtering in columns that carry special characters
such as '-' in their name.
Value
Data frame, subset for rows where filter_for
was
present in col.filter
.
See Also
indicate_term()
, indicate_mir()
, extract_snp()
Other subset functions:
subset_mir_threshold()
,
subset_mir()
,
subset_research()
,
subset_review()
,
subset_snp()
,
subset_year()
Subset data frame for specific miRNA names
Description
Subset data frame for specific miRNA names only.
Usage
subset_mir(df, mir.retain, col.mir = miRNA)
Arguments
df |
Data frame containing a miRNA names. |
mir.retain |
Character vector. Vector specifying which miRNA names to keep.
miRNA names in |
col.mir |
Symbol. Column containing miRNA names. |
Details
Subset data frame for specific miRNA names only.
Value
Data frame containing only specified miRNA names.
If no miRNA name in mir.retain
matches a miRNA name in col.mir
, subset_mir()
stops
with a warning saying "No miRNA name in 'mir.retain' matches a miRNA name in 'col.mir'.
Could not filter for miRNA name.".
See Also
get_mir()
, subset_mir_threshold()
Other subset functions:
subset_df()
,
subset_mir_threshold()
,
subset_research()
,
subset_review()
,
subset_snp()
,
subset_year()
Subset data frame for miRNA names exceeding a threshold
Description
Subset data frame for miRNA names whose frequency exceeds a threshold.
Usage
subset_mir_threshold(df, threshold = 1, col.mir = miRNA, col.pmid = PMID)
Arguments
df |
Data frame containing miRNA names and a PubMed-IDs. |
threshold |
Integer or float. If |
col.mir |
Symbol. Column containing miRNA names. |
col.pmid |
Symbol. Column containing PubMed-IDs. |
Details
Subset data frame for miRNA names whose frequency exceeds a threshold.
This threshold can either
be an absolute value, e.g. 3, or a float between 0 and 1, e.g. 0.2.
If threshold
is an absolute value, subset_mir_threshold()
retains
miRNA names mentioned in at least threshold
abstracts.
If threshold
is a float between 0 and 1, subset_mir_threshold()
retains
miRNA names mentioned in at least threshold
abstracts
of all abstracts in df
.
Value
Data frame, subset for miRNA names whose frequency exceeds a threshold.
See Also
Other subset functions:
subset_df()
,
subset_mir()
,
subset_research()
,
subset_review()
,
subset_snp()
,
subset_year()
Subset data frame for abstracts of research articles
Description
Subset data frame for abstracts of research articles only.
Usage
subset_research(df, col.type = Type)
Arguments
df |
Data frame containing article types. |
col.type |
Symbol. Column containing articles types. |
Details
Subset data frame for abstracts of research articles only. At the same time, abstracts from other article types such as Review, Letter, etc. are dropped.
Value
Data frame containing abstracts of research articles only.
See Also
subset_review()
, subset_year()
Other subset functions:
subset_df()
,
subset_mir_threshold()
,
subset_mir()
,
subset_review()
,
subset_snp()
,
subset_year()
Subset data frame for abstracts of review articles
Description
Subset data frame for abstracts of review articles only.
Usage
subset_review(df, col.type = Type)
Arguments
df |
Data frame containing article types. |
col.type |
Symbol. Column containing articles types. |
Details
Subset data frame for abstracts of review articles only. At the same time, abstracts from other article types such as Journal Article, Letter, etc. are dropped.
Value
Data frame containing abstracts of review articles only.
See Also
subset_research()
, subset_year()
Other subset functions:
subset_df()
,
subset_mir_threshold()
,
subset_mir()
,
subset_research()
,
subset_snp()
,
subset_year()
Subset data frame for specific SNPs
Description
Subset data frame for specific SNPs only.
Usage
subset_snp(df, snp.retain, col.snp = SNPs)
Arguments
df |
Data frame containing SNPs. |
snp.retain |
Character vector. Vector specifying which SNPs to keep.
SNPs in |
col.snp |
Symbol. Column containing SNPs. |
Details
Subset data frame for specific SNPs only.
Value
Data frame containing only specified SNPs.
If no SNP in snp.retain
matches a SNP in col.snp
, subset_snp()
stops
with a warning saying "No SNP in 'snp.retain' matches a SNP in 'col.snp'.
Could not filter for SNP.".
See Also
extract_snp()
,
count_snp()
,
get_snp()
Other subset functions:
subset_df()
,
subset_mir_threshold()
,
subset_mir()
,
subset_research()
,
subset_review()
,
subset_year()
Subset data frame for abstracts published in a specific period
Description
Subset data frame for abstracts published in a specific period only.
Usage
subset_year(df, col.year = Year, start = NULL, end = NULL)
Arguments
df |
Data frame containing publication years. |
col.year |
Symbol. Column containing publication years. |
start |
Integer. Optional. Beginning of
publication period.
If |
end |
Integer. Optional. End of
publication period.
If |
Details
Subset data frame for abstracts published in a specific period only. All other abstracts published not within this period are silently dropped.
Value
Data frame containing abstracts published in a specific period only.
See Also
subset_research()
, subset_review()
Other subset functions:
subset_df()
,
subset_mir_threshold()
,
subset_mir()
,
subset_research()
,
subset_review()
,
subset_snp()