---
title: "drugfindR"
output: BiocStyle::html_document
vignette: >
  %\VignetteIndexEntry{drugfindR}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set( # nolint: extraction_operator_linter.
    collapse = TRUE,
    comment = "#>"
)
```

```{r setup}
library(tidyverse)
library(drugfindR)
```

## Introduction

`drugfindR` provides end-users with a convenient method for accessing
the Library of Integrated Network-Based Cellular Signatures (LINCS).
The LINCS project aims to create a network-based understanding of biology
by systematically cataloging changes in cellular processes, namely gene
expression, that occur when cells are exposed to a variety of perturbing
agents.
iLINCS is an integrated web-based platform designed for the analysis of omics
data and signatures of cellular perturbagens.
While the iLINCS analysis workflows integrate vast omics data resources and a
range of analytic and visual tools into a comprehensive platform, `drugfindR`
is advantageous in that it is scriptable and usable from within R without
relying on the iLINCS web platform. `drugfindR` also possesses the capability
of running all input signatures simultaneously, which makes investigating
a particular gene or drug extremely efficient. From the output data generated
by `drugfindR`, end-users may understand how the overexpression or knockdown
of a specific gene affects the expression of genes within the same cellular
system, identify downstream molecular consequences of gene perturbation within
a system, and investigate candidate drugs that may be repurposed for other
physiological reasons.

## Installation

`drugfindR` can be installed from GitHub using the `devtools` package:

```{r install}
#| eval: FALSE
devtools::install_github("CogDisResLab/drugfindR")
```

## Use Cases

`drugfindR` has multiple features that make interfacing with the iLINCS
database and analyzing LINCS data simple and efficient. However, the package is
explicitly designed for two primary use cases:

1. Using an input transcriptomic signature to identify candidate drugs in the
iLINCS database
2. Identifying drugs or other genes that are similar (or opposite) in function
to a given drug or gene.

## Package Design

This package provides two different ways to achieve these use cases. First,
there is a set of five functions that can be deployed in a pipeline for the
results. Then, there are two functions `investigateTarget()` and
`investigateSignature()` that perform the entire pipeline in one function call
with sensible defaults.

### Pipeline Components

The five pipeline functions are:

1. `getSignature()`: This function takes a LINCS ID and returns the
corresponding signature.
2. `prepareSignature()`: This function takes a transcriptomic signature and
prepares it for analysis by `drugfindR`.
3. `filterSignature()`: This function takes a signature and filters it to given
thresholds.
4. `getConcordants()`: This function takes a signature and returns the
concordant signatures from the iLINCS database.
5. `consensusConcordants()`: This function takes a list of concordant signatures
and returns a list of consensus signature.

## Use Case 1: Identifying Candidate Drugs from an Input Signature

For this case, we will use one of the signatures that was used in the paper
["Identification of candidate repurposable drugs to combat COVID - 19 using a
signature - based approach" by O'Donovan, Imami, et al]
(https://www.nature.com/articles/s41598-021-84044-9).

In that paper, the authors used the available gene expression data from cells
infected with SARS-CoV-2 to identify potential
drugs that could be repurposed to treat COVID-19. We will use one of the
signatures that they have provided in their paper to showcase how `drugfindR`
can be used to identify candidate drugs from an input signature. We will use
the `dCovid_diffexp.tsv` signature from the paper.

### Step 1: Get the Signature

This signature is available with the package. Our first step is to download the
signature so we can work with it. The `read_tsv()` function from the `readr`
package can be used to read the signature into R from a remote URL or a local
file.

```{r load_signature}
# Load the signature from the paper

diffexp <- read_tsv(
    system.file("extdata", "dCovid_diffexp.tsv",
        package = "drugfindR"
    )
)

# Take a look at the signature

head(diffexp) |>
    knitr::kable()
```

We can see that the signature has `ncol(diffexp)` columns and `nrow(diffexp)`
rows. The names of the columns are typical of what you would get from
[edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html) or
[DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html).

### Step 2: Prepare the Signature

The next step is to prepare the signature for analysis by `drugfindR`. This
step is necessary because the signature can be in many different formats,
with different names for columns. iLICNS needs columns to be in a specific
order and with specific names. The `prepareSignature()` function takes care of
this for us.

`prepareSignature()` takes three optional arguments:

1. `geneColumn`: The name of the column in the input that contains the gene
names. The default is `"Symbol"`.
2. `logfcColumn`: The name of the column in the input that contains the log
fold change values. The default is `"logFC"`.
3. `pvalColumn`: The name of the column in the input that contains the p-values.
The default is `"PValue"`.

```{r prepareSignature}
# Prepare the signature for analysis
# The only thing that is different from the defaults is the gene_column
# However, we will specify all three arguments for clarity

signature <- prepareSignature(diffexp,
    geneColumn = "hgnc_symbol",
    logfcColumn = "logFC", pvalColumn = "PValue"
)

# Take a look at the signature

head(signature) |>
    knitr::kable()
```

We can see that the signature has been reordered and renamed. The first column
is now `names(signature)[1]`, the second column is now `names(signature)[2]`,
and the third column is now `names(signature)[3]`, which is what iLINCS expects.

### Step 3: Filter the Signature

Now that we have the signature in the correct format and filtered to the L1000
genes, we can filter it to the thresholds that we want. This filter step is
necessary because we would like to use the genes that have a high enough change
for it to matter.

The `filterSignature()` function can filter based on logFC values in two ways:

1. Absolute Threshold: You can give an absolute threshold (
or a pair of absolute thresholds) for the logFC values. Any genes
that do not meet the threshold will be removed from the signature.

2. Percentile Threshold: You can give a percentile threshold (or a pair of
percentile thresholds) for the logFC values. Any genes that do not meet the
threshold will be removed from the signature.

The `filterSignature()` function takes three arguments:

1. `signature`: The signature to filter.
2. `direction`: This argument specifies whether to filter for upregulated genes,
downregulated genes, or both. The default is `"any"`.
3. One of `threshold` or `prop`: The threshold argument is used to specify an
absolute threshold (or a pair of absolute thresholds) for the logFC values.
The prop argument is used to specify a percentile threshold (or a pair of
percentile thresholds) for the logFC values. They can not be specified together.

```{r filterSignatureUp}
# Filter the signature to only include genes that are upregulated by at least
# 1.5 logFC

filteredSignatureUp <- filterSignature(signature,
    direction = "up",
    threshold = 1.5
)

filteredSignatureUp |>
    head() |>
    knitr::kable()
```

```{r filterSignature_dn}
# Filter the signature to only include genes that are downregulated by at least
# 1.5 logFC
filteredSignatureDn <- filterSignature(signature,
    direction = "down",
    threshold = 1.5
)

filteredSignatureDn |>
    head() |>
    knitr::kable()
```

### Step 4: Get the Concordant Signatures

Now that we have the filtered signatures for both upregulated and downregulated
genes, we can get the concordant signatures from the iLINCS database. The
`getConcordants()` function takes a signature and returns the concordant
signatures from the iLINCS database.
It also requires specification of the database to target for the concordant
signatures.

The `getConcordants()` function takes the following arguments:

1. `signature`: The signature to get concordant signatures for.
2. `ilincsLibrary`: The iLINCS library to target for concordant signatures.
This can be one of c("OE", "KD", "CP"), standing for overexpression, knockdown,
and chemical perturbagens, respectively.
3. `direction`: This argument specifies whether the input signature is
upregulated or downregulated. This is useful to annotate the output.
This is `NULL` by default.

```{r getConcordants}
# Get the concordant signatures for the upregulated signature

upConcordants <- getConcordants(filteredSignatureUp, ilincsLibrary = "CP")

upConcordants |>
    head() |>
    knitr::kable()

# Get the concordant signatures for the downregulated signature

dnConcordants <- getConcordants(filteredSignatureDn, ilincsLibrary = "CP")

dnConcordants |>
    head() |>
    knitr::kable()
```

### Step 5: Get the list of Consensus Concordant Signatures

Now that we have the concordant signatures for both the upregulated and
downregulated signatures, we can get the list of consensus concordant
signatures. The `consensusConcordants()` function takes a list of concordant
signatures and returns a list of consensus signatures.
This function also takes a number of optional arguments that can be used to
control the consensus list generation.

By default the consensus list performs the following steps:

1. Combine the list of concordant signatures into a single data frame.
2. For each individual signature origin (Gene or Drug), choose the one with the
largest absolute concordance value.

Additionally, we can filter by the cell line to only include the cell lines of
interest.

The `consensusConcordants()` function takes the following arguments:

1. `...`: One or Two (see paired) Data Frames with the concordants
2. `paired`: A logical value indicating whether the input is a single data
frame with paired signatures or two data frames with unpaired signatures. The
default is `FALSE`.
3. `cellLines`: A character vector of cell lines to filter the consensus list
to. The default is `NULL`, which means no filtering.
4. `cutoff`: The absolute cutoff value of similarity to use when filtering the
consensus list. The default is `0.321`.

```{r consensusConcordants}
# Get the consensus concordant signatures for the upregulated signature

consensus <- consensusConcordants(upConcordants, dnConcordants,
    paired = TRUE, cutoff = 0.2
)

consensus |>
    head() |>
    knitr::kable()
```

### Alternate One-Step Method

The above method breaks down the entire method into five steps. However,
`drugfindR` also provides two functions that perform the entire
pipeline in one function call with sensible defaults. These functions are
`investigateTarget()` and `investigateSignature()`.

For this use case, `investigateSignature()` is the function that we want to use.
It takes the following required arguments:

1. `expr`: The signature to investigate.
2. `outputLib`: The iLINCS library to target for concordant signatures.
This can be one of c("OE", "KD", "CP"), standing for overexpression, knockdown,
and chemical perturbagens, respectively.
3. `filterThreshold`: The absolute threshold (or a pair of absolute thresholds)
for the logFC values. Any genes that do not meet the threshold will be removed
from the signature.
4. `filterProp`: The percentile threshold (or a pair of percentile thresholds)
for the logFC values. Any genes that do not meet the threshold will be removed
from the signature.

Other arguments that have sensible defaults are:

1. `similarityThreshold`: The absolute cutoff value of similarity to use when
filtering the consensus list. The default is `0.2`.
2. `paired`: A logical value indicating whether the to split the input
dataframe in up and downregulated signatures. The default is `TRUE`.
3. `outputCellLines`: A character vector of cell lines to filter the consensus
list to. The default is `NULL`, which means no filtering.
4. `geneColumn`: The name of the column in the input that contains the gene
names. The default is `"Symbol"`.
5. `logfcColumn`: The name of the column in the input that contains the log
fold change values. The default is `"logFC"`.
6. `pvalColumn`: The name of the column in the input that contains the p-values.
The default is `"PValue"`.
7. `sourceName`: The name of the source of the signature. The default is
`"Input"`.
8. `sourceCellLine`: The cell line of the source of the signature.
The default is `"NA"`.
9. `sourceTime`: The time of the source of the signature. The default is `"NA"`.
10. `sourceConcentration`: The concentration of the source of the signature.
The default is `"NA"`.


```{r investigateSignature}
investigated <- investigateSignature(diffexp,
    outputLib = "CP", filterThreshold = 1.5,
    geneColumn = "hgnc_symbol", logfcColumn = "logFC",
    pvalColumn = "PValue"
)

investigated |>
    head() |>
    knitr::kable()
```

## Environment Setup

```{r sessionInfo}
devtools::session_info()
```