--- title: "Introduction to ggvariant" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to ggvariant} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 4.5, out.width = "100%" ) ``` ## Overview `ggvariant` provides a simple, ggplot2-native toolkit for visualising genomic variant data. Whether you are a wet-lab biologist working from an Excel export or an experienced bioinformatician loading VCF files directly, `ggvariant` gets you to a publication-ready plot in just a few lines of R. This vignette walks through a complete workflow: 1. Loading variant data (VCF file or data frame) 2. Exploring variants with a lollipop plot 3. Summarising consequences across samples and genes 4. Visualising the mutational spectrum --- ## Installation ```{r install, eval = FALSE} # Install from CRAN install.packages("ggvariant") # Or install the development version from GitHub # remotes::install_github("yourname/ggvariant") ``` ```{r load} library(ggvariant) ``` --- ## Loading variant data ### Option 1: From a VCF file `read_vcf()` parses standard VCF v4.x files — including gzipped files and multi-sample VCFs — and returns a tidy data frame called a `gvf` object. Functional annotations from SnpEff (`ANN`) or VEP (`CSQ`) INFO fields are extracted automatically. ```{r read-vcf} vcf_file <- system.file("extdata", "example.vcf", package = "ggvariant") variants <- read_vcf(vcf_file) head(variants) ``` The result is a plain data frame with one row per variant per sample, with columns for chromosome, position, alleles, consequence, gene, and sample name. Because it is a standard data frame, you can filter, subset, and manipulate it with any R tools you already know. ### Option 2: From a data frame or Excel export If your variants are in a spreadsheet or the output of another tool, use `coerce_variants()` to map your column names onto the format `ggvariant` expects. You only need to specify the columns that differ from the defaults. ```{r coerce, eval = FALSE} # Example: data exported from a custom pipeline or Excel my_df <- read.csv("my_variants.csv") variants <- coerce_variants(my_df, chrom = "Chr", pos = "Position", ref = "Ref_Allele", alt = "Alt_Allele", consequence = "Variant_Class", gene = "Hugo_Symbol", sample = "Tumor_Sample" ) ``` Any extra columns in your data frame are carried over automatically, so you never lose information. --- ## Lollipop plot The lollipop plot shows where variants fall along a gene, coloured by consequence. It is particularly useful for identifying mutational hotspots — positions that are recurrently mutated across samples. ```{r lollipop-basic} plot_lollipop(variants, gene = "TP53") ``` ### Adding protein domain annotations Overlaying known protein domains helps interpret *where* variants fall functionally. Provide a data frame with `name`, `start`, and `end` columns (in amino acid coordinates): ```{r lollipop-domains} tp53_domains <- data.frame( name = c("Transactivation", "DNA-binding", "Tetramerization"), start = c(1, 102, 323), end = c(67, 292, 356) ) # Scale genomic positions to protein coordinates tp53 <- variants[variants$gene == "TP53", ] tp53$pos <- round( (tp53$pos - min(tp53$pos)) / (max(tp53$pos) - min(tp53$pos)) * 393 ) + 1 plot_lollipop(tp53, gene = "TP53", domains = tp53_domains, protein_length = 393) ``` ### Colouring by sample To see which sample each variant comes from instead of its consequence, change `color_by`: ```{r lollipop-sample} plot_lollipop(variants, gene = "TP53", color_by = "sample") ``` ### Customising further Because every `ggvariant` function returns a standard `ggplot` object, you can add any `ggplot2` layers on top: ```{r lollipop-custom} library(ggplot2) plot_lollipop(variants, gene = "KRAS") + labs(subtitle = "KRAS mutations across TUMOR_S1 and TUMOR_S2") + theme(legend.position = "bottom") ``` --- ## Consequence summary `plot_consequence_summary()` gives an overview of what *types* of variants are present — missense, frameshift, synonymous, and so on — broken down by sample or gene. ### By sample ```{r consequence-sample} plot_consequence_summary(variants) ``` Each bar represents one sample, stacked by consequence type. This immediately reveals whether two samples have similar or very different mutational profiles. ### Proportional view To compare samples with different total variant counts fairly, use `position = "fill"`: ```{r consequence-fill} plot_consequence_summary(variants, position = "fill") ``` ### By gene To see which genes carry the most variants and what types they are: ```{r consequence-gene} plot_consequence_summary(variants, group_by = "gene", top_n = 7) ``` TP53 stands out immediately as the most mutated gene, a pattern typical of many cancer cohorts. --- ## Mutational spectrum The mutational spectrum shows the relative frequency of each of the six single-base substitution (SBS) classes — C>A, C>G, C>T, T>A, T>C, T>G — normalised to the pyrimidine base (so A>G is represented as T>C, matching COSMIC convention). ```{r spectrum} plot_variant_spectrum(variants) ``` A dominant C>T signature, as seen here, is characteristic of UV damage or age-related deamination — common in many tumour types. ### Faceted by sample To compare mutational processes between samples side by side: ```{r spectrum-facet} plot_variant_spectrum(variants, facet_by_sample = TRUE) ``` --- ## Interactive plots All plot functions support `interactive = TRUE`, which wraps the output in a `plotly` interactive plot. This is ideal for sharing with collaborators who don't use R — simply save as an HTML file and send it. ```{r interactive, eval = FALSE} # Requires the plotly package # install.packages("plotly") p <- plot_lollipop(variants, gene = "TP53", interactive = TRUE) p # opens in RStudio viewer or browser # Save as a standalone HTML file htmlwidgets::saveWidget(p, "TP53_lollipop.html") ``` --- ## Colour palettes and theming ### Access the built-in palettes ```{r palette} # See the consequence colour palette gv_palette("consequence") # See the COSMIC SBS spectrum palette gv_palette("spectrum") ``` ### Apply the theme to your own plots `theme_ggvariant()` is exported so you can apply the same clean look to any ggplot2 figure in your analysis: ```{r theme, eval = FALSE} ggplot(my_data, aes(x, y)) + geom_point() + theme_ggvariant() ``` --- ## Summary | Function | Input | Output | |---|---|---| | `read_vcf()` | VCF file path | `gvf` data frame | | `coerce_variants()` | Any data frame | `gvf` data frame | | `plot_lollipop()` | `gvf` + gene name | Lollipop `ggplot` | | `plot_consequence_summary()` | `gvf` | Stacked bar `ggplot` | | `plot_variant_spectrum()` | `gvf` | SBS spectrum `ggplot` | | `gv_palette()` | palette type | Named colour vector | | `theme_ggvariant()` | — | `ggplot2` theme | All plot functions return a `ggplot` object — extend them freely with standard `ggplot2` syntax, and use `interactive = TRUE` with any of them to get a `plotly` interactive version. --- ## Session information ```{r session} sessionInfo() ```