A variation graph represents the mutual alignment of a set of sequences in a compact, visually-accessible structure. They are used in diverse applications in genome sequence analysis, comparative genomics, and linguistics. A growing community of researchers explores their applications to pangenomics and high-throughput genomics.
The nodes of these labeled graphs represent fragments of consecutive sequences (perhaps DNA, protein, text, or locations). Their edges describe allowed or observed transitions between nodes found in actual sequences of interest. Paths record walks of sequences through the nodes of the graph.
The promise of differential privacy in pangenomics
Human genomes are both useful and difficult to utilize. Knowing what sequences (alleles or haplotypes) exist in the population can help us observe and understand new genomes that we assemble. This is critically useful in fields like medicine, population genetics, and agriculture. However, it can be difficult to share genomic information due to issues of privacy.
Researchers have deeply explored techniques to produce differentially-private summarizations of some genomic information, such as SNPs and other small variants, but thus far no methods approach the question of differential privacy of arbitrary genome sequence and assemblies. As variation graphs are becoming a standard for the representation of pangenomes, which new sequencing technology is allowing us to rapidly accumulate for humans and other species, they are an ideal target for the application of privacy-preserving data release.
Differentially-private mobility graphs
Location data can be represented in a variation graph by encoding quantized location/times in nodes, and expressing individual movements as paths through these nodes. Just as a differentially-private variation graph can help us understand common haplotypes tracing through the graph, a differentially-private model of collective mobility information can help us to understand common flows of people through space and time.
These flows are traditionally difficult for researchers and public planners to access and share, due to the risk of identification and tracking of individiuals. However, in light of the COVID-19 pandemic we see critical importance in making them available. Tools in privvg will support application to mobility data, but work remains to be done to transform the data into formats directly compatible with those used in pangenomics.
We envision a future in which individuals who assemble their genomes for personal or medical reasons could release (via a trusted third party) useful information from their own genome by mixing their genome and others in a privacy-preserving pangenomic variation graph.
As a synthetic database, this differentially-private variation graph could be used directly as a reference system by tools that work on pangenomes. This trusted third-party arrangement matches contemporary practice in genomics. Our implementation simply promises the release of larger-scale information about haplotypic structures and allele frequencies that will become available with ubiquitous genome assembly.
The same approach can be applied to mobility data, textual data, and other highly-repetitive collections of sequential data.
Compact, distributable data models
Methods developed for variation graphs allow the efficient compression of collections of genome and redundant sequence data. A data structure (using the graph Burrows-Wheeler transform or GBWT) built from the 1000 Genomes Project sequences can store genomes using only 1 bit per 1000 base-pair of genome sequence. These approaches are not only effcient, but they are self indexes that support efficient queries to find and match sequences.
Allowing for the redistribution of succinct data structures representing collections of genomes avoids the need to control data access or operate centralized servers. If they are compact, then researchers will only need minimal resources to use them.