Science Reviews - Biology, 2024, 3(2), 13-21 Martina Elena Tarozzi
17
The accurate discovery of single nucleotide
variants from billions of short reads remains a chal-
lenging step in bioinformatics because library prep-
aration, sequencing and data processing tools are
error-prone procedures. These issues become even
more apparent when the object of study are low-fre-
quency somatic mutations or when the input DNA
is of lower quality. Most variant callers use statisti-
cal methods (such as logistic regression, hidden
Markov models, naïve Bayes) to model error
sources and to distinguish whether differences be-
tween experimental reads and the reference ge-
nome are caused by true genetic variants or errors.
In recent years, deep learning has been applied to
address variant calling on NGS data: a common ap-
proach is to address the problem as one of image
recognition, where a Deep Neural Network ana-
lyzes sequencing data that are transformed as im-
ages of read pileups of true genotype calls to com-
pute the genotype likelihoods at each locus. Two of
the first and most popular tools of this kind are
DeepVariant (42) and DeNovoCNN (43), with the
latter specifically used to address the identification
of de novo mutations. Both tools showed higher ac-
curacy compared to classical methods. An alterna-
tive approach is presented in HELLO (44), whereby
comparable performances are obtained by design-
ing Deep Neural Networks that examine aligned
reads to predict the status (ref or alt) of each candi-
date allele given the support for that allele in rela-
tion to the support for the remaining alleles at the
genomic site.
Tertiary bioinformatic analysis: Variant effect prediction
Variant Effect Prediction (VEP) are computa-
tional tools that provide a prediction about the func-
tional significance of a single nucleotide variant
(SNV). The growing use of NGS technologies for
advanced diagnostics has increased the need to bet-
ter classify variants of uncertain significance. VEPs
rely on different types of prior knowledge, such as
protein sequence and structural information, evolu-
tionary sequence conservation, functional experi-
ments, epigenomic data and association studies to
produce an effect score for the variant. In super-
vised VEPs, the algorithm is trained on a set of la-
belled SNVs known as benign or dam2aging ac-
cording to previous knowledge to perform a classi-
fication task. Using this prior knowledge, these
methods compute a score expressing the predicted
effect of the variant. Examples of well-performing
supervised VEPs on human samples are SNP&GO
(45), PolyPhen2 (46) and DEOGEN2 (47). Unsuper-
vised methods do not use any labelled data and
usually rely exclusively on the evolutionary conser-
vation of the genomic locus. This group also in-
cludes deep learning methods, like DeepSequence
(48), considered by a recent benchmark study as the
top-performing tool among 46 tested in deep muta-
tion scanning data (49). An example of a semi-su-
pervised deep learning method is the Illumina Pri-
mateAI (50), which has performed well in the study
of rare diseases.
Visualization of high dimensional datasets
NGS data are highly dimensional because
each sample is sequenced simultaneously. The huge
amount of information contained in these data can
represent an obstacle to the identification of its most
meaningful features. Dimensionality reduction
techniques such as PCA, t-SNE and UMAP are used
to identify latent components in the data that are
not easily accessible due to the high number of var-
iables. Data are thus transformed into a lower di-
mensionality while maintaining the relationships
between data points (e.g., samples) as much as pos-
sible. These methods are extremely versatile. For ex-
ample, they can be used in the pre-processing of
bulk RNA-seq data to identify possible outliers and
relevant covariates(51), to search for recurrent pat-
terns on targeted DNA sequencing data in different
classes of samples (52), or to visualize single-cell
RNA sequencing data. In this context, dimensional-
ity reduction techniques coupled with clustering al-
gorithms are used for cell-type identification tasks,
identifying groups of cells that share similar expres-
sion profiles. Another application of these methods
is lineage trajectory inference(53), which involves
the reconstruction of the position of each individual
cell on the lineage trajectory based on scRNA-seq
profiles with different time points, allowing for the
study of dynamic processes such as the cell cycle,
cell differentiation and cell activation.
6. Future Perspectives
In this review, we summarized the crucial as-
pects and timeline of NGS technologies, bioinfor-
matics and AI, highlighted how they are connected
in a holistic process, and explained the potential
revolutionary insights that can be gained from their