Genomic background sequences systematically outperform synthetic ones in de novo motif discovery for ChIP-seq data

Standard

Genomic background sequences systematically outperform synthetic ones in de novo motif discovery for ChIP-seq data. / Raditsa, Vladimir V.; Tsukanov, Anton V.; Bogomolov, Anton G. et al.

In: NAR Genomics and Bioinformatics, Vol. 6, No. 3, qae090, 01.09.2024.

Research output: Contribution to journal › Article › peer-review

BibTeX

@article{e42ae456d0a8477286882a75f0dc52b4,

title = "Genomic background sequences systematically outperform synthetic ones in de novo motif discovery for ChIP-seq data",

abstract = "Efficient de novo motif discovery from the results of wide-genome mapping of transcription factor binding sites (ChIP-seq) is dependent on the choice of background nucleotide sequences. The foreground sequences (ChIP-seq peaks) represent not only specific motifs of target transcription factors, but also the motifs overrepresented throughout the genome, such as simple sequence repeats. We performed a massive comparison of the 'synthetic' and 'genomic' approaches to generate background sequences for de novo motif discovery. The 'synthetic' approach shuffled nucleotides in peaks, while in the 'genomic' approach selected sequences from the reference genome randomly or only from gene promoters according to the fraction of A/T nucleotides in each sequence. We compiled the benchmark collections of ChIP-seq datasets for mouse, human and Arabidopsis, and performed de novo motif discovery. We showed that the genomic approach has both more robust detection of the known motifs of target transcription factors and more stringent exclusion of the simple sequence repeats as possible non-specific motifs. The advantage of the genomic approach over the synthetic approach was greater in plants compared to mammals. We developed the AntiNoise web service (https://denovosea.icgbio.ru/antinoise/) that implements a genomic approach to extract genomic background sequences for twelve eukaryotic genomes.",

author = "Raditsa, {Vladimir V.} and Tsukanov, {Anton V.} and Bogomolov, {Anton G.} and Levitsky, {Victor G.}",

note = "The bioinformatics data analysis was performed in part on the equipment of the Bioinformatics Shared Access Center within the framework of State Assignment Kurchatov Genomic Center of ICG SB RAS [FWNR-2022-0020]. Author contributions: V.V.R.: Formal analysis, Visualization. A.V.T. Formal analysis. A.G.B.: Software, Writing-review & editing. V.G.L.: Conceptualization, Investigation, Methodology, Validation, Software, Supervision, Validation, Visualization, Writing-original draft, Writing-review & editing. Russian Science Foundation project ",

year = "2024",

month = sep,

day = "1",

doi = "10.1093/nargab/lqae090",

language = "English",

volume = "6",

journal = "NAR Genomics and Bioinformatics",

issn = "2631-9268",

publisher = "Oxford University Press",

number = "3",

}

RIS

TY - JOUR

T1 - Genomic background sequences systematically outperform synthetic ones in de novo motif discovery for ChIP-seq data

AU - Raditsa, Vladimir V.

AU - Tsukanov, Anton V.

AU - Bogomolov, Anton G.

AU - Levitsky, Victor G.

N1 - The bioinformatics data analysis was performed in part on the equipment of the Bioinformatics Shared Access Center within the framework of State Assignment Kurchatov Genomic Center of ICG SB RAS [FWNR-2022-0020]. Author contributions: V.V.R.: Formal analysis, Visualization. A.V.T. Formal analysis. A.G.B.: Software, Writing-review & editing. V.G.L.: Conceptualization, Investigation, Methodology, Validation, Software, Supervision, Validation, Visualization, Writing-original draft, Writing-review & editing. Russian Science Foundation project

PY - 2024/9/1

Y1 - 2024/9/1

N2 - Efficient de novo motif discovery from the results of wide-genome mapping of transcription factor binding sites (ChIP-seq) is dependent on the choice of background nucleotide sequences. The foreground sequences (ChIP-seq peaks) represent not only specific motifs of target transcription factors, but also the motifs overrepresented throughout the genome, such as simple sequence repeats. We performed a massive comparison of the 'synthetic' and 'genomic' approaches to generate background sequences for de novo motif discovery. The 'synthetic' approach shuffled nucleotides in peaks, while in the 'genomic' approach selected sequences from the reference genome randomly or only from gene promoters according to the fraction of A/T nucleotides in each sequence. We compiled the benchmark collections of ChIP-seq datasets for mouse, human and Arabidopsis, and performed de novo motif discovery. We showed that the genomic approach has both more robust detection of the known motifs of target transcription factors and more stringent exclusion of the simple sequence repeats as possible non-specific motifs. The advantage of the genomic approach over the synthetic approach was greater in plants compared to mammals. We developed the AntiNoise web service (https://denovosea.icgbio.ru/antinoise/) that implements a genomic approach to extract genomic background sequences for twelve eukaryotic genomes.

AB - Efficient de novo motif discovery from the results of wide-genome mapping of transcription factor binding sites (ChIP-seq) is dependent on the choice of background nucleotide sequences. The foreground sequences (ChIP-seq peaks) represent not only specific motifs of target transcription factors, but also the motifs overrepresented throughout the genome, such as simple sequence repeats. We performed a massive comparison of the 'synthetic' and 'genomic' approaches to generate background sequences for de novo motif discovery. The 'synthetic' approach shuffled nucleotides in peaks, while in the 'genomic' approach selected sequences from the reference genome randomly or only from gene promoters according to the fraction of A/T nucleotides in each sequence. We compiled the benchmark collections of ChIP-seq datasets for mouse, human and Arabidopsis, and performed de novo motif discovery. We showed that the genomic approach has both more robust detection of the known motifs of target transcription factors and more stringent exclusion of the simple sequence repeats as possible non-specific motifs. The advantage of the genomic approach over the synthetic approach was greater in plants compared to mammals. We developed the AntiNoise web service (https://denovosea.icgbio.ru/antinoise/) that implements a genomic approach to extract genomic background sequences for twelve eukaryotic genomes.

UR - https://www.scopus.com/record/display.uri?eid=2-s2.0-85199899892&origin=inward&txGid=59a2b34d07a8c52767d6bcba4a4b940e

UR - https://www.mendeley.com/catalogue/ea26f521-eb3b-3651-bcc0-7266c319c3ec/

U2 - 10.1093/nargab/lqae090

DO - 10.1093/nargab/lqae090

M3 - Article

C2 - 39071850

VL - 6

JO - NAR Genomics and Bioinformatics

JF - NAR Genomics and Bioinformatics

SN - 2631-9268

IS - 3

M1 - qae090

ER -

ID: 60848847