Case Study: Fruit Fly (Drosophila melanogaster)

The number of detected dngs between the three selected studies on fly dngs (see Fig. 2) vary greatly. This discrepancy most likely stems from different methodologies applied, used databases and selected thresholds. To better assess the similarities and differences, we annotated the studies with our DENOFO toolkit and applied the pairwise comparator to each study combination (see Tab. 2).

Number of dngs detected in Drosophila melanogaster by three different studies

Figure 2: Number of dngs detected in Drosophila melanogaster by three different studies (Roginski et al. 2024; Zheng & Zhao 2022; Grandchamp et al. 2023).

Table 2: Similarities and differences between studies on fly dngs

Feature Roginski et al. (2024) Zheng & Zhao (2022) Grandchamp et al. (2023)
Input genome transcriptome transcriptome
ORF location intergenic, antisense,
intronic, overlapping
intergenic, antisense,
intronic, overlapping
Specificity lineage lineage species
Homology filter protein sequences protein sequences protein sequences
E-value 0.0001 0.1
Database NCBI nr custom custom
Synteny gene anchors gene anchors
Translational evidence ribosome profiling, MS
Selection codeml dN/dS
Enabling mutations yes

Roginski et al. (2024) and Grandchamp et al. (2023), the two studies with the biggest difference in number of detected dngs in flies, share only two methodological similarities in total across all reported standardised features. These similarities are the use of protein sequences for homology filtering and using gene anchors for synteny detection (see Fly_similarities_Roginski-2024_vs_Grandchamp-2023.txt in Supplementary Output Files below, representing the output of the denofo-comparator). The two studies that are closest by number of detected dngs, Zheng & Zhao (2022) and Grandchamp et al. (2023), overlap in type of input data (transcriptome), which transcripts are considered (intergenic, antisense, intronic overlapping), protein sequences for homology filtering. Hence, the discrepancy in the number of detected dngs, must stem from the data analysis like the use of ribosome profiling and mass spectrometry (MS) for translational evidence in Zheng & Zhao (2022).

It is both difficult and time consuming to compare and assess the datasets and methodologies of published data on detected dgns manually. The standardised annotation with the developed DENOFO toolkit provides an overview of overlaps between datasets in a convenient way. Next to such surprising similarities in methodology, DENOFO also allows to compare the differences between the studies, which can help us to identified why studies using the same input data come to different conclusions. To analyse the differences in more detail, we focus here on the two studies by Roginski et al. (2024) and Grandchamp et al. (2023) that are farthest apart regarding the number of detected dngs (see Fly_differences_Roginski-2024_vs_Grandchamp-2023.txt in Supplementary Output Files below, representing the output of the denofo-comparator).

We identify differences in the input data, which is a transcriptome in Grandchamp et al. (2023), but an annotated genome in Roginski et al. (2024). While Roginski et al. identify lineage-specific dngs, Grandchamp et al.’s are strictly species-specific. This alone can explain already a large difference in number of detected dngs. Also, selected thresholds are easy to identify here as a source of differences in number of detected dngs: Roginski et al. selected a way stricter e-value threshold of 0.0001 for homology filtering than Grandchamp et al. with 0.1 as an e-value threshold. The stricter e-value threshold will result in a lower number of dngs filtered out, which contrasts with the much lower number of identified dngs. Homology filtering was based on a custom database of Drosophila and Dipteran proteomes in Grandchamp et al., while Roginski et al. Used the NCBI nr database. The database of more closely related species in Grandchamp et al. can have led to the higher amount of dngs, which were filtered out in Roginski et al. due to matches in more distantly related species. Apart from only methodological differences leading to discrepancies in numbers of identified dngs, the annotation and comparison through DENOFO allows to extract useful information contained in the studies, which might be relevant for readers. As an example, we can see in the differences from the denofo-comparator output that Grandchamp et al. report information about enabling mutations, which is missing in Roginski et al. However, Roginski et al. report additional evolutionary information in the form of selection studied through codeml, while Grandchamp et al. report selection information based on dN/dS values.

Conclusions

The numbers of detected dngs between the selected studies on both, human and fly dngs differ by orders of magnitude. This large discrepancy might stem from different applied methodologies, databases and selected thresholds. Yet, it can be difficult to assess and compare which exact methods were applied and where they differ. With the information provided through the standardised annotation format and analysed and processed for easy comparison by the DENOFO tools, we can gain insights into the specifics of differences and similarities between these studies. Additionally, we learn about the impact of specific methodological differences the more studies are annotated this way and can be compared and analysed in a feasible way.

Supplementary Output Files (Fruit Fly)

Study annotation files in dngf format:

Pairwise similarities and differences produced by the DENOFO comparator are available as text files:

References

  • Roginski P, Grandchamp A, Quignot C, Lopes A. De Novo Emerged Gene Search in Eukaryotes with DENSE. Genome Biology and Evolution. 2024 Aug;16(8):evae159.

  • Grandchamp A, Kühl L, Lebherz M, Brüggemann K, Parsch J, Bornberg-Bauer E. Population genomics reveals mechanisms and dynamics of de novo expressed open reading frame emergence in Drosophila melanogaster. Genome Research. 2023 Jun 1;33(6):872–90.

  • Zheng EB, Zhao L. Protein evidence of unannotated ORFs in Drosophila reveals diversity in the evolution and properties of young proteins. eLife. 2022 Sep 30;11:e78772.