Mercurial > repos > iuc > omark

COMPLETENESS ASSESSMENT
------------
#This benchmark gives an estimate of the completeness of the gene set based on the presence or not of conserved genes of the target lineage.
#Conserved genes are defined using Hierarchical Orthologous Groups (HOGs) defined at a certain taxonomic clade, which is a proxy for the ancestral gene repertoire of this clade. HOGs are considered conserved if they have at least one gene in >80% of the extant species.
#Because representatives of these groups are expected to be present in the target species repertoire, the proportion of missing HOGs proxies the proportion of missing genes in the total gene repertoire of the target proteome.
#Ancestral genes used for this benchmark were in single copy in the selected ancestral lineage, but no assumption is made regarding their propensity to duplicate - they are not universal single copy genes. This benchmark reports the proportion of those genes that are found in multiple copies in target proteomes, and whether it corresponds to a known duplication event in descendants of this gene family (Expected) or not (Unexpected).

The clade used was: Hominidae
Number of conserved HOGs: 17786

#Results on conserved HOGs:
Single: 39 (0.22%)
Duplicated: 0 (0.00%)
Duplicated, Unexpected: 0 (0.00%)
Duplicated, Expected: 0 (0.00%)
Missing: 17747 (99.78%)


CONSISTENCY ASSESSMENT
-------------------------
#This benchmark gives the proportion of annotated protein-coding genes in the query proteome that likely correspond to an actual protein-coding gene by comparing to the known gene families of the selected ancestral lineage.

##High-level categories
#Genes in the "Consistent" category correspond to a gene family known to exist in the selected lineage. Genes in the "Inconsistent" or “Contaminants'' categories correspond to known gene families from different lineages. Such genes are deemed contaminants if more genes than expected by chance correspond to the same species. They are deemed Inconsistent if they correspond to other species seemingly at random. Genes are classified in the “Unknown” category if they do not share enough similarity with known gene families: they may be orphan genes or erroneous protein sequences.

##Subcategories
#Partial hit proteins are those that share similarity with proteins in known gene families on only part of their sequence: they can indicate poorly defined gene models, structurally divergent genes, or erroneous annotation.
#Fragmented proteins are those whose length is smaller than the proteins from the gene families they share similarity with (<50% median length): they are likely fragmentend sequences or erroneous annotations.

Number of proteins in the whole proteome: 44

#Consistent lineage placements
Total Consistent: 43 (97.73%)
Consistent, partial hits: 0 (0.00%)
Consistent, fragmented: 0 (0.00%)

#Inconsistent lineage placements
Total Inconsistent: 1 (2.27%)
Inconsistent, partial hits: 0 (0.00%)
Inconsistent, fragmented: 0 (0.00%)

#Contaminants
Total Contaminants: 0 (0.00%)
Contaminants, partial hits: 0 (0.00%)
Contaminants, fragmented: 0 (0.00%)

#Unknown
Total Unknown: 0 (0.00%)


SPECIES COMPOSITION
-------------------
#This benchmark gives an estimate of the species composition of the dataset, according to HOGs placement. It reports the clades most consistent with the taxonomic distribution of gene families where coding-genes for the query proteomes were placed. The species to which most of the proteins in the query proteome are consistent with is called "Main species." The others are potential contaminants.
#This section also lists the numbers of proteins that can be associated to each of these clades, based on the taxonomic placement of the gene families they share similarity with.


##Detected species

#Main species
Clade: Homo sapiens
Number of associated query proteins: 44 (100.00%)
author	iuc
date	Tue, 12 Mar 2024 08:33:56 +0000
parents	ce13b4c42256
children