COXPRESdb

ver. 8.1

last update; Aug. 24. 2018

Supportability

(1) Purpose

Although strength of coexpression is represented by MR value, the coexpression may be artifact in a platform. A measure, supportability, is introduced to quantify reproducibility of a coexpressed gene list of interest.


History on COXPRESdb

The supportability is previously called as reliability. With the refinement at 2014, we renamed the reliability to supportability for the following two reasons.

  • We are trying to evaluate quality of coexpression data from various aspects. Reliability is not suitable to mean one of such measures.
  • Coexpression supported by another platform is reliable, but the converse is not true. Namely, coexpression without any supports can be occured when appropriate reference is not available.
Period Term Calculation of coex list similarity Null distribution Threshold
☆☆ ☆☆☆
2012-08 ~ 2014-08 Reliability COXSIM(100) Common for all platform virtually including 10000 genes E-04 E-12 E-20
2014-08 ~ now Supportability COXSIM(1%) For each platform E-04 E-16 E-32



(2) Basic idea

When a gene list is repeatedly observed in indipendent platforms, the coexpressed gene list can be regarded as reliable.

There are two possible ways to compare coexpression for reliability assessment. One is comparison of gene pairs (A), and the other is comparison of gene lists (B).



We employ the B-type (gene list) comparison because pseudo coexpression is mainly caused by inappropriate probes with weak hybridization or cross-hybridization and thus pseudo coexpression appears not only one gene pair but also all gene pairs from the problematic guide gene.




(3) Degree of coincidence of two coexpressed gene lists

Basic idea

We introduced a similarity measure COXSIM, which is the weighted concordance rate between the coexpressed gene listf from a guide gene g of interest (listg) and that from a reference guide gene r (listr). COXSIM is a function of guide gene g, guide gene r and threshold k.

, where n(i, listg, listr) is the number of common genes (orthologous genes in the case using platforms in different species) found in the top i coexpressed gene lists.



Excluding orphan genes for the gene list comparison

However, there are genes in a platform for listg that do not have corresponding genes in a platform for listr. When such genes appear at high ranks in listg, the coincidence of the two lists decreases. To avoid the effect of the absence of the corresponding genes in the reference platform, genes that lack corresponding genes in the reference platform are excluded from listg, leaving listg→r. In the same way, genes in listr that lack corresponding genes in listg are excluded, resulting listr→g. Subsequently, we examined the top k coexpressed genes in listg→r with the reference gene list, listr→g.


Selection of k

As k, we use 1% of the number of the genes in listg→r.

We previously used 100 for k, meaning that we checked the gene correspondence of the top 100 coexpressed genes, in accordance with the default representation of a coexpressed gene list on COXPRESdb. However, the use of a common threshold for all platforms causes different stringencies of the coexpression thresholds. For example, the Sce platform for S. cerevisiae has 4,461 genes for coexpression analysis, whereas the Hsa platform for human has 19,803 genes. The former has four to five times higher probability to randomly include a particular gene in the top k rank, and thus overestimates the significance for the coincidence of the gene lists. Therefore, we have modified the number of genes from the top k to the top 1% of all genes in listg→r.


(4) Selection of the most appropriate reference guide gene / gene list

Since the best reference guide gene is unknown, we checked all possible reference guide genes. The reference guide gene set R is composed of all available orthologous genes for different species. When multiple platforms are available for the species including the guide gene g, the same gene in the other platforms is also included in the reference guide gene set R. The COXSIM values are calculated between the target guide gene g and every reference gene r in R. The reference gene rmax that gives the maximum COXSIM value is regarded as the best reference guide gene.


(5) Calculation of p-value

To assess statistical significance, the maxCOXSIM value is compared with the null distribution generated under the same number of genes in listg→r.



(6) Discretization of maxCOXSIM significance

On the COXPRESdb, the significance level, which we call the supportability, is shown as the number of stars according to the following p-value threshold.

p-value thresholdRepresentation
1E-04
1E-16☆☆
1E-32☆☆☆



(7) Result

IconPlatformNo star☆☆☆☆☆NoEvalTotalCOXPRESdbRel0 & NoEval
Hsa18462996190412023151118769202803357
Hsa2821220916921419087618912197881697
Hsa36681986238612774200217814198162670
Mmu1435337131339412360817351209595043
Mmu2945373434139480188117572194532826
Rno2252412924533559135812393137513610
Cfa4003496523832822203814173162116041
Mcc3202490022201819365812141157996860
Dre4304174243823933896723101127693
Gga4930435012881111207811679137577008
Dme576212127895782135811268126261934
Dme2707219928055874151411585130992221
Cel313711192351211264446121725615781
Sce19234272102090237144614013
Spo21823582802313256848814495

Limitation

To use the supportability as a measures of reliability, comparison should be done between species having same genetic systems about the function of interest. Namly comparison between two platforms in the same species or closely related species is required. The table shows the ratio of supportability levels in each platform, indicating that human, mouse and Drosophilla have higher ratio of the three stars and two yeast species and nematode have lower ratio of the three stars. This does not directly mean that the quality of the coexpression data for yeast and nematode is low, because they just do not have closely-related platforms.