Supportability

last update; Aug. 24. 2018

Supportability

(1) Purpose

Although strength of coexpression is represented by MR value, the coexpression may be artifact in a platform. A measure, supportability, is introduced to quantify reproducibility of a coexpressed gene list of interest.

History on COXPRESdb

The supportability is previously called as reliability. With the refinement at 2014, we renamed the reliability to supportability for the following two reasons.

We are trying to evaluate quality of coexpression data from various aspects. Reliability is not suitable to mean one of such measures.
Coexpression supported by another platform is reliable, but the converse is not true. Namely, coexpression without any supports can be occured when appropriate reference is not available.

Period	Term	Calculation of coex list similarity	Null distribution	Threshold
Period	Term	Calculation of coex list similarity	Null distribution	☆	☆☆	☆☆☆
2012-08 ~ 2014-08	Reliability	COXSIM(100)	Common for all platform virtually including 10000 genes	E-04	E-12	E-20
2014-08 ~ now	Supportability	COXSIM(1%)	For each platform	E-04	E-16	E-32

(2) Basic idea

When a gene list is repeatedly observed in indipendent platforms, the coexpressed gene list can be regarded as reliable.

Example of well-supported coexpressed gene list: DHCR7 (Hsa)
Example of less-supported coexpressed gene list: CCND3 (Hsa)

There are two possible ways to compare coexpression for reliability assessment. One is comparison of gene pairs (A), and the other is comparison of gene lists (B).

We employ the B-type (gene list) comparison because pseudo coexpression is mainly caused by inappropriate probes with weak hybridization or cross-hybridization and thus pseudo coexpression appears not only one gene pair but also all gene pairs from the problematic guide gene.

(3) Degree of coincidence of two coexpressed gene lists

Basic idea

We introduced a similarity measure COXSIM, which is the weighted concordance rate between the coexpressed gene listf from a guide gene g of interest (list_g) and that from a reference guide gene r (list_r). COXSIM is a function of guide gene g, guide gene r and threshold k.

, where n(i, list_g, list_r) is the number of common genes (orthologous genes in the case using platforms in different species) found in the top i coexpressed gene lists.

Excluding orphan genes for the gene list comparison

However, there are genes in a platform for list_g that do not have corresponding genes in a platform for list_r. When such genes appear at high ranks in list_g, the coincidence of the two lists decreases. To avoid the effect of the absence of the corresponding genes in the reference platform, genes that lack corresponding genes in the reference platform are excluded from list_g, leaving list_g→r. In the same way, genes in list_r that lack corresponding genes in list_g are excluded, resulting list_r→g. Subsequently, we examined the top k coexpressed genes in list_g→r with the reference gene list, list_r→g.

Selection of k

As k, we use 1% of the number of the genes in list_g→r.

We previously used 100 for k, meaning that we checked the gene correspondence of the top 100 coexpressed genes, in accordance with the default representation of a coexpressed gene list on COXPRESdb. However, the use of a common threshold for all platforms causes different stringencies of the coexpression thresholds. For example, the Sce platform for S. cerevisiae has 4,461 genes for coexpression analysis, whereas the Hsa platform for human has 19,803 genes. The former has four to five times higher probability to randomly include a particular gene in the top k rank, and thus overestimates the significance for the coincidence of the gene lists. Therefore, we have modified the number of genes from the top k to the top 1% of all genes in list_g→r.

(4) Selection of the most appropriate reference guide gene / gene list

Since the best reference guide gene is unknown, we checked all possible reference guide genes. The reference guide gene set R is composed of all available orthologous genes for different species. When multiple platforms are available for the species including the guide gene g, the same gene in the other platforms is also included in the reference guide gene set R. The COXSIM values are calculated between the target guide gene g and every reference gene r in R. The reference gene r_max that gives the maximum COXSIM value is regarded as the best reference guide gene.

(5) Calculation of p-value

To assess statistical significance, the maxCOXSIM value is compared with the null distribution generated under the same number of genes in list_g→r.

(6) Discretization of maxCOXSIM significance

On the COXPRESdb, the significance level, which we call the supportability, is shown as the number of stars according to the following p-value threshold.

p-value threshold	Representation
1E-04	☆
1E-16	☆☆
1E-32	☆☆☆

(7) Result

Platform	No star	☆	☆☆	☆☆☆	NoEval	Total	COXPRESdb	Rel0 & NoEval
Hsa	1846	2996	1904	12023	1511	18769	20280	3357
Hsa2	821	2209	1692	14190	876	18912	19788	1697
Hsa3	668	1986	2386	12774	2002	17814	19816	2670
Mmu	1435	3371	3133	9412	3608	17351	20959	5043
Mmu2	945	3734	3413	9480	1881	17572	19453	2826
Rno	2252	4129	2453	3559	1358	12393	13751	3610
Cfa	4003	4965	2383	2822	2038	14173	16211	6041
Mcc	3202	4900	2220	1819	3658	12141	15799	6860
Dre	4304	1742	438	239	3389	6723	10112	7693
Gga	4930	4350	1288	1111	2078	11679	13757	7008
Dme	576	2121	2789	5782	1358	11268	12626	1934
Dme2	707	2199	2805	5874	1514	11585	13099	2221
Cel	3137	1119	235	121	12644	4612	17256	15781
Sce	1923	427	21	0	2090	2371	4461	4013
Spo	2182	358	28	0	2313	2568	4881	4495

Limitation

To use the supportability as a measures of reliability, comparison should be done between species having same genetic systems about the function of interest. Namly comparison between two platforms in the same species or closely related species is required. The table shows the ratio of supportability levels in each platform, indicating that human, mouse and Drosophilla have higher ratio of the three stars and two yeast species and nematode have lower ratio of the three stars. This does not directly mean that the quality of the coexpression data for yeast and nematode is low, because they just do not have closely-related platforms.