rpretability of results obtained from genetic optimisations, and we do not intend to speculate about those reasons at this point and leave this for further study. We note, however, that the enrichments obtained for the optimised signature are fundamentally different from and much more significant than those for an equal number of randomly selected probesets. Conclusion We established a baseline for achievable target predic tion accuracy using a simple guilt by association method based on correlation of transcriptional profiles. The main objective of this study, however, is not target prediction per se but an investigation about how this can be achieved with gene signatures of varying nature and length. Two distinct groups of transcriptional sig natures��e pression data driven and based on biologi cal interaction networks��were analysed for their performance.
no striking differences between these groups were found. The optimisation of transcriptional signatures by a genetic algorithm led to the best per forming signatures and indicated that a ma imum size of appro imately 128 probesets is optimal. A signature of this size therefore e tracted a ma imum of biologi cal variation of the investigated cellular systems. The genes of this optimised signature were predominantly found in pathways relating to o idative phosphorylation and ubiquinone metabolism. this indi cated that these biological processes might be the most generic way to capture compound perturbation of cells. We furthermore showed that it is possible to optimise very small signatures for a par ticular purpose.
Given that both groups of signatures�� e pression based and network based��perform simi larly it is to be e pected that a combination of both can lead to better signatures. Methods and materials E GSK-3 pression data and compound annotations Our analyses are based on gene e pression data from the Broad Institutes Connectivity Map 2. Several cell lines were treated with a total of 1,309 dif ferent compounds and whole genome e pression levels were determined using Affymetri gene chips. The cell lines with most measurements in CMAP2 were the human breast epithelial adenocarcinoma cell line MCF7, the prostate adenocarcinoma cell line PC3 and the human promyelocytic leukaemia cell line HL60. E pres sion levels were measured using the human Affymetri chips HG U133A.
The compounds were tested in batches with replicates, resulting in a total of 6,100 e periments. The combination of a compound, applied concentration, cell line and microarray platform used is referred to as a treatment instance. We used a total of 22,267 probesets that were present in all treatment instances. CMAP2 data were down loaded from the Broad Institutes website and processed in R using the affy package. Robust multichip average e pression values were calculated for each treatment instance, and the e pression values of each batch containing more than five treatment instances were then mean centred on a probeset level using the