Each of the computational strategies provides text data files with the results contained within them as well as a graphical output generated on the fly. These data are stored in a unique directory with a randomly generated name for at least 24 hours to permit user viewing and download as necessary. Subsequently all uploaded data, intermediate analysis data, and output are destroyed.
The site curators will not under any circumstances exam data that is being uploaded unless requested by the user to do so. All data is deleted after 24 hours. The data is viewable by anyone with the right unique directory code – if this is kept private, the data is secure – as our server does not permit surfing of these directories. The Cleaver site otherwise maintains all of the security that the PharmGKB server employs.
All cleaver statistical procedures expect microarray data to be in a tab-delimited text file format. Tab delimited files can be made in programs like excel – just save the file as a tab-delimited text file. All files should have the following format :
|
GeneName |
Condition1 |
Condition2 |
Condition3 |
Condition4 |
Condition5 |
|
Gene1 |
X11 |
X12 |
X13 |
X14 |
X15 |
|
Gene2 |
X21 |
X22 |
X23 |
X24 |
X25 |
|
Gene3 |
X31 |
X32 |
X33 |
X34 |
X35 |
The term “Gene1” is a description or name of a gene; “Condition1” is a name or description of an experimental condition, strain, or specimen. The terms Xij represent data for the ith gene and the jth experiment. The word GeneName is a string that may describe the whole data file. The gene names and the condition names should be unique.
Examples of acceptable data files are available at :
http://smi-web.stanford.edu/projects/helix/pubs/pda/
Each of Cleaver’s statistical procedures offers various data transformation options. In many cases it may be best to transform data before utilizing Cleaver options. The transformations are offered as a convenience to the user. Cleaver offers the following transforms :
1. Column Normalize – normalizes intensities for a given array to be mean zero, variance 1 across all genes.
2. Row Normalize – normalizes intensities for a given gene to be mean zero, variance one across all conditions.
3. Row Rank – changes intensity values to the rank of the intensity of the condition for a given gene.
4. Column Rank – replaces intensity with rank of gene in an array
5. Log – takes the natural logarithm of data. Be careful to avoid taking the log of negative values to avoid erroneous results.
The classification protocol combines the positive, negative, and test files into one pooled data set that the transforms are applied to.
Up to three transforms can be utilized in any desired order – just indicate with the drop down menus which transform you wish to do first, second, and third. The user is cautioned to choose transforms carefully, for example
1. A log transform after a normalize transform will result in taking logs of negative values
2. When analyzing only a subset of your data transforms must be applied judicially. For example if analyzing only a few genes from your arrays – column normalizations or ranks in which all genes are compared to each other may cause misleading results.
3. Doing a row transform followed by a column transform or vice-versa may be appropriate under some circumstances – but may also obscure information in the data set.
The statistical procedures offer a procedure to fill in missing values. We encourage users to eliminate missing values before utilizing our site – but simple facilities to deal with missing values are offered as a convenience. Missing values in the data file should be left as empty boxes. (In terms of tabs these means that there should be two tabs in a row rather than the usual tab,values,tab). The following options are available
1. Zero Fill – all missing values are automatically set to zero.
2. Average Over Genes – all missing values are imputed as the average of all of the expression values for the associated gene.
3. Average Over Arrays – all missing values are imputed as the average of all of the expression values for the associated array.
Classification requires at least two different files and offers the opportunity to classify on genes or on arrays. The “Positive Array File” is the file with data associated with the positive cases; likewise the “Negative Array File” is the file with data associated with negative cases. For example if you were trying to determine which genes were ribosomal or not – the data from known ribosomal genes across a number of experiments should be entered as the positive file and that data from the known non-ribosomal genes should be entered as the negative file. The user can choose to classify on arrays or on genes. If classifying on arrays the number of genes in all files must correspond and be identical; if classifying on genes as in the ribosomal example the number of experiments in all files must correspond and be identical. The test file contains the data that the algorithms analyze and attempt to classify based on the data in the other provided files.
Penalty Parameter
The penalty parameter is a parameter that dictates the amount of constraining a problem requires. Typically – it may be best to use penalties of zero if the number of training examples far exceed the number of features. In other cases where there is a considerable number of features, and few examples – setting the penalty to some non-zero values may be desirable. This is a parameter that you may need to explore to obtain optimal results.
Instead of providing a test file – the user can opt to utilize the cross-validate option. This is the option that you would use to understand how accurate classification is. The procedure attempts to classify each of the cases in the Positive and Negative files. Since the “correct” answer is known – accuracy can be calculated. The accuracy of the classification procedure will depend heavily on the nature of the data and the problem.
Output
The immediate output of the program is in the form of a graphic. A link to raw data in text form is available. The display shows the score assigned to each tested case. If cross-validation is utilized an accuracy estimate is presented. The correct assignment for each cross-validation case is presented in brackets (“[+]” for positive and [-] for negative). Cases are sorted by classification score. The graphic displays only the twenty most predictive features with a score indicating predictive values. Expression values are scaled into red (relatively high) and green (relatively low) colors.
A commonly used clustering technique in many scientific areas is the so-called K-means clustering algorithm. Using this approach the user can cluster data based on some specified metric into a number of clusters. Users can cluster arrays or genes as desired into a pre-specified number of clusters. The algorithm has a randomized starting point – so results may vary from run to run.
File Format
As indicated above, the one file containing the data must contain array data in tab-delimeted format. The user can choose to cluster on the arrays or the genes in the file by appropriately clicking on the button at the top of the page.
A number of metric are provided for analysis. A metric is a measure of dissimilarity between data cases (genes or arrays depending on which we are classifying). The following metric are offered:
1. Euclidean Distance – each data case is treated like a vector – distances are calculated correspondingly. The distance between two cases x and y are calculated as follows
Ö(S ( xi – yi )2)
where i indexes over all features.
2. Manhattan Distances – similar in nature to the Euclidean metric – but calculated slightly differently :
S abs ( xi – yi )
3. Correlation Based – a correlation coefficient, r, is first calculated between two two cases x and y before estimating distance with the formula :
Ö(1 - r2)
4. Variance Scaled – On each iteration of the k-means algorithm pooled within cluster variances are calculated for each of the features. A Euclidean like measure is calculated where the distance is normalized for the variance of different features. For example if classifying genes, this technique discounts the effects of different variance for different experiments.
5. Covariance Scaled – This metric should be used ONLY when the number of features is much less than the number cases. For example if attempting to cluster 10 array cases based on 1000 gene features, this strategy is unacceptable and will cause erroneous results; However if attempting to cluster 1000 genes using 10 array experiments as features, this metric is acceptable. On each iteration of the k-means algorithm a pooled within cluster covariance metric is calculated for all features. This metric is used to discount variances of features and interrelationships between features in clustering. This metric is known as the Mahalonabis distance.
The expected number of clusters must be assigned by the user in advance. We also request that the user specifies the number of iterations before beginning analysis. The algorithm usually converges within a small number of iterations (20-30). However, sometimes their may be a small number of cases that cannot be definitively assigned to a particular cluster – in these cases we use the iteration limit to prevent the algorithm from running indefinitely.
Graphical Output
The graphical output displays for each cluster up to 25 cases – the most representative 20 cases, and the least representative 5 cases; the scores indicating how representative a case is of the cluster it has been associated with is indicated along the side. Also the different feature vectors are displayed – red indicates relatively low expression, green indicates relatively high expression.
The text output lists the clusters and the cases associated with each cluster. The centers of each cluster is reported also. Also provided are two measures of how representative a case is of a cluster, that is how close to the center of the cluster any case is. One measure of representative-ness is the distance to the center of the cluster that the case is in divided by the distance to the next closest cluster (this is used for the graphical output); the other measure is the distance to the cluster that the case is in divided by the average distance to the center of all other clusters.
Principal Component Analysis (PCA) is an analytical technique that is often used in dimensional reduction strategies. The implementation of PCA that we provide allows users to view the location of genes relative to each other in a reduced experiment space. Genes are plotted with respect to the two orthogonal linear combinations of experiments that contain most variance.
Graphical Output
Each of the genes in the provided data file are plotted in a two dimensional graphic. The total percentage of variance contained in the top two dimensions are indicated at the top of the page – the closer this percentage is to 100, the more likely that all of the relevant information is graphically displayed appropriately. To see which spots are associated with which genes, just click on the spot – the gene name will appear in a text box at the top of the page.
Text Output
The text output contains information about each of the components. Each component is a linear combination of the array conditions. The coefficients of the linear sum are provided for each component, that amount of variance contributed by each component, and the value of each gene along each component is presented also.