Gerard JP van Westen: oktober 2011

(I am seeing about 5-10 views a day on the Pipeline Pilot pages, please be so kind to acknowledge / cite my blog when you use these tools and guides)

Why would we want such a thing?
During the time I have been using PP, I found it inconvenient that there was no component to calculate the correlation coëfficient between two properties present in the data stream (for instance when performing external validation of a model).

Therefore I have written a component to do just that. One of the features I find useful is the option to include both an upper and lower error margin line. Allowing a quick visual inspection of your model reliability.

While in the latest version (8.5) there is a component called "Regression Model Evaluation Viewer" which calculates an RMSE and R², this component has some downsides.

The component calculates the modeled values internally, so it cannot be used to calculate the correlation between two sets of values obtained from external sources.
The component only calculates the R² and RMSE, while for a proper evaluation R₀² and k-slope are also required.

My component is on my website and compatible with PP 8.5 and up, it can be found

here.

It has been tested up to a maximum of approx. 20,000 records and works fine. In addition the parameters that are also calculated in the 'Regression Model Evaluation Viewer' and 'R-statistics fit plots' are identical.

So what does it do?
The component calculates correlation parameters according to Tropsha (2010) ¹ between two properties present in the stream. These properties are defined as 'Activity' (Y-values) and 'Model' (X-values). These have to be present in the stream and therefore need to be pre-calculated in the case of a model. In addition, a scatter plot containing all values is output. Both the parameters and the plot are output as reporting items.

The following values are calculated:

RMS Error (RMSE)
R² (R2)
R₀² (R2_zero)
R₀²' (R2_zero_acc)
k-Slope (Slope_K)
k-Slope ' (Slope_K_acc)
% Difference between R² and R₀² (Perc_Diff_R2_with_R2_zero)
% Difference between R² and R₀²' (Perc_Diff_R2_with_R2_zero_acc)
Absolute difference between R₀² and R₀²' ( Absolute_diff_R2_zero_and_R2_zero_acc)

Additional Settings:

Under 'Plot Parameters' variables for the x-y scatter plot can be defined. Furthermore the range of the upper and lower error lines can be set (default 0.5 from the line of unity).
- 'Auto_range'; when set to 'True' the scale of the axis is automatically defined to the scale of the data. Alternatively; when set to 'False' (default), a range can be entered manually for 'Activity' (y-value) and 'Model' (x-value)(Default is 2.0 - 12.0).
- 'Uncertainty' defines the margin between the line of unity and the uncertainty lines (default 0.5 units away from line of unity).
- If 'Uncertainty_in_plot' is set to 'True' (default) then two lines indicating a lower and upper error line are drawn in the plot.
If 'Output_Records' is set to 'True' all values are output unchanged to the 'Fail' port while the plot and correlation parameters are output to the 'Pass' port.

The examples are made in the example protocol "08 Calculate logP using the R_logP_SVM Model" , listed under Examples/R Statistics/Learning and Clustering/R Learn Models...

RMSE	R2_zero	R2	R2_zero_acc	Slope_K acc	Slope_K	Perc_Diff_R2 with_R2_zero	Perc_Diff_R2 with_R2_zero_acc	Absolute_diff_R2_zero and_R2_zero_acc
0.679	0.839	0.839	0.827	0.997	0.928	0.000	0.015	0.012

If for some reason you are heaving trouble with the component, please contact me!

Tropsha, A. (2010). Predictive Quantitative Structure-Activity Relationships Modeling. Handbook of Chemoinformatics Algorithms. J. Faulon and A. Bender.

Updated!
(I am seeing about 5-10 views a day on the Pipeline Pilot pages, please be so kind to acknowledge / cite my blog when you use these tools and guides).

Over the last years I have been using R to create my models. However the interface running on top of R (doing the data shaping and fingerprint folding) was pipeline pilot. This works quite nice and efficient (although one could think of better solutions, but for my works this set up suffices).

When there are errors in your data though, things go wrong. Not all error messages are as intuitive as you would like. The pipeline pilot help can't really help here either, so over the last years I have kept a list of error codes and what they mean in practice. I have listed it here so that anyone else struggling with an unknown error might find it. however this is also convenient for myself as online these things are retrieved quicker than on network share xxx :).

The organisation is as follows, the closed dot with italic characters is the actual error message received (trimmed), the white dot with regular text contains a possible cause, the closed square a solution.

Related to SVM as performed in the “e1071” package:

Error in svm.default(x, y, scale = scale, ..., na.action = na.action) :
dependent variable has to be of factor or integer type for classification mode.

Calls: doCV -> modelfunc -> svm -> svm.formula -> svm.default

Fingerprint properties are not recognized as fingerprints

Set property type of properties to learn from to “fingerprint” (like 'SciTegic.value.IntegerFingerprintValue')
Set option convert fingerprints to “Fixed-Length array of bits”
Possibly due to merge there are array properties present (multiple values for one property)

Error in …. Subscript out of bounds

The property to learn is incorrect

Two values present in one property where there should be one
Only actives are present

No properties present to learn from

Possibly through ignore properties)

Empty beginning of file

The property to learn is incorrect.

Either not present in the stream
The name is misspelled

Missing properties in file

Problem with the fingerprints that are being input in a learned model.

The ‘change fingerprints to fixed length bit size’ is executed wrongly,
This specific property is missing
Set property type to fingerprint has not been performed ('SciTegic.value.IntegerFingerprintValue')

"Error in svd(x, nu = 0) : 0 extent dimensions"

When performing a PCA, (multiple) properties are not considered to be numeric.

Decimal comma instead of dot

“Error in svm.default(x,y,scale,…..): C <= 0!”

The allocation of a cost value is incorrect.

Decimal comma instead of dot

“Error in matrix(ret$dec, nrow = nrow(newdata), byrow = TRUE, dimnames = list(rowns, : matrix: invalid 'ncol' value (< 0)Execution halted”

Properties to learn from defined incorrect

“allpropertiesonfirstdata” instead of “user set” when not all properties are present in all records

Error in svm.default(x, y, scale = scale, ..., na.action = na.action) : Need numeric dependent variable for regression. In addition: Warning message:data length exceeds size of matrix

Property to learn from contains non-numeric characters
Continuous model selected for classification data

Error in cor(preds[[1]], preds[[2]], method = "pearson") : missing observations in cov/cor. In addition: Warning messages: 1-5: data length exceeds size of matrix

Non numeric properties are used to learn from in regression.

Use ‘IgnoreProperties’ to exclude non numeric properties

Possibly, property should be changed to ('SciTegic.value.IntegerFingerprintValue') while using regression.

Error in c(1e-05/nx, 0.001/nx, 1/nx, ) : argument 4 is empty

Gamma values to be sampled ends with comm rather than value

Remove comma at the end or add value

Error in svm.default(x, y, scale = scale, ..., na.action = na.action) : NA/NaN/Inf in foreign function call (arg 4) Calls: doCV ... modelfunc -> svm -> svm.formula -> svm.default -> .C

Property to learn from non-numeric

‘Inf’ rather than numeric

Error in withCallingHandlers(expr, warning = function(w) invokeRestart("muffleWarning")) : invalid multibyte string at '<b2>II' (or at '<a0>hydra) Calls: readxy -> cleandata -> FactorOrNumber Execution halted

Array property present formatted as blabla[1], blabla[2], etc.

Flatten to single properties (eg turn binary flag on for proeprties present named by value in array property.

Related to decision tree forests as performed in the “randomForest” package:

Error in randomForest.default(xy[-1], y, ntree = 500, mtry = mtry, importance = imp) : NA/NaN/Inf in foreign function call (arg 2) Calls: randomForest -> randomForest.default -> .C

Property to learn from non-numeric

‘Inf’ rather than numeric

Error in randomForest.default(xy[-1], y, ntree = 70, mtry = mtry, importance = imp, :
NA not permitted in predictors

Property to learn from numeric when classifying

Error in comps[c1, c2] <- round(roc12, digits = 4) : replacement has length zero Calls: print -> genroc

One of the classes might be present once, making ou-of-bag validation impossible

Error in read.table(file = file, header = header, sep = sep, quote = quote, :
empty beginning of file Calls: readxy -> read.csv -> read.table

Property to learn from is missing from the data

Possibly removed using keep / remove properties

Error in `rownames<-`(`*tmp*`, value = row.names(x)) : attempt to set rownames on object with no dimensions Calls: randomForest ... randomForest.default -> is.na -> is.na.data.frame -> rownames<-

One of observations has an incomplete set of variables, one or more descriptors are missing (n/a)

Error in predict.randomForest(model, x, type = "response") : New factor levels not present in the training data Calls: predict -> predict.randomForest

One of observations has a level for set one of the variables that was not observed in the training set (e.g. present in the training set : 0,1,2,3 ; value in the test set 6)

Make sure each observation is seen in the training set
alternatively use continuous variables to describe the datapoints rather than categorical

Error in withCallingHandlers(expr, warning = function(w) invokeRestart("muffleWarning")) : invalid multibyte string at '<b2>II' (or at '<a0>hydra) Calls: readxy -> cleandata -> FactorOrNumber Execution halted

Array property present formatted as blabla[1], blabla[2], etc.

Flatten to single properties (eg turn binary flag on for proeprties present named by value in array property.

Hope this helps anyone when stuck (and that this page is indexed by Google, well probably isn't the case)

Gerard JP van Westen

maandag 24 oktober 2011

Component for calculating the R^2 zero with PP

donderdag 20 oktober 2011

R-Statistics Error messages in Pipeline Pilot