maandag 24 oktober 2011

Component for calculating the R^2 zero with PP

(I am seeing about 5-10 views a day on the Pipeline Pilot pages, please be so kind to acknowledge / cite my blog when you use these tools and guides)

Why would we want such a thing?
During the time I have been using PP, I found it inconvenient that there was no component to calculate the correlation coĆ«fficient between two properties present in the data stream (for instance when performing external validation of a model).

Therefore I have written a component to do just that. One of the features I find useful is the option to include both an upper and lower error margin line. Allowing a quick visual inspection of your model reliability.

While in the latest version (8.5) there is a component called "Regression Model Evaluation Viewer" which calculates an RMSE and R2, this component has some downsides.
  1. The component calculates the modeled values internally, so it cannot be used to calculate the correlation between two sets of values obtained from external sources.
  2. The component only calculates the R2 and RMSE, while for a proper evaluation R02 and k-slope are also required.

My component is on my website and compatible with PP 8.5 and up, it can be found 

here.

It has been tested up to a maximum of approx. 20,000 records and works fine. In addition the parameters that are also calculated in the 'Regression Model Evaluation Viewer' and 'R-statistics fit plots' are identical. 



So what does it do?
The component calculates correlation parameters according to Tropsha (2010) 1 between two properties present in the stream. These properties are defined as 'Activity' (Y-values) and 'Model' (X-values). These have to be present in the stream and therefore need to be pre-calculated in the case of a model. In addition, a scatter plot containing all values is output. Both the parameters and the plot are output as reporting items.

The following values are calculated:
  1. RMS Error (RMSE)
  2. R2 (R2)
  3. R02 (R2_zero)
  4. R02' (R2_zero_acc)
  5. k-Slope (Slope_K)
  6. k-Slope ' (Slope_K_acc)
  7. % Difference between R2 and R02 (Perc_Diff_R2_with_R2_zero)
  8. % Difference between R2 and R02' (Perc_Diff_R2_with_R2_zero_acc)
  9. Absolute difference between R02 and R02' ( Absolute_diff_R2_zero_and_R2_zero_acc) 

Additional Settings:
  • Under 'Plot Parameters' variables for the x-y scatter plot can be defined. Furthermore the range of the upper and lower error lines can be set (default 0.5 from the line of unity).
    • 'Auto_range'; when set to 'True' the scale of the axis is automatically defined to the scale of the data. Alternatively; when set to 'False' (default), a range can be entered manually for 'Activity' (y-value) and 'Model' (x-value)(Default is 2.0 - 12.0).
    • 'Uncertainty' defines the margin between the line of unity and the uncertainty lines (default 0.5 units away from line of unity).
    • If 'Uncertainty_in_plot' is set to 'True' (default) then two lines indicating a lower and upper error line are drawn in the plot.
  • If 'Output_Records' is set to 'True' all values are output unchanged to the 'Fail' port while the plot and correlation parameters are output to the 'Pass' port.
The examples are made in the example protocol "08 Calculate logP using the R_logP_SVM Model" , listed under Examples/R Statistics/Learning and Clustering/R Learn Models...


RMSE R2_zero R2 R2_zero_acc Slope_K acc Slope_K Perc_Diff_R2 with_R2_zero Perc_Diff_R2 with_R2_zero_acc Absolute_diff_R2_zero and_R2_zero_acc
0.679 0.839 0.839 0.827 0.997 0.928 0.000 0.015 0.012
If for some reason you are heaving trouble with the component, please contact me!



  1. Tropsha, A. (2010). Predictive Quantitative Structure-Activity Relationships Modeling. Handbook of Chemoinformatics Algorithms. J. Faulon and A. Bender.

donderdag 20 oktober 2011

R-Statistics Error messages in Pipeline Pilot

Updated!
(I am seeing about 5-10 views a day on the Pipeline Pilot pages, please be so kind to acknowledge / cite my blog when you use these tools and guides).

Over the last years I have been using R to create my models. However the interface running on top of R (doing the data shaping and fingerprint folding) was pipeline pilot. This works quite nice and efficient (although one could think of better solutions, but for my works this set up suffices). 



When there are errors in your data though, things go wrong. Not all error messages are as intuitive as you would like. The pipeline pilot help can't really help here either, so over the last years I have kept a list of error codes and what they mean in practice. I have listed it here so that anyone else struggling with an unknown error might find it. however this is also convenient for myself as online these things are retrieved quicker than on network share xxx :). 

The organisation is as follows, the closed dot with italic characters is the actual error message received (trimmed), the white dot with regular text contains a possible cause, the closed square a solution.

Related to SVM as performed in the “e1071” package:
  • Error in svm.default(x, y, scale = scale, ..., na.action = na.action) : 
  • dependent variable has to be of factor or integer type for classification mode. 
  •  Calls: doCV -> modelfunc -> svm -> svm.formula -> svm.default
    • Fingerprint properties are not recognized as fingerprints
      • Set property type of properties to learn from to “fingerprint” (like 'SciTegic.value.IntegerFingerprintValue')
      • Set option convert fingerprints to “Fixed-Length array of bits”
      • Possibly due to merge there are array properties present (multiple values for one property)

  • Error in …. Subscript out of bounds
    • The property to learn is incorrect
      • Two values  present in one property where there should be one
      • Only actives are present
    • No properties present to learn from
      • Possibly through ignore properties)

  • Empty beginning of file
    • The property to learn is incorrect.
      • Either not present in the stream
      • The name is misspelled

  • Missing properties in file
    • Problem with the fingerprints that are being input in a learned model.
      • The ‘change fingerprints to fixed length bit size’ is executed wrongly,
      • This specific property is missing
      • Set property type to fingerprint has not been performed ('SciTegic.value.IntegerFingerprintValue')

  • "Error in svd(x, nu = 0) : 0 extent dimensions"
    • When performing a PCA, (multiple) properties are not considered to be numeric.
      • Decimal comma instead of dot

  • “Error in svm.default(x,y,scale,…..): C <= 0!”
    • The allocation of a cost value is incorrect.
      • Decimal comma instead of dot

  •  “Error in matrix(ret$dec, nrow = nrow(newdata), byrow = TRUE, dimnames = list(rowns,  :   matrix: invalid 'ncol' value (< 0)Execution halted”
    • Properties to learn from defined incorrect
      • “allpropertiesonfirstdata” instead of “user set” when not all properties are present in all records

  • Error in svm.default(x, y, scale = scale, ..., na.action = na.action) :           Need numeric dependent variable for regression. In addition: Warning message:data length exceeds size of matrix
    • Property to learn from contains non-numeric characters
    • Continuous model selected for classification data

  • Error in cor(preds[[1]], preds[[2]], method = "pearson") : missing observations in cov/cor. In addition: Warning messages: 1-5: data length exceeds size of matrix
    • Non numeric properties are used to learn from in regression.
      • Use ‘IgnoreProperties’ to exclude non numeric properties
    • Possibly, property should be changed to ('SciTegic.value.IntegerFingerprintValue') while using regression.
  • Error in c(1e-05/nx, 0.001/nx, 1/nx, ) : argument 4 is empty
    • Gamma values to be sampled ends with comm rather than value
      • Remove comma at the end or add value

  • Error in svm.default(x, y, scale = scale, ..., na.action = na.action) :  NA/NaN/Inf in foreign function call (arg 4) Calls: doCV ... modelfunc -> svm -> svm.formula -> svm.default -> .C
    • Property to learn from non-numeric
      • Inf’ rather than numeric

  • Error in withCallingHandlers(expr, warning = function(w) invokeRestart("muffleWarning")) : invalid multibyte string at '<b2>II' (or at '<a0>hydra) Calls: readxy -> cleandata -> FactorOrNumber Execution halted
    • Array property present formatted as blabla[1], blabla[2], etc.
      • Flatten to single properties (eg turn binary flag on for proeprties present named by value in array property.


Related to decision tree forests as performed in the “randomForest” package:
  • Error in randomForest.default(xy[-1], y, ntree = 500, mtry = mtry, importance = imp) :   NA/NaN/Inf in foreign function call (arg 2) Calls: randomForest -> randomForest.default -> .C
    • Property to learn from non-numeric
      • Inf’ rather than numeric

  • Error in randomForest.default(xy[-1], y, ntree = 70, mtry = mtry, importance = imp,  :  
  • NA not permitted in predictors
    • Property to learn from numeric when classifying

  • Error in comps[c1, c2] <- round(roc12, digits = 4) : replacement has length zero Calls: print -> genroc
    • One of the classes might be present once, making ou-of-bag validation impossible

  • Error in read.table(file = file, header = header, sep = sep, quote = quote,  :
      empty beginning of file Calls: readxy -> read.csv -> read.table
    • Property to learn from is missing from the data
      • Possibly removed using keep / remove properties


  • Error in `rownames<-`(`*tmp*`, value = row.names(x)) :  attempt to set rownames on object with no dimensions  Calls: randomForest ... randomForest.default -> is.na -> is.na.data.frame -> rownames<-
    • One of observations has an incomplete set of variables, one or more descriptors are missing (n/a) 
  • Error in predict.randomForest(model, x, type = "response") :  New factor levels not present in the training data  Calls: predict -> predict.randomForest
    • One of observations has a level for  set one of the variables that was not observed in the training set (e.g. present in the training set : 0,1,2,3 ; value in the test set 6)
      • Make sure each observation is seen in the training set
      • alternatively use continuous variables to describe the datapoints rather than categorical
  • Error in withCallingHandlers(expr, warning = function(w) invokeRestart("muffleWarning")) : invalid multibyte string at '<b2>II' (or at '<a0>hydra) Calls: readxy -> cleandata -> FactorOrNumber Execution halted
    • Array property present formatted as blabla[1], blabla[2], etc.
      • Flatten to single properties (eg turn binary flag on for proeprties present named by value in array property.

Hope this helps anyone when stuck (and that this page is indexed by Google, well probably isn't the case)