maandag 24 oktober 2011

Component for calculating the R^2 zero with PP

(I am seeing about 5-10 views a day on the Pipeline Pilot pages, please be so kind to acknowledge / cite my blog when you use these tools and guides)

Why would we want such a thing?
During the time I have been using PP, I found it inconvenient that there was no component to calculate the correlation coĆ«fficient between two properties present in the data stream (for instance when performing external validation of a model).

Therefore I have written a component to do just that. One of the features I find useful is the option to include both an upper and lower error margin line. Allowing a quick visual inspection of your model reliability.

While in the latest version (8.5) there is a component called "Regression Model Evaluation Viewer" which calculates an RMSE and R2, this component has some downsides.
  1. The component calculates the modeled values internally, so it cannot be used to calculate the correlation between two sets of values obtained from external sources.
  2. The component only calculates the R2 and RMSE, while for a proper evaluation R02 and k-slope are also required.

My component is on my website and compatible with PP 8.5 and up, it can be found 

here.

It has been tested up to a maximum of approx. 20,000 records and works fine. In addition the parameters that are also calculated in the 'Regression Model Evaluation Viewer' and 'R-statistics fit plots' are identical. 



So what does it do?
The component calculates correlation parameters according to Tropsha (2010) 1 between two properties present in the stream. These properties are defined as 'Activity' (Y-values) and 'Model' (X-values). These have to be present in the stream and therefore need to be pre-calculated in the case of a model. In addition, a scatter plot containing all values is output. Both the parameters and the plot are output as reporting items.

The following values are calculated:
  1. RMS Error (RMSE)
  2. R2 (R2)
  3. R02 (R2_zero)
  4. R02' (R2_zero_acc)
  5. k-Slope (Slope_K)
  6. k-Slope ' (Slope_K_acc)
  7. % Difference between R2 and R02 (Perc_Diff_R2_with_R2_zero)
  8. % Difference between R2 and R02' (Perc_Diff_R2_with_R2_zero_acc)
  9. Absolute difference between R02 and R02' ( Absolute_diff_R2_zero_and_R2_zero_acc) 

Additional Settings:
  • Under 'Plot Parameters' variables for the x-y scatter plot can be defined. Furthermore the range of the upper and lower error lines can be set (default 0.5 from the line of unity).
    • 'Auto_range'; when set to 'True' the scale of the axis is automatically defined to the scale of the data. Alternatively; when set to 'False' (default), a range can be entered manually for 'Activity' (y-value) and 'Model' (x-value)(Default is 2.0 - 12.0).
    • 'Uncertainty' defines the margin between the line of unity and the uncertainty lines (default 0.5 units away from line of unity).
    • If 'Uncertainty_in_plot' is set to 'True' (default) then two lines indicating a lower and upper error line are drawn in the plot.
  • If 'Output_Records' is set to 'True' all values are output unchanged to the 'Fail' port while the plot and correlation parameters are output to the 'Pass' port.
The examples are made in the example protocol "08 Calculate logP using the R_logP_SVM Model" , listed under Examples/R Statistics/Learning and Clustering/R Learn Models...


RMSE R2_zero R2 R2_zero_acc Slope_K acc Slope_K Perc_Diff_R2 with_R2_zero Perc_Diff_R2 with_R2_zero_acc Absolute_diff_R2_zero and_R2_zero_acc
0.679 0.839 0.839 0.827 0.997 0.928 0.000 0.015 0.012
If for some reason you are heaving trouble with the component, please contact me!



  1. Tropsha, A. (2010). Predictive Quantitative Structure-Activity Relationships Modeling. Handbook of Chemoinformatics Algorithms. J. Faulon and A. Bender.

Geen opmerkingen:

Een reactie plaatsen