donderdag 10 november 2011

Reactie op 'Liegen doen we allemaal, en wel voortdurend' verschenen in Trouw

Origineel artikel zie hier.

Beste Asha,

Ik moet zeggen dat ik hoogst verbaasd was over de toon die jij in deze brief aanslaat. Een toon die ik ook bij andere mensen uit jouw vakgebied tegen kwam, zelfs op nationale televisie. Als ik jouw tekst lees als wetenschapper, dan ben ik geneigd te denken dat een deel van de mensen werkzaam in jouw vakgebied dit vakgebied niet als wetenschap ziet. Jij demonstreert hier een groot gebrek aan wetenschappelijke ethiek, ja zelfs een minachting voor hen die dat wel bezitten!

Je begint je betoog met een stelling waar je het leugentje om bestwil in een voorbijgaande sociale interactie gelijk stelt aan een wetenschappelijke publicatie puur gebaseerd op het feit dat beiden door mensen uitgevoerd worden. Met andere woorden, je stelt dat je op professioneel gebied niet meer mag verwachten van mensen dan het niveau van amateurs, tijdens het roddeluurtje met de koffie pauze. Om jouw analogie te vervolgen: Stel iemand verft zijn kamer en gebruikt de verkeerde, slecht dekkende, verf. Dit kan je dus eigenlijk ook van schilders verwachten en is helemaal zo gek niet, want het betreft in beide gevallen mensen.  

Mijn aandacht was hierdoor inderdaad gevestigd en ik was benieuwd wat je aan ging voeren om dit te onderbouwen. 

Je stelt vervolgens dat 70 % (!!!!) van de psychologen liegt over hun data. Zonder bronvermelding maak je het leeuwendeel van je vakgebied uit voor leugenaars! Ik wil je graag naar deze [1] pagina van de KNAW verwijzen betreffende wetenschappelijke integriteit. Besef je dat een wetenschapper een expertise op een bepaald gebied heeft, hierom heeft hij de plicht ethisch te handelen. Als expert ben je in de positie dat je makkelijk mensen kan misleiden, vooral om de doodeenvoudige reden dat je een autoriteitspositie hebt. Dat je bewezen ethisch verwerpelijk gedrag vergoelijkt door het merendeel van jouw vakgebied zwart te maken verdient natuurlijk niet de schoonheidsprijs. 

De wetenschap bestudeert en verklaart de wereld in verscheidene vakgebieden. Zij kent drie belangrijke pijlers:
  1. De wetenschappelijke methode
  2. Publicatie van resultaten
  3. Reproduceerbaarheid van resultaten
De kern van de wetenschappelijke methode is dat je antwoorden zoekt op onderzoeksvragen. Het is belangrijk eerst een vraag te formuleren en dan systematisch onderzoek hiernaar te doen omdat dit voorkomt dat je valse verbanden gaat zien in data.

Ten tweede is het van belang je data te publiceren, dit om je 'peers' van jouw inzichten op de hoogte te stellen zodat het vakgebied als geheel progressie boekt. Nieuwe resultaten zullen nieuwe vragen opwerpen, welke weer onderzocht worden. 

Ten derde is reproduceerbaarheid van jouw bevindingen erg belangrijk. Je kan gerust stellen dat als iets niet reproduceerbaar is, het geen vaststaand feit is. Immers, als je iets niet kan reproduceren onder identieke omstandigheden, is er iets fundamenteel anders en levert jouw hypothese geen verklaring van jouw resultaten. Met andere woorden, als iets niet reproduceerbaar is, heb je iets over het hoofd gezien. 

Dit houdt dus in dat een publicatie vereist dat je ruwe data mee stuurt. Hoe anders kunnen jouw 'peers' jouw resultaten reproduceren, hoe anders kunnen zij deze interpreteren en gebruiken voor hun eigen hypothesen? 

"In een vakgebied waar de weg naar succes geplaveid dient te worden met publicaties in topbladen als Nature en Science, en die bladen alleen papers accepteren met ronkende resultaten, is het niet verwonderlijk dat je als onderzoeker de werkelijkheid zo nu en dan een handje wilt helpen."

Hoe kan iemand gefabriceerde resultaten publiceren en zonder wroeging de carrière van andere wetenschappers erop zien rusten? Fabriceer jij data dan worden nieuwe promovendi, 'peers' hier de dupe van, iets dat zich manifesteert in de affaire Stapel. Hoe gaat iemand ooit nog een naam opbouwen waneer hij/zij slechts met Stapel gepubliceerd heeft? Hoe gaat zo iemand ooit nog zijn vak kunnen uitoefenen? 

De kern van de wetenschap is de werkelijkheid bestuderen en verklaren "..de werkelijkheid zo nu en dan een handje helpen.." is geen wetenschap. Wanneer iemand dit toch doet is hij/zij per definitie niet bezig met de beoefening van wetenschap. 

"De wetenschap is het meest gebaat bij een cultuur waarin onderzoekers niet worden afgestraft voor hun mislukkingen maar juist worden aangemoedigd om er openhartig over te zijn. Dat bereiken we niet door degenen die fouten maken aan de schandpaal te nagelen."

Er is een groot verschil tussen mislukkingen (oftewel verworpen hypothesen) en moedwillig liegen / bedriegen. Inderdaad zijn negatieve resultaten ook resultaten, maar deze persoon heeft zijn negatieve resultaten niet gepubliceerd maar verzwegen. Deze persoon heeft resultaten gefingeerd en hiermee de wetenschap niet verder geholpen maar eerder een hoop schade berokkent! Als jij denkt dat hij aan de schandpaal genageld wordt voor het feit dat hij geen resultaten boekte sla je de plank mis.

Tot slot wil ik je graag wijzen op 'Good Clinical Practice' [2] (GCP). GCP is de standaard waaraan klinisch onderzoek dient te voldoen, deze standaard omvat alle regels die jij aanhaalt en meer. Misschien is het goed om de regels van GCP in de (sociale) psychologie ook toe te passen. Vergeet niet dat dit reeds verplicht is in een deel van de psychologie, dat gebied wat belast is met het onderzoek naar interventies/behandelmethoden. Dit verplicht een minimum standaard voor onderzoeken en ontdoet een geplaagd vakgebied van twijfel. 

[1] Thema Pagina Wetenschappelijke Integriteit, Website KNAW (www.knaw.nl), bezocht 10-11-2011
[2] International Conference on Harmonisation Topic E 6 (R1), Guideline for Good Clinical Practice, Website European Medicines Agency (www.emea.europa.eu ), bezocht 10-11-2011

vrijdag 4 november 2011

Component to calculate Matthews Correlation, Sensitivity, Specificity, PPV and NPV with PP

(I am seeing about 5-10 views a day on the Pipeline Pilot pages, please be so kind to acknowledge / cite my blog when you use these tools and guides)

Why would we want such a thing?
Like with the regression validation parameters, I found that PP lacked a component to calculate correlation coefficients between two properties in the data stream in classification. 

Therefore I have written a component to do just that. One of the features I find useful is the option to include a bar chart that displays the values of the calculated properties on a scale between 0 and 1. This allows a quick visual inspection of your model reliability. When applied to the 'KNN classification of Estrogen Antagonists' from the example protocols, it looks like this:


In addition it outputs the parameters in a shaded table: 


The component calculates these parameters between two properties. Therefore, when using in external validation of a model, the modeled values have to be pre-calculated. 

The component requires that you give the names of the properties carying the measured value, the modeled value and the classes that were modeled. Currently it can only be used in a two class classification. In addition you can choose to also output the original unmodified records through the fail port, while the correlation plot and table are output through the pass port. I have not written a 'how to use' in the help, but will do so next week. In the meantime, the component can be found 

here.

maandag 24 oktober 2011

Component for calculating the R^2 zero with PP

(I am seeing about 5-10 views a day on the Pipeline Pilot pages, please be so kind to acknowledge / cite my blog when you use these tools and guides)

Why would we want such a thing?
During the time I have been using PP, I found it inconvenient that there was no component to calculate the correlation coëfficient between two properties present in the data stream (for instance when performing external validation of a model).

Therefore I have written a component to do just that. One of the features I find useful is the option to include both an upper and lower error margin line. Allowing a quick visual inspection of your model reliability.

While in the latest version (8.5) there is a component called "Regression Model Evaluation Viewer" which calculates an RMSE and R2, this component has some downsides.
  1. The component calculates the modeled values internally, so it cannot be used to calculate the correlation between two sets of values obtained from external sources.
  2. The component only calculates the R2 and RMSE, while for a proper evaluation R02 and k-slope are also required.

My component is on my website and compatible with PP 8.5 and up, it can be found 

here.

It has been tested up to a maximum of approx. 20,000 records and works fine. In addition the parameters that are also calculated in the 'Regression Model Evaluation Viewer' and 'R-statistics fit plots' are identical. 



So what does it do?
The component calculates correlation parameters according to Tropsha (2010) 1 between two properties present in the stream. These properties are defined as 'Activity' (Y-values) and 'Model' (X-values). These have to be present in the stream and therefore need to be pre-calculated in the case of a model. In addition, a scatter plot containing all values is output. Both the parameters and the plot are output as reporting items.

The following values are calculated:
  1. RMS Error (RMSE)
  2. R2 (R2)
  3. R02 (R2_zero)
  4. R02' (R2_zero_acc)
  5. k-Slope (Slope_K)
  6. k-Slope ' (Slope_K_acc)
  7. % Difference between R2 and R02 (Perc_Diff_R2_with_R2_zero)
  8. % Difference between R2 and R02' (Perc_Diff_R2_with_R2_zero_acc)
  9. Absolute difference between R02 and R02' ( Absolute_diff_R2_zero_and_R2_zero_acc) 

Additional Settings:
  • Under 'Plot Parameters' variables for the x-y scatter plot can be defined. Furthermore the range of the upper and lower error lines can be set (default 0.5 from the line of unity).
    • 'Auto_range'; when set to 'True' the scale of the axis is automatically defined to the scale of the data. Alternatively; when set to 'False' (default), a range can be entered manually for 'Activity' (y-value) and 'Model' (x-value)(Default is 2.0 - 12.0).
    • 'Uncertainty' defines the margin between the line of unity and the uncertainty lines (default 0.5 units away from line of unity).
    • If 'Uncertainty_in_plot' is set to 'True' (default) then two lines indicating a lower and upper error line are drawn in the plot.
  • If 'Output_Records' is set to 'True' all values are output unchanged to the 'Fail' port while the plot and correlation parameters are output to the 'Pass' port.
The examples are made in the example protocol "08 Calculate logP using the R_logP_SVM Model" , listed under Examples/R Statistics/Learning and Clustering/R Learn Models...


RMSE R2_zero R2 R2_zero_acc Slope_K acc Slope_K Perc_Diff_R2 with_R2_zero Perc_Diff_R2 with_R2_zero_acc Absolute_diff_R2_zero and_R2_zero_acc
0.679 0.839 0.839 0.827 0.997 0.928 0.000 0.015 0.012
If for some reason you are heaving trouble with the component, please contact me!



  1. Tropsha, A. (2010). Predictive Quantitative Structure-Activity Relationships Modeling. Handbook of Chemoinformatics Algorithms. J. Faulon and A. Bender.

donderdag 20 oktober 2011

R-Statistics Error messages in Pipeline Pilot

Updated!
(I am seeing about 5-10 views a day on the Pipeline Pilot pages, please be so kind to acknowledge / cite my blog when you use these tools and guides).

Over the last years I have been using R to create my models. However the interface running on top of R (doing the data shaping and fingerprint folding) was pipeline pilot. This works quite nice and efficient (although one could think of better solutions, but for my works this set up suffices). 



When there are errors in your data though, things go wrong. Not all error messages are as intuitive as you would like. The pipeline pilot help can't really help here either, so over the last years I have kept a list of error codes and what they mean in practice. I have listed it here so that anyone else struggling with an unknown error might find it. however this is also convenient for myself as online these things are retrieved quicker than on network share xxx :). 

The organisation is as follows, the closed dot with italic characters is the actual error message received (trimmed), the white dot with regular text contains a possible cause, the closed square a solution.

Related to SVM as performed in the “e1071” package:
  • Error in svm.default(x, y, scale = scale, ..., na.action = na.action) : 
  • dependent variable has to be of factor or integer type for classification mode. 
  •  Calls: doCV -> modelfunc -> svm -> svm.formula -> svm.default
    • Fingerprint properties are not recognized as fingerprints
      • Set property type of properties to learn from to “fingerprint” (like 'SciTegic.value.IntegerFingerprintValue')
      • Set option convert fingerprints to “Fixed-Length array of bits”
      • Possibly due to merge there are array properties present (multiple values for one property)

  • Error in …. Subscript out of bounds
    • The property to learn is incorrect
      • Two values  present in one property where there should be one
      • Only actives are present
    • No properties present to learn from
      • Possibly through ignore properties)

  • Empty beginning of file
    • The property to learn is incorrect.
      • Either not present in the stream
      • The name is misspelled

  • Missing properties in file
    • Problem with the fingerprints that are being input in a learned model.
      • The ‘change fingerprints to fixed length bit size’ is executed wrongly,
      • This specific property is missing
      • Set property type to fingerprint has not been performed ('SciTegic.value.IntegerFingerprintValue')

  • "Error in svd(x, nu = 0) : 0 extent dimensions"
    • When performing a PCA, (multiple) properties are not considered to be numeric.
      • Decimal comma instead of dot

  • “Error in svm.default(x,y,scale,…..): C <= 0!”
    • The allocation of a cost value is incorrect.
      • Decimal comma instead of dot

  •  “Error in matrix(ret$dec, nrow = nrow(newdata), byrow = TRUE, dimnames = list(rowns,  :   matrix: invalid 'ncol' value (< 0)Execution halted”
    • Properties to learn from defined incorrect
      • “allpropertiesonfirstdata” instead of “user set” when not all properties are present in all records

  • Error in svm.default(x, y, scale = scale, ..., na.action = na.action) :           Need numeric dependent variable for regression. In addition: Warning message:data length exceeds size of matrix
    • Property to learn from contains non-numeric characters
    • Continuous model selected for classification data

  • Error in cor(preds[[1]], preds[[2]], method = "pearson") : missing observations in cov/cor. In addition: Warning messages: 1-5: data length exceeds size of matrix
    • Non numeric properties are used to learn from in regression.
      • Use ‘IgnoreProperties’ to exclude non numeric properties
    • Possibly, property should be changed to ('SciTegic.value.IntegerFingerprintValue') while using regression.
  • Error in c(1e-05/nx, 0.001/nx, 1/nx, ) : argument 4 is empty
    • Gamma values to be sampled ends with comm rather than value
      • Remove comma at the end or add value

  • Error in svm.default(x, y, scale = scale, ..., na.action = na.action) :  NA/NaN/Inf in foreign function call (arg 4) Calls: doCV ... modelfunc -> svm -> svm.formula -> svm.default -> .C
    • Property to learn from non-numeric
      • Inf’ rather than numeric

  • Error in withCallingHandlers(expr, warning = function(w) invokeRestart("muffleWarning")) : invalid multibyte string at '<b2>II' (or at '<a0>hydra) Calls: readxy -> cleandata -> FactorOrNumber Execution halted
    • Array property present formatted as blabla[1], blabla[2], etc.
      • Flatten to single properties (eg turn binary flag on for proeprties present named by value in array property.


Related to decision tree forests as performed in the “randomForest” package:
  • Error in randomForest.default(xy[-1], y, ntree = 500, mtry = mtry, importance = imp) :   NA/NaN/Inf in foreign function call (arg 2) Calls: randomForest -> randomForest.default -> .C
    • Property to learn from non-numeric
      • Inf’ rather than numeric

  • Error in randomForest.default(xy[-1], y, ntree = 70, mtry = mtry, importance = imp,  :  
  • NA not permitted in predictors
    • Property to learn from numeric when classifying

  • Error in comps[c1, c2] <- round(roc12, digits = 4) : replacement has length zero Calls: print -> genroc
    • One of the classes might be present once, making ou-of-bag validation impossible

  • Error in read.table(file = file, header = header, sep = sep, quote = quote,  :
      empty beginning of file Calls: readxy -> read.csv -> read.table
    • Property to learn from is missing from the data
      • Possibly removed using keep / remove properties


  • Error in `rownames<-`(`*tmp*`, value = row.names(x)) :  attempt to set rownames on object with no dimensions  Calls: randomForest ... randomForest.default -> is.na -> is.na.data.frame -> rownames<-
    • One of observations has an incomplete set of variables, one or more descriptors are missing (n/a) 
  • Error in predict.randomForest(model, x, type = "response") :  New factor levels not present in the training data  Calls: predict -> predict.randomForest
    • One of observations has a level for  set one of the variables that was not observed in the training set (e.g. present in the training set : 0,1,2,3 ; value in the test set 6)
      • Make sure each observation is seen in the training set
      • alternatively use continuous variables to describe the datapoints rather than categorical
  • Error in withCallingHandlers(expr, warning = function(w) invokeRestart("muffleWarning")) : invalid multibyte string at '<b2>II' (or at '<a0>hydra) Calls: readxy -> cleandata -> FactorOrNumber Execution halted
    • Array property present formatted as blabla[1], blabla[2], etc.
      • Flatten to single properties (eg turn binary flag on for proeprties present named by value in array property.

Hope this helps anyone when stuck (and that this page is indexed by Google, well probably isn't the case)

zaterdag 17 september 2011

Setting up a Virtual Firewall using VMWare

Last week I was setting up a virtual firewall to kee my homenetwork safe. Now, my old setup looked like this:



The server is actually a dual core machine with 4 gigs of memory, still performance was a little slow at times. Turns out ISA might be responsible for that. So, my annoyance  with ISA server hogging all resources was the main reason to find a better solution. (I think that the SQL server that get's installed with ISA interfered with the CSS dedicated server). In addition, ISA is incompatible with IPv6, which is a deal breaker.

But there was another problem. I have a 120 MBit connection and I often back up > 1 TB  from my domain controller (which is also a file server) to my backup machine. So as you can imagine, if I am downloading (since the server is also the gateway) while backing up, both processes go slower as they share the single LAN NIC. The best option would therefore be a solution where I would use a single physical machine (hey I do have to pay for the power I use), but having 3 NICs. 1 as a WAN NIC, 1 as a LAN NIC for the firewall and 1 as a LAN NIC for the domain controller.

So I decided to go with a more elegant aproach, but I did want to hold on to server 2003 as a DC. I downloaded vmware server 2.0 (should do the job) and created a linux virtual machine. Now the plan was to bridge 1 virtual NIC to the physical WAN NIC and bridge one virtual NIC to the physical LAN NIC. At the same time the 3rd NIC, to connect the DC, should be on it's own.

It should look like this:


Where the small server represents the virtual machine with two dedicated NICs.


Now I thought this should not at all be difficult, and it turns out it isn't but you should be aware of what VMware actually means when they say bridging. When you bridge a virtual NIC to a physical NIC, you do not bridge NIC to NIC. You bridge a virtual switch, to which the virtual NIC is connected, to the physical NIC. Therefore, the physical NIC can still have it's own IP address and be reached from the network, parallel to the virtual NIC. Now this is a security risk. :)  The solution is to disable the IP protocols on the physical NICs if you want them to be available only to the virtual NIC.

So here is the layout I used in the virtual network editor. Note that I did not change the default nets, VMNET0, VMNET1 and VMNET8. If you don't need more than 7 networks (which is not very likely in a home setting) I recommend letting them be. :)

Here the Realtek adapter is the WAN adapter (or NIC1) and the PRO/100S is the LAN adapter (NIC2). (after this initial test I did replace it with a gigabit adapter)

Also, to prevent any unwanted bridging, I turned of all automatic bridging:

The funny thing is that you can only do this by excluding all adapters and keeping the automatic bridging tickbox ticked. If you untick it, any non bridged network adapters will automatically be bridged to VMNET0. I know, weird... The default VMNETs I disabled in the host OS (simple and quick):


And I created a virtual machine (Firewall) , bridging 1 virtual NIC to VMNET2 (WAN) and 1 virtual NIC to VMNET3 (LAN exclusive for firewall)




Originally I had hoped to use PFsense. However, it turns out that PFsense is incompatible with the UBEE modem provided to me by Ziggo, and only when running PSense on VMware. Now as I could not change the virtual solution (see above), nor my provider or my modem, I changed to Firewall OS to IPfire, which is also quite resource efficient. 

After I disabled the IP protocols and Client for Microsoft Networks. Keeping only the VMware bridging protocol enabled:


Fianlyl, the host only saw 1 NIC and everything worked (because I was running in a test environment, the link is 100Mbps :P )! 


The final result is that my server is faster (running a dedicated CSS server next to it's other tasks without breaking a sweat) and I am able to backup @ full speed while downloading! 

I never figured out why a virtual pfsense install is incompatible with Ziggo + UBEE, but I am not the only one who has come across this problem. The strange thing is that the WAN adapter on pfsense get's an IP address by DHCP but there is simply no traffic possible (no http, no icmp nothing). 





zaterdag 16 juli 2011

Improving the peer review system..

As we all know the peer-review system has it's pros and cons, but, quite frankly is the best thing we've got to ensure high quality publications...

...What if we could improve the current system a little...?

In my opinion the peer review system can be improved with the following 2 steps only.

Firstly, manuscripts should be send to the referees with the authors name removed. This way it is just about the science. Secondly, more important, the names of the referees that reviewed a paper should be added to the list of authors. This actually has several major advantages, the referee gets rewarded for making the effort of reviewing a paper (as we all know this can take up a lot of time and effort). In addition he will take care not to allow bad science to be published as his own name is also now connected to the paper.