I'd welcome some help getting started in data mining... Every discipline has best practices. Are there best practices that I'm supposed to do when predicting something rare like fraud?
I have 13 million cases, 3000 frauds, in a year. I'm using decision trees right now...
Should I oversample the 3000 frauds to 3,000,000 frauds? Should I undersample the ~13 million non frauds ? Should I use the data as is ?
Should I train on 6.5 million cases and test against the other 6.5 million? What percentage holdback should I use for training? Something really small ? or the default 70/30 ?
Should I do anything special with the algorithm parameters?
View Complete Post