I'm getting acquainted with data mining and I'm definitely not a stats guy, but this struck me as odd. I'm using ms decision trees on a dataset with 13 Million rows. I'm trying to predict a YES occurrence which only happened 3,000 times in the
The first thing I'm doing is just playing with the data, meaning using all of it or limiting it to certain values of certain attributes, how holdback affects things, etc.
I built a model with a couple dozen attributes and no over or understampling based on the entire 13 million rows, then fed it a dataset of only the 3000 YES occurances into a DMX prediction query. When I rolled up the results into buckets of 0-9%,
10-19%, etc and plotted a graph, I got a smile, meaning the opposite of a normal distribution curve. The average probability was almost dead center at 53%. By rolling up the data according to interesting attributes I can skew the smile to
average of about 65% so far.
My question is: Is this even a valid test? Is the average probability for a collection of known YES cases a valid indicator of anything?
View Complete Post