at the moment I try to figure out how the implementation of k-means in SSAS works, especially how the disctance between 2 datapoints (with continuous and discrete variables) is measured.
I figured out, that in case of only coninuous variables the square of the eucledian distance is used.
Furthermore I figured out that the distance between only discrete variables is described as 1 - the probability of that value in a cluster.
My questions are:
-How does the algorithm work with continuous AND disrete variables? Is the distance calculated by sum the eucledian distance (e.g. 13,2) and the probability of a discrete value in a cluster (e.g. 0,4)?
-Is it necessary to scale the continuous variables to a lower range, maybe -1 to 1? How does SSAS handle this job
Unfortunately there are no sufficient documentations...
View Complete Post