Thursday, 12 April 2012

Working Log April 12

A power law is a special kind of mathematical relationship between two quantities. When the frequency of an event varies as a power of some attribute of that event (e.g. its size), the frequency is said to follow a power law. For instance, the number of cities having a certain population size is found to vary as a power of the size of the population, and hence follows a power law. There is evidence that the distributions of a wide variety of physical, biological, and man-made phenomena follow a power law, including the sizes of earthquakes, craters on the moon and of solar flares,[1] the foraging pattern of various species,[2] the sizes of activity patterns of neuronal populations,[3] the frequencies of words in most languages, frequencies of family names, the species richness in clades of organisms,[4] the sizes of power outages and wars,[5] and many other quantities.



K-Means Clustering Overview


Overview
K-Means clustering generates a specific number of disjoint, flat (non-hierarchical) clusters. It is well suited to generating globular clusters. The K-Means method is numerical, unsupervised, non-deterministic and iterative.

K-Means Algorithm Properties
  • There are always K clusters.
  • There is always at least one item in each cluster.
  • The clusters are non-hierarchical and they do not overlap.
  • Every member of a cluster is closer to its cluster than any other cluster because closeness does not always involve the 'center' of clusters.

The K-Means Algorithm Process
  • The dataset is partitioned into K clusters and the data points are randomly assigned to the clusters resulting in clusters that have roughly the same number of data points.
  • For each data point:
  • Calculate the distance from the data point to each cluster.
  • If the data point is closest to its own cluster, leave it where it is. If the data point is not closest to its own cluster, move it into the closest cluster.
  • Repeat the above step until a complete pass through all the data points results in no data point moving from one cluster to another. At this point the clusters are stable and the clustering process ends.
  • The choice of initial partition can greatly affect the final clusters that result, in terms of inter-cluster and intracluster distances and cohesion.

K-Means Clustering in GeneLinker™
The version of the K-Means algorithm used in GeneLinker™ differs from the conventional K-Means algorithm in that GeneLinker™ does not compute the centroid of the clusters to measure the distance from a data point to a cluster. Instead, the algorithm uses a specified linkage distance metric. The use of the Average Linkage distance metric most closely corresponds to conventional K-Means, but can produce different results in many cases.

Advantages to Using this Technique
  • With a large number of variables, K-Means may be computationally faster than hierarchical clustering (if K is small).
  • K-Means may produce tighter clusters than hierarchical clustering, especially if the clusters are globular.

Disadvantages to Using this Technique
  • Difficulty in comparing quality of the clusters produced (e.g. for different initial partitions or values of K affect outcome).
  • Fixed number of clusters can make it difficult to predict what K should be.
  • Does not work well with non-globular clusters.
  • Different initial partitions can result in different final clusters. It is helpful to rerun the program using the same as well as different K values, to compare the results achieved.



The Euclidean distance between points p and q is the length of the line segment connecting them (\overline{\mathbf{p}\mathbf{q}}).
In Cartesian coordinates, if p = (p1p2,..., pn) and q = (q1q2,..., qn) are two points in Euclidean n-space, then the distance from p to q, or from q to p is given by
\mathrm{d}(\mathbf{p},\mathbf{q}) = \mathrm{d}(\mathbf{q},\mathbf{p}) = \sqrt{(q_1-p_1)^2 + (q_2-p_2)^2 + \cdots + (q_n-p_n)^2} = \sqrt{\sum_{i=1}^n (q_i-p_i)^2}.