K-means clustering and its real use-case in security domain

Abhinav Chaudhary
6 min readJul 20, 2021

--

What is clustering?

Clustering is one of the most common exploratory data analysis technique used to get an intuition about the structure of the data. It can be defined as the task of identifying subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different.
Unlike supervised learning, clustering is considered an unsupervised learning method since we don’t have the ground truth to compare the output of the clustering algorithm to the true labels to evaluate its performance. We only want to try to investigate the structure of the data by grouping the data points into distinct subgroups.

What is K-means clustering?

K-means clustering is a very famous and powerful unsupervised machine learning algorithm. It is used to solve many complex unsupervised machine learning problems.

K means is one of the most popular Unsupervised Machine Learning Algorithms Used for Solving Classification Problems. K Means segregates the unlabeled data into various groups, called clusters, based on having similar features, common patterns.

K-means Algorithm

K-means Algorithm is an Iterative algorithm that divides a group of n datasets into k subgroups /clusters based on the similarity and their mean distance from the centroid of that particular subgroup/ formed.Or K-Means clustering is an unsupervised learning algorithm. There is no labeled data for this clustering, unlike in supervised learning. K-Means performs the division of objects into clusters that share similarities and are dissimilar to the objects belonging to another cluster.

The term ‘K’ is a number. You need to tell the system how many clusters you need to create. For example, K = 2 refers to two clusters. There is a way of finding out what is the best or optimum value of K for a given data.

For a better understanding of k-means, let’s take an example from cricket. Imagine you received data on a lot of cricket players from all over the world, which gives information on the runs scored by the player and the wickets taken by them in the last ten matches. Based on this information, we need to group the data into two clusters, namely batsman and bowlers.

What are the basic steps for K-means clustering?

Step 1: Choose the number of clusters k.

Step 2: Select k random points from the data as centroids.

Step 3: Assign all the points to the closest cluster centroid.

Step 4: Re-compute the centroids of newly formed clusters.

Step 5: Repeat steps 3 and 4.

How to choose the value of K?

One of the most challenging tasks in this clustering algorithm is to choose the right values of k. What should be the right k-value? How to choose the k-value? Let us find the answer to these questions. If you are choosing the k values randomly, it might be correct or may be wrong. If you will choose the wrong value then it will directly affect your model performance. So there are two methods by which you can select the right value of k.

  1. Elbow Method.
  2. Silhouette Method.

Elbow Method

Elbow is one of the most famous methods by which you can select the right value of k and boost your model performance. We also perform the hyper parameter tuning to chose the best value of k. Let us see how this elbow method works.

It is an empirical method to find out the best value of k. it picks up the range of values and takes the best among them. It calculates the sum of the square of the points and calculates the average distance.

When the value of k is 1, the within-cluster sum of the square will be high. As the value of k increases, the within-cluster sum of square value will decrease.

Silhouette Method

The silhouette method is somewhat different. The elbow method it also picks up the range of the k values and draws the silhouette graph. It calculates the silhouette coefficient of every point. It calculates the average distance of points within its cluster a (i) and the average distance of the points to its next closest cluster called b (i).

Note : The a (i) value must be less than the b (i) value, that is ai<<bi.

Now, we have the values of a (i) and b (i). we will calculate the silhouette coefficient by using the below formula.

Note that for silhouette coeficient equal to -1 is the worst case scenario.

Use-case of K-means Clustering in Security Domain

The K-means clustering algorithm is used to find groups which have not been explicitly labeled in the data. This can be used to confirm business assumptions about what types of groups exist or to identify unknown groups in complex data sets.

The Cluster has a special meaning which refers to a special group of crime i.e. a lot of crimes in a given specified regions. These clusters can be represented using a geo plot of the crime described on the map of the police jurisdiction.

  1. Crime Data Mining for Indian Police Information System
Photo by Maxim Hopman on Unsplash

This is all about India’s crime analysis system. It gives ways to enhance the currently existing system in Indian Police System called Crime Criminal Information System (CCIS). It suggests dividing the database according to respective states, using classification, to make the data easier to analyze. In our project, we are subdividing the data into different types of crime allowing the user to get information of those crimes easily (e.g. percentage of the particular crime in a particular year, the hotspot of that particular crime)

2. An Optimal KD Model for Crime Pattern Detection Based on Semantic Link Analysis-A Data Mining Tool

The system finds the critical path of the serial killers who are striking over again and again and determines links between their crime activities locations occurred in the past, travel record, and background history, etc. These findings increase the chance of finding these repeat offenders. The formation to integrate information from various crime incidents and also multiple sources and also discover regular patterns for the structure, organization, operation and information in criminal databases. If a particular criminal uses a pattern of path to commit consecutive crimes, then the next crime location of this serial killer can be predicted from the pattern observed. e.g.: in DHOOM 2, the last location of crime of criminal HRITHIK ROSHAN was predicted by the pattern he formed from his previous crime locations.

3. Evolving Data Mining Algorithms on the Prevailing Crime Trend

The crime data is divided into days of the week, to observe Spatial temporal distribution of crime. To the clustered results, a classification algorithm was applied to predict the future crime pattern. It enables us to build a model on predicting next records using previous year’s data.

RESULTS OF CRIME PATTERN ANALYSIS
The different clusters or the crime patterns are color-coded. For each group, the legend provides the total number of crimes incidents included in the group along with the significant attributes that characterize the group.

--

--

No responses yet