OperationalResearch.org

Topics

← Back to Data Science

Customer Segmentation with K-means

What is Customer Segmentation?


Customer segmentation is the practice of dividing a company's customers into groups that reflect similarities among customers in each group. The goal is to identify and target specific customer groups with tailored marketing messages and products. Instead of a one-size-fits-all approach, segmentation allows businesses to:
  • Personalize Marketing: Create targeted campaigns that resonate with specific customer needs and preferences.

  • Improve Product Development: Develop products and services that cater to the demands of different segments.

  • Optimize Pricing Strategies: Set different price points for various customer groups based on their willingness to pay.

  • Enhance Customer Retention: Identify and nurture the most valuable customer segments.

What is K-Means Clustering?

K-means is a popular and straightforward unsupervised machine learning algorithm used for clustering. "Unsupervised" means that the algorithm learns patterns from unlabeled data. In our case, we won't tell the algorithm which customers belong to which group; it will figure it out on its own.

Here's a simplified breakdown of how K-means works:

  1. Choose the number of clusters (K): You first need to decide how many customer segments you want to create. Let's say you choose K=3.
  2. Initialize Centroids: The algorithm randomly selects K data points from your dataset as the initial "centroids" or centers of the clusters.
  3. Assign Data Points: Each data point (customer) is assigned to the nearest centroid based on a distance metric, usually the Euclidean distance.
  4. Update Centroids: Once all data points are assigned to a cluster, the algorithm recalculates the position of the K centroids by taking the mean of all data points within each cluster.
  5. Repeat: Steps 3 and 4 are repeated until the centroids no longer move significantly, meaning the clusters have stabilized.

The "K" in K-means represents the number of clusters you choose. The algorithm's objective is to minimize the sum of the squared distances between the data points and their respective cluster centroids.


The Dataset: Online Retail from UCI

For this tutorial, we'll use the "Online Retail" Download dataset available from the UCI Machine Learning Repository. This dataset contains transactional data from a UK-based online retail company.

Dataset Features

Variable Name Role Type Description Units Missing Values
InvoiceNo ID Categorical A 6-digit integral number uniquely assigned to each transaction. If it starts with 'C', it's a cancellation. No
StockCode ID Categorical A 5-digit integral number uniquely assigned to each distinct product. No
Description Feature Categorical Product name. No
Quantity Feature Integer The quantities of each product (item) per transaction. No
InvoiceDate Feature Date The day and time when each transaction was generated. No
UnitPrice Feature Continuous Product price per unit. Sterling No
CustomerID Feature Categorical A 5-digit integral number uniquely assigned to each customer. No
Country Feature Categorical The name of the country where each customer resides. No

Step-by-Step Implementation of K-Means for Customer Segmentation


Now let’s start with the task of consumer data by importing the necessary Python libraries and the dataset Download:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_excel("Online Retail.xlsx")
df.head()

ORA.ai

🤖

Hello! I'm your AI assistant

Ask me anything about Operations Research, algorithms, or optimization!