Parallel K-Means for image clustering using NVidia Cuda

!! The code is available on my gitHub toskyRocker account!!

K-Means is commonly used for cluster analysis and data mining. For a better insight of this algorithm I suggest to read this.
A brief explanation of how it works is shown below.

K-Means scheme

  1. K centroids, one of each cluster, are chosen with an heuristic method.
  2. Each element of the dataset is assigned to the closest cluster.
  3. Once every element is assigned to some cluster,  for each cluster the new centroid is computed.
  4. The algorithm stops when 2. and 3. are repeated J number of times.
    K-Means Iteration
    K-Means Iteration

Kmeans can be used to reduce the number of color of an RGB image for many purposes and for this reason the scheme described above is expressed in a different domain.

 

K-Means on RGB image

In our domain the dataset is represented by the entire image containing its pixels. The K centroids are the RGB colours of the image that will be computed iteration by iteration and, at the end, it will define the final colors of the entire image. K, as already mentioned, is defined by the user.

RGB representation
RGB representation

How it works

In order to get familiarity with k-means, a first prototype of the algorithm is implemented in python.
The libraries needed for this example code are PIL and Numpy.

python kmeans.py my_Picture.jpg no_Of_Clusters no_Of_Iterations

This scripts needs the following args:
{my_Picture.jpg, no_Of_Clusters, no_Of_Iterations}

my_Picture.jpg stands for the image I want to be clusterized and must be a .jpg one.
No_Of_Clusters defines the number of clusters (and obviously the centroids, respectively) of the clusterized output image.
No_Of_Iterations defines how many iterations K-Means will perform

'''
Andrea Toscano
Universita' degli Studi di Milano
MS Computer Science
K-Means applied on RGB Images

With terminal move where your photos and kmeans.py are saved
Then execute kmeans.py script following this pattern.

name of your photo clusters no. iterations no.
EXAMPLE :$ python kmeans.py my_Picture.jpg No_Of_Clusters No_Of_Iterations
'''

# Import the modules
import sys, os, math, time
from PIL import Image, ImageFilter, ImageDraw

####################
# GLOBAL VARIABLES #
####################
# Change these values if you want
MAX_CLUSTERS = 50
MAX_ITERATIONS = 50

# Stores w, h of the image
IMAGE_SIZE = []
EXTENSION = ".jpg"

# key: Cluster number, value: rgb value
CENTROIDS = {}
NEW_CENTROIDS = {}

#list of [x,y,n_cluster] for each pixel
CLUSTER_NO = []

# Time
T0 = 0
T1 = 0

########################################
# Finding starting colors (centroids) #
########################################
def getPalette(nameFile):

global IMAGE_SIZE
global CENTROIDS
# Dim of each color palette
dimEachColorPalette = 200
image = 0
try:
image = Image.open(nameFile+EXTENSION)
except IOError:
print "IOError: Unable to open the image. "
sys.exit()

IMAGE_SIZE = image.size
# Dim image
print "Size Original Image: " + str(IMAGE_SIZE)
# Dim of the resized image
resize = int(0.5*IMAGE_SIZE[1])
# Resize with antialias filter
resizedImage = image.resize((resize, resize), Image.ANTIALIAS)
# Get Palette
result = resizedImage.convert('P', palette=Image.ADAPTIVE, colors=N_CLUSTERS)
# Add Alpha channel -> PNG
result.putalpha(0)
# Get a list of tuple (count, color)
colors = result.getcolors(resize*resize)

# Save colors to file
pal = Image.new('RGB', (dimEachColorPalette*numColors, dimEachColorPalette))
draw = ImageDraw.Draw(pal)

posx = 0
arrayColors = []
#print "Printing Initial Centroids: "
for count, col in colors:
#print count, col
# Draw a rectangle with palette colors
draw.rectangle([posx, 0, posx+dimEachColorPalette, dimEachColorPalette], fill=col)
# List of the palette Colors without alpha channel aka Initial centers
arrayColors.append(col[:-1])
posx = posx + dimEachColorPalette

del draw

CENTROIDS = dict(zip(range(0, N_CLUSTERS), arrayColors))
#print CENTROIDS

pal.save(nameFile+"_cl"+str(N_CLUSTERS)+"_palette"+".png", "PNG")
return image

########################################################################
# Euclidean distance between each pixel of the image and each centroid #
# Returns the number of centroid which is the nearest to a given pixel #
########################################################################
def findCluster(pixel):
# Distance calculus
dist = lambda pixel, clusterValue: math.sqrt( (clusterValue[1][0]-pixel[0])**2 + (clusterValue[1][1]-pixel[1])**2 + (clusterValue[1][2]-pixel[2])**2)
# Vector with distances
centroidDistances = ([dist(pixel, cv) for cv in CENTROIDS.items()])
#print centroidDistances
# Finding the nearest centroid and saving its number
myCluster = centroidDistances.index(min([x for x in centroidDistances ]))
#print myCluster
return myCluster

def findNewMean(image):

global NEW_CENTROIDS

# Count how many pixel there are in each cluster
count = [0] * N_CLUSTERS
rgbValues = lambda: [0] * 3
# Dictionary that saves the sum of r, g, b values of the pixel
rgb4EachCluster = {key : rgbValues() for key in range(N_CLUSTERS)}

# For each triple [x,y,n_cluster]
# Add +1 to the counter of the cluster where the pixel belongs to
# Increment r,g,b values in the correct entry of the dict
for triple in CLUSTER_NO:
count[triple[2]] += 1
rgb4EachCluster[triple[2]][0] += image[triple[0],triple[1]][0]
rgb4EachCluster[triple[2]][1] += image[triple[0],triple[1]][1]
rgb4EachCluster[triple[2]][2] += image[triple[0],triple[1]][2]

#print "printing rgb4Cluster: " + str(rgb4EachCluster)
#print "printing count: " + str(count)
pos = 0

# Finding the new centroids as
# sum(red_Color_of_total_pixels_in_a_given_centroid)/n_pixel_in_that_cluster
# the same with g and b
for pos in range(N_CLUSTERS):
value = [0,0,0]
if count[pos] == 0:
value[0] = rgb4EachCluster[pos][0]
value[1] = rgb4EachCluster[pos][1]
value[2] = rgb4EachCluster[pos][2]
#print rgb4EachCluster[pos][0],count[pos],value[0]
#print rgb4EachCluster[pos][1],count[pos],value[1]
#print rgb4EachCluster[pos][2],count[pos],value[2]
else:
value[0] = rgb4EachCluster[pos][0]/count[pos]
value[1] = rgb4EachCluster[pos][1]/count[pos]
value[2] = rgb4EachCluster[pos][2]/count[pos]
#print rgb4EachCluster[pos][0],count[pos],value[0]
#print rgb4EachCluster[pos][1],count[pos],value[1]
#print rgb4EachCluster[pos][2],count[pos],value[2]
NEW_CENTROIDS[pos] = value
#print "Printing NEW CENTROIDS: " + str(NEW_CENTROIDS)

####################################
# Swaps NEW_CENTROIDS -> CENTROIDS #
####################################
def swapOldNewCentroids():
global CENTROIDS
tempDict = {k:v for k,v in CENTROIDS.items()}
CENTROIDS = {}
CENTROIDS = {k:v for k,v in NEW_CENTROIDS.items()}

###################################
# Saves the new clusterized image #
###################################
def saveNewImage(name):

outputImage = Image.new("RGB",IMAGE_SIZE)
print "Saving..."
draw = ImageDraw.Draw(outputImage)

for element in CLUSTER_NO:
c = tuple(NEW_CENTROIDS.get(element[2]))
draw.point((element[0],element[1]), fill=c)
#outputImage.putdata(imageArray)
outputImage.save(name+"_cl"+str(N_CLUSTERS)+"_it"+str(N_ITERATIONS)+EXTENSION,"JPEG",quality=100)
print "That's the end"

########################
# Main k-means routine #
########################
def kmeans(image):

global CLUSTER_NO
global NEW_CENTROIDS
# Loading the image
loadedImage = image.load()
#print loadedImage[0,0][0]
x = 0
y = 0
# Finding initial classification
for iter in range(N_ITERATIONS):
print "#Iteration : " + str(iter)
CLUSTER_NO = []
NEW_CENTROIDS = {}

# ClusterTuple has [x,y, n_cluster] for each pixel in the image
# n_cluster is the number of cluster where the pixel belongs to
for x in range(IMAGE_SIZE[0]):
for y in range(IMAGE_SIZE[1]):
clusterTuple = [0] * 3
clusterTuple[0] = x # x position of the pixel
clusterTuple[1] = y # y position of the pixel
clusterTuple[2] = findCluster(loadedImage[x,y]) # cluster no. where the pixel belongs to
CLUSTER_NO.append(clusterTuple)

# Finds the new Cluster mean
findNewMean(loadedImage)
# Swaps the Dictionaries
swapOldNewCentroids()

#print "Printing CLUSTER NO " + str(CLUSTER_NO)
#print "Printing NEW CENTROIDS: " + str(NEW_CENTROIDS)
#print "Printing OLD CENTROIDS: " + str(CENTROIDS)
#print ""
#print "stamp Cluster no"
#print CLUSTER_NO

##################
# Main procedure #
##################
def procedure(nameImage):

# Finds initial centroids
image = getPalette(nameImage)
# Computes kmeans algorithm
# Taking K-Means time
T0=time.time()
kmeans(image)
T1=time.time()
print "Time: " + str(T1-T0) + " seconds."
# Saves the new clusterized image
saveNewImage(nameImage)

########
# Main #
########
if __name__ == "__main__":

global N_ITERATIONS
global N_CLUSTERS
# Not considering .py namefile
args = sys.argv[1:]
# Name input image
inputFile = args[0]
print "File Name: " + str(inputFile)
try:
numColors = int(args[1])
iterations = int(args[2])
N_ITERATIONS = iterations
N_CLUSTERS = numColors

except:
print "Error: #Colors or #Iteration are not correct"
sys.exit()

print "#Colors: " + str(numColors) + " #Iterations: " + str(iterations)

nameFile,extension = os.path.splitext(inputFile)
if (extension.lower() == ".jpg" or extension.lower() == ".jpeg") and numColors > 1 and numColors <= MAX_CLUSTERS and iterations > 0 and iterations <= MAX_ITERATIONS :
procedure(nameFile)
else:
print "Unable to load the image, probably your parameters are incorrect"
sys.exit()

Handling Jpg images

Before calling CUDA program, some preprocessing is needed in order to choose the starting centroids and to better represent an RGB image in C++.
First of all, an RGB .jpg image is converted into a .raw file containing the stream of RGB values (from 0 to 255) that represent the single image (sorted: from the top left to the bottom right side of the image). This way of “serialising pixels simplifies the image reading procedure on our CUDA code, since we won’t need to import any extra library to read .jpg images.

ImageConverter.py works as shown below.
To get a raw file from a jpg one the args must be:

python imageConverter.py image_name.jpg

Where image_name.jpg is our input image. This script returns image_name.raw that will be the input of our CUDA program.
To get the jpg file from a raw one the args must be:

python imageConverter.py image_name.raw Width Height

As mentioned before, image_name.raw is our image represented by the stream of its RGB values. This conversion is used to convert the output of the CUDA program (that returns a raw file) into a jpg one.

ImageConverter.py is shown below.

'''
Andrea Toscano
Universita' degli Studi di Milano
MS Computer Science
Jpg - Raw Converter (uint8 stream)

With OSX Terminal:
Install Pip (Python package manager):
sudo easy_install pip

Install PIL, Numpy:
pip install PIL
pip install numpy

Using Terminal move where .py and images are stored.

USE CASES
to get .raw file from .jpg file
:$ python imageConverter.py imageName.jpg
to get .jpg file from .raw file
:$ python imageConverter.py imageName.raw Width Height

'''
#Import modules
import sys, os, math, time
import numpy as np
from PIL import Image

#Global Variables
#Width
W = 0
#Height
H = 0
# File Name without extension
global NAMEFILE
#Current file extension
global EXTENSION

#Used to specify raw file extension
global RAW

def fromJpgToUint8():
global W,H
img = Image.open(NAMEFILE+EXTENSION)
W = img.size[0]
H = img.size[1]
print "Width: " + str(W) + " Height: " + str(H)

#print type(img)
array = np.asarray(img)
#print type(array)
#print array.dtype

# Debugging..
# (300,400,3) means 400x300 with RGB channels
#print "Array shape: " + str(array.shape)
# Starting from top left bottom

#print "Array: "
#print array
# 'uint8'
#print "Array Type: "+ str(array.dtype)
# Saving the array in .raw format
print "Saving .raw ..."
array.tofile(NAMEFILE+RAW)
print "Saved!"

def fromUint8ToJpg():
# Reading .raw file
array = np.fromfile(NAMEFILE+EXTENSION, dtype = "uint8")
#print array
#print len(array)
# Reshaping 1d Array into 2d Array
outputImage = array.reshape(H,W,3)
image = Image.fromarray(outputImage)
print "Saving .jpg ...."
# Saving new image
image.save(NAMEFILE+"_PYoutput"+".jpg" ,quality=100)
print "Saved!"

if __name__ == "__main__":
global NAMEFILE, EXTENSION, RAW, W, H
RAW = ".raw"
args = sys.argv[1:]
inputFile = args[0]
print "File Name: " + str(inputFile)
try:
NAMEFILE,EXTENSION = os.path.splitext(inputFile)
if (EXTENSION.lower() == ".jpg" or EXTENSION.lower() == ".jpeg"):
fromJpgToUint8()
elif EXTENSION.lower() == RAW:
W = int(args[1])
H = int(args[2])
fromUint8ToJpg()
else:
print "Error: Unknown image format."
except IOError:
print "Unable to load your image."

Finding the starting centroids

K-means performance strictly depends on the starting centroids. The better we choose the initial colours the better result we will obtain. For this reason the initial centroids are found analysing the histogram of the image.
This step is obtained using python PIL library as it follows:
Palette.py receives as args the jpg image and the number of clusters given by the user.

python palette.py image_name.jpg no_Of_Cluster

The output of the script (image_palette.raw) will be a raw file containing the stream of clusters’ RGB values. This file will be used as a secondd input of the CUDA program.
Palette.py

'''
Andrea Toscano
Universita' degli Studi di Milano
MS Computer Science
nameFile n Clusters
USAGE :$ python palette.py photoName.jpg 10
'''

# Import the modules
import sys, os, math, time
import numpy as np
from PIL import Image

# GLOBAL VARIABLES
# Change these values if you want
MAX_CLUSTERS = 50

# Stores w, h of the image
IMAGE_SIZE = []
EXTENSION = ".jpg"

# Finding the starting colors (centroids)
def getPalette(nameFile):

global IMAGE_SIZE
global CENTROIDS
# Dim of each color palette
dimEachColorPalette = 200
image = 0
try:
image = Image.open(nameFile+EXTENSION)
except IOError:
print "IOError: Unable to open the image. "
sys.exit()

IMAGE_SIZE = image.size
# Dim image
print "Size Original Image: " + str(IMAGE_SIZE)
# Dim of the resized image
resize = int(0.5*IMAGE_SIZE[1])
# Resize with antialias filter
resizedImage = image.resize((resize, resize), Image.ANTIALIAS)
# Get Palette
result = resizedImage.convert('P', palette=Image.ADAPTIVE, colors=N_CLUSTERS)
# Add Alpha channel -> PNG
result.putalpha(0)
# Get a list of tuple (count, color)
colors = result.getcolors(resize*resize)
colors = [col[1][:-1] for col in colors]

array = np.asarray(colors, dtype=np.uint8)
print "Printing Initial Centroids: "
print array
array.tofile(nameFile+"_palette"+str(N_CLUSTERS)+".raw")
print nameFile+"_palette"+str(N_CLUSTERS)+".raw" + " saved."

# Main
if __name__ == "__main__":

global N_CLUSTERS
# Not considering .py namefile
args = sys.argv[1:]
# Name input image
inputFile = args[0]
print "File Name: " + str(inputFile)
try:
N_CLUSTERS = int(args[1])
except:
print "Error: #Initial Colors > 1"
sys.exit()

print "#Initial Centroids: " + str(N_CLUSTERS)

nameFile,extension = os.path.splitext(inputFile)
if (extension.lower() == ".jpg" or extension.lower() == ".jpeg") and N_CLUSTERS > 1 and N_CLUSTERS <= MAX_CLUSTERS:
getPalette(nameFile)
else:
print "Unable to load the image, probably your parameters are incorrect"
print "USAGE: palette.py photoName.jpg n "
print "palette.py: name of this script "
print "photoName.jpg: name of jpg input image"
print "n: number of Clusters"
sys.exit()
[/sourcecode]

K-Means implementation on CUDA

We know all that’s needed to perform the parallel computation of K-Means on Cuda.
The implementation is very simples and requires only a .cu file.
It takes as input:

image.raw output.raw width length image_palette.raw no_Of_Clusters no_Of_Iterations

  1. image.raw is our input image represented by the stream of its RGB values.
  2. output.raw is the output of CUDA algorithm. It represents the clusterized image and needs to be converted to jpg.
  3. width of the image.
  4. length of the image
  5. image_palette.raw represents the RGB initial centroids
  6. no_Of_Clusters defines the number of clusters used in the algorithm. Must match with those contained in image_palette.raw
  7. no_Of_Iterations defines the number of iterations of K-Means

The full code is here:

/**
* Universita' degli Studi di Milano
* GPU Computing
*
* Title: Parallel K-Means applied on RGB Images.
* Created: JULY 13, 2014 11:33AM
* Author: Andrea Toscano
*
*
*/
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <math.h>
#include "timer.h"

// Number of threads
#define BLOCK_SIZE 16
#define GRID_SIZE 256

//Useful to read Error from CUDA Calls
#define CUDA_CALL(x) {if((x) != cudaSuccess){ \
printf("CUDA error at %s:%d\n",__FILE__,__LINE__); \
printf(" %s\n", cudaGetErrorString(cudaGetLastError())); \
exit(EXIT_FAILURE);}}

// nCentroids and size on device
__constant__ int dev_nCentroids;
__constant__ int dev_size;

// global variables
int PALETTE_BYTES = 0; // nCentroids * sizeof(int)
int IMAGE_BYTES = 0; // width * height * sizeof(int)

//**********************************
//R,G,B Centroid's triple on device
// nCentroids on GPU is HARDCODED remind to update it manually!
__constant__ int dev_RedCentroid[20];
__constant__ int dev_GreenCentroid[20];
__constant__ int dev_BlueCentroid[20];
//**********************************
/*
* get RGB values from Initial Centroids (RAW format)
*
*/
bool loadPalette(char* filename, int nCentroids, int* redCentroid, int* greenCentroid, int* blueCentroid) {

FILE *imageFile;
int length = 0;

imageFile = fopen(filename,"r");
if (imageFile == NULL) {
return false;
} else {
for (int i = 0; i < nCentroids; i++) {

// R,G,B Centroid triple, nCentroids long
redCentroid[i] = fgetc(imageFile);
greenCentroid[i] = fgetc(imageFile);
blueCentroid[i] = fgetc(imageFile);
printf("%d, %d, %d\n",redCentroid[i], greenCentroid[i], blueCentroid[i] );
length++;
}
fclose(imageFile);
printf("\n");
//printf("Palette Length: %d\n", length);
return true;
}
}

/*
* Image file loader (RAW format)
* Each pixel is identified by a triple r, g, b
* These values are taken from a raw file and put into three monodimensional arrays whose length is width*height
*/
bool loadRawImage(char* filename, int* r, int* g, int* b, int size) {
FILE *imageFile;
imageFile = fopen(filename, "r");

if (imageFile == NULL) {
return false;
} else {
for (int i = 0; i < size; i++) {

r[i] = fgetc(imageFile);
g[i] = fgetc(imageFile);
b[i] = fgetc(imageFile);
}
fclose(imageFile);

/*for(int j = 0; j < h * w; j++) {
printf("%d, %d, %d ", r[j], g[j], b[j]);
}*/
return true;
}
}

/*
* Image file writer (RAW format)
*/
bool writeRawImage(char* filename, int* labelArray, int* redCentroid, int* greenCentroid, int* blueCentroid, int size){
FILE *imageFile;
imageFile = fopen(filename, "wb");

if(imageFile == NULL) {
return false;
} else {
for (int i = 0; i < size; i++) {
fputc((char) redCentroid[labelArray[i]], imageFile);
fputc((char) greenCentroid[labelArray[i]], imageFile);
fputc((char) blueCentroid[labelArray[i]], imageFile);
}
fclose(imageFile);
return true;
}
}

/*******************************************************************/

/*
* Clears arrays before each kernel getClusterLabel iteration
*
*/
__global__ void clearPaletteArrays(int *dev_sumRed,int *dev_sumGreen,int *dev_sumBlue, int* dev_pixelClusterCounter, int* dev_tempRedCentroid, int* dev_tempGreenCentroid, int* dev_tempBlueCentroid ) {

// 1 block, 16x16 threads
int threadID = threadIdx.x + threadIdx.y * blockDim.x;

if(threadID < dev_nCentroids) {

// nCentroids long
dev_sumRed[threadID] = 0;
dev_sumGreen[threadID] = 0;
dev_sumBlue[threadID] = 0;
dev_pixelClusterCounter[threadID] = 0;
dev_tempRedCentroid[threadID] = 0;
dev_tempGreenCentroid[threadID] = 0;
dev_tempBlueCentroid[threadID] = 0;
}
}// end clearPaletteArrays

/*
* Clear label array before each kernel getClusterLabel iteration
*/
__global__ void clearLabelArray(int *dev_labelArray){

// Global thread index
int threadID = (threadIdx.x + blockIdx.x * blockDim.x) + (threadIdx.y + blockIdx.y * blockDim.y) * blockDim.x * gridDim.x;

// labelArray is "size" long
if(threadID < dev_size) {
dev_labelArray[threadID] = 0;
}
}// end clearLabelArray

/*
* Finds the minimum distance between each triple dev_Red[i] dev_Green[i] dev_Blue[i] and all centroids.
* Then saves the equivalent centroid label in dev_labelArray.
* labelArray is "width*height" long, monodimensional array
*
* INPUT : pixel triple arrays dev_Red, dev_Green, dev_Blue. labelArray that will contains the label for each pixel triple
*/
__global__ void getClusterLabel(int *dev_Red,int *dev_Green,int *dev_Blue,int *dev_labelArray) {

// Global thread index
int threadID = (threadIdx.x + blockIdx.x * blockDim.x) + (threadIdx.y + blockIdx.y * blockDim.y) * blockDim.x * gridDim.x;

//default min value of distance
float min = 500.0, value;
//will be label
int index = 0;

if(threadID < dev_size) {
// Finding the nearest centroid to current triple identified by threadID thread
for(int i = 0; i < dev_nCentroids; i++) {
// Performing Euclidean distance, Saving current value
value = sqrtf(powf((dev_Red[threadID]-dev_RedCentroid[i]),2.0) + powf((dev_Green[threadID]-dev_GreenCentroid[i]),2.0) + powf((dev_Blue[threadID]-dev_BlueCentroid[i]),2.0));

if(value < min){
// saving new nearest centroid
min = value;
// Updating his index
index = i;
}
}// end for
// Writing to global memory the index of the nearest centroid
// for dev_Red[threadID], dev_Green[threadID], dev_Blue[threadID] pixel triple
dev_labelArray[threadID] = index;

}// end if
}// end getClusterLabel

/*
* Summing Red, Green, Blue values per cluster
* Counting how many pixels there are in each cluster
*
*/

__global__ void sumCluster(int *dev_Red,int *dev_Green,int *dev_Blue,int *dev_sumRed,int *dev_sumGreen,int *dev_sumBlue,int *dev_labelArray,int *dev_pixelClusterCounter) {

// Global thread index
int threadID = (threadIdx.x + blockIdx.x * blockDim.x) + (threadIdx.y + blockIdx.y * blockDim.y) * blockDim.x * gridDim.x;

if(threadID < dev_size) {
int currentLabelArray = dev_labelArray[threadID];
int currentRed = dev_Red[threadID];
int currentGreen = dev_Green[threadID];
int currentBlue = dev_Blue[threadID];
// Writing to global memory needs a serialization. Many threads are writing into the same few locations
atomicAdd(&dev_sumRed[currentLabelArray], currentRed);
atomicAdd(&dev_sumGreen[currentLabelArray], currentGreen);
atomicAdd(&dev_sumBlue[currentLabelArray], currentBlue);
atomicAdd(&dev_pixelClusterCounter[currentLabelArray], 1);
}
}// end sumCluster

/*
* Calculates the new R,G,B values of the centroids dividing the sum of color (for each channel) by the number of pixels in that cluster
* New values are stored in global memory since the current R,G,B values of the centroids are in read-only constant memory.
*/
__global__ void newCentroids(int *dev_tempRedCentroid, int *dev_tempGreenCentroid, int *dev_tempBlueCentroid,int* dev_sumRed, int *dev_sumGreen,int *dev_sumBlue, int* dev_pixelClusterCounter) {

// 1 block , 16*16 threads
int threadID = threadIdx.x + threadIdx.y * blockDim.x;

if(threadID < dev_nCentroids) {
int currentPixelCounter = dev_pixelClusterCounter[threadID];
int sumRed = dev_sumRed[threadID];
int sumGreen = dev_sumGreen[threadID];
int sumBlue = dev_sumBlue[threadID];

//new RGB Centroids' values written in global memory
dev_tempRedCentroid[threadID] = (int)(sumRed/currentPixelCounter);
dev_tempGreenCentroid[threadID] = (int)(sumGreen/currentPixelCounter);
dev_tempBlueCentroid[threadID] = (int)(sumBlue/currentPixelCounter);
}

}// end newCentroids

/*******************************************************************/

int main(int argc, char *argv[]) {

// init device
cudaSetDevice(0);
cudaDeviceSynchronize();
cudaThreadSynchronize();

//input raw file, output raw file, input palette raw file containing RGB values of initial centroids
char *inputFile, *outputFile, *palette;
//Pixels' r,g,b values. Centroid's r,g,b values
int *red, *green, *blue, *redCentroid, *greenCentroid, *blueCentroid;

// ref to GPU Pixels'RGB values, Centroids' RGB values
int *dev_Red, *dev_Green, *dev_Blue, *dev_tempRedCentroid, *dev_tempGreenCentroid, *dev_tempBlueCentroid;
// array containing ref to GPU label array variable
int *labelArray, *dev_labelArray;

// local variables for storing image width, height
// number of cluster, number of iterations, linear size of the image ( = width * height)
int width, height, nCentroids, nIterations,size;
//int IMAGE_BYTES, PALETTE_BYTES;

// ref to array where pixels' count are stored
int *pixelClusterCounter, *dev_pixelClusterCounter;
// ref to arrays where sum of RGB values for each cluster are stored
int *sumRed, *sumGreen, *sumBlue;
int *dev_sumRed, *dev_sumGreen, *dev_sumBlue;

// checking arguments
if (argc > 7) {
inputFile = argv[1];
outputFile = argv[2];
width = atoi(argv[3]);
height = atoi(argv[4]);
palette = argv[5];
nCentroids = atoi(argv[6]); // remind to update hardcoded nCentroids above
if(nCentroids > 256)
nCentroids = 256;
nIterations = atoi(argv[7]);
if(nIterations > 15)
nIterations = 15;

} else {
printf(" USAGE: kmeans.cu <inputfile.raw> <outputfile.raw> nRows nCols palette nCentroids nItarations \n");
printf(" <inputfile.raw>: input .raw file (sequence of bytes)\n");
printf(" <outputfile.raw>: output .raw file\n");
printf(" nRows: the number of rows of the image\n");
printf(" nCols: the number of columns of the image\n");
printf(" palette: RGB initial Centroids");
printf(" nCentroids: number of clusters");
printf(" nIterations: number of iterations of K-Means");

return 0;
}

// Setting image and palette size in bytes
IMAGE_BYTES = width * height * sizeof(int);
PALETTE_BYTES = nCentroids * sizeof(int);
size = width * height;

/* TODO useful for Texture implementation
int ra[width][height];
int ga[width][height];
int ba[width][height];
*/

printf("Image: %s\n",inputFile);
printf("Width: %d, Height: %d\n", width, height);
printf("#Clusters: %d, #Iterations: %d\n", nCentroids, nIterations);

// allocate memory on CPU
red = static_cast<int *>(malloc(IMAGE_BYTES));
green = static_cast<int *>(malloc(IMAGE_BYTES));
blue = static_cast<int *>(malloc(IMAGE_BYTES));
redCentroid = static_cast<int *>(malloc(PALETTE_BYTES));
greenCentroid = static_cast<int *>(malloc(PALETTE_BYTES));
blueCentroid = static_cast<int *>(malloc(PALETTE_BYTES));
labelArray = static_cast<int *>(malloc(IMAGE_BYTES));
sumRed = static_cast<int*>(malloc(PALETTE_BYTES));
sumGreen = static_cast<int*>(malloc(PALETTE_BYTES));
sumBlue = static_cast<int*>(malloc(PALETTE_BYTES));
pixelClusterCounter = static_cast<int*>(malloc(PALETTE_BYTES));

// Setting initial centroids
printf("Initial Centroids: \n");
if(loadPalette(palette, nCentroids, redCentroid, greenCentroid, blueCentroid)) {
} else {
printf("Unable to set Initial Centroids.\n");
}

// Loading image in r, g, b arrays
printf("Image loading...\n");
if (loadRawImage(inputFile, red, green, blue, size)) {
printf("Image loaded!\n");
} else {
printf("NOT loaded!\n");
return -1;
}

printf("\n");

/* TODO Useful for Texture implementation
int ri = 0;
int co = 0;

for( ri = 0; ri < width; ri++){
for( co = 0; co < height; co++) {
int x = ri+co*width;
ra[ri][co] = red[x];
ga[ri][co] = green[x];
ba[ri][co] = blue[x];
}
}
*/

if(IMAGE_BYTES == 0 || PALETTE_BYTES == 0) {
return -1;
}

// allocate memory on GPU
CUDA_CALL(cudaMalloc((void**) &dev_Red, IMAGE_BYTES));
CUDA_CALL(cudaMalloc((void**) &dev_Green, IMAGE_BYTES));
CUDA_CALL(cudaMalloc((void**) &dev_Blue, IMAGE_BYTES));
CUDA_CALL(cudaMalloc((void**) &dev_tempRedCentroid, PALETTE_BYTES));
CUDA_CALL(cudaMalloc((void**) &dev_tempGreenCentroid, PALETTE_BYTES));
CUDA_CALL(cudaMalloc((void**) &dev_tempBlueCentroid, PALETTE_BYTES));
CUDA_CALL(cudaMalloc((void**) &dev_labelArray, IMAGE_BYTES));
CUDA_CALL(cudaMalloc((void**) &dev_sumRed, PALETTE_BYTES));
CUDA_CALL(cudaMalloc((void**) &dev_sumGreen, PALETTE_BYTES));
CUDA_CALL(cudaMalloc((void**) &dev_sumBlue, PALETTE_BYTES));
CUDA_CALL(cudaMalloc((void**) &dev_pixelClusterCounter, PALETTE_BYTES));

// copy host CPU memory to GPU
CUDA_CALL(cudaMemcpy(dev_Red, red, IMAGE_BYTES, cudaMemcpyHostToDevice));
CUDA_CALL(cudaMemcpy(dev_Green, green, IMAGE_BYTES, cudaMemcpyHostToDevice));
CUDA_CALL(cudaMemcpy(dev_Blue, blue, IMAGE_BYTES, cudaMemcpyHostToDevice));
CUDA_CALL(cudaMemcpy(dev_tempRedCentroid, redCentroid,PALETTE_BYTES,cudaMemcpyHostToDevice ));
CUDA_CALL(cudaMemcpy(dev_tempGreenCentroid, greenCentroid,PALETTE_BYTES,cudaMemcpyHostToDevice ));
CUDA_CALL(cudaMemcpy(dev_tempBlueCentroid, blueCentroid,PALETTE_BYTES,cudaMemcpyHostToDevice ));
CUDA_CALL(cudaMemcpy(dev_labelArray, labelArray, IMAGE_BYTES, cudaMemcpyHostToDevice));
CUDA_CALL(cudaMemcpy(dev_sumRed, sumRed, PALETTE_BYTES, cudaMemcpyHostToDevice));
CUDA_CALL(cudaMemcpy(dev_sumGreen, sumGreen, PALETTE_BYTES, cudaMemcpyHostToDevice));
CUDA_CALL(cudaMemcpy(dev_sumBlue, sumBlue, PALETTE_BYTES, cudaMemcpyHostToDevice));
CUDA_CALL(cudaMemcpy(dev_pixelClusterCounter, pixelClusterCounter, PALETTE_BYTES, cudaMemcpyHostToDevice));
CUDA_CALL(cudaMemcpyToSymbol(dev_RedCentroid, redCentroid, PALETTE_BYTES));
CUDA_CALL(cudaMemcpyToSymbol(dev_GreenCentroid, greenCentroid, PALETTE_BYTES));
CUDA_CALL(cudaMemcpyToSymbol(dev_BlueCentroid, blueCentroid, PALETTE_BYTES));
CUDA_CALL(cudaMemcpyToSymbol(dev_nCentroids,&nCentroids, sizeof(int)));
CUDA_CALL(cudaMemcpyToSymbol(dev_size, &size, sizeof(int)));

// Clearing centroids on host
for(int i = 0; i < nCentroids; i++) {
redCentroid[i] = 0;
greenCentroid[i] = 0;
blueCentroid[i] = 0;
}

// Defining grid size

int BLOCK_X, BLOCK_Y;
BLOCK_X = ceil(width/BLOCK_SIZE);
BLOCK_Y = ceil(height/BLOCK_SIZE);
if(BLOCK_X > GRID_SIZE)
BLOCK_X = GRID_SIZE;
if(BLOCK_Y > GRID_SIZE)
BLOCK_Y = GRID_SIZE;

//2D Grid
//Minimum number of threads that can handle width¡height pixels
dim3 dimGRID(BLOCK_X,BLOCK_Y);
//2D Block
//Each dimension is fixed
dim3 dimBLOCK(BLOCK_SIZE,BLOCK_SIZE);

//Starting timer
GpuTimer timer;
timer.Start();

printf("Launching K-Means Kernels.. \n");
//Iteration of kmeans algorithm
for(int i = 0; i < nIterations; i++) {

// Passing image RGB components, palette RGB components, label Array, number of Clusters
// Init arrays' values to 0
// Kernel needs only 1 block since nClusters
clearPaletteArrays<<<1, dimBLOCK>>>(dev_sumRed, dev_sumGreen, dev_sumBlue, dev_pixelClusterCounter, dev_tempRedCentroid, dev_tempGreenCentroid, dev_tempBlueCentroid);

// Init labelarray values to 0
clearLabelArray<<<dimGRID, dimBLOCK>>>(dev_labelArray);

// Calculates the distance from each pixel and all centroids
// Then saves the equivalent label in dev_labelArray
getClusterLabel<<< dimGRID, dimBLOCK >>> (dev_Red, dev_Green, dev_Blue,dev_labelArray);

//Sums RGB values in each Cluster
sumCluster<<<dimGRID, dimBLOCK>>> (dev_Red, dev_Green, dev_Blue, dev_sumRed, dev_sumGreen, dev_sumBlue, dev_labelArray,dev_pixelClusterCounter);

//Finds new RGB Centroids' values
newCentroids<<<1,dimBLOCK >>>(dev_tempRedCentroid, dev_tempGreenCentroid, dev_tempBlueCentroid, dev_sumRed, dev_sumGreen, dev_sumBlue, dev_pixelClusterCounter);

//Old RGB Centroids' values are in constant memory
//Updated RGB Centroids' values are in global memory
//We need a swap
CUDA_CALL(cudaMemcpy(redCentroid, dev_tempRedCentroid, PALETTE_BYTES,cudaMemcpyDeviceToHost));
CUDA_CALL(cudaMemcpy(greenCentroid, dev_tempGreenCentroid, PALETTE_BYTES,cudaMemcpyDeviceToHost));
CUDA_CALL(cudaMemcpy(blueCentroid, dev_tempBlueCentroid, PALETTE_BYTES,cudaMemcpyDeviceToHost));
//Uploading in constant memory updated RGB Centroids' values
CUDA_CALL(cudaMemcpyToSymbol(dev_RedCentroid, redCentroid, PALETTE_BYTES));
CUDA_CALL(cudaMemcpyToSymbol(dev_GreenCentroid, greenCentroid, PALETTE_BYTES));
CUDA_CALL(cudaMemcpyToSymbol(dev_BlueCentroid, blueCentroid, PALETTE_BYTES));
timer.Stop();
}

// DEBUG
CUDA_CALL(cudaMemcpy(labelArray, dev_labelArray, IMAGE_BYTES, cudaMemcpyDeviceToHost));
CUDA_CALL(cudaMemcpy(sumRed, dev_sumRed, PALETTE_BYTES, cudaMemcpyDeviceToHost));
CUDA_CALL(cudaMemcpy(sumGreen, dev_sumGreen, PALETTE_BYTES, cudaMemcpyDeviceToHost));
CUDA_CALL(cudaMemcpy(sumBlue, dev_sumBlue, PALETTE_BYTES, cudaMemcpyDeviceToHost));
CUDA_CALL(cudaMemcpy(pixelClusterCounter, dev_pixelClusterCounter, PALETTE_BYTES, cudaMemcpyDeviceToHost));

printf("Kmeans code ran in: %f msecs.\n", timer.Elapsed());
printf("\n");

// labelArray DEBUG
int counter = 0;

printf("Label Array:\n");
for(int i = 0; i < (size); i++) {
//printf("%d\n", labelArray[i]);
counter++;
}
printf("printing counter %d\n", counter);
counter = 0;

printf("Sum Arrays:\n");
for(int j = 0; j < nCentroids; j++) {
printf("r: %u g: %u b: %u \n", sumRed[j], sumGreen[j], sumBlue[j]);
counter++;
}

printf("\n");

printf("Pixels per centroids:\n");
for(int k = 0; k < nCentroids; k++){
printf("%d centroid: %d pixels\n", k, pixelClusterCounter[k]);
}

printf("\n");

printf("New centroids:\n");
for(int i = 0; i < nCentroids; i++) {

printf("%d, %d, %d \n", redCentroid[i], greenCentroid[i], blueCentroid[i]);
}

// writing...
printf("Image writing...\n");

if (writeRawImage(outputFile,labelArray, redCentroid, greenCentroid, blueCentroid, size)) {
printf("Image written!\n");
} else {
printf("NOT written!\n");
return -1;
}

free(red);
free(green);
free(blue);
free(redCentroid);
free(greenCentroid);
free(blueCentroid);
free(labelArray);
free(sumRed);
free(sumGreen);
free(sumBlue);
free(pixelClusterCounter);

CUDA_CALL(cudaFree(dev_Red));
CUDA_CALL(cudaFree(dev_Green));
CUDA_CALL(cudaFree(dev_Blue));
CUDA_CALL(cudaFree(dev_tempRedCentroid));
CUDA_CALL(cudaFree(dev_tempGreenCentroid));
CUDA_CALL(cudaFree(dev_tempBlueCentroid));
CUDA_CALL(cudaFree(dev_labelArray));
CUDA_CALL(cudaFree(dev_sumRed));
CUDA_CALL(cudaFree(dev_sumGreen));
CUDA_CALL(cudaFree(dev_sumBlue));
CUDA_CALL(cudaFree(dev_pixelClusterCounter));

printf("That's the end.\n");
return 0;
}

Reading the RGB raw image

Let’s examine the code from the very first step.
LoadRawImage routine shows how all R, G, and B values of the raw image are added in r[], g[] and b[] arrays, respectively.
LoadPalette routine reads image_palette-raw and copies its values into three arrays like loadRawImage function.

Cuda Kernels

Memory space is allocated on the GPU and the grid is defined on a 2D space. Each block in the grid is 2D too.
The maximum number of threads for each block is set to 16 and the dimension of the grid is related to the width and height of the image.


BLOCK_X = ceil(width/BLOCK_SIZE);
BLOCK_Y = ceil(height/BLOCK_SIZE);

Since the centroids are updated only at the end of each iteration, I chose to put them in constant memory (only read). They are updated at the beginning of the new loop with the values found at the previous iteration ok K-Means.
As the sequential version, the parallel algorithm consists in 3 main routines:

  1. getClusterLabel: this function receives the r,g,b arrays and returns the nearest centroid for each pixel. Have a look to the code for the global thread index identification.
  2. sumCluster: for all pixels in a cluster it sums separately their R,G,B in 3 different arrays so that each output array represents the sum for a single R,G,B component.
  3. newCentroids: For each cluster it simply divides the sum (for each component obtained in the previous step) by the number of pixels contained in each cluster

The result is copied back to CPU and the last routine saves the output image into a raw file.
Now you can run again imageConverter.py specifying that you want a jpg from a raw file:

python imageConverter.py imageName.raw Width Height

Note: you have also to specify the width and height of the image to convert the raw stream into a new picture.

Recap

  1. Choose a jpg and run imageConverter.py to get the raw representation of the image.
  2. Once you have decided the number of clusters of k-means, run palette.py with the same image to get the initial centroids.
  3. Now you can run CUDAkmeans.cu in order to get the clusterized image (raw)
  4. Run again imageConverter.py to convert the raw file into a jpg one.

You can find some examples on my GitHub repo.

Image clusterized with K-Means on Cuda
 Image clusterized with K-Means on Cuda

Comments (13)

  1. ruzi

    Nov 20, 2016 at 6:11 am

    sir, the code is awesome, but when i run the cudakmeans, the output show me some bug/error, it produce -1 in every centroid, am i doing it wrong? because when i run the python code, the code produce the right answer
    and pardon with my english

    • AndrewTosky

      Nov 20, 2016 at 11:34 am

      Hi Ruzi,
      first of all, thank you for reading my blog 🙂 .
      Concerning your problem,
      if you’re using NSight as IDE, could you please debug my code to see where it fails?
      You have to consider in addition that I developed it a couple of years ago with an outdated CUDA version, (likely 5.5).
      Which CUDA version are you using?

      Thanks
      Andrea

      • ruzi

        Nov 20, 2016 at 1:04 pm

        wow thanks for responding to sir, i’m new to this cuda thing
        i run the code with cmd, because when i use VS the program stop when saving the output.raw, i tried to trace it with some printf, what i found that the gaussian value always -1, and i assume the cudamemcpy might not copying the value.
        i want to debug it, but the program always say access violation
        reading location

        im using cuda 8.0, maybe if i downgrade my cuda it will work?

        • AndrewTosky

          Nov 20, 2016 at 11:16 pm

          Hi Ruzi,
          it must surely be a version issue since I tested my code with both cmd and Nsight (Linux) on GeForce and Tesla Nvidia architectures.
          I’d suggest you not to downgrade but to understand where the problem is..maybe trying a different allocation method.
          Are you sure all data are valid just before passing them to the GPU? Are you printing matrixes’ sizes and so on?

          Cheers
          Andrea

          • ruzi

            Nov 22, 2016 at 3:35 pm

            i still identifying the error, but i found something, the allocation is okay (my bad), because actually it copying the value of the initial centroids, so the problem might be in the iteration (still no clue)

            does the code have some limitation on jpg dimension?
            does cuda need to be config first?

            Thanks

          • AndrewTosky

            Nov 22, 2016 at 11:22 pm

            Hi Ruzi, my CUDA code has no limitations since it handles a text file holding a stream of int values whose range is [0,255]. Then it interprets them as a RGB matrix, as you can see in the first lines of my CUDA program.

            Python code, on the contrary, relies on the fact that input image’s extension is “jpg” (line 27 of Python code) . You can easily make it generic, if you want 🙂

          • ruzi

            Nov 23, 2016 at 2:14 am

            thanks sir still helping me 🙂

            yeah because of the raw file right? but i found something weird, in load raw image function, i print it by r g b, at the beginning it was okay, it print the r g b number, but when it reach certain loop of print, all the values become -1, i use 3 dif image, and all of them have different -1 start, 1 jpg start at iteration 63, and the other somewhere around 400 and 300.
            i think this is the main problem with the running

          • AndrewTosky

            Nov 24, 2016 at 10:04 pm

            Hi Ruzi,

            yes the .raw file is just a simple solution to read an image without using opencv or any other external library.

            Are you declaring image width and height correctly as command line input of CUDA program?

            Do the images you used have the same size? I’m trying to understand whether the program fails at the same point or not

          • ruzi

            Nov 25, 2016 at 5:48 am

            i have image with width:236 height:244 i change it to raw, to check if the raw correct, i change it into jpg again, width and height still the same, but through that process, the file size change.

            and when i start the command, i use
            test3.exe 4.raw 444.raw 236 244 4_palette10.raw 10 10
            and
            test3.exe 4.raw 444.raw 244 236 4_palette10.raw 10 10

            both still producing -1

          • AndrewTosky

            Dec 05, 2016 at 11:23 pm

            Here debugging becomes essential.. otherwise it will be difficult to sort it out.

  2. ruzi

    Dec 06, 2016 at 6:21 am

    Thanks still keeping up with me,
    i debug it, but when i run the debug, it’s having error on location for write raw method.

    Btw do i need to config the NVIDIA?
    my step -> install the Visual Studio -> install CUDA -> add cl.exe to the environment so i can run it in CMD, am I missing something? thanks

    • ruzi

      Dec 06, 2016 at 8:45 am

      Hello, the code now working sir, i just need to change the fopen form r into rb. Thanks for everything, Sir. Your code is awesome 🙂

      • AndrewTosky

        Dec 09, 2016 at 9:53 pm

        Setup Visual Studio for CUDA and debug the program properly, since the CUDA version I used is now outdated.
        That’t the only suggestion I can give you 🙂

Leave a Reply to AndrewTosky

This site uses Akismet to reduce spam. Learn how your comment data is processed.