Go Back
Colosseo Roma clusterized with K-Means Cuda
04th, May
Parallel K-Means for image clustering using NVidia Cuda

!! The code is available on my gitHub toskyRocker account!!

K-Means is commonly used for cluster analysis and data mining. For a better insight of this algorithm I suggest to read this.
A brief explanation of how it works is shown below.

K-Means scheme

  1. K centroids, one of each cluster, are chosen with an heuristic method.
  2. Each element of the dataset is assigned to the closest cluster.
  3. Once every element is assigned to some cluster,  for each cluster the new centroid is computed.
  4. The algorithm stops when 2. and 3. are repeated J number of times.

    K-Means Iteration

    K-Means Iteration

Kmeans can be used to reduce the number of color of an RGB image for many purposes and for this reason the scheme described above is expressed in a different domain.

K-Means on RGB image

In our domain the dataset is represented by the entire image containing its pixels. The K centroids are the RGB colours of the image that will be computed iteration by iteration and, at the end, it will define the final colors of the entire image. K, as already mentioned, is defined by the user.

RGB representation

RGB representation

How it works

In order to get familiarity with k-means, a first prototype of the algorithm is implemented in python.
The libraries needed for this example code are PIL and Numpy.
This scripts needs the following args:

>>python kmeans.py my_Picture.jpg no_Of_Clusters no_Of_Iterations

my_Picture.jpg stands for the image I want to be clusterized and must be a .jpg one.
No_Of_Clusters defines the number of clusters (and obviously the centroids, respectively) of the clusterized output image.
No_Of_Iterations defines how many iterations K-Means will perform.

'''
Andrea Toscano
Universita' degli Studi di Milano
MS Computer Science
K-Means applied on RGB Images

With terminal move where your photos and kmeans.py are saved
Then execute kmeans.py script following this pattern. 

                              name of your photo  clusters no.     iterations no.
EXAMPLE :$   python kmeans.py my_Picture.jpg       No_Of_Clusters    No_Of_Iterations
'''

# Import the modules
import sys, os, math, time
from PIL import Image, ImageFilter, ImageDraw

####################
# GLOBAL VARIABLES #
####################
# Change these values if you want
MAX_CLUSTERS = 50
MAX_ITERATIONS = 50

# Stores w, h of the image
IMAGE_SIZE = []
EXTENSION = ".jpg"

# key: Cluster number, value: rgb value
CENTROIDS = {}
NEW_CENTROIDS = {}

#list of [x,y,n_cluster] for each pixel
CLUSTER_NO = []

# Time
T0 = 0
T1 = 0

########################################
# Finding  starting colors (centroids) #
########################################
def getPalette(nameFile):

	global IMAGE_SIZE
	global CENTROIDS
	# Dim of each color palette
	dimEachColorPalette = 200
	image = 0
	try:
		image = Image.open(nameFile+EXTENSION)
	except IOError:
		print "IOError: Unable to open the image. "
		sys.exit()

	IMAGE_SIZE = image.size
	# Dim image
	print "Size Original Image: " + str(IMAGE_SIZE)
	# Dim of the resized image
	resize = int(0.5*IMAGE_SIZE[1])
    # Resize with antialias filter
	resizedImage = image.resize((resize, resize), Image.ANTIALIAS)
    # Get Palette
	result = resizedImage.convert('P', palette=Image.ADAPTIVE, colors=N_CLUSTERS)
	# Add Alpha channel -> PNG
	result.putalpha(0)
	# Get a list of tuple (count, color)
	colors = result.getcolors(resize*resize)

    # Save colors to file
	pal = Image.new('RGB', (dimEachColorPalette*numColors, dimEachColorPalette))
	draw = ImageDraw.Draw(pal)

	posx = 0
	arrayColors = []
	#print "Printing Initial Centroids: "
	for count, col in colors:
        #print count, col
        # Draw a rectangle with palette colors
		draw.rectangle([posx, 0, posx+dimEachColorPalette, dimEachColorPalette], fill=col)
        # List of the palette Colors without alpha channel aka Initial centers
		arrayColors.append(col[:-1])
		posx = posx + dimEachColorPalette

 	del draw

	CENTROIDS = dict(zip(range(0, N_CLUSTERS), arrayColors))
	#print CENTROIDS

	pal.save(nameFile+"_cl"+str(N_CLUSTERS)+"_palette"+".png", "PNG")
	return image

########################################################################
# Euclidean distance between each pixel of the image and each centroid #
# Returns the number of centroid which is the nearest to a given pixel #
########################################################################
def findCluster(pixel):
	# Distance calculus
    dist = lambda pixel, clusterValue: math.sqrt( (clusterValue[1][0]-pixel[0])**2 + (clusterValue[1][1]-pixel[1])**2 + (clusterValue[1][2]-pixel[2])**2)
    # Vector with distances
    centroidDistances = ([dist(pixel, cv) for cv in CENTROIDS.items()])
    #print centroidDistances
    # Finding the nearest centroid and saving its number
    myCluster = centroidDistances.index(min([x for x in centroidDistances ]))
    #print myCluster
    return myCluster

def findNewMean(image):

	global NEW_CENTROIDS

	# Count how many pixel there are in each cluster
	count = [0] * N_CLUSTERS
	rgbValues = lambda: [0] * 3
	# Dictionary that saves the sum of r, g, b values of the pixel
	rgb4EachCluster = {key : rgbValues() for key in range(N_CLUSTERS)}

	# For each triple [x,y,n_cluster]
	# Add +1 to the counter of the cluster where the pixel belongs to
	# Increment r,g,b values in the correct entry of the dict
	for triple in CLUSTER_NO:
		count[triple[2]] += 1
		rgb4EachCluster[triple[2]][0] += image[triple[0],triple[1]][0]
		rgb4EachCluster[triple[2]][1] += image[triple[0],triple[1]][1]
		rgb4EachCluster[triple[2]][2] += image[triple[0],triple[1]][2]

	#print "printing rgb4Cluster: " + str(rgb4EachCluster)
	#print "printing count: " + str(count)
	pos = 0

	# Finding the new centroids as
	# sum(red_Color_of_total_pixels_in_a_given_centroid)/n_pixel_in_that_cluster
	# the same with g and b
	for pos in range(N_CLUSTERS):
		value = [0,0,0]
		if count[pos] == 0:
			value[0] = rgb4EachCluster[pos][0]
			value[1] = rgb4EachCluster[pos][1]
			value[2] = rgb4EachCluster[pos][2]
			#print rgb4EachCluster[pos][0],count[pos],value[0]
			#print rgb4EachCluster[pos][1],count[pos],value[1]
			#print rgb4EachCluster[pos][2],count[pos],value[2]
		else:
			value[0] = rgb4EachCluster[pos][0]/count[pos]
			value[1] = rgb4EachCluster[pos][1]/count[pos]
			value[2] = rgb4EachCluster[pos][2]/count[pos]
			#print rgb4EachCluster[pos][0],count[pos],value[0]
			#print rgb4EachCluster[pos][1],count[pos],value[1]
			#print rgb4EachCluster[pos][2],count[pos],value[2]
		NEW_CENTROIDS[pos] = value
	#print "Printing NEW CENTROIDS: " + str(NEW_CENTROIDS)

####################################
# Swaps NEW_CENTROIDS -> CENTROIDS #
####################################
def swapOldNewCentroids():
	global CENTROIDS
	tempDict = {k:v for k,v in CENTROIDS.items()}
	CENTROIDS = {}
	CENTROIDS = {k:v for k,v in NEW_CENTROIDS.items()}

###################################
# Saves the new clusterized image #
###################################
def saveNewImage(name):

	outputImage = Image.new("RGB",IMAGE_SIZE)
	print "Saving..."
	draw = ImageDraw.Draw(outputImage)

	for element in CLUSTER_NO:
		c = tuple(NEW_CENTROIDS.get(element[2]))
		draw.point((element[0],element[1]), fill=c)
	#outputImage.putdata(imageArray)
	outputImage.save(name+"_cl"+str(N_CLUSTERS)+"_it"+str(N_ITERATIONS)+EXTENSION,"JPEG",quality=100)
	print "That's the end"

########################
# Main k-means routine #
########################
def kmeans(image):

	global CLUSTER_NO
	global NEW_CENTROIDS
	# Loading the image
	loadedImage = image.load()
	#print loadedImage[0,0][0]
	x = 0
	y = 0
	# Finding initial classification
	for iter in range(N_ITERATIONS):
		print "#Iteration : " + str(iter)
		CLUSTER_NO = []
		NEW_CENTROIDS = {}

		# ClusterTuple has [x,y, n_cluster] for each pixel in the image
		# n_cluster is the number of cluster where the pixel belongs to
		for x in range(IMAGE_SIZE[0]):
			for y in range(IMAGE_SIZE[1]):
				clusterTuple = [0] * 3
				clusterTuple[0] = x # x position of the pixel
				clusterTuple[1] = y # y position of the pixel
				clusterTuple[2] = findCluster(loadedImage[x,y]) # cluster no. where the pixel belongs to
				CLUSTER_NO.append(clusterTuple)

		# Finds the new Cluster mean
		findNewMean(loadedImage)
		# Swaps the Dictionaries
		swapOldNewCentroids()

		#print "Printing CLUSTER NO " + str(CLUSTER_NO)
		#print "Printing NEW CENTROIDS: " + str(NEW_CENTROIDS)
		#print "Printing OLD CENTROIDS: " + str(CENTROIDS)
		#print ""
    #print "stamp Cluster no"
    #print CLUSTER_NO

##################
# Main procedure #
##################
def procedure(nameImage):

	# Finds initial centroids
	image = getPalette(nameImage)
	# Computes kmeans algorithm
	# Taking K-Means time
	T0=time.time()
	kmeans(image)
	T1=time.time()
	print "Time: " + str(T1-T0) + " seconds."
	# Saves the new clusterized image
	saveNewImage(nameImage)

########
# Main #
########
if __name__ == "__main__":

    global N_ITERATIONS
    global N_CLUSTERS
	# Not considering .py namefile
    args = sys.argv[1:]
   	# Name input image
    inputFile = args[0]
    print "File Name: " + str(inputFile)
    try:
        numColors = int(args[1])
        iterations = int(args[2])
        N_ITERATIONS = iterations
        N_CLUSTERS = numColors

    except:
        print "Error: #Colors or #Iteration are not correct"
        sys.exit()

    print "#Colors: " + str(numColors) + " #Iterations: " + str(iterations)

    nameFile,extension = os.path.splitext(inputFile)
    if (extension.lower() == ".jpg" or extension.lower() == ".jpeg") and numColors > 1 and numColors <= MAX_CLUSTERS and iterations > 0 and iterations <= MAX_ITERATIONS :
        procedure(nameFile)
    else:
		print "Unable to load the image, probably your parameters are incorrect"
		sys.exit()

Handling Jpg images

Before calling CUDA program, some preprocessing is needed in order to choose the starting centroids and to better represent an RGB image in C++.
First of all, an RGB .jpg image is converted into a .raw file containing the stream of RGB values (from 0 to 255) that represent the single image (sorted: from the top left to the bottom right side of the image). This way of “serialising pixels simplifies the image reading procedure on our CUDA code, since we won’t need to import any extra library to read .jpg images.

ImageConverter.py works as shown below.
To get a raw file from a jpg one the args must be:

>> python imageConverter.py image_name.jpg

Where image_name.jpg is our input image. This script returns image_name.raw that will be the input of our CUDA program.
To get the jpg file from a raw one the args must be:

python imageConverter.py image_name.raw Width Height

As mentioned before image_name.raw<\em> is our image represented by the stream of its RGB values. This conversion is used to convert the output of the CUDA program (that returns a raw file) into a jpg one.

ImageConverte.py is shown below.

'''
Andrea Toscano
Universita' degli Studi di Milano
MS Computer Science
Jpg - Raw Converter (uint8 stream)

With OSX Terminal:
	Install Pip (Python package manager):
		sudo easy_install pip

	Install PIL, Numpy:
		pip install PIL
		pip install numpy

Using Terminal move where .py and images are stored.

USE CASES
to get .raw file from .jpg file
:$ python imageConverter.py imageName.jpg
to get .jpg file from .raw file
:$ python imageConverter.py imageName.raw Width Height

'''
#Import modules
import sys, os, math, time
import numpy as np
from PIL import Image

#Global Variables
#Width
W = 0
#Height
H = 0
# File Name without extension
global NAMEFILE
#Current file extension
global EXTENSION

#Used to specify raw file extension
global RAW

def fromJpgToUint8():
	global W,H
	img = Image.open(NAMEFILE+EXTENSION)
	W = img.size[0]
	H = img.size[1]
	print "Width: " + str(W) + " Height: " + str(H)

	#print type(img)
	array = np.asarray(img)
	#print type(array)
	#print array.dtype

	# Debugging..
	# (300,400,3) means 400x300 with RGB channels
	#print "Array shape: " + str(array.shape)
	# Starting from top left bottom

	#print "Array: "
	#print array
	# 'uint8'
	#print "Array Type: "+ str(array.dtype)
	# Saving the array in .raw format
	print "Saving .raw ..."
	array.tofile(NAMEFILE+RAW)
	print "Saved!"

def fromUint8ToJpg():
	# Reading .raw file
	array = np.fromfile(NAMEFILE+EXTENSION, dtype = "uint8")
	#print array
	#print len(array)
	# Reshaping 1d Array into 2d Array
	outputImage = array.reshape(H,W,3)
	image = Image.fromarray(outputImage)
	print "Saving .jpg ...."
	# Saving new image
	image.save(NAMEFILE+"_PYoutput"+".jpg" ,quality=100)
	print "Saved!"

if __name__ == "__main__":
	global NAMEFILE, EXTENSION, RAW, W, H
	RAW = ".raw"
	args = sys.argv[1:]
	inputFile = args[0]
	print "File Name: " + str(inputFile)
	try:
		NAMEFILE,EXTENSION = os.path.splitext(inputFile)
		if  (EXTENSION.lower() == ".jpg" or EXTENSION.lower() == ".jpeg"):
			fromJpgToUint8()
		elif EXTENSION.lower() == RAW:
			W = int(args[1])
			H = int(args[2])
			fromUint8ToJpg()
		else:
			print "Error: Unknown image format."
	except IOError:
		print "Unable to load your image."

Finding the starting centroids

K-means performance strictly depends on the starting centroids. The better we choose the initial colours the better result we will obtain. For this reason the initial centroids are found analysing the histogram of the image.
This step is obtained using python PIL library as it follows:
Palette.py receives as args the jpg image and the number of clusters given by the user.

&gt;&gt;python palette.py image_name.jpg  no_Of_Cluster 

The output of the script (image_palette.raw) will be a raw file containing the stream of clusters’ RGB values. This file will be used as a secondd input of the CUDA program.
Palette.py

'''
Andrea Toscano
Universita' degli Studi di Milano
MS Computer Science
							   nameFile          n Clusters
USAGE :$   python palette.py photoName.jpg  		10
'''

# Import the modules
import sys, os, math, time
import numpy as np
from PIL import Image

# GLOBAL VARIABLES
# Change these values if you want
MAX_CLUSTERS = 50

# Stores w, h of the image
IMAGE_SIZE = []
EXTENSION = ".jpg"

# Finding the starting colors (centroids)
def getPalette(nameFile):

	global IMAGE_SIZE
	global CENTROIDS
	# Dim of each color palette
	dimEachColorPalette = 200
	image = 0
	try:
		image = Image.open(nameFile+EXTENSION)
	except IOError:
		print "IOError: Unable to open the image. "
		sys.exit()

	IMAGE_SIZE = image.size
	# Dim image
	print "Size Original Image: " + str(IMAGE_SIZE)
	# Dim of the resized image
	resize = int(0.5*IMAGE_SIZE[1])
    # Resize with antialias filter
	resizedImage = image.resize((resize, resize), Image.ANTIALIAS)
    # Get Palette
	result = resizedImage.convert('P', palette=Image.ADAPTIVE, colors=N_CLUSTERS)
	# Add Alpha channel -&gt; PNG
	result.putalpha(0)
	# Get a list of tuple (count, color)
	colors = result.getcolors(resize*resize)
	colors = [col[1][:-1] for col in colors]

	array = np.asarray(colors, dtype=np.uint8)
	print "Printing Initial Centroids: "
	print array
	array.tofile(nameFile+"_palette"+str(N_CLUSTERS)+".raw")
	print nameFile+"_palette"+str(N_CLUSTERS)+".raw" + " saved."

# Main
if __name__ == "__main__":

    global N_CLUSTERS
	# Not considering .py namefile
    args = sys.argv[1:]
   	# Name input image
    inputFile = args[0]
    print "File Name: " + str(inputFile)
    try:
        N_CLUSTERS = int(args[1])
    except:
        print "Error: #Initial Colors &gt; 1"
        sys.exit()

    print "#Initial Centroids: " + str(N_CLUSTERS) 

    nameFile,extension = os.path.splitext(inputFile)
    if (extension.lower() == ".jpg" or extension.lower() == ".jpeg") and N_CLUSTERS &gt; 1 and N_CLUSTERS &lt;= MAX_CLUSTERS:
    	getPalette(nameFile)
    else:
		print "Unable to load the image, probably your parameters are incorrect"
		print "USAGE:  palette.py photoName.jpg  		n    	 "
		print "palette.py: name of this script "
		print "photoName.jpg: name of jpg input image"
		print "n: number of Clusters"
		sys.exit()

K-Means implementation on CUDA

We know all that’s needed to perform the parallel computation of K-Means on Cuda.
The implementation is very simples and requires only a .cu file.
It takes as input:

image.raw output.raw width length image_palette.raw no_Of_Clusters no_Of_Iterations
  1. image.raw is our input image represented by the stream of its RGB values.
  2. output.raw is the output of CUDA algorithm. It represents the clusterized image and needs to be converted to jpg.
  3. width of the image.
  4. length of the image
  5. image_palette.raw represents the RGB initial centroids
  6. no_Of_Clusters defines the number of clusters used in the algorithm. Must match with those contained in image_palette.raw
  7. no_Of_Iterations defines the number of iterations of K-Means

The full code is here:

/**
 * Universitˆ degli Studi di Milano
 * GPU Computing
 *
 * Title: Parallel K-Means applied on RGB Images.
 * Created: JULY 13, 2014 11:33AM
 * Author: Andrea Toscano
 *
 *
 */
#include &lt;stdio.h&gt;
#include &lt;stdlib.h&gt;
#include &lt;stdbool.h&gt;
#include &lt;math.h&gt;
#include "timer.h"

// Number of threads
#define BLOCK_SIZE 16
#define GRID_SIZE 256

//Useful to read Error from CUDA Calls
#define CUDA_CALL(x) {if((x) != cudaSuccess){ \
  printf("CUDA error at %s:%d\n",__FILE__,__LINE__); \
  printf("  %s\n", cudaGetErrorString(cudaGetLastError())); \
  exit(EXIT_FAILURE);}}

// nCentroids and size on device
__constant__ int dev_nCentroids;
__constant__ int dev_size;

// global variables
int PALETTE_BYTES = 0; // nCentroids * sizeof(int)
int IMAGE_BYTES = 0;  // width * height * sizeof(int)

//**********************************
//R,G,B Centroid's triple on device
// nCentroids on GPU is HARDCODED remind to update it manually!
__constant__ int dev_RedCentroid[20];
__constant__ int dev_GreenCentroid[20];
__constant__ int dev_BlueCentroid[20];
//**********************************
/*
 *  get RGB values from Initial Centroids (RAW format)
 *
 */
bool loadPalette(char* filename, int nCentroids, int* redCentroid, int* greenCentroid, int*  blueCentroid) {

	FILE *imageFile;
	int length = 0;

	imageFile = fopen(filename,"r");
	if (imageFile == NULL) {
		return false;
	} else {
		for (int i = 0; i &lt; nCentroids; i++) {

			// R,G,B Centroid triple, nCentroids long
			redCentroid[i] = fgetc(imageFile);
			greenCentroid[i] = fgetc(imageFile);
			blueCentroid[i] = fgetc(imageFile);
			printf("%d, %d, %d\n",redCentroid[i], greenCentroid[i], blueCentroid[i] );
			length++;
		}
		fclose(imageFile);
		printf("\n");
		//printf("Palette Length: %d\n", length);
		return true;
	}
}

/*
 *  Image file loader (RAW format)
 *  Each pixel is identified by a triple r, g, b
 *  These values are taken from a raw file and put into three monodimensional arrays whose length is width*height
 */
bool loadRawImage(char* filename, int* r, int* g, int* b, int size) {
	FILE *imageFile;
	imageFile = fopen(filename, "r");

	if (imageFile == NULL) {
		return false;
	} else {
		for (int i = 0; i &lt; size; i++) {

			r[i] = fgetc(imageFile);
			g[i] = fgetc(imageFile);
			b[i] = fgetc(imageFile);
		}
		fclose(imageFile);

		/*for(int j = 0; j &lt; h * w; j++) {
			printf("%d, %d, %d ", r[j], g[j], b[j]);
		}*/
		return true;
	}
}

/*
 *  Image file writer (RAW format)
 */
bool writeRawImage(char* filename, int* labelArray, int* redCentroid, int* greenCentroid, int* blueCentroid, int size){
	FILE *imageFile;
	imageFile = fopen(filename, "wb");

	if(imageFile == NULL) {
		return false;
	} else {
		for (int i = 0; i &lt; size; i++) {
			fputc((char) redCentroid[labelArray[i]], imageFile);
			fputc((char) greenCentroid[labelArray[i]], imageFile);
			fputc((char) blueCentroid[labelArray[i]], imageFile);
		}
		fclose(imageFile);
		return true;
	}
}

/*******************************************************************/

/*
 *  Clears arrays before each kernel getClusterLabel iteration
 *
 */
__global__ void clearPaletteArrays(int *dev_sumRed,int *dev_sumGreen,int *dev_sumBlue, int* dev_pixelClusterCounter, int* dev_tempRedCentroid, int* dev_tempGreenCentroid, int* dev_tempBlueCentroid ) {

	// 1 block, 16x16 threads
	int threadID = threadIdx.x + threadIdx.y * blockDim.x;

	if(threadID &lt; dev_nCentroids) {

		// nCentroids long
		dev_sumRed[threadID] = 0;
		dev_sumGreen[threadID] = 0;
		dev_sumBlue[threadID] = 0;
		dev_pixelClusterCounter[threadID] = 0;
		dev_tempRedCentroid[threadID] = 0;
		dev_tempGreenCentroid[threadID] = 0;
		dev_tempBlueCentroid[threadID] = 0;
	}
}// end clearPaletteArrays

/*
 *  Clear label array before each kernel getClusterLabel iteration
 */
__global__ void clearLabelArray(int *dev_labelArray){

	// Global thread index
	int threadID = (threadIdx.x + blockIdx.x * blockDim.x) + (threadIdx.y + blockIdx.y * blockDim.y) * blockDim.x * gridDim.x;

	// labelArray is "size" long
	if(threadID &lt; dev_size) {
		dev_labelArray[threadID] = 0;
	}
}// end clearLabelArray

/*
 * Finds the minimum distance between each triple dev_Red[i] dev_Green[i] dev_Blue[i] and all centroids.
 * Then saves the equivalent centroid label in dev_labelArray.
 * labelArray is   "width*height" long, monodimensional array
 *
 * INPUT : pixel triple arrays dev_Red, dev_Green, dev_Blue. labelArray that will contains the label for each pixel triple
 */
__global__ void getClusterLabel(int *dev_Red,int *dev_Green,int *dev_Blue,int *dev_labelArray) {

	// Global thread index
	int threadID = (threadIdx.x + blockIdx.x * blockDim.x) + (threadIdx.y + blockIdx.y * blockDim.y) * blockDim.x * gridDim.x;

	//default min value of distance
	float min = 500.0, value;
	//will be label
	int index = 0;

	if(threadID &lt; dev_size) {
		// Finding the nearest centroid to current triple identified by threadID thread
		for(int i = 0; i &lt; dev_nCentroids; i++) {
			// Performing Euclidean distance, Saving current value
			value = sqrtf(powf((dev_Red[threadID]-dev_RedCentroid[i]),2.0) + powf((dev_Green[threadID]-dev_GreenCentroid[i]),2.0) + powf((dev_Blue[threadID]-dev_BlueCentroid[i]),2.0));

			if(value &lt; min){
				// saving new nearest centroid
				min = value;
				// Updating his index
				index = i;
			}
		}// end for
		// Writing to global memory the index of the nearest centroid
		// for dev_Red[threadID], dev_Green[threadID], dev_Blue[threadID] pixel triple
		dev_labelArray[threadID] = index;

	}// end if
}// end getClusterLabel

/*
 *  Summing Red, Green, Blue values per cluster
 *  Counting how many pixels there are in each cluster
 *
 */

__global__ void sumCluster(int *dev_Red,int *dev_Green,int *dev_Blue,int *dev_sumRed,int *dev_sumGreen,int *dev_sumBlue,int *dev_labelArray,int *dev_pixelClusterCounter) {

	// Global thread index
	int threadID = (threadIdx.x + blockIdx.x * blockDim.x) + (threadIdx.y + blockIdx.y * blockDim.y) * blockDim.x * gridDim.x;

	if(threadID &lt; dev_size) {
		int currentLabelArray = dev_labelArray[threadID];
		int currentRed = dev_Red[threadID];
		int currentGreen = dev_Green[threadID];
		int currentBlue = dev_Blue[threadID];
		// Writing to global memory needs a serialization. Many threads are writing into the same few locations
		atomicAdd(&amp;dev_sumRed[currentLabelArray], currentRed);
		atomicAdd(&amp;dev_sumGreen[currentLabelArray], currentGreen);
		atomicAdd(&amp;dev_sumBlue[currentLabelArray], currentBlue);
		atomicAdd(&amp;dev_pixelClusterCounter[currentLabelArray], 1);
	}
}// end sumCluster

/*
 *  Calculates the new R,G,B values of the centroids dividing the sum of color (for each channel) by the number of pixels in that cluster
 *	New values are stored in global memory since the current R,G,B values of the centroids are in read-only constant memory.
 */
__global__ void newCentroids(int *dev_tempRedCentroid, int *dev_tempGreenCentroid, int *dev_tempBlueCentroid,int* dev_sumRed, int *dev_sumGreen,int *dev_sumBlue, int* dev_pixelClusterCounter) {

	// 1 block , 16*16 threads
	int threadID = threadIdx.x + threadIdx.y * blockDim.x;

	if(threadID &lt; dev_nCentroids) {
		int currentPixelCounter = dev_pixelClusterCounter[threadID];
		int sumRed = dev_sumRed[threadID];
		int sumGreen = dev_sumGreen[threadID];
		int sumBlue = dev_sumBlue[threadID];

		//new RGB Centroids' values written in global memory
		dev_tempRedCentroid[threadID] = (int)(sumRed/currentPixelCounter);
		dev_tempGreenCentroid[threadID] = (int)(sumGreen/currentPixelCounter);
		dev_tempBlueCentroid[threadID] = (int)(sumBlue/currentPixelCounter);
	}

}// end newCentroids

/*******************************************************************/

int main(int argc, char *argv[]) {

		// init device
		cudaSetDevice(0);
		cudaDeviceSynchronize();
		cudaThreadSynchronize();

		//input raw file, output raw file, input palette raw file containing RGB values of initial centroids
		char *inputFile, *outputFile, *palette;
		//Pixels' r,g,b values. Centroid's r,g,b values
		int *red, *green, *blue, *redCentroid, *greenCentroid, *blueCentroid;

		// ref to GPU  Pixels'RGB values, Centroids' RGB values
		int *dev_Red, *dev_Green, *dev_Blue, *dev_tempRedCentroid, *dev_tempGreenCentroid, *dev_tempBlueCentroid;
		// array containing ref to GPU label array variable
		int *labelArray, *dev_labelArray;

		// local variables for storing image width, height
		// number of cluster, number of iterations, linear size of the image ( = width * height)
		int width, height, nCentroids, nIterations,size;
		//int IMAGE_BYTES, PALETTE_BYTES;

		// ref to array where pixels' count are stored
		int *pixelClusterCounter, *dev_pixelClusterCounter;
		// ref to arrays where sum of RGB values for each cluster are stored
		int *sumRed, *sumGreen, *sumBlue;
		int *dev_sumRed, *dev_sumGreen, *dev_sumBlue;

		// checking arguments
		if (argc &gt; 7) {
			inputFile = argv[1];
			outputFile = argv[2];
			width = atoi(argv[3]);
			height = atoi(argv[4]);
			palette = argv[5];
			nCentroids = atoi(argv[6]);  // remind to update hardcoded nCentroids above
			if(nCentroids &gt; 256)
				nCentroids = 256;
			nIterations = atoi(argv[7]);
			if(nIterations &gt; 15)
				nIterations = 15;

		} else {
			printf("  USAGE: kmeans.cu &lt;inputfile.raw&gt; &lt;outputfile.raw&gt; nRows nCols palette nCentroids nItarations \n");
			printf("           &lt;inputfile.raw&gt;: input .raw file (sequence of bytes)\n");
			printf("          &lt;outputfile.raw&gt;: output .raw file\n");
			printf("                  nRows: the number of rows of the image\n");
			printf("                  nCols: the number of columns of the image\n");
			printf("			      palette: RGB initial Centroids");
			printf("			   nCentroids: number of clusters");
			printf("			  nIterations: number of iterations of K-Means");

			return 0;
		}

		// Setting image and palette size in bytes
		IMAGE_BYTES = width * height * sizeof(int);
		PALETTE_BYTES = nCentroids * sizeof(int);
		size = width * height;

		/* TODO useful for Texture implementation
		int ra[width][height];
		int ga[width][height];
		int ba[width][height];
		*/

		printf("Image: %s\n",inputFile);
		printf("Width: %d, Height: %d\n", width, height);
		printf("#Clusters: %d, #Iterations: %d\n", nCentroids, nIterations);

		// allocate memory on CPU
		red = static_cast&lt;int *&gt;(malloc(IMAGE_BYTES));
		green = static_cast&lt;int *&gt;(malloc(IMAGE_BYTES));
		blue = static_cast&lt;int *&gt;(malloc(IMAGE_BYTES));
		redCentroid = static_cast&lt;int *&gt;(malloc(PALETTE_BYTES));
		greenCentroid = static_cast&lt;int *&gt;(malloc(PALETTE_BYTES));
		blueCentroid = static_cast&lt;int *&gt;(malloc(PALETTE_BYTES));
		labelArray = static_cast&lt;int *&gt;(malloc(IMAGE_BYTES));
		sumRed = static_cast&lt;int*&gt;(malloc(PALETTE_BYTES));
		sumGreen = static_cast&lt;int*&gt;(malloc(PALETTE_BYTES));
		sumBlue = static_cast&lt;int*&gt;(malloc(PALETTE_BYTES));
		pixelClusterCounter = static_cast&lt;int*&gt;(malloc(PALETTE_BYTES));

		// Setting initial centroids
		printf("Initial Centroids: \n");
		if(loadPalette(palette, nCentroids, redCentroid, greenCentroid, blueCentroid)) {
		} else {
			printf("Unable to set Initial Centroids.\n");
		}

		// Loading image in r, g, b arrays
		printf("Image loading...\n");
		if (loadRawImage(inputFile, red, green, blue, size)) {
			printf("Image loaded!\n");
		} else {
			printf("NOT loaded!\n");
			return -1;
		}

		printf("\n");

		/* TODO Useful for Texture implementation
		int ri = 0;
		int co = 0;

		for( ri = 0; ri &lt; width; ri++){
			for( co = 0; co &lt; height; co++) {
			int x = ri+co*width;
			ra[ri][co] = red[x];
			ga[ri][co] = green[x];
			ba[ri][co] = blue[x];
			}
		}
*/

		if(IMAGE_BYTES == 0 || PALETTE_BYTES == 0) {
			return -1;
		}

		// allocate memory on GPU
		CUDA_CALL(cudaMalloc((void**) &amp;dev_Red, IMAGE_BYTES));
		CUDA_CALL(cudaMalloc((void**) &amp;dev_Green, IMAGE_BYTES));
		CUDA_CALL(cudaMalloc((void**) &amp;dev_Blue, IMAGE_BYTES));
		CUDA_CALL(cudaMalloc((void**) &amp;dev_tempRedCentroid, PALETTE_BYTES));
		CUDA_CALL(cudaMalloc((void**) &amp;dev_tempGreenCentroid, PALETTE_BYTES));
		CUDA_CALL(cudaMalloc((void**) &amp;dev_tempBlueCentroid, PALETTE_BYTES));
		CUDA_CALL(cudaMalloc((void**) &amp;dev_labelArray, IMAGE_BYTES));
		CUDA_CALL(cudaMalloc((void**) &amp;dev_sumRed, PALETTE_BYTES));
		CUDA_CALL(cudaMalloc((void**) &amp;dev_sumGreen, PALETTE_BYTES));
		CUDA_CALL(cudaMalloc((void**) &amp;dev_sumBlue, PALETTE_BYTES));
		CUDA_CALL(cudaMalloc((void**) &amp;dev_pixelClusterCounter, PALETTE_BYTES));

		// copy host CPU memory to GPU
		CUDA_CALL(cudaMemcpy(dev_Red, red, IMAGE_BYTES, cudaMemcpyHostToDevice));
		CUDA_CALL(cudaMemcpy(dev_Green, green, IMAGE_BYTES, cudaMemcpyHostToDevice));
		CUDA_CALL(cudaMemcpy(dev_Blue, blue, IMAGE_BYTES, cudaMemcpyHostToDevice));
		CUDA_CALL(cudaMemcpy(dev_tempRedCentroid, redCentroid,PALETTE_BYTES,cudaMemcpyHostToDevice ));
		CUDA_CALL(cudaMemcpy(dev_tempGreenCentroid, greenCentroid,PALETTE_BYTES,cudaMemcpyHostToDevice ));
		CUDA_CALL(cudaMemcpy(dev_tempBlueCentroid, blueCentroid,PALETTE_BYTES,cudaMemcpyHostToDevice ));
		CUDA_CALL(cudaMemcpy(dev_labelArray, labelArray, IMAGE_BYTES, cudaMemcpyHostToDevice));
        CUDA_CALL(cudaMemcpy(dev_sumRed, sumRed, PALETTE_BYTES, cudaMemcpyHostToDevice));
		CUDA_CALL(cudaMemcpy(dev_sumGreen, sumGreen, PALETTE_BYTES, cudaMemcpyHostToDevice));
		CUDA_CALL(cudaMemcpy(dev_sumBlue, sumBlue, PALETTE_BYTES, cudaMemcpyHostToDevice));
		CUDA_CALL(cudaMemcpy(dev_pixelClusterCounter, pixelClusterCounter, PALETTE_BYTES, cudaMemcpyHostToDevice));
		CUDA_CALL(cudaMemcpyToSymbol(dev_RedCentroid, redCentroid, PALETTE_BYTES));
		CUDA_CALL(cudaMemcpyToSymbol(dev_GreenCentroid, greenCentroid, PALETTE_BYTES));
		CUDA_CALL(cudaMemcpyToSymbol(dev_BlueCentroid, blueCentroid, PALETTE_BYTES));
		CUDA_CALL(cudaMemcpyToSymbol(dev_nCentroids,&amp;nCentroids, sizeof(int)));
		CUDA_CALL(cudaMemcpyToSymbol(dev_size, &amp;size, sizeof(int)));

		// Clearing centroids on host
		for(int i = 0; i &lt; nCentroids; i++) {
			redCentroid[i] = 0;
			greenCentroid[i] = 0;
			blueCentroid[i] = 0;
		}

		// Defining grid size

		int BLOCK_X, BLOCK_Y;
		BLOCK_X = ceil(width/BLOCK_SIZE);
		BLOCK_Y = ceil(height/BLOCK_SIZE);
		if(BLOCK_X &gt; GRID_SIZE)
			BLOCK_X = GRID_SIZE;
		if(BLOCK_Y &gt; GRID_SIZE)
			BLOCK_Y = GRID_SIZE;

		//2D Grid
		//Minimum number of threads that can handle width¡height pixels
	 	dim3 dimGRID(BLOCK_X,BLOCK_Y);
	 	//2D Block
	 	//Each dimension is fixed
		dim3 dimBLOCK(BLOCK_SIZE,BLOCK_SIZE);

		//Starting timer
		GpuTimer timer;
		timer.Start();

		printf("Launching K-Means Kernels..	\n");
		//Iteration of kmeans algorithm
		for(int i = 0; i &lt; nIterations; i++) {

			// Passing image RGB components, palette RGB components, label Array, number of Clusters
			// Init  arrays' values to 0
			// Kernel needs only 1 block since nClusters
			clearPaletteArrays&lt;&lt;&lt;1, dimBLOCK&gt;&gt;&gt;(dev_sumRed, dev_sumGreen, dev_sumBlue, dev_pixelClusterCounter, dev_tempRedCentroid, dev_tempGreenCentroid, dev_tempBlueCentroid);

			// Init labelarray values to 0
			clearLabelArray&lt;&lt;&lt;dimGRID, dimBLOCK&gt;&gt;&gt;(dev_labelArray);

			// Calculates the distance from each pixel and all centroids
			// Then saves the equivalent label in dev_labelArray
			getClusterLabel&lt;&lt;&lt; dimGRID, dimBLOCK &gt;&gt;&gt; (dev_Red, dev_Green, dev_Blue,dev_labelArray);

			//Sums RGB values in each Cluster
			sumCluster&lt;&lt;&lt;dimGRID, dimBLOCK&gt;&gt;&gt; (dev_Red, dev_Green, dev_Blue, dev_sumRed, dev_sumGreen, dev_sumBlue, dev_labelArray,dev_pixelClusterCounter);

			//Finds new RGB Centroids' values
			newCentroids&lt;&lt;&lt;1,dimBLOCK &gt;&gt;&gt;(dev_tempRedCentroid, dev_tempGreenCentroid, dev_tempBlueCentroid, dev_sumRed, dev_sumGreen, dev_sumBlue, dev_pixelClusterCounter);

			//Old RGB Centroids' values are in constant memory
			//Updated RGB Centroids' values are in global memory
			//We need a swap
			CUDA_CALL(cudaMemcpy(redCentroid, dev_tempRedCentroid, PALETTE_BYTES,cudaMemcpyDeviceToHost));
			CUDA_CALL(cudaMemcpy(greenCentroid, dev_tempGreenCentroid, PALETTE_BYTES,cudaMemcpyDeviceToHost));
			CUDA_CALL(cudaMemcpy(blueCentroid, dev_tempBlueCentroid, PALETTE_BYTES,cudaMemcpyDeviceToHost));
			//Uploading in constant memory updated RGB Centroids' values
			CUDA_CALL(cudaMemcpyToSymbol(dev_RedCentroid, redCentroid, PALETTE_BYTES));
			CUDA_CALL(cudaMemcpyToSymbol(dev_GreenCentroid, greenCentroid, PALETTE_BYTES));
			CUDA_CALL(cudaMemcpyToSymbol(dev_BlueCentroid, blueCentroid, PALETTE_BYTES));
			timer.Stop();
		}

		// DEBUG
		CUDA_CALL(cudaMemcpy(labelArray, dev_labelArray, IMAGE_BYTES, cudaMemcpyDeviceToHost));
		CUDA_CALL(cudaMemcpy(sumRed, dev_sumRed, PALETTE_BYTES, cudaMemcpyDeviceToHost));
		CUDA_CALL(cudaMemcpy(sumGreen, dev_sumGreen, PALETTE_BYTES, cudaMemcpyDeviceToHost));
		CUDA_CALL(cudaMemcpy(sumBlue, dev_sumBlue, PALETTE_BYTES, cudaMemcpyDeviceToHost));
		CUDA_CALL(cudaMemcpy(pixelClusterCounter, dev_pixelClusterCounter, PALETTE_BYTES, cudaMemcpyDeviceToHost));

		printf("Kmeans code ran in: %f msecs.\n", timer.Elapsed());
		printf("\n");

		// labelArray DEBUG
		  int counter = 0;

		printf("Label Array:\n");
		for(int i = 0; i &lt; (size); i++) {
			//printf("%d\n", labelArray[i]);
			counter++;
		}
		printf("printing counter %d\n", counter);
		counter = 0;

		printf("Sum Arrays:\n");
		for(int j = 0; j &lt; nCentroids; j++) {
			printf("r: %u g: %u b: %u \n", sumRed[j], sumGreen[j], sumBlue[j]);
			counter++;
		}

		printf("\n");

		printf("Pixels per centroids:\n");
		for(int k = 0; k &lt; nCentroids; k++){
			printf("%d centroid: %d pixels\n", k, pixelClusterCounter[k]);
		}

		printf("\n");

		printf("New centroids:\n");
		for(int i = 0; i &lt; nCentroids; i++) {

			printf("%d, %d, %d \n", redCentroid[i], greenCentroid[i], blueCentroid[i]);
		}

		// writing...
		printf("Image writing...\n");

		if (writeRawImage(outputFile,labelArray, redCentroid, greenCentroid,  blueCentroid,  size)) {
			printf("Image written!\n");
		} else {
			printf("NOT written!\n");
			return -1;
		}

		free(red);
		free(green);
		free(blue);
		free(redCentroid);
		free(greenCentroid);
		free(blueCentroid);
		free(labelArray);
		free(sumRed);
		free(sumGreen);
		free(sumBlue);
		free(pixelClusterCounter);

		CUDA_CALL(cudaFree(dev_Red));
		CUDA_CALL(cudaFree(dev_Green));
		CUDA_CALL(cudaFree(dev_Blue));
		CUDA_CALL(cudaFree(dev_tempRedCentroid));
		CUDA_CALL(cudaFree(dev_tempGreenCentroid));
		CUDA_CALL(cudaFree(dev_tempBlueCentroid));
		CUDA_CALL(cudaFree(dev_labelArray));
		CUDA_CALL(cudaFree(dev_sumRed));
		CUDA_CALL(cudaFree(dev_sumGreen));
		CUDA_CALL(cudaFree(dev_sumBlue));
		CUDA_CALL(cudaFree(dev_pixelClusterCounter));

		printf("That's the end.\n");
		return 0;
}

Reading the RGB raw image

Let’s examine the code from the very first step.
LoadRawImage routine shows how all R, G, and B values of the raw image are added in r[], g[] and b[] arrays, respectively.
LoadPalette routine reads image_palette-raw and copies its values into three arrays like loadRawImage function.

Cuda Kernels

Memory space is allocated on the GPU and the grid is defined on a 2D space. Each block in the grid is 2D too.
The maximum number of threads for each block is set to 16 and the dimension of the grid is related to the width and height of the image.

BLOCK_X = ceil(width/BLOCK_SIZE);
BLOCK_Y = ceil(height/BLOCK_SIZE); 

Since the centroids are updated only at the end of each iteration, I chose to put them in constant memory (only read). They are updated at the beginning of the new loop with the values found at the previous iteration ok K-Means.
As the sequential version, the parallel algorithm consists in 3 main routines:

  1. getClusterLabel: this function receives the r,g,b arrays and returns the nearest centroid for each pixel. Have a look to the code for the global thread index identification.
  2. sumCluster: for all pixels in a cluster it sums separately their R,G,B in 3 different arrays so that each output array represents the sum for a single R,G,B component.
  3. newCentroids: For each cluster it simply divides the sum (for each component obtained in the previous step) by the number of pixels contained in each cluster

The result is copied back to CPU and the last routine saves the output image into a raw file.
Now you can run again imageConverter.py specifying that you want a jpg from a raw file:

&gt;&gt;python imageConverter.py imageName.raw Width Height

Note: you have also to specify the width and height of the image to convert the raw stream into a new picture.

Recap

  1. Choose a jpg and run imageConverter.py to get the raw representation of the image.
  2. Once you have decided the number of clusters of k-means, run palette.py with the same image to get the initial centroids.
  3. Now you can run CUDAkmeans.cu in order to get the clusterized image (raw)
  4. Run again imageConverter.py to convert the raw file into a jpg one.

You can find some examples on my GitHub repo.

Image clusterized with K-Means on Cuda

Image clusterized with K-Means on Cuda

November 20th, 2016 at 06:11 by ruzi

Reply



sir, the code is awesome, but when i run the cudakmeans, the output show me some bug/error, it produce -1 in every centroid, am i doing it wrong? because when i run the python code, the code produce the right answer
and pardon with my english

November 20th, 2016 at 11:34 by AndrewTosky

Reply



Hi Ruzi,
first of all, thank you for reading my blog 🙂 .
Concerning your problem,
if you’re using NSight as IDE, could you please debug my code to see where it fails?
You have to consider in addition that I developed it a couple of years ago with an outdated CUDA version, (likely 5.5).
Which CUDA version are you using?

Thanks
Andrea

November 20th, 2016 at 13:04 by ruzi

Reply



wow thanks for responding to sir, i’m new to this cuda thing
i run the code with cmd, because when i use VS the program stop when saving the output.raw, i tried to trace it with some printf, what i found that the gaussian value always -1, and i assume the cudamemcpy might not copying the value.
i want to debug it, but the program always say access violation
reading location

im using cuda 8.0, maybe if i downgrade my cuda it will work?

November 20th, 2016 at 23:16 by AndrewTosky

Reply



Hi Ruzi,
it must surely be a version issue since I tested my code with both cmd and Nsight (Linux) on GeForce and Tesla Nvidia architectures.
I’d suggest you not to downgrade but to understand where the problem is..maybe trying a different allocation method.
Are you sure all data are valid just before passing them to the GPU? Are you printing matrixes’ sizes and so on?

Cheers
Andrea

November 22nd, 2016 at 15:35 by ruzi



i still identifying the error, but i found something, the allocation is okay (my bad), because actually it copying the value of the initial centroids, so the problem might be in the iteration (still no clue)

does the code have some limitation on jpg dimension?
does cuda need to be config first?

Thanks

November 22nd, 2016 at 23:22 by AndrewTosky



Hi Ruzi, my CUDA code has no limitations since it handles a text file holding a stream of int values whose range is [0,255]. Then it interprets them as a RGB matrix, as you can see in the first lines of my CUDA program.

Python code, on the contrary, relies on the fact that input image’s extension is “jpg” (line 27 of Python code) . You can easily make it generic, if you want 🙂

November 23rd, 2016 at 02:14 by ruzi



thanks sir still helping me 🙂

yeah because of the raw file right? but i found something weird, in load raw image function, i print it by r g b, at the beginning it was okay, it print the r g b number, but when it reach certain loop of print, all the values become -1, i use 3 dif image, and all of them have different -1 start, 1 jpg start at iteration 63, and the other somewhere around 400 and 300.
i think this is the main problem with the running

November 24th, 2016 at 22:04 by AndrewTosky



Hi Ruzi,

yes the .raw file is just a simple solution to read an image without using opencv or any other external library.

Are you declaring image width and height correctly as command line input of CUDA program?

Do the images you used have the same size? I’m trying to understand whether the program fails at the same point or not

November 25th, 2016 at 05:48 by ruzi



i have image with width:236 height:244 i change it to raw, to check if the raw correct, i change it into jpg again, width and height still the same, but through that process, the file size change.

and when i start the command, i use
test3.exe 4.raw 444.raw 236 244 4_palette10.raw 10 10
and
test3.exe 4.raw 444.raw 244 236 4_palette10.raw 10 10

both still producing -1

December 5th, 2016 at 23:23 by AndrewTosky



Here debugging becomes essential.. otherwise it will be difficult to sort it out.

December 6th, 2016 at 06:21 by ruzi

Reply



Thanks still keeping up with me,
i debug it, but when i run the debug, it’s having error on location for write raw method.

Btw do i need to config the NVIDIA?
my step -> install the Visual Studio -> install CUDA -> add cl.exe to the environment so i can run it in CMD, am I missing something? thanks

December 6th, 2016 at 08:45 by ruzi

Reply



Hello, the code now working sir, i just need to change the fopen form r into rb. Thanks for everything, Sir. Your code is awesome 🙂

December 9th, 2016 at 21:53 by AndrewTosky

Reply



Setup Visual Studio for CUDA and debug the program properly, since the CUDA version I used is now outdated.
That’t the only suggestion I can give you 🙂