Faculty Advisor

Dr.Mohamed Y. Eltabakh

Faculty Advisor

Dr.Dmitry Korkin

Identifier

etd-042518-133732

Abstract

Clustering is defined as the process of grouping a set of objects in a way that objects in the same group are similar in some sense to each other than to those in other groups. It is used in many fields including machine learning, image recognition, pattern recognition and knowledge discovery. In this era of Big Data, we could leverage the computing power of distributed environment to achieve it over large dataset. It can be achieved through various algorithms, but in general they have high time complexities. We see that for large datasets the scalability and the parameters of the environment in which it is running become issues which needs to be addressed. Therefore it's brute force implementation is not scalable over large datasets even in a distributed environment, which calls the need for an approximation technique or optimization to make it scalable. We study three clustering techniques: CURE, DBSCAN and k-means over distributed environment like Hadoop. For each of these algorithms we understand their performance trade offs and bottlenecks and then propose enhancements or optimizations or an approximation technique to make it scalable in Hadoop. Finally we evaluate it's performance and suitability to datasets of different sizes and distributions.

Publisher

Worcester Polytechnic Institute

Degree Name

MS

Department

Computer Science

Project Type

Thesis

Date Accepted

2018-04-25

Accessibility

Restricted-WPI community only

Subjects

clustering, hadoop

Available for download on Saturday, April 25, 2020

Share

COinS