Dr.Mohamed Y. Eltabakh
Clustering is defined as the process of grouping a set of objects in a way that objects in the same group are similar in some sense to each other than to those in other groups. It is used in many fields including machine learning, image recognition, pattern recognition and knowledge discovery. In this era of Big Data, we could leverage the computing power of distributed environment to achieve it over large dataset. It can be achieved through various algorithms, but in general they have high time complexities. We see that for large datasets the scalability and the parameters of the environment in which it is running become issues which needs to be addressed. Therefore it's brute force implementation is not scalable over large datasets even in a distributed environment, which calls the need for an approximation technique or optimization to make it scalable. We study three clustering techniques: CURE, DBSCAN and k-means over distributed environment like Hadoop. For each of these algorithms we understand their performance trade offs and bottlenecks and then propose enhancements or optimizations or an approximation technique to make it scalable in Hadoop. Finally we evaluate it's performance and suitability to datasets of different sizes and distributions.
Worcester Polytechnic Institute
All authors have granted to WPI a nonexclusive royalty-free license to distribute copies of the work. Copyright is held by the author or authors, with all rights reserved, unless otherwise noted. If you have any questions, please contact email@example.com.
Restricted-WPI community only
Shetty, Kartik, "Evaluating Clustering Techniques over Big Data in Distributed Infrastructures" (2018). Masters Theses (All Theses, All Years). 1226.
Available for download on Saturday, April 25, 2020