MapReduce is a distributed programming framework designed to ease the development of scalable data-intensive applications for large clusters of commodity machines.
Most machine learning and data mining applications involve iterative computations
over large datasets, such as the Web hyperlink structures and social network graphs.
Yet, the MapReduce model does not efficiently support this important class of applications. The architecture of MapReduce, most critically its dataflow techniques and
task scheduling, is completely unaware of the nature of iterative applications; tasks
are scheduled according to a policy that optimizes the execution for a single iteration which wastes bandwidth, I/O, and CPU cycles when compared with an optimal
execution for a consecutive set of iterations.
This work presents iHadoop, a modified MapReduce model, and an associated
implementation, optimized for iterative computations. The iHadoop model schedules
iterations asynchronously. It connects the output of one iteration to the next, allowing
both to process their data concurrently. iHadoop's task scheduler exploits inter-
iteration data locality by scheduling tasks that exhibit a producer/consumer relation
on the same physical machine allowing a fast local data transfer. For those iterative
applications that require satisfying certain criteria before termination, iHadoop runs the check concurrently during the execution of the subsequent iteration to further
reduce the application's latency.
This thesis also describes our implementation of the iHadoop model, and evaluates its performance against Hadoop, the widely used open source implementation of
MapReduce. Experiments using different data analysis applications over real-world
and synthetic datasets show that iHadoop performs better than Hadoop for iterative
algorithms, reducing execution time of iterative applications by 25% on average. Furthermore, integrating iHadoop with HaLoop, a variant Hadoop implementation that
caches invariant data between iterations, reduces execution time by 38% on average.
|Date of Award||Aug 2011|
|Original language||English (US)|
- Computer, Electrical and Mathematical Sciences and Engineering
|Supervisor||Mohamed Ramadan (Supervisor)|