Ultimo aggiornamento: 28-01-2016
In the last months I with a university colleague, Gino Farisano, were working on Hadoop KMeans project for documents clustering. This work was done for Advanced Operation Systems exam, under the supervision of prof. Cattaneo and ph.D. student Roscigno.
While we were working on the Bisecting KMeans, we had read that existing algorithms (at least, those we found), for each new cluster found, write its documents on a new HDFS directory. This memory disk wasting is a bit bad for the implementation and for the performance of the algorithm.
Luckily we have developed an algorithm which doesn’t write nothing new on HDFS.
Annotation: Working on documents is as woking on a sparse matrix, where every document is a raw and every word is a column. For the implementation we simply used an HashMap(Word, Value).
Hadoop Pseudo Code
For every Bisecting, two jobs are execute:
- The first job select two random documents from the biggest cluster
- The secondo job is the kmeans job, that can be executed #iterations times.
At the end of each Bisecting, we have to remove for “array list” the centroid which have more documents and to add these two new centroids. This is the pseudo code.
The input of the mapper (raw) is a pair <K, V>, where K is the document name and V is the list of pairs: <Word, Value>
As you can see in the Java code (it is at the end of page) we use the Document class, which it is used to merge documents and to normalize the new centroid found.
Mapper
Mapper(raw <K,V>):
indexNear = -1; distMin = inf;
//Get the nearest centroid to document
for Centroids as index => centroid {
dist = distance(raw, centroid)
if dist < lenMin {
indexNear = index
distMin = dist
}
}
//biggerCentroid is read into setup
if (indexNear == biggerCentroid ) {
indexNear = -1; distMin = inf;
for RndCentroids as ind => centroid {
dist = distance(raw, centroid)
if dist < lenMin {
indexNear = ind
distMin = dist
}
}
//indexNear is the random centroid nearest to document
send(indexNear, raw)
}
Combiner
Reducer(List<K,V> input):
Document aggreg = new Document(K);
For Document raw in input:
aggreg.add(raw)
send(aggreg.key(); aggreg.value())
Reducer
Reducer(List<K,V> input):
Document aggreg = new Document(K);
for Document raw in input {
Aggreg.add(raw)
}
aggreg.normalize()
send(aggreg.key(); aggreg.value())
So, the output of Reducer is the new centroid computed by Map-Reduce job.
Java Code
This is the sequential BisecKMeans Java code. The Hadoop code will be release at the end of our university work.
Hadoop Code
Here there’s the code or what‘s left 😛