Archivi tag: hadoop

Tesi Magistrale: Soluzioni distribuite per l’analisi di dati biomedici in ambiente Virtual Data Center

Sono stato un pò fermo con il blog a causa del mio lavoro di tesi magistrale: benchmark di applicazioni distribuite biomediche su un cluster privato del GARR. In sostanza avevo due applicazioni per l’estrazione dei k-mer, una Hadoop e l’altra Spark e ne ho testato le performance su di un cluster composto da server con tecnologia convergente.

 

Approfondimenti:

How to compile Hadoop 2.7.1 on Raspberry Pi B+

A day I friend of mine give me an old Raspberry B+ and I  found on internet this tutorial by The University of Glasgow’s Raspberry Pi Project Blog . I tried to compile Hadoop on it, in the way that native libraries are loaded (you can check this with hadoop checknative -a) 

But why are native libraries (libhadoop, zlib, snappy, lz4, bzip2 and openssl) so important? I get the answer from the Hadoop page:

“Hadoop has native implementations of certain components for performance reasons and for non-availability of Java implementations.”

Raspberry B+ is not so fast and use more possible native code is better! (Raspberry Zero was also released: it is 40% faster than B+ but it has not ethernet support ).

So I started to compile it on hardware. I started from this tutorial by Headamage for Raspi2 and after some days of experiments, I have successfully compiled Hadoop!
First: I used an external USB disk also because compile projects of this dimension can destroy your SD card. You need at least 1.5GB  of free memory on an ext2  partition  (or some other file system that support symbolic links) only to compile Hadoop without protobuf: so I suggest you 3GB of free memory on disk for everythings. Continua a leggere

Bisecting KMeans on Hadoop without wasting disk memory

In the last months I with a university colleague, Gino Farisano, were working on Hadoop  KMeans project for documents clustering. This work was done for Advanced Operation Systems exam, under the supervision of prof. Cattaneo and ph.D. student Roscigno.
While we were working on the Bisecting KMeans, we had read that existing algorithms (at least, those we found), for each new cluster found, write its documents on a new HDFS directory. This memory disk wasting is a bit bad for the implementation and for the performance of the algorithm.

Luckily we have developed an algorithm which doesn’t write nothing new on HDFS.

Annotation: Working on documents is as woking on a sparse matrix, where every document is a raw and every word is a column. For the implementation we simply used an HashMap(Word, Value).

Hadoop Pseudo Code

For every Bisecting, two jobs are execute:

  1. The first job select two random documents from the biggest cluster
  2. The secondo job is the kmeans job, that can be executed #iterations times.

At the end of each Bisecting, we have to remove for “array list” the centroid which have more documents and to add these two new centroids. This is the pseudo code.

The input of the mapper (raw) is a pair <K, V>, where K is the document name and V is the list of pairs: <Word, Value>

As you can see in the Java code (it is at the end of page) we use the Document class, which it is used to merge documents and to normalize the new centroid found.

Mapper

Mapper(raw <K,V>):
indexNear = -1; distMin = inf;

//Get the nearest centroid to document
for Centroids as index => centroid {
  dist = distance(raw, centroid)
  if dist < lenMin {
    indexNear = index
    distMin = dist
  }
}
//biggerCentroid is read into setup
if (indexNear == biggerCentroid ) {
  indexNear = -1; distMin = inf;
  for RndCentroids as ind => centroid {
     dist = distance(raw, centroid)
     if dist < lenMin {
       indexNear = ind
       distMin = dist
     }
  }
  //indexNear is the random centroid nearest to document
  send(indexNear, raw)	
}

Combiner

Reducer(List<K,V> input):
Document aggreg = new Document(K);
For Document raw in input:
    aggreg.add(raw)
send(aggreg.key(); aggreg.value())

Reducer

Reducer(List<K,V> input):
Document aggreg = new Document(K);
for Document raw in input {
   Aggreg.add(raw)
}
aggreg.normalize()
send(aggreg.key(); aggreg.value())

So, the output of Reducer is the new centroid computed by Map-Reduce job.

Java Code

This is the sequential BisecKMeans  Java code. The Hadoop code will be release at the end of our university work.

 

Hadoop Code

Here there’s the code or what‘s left 😛