Archivi tag: hadoop

Tesi Magistrale: Soluzioni distribuite per l’analisi di dati biomedici in ambiente Virtual Data Center

Aprile 4, 2017(Digital) life, Programmingcloud, garr, hadoop, iaas, iperconvergenza, server convergenti, sparklgwebproject

Ultimo aggiornamento: 04-06-2017

Sono stato un pò fermo con il blog a causa del mio lavoro di tesi magistrale: benchmark di applicazioni distribuite biomediche su un cluster privato del GARR. In sostanza avevo due applicazioni per l’estrazione dei k-mer, una Hadoop e l’altra Spark e ne ho testato le performance su di un cluster composto da server con tecnologia convergente.

Soluzioni distribuite per l’analisi di dati biomedici in ambiente Virtual Data Center from Giuseppe Luciano

Approfondimenti:

Best practice KVM – I/O schedulers
Lezioni di Cloud
BlueData: Data Locality is Irrelevant for Hadoop (2015) vs Hyperconverged Arch (dove la data locality è il re)

Bisecting KMeans on Hadoop without wasting disk memory

Giugno 13, 2015Programmingbisec, hadoop, java, kmeanslgwebproject

Ultimo aggiornamento: 28-01-2016

In the last months I with a university colleague, Gino Farisano, were working on Hadoop KMeans project for documents clustering. This work was done for Advanced Operation Systems exam, under the supervision of prof. Cattaneo and ph.D. student Roscigno.
While we were working on the Bisecting KMeans, we had read that existing algorithms (at least, those we found), for each new cluster found, write its documents on a new HDFS directory. This memory disk wasting is a bit bad for the implementation and for the performance of the algorithm.

Luckily we have developed an algorithm which doesn’t write nothing new on HDFS.

Annotation: Working on documents is as woking on a sparse matrix, where every document is a raw and every word is a column. For the implementation we simply used an HashMap(Word, Value).

Hadoop Pseudo Code

For every Bisecting, two jobs are execute:

The first job select two random documents from the biggest cluster
The secondo job is the kmeans job, that can be executed #iterations times.

At the end of each Bisecting, we have to remove for “array list” the centroid which have more documents and to add these two new centroids. This is the pseudo code.

The input of the mapper (raw) is a pair <K, V>, where K is the document name and V is the list of pairs: <Word, Value>

As you can see in the Java code (it is at the end of page) we use the Document class, which it is used to merge documents and to normalize the new centroid found.

Mapper

Mapper(raw <K,V>):
indexNear = -1; distMin = inf;

//Get the nearest centroid to document
for Centroids as index => centroid {
  dist = distance(raw, centroid)
  if dist < lenMin {
    indexNear = index
    distMin = dist
  }
}
//biggerCentroid is read into setup
if (indexNear == biggerCentroid ) {
  indexNear = -1; distMin = inf;
  for RndCentroids as ind => centroid {
     dist = distance(raw, centroid)
     if dist < lenMin {
       indexNear = ind
       distMin = dist
     }
  }
  //indexNear is the random centroid nearest to document
  send(indexNear, raw)	
}

Combiner

Reducer(List<K,V> input):
Document aggreg = new Document(K);
For Document raw in input:
    aggreg.add(raw)
send(aggreg.key(); aggreg.value())

Reducer

Reducer(List<K,V> input):
Document aggreg = new Document(K);
for Document raw in input {
   Aggreg.add(raw)
}
aggreg.normalize()
send(aggreg.key(); aggreg.value())

So, the output of Reducer is the new centroid computed by Map-Reduce job.

Java Code

This is the sequential BisecKMeans Java code. The Hadoop code will be release at the end of our university work.

Hadoop Code

Here there’s the code or what‘s left 😛