Large scale thematic mapping by supervised machine learning on big data distributed cluster computing frameworks
Authors: Javier Lozano Silva Naiara Aginako Bengoa Ekaitz Zulueta Guerrero Pedro Iriondo Bengoa
Date: 28.07.2015
Abstract
The Petabyte-scale data volumes in Earth Observation (EO) archives are not efficiently manageable with serial processes run-
ning on large isolated servers. Distributed storage and processing based on ‘big data’ cloud computing frameworks needs to be
considered as a part of the solution. This contribution describes a parallelized data processing approach for EO image analysis that is based on the MapReduce paradigm and implemented on the Apache Spark framework. The thematic mapping approach presented in is based on a serial implementation of a probabilistic k-Nearest Neighbor supervised classification approach that produces high quality results. The algorithm in itself, as it is often the case with machine learning, is inherently parallelizable, yet it needs to be revised in order to manage big volumes of data efficiently in terms of performance. Since the algorithm is coded in a high level scripting language, the processing time needed for the classification of a 25 Megapixel image is of about a minute. If these values are extrapolated to regional extensions, unacceptable running times are obtained as a result. As a concrete example, the generation of metric thematic maps of the Basque country in the north of Spain would require analyzing about 150 Megapixels of data, hence obtaining running times of the order of 20 hours. While the parallelization and distribution of analysis processes can provide evident advantages, porting classical machine learning algorithms intended for limited data volumes to the domain of large scale remote sensing data coverages requires a significant effort. The adoption of a parallelization approach based on the MapReduce paradigm can be beneficial in this respect. In this contribution, we present a methodology for the parallelization of machine learning algorithms on local and cloud based cluster computing environments for the efficient analysis of large geospatial EO coverages.
BIB_text
title = {Large scale thematic mapping by supervised machine learning on big data distributed cluster computing frameworks},
pages = {1504-1507},
abstract = {
The Petabyte-scale data volumes in Earth Observation (EO) archives are not efficiently manageable with serial processes run-
ning on large isolated servers. Distributed storage and processing based on ‘big data’ cloud computing frameworks needs to be
considered as a part of the solution. This contribution describes a parallelized data processing approach for EO image analysis that is based on the MapReduce paradigm and implemented on the Apache Spark framework. The thematic mapping approach presented in is based on a serial implementation of a probabilistic k-Nearest Neighbor supervised classification approach that produces high quality results. The algorithm in itself, as it is often the case with machine learning, is inherently parallelizable, yet it needs to be revised in order to manage big volumes of data efficiently in terms of performance. Since the algorithm is coded in a high level scripting language, the processing time needed for the classification of a 25 Megapixel image is of about a minute. If these values are extrapolated to regional extensions, unacceptable running times are obtained as a result. As a concrete example, the generation of metric thematic maps of the Basque country in the north of Spain would require analyzing about 150 Megapixels of data, hence obtaining running times of the order of 20 hours. While the parallelization and distribution of analysis processes can provide evident advantages, porting classical machine learning algorithms intended for limited data volumes to the domain of large scale remote sensing data coverages requires a significant effort. The adoption of a parallelization approach based on the MapReduce paradigm can be beneficial in this respect. In this contribution, we present a methodology for the parallelization of machine learning algorithms on local and cloud based cluster computing environments for the efficient analysis of large geospatial EO coverages.
}
isbn = {978-1-4799-7929-5},
isi = {1},
date = {2015-07-28},
year = {2015},
}