PolygloNet: Multilingual Approach for Scene Text Recognition Without Language Constraints
Autores: Alex Solè Gómez
Fecha: 23.05.2022
Abstract
The detection and recognition of text instances in camera-captured images or videos generate rich and precise semantic information for interpreting and describing the scene. However, recognizing text in the wild remains a challenging problem despite its value. Apart from the inherent problems in text detection and recognition tasks, knowing the language to be recognized is required in unsupervised forensic applications where multilingual information is frequently employed. This work proposes an unconstrained multilingual text recognition pipeline for scene text detection and recognition. The pipeline consists of multiple text recognition experts contributing to determining the output text sentence. Each expert translates the visual information into a candidate text sentence. Finally, our post-processing model, named PolygloNet, encodes and aggregates all text sentences to generate the optimal text sequence. This model can select the most likely text sequence and correct spelling errors produced in the recognition stage. Experimental results show that our model can produce accurate recognition results on relevant datasets (MLT2017 [2] and MLT2019 [1]).
BIB_text
author = {Alex Solè Gómez},
title = {PolygloNet: Multilingual Approach for Scene Text Recognition Without Language Constraints},
pages = {479-490},
keywds = {
Scene text detection (STD), Multilingual text recognition, Text analysis, Long short-term memory (LSTM)
}
abstract = {
The detection and recognition of text instances in camera-captured images or videos generate rich and precise semantic information for interpreting and describing the scene. However, recognizing text in the wild remains a challenging problem despite its value. Apart from the inherent problems in text detection and recognition tasks, knowing the language to be recognized is required in unsupervised forensic applications where multilingual information is frequently employed. This work proposes an unconstrained multilingual text recognition pipeline for scene text detection and recognition. The pipeline consists of multiple text recognition experts contributing to determining the output text sentence. Each expert translates the visual information into a candidate text sentence. Finally, our post-processing model, named PolygloNet, encodes and aggregates all text sentences to generate the optimal text sequence. This model can select the most likely text sequence and correct spelling errors produced in the recognition stage. Experimental results show that our model can produce accurate recognition results on relevant datasets (MLT2017 [2] and MLT2019 [1]).
}
isbn = {978-3-031-06429-6},
date = {2022-05-23},
}