场景文本识别具有置换的自回归序列模型

论文标题

场景文本识别具有置换的自回归序列模型

Scene Text Recognition with Permuted Autoregressive Sequence Models

论文作者

Bautista, Darwin, Atienza, Rowel

论文摘要

上下文感知的str方法通常使用内部自回旋（AR）语言模型（LM）。 AR模型的固有局限性动机是采用外部LM的两阶段方法。输入图像上外部LM的条件独立性可能导致其错误地纠正正确的预测，从而导致明显的效率低下。我们的方法Parseq使用置换语言建模学习了具有共同权重的内部AR LMS集合。它使用双向上下文统一了无上下文的非AR和上下文感知的AR推理以及迭代的改进。使用合成训练数据，Parseq实现了最新的（SOTA），可导致STR基准（精度为91.9％）和更具挑战性的数据集。在对实际数据进行培训时，它建立了新的SOTA结果（精度为96.0％）。 Parseq由于其简单，统一的结构和平行的令牌处理而对准确性与参数计数，拖放和延迟非常最佳。由于其广泛使用了注意力，它在现实世界中常见的任意导向文本上具有鲁棒性。代码，预算权重和数据可在以下网址提供：https：//github.com/baudm/parseq。

Context-aware STR methods typically use internal autoregressive (AR) language models (LM). Inherent limitations of AR models motivated two-stage methods which employ an external LM. The conditional independence of the external LM on the input image may cause it to erroneously rectify correct predictions, leading to significant inefficiencies. Our method, PARSeq, learns an ensemble of internal AR LMs with shared weights using Permutation Language Modeling. It unifies context-free non-AR and context-aware AR inference, and iterative refinement using bidirectional context. Using synthetic training data, PARSeq achieves state-of-the-art (SOTA) results in STR benchmarks (91.9% accuracy) and more challenging datasets. It establishes new SOTA results (96.0% accuracy) when trained on real data. PARSeq is optimal on accuracy vs parameter count, FLOPS, and latency because of its simple, unified structure and parallel token processing. Due to its extensive use of attention, it is robust on arbitrarily-oriented text which is common in real-world images. Code, pretrained weights, and data are available at: https://github.com/baudm/parseq.

下载PDF全文

下载文献需遵守相关版权规定

论文标题