--- title: "Optical Character Recognition with imagerExtra" author: "Shota Ochi" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Optical Character Recognition with imagerExtra} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set(warning = FALSE, message = FALSE, cache = FALSE, comment = NA, verbose = TRUE, fig.width = 5, fig.height = 5, dev = 'jpeg', dev.args=list(quality=50)) is_available_tesseract <- requireNamespace("tesseract", quietly = TRUE) ``` You need the R package tesseract, which is bindings to a powerful optical character recognition (OCR) engine, to do OCR with imagerExtra. See the [installation guide of tesseract](https://github.com/ropensci/tesseract#installation) if you haven't installed tesseract. ocr function of tesseract works best for images with high contrast, little noise, and horizontal text. ocr function doesn't show a good performance for degraded images as shown below. ```{r, fig.height = 3, eval = is_available_tesseract} library(imagerExtra) plot(papers, main = "Original") OCR(papers) %>% print OCR_data(papers) %>% print ``` OCR function and OCR_data function are wrappers for ocr function and ocr_data function of tesseract. We can see OCR function and OCR_data function failed to recognize the text "Hello". We need to clean the image before using OCR function. ```{r, fig.height = 3, eval = is_available_tesseract} hello <- DenoiseDCT(papers, 0.01) %>% ThresholdAdaptive(., 0.1, range = c(0,1)) plot(hello, main = "Hello") OCR(hello) %>% print OCR_data(hello) %>% print ``` We can see the text "Hello" was recognized. Using tesseract in combination with imagerExtra enables us to extract text from degraded images.