pyocr光學(xué)識(shí)別

標(biāo)簽：

Python

PyOCR是python的光学字符识别（OCR）工具包装器。它有助于使用Python程序中的OCR工具。 PyOCR可以用作谷歌的Tesseract-OCR或Cuneiform的包装。它可以读取枕头支持的所有图像类型，包括jpeg，png，gif，bmp，tiff等。它还支持边界框数据。

安装

1. pip
过程中可能会出现一些异常，命令行间调用没有问题，可以在代码中用sys设定path位置信息。

pip install pyocr

2. 手动安装

mkdir -p ~/git ; cd git
git clone https://gitlab.gnome.org/World/OpenPaperwork/pyocr.git
cd pyocr
make install  # will run 'python ./setup.py install'

使用

1. 初始化
在这里对tools进行初始化，后面的函数调用都基于此，也可以在这里对lang或者其他配置信息进行设置

from PIL import Image
import pyocr
import sys
import pyocr.builders

class Image2Text:
    def __init__(self):
        tools = pyocr.get_available_tools()
        if len(tools) == 0:
            print("No OCR tool found")
            sys.exit(1)
        tool = self.tools[0]
        #langs参数可选，可自行定义
        # langs = tool.get_available_languages()
        #获取所有lang类型
        #print("Available languages: %s" % ", ".join(langs))

2. 语法
语法和pytesseract类似，同时pyocr不光支持对于简单图像转换文本，同样支持对于局部内容、边界、框、行甚至方向进行获取。

# txt is a Python string
    def revertByText(self):
        print("Will use lang '%s'" % (lang))
        txt = self.tool.image_to_string(
            Image.open('pic/中文.png'),
            lang='chi_sim',
            builder=pyocr.builders.TextBuilder()
        )
        print(txt)

    '''
    list of box objects. For each box object:
    box.content is the word in the box
    box.position is its position on the page (in pixels)
    
    Beware that some OCR tools (Tesseract for instance)
    may return empty boxes
    '''

    def revertByBox(self):
        word_boxes = self.tool.image_to_string(
            Image.open('pic/中文.pngg'),
            lang="chi_sim",
            builder=pyocr.builders.WordBoxBuilder()
        )
        print(word_boxes)


    '''
    list of line objects. For each line object:
    line.word_boxes is a list of word boxes (the individual words in the line)
    line.content is the whole text of the line
    line.position is the position of the whole line on the page (in pixels)

    Each word box object has an attribute 'confidence' giving the confidence
    score provided by the OCR tool. Confidence score depends entirely on
    the OCR tool. Only supported with Tesseract and Libtesseract (always 0
    with Cuneiform).
    
    Beware that some OCR tools (Tesseract for instance) may return boxes
    with an empty content.

    '''
    def revertByline(self):
        line_and_word_boxes = self.tool.image_to_string(
            Image.open('pic/中文.png'), lang="chi_sim",
            builder=pyocr.builders.LineBoxBuilder()
        )
        print(line_and_word_boxes)


    # digits is a python string
    def revertBydigits(self):
        digits = tool.image_to_string(
            Image.open('pic/中文.png'),
            lang='chi_sim',
            builder=pyocr.tesseract.DigitBuilder()
        )
        print(digits)

3. 扩展
同样支持对其他类型文件的识别，包括pdf的识别与转换，获取方向信息等等。

# Get a searchable PDF
pdf = pytesseract.image_to_pdf_or_hocr('test.png', extension='pdf')

# Get HOCR output
hocr = pytesseract.image_to_pdf_or_hocr('test.png', extension='hocr')

# Get verbose data including boxes, confidences, line and page numbers
print(pytesseract.image_to_data(Image.open('test.png')))

# Get information about orientation and script detection
print(pytesseract.image_to_osd(Image.open('test.png')))