Pythonで画像中の数字認識・文字認識

目的

　画像ファイル中に書かれている数字や文字を認識したい！

　できれば、Python でやりたい！

参考リンク

　Pythonで画像内の数字認識 - Qiita

　日本語OCRのtesseract-ocrを使ってやってみた | JProgramer

動作環境

　Windows7

　Python 2.7.13 :: Anaconda custom (64-bit)

やりかた

　１．tesseract のセットアップ

　　０）小目的

　　　　まずはPythonとか以前にtesseractで画像ファイルをOCRで読み取ろう！

　　１）ダウンロード

　　　　https://sourceforge.net/projects/tesseract-ocr-alt/files/

　　　　tesseract-ocr-setup-3.02.02.exe

　　　　tesseract-ocr-3.02.jpn.tar.gz

　　２）インストール

　　　　tesseract-ocr-setup-3.02.02.exe

　　　　⇒デフォルトインストール

　　　　tesseract-ocr-3.02.jpn.tar.gz

　　　　⇒ファイルを解凍

　　　　　jpn.traineddataファイルをtesseractのインストールフォルダのtesdataへコピー

　　３）動作確認

　　　　コマンドプロンプトから以下を実行し、

　　　　> tesseract sample.png result -l jpn

　　　　result.txt ファイルに書かれた結果を確認。

　２．pytesseract のセットアップ

　　０）小目的

　　　　次に Python から tesseract を使えるように！

　　１）インストール

　　　　コマンドプロンプトから以下を実行

　　　　> pip install pytesseract

　　　　> conda install -y pillow

　　２）動作確認

　　　　Python のコンソールから以下を実行し、エラーがないことを確認

　　　　>>> import pytesseract

　　　　>>> import PIL

　３．やってみよう

　　１）ソースコード

　　　　下に記載しました

　　２）実行

　　　　画像ファイルを準備し、実行あるのみ！

　　　　> python sample.py

Good Luck !!!

# -*- coding: utf-8 -*-
import pytesseract
from PIL import Image

imgfile = 'sample.jpg'
img = Image.open(imgfile)
str = pytesseract.image_to_string(img, lang="jpn")
print unicode(str, 'utf-8')