---文档均为word文档,下载后可直接编辑使用亦可打印---
摘  要
随着时代的发展和计算机的普及,现在的资料、文献、档案和书籍都逐渐地变成了数字化的模式,但是在此之前,已有的纸质资料、文献、档案和书籍的存量十分之多,以纸张作为载体来保存这些内容的话存在不少的不方便和安全隐患。纸张是无法再生的,纸张一旦损毁了,上面所记录的内容也将会丢失,而且纸张不方便传播,所以把纸质资料转化为电子化的形式是非常有必要的。光学字符识别(Optical Character Recognition, OCR)是一种能把印刷在或者写在纸上的内容识别成字符并保存到计算中去的技术,在文字录入、书籍电子化这些领域起着至关重要的作用。
在OCR进行识别的时候,存在着一些影响识别成功率的因素,例如图像文件的背景和所识别字符的字体等因素。本文将研究通过图像文件的预处理和训练字库来提升识别的成功率。本次课题所研究的内容主要包括如下内容:
(1)开发一个基于Python的OCR工具。
(2)通过把图片进行灰度化处理、二值化处理和降噪处理减少图像内背景和非字符的干扰,提高识别准确率。
(3)训练字库,使得开发的OCR工具在提高识别的准确率的同时还能够识别除了一般的印刷字体外其他的字体和字符内容。
关键词:OCR技术;信息化;纸质资料;文字录入;灰度化处理;二值化处理
Abstract关于python的书
With the development of the era and the popularity of computers, data, literature, archives and books are now gradually turned into digital forms. But before that, there has been a great number of paper data, literature, archives and books. There are many inconveniences and security risks in using paper as the carrier to preserve these contents. Paper is not regenerated. Once the paper is damaged, the contents recorded on it will be lost, and the paper is not convenient for spreading, so it’s necessary to convert paper data into electronic forms. Optical Character Recognition (OCR) is a techn
ology that can recognize the printed or written content into characters and save them to calculation. It plays an important role in the fields of text input and electronic books.
When OCR is used for recognition, there are some factors affecting the success rate of recognition, such as the background of image file and the font of the recognized characters. This paper focuses on the improvement of the success rate of recognition by preprocessing image file and training font library. The research content of this project mainly includes:
(1) Developing a Python-based OCR tool.
(2) Reducing such interference factors as the background and the font of the recognized characters in the image through grayscale processing, binarization processing and noise reduction processing to improve the accuracy of recognition.
(3) Training the font library to improve the accuracy of recognition and make it possible for the developed OCR tool to recognized special fonts and characters in addition to general printed fonts.