Talk:Optical character recognition/Archives/2013

This page is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Missing an overview of where OCR fits into a document processing solution

Key to a good OCR rate is the quality of input images and pre-processing of them. This needs to be added to the article. For example, thresholding low resolution images of text is critical for good OCR results. This leads into topics such as background removal, background normalization, Otsu thresholding, median filtering, demosacing, etc. A simple chart of OCR recognition rates for various scan DPI settings would help. Commercial products like Abbyy Finereader suggest that characters should be at least 20 pixel high to be OCR'd with good results.

A chart giving the resulting character size in pixels based on character point size, scan dpi would help also. E.g., 75dpi scans of 10point text produce horrible results whilst 300dpi scans of 10 point text produce excellent results. —Preceding unsigned comment added by 98.197.217.81 (talk) 21:00, August 25, 2007 (UTC)

Zip codes

The web page pete's history gives 1965 as the year the United States Postal Service first used OCR to read zip codes.--Rethunk

Open source programs

Are there any open source OCR programs available?

yes. THE KING 12:53, 5 May 2005 (UTC)

I see that http://simpleocr.com/ is free for "personal use"; is it really open source?

Section about software

Kooka - default scanning application in KDE. It uses GOCR for OCR
Tesseract is an open source OCR, initially developed by HP, and released under the Apache License, Version 2.0. It can be compiled using MSVC 6.0 or GCC (~120000 LOC)
Clara - [1], [2] (~50000 LOC)
GOCR - (~20000 LOC + Unpaper + Socrates) - GOCR included in Debian and other distributions (not for Windows)
Ocrad - [3] - (~9900 LOC) - "is an OCR [...] program based on a feature extraction method".
Simple OCR - freeware application available, as well as royalty free SDK and source code.
ISRI Software - some experimental OCR tools
OCRchie - dormant since 1996
OOCR OOCR is an OCR program still in development, under the GPL.
phpOCR A base implementation for an OCR tool in PHP
Kognition - [4]

CJK Support?

This article doesn't mention anything about OCR support for Chinese, Japanese, and Korean though that information would be very valuable, espescially if there is free software with CJK support. Theshibboleth 00:11, 10 May 2006 (UTC)

Seconded. I'm disappointed in you all! Astarica 09:37, 6 September 2007 (UTC)

In a way it does since it refers to OCR in Unicode. Although the info is very specific and somewhat cryptic for a general audience. D'Artagnol (talk) 22:10, 20 March 2009 (UTC)

I added mention that this exists. More details welcome. -- Beland (talk) 06:10, 28 March 2013 (UTC)

MICR

The reference to MICR seems strangely disjointed, as though it is written in the context of human reading rather than machine reading. I am mindful to amend it. Would anyone object? Tom 00:00, 5 June 2006 (UTC)

Merge

I am proposing the merge. Neither article is unduly long and it would be much more convenient to the reader to have all the relevant information in one place. BlueValour 17:22, 5 November 2006 (UTC)

Agreed, it should be in this article under a subsection, makes it easier to find. —Preceding unsigned comment added by 207.81.148.242 (talk • contribs)

Section "Optical Character Recognition in Unicode"

It's not clear at all from the article, what those characters are used for. 83.79.33.140 19:47, 18 March 2007 (UTC)

I can't find a definitive source, but it appears to me that the codes for OCR DASH and OCR CUSTOMER ACCOUNT NUMBER are swapped, according to http://theorem.ca/~mvcorks/cgi-bin/unicode.pl.cgi?start=2440&end=245f , OCR DASH is 0x2448, and OCR CUSTOMER ACCOUNT NUMBER is 0x2449. —Preceding unsigned comment added by 202.164.193.248 (talk) 07:02, 6 July 2008 (UTC)

OCR for mathematical documents

Searching a bit on the web for a taste of OCR for maths led me to this page: http://www.inftyproject.org Although it's labelled 'free software', going by the license it's obviously just freeware. Anyone know of free/open source alternatives? I'm surprised that there isn't any major software project for this, with (cheap) tablet PCs around the corner and Google's plans to digitise the planet being applied to books.

Most mathematical formulas have been set using TeX, so it shouldn't be that difficult to scan it back in again correctly, right? Merctio 23:01, 11 April 2007 (UTC)

For the InftyReader/open source alternatives: I believe there are no alternatives yet. The OSS world is still struggling with straight Latin. For the InftyEditor: actually quite common, f.ex. OpenOffice Math. For the rest: mixing audio with math text, I've never heard of the idea, and I couldn't envision it by myself, except possibly as my really mad ideas of combo-TV-garden-rake. I thought math was simply unspeakable! Said: Rursus ☺ ★ 09:56, 19 July 2007 (UTC)

Wrong word?

Should this say "handwritten" instead of "hand-printed?"

"These methods cannot be used in software that scans paper documents, so accurate recognition of hand-printed documents is still largely an open problem."

Matthias Röder 10:56, 6 July 2007 (UTC)

Since rephrased. -- Beland (talk) 06:09, 28 March 2013 (UTC)

MAP

Ossware map, feel free to modf:

Inwiki: GOCR, Ocrad - cmdline?, OCRopus - new, merger?
Exwiki: Tesseract on g8gle - used by OCRopus, OCRopus on g8gle - nice link to transsurf, Leptonica - unknown whatis,
Known: ClaraOCR - almost no info there, inactive since 2003.

Said: Rursus ☺ ★ 07:01, 19 July 2007 (UTC)

Is this related to a project of some sort? It's not really appropriate for this talk page... Chris Cunningham 11:57, 19 July 2007 (UTC)

Tesseract

Its nice that Tesseract is free etc, but trying to use it seems rather tech-challenging at this point. Does anyone offer it to try as a free online conversion tool? FreeOCR may be a more user-friendly version, but they may all require 2K/XP for Windows version, so older OSes are out of luck. The only free online OCR I can find is scanR, but using it seems quite awkward (must email jpegs, get activation codes, etc.) -69.87.203.15 12:48, 2 October 2007 (UTC)

Citations?

This article does not have any citations. —Preceding unsigned comment added by 70.126.48.91 (talk) 00:37, 2 December 2007 (UTC)

unknown characters

Where do these OCR characters come from:

⑅
⑊
⑃

? They don't seem to be defined in the relevant standards. --Abdull (talk) 23:14, 15 February 2008 (UTC)

Maybe you weren't looking in Unicode version 6.1? Can't explain all of them, but as for their ultimate provenance some come from MICR and some from OCR-A font. Maybe the remainder are from OCR B? The Unicode documentation I could find unfortunately doesn't really trace ancestry. -- Beland (talk) 06:06, 28 March 2013 (UTC)

Character 0x244B

Why is 0x244B declared as "classified"? —Preceding unsigned comment added by AzaToth (talk • contribs) 03:40, 9 March 2008 (UTC)

It just says "reserved" now. -- Beland (talk) 05:51, 28 March 2013 (UTC)

Strongly suggest a 'software - last release date' column in table

The software list is misleading given that many of the open source OCR packages have not had a release in many years as well as that some of them are in pre-alpha status (Tesserect).—Preceding unsigned comment added by 98.197.209.187 (talk • contribs)

Missing

- Optical mark recognition link - Glyph recognition with user interaction (e.g., training an OCR package to learn to OCR latin texts) - Document preprocessing before OCR (deskew, threshold, etc.) - OCR test results to give a basic understanding of scan quality, character size and OCR effectiveness) - Mention output formats for OCR documents (plain text, PDF text on top of the original image, etc.) - Voting techniques for character recognition (i.e., comparing all letter 'e' on a page to help classify unknown glyphs as the letter 'e')—Preceding unsigned comment added by 98.197.209.187 (talk • contribs)

I added some content along these lines; more details are welcome. -- Beland (talk) 05:50, 28 March 2013 (UTC)

This article doesn't even mention Cyrillic OCR!!!

The HP scanner I bought for about $50 five years ago came bundled with software that can OCR Cyrillic text about as well as Roman. Apparently Russians have been making use of these capabilities to put huge amounts of writing from the tsarist and soviet periods online, in honor of "samizdat" traditions!

Apparently the newest versions of HP's bundled software also OCR Greek, Chinese (simplified or traditional), Arabic, Hebrew and Korean. The only really big omission in contemporary terms seems to be Indic scripts (including variants used outside the subcontinent for Tibetan, Burmese, Thai, Laotian and Cambodian).

This article really seems behind the times in not going beyond OCR of the Roman alphabet and its variations. LADave (talk) 02:20, 25 May 2008 (UTC)

Wow!!! Find some reliable sources and add it to the article. Of course, Cyrillic really is a variation of the Roman alphabet (well, the Latin-Greek-Cyrillic superalphabet), especially from the perspective of OCR.--Prosfilaes (talk) 13:26, 25 May 2008 (UTC)

Well, I mentioned that OCR does exist for non-Latin writing systems. I agree it would be interesting to have more details about notable differences. -- Beland (talk) 05:50, 28 March 2013 (UTC)

Uses of OCR

Does anyone know any uses of OCR? I'm getting a bit stuck on this! Adam Hillman (talk) 14:45, 3 October 2008 (UTC)

I have seen it when using "Reading pens" and to make searchable PDF files when scanning. and much more i guess.--83.253.216.123 (talk) 20:01, 22 January 2009 (UTC)

Wow, come on. There are so many. In general, the largest uses of OCR today are related to document management in large instituions, for storage and management of paperless processes. For instance: claims processing (going from health insurance paper claims) to digital claims management without the need for manual data entry. There are many other listed if your search "document OCR paperless applications" on Google.com. One example is legal services[[5]] D'Artagnol (talk) 22:20, 20 March 2009 (UTC)

I added two applications in the intro paragraphs OsamaBinLogin (talk) 20:26, 24 October 2009 (UTC)

Simplicity

Could somebody please rewrite this article so the average reader can understand it? Sincerely, GeorgeLouis (talk) 00:12, 28 October 2008 (UTC)

If you could be specific about what parts you can't understand, that would help us a lot. kbnklvkkfh

I added some stuff to the intro. Hopefully that adds more context. OsamaBinLogin (talk) 20:26, 24 October 2009 (UTC)

Adobe Acrobat

Adobe Acrobat Professional also has OCR, missing from the list —Preceding unsigned comment added by Cowicide (talk • contribs) 10:00, 25 November 2008 (UTC)

I can confirm that, you can chose to have it when you scan documents. It takes quite some time and if you have a lot of documents to scan and don't need it turn it of. It makes the files bigger to but add features to them.

See Scanning options - Make Searchable (Run OCR) at: http://help.adobe.com/en_US/Acrobat/9.0/3D/WS58a04a822e3e50102bd615109794195ff-7f71.w.html --83.253.216.123 (talk) 19:54, 22 January 2009 (UTC)

OCR feature in Adobe Acrobat is provided by ReadIRIS, which is already listed in OCR Software. Please read http://www.irislink.com/Documents/pdf/200609191402/adobe_en.pdf Ankit (talk) 03:59, 11 October 2009 (UTC)

Thx Ankit, from your source now we can clearly stated that Adobe using I.R.I.S.’ OCR technology., So, no need to add Adobe to the list. Ivan Akira (talk) 07:59, 11 October 2009 (UTC)

Zonal OCR

Zonal OCR should probably be merged here, no? Rd232 ^talk 01:01, 13 January 2009 (UTC)

I just added it. -- Beland (talk) 05:40, 28 March 2013 (UTC)

Removing non-notable and promotional links again

In November I removed all the entries from the OCR software section which did not have their own articles or were obviously promotional. Unfortunately, once again the table is full of indisciminate examples which in some cases appear blatantly promotional. I'm going to remove these all again in the future. Chris Cunningham (not at work) - talk 19:23, 13 March 2009 (UTC)

Mac OS support

TypeReader does not appear to sell a Mac OS compatible version any longer.

OmniPage does offer a Mac OS version, but it hasn't been updated in years. It lists the system requirements as Mac OS 9 or Mac OS X 10.1. There is no mention on the Nuance web page showing system requirements of whether or not it works with Mac OS 10.2 or later (current Mac OS X is 10.5).

I believe both of those should either have Mac OS removed from the supported OS columns or a footnote added saying Mac OS support is deprecated or discontinued. 70.251.228.236 (talk) 17:03, 22 March 2009 (UTC) 2009-03-21. Geoff Strickler

A solved problem?

OCR is not a solved problem! I have yet to see an OCR program that doesn't make at least 5 errors per page! If it is solved why do we need reCaptcha? 83.44.126.70 (talk) 17:40, 17 November 2009 (UTC)

Proposal to Split - software tables

I strongly support the proposal to split the OCR software table, made in October 2009, into a separate article. I suggest that both tables OCR software and OCR software language support, be moved into a separate page, along with the relevant talk sections. I also suggest that the page be entitled Comparison of OCR software and placed into the category Software comparisons Artemgy (talk) 08:21, 28 November 2009 (UTC)

I also suggest to split OCR software table. Vcgupta 20:05, 28 December 2009

See list of optical character recognition software (2009). Bwrs (talk) 16:36, 9 January 2010 (UTC)

Typical accuracy rates are inaccurate and need a citation

The accuracy rate in industrial applications is less than 95%. The article suggests a 99% accuracy which might be achieved in a lab under non realistic conditions. I suggest rephrasing this sentence and researching the accuracy rate under different circumstances. —Preceding unsigned comment added by Mudx77 (talk • contribs) 09:42, 24 January 2010 (UTC)

Did GF Handel REALLY live over 200 years and invent an OCR algorithm?

Now, I'm no expert in this field, but I don't think that's possible. I ask someone to investigate that and fix it, please! Zylorian (talk) 03:08, 4 January 2011 (UTC)

Since turned into a redlink for Paul W. Handel. -- Beland (talk) 05:37, 28 March 2013 (UTC)

IT technology

Doesn't "With IT technology development" seem strange? IT technology??? Is it okay to say information technology technology? — Preceding unsigned comment added by 75.70.113.11 (talk) 00:00, 23 December 2011 (UTC)

Definitely! Though this has since been removed from the article. -- Beland (talk) 05:36, 28 March 2013 (UTC)

Does not describe different aspects/problems/algorithms/approaches to OCR

For example, identifying paragraphs, identifying lines, identifying word borders, using Directed Acyclic Graphs of possible letter recognitions (i.e. encoding the different possible character sequences for a words image: dam darn, case ease, and more complicated examples [[vv/w][c/e][t/i/l][c/e][o/0][rn/m][c/e]] for "welcome" which are most compactly described by DAG) how individual characters are identified (with high dpi: tracing outlines of the characters, at low dpi: patternmatching (dunno, autocorrelation, neural networks,...?)), identifying images,...

there is no description of the current state of approaching different kinds of characters: what methods work better for low dpi/high dpi, handwritten/typeset, kinds of alphabets, dealing with layout, ... — Preceding unsigned comment added by 83.134.181.240 (talk) 14:09, 9 January 2012 (UTC)

I found the same information lacking, so I scraped some together and added it. -- Beland (talk) 05:35, 28 March 2013 (UTC)