ASP.NET Issues: OCR in C# using Google's Tessnet2 (Fetch Text From image in C#)

September 12, 2011

OCR in C# using Google's Tessnet2 (Fetch Text From image in C#)

I am experimenting on OCR (Optical Character Recognition) .
which is Read Data from an image , I searched a lot over the web
Found two solutions

1.) Using Google's Tessnet2
2.) Using MODI (Microsoft_Office_Document_Imaging Library)

MODI So far i culdn' try as this library comes with MSOfiice 2007 or XP Which I can not get hold of so far.

I tried Google's Tessnet2 and it gave me 98% correct result but only read Alphabets couldn't read digits though.

Below are the Steps which I have used to use this .

1.) Download Tessnet2 binary from the below link
http://www.pixel-technology.com/freeware/tessnet2/

2.) Add reference of Tessnet2 _32.dll (for 32 bit OS) Tessnet2 _64.dll(for 64 bit os)
in Visual Studio Project Solution

3)Download language data definition file(tesseract-2.00.eng.tar.gz) (I did it for English ) from the below link
http://code.google.com/p/tesseract-ocr/downloads/list

4) UnZip the Above folder and Keep all files in Directory 'tessdata'
Place this directory in your App/bin/debug folder
ex. my case I put it here "D:\TanviDoc\OCRApp\OCRApp\bin\Debug\tessdata"

5.) Below is the sample code to do OCR

using System;
using System.Collections.Generic;
using System.Text;
using System.Drawing;
using System.Threading;

namespace TesseractConsole
{
class Program
{
static void Main(string[] args)
{


Bitmap bmp = new Bitmap(@"C:\Documents and Settings\lak\Desktop/quotes_7a.jpg");
tessnet2.Tesseract ocr = new tessnet2.Tesseract();
// ocr.SetVariable("tessedit_cha/r_whitelist", "0123456789");
ocr.Init(null, "eng", false);
// List<tessnet2.Word> r1 = ocr.DoOCR(bmp, new Rectangle(792, 247, 130, 54));
List<tessnet2.Word> r1 = ocr.DoOCR(bmp, Rectangle.Empty);
int lc = tessnet2.Tesseract.LineCount(r1);
for (int i = 0; i < lc; i++)
{
List<tessnet2.Word> lineWords = tessnet2.Tesseract.GetLineWords(r1, i);
Console.WriteLine("Line {0} = {1}", i, tessnet2.Tesseract.GetLineText(r1, i));
}
foreach (tessnet2.Word word in r1)
Console.WriteLine("{0}:{1}", word.Confidence, word.Text);

}
}



}

6.) Execute this ,you 'll find Image converted into text.

I Got all the data correct just got 'In' in place of 'On'

Above technique did not convert Digits from Image to text , which still I have to figure out.

Microsoft_Office_Document_Imaging Library)And Still have to experiment with MODI( and have to figure out which one is the better one

5 comments:

AnonymousOctober 26, 2011 at 5:33 AM
Hi Tanvi, I'm also having problems while extracting digits from images, and guess if you have solved the issue and you could help me.
ReplyDelete
Replies
TanviOctober 26, 2011 at 7:51 AM
Hi , No I tried a lot but could n' do that then
I used Microsoft Office Imaging Library and that result are very good
If you wanna use it can check the below post

http://www.dotnetissues.com/search/label/OCR%20using%20MODI
ReplyDelete
Replies
AnonymousOctober 26, 2011 at 8:46 AM
Thank you very much, I've read the other post about MODI just after asking you.
ReplyDelete
Replies
AnonymousApril 30, 2012 at 2:39 AM
MODI is far better than Tessnet. it is easy to use also.
It reads both digits and characters

DEAN
ReplyDelete
Replies
AnuskaFebruary 13, 2016 at 5:29 AM
without installing Microsoft office,, how i can use it...
any clues
ReplyDelete
Replies

Add comment

ASP.NET Issues

Pages

September 12, 2011

OCR in C# using Google's Tessnet2 (Fetch Text From image in C#)

5 comments:

Translate

Search This Blog

Total Pageviews

Contact Me

Popular Posts