September 12, 2011

OCR in C# using Google's Tessnet2 (Fetch Text From image in C#)

I am experimenting on OCR (Optical Character Recognition) .
which is Read Data from an image , I searched a lot over the web 
Found two solutions 


1.) Using Google's Tessnet2 
2.) Using MODI (Microsoft_Office_Document_Imaging Library)


MODI So far i culdn' try as this library comes with MSOfiice 2007 or XP Which I can not get hold of so far.


I tried Google's Tessnet2 and it gave me 98% correct result but only read Alphabets couldn't read digits though.


Below are the Steps which I have used to use this .


1.) Download Tessnet2 binary  from the below link
http://www.pixel-technology.com/freeware/tessnet2/


2.) Add reference of Tessnet2 _32.dll (for 32 bit OS) Tessnet2 _64.dll(for 64 bit os)
in Visual Studio Project Solution






3)Download language data definition file(tesseract-2.00.eng.tar.gz) (I did it for English ) from the below link
http://code.google.com/p/tesseract-ocr/downloads/list


4) UnZip the Above folder and Keep all files in Directory 'tessdata'
    Place this directory in your App/bin/debug  folder
    ex. my case I put it here "D:\TanviDoc\OCRApp\OCRApp\bin\Debug\tessdata"


5.) Below is the sample code to do OCR





using System;
using System.Collections.Generic;
using System.Text;
using System.Drawing;
using System.Threading;




namespace TesseractConsole
{
    class Program
    {
        static void Main(string[] args)
        {
        
           
            Bitmap bmp = new Bitmap(@"C:\Documents and Settings\lak\Desktop/quotes_7a.jpg");
            tessnet2.Tesseract ocr = new tessnet2.Tesseract();
            // ocr.SetVariable("tessedit_cha/r_whitelist", "0123456789");
            ocr.Init(null, "eng", false);
            // List<tessnet2.Word> r1 = ocr.DoOCR(bmp, new Rectangle(792, 247, 130, 54));
            List<tessnet2.Word> r1 = ocr.DoOCR(bmp, Rectangle.Empty);
            int lc = tessnet2.Tesseract.LineCount(r1);
            for (int i = 0; i < lc; i++)
            {
                List<tessnet2.Word> lineWords = tessnet2.Tesseract.GetLineWords(r1, i);
                Console.WriteLine("Line {0} = {1}", i, tessnet2.Tesseract.GetLineText(r1, i));
            }
            foreach (tessnet2.Word word in r1)
                Console.WriteLine("{0}:{1}", word.Confidence, word.Text);
        
        }
    }


  


}




6.) Execute this ,you 'll find Image converted into text.


I Got all the data correct just got 'In' in place of 'On'


Above technique did not convert Digits from Image to text , which still I have to figure out.


Microsoft_Office_Document_Imaging Library)And Still have to experiment with MODI( and have to figure out which one is the better one

5 comments:

  1. Hi Tanvi, I'm also having problems while extracting digits from images, and guess if you have solved the issue and you could help me.

    ReplyDelete
  2. Hi , No I tried a lot but could n' do that then
    I used Microsoft Office Imaging Library and that result are very good
    If you wanna use it can check the below post

    http://www.dotnetissues.com/search/label/OCR%20using%20MODI

    ReplyDelete
  3. Thank you very much, I've read the other post about MODI just after asking you.

    ReplyDelete
  4. MODI is far better than Tessnet. it is easy to use also.
    It reads both digits and characters

    DEAN

    ReplyDelete
  5. without installing Microsoft office,, how i can use it...
    any clues

    ReplyDelete