.NET Framework, Software Development

Extract Text from Images with Tesseract OCR in C#

Optical Character Recognition (OCR) has become an invaluable tool for converting printed or handwritten text from images into machine-readable formats. Among the numerous OCR engines available, Google’s Tesseract OCR stands out for its accuracy and open-source nature. In this blog post, we’ll demonstrate how to harness the power of Tesseract OCR in your C# projects to read text from images with ease.

Prerequisites:

To follow this tutorial, you’ll need:

  1. A basic understanding of C# programming.
  2. Visual Studio or a similar C# development environment.
  3. An image containing text that you want to extract.

Let’s dive in and start extracting text from images!

Step 1: Install Tesseract4 NuGet Package

The first step is to install the Tesseract4 NuGet package in your C# project. This package provides a .NET wrapper for the Tesseract OCR engine. You can install it using Visual Studio’s package manager or the Package Manager Console:

Install-Package Tesseract4 -Version 4.1.1

Step 2: Download Language Data Files

Tesseract OCR requires language data files to recognize text in different languages. To download the appropriate file for your target language, visit one of these GitHub repositories:

Choose the repository that best suits your needs: “tessdata_best” for the highest accuracy, “tessdata_fast” for faster performance, or the standard “tessdata” for a balance between the two.

For example, if you want to use English, download the “eng.traineddata” file from your chosen repository and place it in a folder named “tessdata” within your project directory.

Step 3: Write the Code

With the Tesseract4 package installed and the language data file in place, it’s time to write the C# code to extract text from an image. Add the following code snippet to your C# project:

using System;
using System.Drawing;
using Tesseract;

namespace TesseractOCRExample
{
    class Program
    {
        static void Main(string[] args)
        {
            string imagePath = "path/to/your/image.jpg"; // Replace with your image file path
            string dataPath = "path/to/your/tessdata"; // Replace with your tessdata folder path

            using (var engine = new TesseractEngine(dataPath, "eng", EngineMode.Default))
            {
                using (var img = Pix.LoadFromFile(imagePath))
                {
                    using (var page = engine.Process(img))
                    {
                        string text = page.GetText();
                        Console.WriteLine("Detected text:\n{0}", text);
                        Console.ReadKey();
                    }
                }
            }
        }
    }
}

Make sure to replace path/to/your/image.jpg with the path to your image file and path/to/your/tessdata with the path to your “tessdata” folder.

Step 4: Run Your Program

Now you’re ready to run your C# program. Upon execution, Tesseract OCR will extract the text from the provided image and display it in the console.

Keep in mind that the quality of text recognition depends on factors such as image quality and font type. You might need to preprocess the image to improve the OCR results.

Congratulations! You know how to successfully integrate Tesseract OCR into your next C# project.