Tesseract.js

Pure Javascript OCR for more than 100 Languages

README

Tesseract.js


Lint & Test
CodeQL Gitpod Ready-to-Code  Financial Contributors on Open Collective npm version Maintenance License Code Style Downloads Total Downloads Month

Tesseract.js is a javascript library that gets words in almost any language out of images. (Demo)

Image Recognition
fancy demo gif

Video Real-time Recognition

Tesseract.js Video


Tesseract.js wraps an emscripten port of the Tesseract OCR Engine.
It works in the browser using webpack or plain script tags with a CDN and on the server with Node.js.
After you install it, using it is as simple as:

  1. ``` js
  2. import Tesseract from 'tesseract.js';

  3. Tesseract.recognize(
  4.   'https://tesseract.projectnaptha.com/img/eng_bw.png',
  5.   'eng',
  6.   { logger: m => console.log(m) }
  7. ).then(({ data: { text } }) => {
  8.   console.log(text);
  9. })
  10. ```

Or more imperative

  1. ``` js
  2. import { createWorker } from 'tesseract.js';

  3. const worker = await createWorker({
  4.   logger: m => console.log(m)
  5. });

  6. (async () => {
  7.   await worker.loadLanguage('eng');
  8.   await worker.initialize('eng');
  9.   const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
  10.   console.log(text);
  11.   await worker.terminate();
  12. })();
  13. ```

Check out the docs for a full explanation of the API.

Major changes in v4

Version 4 includes many new features and bug fixes--see this issue for a full list.  Several highlights are below.

- Added rotation preprocessing options (including auto-rotate) for significantly better accuracy
- Processed images (rotated, grayscale, binary) can now be retrieved
- Improved support for parallel processing (schedulers)
- Breaking changes:
  - createWorker is now async
  - getPDF function replaced by pdf recognize option

Major changes in v3

- Significantly faster performance
   - Runtime reduction of 84% for Browser and 96% for Node.js when recognizing the example images
- Upgrade to Tesseract v5.1.0 (using emscripten 3.1.18)
- Added SIMD-enabled build for supported devices
- Added support:
   - Node.js version 18
- Removed support:
   - ASM.js version, any other old versions of Tesseract.js-core (<3.0.0)
   - Node.js versions 10 and 12

Major changes in v2

- Upgrade to tesseract v4.1.1 (using emscripten 1.39.10 upstream)
- Support multiple languages at the same time, eg: eng+chi\_tra for English and Traditional Chinese
- Supported image formats: png, jpg, bmp, pbm
- Support WebAssembly (fallback to ASM.js when browser doesn't support)
- Support Typescript

Read a story about v2: Why I refactor tesseract.js v2?
Check the support/1.x branch for version 1

Installation

Tesseract.js works with a `