- How To Ocr An Existing Pdf Converter
- How To Ocr An Existing Pdf
- How To Ocr An Existing Pdf Online
- How To Ocr An Existing Pdf Format
OCR an existing PDF To OCR a document: In Nitro Pro 7, open a PDF document you want to OCR; On the Edit tab, click the OCR button in the Text/Images panel; In the Recognize Text using OCR dialog, specify the text language and page options. To OCR a document: In Nitro Pro, open a PDF document you want to OCR; On the Review tab, click the OCR button in the Documents panel In the Optical Character Recognition (OCR) dialog, choose whether the output text should be Searchable or Searchable and Editable. To OCR a document: In Nitro Pro, open a PDF document you want to OCR; On the Review tab, click the OCR button in the Documents panel In the Optical Character Recognition (OCR) dialog, choose whether the output text should be Searchable or Searchable and Editable Click the Options button to select a target page range, and click Advanced to configure OCR preferences.
Active3 months ago
I am looking for an offline scriptable tool that makes an existing PDF file searchable by running OCR on it, replacing the original non-searchable file with the searchable version, and can run unattended.
E.g., www.pdfscannerapp.com - does exactly what I need, but it's GUI only - not scriptable.
I am aware that Evernote makes PDF files searchable, but they remain searchable only when within Evernote.
I am not looking for perfect OCR, even a moderately acceptable OCR is fine, but I would prefer a small utility rather than a bulky software package.
(I am aware of a similar, but different question on AD: Looking for Software to Scan or Convert to Searchable and Signable PDF - however, I don't need to sign or fill PDFs, and my requirement is that the solution is scriptable)
EDIT:
1) Several utilities allow structured text extraction, however in order to be extracted, the text must be there; I am mainly referring to PDFs that are wrapped bitmaps, as is the case with plain PDFs generated by scanners.
2) I am not necessarily looking for a free solution, and I would be more than happy to pay for a good utility that just does what I need, but I am not looking for bulky applications with a million features that include an OCR feature but whose cost does not justify buying them just for the OCR functionality.
3) As stated above, I am not looking for perfect OCR, just a moderately acceptable OCR. Unfortunately, in my experience, tesseract is really below that threshold. I define 'moderately acceptable' an OCR that can, say, OCR an utility bill so that at least the account number (customer number) is recognized correctly.
EDIT: 'scriptable' or 'automatable', that is, able to be triggered automatically and run unattended without human input whatsoever.
Community♦
magmamagma64311 gold badge55 silver badges1111 bronze badges
13 Answers
It's not entirely clear to me what your requirements are for being able to 'script' this from the 'command line'.
If you are talking about automation, then that is possible with any number of utilities.
ABBYY FineReader Express + Keyboard Maestro + Hazel
I use ABBYY FineReader Express + Keyboard Maestro + Hazel like so:
- Hazel monitors a given folder for any new PDFs
- if a PDF is found, it is opened in 'ABBYY FineReader Express'
- Keyboard Maestro then automates the process of turning the PDF into a Searchable PDF (OCR) and saves the file to a different directory.
Now, if you don't own Hazel and Keyboard Maestro already, your initial costs are going to rise pretty quickly (although I depend on both so much I consider them a bargain).
PDFPen + AppleScript + Folder Actions
You could do something similar with PDFPen (or PDFPenPro) and folder actions and AppleScript. See https://gist.github.com/prenagha/1355037 for one example.
Marco Arment did a survey of OCR apps for Mac and found that PDFPen had great results and was easy to automate.
A google search for 'PDFpen applescript OCR' will turn up a number of alternatives.
TJ LuomaTJ Luoma12.9k33 gold badges4444 silver badges8383 bronze badges
What you want is Tesseract OCR. It's an open source OCR that is maintained by Google and supports a variety of platforms. It also has a native command line interface. It's exactly what you're looking for and available from the Mac ports project as well as homebrew.
Project Home: https://github.com/tesseract-ocr
How to install on OS X: http://blog.matt-swain.com/post/26419042500/installing-tesseract-ocr-on-mac-os-x-lion
Usage Example:
CousinCocainetesseract -l eng input.pdf output
6,65299 gold badges4040 silver badges6666 bronze badges
Daniel KocevskiDaniel Kocevski
Disclaimer:NOT AN OCR SOLUTION (but this answer is still useful to extract text from pdf)
There is an Apache Software Foundation project called Apache Tika:
A toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries
They support PDF text extraction using PDFBox:
allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents. Apache PDFBox also includes several command line utilities
And they recently also added support for OCR (via Tesserac)
For a text based solution, PDFBox makes very simple to extract text from a PDF:
- Download the
pdfbox-app
package from https://pdfbox.apache.org/downloads.html - run the
ExtractText
command on it:java -jar pdfbox-app-x.y.z.jar ExtractText myNiceBook.pdf myNiceBook.txt
It also has some other nice options that you can see in ExtractText docs.
brutuscatbrutuscat
How To Ocr An Existing Pdf Converter
I would recommend DEVONThink Pro Office. It is an excellent application and has very good AppleScript support. Alas only the 'Pro Office' version has the OCR capability - so you'll have to shell out £100 ($150).
It would be overkill if you're only using it for scripted OCR - but it's a very good app.
[edit] - ah just re-read your post - it would definitely be overkill!
If you just want OCR from the shell, you could try talking to ABBY whose engine DEVON licences:
DiggoryDiggory
57322 gold badges66 silver badges1616 bronze badges
You can make your existing PDF searchable by converting it into text file. You need for that at least Imagemagick, Ghostscript (for PDF conversion) and Tesseract OCR tool.
Some command-line example:
This can be extended further to your needs.
To install required tools, on OSX you may install it via Homebrew:
On Linux use
apt-get
or yum
instead of brew
.For more OCR tools, check: OCR on Linux systems
Related:
Community♦
kenorbkenorb7,70299 gold badges5454 silver badges103103 bronze badges
A solution which is easily implementable and providing an output pdf with same quality of input file plus reasonable size is OCRmyPDF:
user127022user127022
Stackoverflow has related questions under PDF-parsing covering things such as PDFBox and Apache's TIKA that the PDFBox uses. The ruby code below extracts writing from PDF. You need to have good enough resolution for this type of codes to work robustly. So get a good enough scanner with large resolution and then see if some of the softwares work.
Examples
SO threads
[Edit]
I am not sure whether I understood your problem now. You want to add OCR layer to different kinds of material such as random photos, screenshots, PDFs without OCR layer and so on? I don't know the solution but I am sure someone knows so asked a specific question how to do it with Automator and some OCR software:
Community♦
hhhhhh1,7562121 gold badges4949 silver badges8181 bronze badges
For this type of self-directed application, I'm a big fan of Hazel.
It makes it extremely easy to script actions without needing to learn a more command line oriented tool like perl or python and paired with the OCR engine of your choice (mine is currently PDF Pen Pro) you should have no problems getting your files processed with minimal fuss.
Both of these are paid software, but the utility of both far extends past this one case. In my situation, with the labor involved in digitizing my past scanned records (and ongoing paper), the price of these far outweighs the time I would have spend programming this elsewhere and now that I own both tools, I can do many other tasks with them.
bmike♦bmike169k4747 gold badges310310 silver badges669669 bronze badges
PDFScannerApp does have an unofficial scripting support. Contact the author for the Automator action.
kenorb7,70299 gold badges5454 silver badges103103 bronze badges
ndfndf
We're looking for long answers that provide some explanation and context. Don't just give a one-line answer; explain why your answer is right, ideally with citations. Answers that don't include explanations may be removed.
I use Adobe acrobat to OCR in batch. My duplex scanner can OCR after scanning but the OCR technology in acrobat is more accurate in my opinion. I just point to there folder that has no OCR then acrobat re saves the PDF as a searchable PDF now including a text layer. If I wanted to OCR via command line, I don't know of a way but I can automate the GUI end by using Autohotkey. Not as reliable nor fast as command line, but it does the job after you set up a workflow action to minimize the GUI interaction.
For Mac, apple script does what Autohotkey does on the PC although I haven't tried on my Mac yet.
Auto hot key comes with a recorder so most of the script writing is dinner for you with a littler bit of editing for refinement and perhaps looping if you want that.
I've been experimenting OCRing images but haven't automated the process fully yet through acrobat. Command line is ideal but haven't found a quality OCR engine that exceeds acrobat so I stick with acrobat for now.
SunSun13611 gold badge66 silver badges2020 bronze badges
I stumbled upon this recently: http://ocrkit.com/faq.html
You have to pay after 14 days though
CharltonCharlton
I got high quality Drag & Drop conversion working using Docker.
If you:
- install Docker for your Mac and
- then create a new Automator app
- with these contents inside a 'Run a Shell Script' action. Choose Pass Input:
'as arguments'
/bin/bash
script text:You should then be good to drag-and-drop PDFs onto it and and you'll get a similarly named PDF with '-ocr' appended to the file name.
I imagine it could be easily modified to return a file to Automator to copy somewhere as well. More details about the fine OCRmyPDF docker package. and main tool (also mentioned in a different answer).
You can test it in Automator itself with 'Get specified Finder items' action as input to this.
The first time it runs, it make take more time as it will need to download the Docker images for OCRmyPDF (invisibly). In Terminal, you can alternatively run
docker pull jbarlow83/ocrmypdf
to speed up the first run. A typical run takes about 10 seconds per high DPI page but has automatically text-to-speachable results even if there are tables or diagrams. Before OCRing, I crop using Sejda so nonsense margin words from other pages are removed.The
--force-ocr
argument tells the tool to ignore and overwrite any earlier OCR attempts, which in my cases are usually only partial and useless.thadkthadk
OCRKit has both AppleScript support and a CLI. From their help page:
AppleScript
You can also script OCRKit to integrate it into your specific workflow. For example process incoming files, via shared folder, from MFP copy machine, etc. and simply tell OCRKit to open and thus process is via AppleScript:
Command line
Since OCRKit version 2.5 direct command line scripting is supported. This greatly simplifies the use of OCRKit in batch processing, allows to set more options and is also more robust and cross-platform than AppleSCript.
Since OCRKit version 16.9 additional command line options are supported:
-r, --recursive directory
Scan directory recursively for new files. Skips files from OCRKit, with text layer or vector graphics.
How To Ocr An Existing Pdf
--pattern 'regex'
Pattern used to match filenames during recursive scans. Defaults to
%.pdf$
, recommendation for TIFF is %.tiff?$
--log file
Write log file information and statistics during recursive scan to file.
--password secret
Use secret password to decrypt PDF files during batch processing.
--test-run [ fast ]
How To Ocr An Existing Pdf Online
Only run test batch processing in test mode to test PDF files or to obtain page count to estimate total processing time. 'fast' will only check the first page of each file, instead of going thru all pages for image and vector analyzation.
--tag name
Use extended attribute name to tag the processing state of files during batch processing.
macos:OCRKit (%s)
will use native macOS Finder tags instead, or simply macos:OCRKit
not including the state attribute. The order of the state attribute are: started
, analyzed
, processed
, and can also be encrypted
.xilopaintxilopaint