Skip to main content

PDF To XML and forcing OCR for text extraction from scanned images inside PDF

Updated over 12 months ago

PDF To XML and forcing OCR for text extraction from scanned images inside PDF

For Zapier, Integromat and others plugins insert custom profiles into profiles field. For API calls please set value as string in profiles parameter as string.

There might be problem with extracting XML from Scanned PDF due to special cases when file contains both scanned images and long text objects (“Generated by Foxit PDF Creator ….”). The Optical Character Recognition (OCR) runs automatically only when a document contains no text. We can force the OCR for such documents. It can be done with a custom profile by using DetectNewColumnBySpacesRatio option.

Following profile will force OCR.

{ "OCRMode": "TextFromImagesOnly" }

We can also combile profiles like below.

private const string Profiles = { "DetectNewColumnBySpacesRatio": "2.0" } }, { "profile2": { "OCRMode": "TextFromImagesOnly" };

Applies To:

  • /pdf/convert/to/csv

  • /pdf/convert/to/xml

  • /pdf/convert/to/json

  • /pdf/convert/to/xls

  • /pdf/convert/to/xlsx

Did this answer your question?