Convert PDF Pages To Text With Python

CONVERT PDF PAGES TO TEXT WITH PYTHON

convert-scanned-pdf-to-text

A simple guide to text from PDF. This is an extension of the Convert PDF pages to JPEG with python post

Objectives:

  • Extract text from PDF

Required Tools:

  • Poppler for windows— Poppler is a PDF rendering library . Include the pdftoppm utility
  • Poppler for Mac — If HomeBrew already installed, can use brew install Poppler
  • pdftotext— Python module. Wraps the poppler pdftotext utility to convert PDF to text.

Steps

    • Install Poppler. For windows, Add “xxx/bin/” to env path
    • pip install pdftotext

Usage (sample code from pdftotext github)

import pdftotext
 
# Load your PDF
with open("Target.pdf", "rb") as f:
    pdf = pdftotext.PDF(f)
 
# Save all text to a txt file.
with open('output.txt', 'w') as f:
    f.write("\n\n".join(pdf))

Further notes

9 Likes