Friendly Websites

OneHack.Us | 1Hack.Us | Ethical Hacking, Giveaways, Tutorials etc.

Convert PDF Pages To Text With Python

Tools & Scripts

TheJoker September 8, 2019, 9:42pm 1

CONVERT PDF PAGES TO TEXT WITH PYTHON

convert-scanned-pdf-to-text

A simple guide to text from PDF. This is an extension of the Convert PDF pages to JPEG with python post

Objectives:

Extract text from PDF

Required Tools:

Poppler for windows— Poppler is a PDF rendering library . Include the pdftoppm utility
Poppler for Mac — If HomeBrew already installed, can use brew install Poppler
pdftotext— Python module. Wraps the poppler pdftotext utility to convert PDF to text.

Steps

- Install Poppler. For windows, Add “xxx/bin/” to env path
- pip install pdftotext

Usage (sample code from pdftotext github)

import pdftotext
 
# Load your PDF
with open("Target.pdf", "rb") as f:
    pdf = pdftotext.PDF(f)
 
# Save all text to a txt file.
with open('output.txt', 'w') as f:
    f.write("\n\n".join(pdf))

Further notes

https://github.com/jalan/pdftotext

9 Likes

Topic		Replies	Views	Last Activity
Convert PDF Pages To JPEG With Python Tutorials & Methods pdf-to-jpeg , script , tips-tricks	0	1123	September 8, 2019
XpdfReader \| An Open Source PDF Reader For Windows & Linux Tools & Scripts linux , tool , windows , pdf-reader	0	1181	April 18, 2020
EvilPDF \| Embedding Executable Files In PDF Documents Tools & Scripts embedding , evilpdf	3	1936	June 19, 2020
Full Essential Python Tools Documentation \| eBook Tutorials & Methods ebook , guide , learn-python	0	885	September 24, 2022
PyTranscriber \| Automatic Transcription And Subtitles For Audio/Video Files Tools & Scripts tool , application , api , speech-recognition	0	2154	September 5, 2022