Convert PDF to TXT File Using Python

Last Updated : 12 Apr, 2025

We have a PDF file and want to extract its text into a simple .txt format. The idea is to automate this process so the content can be easily read, edited, or processed later. For example, a PDF with articles or reports can be converted into plain text using just a few lines of Python. In this article, we’ll use a sample file.pdf to explore different libraries and methods to do this efficiently.

Using pdfplumber

pdfplumber is a Python library that provides advanced capabilities for extracting text, tables and metadata from PDF files. It is especially useful when working with PDFs that have a complex layout or contain structured data like tables.

Python

import pdfplumber

with pdfplumber.open("file.pdf") as pdf, open("output.txt", "w", encoding="utf-8") as f:
    
    for page in pdf.pages:
        t = page.extract_text()
        if t:
            f.write(t + '\n')

Output

Explanation: This code uses pdfplumber to open "file.pdf" and "output.txt" simultaneously. It iterates through each page of the PDF using pdf.pages, extracts text with extract_text(), and if text exists, writes it to the output file followed by a newline to separate the content of each page.

Using PyPDF2

PyPDF2 is a pure-Python library used for reading and writing PDF files. It is widely used for basic PDF manipulation, including text extraction, merging, splitting, and rotating pages. However, it may not always handle complex layouts or structured data as precisely as pdfplumber.

Python

from PyPDF2 import PdfReader

reader = PdfReader("file.pdf")
with open("output.txt", "w", encoding="utf-8") as f:
    
    for page in reader.pages:
        t = page.extract_text()
        if t:
            f.write(t + '\n')

Output

Explanation: This code creates a PdfReader object to read "file.pdf", opens "output.txt" in write mode with UTF-8 encoding, and loops through each page to extract text using extract_text(). If text is found, it writes it to the output file with a newline for separation.

Using fitz

fitz is the interface of the PyMuPDF library, which allows high-performance PDF and eBook manipulation. It is known for its speed and accuracy in text extraction, especially for PDFs that have a complex graphical layout or embedded fonts.

Python

import fitz  # PyMuPDF

doc = fitz.open("file.pdf")
with open("output.txt", "w", encoding="utf-8") as f:
    
    for page in doc:
        f.write(page.get_text() + '\n')

Explanation: This code uses fitz (PyMuPDF) to open "file.pdf" and reads it page by page. For each page, it extracts the text using get_text() and writes it to "output.txt", adding a newline after each page’s content to keep them separated.