Welcome to pdfminer.six’s documentation!

Travis-ci build badge PyPi version badge gitter badge

We fathom PDF.

Pdfminer.six is a python package for extracting information from PDF documents.

Check out the source on github.

Content

This documentation is organized into four sections (according to the Diátaxis documentation framework). The Tutorials section helps you setup and use pdfminer.six for the first time. Read this section if this is your first time working with pdfminer.six. The How-to guides offers specific recipies for solving common problems. Take a look at the Topics if you want more background information on how pdfminer.six works internally. The API Reference provides detailed api documentation for all the common classes and functions in pdfminer.six.

Features

  • Parse all objects from a PDF document into Python objects.

  • Analyze and group text in a human-readable way.

  • Extract text, images (JPG, JBIG2 and Bitmaps), table-of-contents, tagged contents and more.

  • Support for (almost all) features from the PDF-1.7 specification

  • Support for Chinese, Japanese and Korean CJK) languages as well as vertical writing.

  • Support for various font types (Type1, TrueType, Type3, and CID).

  • Support for RC4 and AES encryption.

  • Support for AcroForm interactive form extraction.

Installation instructions

  • Install Python 3.8 or newer.

  • Install pdfminer.six.

::

$ pip install pdfminer.six`

  • (Optionally) install extra dependencies for extracting images.

::

$ pip install ‘pdfminer.six[image]’`

  • Use the command-line interface to extract text from pdf.

::

$ pdf2txt.py example.pdf`

  • Or use it with Python.

from pdfminer.high_level import extract_text

text = extract_text("example.pdf")
print(text)

Contributing

We welcome any contributors to pdfminer.six! But, before doing anything, take a look at the contribution guide.