Welcome to pdfminer.six’s documentation!
We fathom PDF.
Pdfminer.six is a python package for extracting information from PDF documents.
Check out the source on github.
Content
This documentation is organized into four sections (according to the Diátaxis
documentation framework). The
Tutorials section helps you setup and use pdfminer.six for the first
time. Read this section if this is your first time working with pdfminer.six.
The How-to guides offers specific recipies for solving common problems.
Take a look at the Topics if you want more background information on
how pdfminer.six works internally. The API Reference provides
detailed api documentation for all the common classes and functions in
pdfminer.six.
Features
Parse all objects from a PDF document into Python objects.
Analyze and group text in a human-readable way.
Extract text, images (JPG, JBIG2 and Bitmaps), table-of-contents, tagged
contents and more.
Support for (almost all) features from the PDF-1.7 specification
Support for Chinese, Japanese and Korean CJK) languages as well as vertical writing.
Support for various font types (Type1, TrueType, Type3, and CID).
Support for RC4 and AES encryption.
Support for AcroForm interactive form extraction.
Installation instructions
- ::
$ pip install pdfminer.six`
- ::
$ pip install ‘pdfminer.six[image]’`
- ::
$ pdf2txt.py example.pdf`
from pdfminer.high_level import extract_text
text = extract_text("example.pdf")
print(text)
Contributing
We welcome any contributors to pdfminer.six! But, before doing anything, take
a look at the contribution guide.