Extract elements from a PDF using Python

The high level functions can be used to achieve common tasks. In this case, we can use extract_pages:

from pdfminer.high_level import extract_pages
for page_layout in extract_pages("test.pdf"):
    for element in page_layout:
        print(element)

Each element will be an LTTextBox, LTFigure, LTLine, LTRect or an LTImage. Some of these can be iterated further, for example iterating though an LTTextBox will give you an LTTextLine, and these in turn can be iterated through to get an LTChar. See the diagram here: Layout analysis algorithm.

Let’s say we want to extract all of the text. We could do:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
for page_layout in extract_pages("test.pdf"):
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            print(element.get_text())

Or, we could extract the fontname or size of each individual character:

from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar
for page_layout in extract_pages("test.pdf"):
    for element in page_layout:
        if isinstance(element, LTTextContainer):
            for text_line in element:
                for character in text_line:
                    if isinstance(character, LTChar):
                        print(character.fontname)
                        print(character.size)