Extract elements from a PDF using Python¶
The high level functions can be used to achieve common tasks. In this case, we can use extract_pages:
from pdfminer.high_level import extract_pages
for page_layout in extract_pages("test.pdf"):
for element in page_layout:
print(element)
Each element
will be an LTTextBox
, LTFigure
, LTLine
, LTRect
or an LTImage
. Some of these can be iterated further, for example iterating
though an LTTextBox
will give you an LTTextLine
, and these in turn can
be iterated through to get an LTChar
. See the diagram here:
Layout analysis algorithm.
Let’s say we want to extract all of the text. We could do:
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer
for page_layout in extract_pages("test.pdf"):
for element in page_layout:
if isinstance(element, LTTextContainer):
print(element.get_text())
Or, we could extract the fontname or size of each individual character:
from pdfminer.high_level import extract_pages
from pdfminer.layout import LTTextContainer, LTChar
for page_layout in extract_pages("test.pdf"):
for element in page_layout:
if isinstance(element, LTTextContainer):
for text_line in element:
for character in text_line:
if isinstance(character, LTChar):
print(character.fontname)
print(character.size)