High-level functions API¶

extract_text¶

pdfminer.high_level.extract_text(pdf_file: PurePath | str | IOBase, password: str = '', page_numbers: Container[int] | None = None, maxpages: int = 0, caching: bool = True, codec: str = 'utf-8', laparams: LAParams | None = None) → str¶

Parse and return the text contained in a PDF file.

Parameters:

pdf_file – Either a file path or a file-like object for the PDF file to be worked on.
password – For encrypted PDFs, the password to decrypt.
page_numbers – List of zero-indexed page numbers to extract.
maxpages – The maximum number of pages to parse
caching – If resources should be cached
codec – Text decoding codec
laparams – An LAParams object from pdfminer.layout. If None, uses some default settings that often work well.

Returns:

a string containing all of the text extracted.

extract_text_to_fp¶

pdfminer.high_level.extract_text_to_fp(inf: BinaryIO, outfp: TextIO | BinaryIO, output_type: str = 'text', codec: str = 'utf-8', laparams: LAParams | None = None, maxpages: int = 0, page_numbers: Container[int] | None = None, password: str = '', scale: float = 1.0, rotation: int = 0, layoutmode: str = 'normal', output_dir: str | None = None, strip_control: bool = False, debug: bool = False, disable_caching: bool = False, **kwargs: Any) → None¶

Parses text from inf-file and writes to outfp file-like object.

Takes loads of optional arguments but the defaults are somewhat sane. Beware laparams: Including an empty LAParams is not the same as passing None!

Parameters:

inf – a file-like object to read PDF structure from, such as a file handler (using the builtin open() function) or a BytesIO.
outfp – a file-like object to write the text to.
output_type – May be ‘text’, ‘xml’, ‘html’, ‘hocr’, ‘tag’. Only ‘text’ works properly.
codec – Text decoding codec
laparams – An LAParams object from pdfminer.layout. Default is None but may not layout correctly.
maxpages – How many pages to stop parsing after
page_numbers – zero-indexed page numbers to operate on.
password – For encrypted PDFs, the password to decrypt.
scale – Scale factor
rotation – Rotation factor
layoutmode – Default is ‘normal’, see pdfminer.converter.HTMLConverter
output_dir – If given, creates an ImageWriter for extracted images.
strip_control – Does what it says on the tin
debug – Output more logging data
disable_caching – Does what it says on the tin
other

Returns:

nothing, acting as it does on two streams. Use StringIO to get strings.

extract_pages¶

pdfminer.high_level.extract_pages(pdf_file: PurePath | str | IOBase, password: str = '', page_numbers: Container[int] | None = None, maxpages: int = 0, caching: bool = True, laparams: LAParams | None = None) → Iterator[LTPage]¶

Extract and yield LTPage objects

Parameters:

pdf_file – Either a file path or a file-like object for the PDF file to be worked on.
password – For encrypted PDFs, the password to decrypt.
page_numbers – List of zero-indexed page numbers to extract.
maxpages – The maximum number of pages to parse
caching – If resources should be cached
laparams – An LAParams object from pdfminer.layout. If None, uses some default settings that often work well.

Returns:

LTPage objects

High-level functions API¶

extract_text¶

extract_text_to_fp¶

extract_pages¶

pdfminer.six

Navigation

Related Topics