High-level functions API

extract_text

pdfminer.high_level.extract_text(pdf_file: PurePath | str | IOBase, password: str = '', page_numbers: Container[int] | None = None, maxpages: int = 0, caching: bool = True, codec: str = 'utf-8', laparams: LAParams | None = None) str

Parse and return the text contained in a PDF file.

Parameters:
  • pdf_file – Either a file path or a file-like object for the PDF file to be worked on.

  • password – For encrypted PDFs, the password to decrypt.

  • page_numbers – List of zero-indexed page numbers to extract.

  • maxpages – The maximum number of pages to parse

  • caching – If resources should be cached

  • codec – Text decoding codec

  • laparams – An LAParams object from pdfminer.layout. If None, uses some default settings that often work well.

Returns:

a string containing all of the text extracted.

extract_text_to_fp

pdfminer.high_level.extract_text_to_fp(inf: BinaryIO, outfp: TextIO | BinaryIO, output_type: str = 'text', codec: str = 'utf-8', laparams: LAParams | None = None, maxpages: int = 0, page_numbers: Container[int] | None = None, password: str = '', scale: float = 1.0, rotation: int = 0, layoutmode: str = 'normal', output_dir: str | None = None, strip_control: bool = False, debug: bool = False, disable_caching: bool = False, **kwargs: Any) None

Parses text from inf-file and writes to outfp file-like object.

Takes loads of optional arguments but the defaults are somewhat sane. Beware laparams: Including an empty LAParams is not the same as passing None!

Parameters:
  • inf – a file-like object to read PDF structure from, such as a file handler (using the builtin open() function) or a BytesIO.

  • outfp – a file-like object to write the text to.

  • output_type – May be ‘text’, ‘xml’, ‘html’, ‘hocr’, ‘tag’. Only ‘text’ works properly.

  • codec – Text decoding codec

  • laparams – An LAParams object from pdfminer.layout. Default is None but may not layout correctly.

  • maxpages – How many pages to stop parsing after

  • page_numbers – zero-indexed page numbers to operate on.

  • password – For encrypted PDFs, the password to decrypt.

  • scale – Scale factor

  • rotation – Rotation factor

  • layoutmode – Default is ‘normal’, see pdfminer.converter.HTMLConverter

  • output_dir – If given, creates an ImageWriter for extracted images.

  • strip_control – Does what it says on the tin

  • debug – Output more logging data

  • disable_caching – Does what it says on the tin

  • other

Returns:

nothing, acting as it does on two streams. Use StringIO to get strings.

extract_pages

pdfminer.high_level.extract_pages(pdf_file: PurePath | str | IOBase, password: str = '', page_numbers: Container[int] | None = None, maxpages: int = 0, caching: bool = True, laparams: LAParams | None = None) Iterator[LTPage]

Extract and yield LTPage objects

Parameters:
  • pdf_file – Either a file path or a file-like object for the PDF file to be worked on.

  • password – For encrypted PDFs, the password to decrypt.

  • page_numbers – List of zero-indexed page numbers to extract.

  • maxpages – The maximum number of pages to parse

  • caching – If resources should be cached

  • laparams – An LAParams object from pdfminer.layout. If None, uses some default settings that often work well.

Returns:

LTPage objects