High-level functions API

extract_text

pdfminer.high_level.extract_text(pdf_file: Union[pathlib.PurePath, str, io.IOBase], password: str = '', page_numbers: Optional[Container[int]] = None, maxpages: int = 0, caching: bool = True, codec: str = 'utf-8', laparams: Optional[pdfminer.layout.LAParams] = None) → str

Parse and return the text contained in a PDF file.

Parameters:
  • pdf_file – Either a file path or a file-like object for the PDF file to be worked on.
  • password – For encrypted PDFs, the password to decrypt.
  • page_numbers – List of zero-indexed page numbers to extract.
  • maxpages – The maximum number of pages to parse
  • caching – If resources should be cached
  • codec – Text decoding codec
  • laparams – An LAParams object from pdfminer.layout. If None, uses some default settings that often work well.
Returns:

a string containing all of the text extracted.

extract_text_to_fp

pdfminer.high_level.extract_text_to_fp(inf: BinaryIO, outfp: Union[TextIO, BinaryIO], output_type: str = 'text', codec: str = 'utf-8', laparams: Optional[pdfminer.layout.LAParams] = None, maxpages: int = 0, page_numbers: Optional[Container[int]] = None, password: str = '', scale: float = 1.0, rotation: int = 0, layoutmode: str = 'normal', output_dir: Optional[str] = None, strip_control: bool = False, debug: bool = False, disable_caching: bool = False, **kwargs) → None

Parses text from inf-file and writes to outfp file-like object.

Takes loads of optional arguments but the defaults are somewhat sane. Beware laparams: Including an empty LAParams is not the same as passing None!

Parameters:
  • inf – a file-like object to read PDF structure from, such as a file handler (using the builtin open() function) or a BytesIO.
  • outfp – a file-like object to write the text to.
  • output_type – May be ‘text’, ‘xml’, ‘html’, ‘hocr’, ‘tag’. Only ‘text’ works properly.
  • codec – Text decoding codec
  • laparams – An LAParams object from pdfminer.layout. Default is None but may not layout correctly.
  • maxpages – How many pages to stop parsing after
  • page_numbers – zero-indexed page numbers to operate on.
  • password – For encrypted PDFs, the password to decrypt.
  • scale – Scale factor
  • rotation – Rotation factor
  • layoutmode – Default is ‘normal’, see pdfminer.converter.HTMLConverter
  • output_dir – If given, creates an ImageWriter for extracted images.
  • strip_control – Does what it says on the tin
  • debug – Output more logging data
  • disable_caching – Does what it says on the tin
  • other
Returns:

nothing, acting as it does on two streams. Use StringIO to get strings.

extract_pages

pdfminer.high_level.extract_pages(pdf_file: Union[pathlib.PurePath, str, io.IOBase], password: str = '', page_numbers: Optional[Container[int]] = None, maxpages: int = 0, caching: bool = True, laparams: Optional[pdfminer.layout.LAParams] = None) → Iterator[pdfminer.layout.LTPage]

Extract and yield LTPage objects

Parameters:
  • pdf_file – Either a file path or a file-like object for the PDF file to be worked on.
  • password – For encrypted PDFs, the password to decrypt.
  • page_numbers – List of zero-indexed page numbers to extract.
  • maxpages – The maximum number of pages to parse
  • caching – If resources should be cached
  • laparams – An LAParams object from pdfminer.layout. If None, uses some default settings that often work well.
Returns: