High-level functions API¶
extract_text¶
-
pdfminer.high_level.
extract_text
(pdf_file: Union[pathlib.PurePath, str, io.IOBase], password: str = '', page_numbers: Optional[Container[int]] = None, maxpages: int = 0, caching: bool = True, codec: str = 'utf-8', laparams: Optional[pdfminer.layout.LAParams] = None) → str¶ Parse and return the text contained in a PDF file.
Parameters: - pdf_file – Either a file path or a file-like object for the PDF file to be worked on.
- password – For encrypted PDFs, the password to decrypt.
- page_numbers – List of zero-indexed page numbers to extract.
- maxpages – The maximum number of pages to parse
- caching – If resources should be cached
- codec – Text decoding codec
- laparams – An LAParams object from pdfminer.layout. If None, uses some default settings that often work well.
Returns: a string containing all of the text extracted.
extract_text_to_fp¶
-
pdfminer.high_level.
extract_text_to_fp
(inf: BinaryIO, outfp: Union[TextIO, BinaryIO], output_type: str = 'text', codec: str = 'utf-8', laparams: Optional[pdfminer.layout.LAParams] = None, maxpages: int = 0, page_numbers: Optional[Container[int]] = None, password: str = '', scale: float = 1.0, rotation: int = 0, layoutmode: str = 'normal', output_dir: Optional[str] = None, strip_control: bool = False, debug: bool = False, disable_caching: bool = False, **kwargs) → None¶ Parses text from inf-file and writes to outfp file-like object.
Takes loads of optional arguments but the defaults are somewhat sane. Beware laparams: Including an empty LAParams is not the same as passing None!
Parameters: - inf – a file-like object to read PDF structure from, such as a file handler (using the builtin open() function) or a BytesIO.
- outfp – a file-like object to write the text to.
- output_type – May be ‘text’, ‘xml’, ‘html’, ‘hocr’, ‘tag’. Only ‘text’ works properly.
- codec – Text decoding codec
- laparams – An LAParams object from pdfminer.layout. Default is None but may not layout correctly.
- maxpages – How many pages to stop parsing after
- page_numbers – zero-indexed page numbers to operate on.
- password – For encrypted PDFs, the password to decrypt.
- scale – Scale factor
- rotation – Rotation factor
- layoutmode – Default is ‘normal’, see pdfminer.converter.HTMLConverter
- output_dir – If given, creates an ImageWriter for extracted images.
- strip_control – Does what it says on the tin
- debug – Output more logging data
- disable_caching – Does what it says on the tin
- other –
Returns: nothing, acting as it does on two streams. Use StringIO to get strings.
extract_pages¶
-
pdfminer.high_level.
extract_pages
(pdf_file: Union[pathlib.PurePath, str, io.IOBase], password: str = '', page_numbers: Optional[Container[int]] = None, maxpages: int = 0, caching: bool = True, laparams: Optional[pdfminer.layout.LAParams] = None) → Iterator[pdfminer.layout.LTPage]¶ Extract and yield LTPage objects
Parameters: - pdf_file – Either a file path or a file-like object for the PDF file to be worked on.
- password – For encrypted PDFs, the password to decrypt.
- page_numbers – List of zero-indexed page numbers to extract.
- maxpages – The maximum number of pages to parse
- caching – If resources should be cached
- laparams – An LAParams object from pdfminer.layout. If None, uses some default settings that often work well.
Returns: LTPage objects