Search for text strings inside a pdf
You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Dan Howe c67fe76722 Update readme 7 years ago
pdfsearch Use 're.IGNORECASE' flag instead of 'lower()' string method 7 years ago
README.md Update readme 7 years ago
setup.py Add 'setup.py' 7 years ago

README.md

pdfsearch

pdfsearch finds text strings inside a pdf. It relies on pdftotext from the Poppler library.

Installation

You must install Poppler, and add it to your search path.

Windows: http://blog.alivate.com.au/poppler-windows

Linux: sudo apt install poppler-utils

Mac: brew install poppler

Then

pip install tqdm

Then

git clone http://git.wrl.unsw.edu.au:3000/danh/pdfsearch.git
cd pdfsearch
python setup.py install

Or put a copy of pdfsearch.py in your working directory.

Usage

>>> from pdfsearch import search_pdf

>>> pdf_name = 'animal-farm.pdf'
>>> search_patterns = ['horse', 'pig']
>>> context_length = 2
>>> results = search_pdf(pdf_name, search_patterns, context_length)

>>> import pandas as pd
>>> pd.DataFrame(results)

document         page   pattern  word          context
--------------------------------------------------------------------------
animal-farm.pdf     1   horse    cart-horses,  The two cart-horses, Boxer
animal-farm.pdf     1   horse    horses        two ordinary horses put together.
animal-farm.pdf     1   horse    horses        the horses came
animal-farm.pdf     1   pig      pig,          a majestic-looking pig, with a
animal-farm.pdf     1   pig      pigs,         the pigs, who
animal-farm.pdf     1   pig      pigeons       window-sills, the pigeons fluttered up
animal-farm.pdf     1   pig      pigs          the pigs and began
animal-farm.pdf     2   horse    horses        The two horses had just

Search patterns should be lowercase, unless case-sensitivity is important. For example:

'ph' will match:

  • 'photograph'
  • 'PHANTOM'
  • 'pH'

'pH' will match:

  • 'pH'