Search for text strings inside a pdf
Go to file
Dan Howe ff6c503684 Improve handling of unicode control characters.
pdfsearch Improve handling of unicode control characters.
README.md Update readme
setup.py Prepare 0.2.1 release

README.md

pdfsearch

pdfsearch finds text strings inside a pdf. It relies on pdftotext from the Poppler library.

Installation

You must install Poppler, and add it to your search path.

Windows: http://blog.alivate.com.au/poppler-windows

Linux: sudo apt install poppler-utils

Mac: brew install poppler

Then

pip install tqdm

Then

git clone http://git.wrl.unsw.edu.au:3000/danh/pdfsearch.git
cd pdfsearch
python setup.py install

Or put a copy of pdfsearch.py in your working directory.

Usage

>>> from pdfsearch import search_pdf

>>> pdf_name = 'animal-farm.pdf'
>>> search_patterns = ['horse', 'pig']
>>> context_length = 2
>>> results = search_pdf(pdf_name, search_patterns, context_length)

>>> import pandas as pd
>>> pd.DataFrame(results)

document         page   pattern  word          context
--------------------------------------------------------------------------
animal-farm.pdf     1   horse    cart-horses,  The two cart-horses, Boxer
animal-farm.pdf     1   horse    horses        two ordinary horses put together.
animal-farm.pdf     1   horse    horses        the horses came
animal-farm.pdf     1   pig      pig,          a majestic-looking pig, with a
animal-farm.pdf     1   pig      pigs,         the pigs, who
animal-farm.pdf     1   pig      pigeons       window-sills, the pigeons fluttered up
animal-farm.pdf     1   pig      pigs          the pigs and began
animal-farm.pdf     2   horse    horses        The two horses had just

Search patterns should be lowercase, unless case-sensitivity is important. For example:

'ph' will match:

  • 'photograph'
  • 'PHANTOM'
  • 'pH'

'pH' will match:

  • 'pH'