# pdfsearch # **pdfsearch** finds text strings inside a pdf. It relies on `pdftotext` from the [Poppler](https://poppler.freedesktop.org/) library. ## Installation ## You must install Poppler, and add it to your search path. **Windows**: http://blog.alivate.com.au/poppler-windows **Linux**: `sudo apt install poppler-utils` **Mac**: `brew install poppler` Then ``` pip install tqdm ``` Then ``` git clone http://git.wrl.unsw.edu.au:3000/danh/pdfsearch.git cd pdfsearch python setup.py install ``` Or put a copy of `pdfsearch.py` in your working directory. ## Usage ## ```python >>> from pdfsearch import search_pdf >>> pdf_name = 'animal-farm.pdf' >>> search_patterns = ['horse', 'pig'] >>> context_length = 2 >>> results = search_pdf(pdf_name, search_patterns, context_length) >>> import pandas as pd >>> pd.DataFrame(results) document page pattern word context -------------------------------------------------------------------------- animal-farm.pdf 1 horse cart-horses, The two cart-horses, Boxer animal-farm.pdf 1 horse horses two ordinary horses put together. animal-farm.pdf 1 horse horses the horses came animal-farm.pdf 1 pig pig, a majestic-looking pig, with a animal-farm.pdf 1 pig pigs, the pigs, who animal-farm.pdf 1 pig pigeons window-sills, the pigeons fluttered up animal-farm.pdf 1 pig pigs the pigs and began animal-farm.pdf 2 horse horses The two horses had just ``` Search patterns should be lowercase, unless case-sensitivity is important. For example: `'ph'` will match: - `'photograph'` - `'PHANTOM'` - `'pH'` `'pH'` will match: - `'pH'`