|
|
|
# pdfsearch #
|
|
|
|
|
|
|
|
**pdfsearch** finds text strings inside a pdf.
|
|
|
|
|
|
|
|
## Installation ##
|
|
|
|
|
|
|
|
```sh
|
|
|
|
pip install PyPDF2
|
|
|
|
pip install tqdm
|
|
|
|
```
|
|
|
|
|
|
|
|
Then
|
|
|
|
|
|
|
|
```sh
|
|
|
|
git clone http://git.wrl.unsw.edu.au:3000/danh/pdfsearch.git
|
|
|
|
cd pdfsearch
|
|
|
|
python setup.py install
|
|
|
|
```
|
|
|
|
|
|
|
|
Or put a copy of `pdfsearch.py` in your working directory.
|
|
|
|
|
|
|
|
## Usage ##
|
|
|
|
|
|
|
|
```python
|
|
|
|
>>> from pdfsearch import search_pdf
|
|
|
|
|
|
|
|
>>> pdf_name = 'animal-farm.pdf'
|
|
|
|
>>> search_patterns = ['horse', 'pig']
|
|
|
|
>>> context_length = 2
|
|
|
|
>>> results = search_pdf(pdf_name, search_patterns, context_length)
|
|
|
|
|
|
|
|
>>> import pandas as pd
|
|
|
|
>>> pd.DataFrame(results)
|
|
|
|
|
|
|
|
document page pattern word context
|
|
|
|
--------------------------------------------------------------------------
|
|
|
|
animal-farm.pdf 1 horse cart-horses, The two cart-horses, Boxer
|
|
|
|
animal-farm.pdf 1 horse horses two ordinary horses put together.
|
|
|
|
animal-farm.pdf 1 horse horses the horses came
|
|
|
|
animal-farm.pdf 1 pig pig, a majestic-looking pig, with a
|
|
|
|
animal-farm.pdf 1 pig pigs, the pigs, who
|
|
|
|
animal-farm.pdf 1 pig pigeons window-sills, the pigeons fluttered up
|
|
|
|
animal-farm.pdf 1 pig pigs the pigs and began
|
|
|
|
animal-farm.pdf 2 horse horses The two horses had just
|
|
|
|
```
|
|
|
|
|
|
|
|
Search patterns should be lowercase, unless case-sensitivity is important. For example:
|
|
|
|
|
|
|
|
`'ph'` will match:
|
|
|
|
- `'photograph'`
|
|
|
|
- `'PHANTOM'`
|
|
|
|
- `'pH'`
|
|
|
|
|
|
|
|
`'pH'` will match:
|
|
|
|
- `'pH'`
|