You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

56 lines
1.4 KiB
Markdown

# pdfsearch #
**pdfsearch** finds text strings inside a pdf.
## Installation ##
```
pip install PyPDF2
pip install tqdm
```
Then
```
git clone http://git.wrl.unsw.edu.au:3000/danh/pdfsearch.git
cd pdfsearch
python setup.py install
```
Or put a copy of `pdfsearch.py` in your working directory.
## Usage ##
```python
>>> from pdfsearch import search_pdf
>>> pdf_name = 'animal-farm.pdf'
>>> search_patterns = ['horse', 'pig']
>>> context_length = 2
>>> results = search_pdf(pdf_name, search_patterns, context_length)
>>> import pandas as pd
>>> pd.DataFrame(results)
document page pattern word context
--------------------------------------------------------------------------
animal-farm.pdf 1 horse cart-horses, The two cart-horses, Boxer
animal-farm.pdf 1 horse horses two ordinary horses put together.
animal-farm.pdf 1 horse horses the horses came
animal-farm.pdf 1 pig pig, a majestic-looking pig, with a
animal-farm.pdf 1 pig pigs, the pigs, who
animal-farm.pdf 1 pig pigeons window-sills, the pigeons fluttered up
animal-farm.pdf 1 pig pigs the pigs and began
animal-farm.pdf 2 horse horses The two horses had just
```
Search patterns should be lowercase, unless case-sensitivity is important. For example:
`'ph'` will match:
- `'photograph'`
- `'PHANTOM'`
- `'pH'`
`'pH'` will match:
- `'pH'`