diff --git a/README.md b/README.md new file mode 100644 index 0000000..e48e19c --- /dev/null +++ b/README.md @@ -0,0 +1,54 @@ +# pdfsearch # + +**pdfsearch** finds text strings inside a pdf. + +## Installation ## + +```sh +pip install PyPDF2 +pip install tqdm +``` + +Then + +```sh +git clone http://git.wrl.unsw.edu.au:3000/danh/pdfsearch.git +cd pdfsearch +python setup.py install +``` + +Or put a copy of `pdfsearch.py` in your working directory. + +## Usage ## + +```python +>>> from pdfsearch import search_pdf + +>>> pdf_name = 'animal-farm.pdf' +>>> search_patterns = ['horse', 'pig'] +>>> context_length = 2 +>>> results = search_pdf(pdf_name, search_patterns, context_length) + +>>> pd.DataFrame(results) + +document page pattern word context +-------------------------------------------------------------------------- +animal-farm.pdf 1 horse cart-horses, The two cart-horses, Boxer +animal-farm.pdf 1 horse horses two ordinary horses put together. +animal-farm.pdf 1 horse horses the horses came +animal-farm.pdf 1 pig pig, a majestic-looking pig, with a +animal-farm.pdf 1 pig pigs, the pigs, who +animal-farm.pdf 1 pig pigeons window-sills, the pigeons fluttered up +animal-farm.pdf 1 pig pigs the pigs and began +animal-farm.pdf 2 horse horses The two horses had just +``` + +Search patterns should be lowercase, unless case-sensitivity is important. For example: + +`'ph'` will match: + - `'photograph'` + - `'PHANTOM'` + - `'pH'` + +`'pH'` will match: + - `'pH'`