puceny is a simplified search engine prototype that indexes and searches documents from a local directory. It supports reading text from files (such as .txt, .md, .html, .pdf), building an inverted index, and then performing keyword queries using a simple scoring method. Additionally, there's a Flask-based web UI to interact with the index.
-
Indexing Local Documents:
The system crawls a local directory, extracting text from supported file formats:.txt/.md: Directly read as text.html: Parsed using BeautifulSoup to extract text.pdf: Parsed using PyPDF2 to extract text- Other file types are skipped or return empty text
-
Inverted Index:
Each token is processed by an Analyzer (tokenization, lowercasing, stopword removal) and mapped to the documents in which it appears, along with positional data. -
Segment-Based Index:
New documents are committed into immutable segments. These can later be merged into a single segment for optimization. -
Simple Query Interface:
TheSearcheruses an in-memory inverted index to quickly retrieve documents matching query terms. A basic scoring function (TF-IDF-like) is implemented. -
Flask Web Frontend:
A Flask server provides a simple webpage where users can enter queries, rebuild the index, and view highlighted snippets of matched documents.
- Python 3.7+
- Packages:
beautifulsoup4(for HTML parsing)PyPDF2(for PDF parsing)Flask(for the web UI)
Install the required packages:
pip install beautifulsoup4 PyPDF2 flaskpuceny.py: The main indexing and searching implementation (Analyzer, IndexWriter, IndexReader, Searcher, etc.).app.py: A Flask application that:- Builds or rebuilds the index from a specified data directory.
- Provides a web interface to perform searches.
DATA_DIR: The directory containing the documents to be indexed.
In app.py, set the following variables before running:
DATA_DIR: Path to the local directory containing your documents.INDEX_DIR: Directory where the index files will be stored.
Both DATA_DIR and INDEX_DIR should be absolute paths. You can edit these directly in the code:
DATA_DIR = "/path/to/your/data"
INDEX_DIR = "puceny_index_cs61a"The first time you run the app, if no index exists, it will automatically be created. To manually rebuild the index, you can press the "Rebuild Index" button on the web interface.
When rebuilding, the console will show the indexing progress and, after completion, print the total number of documents indexed.
- Ensure
DATA_DIRandINDEX_DIRare correctly set. - Run:
python app.py - Open
http://localhost:5050in your browser.
-
Rebuild Index: Click the "Rebuild Index" button at the top of the page to re-crawl the data directory and rebuild the index. This can be useful if you have updated, added, or removed documents.
-
Search: Enter a keyword or multiple keywords in the search box and submit.
- The results page will show matched documents along with highlighted snippets of the text where your search terms appear.
- Document file paths are shown, and you can click on them to access the file directly (if supported by your environment).
-
Performance: The code preloads the entire index into memory to speed up queries. For very large datasets, consider merging segments or optimizing the data structures further.
- Stopwords: You can edit the default stopwords list in
Analyzerif you need language-specific or domain-specific adjustments. - Scoring: The
search_with_scoresmethod inSearcheruses a simple IDF formula. You can replace or enhance this with TF-IDF, BM25, or other scoring methods. - File Types: Add more file parsers for other formats if needed.
- This is a prototype, not a production-ready search engine.
- The scoring and query parsing are rudimentary.
- No authentication or security checks are enforced beyond directory checks.
Copyright (c) 2024
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

