A high-performance Java-based search engine utilizing multithreading for efficient text indexing and searching. This project implements a robust inverted index data structure with support for exact and partial search capabilities, web crawling, and thread-safe operations.
This search engine is designed to efficiently index and search through large collections of text documents and web pages. It employs a multithreaded architecture to maximize performance on modern hardware, capable of handling concurrent indexing and search operations.
Key features include:
- Multithreaded Indexing: Efficiently processes documents using a thread pool to maximize CPU utilization
- Thread-Safe Data Structures: Custom implementations ensure data integrity during concurrent operations
- Web Crawling: Support for HTML document processing and web link traversal
- Query Processing: Fast exact and partial search capabilities with relevance ranking
- JSON Output: Search results and index data can be exported in a clean JSON format
The system is built around these core components:
- Maps words to their locations (file paths and positions within those files)
- Maintains word frequency counts
- Supports search operations with relevance ranking
ThreadSafeInvertedIndex: Thread-safe version of the inverted index using read-write locksMultiReaderLock: Custom implementation of a read-write lock allowing multiple concurrent readersWorkQueue: Thread pool implementation for managing worker threads
InvertedIndexBuilder: Single-threaded implementation for document processingMultithreadedInvertedIndexBuilder: Parallel implementation using worker threads
QueryProcessor: Processes search queries for single-threaded operationsThreadSafeQueryProcessor: Thread-safe implementation for concurrent query processing- Support for both exact and partial matching search algorithms
WebCrawler: Processes HTML pages and extracts linksHtmlFetcher: Fetches web content with redirect handlingHtmlCleaner: Strips HTML tags and cleans content for indexingLinkFinder: Identifies and normalizes links in HTML content
The system is configured and run via command-line arguments:
java -cp ".:lib/*" edu.usfca.cs272.Driver [arguments]
| Flag | Description | Default |
|---|---|---|
-text |
Path to the file or directory to index | Required |
-index |
Path for the JSON file to output inverted index | index.json |
-counts |
Path for the JSON file to output word counts | counts.json |
-results |
Path for the JSON file to output search results | results.json |
-query |
Path to the file containing search queries | None |
-threads |
Number of worker threads to use | 5 |
-html |
Seed URL for web crawling | None |
-partial |
Use partial search instead of exact search | False |
Index a directory of text files using 8 threads:
java -cp ".:lib/*" edu.usfca.cs272.Driver -text /path/to/texts -threads 8 -index index.json
Perform searches from a query file using previous index:
java -cp ".:lib/*" edu.usfca.cs272.Driver -query /path/to/queries.txt -results results.json
Crawl a website and build an index:
java -cp ".:lib/*" edu.usfca.cs272.Driver -html https://example.com -index web-index.json
- The multithreaded implementation shows significant performance improvements over single-threaded operations, especially for large document collections
- The optimal number of threads depends on your hardware (typically matching the number of available CPU cores)
- Web crawling performance depends heavily on network conditions and the target website's responsiveness
- Java 17 or higher
- Apache Commons Text
- Apache Log4j2
- OpenNLP Snowball Stemmer