Microsoft Windows search is not fast, and it also does not give us good search results. So i thought about writing my own Search Engine for the Desktop. It should crawl the file system, extract the content and meta data and finally should give the same results as Google.
I also wanted to test some new technologies like JavaFX with embedded HTML5, Apache Lucene as a full text search engine, Apache Tika as the content extraction framework and other stuff. But before we dive deep into internals, lets take a look at the frontend:
JavaFXDesktopSearch also comes with a visualization of the current full text index. It provides a clickable Sunburst diagram for this purpose. Basically it looks as follows:
Under the hood it uses d3js.org to visualize the Lucene index. Quite nice and fast, just try it out. The project is hosted at github.com/mirkosertic/FXDesktopSearch. FXDesktopSearch is deployed by JavaFX based native installers. The original version was deployed by WebStart, but WebStart support was dropped due to Oracles changes on security policies. Now JavaFXDesktopSearch can be installed by using native installers, and the right Java run-time is also bundled. Checkout the released at Google Drive .
Of course i want to say thank you to In-SideFX for the cool Undecorator tool, which can be found here.
Under the hood
I use a multi threaded pipes and filters architecture for file indexing. The FileSystemCrawler searches for files and puts them on the ContentExtractionQueue. The ContextExtractor takes entries from the ContentExtractionQueue, extracts the content and meta data with Apache Tika and puts the content on the IndexWriterQueue. The LuceneIndexHandler takes content from the IndexWriterQueue and updates the Apache Lucene full text index.
The JavaFX/HTML5 hybrid is a very powerful thing. It enables us to create cool user interfaces with full support of the whole Java stack using the described Gateway approach. Also, the HTML application could be deployed standalone without Desktop interaction, for instance to support mobile devices like tablets or smartphones.
Git revision: 350d988