--- layout: post title: Building a goede search engine tags: - rust --- This weekend I finally got around to building a little Rust "full text search engine" based on the educational post written by my [Scribd](https://tech.scribd.com) colleague [Bart](https://bart.degoe.de): titled [Building a full-text search engine in 150 lines of Python code](https://bart.degoe.de/building-a-full-text-search-engine-150-lines-of-code/). Bart did a great job writing an accessible post which introduced some common search concepts using Python, my objective wasn't necessarily to write something _faster_ or _better_ but to use the exercise as Rust practice. My day job is no longer writing code so the opportunity for a problem with fixed scope which would work out my Rust muscles was too good to pass up. In this post I want to share some things which I've learned in the process of duplicating Bart's work. The exercise revolves around parsing, indexing, and searching a dump of abstracts from the English language Wikipedia. This meant consuming a rather large Gzipped dump of XML which led me to two instrumental crates: * [flate2](https://github.com/rust-lang/flate2-rs) which has the decoder support to read a Gzip (`.gz`) file. * [quick_xml](https://github.com/tafia/quick-xml) which allows for parsing XML from strings or streams. * [rust-stemmer](https://crates.io/crates/rust-stemmers) which helps with [stemming](https://en.wikipedia.org/wiki/Stemming), a concept which was new to me. Initially I spent a **lot** of time just trying to get the file loaded. It seemed that no matter what I tried I just could not get the file to deserialize properly. I would look at snippets of the file in `zcat` and it all _looked_ like valid XML to me. I copied some of the entries out of the `.xml.gz` file and created a simple test data set which loaded perfectly fine, but for whatever reason I could not properly load the entire dataset generated by Wikipedia. On a guess I unzipped the file and then re-gzipped it with the `gunzip` and `gzip` command line tools and _that_ loaded properly. Unfortunately it looks like I had found a [bug in flate2](https://github.com/rust-lang/flate2-rs/issues/265). Oops. Once I was able to actually load the Wikipedia data set, the rest of the exercise went quite smoothly. I needed to refer to Rust docs quite a bit, had to borrow some suggestions from Stack Overflow when it came time to find the intersection of multiple `HashSet` structs, since Rust's `HashSet.intersection` will only work on two sets. After a couple hours of evening tinkering, **[goedesearch](https://github.com/rtyler/goedesearch)** was born. ## Performance As one might expect, converting a Python project to Rust did introduce performance improvements. Some aspects of the Python code are intentionally slow in order to make it easier to introduce and consider piece by piece. To give you an idea of the inherent performance improvements of Rust, just the act of deserializing a gzipped XML file into memory in my initial attempt took 4-5 minutes where the Python version took 40 or so. Since I work on an `aarch64` machine, I wonder how optimized, or not Python on that architecture may be. Once the file was loaded, the search performance of the two implementations is close enough to not be really worth discussion. The inherent Rust performance improvements aside, I still have done some minor optimizations of my Rust code to further enhance the exercise. For example, my initial version of the code used quick_xml's [serde](https://serde.rs) support to [deserialize entries](https://github.com/rtyler/goedesearch/blob/41cbd31362a91e33208319979bf6c8b634f0dd38/src/main.rs#L261-L272), but I have since moved away from that for performance reasons. The original flow was: 1. Deserialize the `.xml.gz` into a `Vec` in the data set is deserialized into an `Article` and then indexed. This speeds up the time by reducing two steps but the whole process is still fairly CPU-bound. My next steps are to explore some crates I have been watching for distributing the indexing work across multiple cores on the machine. I am optimistic that the safe concurrency support in Rust allows me to make minimal changes to my code to take this indexing operation from dominating a single core, to utilizing every code available. Ultimately, I would love for the loading/indexing problem to be I/O bound rather than CPU bound. ---- Implementing goedesearch is a great example of how taking a project in one language, and porting it to another, can help teach you more about a domain. The full-text search engine is really a toy that is not meant for serious use, but playing is one of the key ways that we learn! Bart suggested checking out [tantivy](https://github.com/tantivy-search/tantivy) for a more _serious_ Rust search implementation. The folks behind tantivy also gave a [FOSDEM 2019 presentation](https://archive.fosdem.org/2019/schedule/event/deepdive_tantivy/) which may be of interest. For me, I don't have any great plans for implementing a search engine, in Rust or otherwise, but I'm grateful to have had a fun little project to relax with this weekend! Should you want to review it, all of the code can be found [in this repository](https://github.com/rtyler/goedesearch) on GitHub.