Add the post

2021-04-11 11:45:54 -07:00 · 2021-04-11 11:45:54 -07:00 · a9ae8cb4ad
parent 941825f38a
commit a9ae8cb4ad
1 changed files with 89 additions and 0 deletions
--- a/_posts/2021-04-11-building-a-goede-search-engine.md
+++ b/_posts/2021-04-11-building-a-goede-search-engine.md
@ -0,0 +1,89 @@
+---
+layout: post
+title: Building a goede search engine
+tags:
+- rust
+---
+
+This weekend I finally got around to building a little Rust "full text search
+engine" based on the educational post written by my
+[Scribd](https://tech.scribd.com) colleague [Bart](https://bart.degoe.de):
+titled [Building a full-text search engine in 150 lines of Python
+code](https://bart.degoe.de/building-a-full-text-search-engine-150-lines-of-code/).
+Bart did a great job writing an accessible post which introduced some common
+search concepts using Python, my objective wasn't necessarily to write
+something _faster_ or _better_ but to use the exercise as Rust practice. My day
+job is no longer writing code so the opportunity for a problem with fixed scope
+which would work out my Rust muscles was too good to pass up. In this post I
+want to share some things which I've learned in the process of duplicating
+Bart's work.
+
+
+The exercise revolves around parsing, indexing, and searching a dump of abstracts from the English language Wikipedia. This meant consuming a rather large Gzipped dump of XML which led me to two instrumental crates:
+
+* [flate2](https://github.com/rust-lang/flate2-rs) which has the decoder support to read a Gzip (`.gz`) file.
+* [quick_xml](https://github.com/tafia/quick-xml) which allows for parsing XML from strings or streams.
+* [rust-stemmer](https://crates.io/crates/rust-stemmers) which helps with [stemming](https://en.wikipedia.org/wiki/Stemming), a concept which was new to me.
+
+Initially I spent a **lot** of time just trying to get the file loaded. It
+seemed that no matter what I tried I just could not get the file to deserialize
+properly. I would look at snippets of the file in `zcat` and it all _looked_
+like valid XML to me. I copied some of the entries out of the `.xml.gz` file
+and created a simple test data set which loaded perfectly fine, but for
+whatever reason I could not properly load the entire dataset generated by
+Wikipedia. On a guess I unzipped the file and then re-gzipped it with the
+`gunzip` and `gzip` command line tools and _that_ loaded properly.
+Unfortunately it looks like I had found a [bug in
+flate2](https://github.com/rust-lang/flate2-rs/issues/265). Oops.
+
+Once I was able to actually load the Wikipedia data set, the rest of the
+exercise went quite smoothly. I needed to refer to Rust docs quite a bit, had
+to borrow some suggestions from Stack Overflow when it came time to find the
+intersection of multiple `HashSet` structs, since Rust's `HashSet.intersection`
+will only work on two sets.
+
+After a couple hours of evening tinkering, **[goedesearch](https://github.com/rtyler/goedesearch)** was born.
+
+## Performance
+
+As one might expect, converting a Python project to Rust did introduce
+performance improvements. Some aspects of the Python code are intentionally
+slow in order to make it easier to introduce and consider piece by piece. 
+
+To give you an idea of the inherent performance improvements of Rust, just the
+act of deserializing a gzipped XML file into memory in my initial attempt took
+4-5 minutes where the Python version took 40 or so. Since I work on an
+`aarch64` machine, I wonder how optimized, or not Python on that architecture
+may be. Once the file was loaded, the search performance of the two
+implementations is close enough to not be really worth discussion.
+
+The inherent Rust performance improvements aside, I still have done some minor optimizations of my Rust code to further enhance the exercise. For example, my initial version of the code used quick_xml's [serde](https://serde.rs) support to [deserialize entries](https://github.com/rtyler/goedesearch/blob/41cbd31362a91e33208319979bf6c8b634f0dd38/src/main.rs#L261-L272), but I have since moved away from that for performance reasons. The original flow was:
+
+1. Deserialize the `.xml.gz` into a `Vec<Article`.
+2. Then iterate through that vector and run the indexing process.
+
+I re-factored this code to use a more streaming approach where each
+`<article/>` in the data set is deserialized into an `Article` and then
+indexed. This speeds up the time by reducing two steps but the whole process is
+still fairly CPU-bound.
+
+My next steps are to explore some crates I have been watching for distributing
+the indexing work across multiple cores on the machine. I am optimistic that the safe concurrency support in Rust allows me to make minimal changes to my code to take this indexing operation from dominating a single core, to utilizing every code available. Ultimately, I would love for the loading/indexing problem to be I/O bound rather than CPU bound.
+
+----
+
+
+Implementing goedesearch is a great example of how taking a project in one
+language, and porting it to another, can help teach you more about a domain.
+The full-text search engine is really a toy that is not meant for serious use,
+but playing is one of the key ways that we learn! Bart suggested checking out
+[tantivy](https://github.com/tantivy-search/tantivy) for a more _serious_ Rust
+search implementation. The folks behind tantivy also gave a [FOSDEM 2019
+presentation](https://archive.fosdem.org/2019/schedule/event/deepdive_tantivy/)
+which may be of interest.
+
+
+For me, I don't have any great plans for implementing a search engine, in Rust or otherwise, but I'm grateful to have had a fun little project to relax with this weekend!
+
+
+Should you want to review it, all of the code can be found [in this repository](https://github.com/rtyler/goedesearch) on GitHub.