Add the post

This commit is contained in:
R Tyler Croy 2021-04-11 11:45:54 -07:00
parent 941825f38a
commit a9ae8cb4ad
No known key found for this signature in database
GPG Key ID: E5C92681BEF6CEA2
1 changed files with 89 additions and 0 deletions

View File

@ -0,0 +1,89 @@
---
layout: post
title: Building a goede search engine
tags:
- rust
---
This weekend I finally got around to building a little Rust "full text search
engine" based on the educational post written by my
[Scribd](https://tech.scribd.com) colleague [Bart](https://bart.degoe.de):
titled [Building a full-text search engine in 150 lines of Python
code](https://bart.degoe.de/building-a-full-text-search-engine-150-lines-of-code/).
Bart did a great job writing an accessible post which introduced some common
search concepts using Python, my objective wasn't necessarily to write
something _faster_ or _better_ but to use the exercise as Rust practice. My day
job is no longer writing code so the opportunity for a problem with fixed scope
which would work out my Rust muscles was too good to pass up. In this post I
want to share some things which I've learned in the process of duplicating
Bart's work.
The exercise revolves around parsing, indexing, and searching a dump of abstracts from the English language Wikipedia. This meant consuming a rather large Gzipped dump of XML which led me to two instrumental crates:
* [flate2](https://github.com/rust-lang/flate2-rs) which has the decoder support to read a Gzip (`.gz`) file.
* [quick_xml](https://github.com/tafia/quick-xml) which allows for parsing XML from strings or streams.
* [rust-stemmer](https://crates.io/crates/rust-stemmers) which helps with [stemming](https://en.wikipedia.org/wiki/Stemming), a concept which was new to me.
Initially I spent a **lot** of time just trying to get the file loaded. It
seemed that no matter what I tried I just could not get the file to deserialize
properly. I would look at snippets of the file in `zcat` and it all _looked_
like valid XML to me. I copied some of the entries out of the `.xml.gz` file
and created a simple test data set which loaded perfectly fine, but for
whatever reason I could not properly load the entire dataset generated by
Wikipedia. On a guess I unzipped the file and then re-gzipped it with the
`gunzip` and `gzip` command line tools and _that_ loaded properly.
Unfortunately it looks like I had found a [bug in
flate2](https://github.com/rust-lang/flate2-rs/issues/265). Oops.
Once I was able to actually load the Wikipedia data set, the rest of the
exercise went quite smoothly. I needed to refer to Rust docs quite a bit, had
to borrow some suggestions from Stack Overflow when it came time to find the
intersection of multiple `HashSet` structs, since Rust's `HashSet.intersection`
will only work on two sets.
After a couple hours of evening tinkering, **[goedesearch](https://github.com/rtyler/goedesearch)** was born.
## Performance
As one might expect, converting a Python project to Rust did introduce
performance improvements. Some aspects of the Python code are intentionally
slow in order to make it easier to introduce and consider piece by piece.
To give you an idea of the inherent performance improvements of Rust, just the
act of deserializing a gzipped XML file into memory in my initial attempt took
4-5 minutes where the Python version took 40 or so. Since I work on an
`aarch64` machine, I wonder how optimized, or not Python on that architecture
may be. Once the file was loaded, the search performance of the two
implementations is close enough to not be really worth discussion.
The inherent Rust performance improvements aside, I still have done some minor optimizations of my Rust code to further enhance the exercise. For example, my initial version of the code used quick_xml's [serde](https://serde.rs) support to [deserialize entries](https://github.com/rtyler/goedesearch/blob/41cbd31362a91e33208319979bf6c8b634f0dd38/src/main.rs#L261-L272), but I have since moved away from that for performance reasons. The original flow was:
1. Deserialize the `.xml.gz` into a `Vec<Article`.
2. Then iterate through that vector and run the indexing process.
I re-factored this code to use a more streaming approach where each
`<article/>` in the data set is deserialized into an `Article` and then
indexed. This speeds up the time by reducing two steps but the whole process is
still fairly CPU-bound.
My next steps are to explore some crates I have been watching for distributing
the indexing work across multiple cores on the machine. I am optimistic that the safe concurrency support in Rust allows me to make minimal changes to my code to take this indexing operation from dominating a single core, to utilizing every code available. Ultimately, I would love for the loading/indexing problem to be I/O bound rather than CPU bound.
----
Implementing goedesearch is a great example of how taking a project in one
language, and porting it to another, can help teach you more about a domain.
The full-text search engine is really a toy that is not meant for serious use,
but playing is one of the key ways that we learn! Bart suggested checking out
[tantivy](https://github.com/tantivy-search/tantivy) for a more _serious_ Rust
search implementation. The folks behind tantivy also gave a [FOSDEM 2019
presentation](https://archive.fosdem.org/2019/schedule/event/deepdive_tantivy/)
which may be of interest.
For me, I don't have any great plans for implementing a search engine, in Rust or otherwise, but I'm grateful to have had a fun little project to relax with this weekend!
Should you want to review it, all of the code can be found [in this repository](https://github.com/rtyler/goedesearch) on GitHub.