Add the post
This commit is contained in:
parent
941825f38a
commit
a9ae8cb4ad
|
@ -0,0 +1,89 @@
|
|||
---
|
||||
layout: post
|
||||
title: Building a goede search engine
|
||||
tags:
|
||||
- rust
|
||||
---
|
||||
|
||||
This weekend I finally got around to building a little Rust "full text search
|
||||
engine" based on the educational post written by my
|
||||
[Scribd](https://tech.scribd.com) colleague [Bart](https://bart.degoe.de):
|
||||
titled [Building a full-text search engine in 150 lines of Python
|
||||
code](https://bart.degoe.de/building-a-full-text-search-engine-150-lines-of-code/).
|
||||
Bart did a great job writing an accessible post which introduced some common
|
||||
search concepts using Python, my objective wasn't necessarily to write
|
||||
something _faster_ or _better_ but to use the exercise as Rust practice. My day
|
||||
job is no longer writing code so the opportunity for a problem with fixed scope
|
||||
which would work out my Rust muscles was too good to pass up. In this post I
|
||||
want to share some things which I've learned in the process of duplicating
|
||||
Bart's work.
|
||||
|
||||
|
||||
The exercise revolves around parsing, indexing, and searching a dump of abstracts from the English language Wikipedia. This meant consuming a rather large Gzipped dump of XML which led me to two instrumental crates:
|
||||
|
||||
* [flate2](https://github.com/rust-lang/flate2-rs) which has the decoder support to read a Gzip (`.gz`) file.
|
||||
* [quick_xml](https://github.com/tafia/quick-xml) which allows for parsing XML from strings or streams.
|
||||
* [rust-stemmer](https://crates.io/crates/rust-stemmers) which helps with [stemming](https://en.wikipedia.org/wiki/Stemming), a concept which was new to me.
|
||||
|
||||
Initially I spent a **lot** of time just trying to get the file loaded. It
|
||||
seemed that no matter what I tried I just could not get the file to deserialize
|
||||
properly. I would look at snippets of the file in `zcat` and it all _looked_
|
||||
like valid XML to me. I copied some of the entries out of the `.xml.gz` file
|
||||
and created a simple test data set which loaded perfectly fine, but for
|
||||
whatever reason I could not properly load the entire dataset generated by
|
||||
Wikipedia. On a guess I unzipped the file and then re-gzipped it with the
|
||||
`gunzip` and `gzip` command line tools and _that_ loaded properly.
|
||||
Unfortunately it looks like I had found a [bug in
|
||||
flate2](https://github.com/rust-lang/flate2-rs/issues/265). Oops.
|
||||
|
||||
Once I was able to actually load the Wikipedia data set, the rest of the
|
||||
exercise went quite smoothly. I needed to refer to Rust docs quite a bit, had
|
||||
to borrow some suggestions from Stack Overflow when it came time to find the
|
||||
intersection of multiple `HashSet` structs, since Rust's `HashSet.intersection`
|
||||
will only work on two sets.
|
||||
|
||||
After a couple hours of evening tinkering, **[goedesearch](https://github.com/rtyler/goedesearch)** was born.
|
||||
|
||||
## Performance
|
||||
|
||||
As one might expect, converting a Python project to Rust did introduce
|
||||
performance improvements. Some aspects of the Python code are intentionally
|
||||
slow in order to make it easier to introduce and consider piece by piece.
|
||||
|
||||
To give you an idea of the inherent performance improvements of Rust, just the
|
||||
act of deserializing a gzipped XML file into memory in my initial attempt took
|
||||
4-5 minutes where the Python version took 40 or so. Since I work on an
|
||||
`aarch64` machine, I wonder how optimized, or not Python on that architecture
|
||||
may be. Once the file was loaded, the search performance of the two
|
||||
implementations is close enough to not be really worth discussion.
|
||||
|
||||
The inherent Rust performance improvements aside, I still have done some minor optimizations of my Rust code to further enhance the exercise. For example, my initial version of the code used quick_xml's [serde](https://serde.rs) support to [deserialize entries](https://github.com/rtyler/goedesearch/blob/41cbd31362a91e33208319979bf6c8b634f0dd38/src/main.rs#L261-L272), but I have since moved away from that for performance reasons. The original flow was:
|
||||
|
||||
1. Deserialize the `.xml.gz` into a `Vec<Article`.
|
||||
2. Then iterate through that vector and run the indexing process.
|
||||
|
||||
I re-factored this code to use a more streaming approach where each
|
||||
`<article/>` in the data set is deserialized into an `Article` and then
|
||||
indexed. This speeds up the time by reducing two steps but the whole process is
|
||||
still fairly CPU-bound.
|
||||
|
||||
My next steps are to explore some crates I have been watching for distributing
|
||||
the indexing work across multiple cores on the machine. I am optimistic that the safe concurrency support in Rust allows me to make minimal changes to my code to take this indexing operation from dominating a single core, to utilizing every code available. Ultimately, I would love for the loading/indexing problem to be I/O bound rather than CPU bound.
|
||||
|
||||
----
|
||||
|
||||
|
||||
Implementing goedesearch is a great example of how taking a project in one
|
||||
language, and porting it to another, can help teach you more about a domain.
|
||||
The full-text search engine is really a toy that is not meant for serious use,
|
||||
but playing is one of the key ways that we learn! Bart suggested checking out
|
||||
[tantivy](https://github.com/tantivy-search/tantivy) for a more _serious_ Rust
|
||||
search implementation. The folks behind tantivy also gave a [FOSDEM 2019
|
||||
presentation](https://archive.fosdem.org/2019/schedule/event/deepdive_tantivy/)
|
||||
which may be of interest.
|
||||
|
||||
|
||||
For me, I don't have any great plans for implementing a search engine, in Rust or otherwise, but I'm grateful to have had a fun little project to relax with this weekend!
|
||||
|
||||
|
||||
Should you want to review it, all of the code can be found [in this repository](https://github.com/rtyler/goedesearch) on GitHub.
|
Loading…
Reference in New Issue