Still learning Rust
It’s been a little over a month since I started by foray into learning the Rust programming language… I’m a professional Python programmer by day so my time to concentrate on learning a new language is pretty limited. I just finished up chapter 8 of “The Rust Programming Language” and the suggested programming tasks are definitely worth doing. The three simple programs that it recommends are:
- Given a list of integers, use a vector and return the mean (the average value), median (when sorted, the value in the middle position), and mode (the value that occurs most often; a hash map will be helpful here) of the list.
- Convert strings to pig latin. The first consonant of each word is moved to the end of the word and “ay” is added, so “first” becomes “irst-fay.” Words that start with a vowel have “hay” added to the end instead (“apple” becomes “apple-hay”). Keep in mind the details about UTF-8 encoding!
- Using a hash map and vectors, create a text interface to allow a user to add employee names to a department in a company. For example, “Add Sally to Engineering” or “Add Amir to Sales.” Then let the user retrieve a list of all people in a department or all people in the company by department, sorted alphabetically.
I implemented solutions to each of these over the course of a few mornings during my “morning coffee”. I recommend reserving some time in the morning before you “go to work” to do something like this. It really has made a difference in my almost immediate effectiveness at work. In the past I used this time to do coding katas or scan stack-overflow for C/C++ questions that would require me to write some code and think. Now I am going through the Rust Book or writing articles like this one. Anyway, I highly recommend doing something like this in the morning.
Calculating statistics for a sequence of integers
The first one was pretty easy. I implemented it as an evaluate
function that returns a simple struct:
pub struct Statistics {
length: usize,
max: i32,
mean: i32,
median: f64,
min: i32,
mode: i32,
unique: usize,
}
The function took a vector of i32
s and returned a Statistics
instance. The learning that I took away from the implementation is that coercion from i32
to usize
to f64
is more rigorous than it is in C. It reminded me very much of doing explicit type transformations in Ada. The requirement for explicitly handling of numeric overflow is a bit tedious but such faults can result in extreme failures such as the explosion of a rocket in the Ariane 5–501 case. Interestingly enough the underlying software in that case was Ada and the failure was caused by “unsafe” operations that were not protected. The linked after-action report is interesting reading if you have a few minutes.
Where I ran into this in the first project was in the calculation of the mean. It was a portion of a loop over the input vector that looked like:
pub fn evaluate(v: Vec<i32>) -> Statistics {
let mut result = Statistics::zero();
for elm in &v {
result.mean += elm;
// more stuff here
}
if vals.is_empty() {
// handle empty input -- also avoid div by zero
} else {
result.mean /= i32::try_from(v.len()).expect("I'm sloppy");
// ...
}
result
}
The need to explicitly use the try_from
method defined by the TryFrom trait was a fun new thing to learn about. I’d like to point out that I was led to the correct solution by the error messages generated by the compiler. Again, the ergonomics of the Rust environment are truly exceptional. I spent the first decade or so of my career as an embedded engineer writing scads of rather funky C and using a dozen or so different compiler toolchains. None of them had useful error messages. Mind you in C, you would have also had to run a static analyzer before you would even see such a possible error. There is a rumor that the earlier versions of C compilers included linting functionality and it was separated out to a separate process. Maybe that was a mistake.
The remainder of the exercise was relatively unremarkable. I’m going to take a second and reflect on a reoccurring theme in my journey so far:
The ergonomics of the Rust environment are truly exceptional
I started thinking about the ergonomics of a programming environment whilst playing with Smalltalk back in college and more so when I learned about Test Driven Development in Smalltalk which would go something like this:
- invoke the new method on an instance — note that the method does not exist anywhere at this point
- the runtime detects the error and pops open the debugger for something like
Instance of MyClass did not understand #someNewMethod
- you press the button to create the new method and an editor is opened
- once the method is created, you can continue
Think about that for a moment. Instead of creating the methods that you expect to need when you write the code, you simply call them and create them when they are needed. It is a very different process for writing code. But Smalltalk is very different in that way.
The existence of a Read-Evaluate-Print-Loop (REPL) in Python, Ruby, and other interpreted languages seriously affected the ergonomics of the environment. I learned and still learn Python interfaces by installing a new package, popping open a REPL, importing the module, looking at its exports using help()
, and playing with it. Eventually I will look at the reference but if I can discover how to use the module easily enough then I’m even happier.
Hopefully I’ve painted a picture of what I mean when I talk about the ergonomics of the environment. A lot of thought went into the development of the ergonomic aspect of the Rust programming language … or they got really lucky! I can’t speak to how it got this way but I can say that it is exceptional. The fact that the compiler error messages led me down a path of discovery that resulted in a much greater understanding of the language without buying a book or listening to a talk is quite exceptional.
The last example of this is at the API level. The HashMap interface for lookup requires that you deal with the fact that the key might not exist. The entry
method returns an Entry
enumeration instance which is either Occupied
or Vacant
. The Entry
type includes a method called or_insert
that either returns a reference to the value in the case of Occupied
or inserts a new value for the key in the case of Vacant
and returns a reference to the value.
Let’s look at some code that maintains a HashMap that counts the occurrences of each value:
let count: &mut i32 = occur_map.entry(elm).or_insert(0);
*count += 1;
Here’s what that would look like in Python:
try:
occur_map[elm]
except KeyError:
occur_map[elm] = 0
else:
occur_map[elm] += 1
Mind you that I did not use the defaultdict
class that is offered in the Python Standard Library quite intentionally. Notice that the Rust implementation changes what is a lookup error into a fluent API. I could have written it using get
instead which would look something like:
let count = match occur_map.get(elm) {
Some(&count) => count + 1,
None => 1
};
occur_map.insert(elm, count);
That looks more like the Python example where you concentrate on handling the error instead of using the more elegant Entry API. In full disclosure, I started with something closer to this before I remembered the entry
method that was discussed in The Book.
In retrospect, Python famously tries to have “one obvious way to do it” when it comes to APIs. Maybe having more than one way to do something is not that bad. In this case having two methods to find an element in the map makes the intent quite clear.
Pig-Latin translator
This one required some digging into how characters and strings differ in representation and, more importantly, how to go from one to the other. The largest surprise was that using String
as an accumulator and the push_str
method was easier to read than trying to use a vector and joining on the empty string (which is what I would have done in C++ or Python).
I still do not fully understand/appreciate the difference between str
, String
, and literals. For example, look at the following partial snippet:
fn convert_word(word: &str) -> String {
let mut chrs = word.chars();
match chrs.next() {
Some(first) => {
if first.is_ascii_alphabetic() {
let mut result = String::new();
match first {
'a' | 'e' | 'i' | 'o' | 'u' => {/*...*/},
_ => {
result.push_str(chrs.as_string());
result.push_str("-");
result.push_str(&first.to_string());
result.push_str("ay");
}
};
result
} else { /*...*/ }
}
// ...
The sequence of push_str
calls exposes one of the uglier parts of Rust that I have seen so far. How many ways do I really need to say append this string? Mind you that I am probably doing something very inappropriate here… I am still learning the language.
Simple employee database
This project was quite the endeavor. I decided to model the solution as a HashMap
that maps from department to a list of Employee
structures. The Employee
only had one attribute but I want to play with collections of non-primitive items.
Before I got to the store and retrieve portion of the project, I decided to use a match
statement to parse the incoming commands. It took me a few tries to get the referencing correct. Here’s what I ended up with:
let cmd : Vec<_> = command_str.split_whitespace().collect();
match (cmd.get(0), cmd.get(1), cmd.get(2), cmd.get(3)) {
(Some(&"add"), Some(name), Some(&"to"), Some(department)) => {
db.add_person(*name, *department);
},
(Some(&"remove"), Some(name), Some(&"from"),
Some(department)) =>
{
if db.remove_person(*name, *department) {
println!("Removed {} from {}", name, department);
} else {
println!("{} is not in {}", name, department);
}
},
(Some(&"list"), Some(&"all"), None, None) => {/*...*/},
(Some(&"list"), Some(department), None, None) => {/*...*/},
(Some(&"quit"), None, None, None) => {/*...*/},
_ => {/*...*/},
}
The Vec
has a full type of Vec<&str>
so it contains references into the result of splitting the string which is interesting. If I were writing this in C, then I probably would have modified command_str
to terminate the tokens and keep an array of pointers to the token starts. I assumed that this is what is going on here. After writing Python for a few years, I almost forgot about the amount of thought that efficient referencing requires. This is also why the match targets are str
references instead of literals and the calls to the database methods dereference the parameters.
It took me a few tries before I settled on using split_whitespace
to parse the tokens. I needed to provide a type hint to Rust since it could not infer what I was trying to do with the result of collect
. This taught me a little lesson … sometimes it is better to provide less detail. I tried to use Vec<String>
and other such things to no avail. Once I told Rust that I wanted a Vec
it took care of filling in the rest — in other words, Vec<_>
was the minimal amount of information that Rust needed.
Before I get into the implementation behind add, remove, and the other methods, I want to mention the usage of match
as a command parser. My first inclination was to write a quick purpose-built top-down parser like:
if cmd.len() == 1 && cmd[0] == "quit" {
break;
} else if cmd.len() == 2 && cmd[0] == "list" {
if cmd[1] == "all" {
db.list_employees();
} else {
db.list_department(&cmd[1]);
}
} else if cmd.len() == 4 && cmd[0] == "add" && cmd[2] == "to" {
todo!("add an employee");
} else if cmd.len() == 4 && cmd[0] == "remove" && cmd[2] == "from" {
todo!("remove an employee");
} else {
todo!("show help");
}
It is compact and easy to read if you’ve spent a few years staring at C++. After all, switch
in C++ isn’t capable of dispatching based on strings. I do like the expressiveness of the match
statement in this case. Much cleaner as long as you don’t include too much code in the branches.
My second attempt was to use an enum
with the functionality attached as methods. The application looked like:
enum Command {
AddEmployee {name: String, department: String},
RemoveEmployee {name: String, department: String},
ListEmployees,
ListDepartment {department: String},
Exit,
NoMatch {command: String},
}enum Result {
Exit,
ShowHelp,
Ok,
}impl for Command {
fn parse(command: str) -> Self {
todo!("parse");
}
fn action(&self) -> Result {
match self {
Command::Exit => Result::Exit,
Command::NoMatch {command} => {
println!("{} is not a valid command");
Result::ShowHelp
},
_ => { todo!() },
}
}
}fn main() {
loop {
let cmd = Command::parse(prompt("Enter command: "));
match cmd.action() {
Exit => break,
ShowHelp => todo!("show help"),
}
}
}
After implementing the majority of the functionality, I decided that the separation between parse
and action
was silly. It felt artificial and required the additional mental leap of mapping Command
values into Result
values. I moved the match logic to the top-level and removed the enums altogether. I like where I landed.
This little project is also where I discovered the todo!()
macro. Another little case of the ergonomics that I keep going on about. The macro panics after printing out the message. This is a very quick way to sketch out the structure of your program and fill it in later. I use raise NotImplementedError(msg)
for this in Python but there is something very nice about todo!(msg)
. Maybe it is the lack of looking like an error. It also helps that the language provides it so my editor can “understand” the meaning behind it.
The last part of the project was the implementation of the database-like backend. I implemented add and remove functionality first. This is when I learned about implementing PartialEq
and Eq
for my Employee
struct. It took me more than a little time to figure out that this required an empty implementation of Eq
… I’m still not completely sure why this is. I added the following implementations which made the compiler happy:
impl PartialEq for Employee {
fn eq(&self, other: &Self) -> bool { self.name.eq(&other.name) }
}impl Eq for Employee {}
This seemed pretty clunky to me but I left it alone for the time being. It wasn’t until I was writing this article that I noticed that the comment in the Rust library code mentions using #[derive]
for this. It turns out that I could have replaced the implementations of equality and ordering with:
#[derive(Eq, Ord, PartialOrd, PartialEq)]
struct Employee {
name: String,
}
The project description required that the output of the two list commands is sorted. Vec
has a sort method so I was going to use that. The “add” implementation was simple. I used the same map.entry(key).or_insert(Vec::new())
and added the new employee instance. The “remove” method was a little more interesting. I tried a number of different approaches before I landed on:
self.teams
.entry(department)
.and_modify(|members| {
if let Some(index) = members.iter().position(
|e| e.name == name)
{
members.remove(index);
}
});
I really wanted to write something like:
if let Occupied(team) = self.teams.entry(department) {
if let Some(index) = team.position(|e| e.name == name) {
team.remove(index);
}
}
But I could not get the mutability markers to work out. I looked at the Entry
API again and noticed the and_modify()
method which is precisely what I needed. Another lesson learned when you run into mutability problems, look for a helper method. The collection implementations tend to have specialized methods to do what you need. In retrospect, this makes sense since the ownership tracking for keys and values in maps is tricky without reference counting.
Once I implemented add and remove, I moved on to displaying the employees in a department. This was a pretty simple loop that printed out the vector contents. Printing out all of the employees was simply another loop over the teams. Since the database methods handled the display of the employees, it was easy enough to collect all of the employees into a temporary vector, sort it, and print each element.
I could have stopped here but I didn’t. I wanted to pull the actual display out of the “database” implementation and offer methods to retrieve the sorted employees for a department or all of the employees sorted. This turned out to be a completely different problem related to ownership since I needed to return an iterator over a temporary vector for the “all employees” case. I looked ahead in the Rust Book and figured out that I needed to link the lifetime of the temporary with the lifetime of the “database”.
fn all_employees<'a>(&'a self) -> Vec<&'a Employee> { /*...*/ }
The <'a>
declares that there is a lifetime marker named a
. The &'a self
in the parameter list associates the lifetime marker with the lifetime of self
which is exactly what I wanted to say — the returned vector contains references to Employee instances that are owned by the database. The return value specification of <&'a Employee>
says precisely this. I guess that all of the 'a
markers have sensible definitions after all. I did end up returning a vector from all_employees
and an Iter
from employees_for_department
. I gave up after trying to figure out how to return an Iter
to a temporary vector. I suspect that moving the lifetime markers around would make this possible.
In retrospect…
I’m still very much enjoying Rust. Thinking about lifetimes and ownership requires a little thought but not too much. I have a feeling that a lot of the stumbling about will resolve once I understand lifetimes and the patterns that are present in the library. For example, it didn’t occur to me to look beyond the simple get
and insert
methods of the map since those are the only things required if you have reference counting in place. Once my brain adjusts to thinking in terms of explicit reference and ownership passing again I can work with this. After all, I managed to learn my way through boost::shared_ptr
, boost::unique_ptr
, boost::ptr_from_shared
, and all of the fun of pointer management in C++. The parallel behavior is not lost on me though.