This post is part of learn languages. A series where I learn different features of programing languages using bioinformatics applications. Note: The purpose of these posts is to reinvent the wheel. Disclaimer: Use the code snippets contained here at your own risk. They have been optimized for maximum enjoyment and learning of the objectives herein. No guarantees regarding optimizations around speed, cleverness, or pedantry are provided.
Objectives
Rust would normally be overkill for a simple application like this. However, remember that the objective is not to design the optimal solution, but to use this application as an excuse to explore how to get around Rust. I tried to use sensical Rust conventions, but if the borrow checker starts to nag, I will hapily use clone and go on with my day (if you know what I mean).
- How to work with files
- How to extract gzipped files
- Function pattern matching
- Pipelines
Implementation
The code for this exercise can be found in this Github repository
use flate2::read::GzDecoder;
use std::collections::HashMap;
use std::fmt;
use std::fs::File;
use std::io::{BufRead, BufReader};
type Attrs = HashMap<String, String>;
// Structs un Rust ae quite ergonomic, and quite straight-forward.
// Building new structs is also not too bad (see below) for this case.
struct Gene {
chromosome: String,
start: i32,
end: i32,
strand: String,
attrs: Attrs,
}
// Here I used a hashmap, to see how to work with items that are set at runtime.
// With a stricter gff spec, we could enforce fields and use a struct.
// However, the GFF specification is very loose, and the attribute set for
// each kind of feature is different, so implementing that feature would
// make the code unnecessarily complex for the problem at hand: learning
fn parse_attrs(attrs_line: String) -> Attrs {
let attrs: Attrs = attrs_line
.split(";")
.map(|x| {
let mut split = x.split("=");
let key = split.next().unwrap().to_string();
let value = split.next().unwrap().to_string();
(key, value)
})
.collect();
attrs
}
// impl is like adding methods to a type.
// In this case, it is an associated function used to generate a new instance of Gene.
impl Gene {
fn new(gff_line: String) -> Self {
let parts = gff_line.split("\t").collect::<Vec<&str>>();
let chromosome = parts[0].to_string();
let start = parts[3].parse::<i32>().unwrap();
let end = parts[4].parse::<i32>().unwrap();
let strand = parts[6].to_string();
let attrs_line = parts[8].to_string();
let attrs = parse_attrs(attrs_line);
Self {
chromosome,
start,
end,
strand,
attrs,
}
}
}
struct Region {
name: String,
length: u64,
}
struct Annotation {
gff_version: u16,
regions: Vec<Region>,
genes: Vec<Gene>,
}
// There is a "feature" of rust that is quite frustrating to people coming from other languages.
// Usually, printing variables from different types works quite similar across many languages. You
// just use thei "print()" (or simialr) function, and the language decides how it is going to print
// that variable. In rust, we have to implement the Display trait, and specify how we are going to
// present the dat when calling `println` or simila functions.
// Even though a bit cumbersome, it actually makes sense. If different types hold different kind of
// data, it is likely that it requires different ways to represent it. By abstracting tha logic at
// the display trait, we make the code easierr ti reason about, and we don't mix "business logic"
// with representation logic.
// Case in point: when printing a gff annotation, we don't actually want to print evey single gene.
// It is better to display general information about it, and a few genes as an example.
impl fmt::Display for Annotation {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
writeln!(f, "GFF Version: {}", self.gff_version)?;
writeln!(f, "Regions: {}", self.regions.len())?;
writeln!(f, "Regions longer than 200kb:")?;
for region in &self.regions {
if region.length > 2000000 {
writeln!(f, " {} ({})", region.name, region.length)?;
}
}
writeln!(f, "Total number of genes: {}", self.genes.len())?;
writeln!(f, "first 5 genes:")?;
for (i, gene) in self.genes.iter().enumerate() {
if i >= 5 {
break;
}
let gene_id = gene.attrs.get("ID").map_or("NA", |v| v);
writeln!(
f,
" {}. {}:{}-{} ({})",
gene_id, gene.chromosome, gene.start, gene.end, gene.strand
)?;
}
Ok(())
}
}
// Here, we are using one implementation block to add functions that set values to the slots in the
// stuct. It is quite nice to be able to group related functionality in the same implementation
// block. The Ok()/Err() pattern is quire nice, as it makes validation processes quite
// intuitive.
impl Annotation {
fn new() -> Self {
Self {
gff_version: 0,
regions: Vec::new(),
genes: Vec::new(),
}
}
fn set_version(&mut self, gff_line: String) {
let version_str = gff_line.split(" ").nth(1).unwrap().to_string();
let version_num = match version_str.parse::<u16>() {
Ok(version_num) => version_num,
Err(e) => panic!("Error parsing version number: {e}"),
};
self.gff_version = version_num;
}
fn add_region(&mut self, gff_line: String) {
let fields = gff_line.split_whitespace().collect::<Vec<&str>>();
let length = match fields[3].parse::<u64>() {
Ok(length) => length,
Err(e) => panic!("Error parsing length: {e}"),
};
let name = fields[1].to_string();
self.regions.push(Region { name, length });
}
fn add_gene(&mut self, gff_line: String) {
self.genes.push(Gene::new(gff_line));
}
}
// reading files works as one would expect. To note, the ? is quite convenient, as i returns the
// error if the operation fails,
fn read_file() -> Result<BufReader<GzDecoder<File>>, std::io::Error> {
let file = File::open("../../data/gff3_parsing/Homo_sapiens.GRCh38.114.gff3.gz")?;
let gz_decoder = GzDecoder::new(file);
let reader = BufReader::new(gz_decoder);
Ok(reader)
}
fn main() {
let file_reader = match read_file() {
Ok(file) => file,
Err(e) => panic!("Error reading file: {e}"),
};
let mut annotation = Annotation::new();
for line in file_reader.lines() {
// the match expression reminds me of Elixir's pattern matching. It is very legible, as it
// tells the app what to do in different cases. In this example, it processes a line with
// different functions, depending on information in the line that tells us what data is
// contained in the line.
match line {
Ok(line) => {
match line {
line if line.starts_with("##gff-version") => {
annotation.set_version(line);
}
line if line.starts_with("##sequence-region") => {
annotation.add_region(line);
}
line if line.contains("\tensembl_havana\tgene\t") => {
annotation.add_gene(line);
}
_ => {
// Ignore other lines
}
}
}
Err(e) => panic!("Error reading line: {e}"),
}
}
println!("{}", annotation);
}
Notes
I was surprised at how intuitive Rust was. Granted, I am awared that this is because I did not need o worry about the borrow-checker, thread-safety, or who knows what else. I think that this exercise of ignoring optimizations and focus on learning was especially successful with this Rust exercise. If I kept waiting in tutorial hell until I understood every possible way of maximizing performance in Rust, I would be stuck there for a long time. By taking the approach in this blog series, I now have a very few things I know about Rust, and that gives me the confidence to bootstrap my learning process with the language.
language-wise, Rust felt quite ergonimic. The places were the code was separated (implementation blocks, Display trait, etc) made the code more readabale, and I am sure that translates in more maintainable code. The Ok()/Err() pattern took me a bit to get used to, but it was not that bad. Rust has great tooling, and the LSP takes you very close to the steps you have to take to comply with its patterns.
All-in-all, I think Rust’s reputation of being “obtuse” or “very hard” to code in, are overblown. There are indeed complex coding patterns in Rust, especially when one wants to make the borrow-checker happy while using concurency in thread-safety applications. However, the language is very capable of delivering a lot of value withou those concepts, and we can build skills towards them as we spend more time with the language. This exercise made me want to deepen my understanding of Rust, and I am excited to see where this journey takes me.