Learn Elixir's data structures by building a GFF3 parser

This post is part of learn languages. A series where I learn different features of programing languages using bioinformatics applications. Note: The purpose of these posts is to reinvent the wheel. Disclaimer: Use the code snippets contained here at your own risk. They have been optimized for maximum enjoyment and learning of the objectives herein. No guarantees regarding optimizations around speed, cleverness, or pedantry are provided.

Preamble

OK, before anything else, let’s just acknowledge how cool Elixir is! Even though it is not the most common language for bioinformatics, but it has a lot of potential in data science in general. This is not to be taken as an expert opinion, as I am using this series to learn Elixir, but I can see how it can be positioned as a good alternative to Python or R for bioinformatics because of its concurrency model, its robustness to failures, and the ease at which it can be used to build scalable applications.

Elixir is a functional language. That explains a bit of how wierd it looks at first. The idea is that functions avoid side effects and data is immutable. This means that the data passed to a function is not modified, but rather a new data structure is created. This is very different from Python, where functions modify the data passed to them. Wait, what? isn’t copying data expensive? Well, yes, but one can argue that in many tasks in bioingormatics we create new data at most steps in an analysis, so we are going to copy data anyway. Moreover, because the data is immutable, concurrency becomes a lot easier, as race conditions are way less likely. As a bonus, Elixir is very easy to scale horizontally. This means that we can distribute a computation across multiple machines, and the code would look very similar to concurrent code in a single machine.

With that out of the way, let’s look at this implementation.

Objectives

I am still learning Elixir, so I will be focusing more on practicing basic syntax and data structures, and less time optimizing or deeply exploring optimal features for this task. These are the features of the language that I want to focus on:

Implementation

The code for this exercise can be found in this Github repository

defmodule ParseGff3 do
  def main(_args) do
    ParseGff3.parse()
  end

  def stream_file(filepath) do
    # Cases. In Elixir, we can use function guards to define a function several times, matching a condition in the guard. In this case, I chose to use cases. They look similar to Rust's match expression. For functions with just a few lines, I like this better, as it is less verbose.
    case Path.extname(filepath) do
      ".gz" ->
        File.stream!(filepath, [:compressed])

      ".gff3" ->
        File.stream!(filepath)

      ext ->
        IO.puts("The file #{filepath} requires an .gz or .gff3 extension, not #{ext}")
        exit(1)
    end
  end

  def parse_attrs(attrs_line) do
    # List comprehensions. Elixir can do list comprehensions. In this case, `into: %{}` tells Elixir to create a map, and the little pipeline below, send a tuple of `{key, value}` to the resulting map.
    for attr <- String.split(attrs_line, ";"), into: %{} do
      String.split(attr, "=") |> List.to_tuple()
    end
  end

  # Having the same function name with different argumentns is called function overloading. The way it works is similar to a `switch` statement in other languages. In this case, the arguments are matched to the structure of arguments in the function definitions in the order they are defined. If the match succeeds, that function will execute, and if it does not, the next function will be matched. I like this, because it makes function definitions more consice, as we can have a definition for each case.

  # Function pattern mathing. In Elixir, the `=` sign is not necessarily "equal", but a pattern matching operator. There is a whole discussion about it in several books, and it is outside of the scope of this article. For now, what is relevant is that it vcan be used to declare variables, but also to destructure and structurally match arguments from functions. For example, extract annotation will receive an argument, if that argument is a list with 9 elements, it will destructure the argument and assign the elements to `chrom`, `_source`, etc, and return a map with those values. If the argument cannot be destructured in 9 elements, the next `extract_annotation` will be matched. In this case, the `extract_annotation` just passes the data along.
  def extract_annotation([
        chrom,
        _source,
        feature_type,
        n_start,
        n_end,
        _score,
        strand,
        _phase,
        attrs
      ]) do
    %{
      region: chrom,
      feature_type: feature_type,
      start: String.to_integer(n_start),
      end: String.to_integer(n_end),
      strand: strand,
      attrs: parse_attrs(attrs)
    }
  end

  def extract_annotation(data), do: data

  def parse do
    # Pipelines. If you are familiar with bash pipes, this will feel natural. In a nutshell, a pipeline sends the result of the previous function as the first argument of the next function. This is very convenient, as in a lot of data processing tasks, we take the output of one procedure to further process it with different programs.
    Path.expand("../../../data/gff3_parsing/Homo_sapiens.GRCh38.114.gff3.gz", __DIR__)

    # Streams. Streams in Elixir are lazy, so they don't load the whole file into memory. Instead, they read the file line by line, and send each line to the next function. This is very convenient for large files, as we don't need to load the whole file into memory. Most functions from the `Enum` module have implementations in `Stream`.
    |> stream_file()
    |> Stream.filter(fn x -> !String.starts_with?(x, "#") end)
    |> Stream.filter(fn x -> String.contains?(x, "ensembl_havana") end)
    |> Stream.map(fn x -> String.trim(x, "\n") |> String.split("\t") end)
    |> Stream.map(&extract_annotation/1)
    # Finishing with `Enum`. Returns values from `Stream` are lazy loaded. This means that they do not produce any persistent values, unless we trigger the computation. In this case, Sending the stream to `Enum.to_list` will force the computation, and return a list with all the values.
    |> Enum.to_list()
  end
end

Notes

One step further, with Structs and imports

In this case, we use the functions stream_file and extract_annotation from the first implementation to do the basic processing as in the first iteration, but we will add the functionality to parse metadata, and use pattern matching with Structs to send different data lines to different f functions.

# Structs. In Elixir, structs are similar to structs in Rust or Go. Basically, we declare a module with named fields. This is usually not absolutely required, but it can give the code more clarity, and it is very useful when matching functions that have to interact with specific structs. In this case, we have 2 named structs `GFFfeature`, and `RegionHeader`  that will be executed by different versions of `collapse_regions` depending on what kind of struct is passed to the function.
defmodule RegionHeader do
  defstruct [:region, :start, :end]
end

defmodule GFFfeature do
  defstruct [:region, :feature_type, :start, :end, :strand, :attrs]

  def new(map) do
    struct(__MODULE__, map)
  end
end

defmodule ParseAllGff3 do
  def main(_args) do
    parse_file()
  end

  # Matching functions to the beginning of a string. In order to rexecute a function when a string starts with a specific substring, we use the pattern `"##" <> rest` in the function signature. The `<>` operator concatenate strings, so in this case, the argument passed to the function will try to be reconstructed with the `<>` operator where the match will be successful only if the first part of the string is `"##"`. As a bonus, the variable `rest` can be used in the body of the function.
  def parse("##" <> rest) do
    [ix | body] = String.trim(rest, "\n") |> String.split()

    case ix do
      "sequence-region" ->
        [region, n_start, n_end] = body

        %RegionHeader{
          region: region,
          start: String.to_integer(n_start),
          end: String.to_integer(n_end)
        }

      "gff-version" ->
        [ver | _] = body
        %{gff_version: ver}

      _ ->
        nil
    end
  end

  # List processing. In Elixir, lists can be destructured by using the syntax `[head | tail]`. This will assign the first element of the list to `head`, and the rest of the list to `tail`. Similary, one can add elements efficiently to a list with the pattern `[new_element | old_list]`. The issue with this, is that the new element is added to the beginning of the list, so we have to reverse the list at the end. This pattern is so common, that Elixir has optimized this process, so there is not a big performance penalties for using this approach.
  def parse("#!" <> rest) do
    [ix | body] = String.trim(rest, "\n") |> String.split()
    [first | rest] = body
    joined = Enum.reduce(rest, first, fn x, a -> a <> " " <> x end)

    %{(ix |> String.replace("-", "_") |> String.to_atom()) => joined}
  end

  def parse(line) do
    line
    |> String.trim("\n")
    |> String.split("\t")
    |> ParseGff3.extract_annotation()
    |> GFFfeature.new()
  end

  def collapse_regions(%GFFfeature{} = data, acc) do
    put_in(acc.features, [data | acc.features])
  end

  def collapse_regions(%RegionHeader{} = data, acc) do
    new_region = %{data.region => %{start: data.start, end: data.end}}
    all_regions = Map.merge(acc.regions, new_region)
    put_in(acc.regions, all_regions)
  end

  def collapse_regions(%{} = data, acc) do
    meta = Map.merge(data, acc.metadata)
    put_in(acc.metadata, meta)
  end

  def collapse_regions(nil, acc) do
    acc
  end

  def parse_file() do
    parsed_features =
      Path.expand("../../../data/gff3_parsing/Homo_sapiens.GRCh38.114.gff3.gz", __DIR__)
      |> ParseGff3.stream_file()
      # Map and anonymous functions. Anonymous functions can be defined in different ways. In this case , the syntax `&parse/1` means "convert the function `parse` that accepts one argument (`/1`) to an anonymous function", and now we can use `parse/1` in our pipeline.
      |> Stream.map(&parse/1)
      |> Enum.reduce(%{regions: %{}, features: [], metadata: %{}}, &collapse_regions/2)

    Map.update!(parsed_features, :features, &Enum.reverse/1)
  end
end

Conclusion

I have to concede that Elixir required me to change a lot of patterns I was used to when programming in other languages. For example, list processing can be a little annoyong at first, because I have to think about parsing the first element first, and take into account that the last “element” is an empty list, while in Python and other languages, I can just iterate over the list. That said, this pattern in Elixir can be an advantage, as it makes recursion easier to implement, which is optimised in Elixir, and takes advantage of other more advanced features of the language.
The concept of pipes is so ergonomic! It is very easy to chain data transformation functions together, and see the results at each step when building the application. It is also very convenient that I can group pipes together as a function, which makes the code more pleasant to read.
Function overloading with pattern matching is my favorite feature so far! This makes the code very readable, as instead of having a lot of if statements or switch blocks for different data types, I can just code a separate function for each type of data. Additionally, pattern matching makes arguments easier to parse, and more expressive in the code, as I knowe how the arguments should look like right in the function definition.
The elephant in the room in this application is performance. This application runs slower than Python, which is already a slow language. However, it is important to note that in this application, the purpose was not to build performant code, but just to get a sense of the basic features of the language. Elixir is more than capable of scaling to large datasets with great performance, and I will be using it in future projects.
All in all, I am very excited to continue developing in Elixir. Once I learn more advanced features, I will be able to use it for production-ready ETL applications, and even server-side bioinformatics tools.