Learn Python's data structures by building a GFF3 parser

This post is part of learn languages. A series where I learn different features of programing languages using bioinformatics applications. Note: The purpose of these posts is to reinvent the wheel. Disclaimer: Use the code snippets contained here at your own risk. They have been optimized for maximum enjoyment and learning of the objectives herein. No guarantees regarding optimizations around speed, cleverness, or pedantry are provided.

Objectives

This exercise will test the following features of python

The output of this implementation will be a list of dictionaries with annotation data. With a file parsed like this, we will be able to more efficiently query the data. Moreover, having all the components of the annotation in data structures will allow other applications to access the parsed data. For example, when building an application that plots GFF features along the genome.

Implementation

The code for this exercise can be found in this Github repository

from pathlib import Path
from gzip import open as gopen


def parse_attrs(attrs_line) -> dict:
    """
    Converts a GFF3 attributes field into a dictionary
    :param attrs_line: GFF3-formatted annotation attributes
    :return: dictionary of attributes
    """
    # List and dictionary comprehensions are powerful tools to easily parse collections of data into
    # dictionary and list data structures. **If** they are short and concise, they are quite easy to read.
    # However, for complex processing logic, use a for-loop
    items = [x.split("=") for x in attrs_line.split(";")]
    ann_raw = {x[0]: x[1] for x in items}
    return ann_raw


def parse_gff3(in_file: Path, source):
    # Working with gzipped files is very similar to working with normal text files.
    # Since both functions use a similar function signature, I can just use the appropriate
    # function with the same arguments, and it will just work
    # (Note, this might be different when working with binary files)
    open_fx = gopen if in_file.suffix == ".gz" else open
    parsed_features = []
    with open_fx(in_file, "rt") as file_h:
        for line in file_h:
            if not line.startswith("#"):
                if source in line:
                    # I can use list destructuring to assign the elements of a list
                    # (the result of the `split` method) to individual variables.
                    (
                        chrom,
                        _source,
                        feature_type,
                        start,
                        end,
                        _score,
                        strand,
                        _phase,
                        attrs,
                        # String manipulation is straight-forward. Most string methods can be chained.
                        # In that way, I can keep adding methods to my parsing functionality and see the result seach time.
                        # This pattern makes string-parsing easy in Python
                    ) = line.strip().split("\t")
                    attrs_parsed = parse_attrs(attrs)
                    annotation = {
                        "region": chrom,
                        "feature_type": feature_type,
                        # Casting variables to other data types is easy. In this case, we can assume that the columns
                        # four and five will always be integers, however, in non-trivial cases, casting variables to
                        # other data types can crash the script or silently assign data to the wrong type.
                        "start": int(start),
                        "end": int(end),
                        "strand": strand,
                        "attrs": attrs_parsed,
                    }
                    # Adding elements to a list is quite easy. However, keep in mind that Python will generate a new
                    # list and copy all elements in the list plus the new element every time a list is updated, which
                    # will have performance issues with long lists. Ideally, if we know the size of the list in advance,
                    # it is ideal to initialize a list with empty items and assign elements to the list rather than
                    # appending new element to it. If the size of the list is not known in advance, several techniques
                    # exist to ameliorate the performance issues that appending to a list bring.
                    parsed_features.append(annotation)
    return parsed_features


if __name__ == "__main__":
    # ensembl_havana are highly curated features. These are the ones we will capture
    annotation_source = "ensembl_havana"
    gff_file = Path("../data/gff3_parsing/Homo_sapiens.GRCh38.114.gff3.gz")
    gff_dict = parse_gff3(gff_file, annotation_source)

Taking it up a notch. Using Python classes

The implementation shown above will return a list of dictionaries that can be queried using functions that traverse the list, and access the features’ annotation dictionaries. This might be all we need for basic uses, but if we want this to be part of a stable and maintainable API, we can use Python classes to encapsulate features, lists of features, and functionality associated with them. In this iteration, we are actually going to use the code from the last section to populate our class attributes. At this point, only basic methods are going to be created. In a future installment, we will implement more functionality to the GFF parser

from parse_gff import parse_gff3
from pathlib import Path
from dataclasses import dataclass
from gzip import open as gzopen

# Python uses exceptions (as opposed to errors), which can be customised, and then raised as any other exception
class BadGffVersion(Exception):
    def __init__(
        self,
        message="GFF is not version 3, or there is no version information in the header of the file",
    ):
        super().__init__(message)

# `dataclass` classes are a very quick and easy way to declare classes that are mostly used to store data
@dataclass
class GffFeature:
    region: str
    feature_type: str
    start: int
    end: int
    strand: str
    attrs: dict


class Gff:
    def __init__(self, gff_file: Path) -> None:
        self.gff_file = gff_file
        self.meta = self.parse_meta()
        self.feats = [
            GffFeature(
                x["region"],
                x["feature_type"],
                x["start"],
                x["end"],
                x["strand"],
                x["attrs"],
            )
            for x in parse_gff3(gff_file, "ensembl_havana")
        ]
        self.n_features = len(self.feats)
        self.n_regions = len({x.region for x in self.feats})

    def parse_meta(self):
        open_fx = gzopen if self.gff_file.suffix == ".gz" else open
        meta_d = dict()
        with open_fx(self.gff_file, "rt") as f:
            for line in f:
                if line.startswith("#"):
                    # One can capture part of a list in a variable as part of a list destructuring.
                    # Here, the first element is assigned to `key`, and the rest to `line_data`.
                    key, *line_data = line.lstrip("##").lstrip("#!").split()
                    key = key.replace("-", "_")
                    if key not in meta_d:
                        # Ternary statements have a different syntax from other languages, but it is not that bad
                        # it goes something like this:
                        # `var_from_ternary = value_if_true if boolean_test else value_if_false`
                        # In python we can break expressions in several lines by enclosing it in parentheses
                        meta_d[key] = (
                            " ".join(line_data)
                            if line.startswith("#!") or "gff_version" == key
                            else [line_data]
                        )
                    else:
                        meta_d[key].append(line_data)
                else:
                    break

        if "sequence_region" in meta_d:
            fixed_regions = [None] * len(meta_d["sequence_region"])
            for ix, region_data in enumerate(meta_d["sequence_region"]):
                region, start, end = region_data
                region_dict = dict(region=region, start=int(start), end=int(end))
                fixed_regions[ix] = region_dict

        meta_d["regions"] = fixed_regions
        del meta_d["sequence_region"]
        if meta_d.get("gff_version", None) != "3":
            # After defining a custom exception, it can be raised like any other exception
            raise BadGffVersion
        return meta_d
    # The `__str__` method is a very easy way to polish the way classes are printed.
    # In this case, running `print(gff_instance)` will show the string below
    def __str__(self):
        info = (
            f"GFF file: {self.gff_file}\n"
            "Summary\n"
            f"{self.n_features} features in {self.n_regions} regions"
        )
        return info


if __name__ == "__main__":
    gff_in = Path("../data/gff3_parsing/Homo_sapiens.GRCh38.114.gff3.gz")
    gff = Gff(gff_in)
    print(gff)

Conclusion

Python is a very easy language to learn and mold to our needs (some would say, to a fault). The data structures available in Python are quite straight-forward to work with, and the OOP-readiness of the language makes it very easy to create objects that model both data and methods that process it, which is quite useful in bioinformatics. Other than being a bit finicky about indentation (in Python one has to indent all nested blocks with the same number of spaces), Python is very easy to write. However, it has some drawbacks. It’s dynamic nature make bugs related to undefined behavior due to using incorrect types are difficult to catch, especially in big projects. There is some progress in type hinting, but in my opinion, it still feels like a work-around, which is parsed by different language servers very inconsistently. Additionally, memory allocation is done automatically, this means that one has to be very careful knowing when an operation returns a copy of an object or if the object is being updated in-place.