Back to catalog

Season 20 18 Episodes 1h 7m 2026

Biopython Fundamentals

v1.87 — 2026 Edition. A comprehensive 18-episode guide to using Biopython (v1.87 - 2026) for sequence analysis, parsing biological data formats, running BLAST, handling 3D structures, phylogenetic trees, and more.

Scientific Computing Bioinformatics

🌐 English 🇪🇸 Español 🇫🇷 Français 🇵🇹 Português 🇮🇹 Italiano 🇵🇱 Polski 🇩🇪 Deutsch 🇷🇴 Română

Now Playing

Click play to start

0:00

Introduction to Biopython & The Seq Object

Discover the foundation of Biopython: the Seq object. We explore how sequence objects differ from standard Python strings and learn how to perform biological operations like reverse complement and translation.

3m 33s

Rich Sequence Data: The SeqRecord Object

Learn how to wrap sequences in rich metadata using the SeqRecord object. We cover how identifiers, names, descriptions, and dictionary annotations are stored alongside the raw sequence.

4m 14s

Reading and Writing Files with SeqIO

Master sequence file conversions and batch processing with Bio.SeqIO. This episode explains the difference between reading single-record files and parsing multi-record datasets.

3m 21s

Extracting Genes with SeqFeature

Dive into the complex world of sequence features. We explain how Biopython represents gene coordinates, strands, and fuzzy locations using the SeqFeature object.

3m 39s

Pairwise Sequence Alignment

Learn how to compare two sequences directly using the Bio.Align module. We discuss the PairwiseAligner, substitution scoring, and gap penalties for both global and local alignments.

3m 57s

Handling Multiple Sequence Alignments

Transition from pairwise to multiple sequence alignments. This episode covers parsing alignment files with AlignIO and treating alignments like 2D arrays to slice out specific columns.

3m 43s

Querying NCBI Databases Programmatically

Automate your literature and sequence searches. Discover how to query NCBI databases using Entrez.esearch and retrieve exact IDs without using a web browser.

4m 04s

Running BLAST over the Internet

Trigger remote BLAST searches directly from Python. Learn how to use qblast to send sequences to the NCBI servers and safely save the raw XML results.

3m 28s

Parsing Natively: Unpacking BLAST XML

Make sense of complex BLAST outputs. This episode walks through parsing BLAST XML files into native Python objects to extract alignments, High-scoring Segment Pairs (HSPs), and E-values.

3m 32s

Navigating 3D Structures with Bio.PDB

Step into three dimensions. We explore the PDB module, parsing macromolecular structures, and understanding the Structure-Model-Chain-Residue-Atom (SMCRA) architecture.

3m 40s

Measuring Protein Geometry

Calculate spatial relationships in proteins. This episode covers calculating interatomic distances and using NeighborSearch to find atoms within a specific radius.

4m 00s

Phylogenetic Trees in Python

Parse, manipulate, and draw evolutionary trees with Bio.Phylo. We cover reading Newick files, tree traversal, and isolating specific clades.

3m 44s

Sequence Motif Analysis

Uncover hidden patterns in DNA. Discover how to create sequence motifs, build Position-Weight Matrices (PWMs), and scan target sequences for transcription factor binding sites.

4m 00s

Swiss-Prot and ExPASy Integration

Access the gold standard of protein databases. We detail how to fetch records via Bio.ExPASy and parse dense Swiss-Prot flat files to extract curated protein metadata.

3m 14s

Visualizing Genomes with GenomeDiagram

Turn raw GenBank files into publication-quality images. Learn how GenomeDiagram constructs circular and linear genome maps by layering tracks and feature arrows.

3m 29s

Population Genetics with Bio.PopGen

Analyze genetic variation across populations. This episode introduces Bio.PopGen to parse Genepop files and easily extract allele frequencies and heterozygosity metrics.

3m 49s

Biochemical Pathways with KEGG

Connect the metabolic dots. Learn how to parse KEGG enzyme and pathway records to trace biochemical reactions and chemical compound structures.

3m 56s

Cluster Analysis for Gene Expression

Group genes by their behavior. In this final episode, we cover the Bio.Cluster module, applying K-means and hierarchical clustering to microarray expression data.

3m 52s

Episodes

Introduction to Biopython & The Seq Object

3m 33s

Download

Hi, this is Alex from DEV STORIES DOT EU. Biopython Fundamentals, episode 1 of 18. Treating a DNA sequence as a standard text string works fine until you need its opposite strand. Simple string manipulation in Python falls apart when you realize biological transcription is not just about replacing letters, but about directionality and distinct genetic codes. This is exactly why we use Introduction to Biopython and the Seq Object. At its core, a biological sequence is a string of letters representing molecules like DNA, RNA, or proteins. In Biopython, the central data structure is the Seq object. It wraps standard Python string behavior but attaches biological awareness directly to the data. Because it behaves like a normal string, you can do everything you normally do with text. You can check the length of a sequence. You can count how many times a specific pattern appears. You can slice it to extract a smaller piece. When you slice a Seq object, the result is not a standard text string. It is a new Seq object. This ensures you never accidentally strip away the biological methods when you parse a sequence into smaller fragments. This is the part that matters. Standard strings do not understand biology. DNA is double-stranded, and sequences are read in a specific direction. If you have a sequence of coding DNA and you need the opposite strand, a simple text reversal is biologically incorrect. You must reverse the order of the letters and swap each base for its structural pair. The Seq object handles this with a single reverse complement method, returning a biologically accurate opposite strand. Beyond structure, the Seq object handles the central dogma of molecular biology directly. Consider a short coding DNA sequence. To convert this into messenger RNA, you call the transcribe method on your sequence. Under the hood, this handles the specific letter substitutions, replacing thymine with uracil, and gives you back an RNA sequence. If you start with RNA and need the DNA equivalent, a back transcribe method reverses that process. Once you have your sequence ready, you generally want the resulting protein. You call the translate method. This automatically groups the sequence letters into triplets, evaluates them as codons, and returns a new Seq object containing the amino acid sequence. Biology is rarely uniform, and the standard genetic code does not apply everywhere. Organisms like certain bacteria, or organelles like vertebrate mitochondria, read codons differently. The translate method accounts for this through translation tables. Instead of writing custom logic for a mitochondrial sequence, you simply pass the translation table argument to the method. You can provide the official NCBI table number, such as table two for vertebrate mitochondria, or the table name as text. The Seq object recalculates the amino acids using those specific organism rules. The true power of the Seq object is that it prevents you from writing brittle, error-prone text parsing functions for established biological processes. If you want to help support the show, you can search for DevStoriesEU on Patreon. That is all for this one. Thanks for listening, and keep building!

Rich Sequence Data: The SeqRecord Object

4m 14s

Learn how to wrap sequences in rich metadata using the SeqRecord object. We cover how identifiers, names, descriptions, and dictionary annotations are stored alongside the raw sequence.

Download

Hi, this is Alex from DEV STORIES DOT EU. Biopython Fundamentals, episode 2 of 18. A string of ten thousand DNA letters is technically a sequence, but analytically, it is useless. Without knowing the species it came from, the molecule type, or the reading quality, you just have raw text. Binding that crucial metadata to the raw sequence is the job of Rich Sequence Data: The SeqRecord Object. The SeqRecord is a container. It takes a basic sequence and wraps it with the identifiers and descriptive information needed to actually use it in a bioinformatics pipeline. Every SeqRecord relies on a few core attributes. The first is dot seq, which holds the actual sequence data itself. Surrounding that are three standard string attributes. Dot id contains the primary identifier, typically an accession number from a public database. Dot name holds a shorter common name or clone identifier. Dot description contains human-readable text explaining what the sequence actually represents. Sometimes three string attributes are not enough. When you need to store complex or custom metadata, you use the dot annotations attribute. This is a standard Python dictionary. It handles record-level metadata. You can store anything here. If you want to track how a sequence was verified, you simply add a new key called evidence to the dictionary and set its value to experimental. Here is the key insight. Not all metadata applies to the entire sequence as a whole. Sometimes you need data assigned to every single letter. This is what dot letter annotations does. It is another dictionary, but it comes with a strict rule. Any list, array, or string you assign as a value in this dictionary must be exactly the same length as the sequence itself. The classic use case is storing Phred quality scores from a sequencing machine. If your dummy DNA sequence is exactly twenty letters long, you can assign a list of twenty integers to the dot letter annotations dictionary under the key phred quality. If you try to assign nineteen or twenty-one integers, Biopython will immediately throw an error. You can easily build a SeqRecord from scratch. First, you define a sequence object. Next, you pass that sequence into the SeqRecord constructor. You can pass the identifier, name, and description right there as arguments. Once the object exists, you can access its dictionaries to manually inject your custom evidence strings or your lists of per-letter quality scores. In practice, you rarely build these entirely by hand. You usually get them by parsing files, and Biopython handles different file formats by mapping data to the SeqRecord attributes in specific ways. Consider a standard FASTA file. FASTA is a very basic format. When Biopython reads it, it grabs the first word after the greater-than symbol on the header line and assigns it to the dot id attribute. The dot name attribute is simply given the exact same value as the identifier. The rest of the header line is dumped directly into the dot description attribute. Because FASTA has no structured metadata, the dot annotations dictionary remains completely empty. GenBank files provide a sharp contrast. They contain rich, structured data. When parsing GenBank, Biopython assigns the locus and version number to the dot id. The locus name goes to the dot name attribute. The definition line maps to the dot description. Crucially, Biopython actively populates the dot annotations dictionary. It extracts the taxonomy details, the molecule type, the data file division, and the list of published references, placing them all neatly into the dictionary under standardized keys. The sequence tells you the underlying biological code, but the SeqRecord object is what ultimately gives that code its context. Thanks for spending a few minutes with me. Until next time, take it easy.

Reading and Writing Files with SeqIO

3m 21s

Master sequence file conversions and batch processing with Bio.SeqIO. This episode explains the difference between reading single-record files and parsing multi-record datasets.

Download

Hi, this is Alex from DEV STORIES DOT EU. Biopython Fundamentals, episode 3 of 18. Your bioinformatics pipeline crashes midway through a run. You check the logs and realize a downstream script expected a FASTA file, but the upstream tool passed it a GenBank file. You now have to rewrite a parser from scratch just to extract the sequence data. Or, you could use Bio dot SeqIO. SeqIO is the standard Biopython module for sequence input and output. It handles the parsing of various biological file formats so you do not have to write custom regular expressions or string manipulation functions. You provide the file path and the format string, and SeqIO handles the underlying file structure. When reading files, SeqIO gives you two distinct functions, and picking the wrong one is a common source of bugs. The first is read. You use the read function when you are absolutely certain your file contains exactly one sequence record. You pass it the file path and the format, like fasta, and it returns a single sequence record object. If the file is empty, or if it contains two or more records, the read function will immediately raise an exception and stop your script. It is strictly for one-to-one mapping. More often, you deal with files containing many sequences. For this, you use the parse function. Unlike read, parse does not load everything into memory at once. Instead, it returns an iterator. This is the part that matters. An iterator yields one sequence record at a time, which keeps your memory footprint low even if you are processing massive multi-gigabyte genomic datasets. Consider a specific scenario. You have a GenBank file containing data for several orchid species. You want to extract these records, but you only care about sequences that are longer than one hundred base pairs. Furthermore, the next tool in your pipeline requires FASTA format, not GenBank. You can handle the reading, filtering, and format conversion sequentially. First, you create an empty list to hold your filtered sequences. Next, you set up a loop using the parse function, pointing it to your orchid GenBank file and specifying genbank as the input format string. Inside the loop, the parse function hands you one sequence record object on each pass. You check the length of the sequence attached to that record. If the length is greater than one hundred, you append the record to your list. If it is shorter, you simply ignore it and move to the next iteration. Now you have a list of valid, filtered records in memory. To save them, you use the write function. The write function requires three parameters: the sequence of records you want to save, the output file path, and the output format string. You pass in your list of filtered orchid records, specify a new file name, and provide fasta as the format string. That is all it takes. SeqIO automatically extracts the relevant identifiers and sequences from your GenBank records and formats them correctly as FASTA text. You do not need to manually build the header lines or format the sequence blocks. By pairing parse with write, you translate between file formats effortlessly. The single most useful takeaway here is that SeqIO acts as a standardized bridge. By forcing all file formats into a universal sequence record structure during the parse step, it completely decouples the format you read from the format you eventually write. That is all for this one. Thanks for listening, and keep building!

Extracting Genes with SeqFeature

3m 39s

Dive into the complex world of sequence features. We explain how Biopython represents gene coordinates, strands, and fuzzy locations using the SeqFeature object.

Download

Hi, this is Alex from DEV STORIES DOT EU. Biopython Fundamentals, episode 4 of 18. You are manually slicing a sequence string to pull out a gene on the reverse strand. You calculate the coordinates, reverse the string, complement the bases, and run your script. Half the time, you end up with an off by one error or meaningless data. Trying to do this sequence math by hand is a classic trap. Instead, you should be letting the parser do the work by extracting genes with SeqFeature. A SeqFeature object describes a specific region or landmark on a parent sequence. When you parse a rich file format, Biopython populates your sequence record with a list of these feature objects. Every feature has three main attributes you will use constantly. First is the type attribute. This is simply a string telling you what the feature represents, such as gene, CDS, or mRNA. Second is the qualifiers attribute. This is a dictionary containing all the metadata attached to that specific feature. If you need the gene name, the locus tag, or a note about the translation product, you retrieve it from the qualifiers dictionary. Third, and most importantly, is the location attribute. This dictates exactly where the feature lives mathematically on the parent sequence. It comes in two main forms. A SimpleLocation handles continuous, uninterrupted stretches of sequence. It contains an exact start position, an exact end position, and the strand. The strand is represented as an integer. It is one for the forward strand, and minus one for the reverse strand. Biology often resists being simple. Spliced genes in eukaryotes are broken up into multiple exons across the genome. To handle this, the location attribute will instead be a CompoundLocation. A CompoundLocation is essentially a collection of SimpleLocations joined together. It groups those fragmented exon coordinates and treats them as a single logical unit. Here is the key insight. You do not need to look at these location coordinates and write your own Python slice notation to get the sequence data out of the parent record. You use the extract method attached to the feature itself. To use it, you take your SeqFeature object, call the extract method, and pass the full parent sequence as the argument. The feature looks at its own location data and does the heavy lifting for you. If the feature is a SimpleLocation on the forward strand, it slices the sequence exactly as you would expect. If the feature is on the reverse strand, extract automatically handles the slice, generates the reverse complement of that sequence, and returns the biologically accurate result. You do not have to flip strings or remember to complement the bases. It does even more work for you when dealing with a CompoundLocation. If you call extract on a spliced gene, Biopython goes to the parent sequence, extracts every individual exon based on their respective SimpleLocations, automatically reverse complements them if the gene is on the minus strand, and stitches them all together in the correct order. It hands you back one continuous, clean piece of sequence ready for downstream translation or analysis. Relying on the extract method eliminates the manual index math that causes silent data errors in bioinformatics pipelines. I would like to take a moment to thank you for listening — it helps us a lot. Have a great one!

Pairwise Sequence Alignment

3m 57s

Learn how to compare two sequences directly using the Bio.Align module. We discuss the PairwiseAligner, substitution scoring, and gap penalties for both global and local alignments.

Download

Hi, this is Alex from DEV STORIES DOT EU. Biopython Fundamentals, episode 5 of 18. Classic dynamic programming alignment algorithms are notoriously slow when written in pure Python. If you try to process thousands of sequence pairs, your script will stall for hours. Biopython solves this by dropping down to C for the heavy lifting, giving you millisecond performance. This episode is about Pairwise Sequence Alignment using the PairwiseAligner object. To compare exactly two sequences, you instantiate a PairwiseAligner. This single object acts as the configuration engine for the entire process. By default, it performs a global alignment, meaning it forces an end-to-end match between the two sequences. If you only want to find the best matching subsequence hidden somewhere inside a larger string, you switch the aligner mode attribute from global to local. The aligner needs rules to decide what constitutes a good match. You set these rules directly as attributes on the aligner object. You assign a positive number to the match score attribute, and a negative number to the mismatch score attribute. But biological sequences also mutate by inserting or deleting entire chunks of letters, which creates gaps in the alignment. Here is the key insight. Biology strongly favors one single, long deletion over several small, scattered deletions. To model this physical reality, you configure two separate gap penalties. You assign a large negative value to the open gap score to penalize the creation of a new gap. Then, you assign a much smaller negative value to the extend gap score. This mathematical imbalance forces the aligner to group gaps together whenever possible. Once your aligner is configured, you call its align method and pass in your two sequences. The method does not just return a single string or a final number. It returns an iterator of alignment objects. Multiple paths through a dynamic programming matrix can yield the exact same highest score. The aligner gives you access to all of these mathematically optimal alignments. Because it returns an iterator, it evaluates lazily. It only computes the next optimal path when you ask for it, which protects your system from memory overloads if highly repetitive sequences generate thousands of equally valid alignments. Consider a concrete example using two short DNA sequences. Sequence one is A C C G T. Sequence two is A C G, meaning it is missing the second C and the final T. First, you create the aligner object. You set the match score to two, and the mismatch score to minus one. You set the open gap score to minus five, and the extend gap score to minus one. Then, you pass both sequences to the align method. When you loop over the returned iterator and print the results, Biopython automatically formats the output. It stacks the sequences visually, drawing vertical lines between matching bases and dashes where a letter is missing. You will see exactly how the aligner placed gap characters in the second sequence to line up the matching letters, maximizing the total score based on the severe open gap penalty you defined. Remember that configuring an aligner is essentially tuning a mathematical model; the output is only biologically meaningful if your match, mismatch, and gap scores accurately reflect the specific evolutionary relationship between your input sequences. That is all for this one. Thanks for listening, and keep building!

Handling Multiple Sequence Alignments

3m 43s

Transition from pairwise to multiple sequence alignments. This episode covers parsing alignment files with AlignIO and treating alignments like 2D arrays to slice out specific columns.

Download

Hi, this is Alex from DEV STORIES DOT EU. Biopython Fundamentals, episode 6 of 18. You open an alignment file in a text editor and it looks perfectly structured, like a simple grid. But try to parse out just the third position across fifty different species, and suddenly you are writing nested loops and keeping track of string indices. Handling Multiple Sequence Alignments does not have to be this frustrating. To bring an alignment into your code, you use the Bio AlignIO module. Assuming your alignment file is already generated, you rely on the read function. You pass this function the file path and a string naming the format, like phylip or stockholm. The read function processes the text file and hands you back a MultipleSeqAlignment object. Here is the key insight. The MultipleSeqAlignment object behaves exactly like a two-dimensional array. If you have ever worked with a NumPy matrix, the logic is identical. The rows in this grid are your individual sequences, usually representing different organisms or genes. The columns are the specific nucleotide or amino acid positions aligned across all those sequences. Because it acts like a 2D array, you use standard Python slicing syntax to manipulate both dimensions simultaneously. If you only need to grab a specific sequence, you slice the rows. Accessing index zero of your alignment returns the very first sequence in the file. This row is returned as a standard SeqRecord object, complete with the sequence ID, the description, and the biological sequence data itself. You can also loop over the alignment object directly, and it will hand you each SeqRecord row by row. That handles the sequences. Now for the columns, which is where the 2D logic really saves you time. Suppose you loaded a Stockholm alignment and you want to isolate a highly conserved start region. You only care about the first ten positions across every single species. Instead of iterating through fifty sequences and slicing fifty individual strings, you apply a 2D slice directly to the alignment object. You ask for all rows by using a single colon for the first dimension, followed by a comma, and then the slice zero to ten for the second dimension. The syntax is simply open bracket, colon, comma, colon ten, close bracket. This single operation returns a brand new MultipleSeqAlignment object containing only those first ten columns for every sequence. All the original sequence IDs and metadata are preserved automatically in the new, smaller grid. There is a slight difference in behavior depending on how wide your slice is. When you slice a range of columns, Biopython maintains the grid structure and returns an alignment object. But if you extract exactly one single column—say, all rows at column index five—the object returns a single plain text string containing just those characters from top to bottom. This text string makes it exceptionally easy to calculate conservation scores or identify single nucleotide polymorphisms at a specific site without unpacking any objects. You can combine these row and column slices to extract any sub-block you need. You could request rows two through five, and columns fifty through sixty, isolating a specific domain for a subset of species. Treat your alignments as matrices, let the MultipleSeqAlignment object handle the string tracking, and your data extraction code will shrink to almost nothing. That is your lot for this one. Catch you next time!

Querying NCBI Databases Programmatically

4m 04s

Automate your literature and sequence searches. Discover how to query NCBI databases using Entrez.esearch and retrieve exact IDs without using a web browser.

Download

Hi, this is Alex from DEV STORIES DOT EU. Biopython Fundamentals, episode 7 of 18. You need to download ten thousand sequence records. If you write a standard web scraper to hit the NCBI website, your IP address will be blocked before you finish the first hundred. The National Center for Biotechnology Information explicitly forbids automated scraping. To pull data legally and reliably, you need to query NCBI databases programmatically using Bio dot Entrez. Before making any network requests, you have to follow the NCBI API guidelines. First, you must tell them who you are. You do this by setting the Entrez dot email variable to your actual email address. If your script goes out of control and overloads their servers, they will use this email to contact you before permanently cutting off your access. Second, you must respect their rate limits. NCBI restricts unauthenticated users to three requests per second. Biopython handles this automatically by enforcing a delay in the background. However, if you attempt to launch multiple scripts in parallel to bypass this delay, NCBI will detect the abuse and ban your IP. Play by the rules. The workflow for extracting data from Entrez usually requires a two-step process: finding the unique identifiers, and then downloading the actual files. You start by searching with the Entrez dot esearch function. Suppose you want to find the matK gene in orchids. You call esearch and pass it the database name, which is nucleotide, along with your search term, like orchid organism AND matK gene. This search function does not return biological sequence data. It returns an HTTP response that acts like an open text file containing an XML document. Here is the key insight. You do not need to write a custom XML parser or extract strings manually. You take that network file handle and pass it directly into the Entrez dot read function. This parses the XML and translates it into a standard Python dictionary. Inside that dictionary, you simply look up the key called IdList. This gives you a Python list containing the exact GenBank identifiers that match your orchid query. Once you have that list of identifiers, you move to the second step: downloading the full records. For this, you use the Entrez dot efetch function. You pass it the same nucleotide database name, along with an identifier from your search results. You also need to specify the format of the data you want back. To get a standard GenBank text file, you set the retrieval type argument to gb and the retrieval mode argument to text. Just like the search function, efetch returns a network stream, not a raw string. Because this stream mimics a local file, you do not need to save the downloaded data to your hard drive first. You can hand the network handle directly to SeqIO dot read, specify that the format is genbank, and it immediately parses the stream into a Biopython SeqRecord object. You now have the sequence and all its biological annotations loaded into memory. When you are done, close the handle to free up network resources. The single most important habit when querying NCBI is treating your API responses as file streams rather than text strings. Passing those streams directly into Entrez dot read or SeqIO dot read avoids unnecessary memory overhead and keeps your automation pipelines perfectly clean. If you find these episodes helpful, you can support the show by searching for DevStoriesEU on Patreon. Thanks for listening, happy coding everyone!

Running BLAST over the Internet

3m 28s

Trigger remote BLAST searches directly from Python. Learn how to use qblast to send sequences to the NCBI servers and safely save the raw XML results.

Download

Hi, this is Alex from DEV STORIES DOT EU. Biopython Fundamentals, episode 8 of 18. Staring at an auto-refreshing web page while a sequence aligns against a global database is a terrible use of your afternoon. Automating that search in the background means you can do something useful while Python does the waiting. Today we are covering running BLAST over the internet using Biopython. The function you need is called qblast, located in the Bio dot Blast dot NCBIWWW module. It acts as a wrapper around the NCBI BLAST API, allowing you to submit queries directly from your script exactly as you would through their web portal. To trigger a search, qblast requires three mandatory arguments. The first is the blast program you want to execute. For example, if you are comparing a nucleotide sequence against a nucleotide database, you pass the string blastn. If you are working with proteins, you pass blastp. The second argument is the specific database you are searching against. The most common choice for nucleotides is the string nt, which stands for the non-redundant nucleotide database. If you were searching proteins, you would use nr. The third argument is the sequence query itself. Biopython makes this input highly flexible. You can provide a plain string of the raw sequence, a multi-line string formatted as a FASTA record, a Biopython SeqRecord object, or an exact sequence identifier like an NCBI accession number. If you want to run a basic search, you call qblast, give it blastn as your program, nt as your database, and your FASTA string as the query. Python will pause execution here while it negotiates with the NCBI servers, waiting however long it takes for the remote alignment to finish. When the remote server finishes the job, qblast returns a file-like object containing your results. Under the hood, the function requests these results in XML format by default. You should ensure it stays that way, because XML is the standard format Biopython relies on for downstream processing. Here is the key insight. Do not immediately pass this returned web handle into a parsing function. Running a remote BLAST search is computationally expensive for the NCBI servers and time-consuming for you. If you parse the live data stream directly from the web handle and your script encounters an error a few lines down, that sequence data is lost from memory. To fix your code and try again, you would have to execute the exact same remote search from scratch, wasting their bandwidth and your time. The correct workflow is to permanently save the raw output immediately. The moment qblast returns the web handle, use standard Python file operations to open a new local file in write mode. Read everything out of the web handle, write it directly into your new local XML file, and then close the web handle. You now have a safe, static copy of your BLAST search sitting on your hard drive. You can open that local XML file tomorrow, next week, or fifty times in a row while debugging your application, without ever pinging the NCBI API a second time. Always decouple your network retrieval from your data processing by writing web responses to disk first. That is all for this one. Thanks for listening, and keep building!

Parsing Natively: Unpacking BLAST XML

3m 32s

Make sense of complex BLAST outputs. This episode walks through parsing BLAST XML files into native Python objects to extract alignments, High-scoring Segment Pairs (HSPs), and E-values.

Download

Hi, this is Alex from DEV STORIES DOT EU. Biopython Fundamentals, episode 9 of 18. Writing regular expressions to scrape plain-text output files is a bioinformatic rite of passage. It is also an anti-pattern. The moment a search tool updates its visual formatting, your script breaks. To get structural guarantees, you need to rely on structured data. This episode covers Parsing Natively: Unpacking BLAST XML. Biopython handles this using the NCBIXML module within the Bio dot Blast package. Instead of saving your search results as a human-readable text block, you save them as an XML file. When you open this file and pass the handle to the parse function in the NCBIXML module, it does not load a massive text blob into memory. Instead, it returns an iterator. This iterator yields structured objects one at a time, keeping your memory footprint low even if your search returned massive amounts of data. To use this efficiently, you must understand the three-level hierarchy Biopython uses to model the results. The top level is the BLAST Record. One record corresponds to exactly one sequence you submitted as a query. If you submitted a single sequence, the iterator yields one record. If you submitted fifty sequences, it yields fifty records. Inside each BLAST Record, you will find a list of Alignments. An alignment represents a single database hit. It tells you that a specific sequence in the database matched your query, along with metadata about that database entry. Inside each Alignment is the third level: High-scoring Segment Pairs, universally known as HSPs. This is the part that matters. An alignment simply indicates that two sequences are related. The HSP contains the mathematical and structural proof of that relationship. A single alignment can contain multiple HSPs if the sequences share multiple distinct regions of similarity separated by unaligned gaps. Extracting the data means writing three nested loops. First, you iterate over the records in the parsed XML. Second, you iterate over the alignments inside each record. Third, you iterate over the HSPs inside each alignment. Once you are inside that innermost loop, you have access to the actual match data. This is where you apply your statistical filters. The most common filter is the e-value, which represents the number of hits of similar quality you would expect to see by chance. You access this via the expect attribute on the HSP object. You write a conditional check: if the HSP dot expect is less than your threshold, say zero point zero four, you process the match. If it is higher, the match is too weak, and you ignore it. For the hits that pass your filter, the HSP object holds the exact alignment strings. The query attribute contains your input sequence with any gaps introduced by the algorithm. The sbjct attribute contains the matched database sequence. The match attribute sits conceptually between them, containing the representation of the alignment, mapping out the exact matches, positive substitutions, or gaps. The real advantage here is stability. By letting the NCBIXML module unpack the file, you stop relying on brittle text scraping and start interacting with defined data objects that map perfectly to the biological reality of records, alignments, and segment pairs. Thanks for listening, happy coding everyone!

Navigating 3D Structures with Bio.PDB

3m 40s

Step into three dimensions. We explore the PDB module, parsing macromolecular structures, and understanding the Structure-Model-Chain-Residue-Atom (SMCRA) architecture.

Download

Hi, this is Alex from DEV STORIES DOT EU. Biopython Fundamentals, episode 10 of 18. Proteins are often stored as a flat, one-dimensional string of letters, but they function in a complex three-dimensional reality. Mapping a 1D sequence to a 3D coordinate space requires a strict, predictable hierarchy. Navigating 3D Structures with Bio.PDB provides exactly that map. When you download a file from the Protein Data Bank, it comes as a massive block of text. Historically, this was the dot pdb format. Today, the standard is the dot cif format. Writing your own code to read these text blocks line by line is an exercise in managing edge cases. Bio.PDB solves this by converting the text into an object-oriented tree. To do this, you use either PDBParser or MMCIFParser. You create an instance of the parser, call its get structure method, provide an arbitrary name for your structure, and pass in your file path. What you get back is a Structure object. This Structure object follows a strict data architecture known as SMCRA. That stands for Structure, Model, Chain, Residue, Atom. Every 3D file parsed by Biopython is organized into these five nested levels. The root level is the Structure. A Structure contains one or more Models. If the protein was solved using X-ray crystallography, there is usually just one model. If it was solved using NMR, there might be dozens of models representing different structural fluctuations. Usually, you just grab the first one, which is located at index zero. Inside a Model, you have Chains. Many proteins are complexes made of multiple interacting polypeptide chains. These are typically labeled with single uppercase letters, like chain A and chain B. Inside a Chain, you find Residues. These are the individual amino acids that make up the sequence. This level also holds ligands and water molecules attached to that specific chain. Finally, inside a Residue, you have Atoms. Here is the key insight. You can navigate this entire tree using simple iteration or dictionary-style access. If you write a for loop over a chain, it yields residues. If you iterate over a residue, it yields atoms. If you know exactly what you are looking for, you can drill straight down without looping. If you need the spatial coordinates of the alpha carbon in the hundredth residue of chain A, you just stack your lookups. You take your structure object, ask for model zero, then chain A, then residue one hundred, then the atom named CA. Consider a practical scenario: extracting the 3D coordinates of all atoms in a single specific residue. First, you initialize your parser and load the file into a variable. Next, you isolate the target residue. You create a variable and assign it the result of looking up model zero, chain A, and residue fifty from your structure. Now, you write a standard for loop. For each atom in the target residue, you request its coordinates. Biopython returns these as a NumPy array representing the exact X, Y, and Z positions in space. You print the atom name and its coordinate array. In just a few lines of code, you have bypassed thousands of lines of raw text and extracted the exact spatial data you need. The SMCRA hierarchy ensures that every atom in every structure file is found using the exact same logic. That is your lot for this one. Catch you next time!

Measuring Protein Geometry

4m 00s

Calculate spatial relationships in proteins. This episode covers calculating interatomic distances and using NeighborSearch to find atoms within a specific radius.

Download

Hi, this is Alex from DEV STORIES DOT EU. Biopython Fundamentals, episode 11 of 18. If you write a naive nested loop to calculate the distance between every atom in a protein and every atom in a ligand, your script will freeze. Spatial indexing solves this instantly, and that is what we are covering today with Measuring Protein Geometry. The most fundamental geometric operation in Bio dot PDB is finding the distance between two points in three-dimensional space. Biopython overrides the standard subtraction operator for Atom objects. If you take one Atom object and subtract a second Atom object, the result is a float representing the Euclidean distance between them. The unit is always Angstroms. You do not need to extract the coordinates, square them, or take the square root yourself. Just subtract one atom from the other, and you have your exact distance. This subtraction trick is perfect for checking a specific, known interaction, like a single suspected hydrogen bond. But it breaks down when you need to explore unknown proximity. Suppose you have a small molecule ligand bound to a large receptor, and you need to identify every protein residue within a five Angstrom radius of that ligand. A typical protein contains thousands of atoms. Checking every single ligand atom against every single protein atom is computationally punishing. The time complexity grows geometrically, and your script grinds to a halt. To resolve this, Bio dot PDB provides the Neighbor Search module. Instead of comparing pairs one by one, Neighbor Search builds a spatial index. Under the hood, it constructs a KD-tree that maps where everything is in three-dimensional space, allowing for extremely fast proximity queries. You construct a Neighbor Search object by passing it a flat list of all the atoms you want to query against. Usually, you extract every atom from your target structure, cast that generator to a standard list, and hand it to the Neighbor Search constructor. This initialization step takes a fraction of a second, but it does all the heavy computational lifting of organizing the spatial data up front. Once the index is built, you can query it. The primary search method requires two arguments: a target coordinate, and a radius in Angstroms. When you run the search, it rapidly traverses the spatial tree and returns only the entities located inside that spherical boundary. Here is the key insight. The search does not just return atoms. Because Bio dot PDB relies on the SMCRA architecture, Neighbor Search understands the structural hierarchy. You can pass an optional level parameter to your search query. If you specify the letter R for residue, the index will look at all the atoms within your five Angstrom radius, determine which parent residues those atoms belong to, and return a clean list of unique Residue objects. You bypass the atom level in your results entirely. To map our ligand binding pocket, you follow a simple flow. First, initialize the Neighbor Search with all atoms in the protein chain. Second, iterate through just the atoms in your ligand. For each ligand atom, call the search method with a five Angstrom radius and request the residue level. Finally, collect those returned residues into a Python set to automatically remove any duplicates. You just mapped the binding site in milliseconds. If you need to find internal clashes within a single folded chain, Neighbor Search also includes a search all method. You simply provide a radius, and it returns every single pair of atoms in the entire structure that sit closer together than that specified distance, completely avoiding the nested loop problem. Treating your molecular data as a searchable geographic index rather than a nested list changes your code from slow and structural to fast and spatial. That is all for this one. Thanks for listening, and keep building!

Phylogenetic Trees in Python

3m 44s

Parse, manipulate, and draw evolutionary trees with Bio.Phylo. We cover reading Newick files, tree traversal, and isolating specific clades.

Download

Hi, this is Alex from DEV STORIES DOT EU. Biopython Fundamentals, episode 12 of 18. If you have ever looked at a raw Newick file, you know it is just a dense, nested nightmare of parentheses and commas. Writing your own string parser to figure out which species is related to which is a miserable exercise. What you need is a way to load that text directly into a traversable data structure. That is exactly what Bio dot Phylo provides for Phylogenetic Trees in Python. Bio dot Phylo handles parsing, manipulating, and drawing evolutionary trees. The primary entry point is the read function. You pass it a file path and a format string, like newick or phyloxml. The read function consumes that file and returns a single tree object. If your file contains multiple trees, you use the parse function instead to get an iterable sequence of trees. Once you load the file, you work with two distinct components: the Tree object and the Clade object. The Tree object represents the entire structure as a whole. It holds global metadata, such as the name of the tree and whether it is rooted or unrooted. But the actual branching data does not live directly in the Tree object. It lives in Clade objects. A Clade represents a specific node and every descendant that branches from it. The Tree object has a root attribute, which is simply the starting Clade for the entire hierarchy. Every branch off that root is another Clade, which contains its own sub-clades, cascading all the way down to the tips. Navigating this hierarchy is where the module shines. You do not write recursive loops manually. To search the structure, you call the find clades method on your tree. This method searches through all the nested branches and returns an iterable of clades matching the properties you specify. You can search by name, by branch length, or even by a custom evaluation function. If you only care about the absolute ends of the branches, representing the current living species, you use the get terminals method. This immediately returns a flat list of the leaf clades. You can also calculate structural metrics instantly. Calling the total branch length method on the tree sums up the lengths of every single branch in the entire structure, giving you the total evolutionary distance. Consider a scenario where you load a massive mammalian evolutionary tree. You want to isolate just the branch containing primates to run a localized analysis. You call find clades and search for the target name Primates. This returns the specific Clade object representing the common ancestor of all primates. Because a Clade naturally contains all its descendants, you now have a complete, isolated subtree. You can pass that primate clade into other analysis functions exactly as you would a full tree. Sometimes you just need to look at the structure to verify your extraction worked. Bio dot Phylo includes a draw ascii function. You pass your tree or your isolated clade to this function, and it prints a plain text representation directly to your terminal. It is not meant for publication, but it gives you immediate visual feedback on the topology without requiring you to configure external plotting libraries. Here is the key insight. Translating nested brackets into native objects means topology becomes an API, allowing you to slice evolutionary histories with standard method calls. If you found this episode helpful and want to support the show, you can search for DevStoriesEU on Patreon. That is all for this one. Thanks for listening, and keep building!

Sequence Motif Analysis

4m 00s

Uncover hidden patterns in DNA. Discover how to create sequence motifs, build Position-Weight Matrices (PWMs), and scan target sequences for transcription factor binding sites.

Download

Hi, this is Alex from DEV STORIES DOT EU. Biopython Fundamentals, episode 13 of 18. If you try to find a transcription factor binding site using a strict string match, you will miss a vast majority of the real biological signals. DNA binding sites wobble, and proteins tolerate variation at specific sequence positions. To find them reliably, you have to stop looking for fixed strings and start looking for probability profiles, which is exactly what sequence motif analysis allows you to do. In Biopython, you handle this with the motifs module. You start with a known set of aligned sequences that represent your binding site. Let us use the classic TATA box as an example. You might have a handful of known short sequences like T A T A A A, T A T A A T, and T A T A A C. You define these as standard Seq objects and pass them as a single list to the create function in the motifs module. Biopython takes this list of instances and packages them into a single motif object. The immediate result of creating this motif object is a count matrix. At every position along your sequence length, Biopython tallies the total occurrences of A, C, G, and T. If you look at the first position of our TATA box examples, the count for T is three, and the count for the other three letters is zero. This matrix is the literal foundation of all downstream analysis, but raw counts have a severe mathematical limitation. If a nucleotide never appears at a certain position in your small reference sample, its probability becomes an absolute zero. Later, when you multiply probabilities to score a sequence, a single zero wipes out the entire score. To fix this vulnerability, you convert the count matrix into a Position-Weight Matrix, or PWM. Here is the key insight. You generate the PWM by applying pseudocounts. A pseudocount adds a tiny baseline fraction to every possible nucleotide at every position. It mathematically acknowledges that just because you have not seen a cytosine at position two yet, that does not mean a substitution there is biologically impossible. By smoothing out the zeros, the position-weight matrix turns your raw tallies into robust probability distributions. Probabilities alone are still difficult to use for sequence scanning. You need to know if a stretch of DNA matches your motif better than it matches random background sequence. You handle this by converting your PWM into a Position-Specific Scoring Matrix, or PSSM. The PSSM calculates log-odds scores based on a background nucleotide distribution. It compares the probability of a letter appearing because it is part of your motif against the probability of it appearing purely by random chance. A positive score means the sequence actively looks like your motif. A negative score means it looks like background noise. Once you have your PSSM, you can hunt for unannotated binding sites in a long target sequence. You call the calculate method on the PSSM and pass in your target DNA sequence. This method effectively slides your motif across the entire target, scoring every possible window of that length. It returns an array of numerical scores representing the log-odds value at every sequence position. You iterate over these scores and look for values that cross a defined strict threshold. If the score is high enough, you have likely found a functional TATA box. A strict string match demands perfection, but biological systems operate on probability, and moving from raw counts to scoring matrices captures that reality. That is all for this one. Thanks for listening, and keep building!

Swiss-Prot and ExPASy Integration

3m 14s

Access the gold standard of protein databases. We detail how to fetch records via Bio.ExPASy and parse dense Swiss-Prot flat files to extract curated protein metadata.

Download

Hi, this is Alex from DEV STORIES DOT EU. Biopython Fundamentals, episode 14 of 18. GenBank is great when you are working with nucleic acids. But when you need expertly curated, hand-reviewed data on a protein — like its exact active sites or where it crosses a cell membrane — standard sequence databases fall short. You need a specialized protein database, and that brings us to Swiss-Prot and ExPASy integration. Swiss-Prot is a high-quality, manually annotated protein sequence database. ExPASy is the server portal that provides access to it. Biopython connects these two pieces. It uses the ExPASy module to download the data over the internet, and the SwissProt module to parse it into a structured object. First, you need to pull the raw record from the internet. You do this using the ExPASy module, specifically a function called get_sprot_raw. You pass it a Swiss-Prot accession ID. For example, if you want the record for the human hemoglobin subunit alpha, you pass its accession ID, which is P69905. Calling get_sprot_raw with this ID opens a network connection to the server and returns a handle to the raw text data. Once you have that raw text handle, you hand it off to the SwissProt module. You call the read function and pass in your handle. This parses the flat file and returns a Swiss-Prot record object. Here is the key insight. This record object is not just a sequence string. It is a deeply structured container of expert metadata. Because Swiss-Prot is manually curated, the fields on this object are exceptionally rich. You can access the organism attribute to confirm exactly which species the protein belongs to. You can check the sequence_length attribute to get the amino acid count instantly, without needing to measure the sequence string yourself. The most valuable part of the record is the features attribute. This is a nested list containing the known structural and functional domains of the protein. When you iterate through these features, you will find specific annotations. You will see features detailing the exact amino acid coordinates for active sites, metal binding regions, or transmembrane domains. Instead of running a prediction algorithm, you are reading definitive, laboratory-confirmed coordinates straight from a curated database. In practice, your code flow looks like this. You call the get_sprot_raw function from ExPASy with your accession ID. You take the resulting handle and pass it into the read function from SwissProt. This gives you your record object. From there, you can read the organism, check the sequence_length, or loop through the features list to extract the functional sites. Always remember to close the network handle once you are done reading the data to free up system resources. When you are building pipelines that rely on biological function rather than just sequence similarity, targeting Swiss-Prot guarantees you are basing your logic on verified, human-curated data. That is all for this one. Thanks for listening, and keep building!

Visualizing Genomes with GenomeDiagram

3m 29s

Turn raw GenBank files into publication-quality images. Learn how GenomeDiagram constructs circular and linear genome maps by layering tracks and feature arrows.

Download

Hi, this is Alex from DEV STORIES DOT EU. Biopython Fundamentals, episode 15 of 18. When you submit a paper describing a newly sequenced bacterial plasmid, reviewers rarely want to read a plain text table of gene coordinates. They expect a cleanly annotated, colored, circular map that makes the genomic structure obvious at a single glance. Visualizing Genomes with GenomeDiagram is how you generate those maps directly from your sequence data. Before you start drawing, you must ensure the external ReportLab library is installed. GenomeDiagram relies entirely on ReportLab under the hood to calculate the geometry and render the final images. Without it, the graphical export will fail. The architecture of GenomeDiagram is strictly hierarchical. You construct your visualization by nesting objects from the top down. At the absolute top is the Diagram object. This is your blank canvas. Inside the Diagram, you add one or more Tracks. A Track represents a specific plotting area. On a linear map, a Track is a horizontal row. On a circular plasmid map, a Track is a single concentric ring. Inside a Track, you add FeatureSets. A FeatureSet is a logical grouping of sequence data. You might use one FeatureSet for coding sequences and a different FeatureSet for regulatory regions on the exact same ring. Finally, inside the FeatureSets, you place the individual sequence features themselves. Here is the key insight. You do not have to manually calculate angles or pixel coordinates to place your genes. GenomeDiagram reads the start position, stop position, and strand orientation natively from standard Biopython feature objects. To put this into practice, suppose you have a parsed GenBank record for a small plasmid. First, you initialize an empty Diagram. Then, you create a new Track inside it, and a new FeatureSet inside that Track. Next, you iterate over the features in your parsed GenBank record. You check if the current feature type is a CDS, meaning a coding sequence. When you find a CDS, you pass it to your FeatureSet to add it to the visualization. During this step, you assign visual styling. You can tell the feature to render as an arrow, which automatically points clockwise or counter-clockwise based on the gene strand. You can also assign distinct colors here. A common trick is to use a simple loop counter to alternate between two colors so that adjacent genes do not visually blend into one massive block. Once your hierarchy is fully populated with data, you instruct the Diagram to draw itself. You call the draw method and set the format to circular. You can define the page dimensions, the background color, and the track scaling. After the drawing is constructed in memory, you simply call the write method, passing a filename ending in either PDF or PNG. ReportLab instantly calculates the correct arc lengths, positions the colored arrows along the circular track, and exports a publication-ready file. The primary advantage of this tiered structure is reproducibility. Because the layout rules are strictly separated from your raw sequence parsing, you can write one script to uniformly style an entire database of plasmids without ever touching a graphic design tool. Thanks for tuning in. Until next time!

Population Genetics with Bio.PopGen

3m 49s

Analyze genetic variation across populations. This episode introduces Bio.PopGen to parse Genepop files and easily extract allele frequencies and heterozygosity metrics.

Download

Hi, this is Alex from DEV STORIES DOT EU. Biopython Fundamentals, episode 16 of 18. You have raw field genotype data and you need statistical metrics, but bridging that gap means writing yet another custom script to parse text files. Converting that field data into a format ready for analysis is tedious unless you use a parser built exactly for the standard Genepop format. That is where Population Genetics with Bio.PopGen comes in. Genepop is a standard software package and file format used to calculate metrics like Hardy-Weinberg equilibrium, F-statistics, and linkage disequilibrium. Its text-based input format, usually saved as a dot gen file, is rigid. It relies on strict line breaks, specific keywords, and glued-together integer codes for alleles. Writing your own logic to extract allele frequencies across multiple loci and populations is prone to off-by-one errors. The Bio.PopGen module avoids this entirely by reading the Genepop format directly into a structured Python object. Take a concrete scenario. You are studying two isolated populations of fish. You have typed each fish at three microsatellite loci. Your raw data sits in a standard dot gen file. The top of the file has a title, followed by the names of your three markers on separate lines. Next comes the keyword Pop, marking the start of the first fish population. Below that, each line represents one fish, showing an identifier followed by its alleles at all three loci. After the last fish in that group, another Pop keyword appears, marking the start of the second population. To process this, you import the GenePop module from Bio.PopGen. You open your text file and pass the open file handle to the GenePop dot read function. This function processes the text and returns a single Record object. This is the part that matters. The Record object mirrors the biological hierarchy of your study. If you check the loci list attribute on the Record, it returns a simple sequence containing the names of your three microsatellite markers. This confirms the parser successfully read the header. Then, you can look at the populations attribute. This attribute holds a list where each element represents an entire population from your file. Since you have two Pop keywords in your file, this list contains exactly two elements. If you look closely at one of those populations, you find a list of the individuals it contains. Every individual is stored as a pairing of their string identifier and a list of their genotypes. For a diploid organism, a genotype at a single locus is parsed as a pair of alleles. The Genepop text file often stores these alleles as a single combined string of digits, like zero one zero two. The parser automatically splits this into distinct alleles, representing zero one and zero two as separate entities. You do not have to write string slicing logic to pull the maternal and paternal alleles apart. With the file parsed, you have immediate access to marker names, population counts, and individual allele data. You can iterate through the populations to count specific allele frequencies, or pass the Record object straight into Biopython routines that interact with the Genepop software. You can trigger calculations for expected heterozygosity or measure genetic differentiation between your two isolated groups. The value of the Bio.PopGen parser is not just reading a text file, but transforming a flat list of strings into a strict biological hierarchy of loci, populations, and individuals that statistical algorithms demand. That is all for this one. Thanks for listening, and keep building!

Biochemical Pathways with KEGG

3m 56s

Connect the metabolic dots. Learn how to parse KEGG enzyme and pathway records to trace biochemical reactions and chemical compound structures.

Download

Hi, this is Alex from DEV STORIES DOT EU. Biopython Fundamentals, episode 17 of 18. You sequenced a gene and found a variant. Knowing the sequence is rarely enough. To understand the actual biological damage, you need to know exactly which metabolic networks the resulting enzyme disrupts. Biochemical Pathways with KEGG provides exactly that map. The Kyoto Encyclopedia of Genes and Genomes maintains extensive databases detailing biological systems, compounds, and reactions. They distribute this data as standard plain text flat files. These files use a specific structure where a keyword sits on the left margin, and the associated data sits on the right. The problem is that data values, like long list of synonyms or complex pathway names, frequently wrap unpredictably across multiple lines. Writing custom string manipulation to handle that wrapping is fragile. The Bio dot KEGG module exists to handle the formatting quirks and translate those text files directly into predictable Python objects. To extract basic molecular data, you use the Compound dot parse function inside the KEGG module. You pass it an open file handle pointing to a KEGG compound text file. Instead of loading a massive database dump into memory all at once, the parser returns an iterator. Each time you advance the iterator, it yields a single Python object representing one compound. From this object, you can access the entry attribute to get the unique identifier, which typically looks like a capital C followed by numbers. Because compounds frequently have multiple common names in the literature, the name attribute returns a list of strings containing all recognized synonyms. You also get direct access to the exact molecular structure through the formula attribute. Compounds are just the static materials in your system. Enzymes are what actually drive the reactions. To analyze them, you switch to the Enzyme dot parse function, feeding it a KEGG enzyme flat file. Just like the compound parser, it yields discrete objects one by one. You check the entry attribute to get the standard EC number, and the name attribute for the enzyme designation. This is the part that matters. An enzyme does not operate in a vacuum, and the enzyme object reflects that through its pathway attribute. The parser reads the multi-line text block detailing the enzyme interactions and structures it into a clean list. Each item in that list is a tuple containing two pieces of data. The first item in the tuple is the KEGG pathway identifier, and the second is the human-readable name of that metabolic pathway. You can script a complete mapping process in just a few steps. First, you open your downloaded enzyme data file and pass it to the Enzyme dot parse function. You create a loop to read through the returned records. When you hit a record where the entry attribute matches the EC number of your mutated enzyme, you grab its pathway attribute. You can then iterate over that list of tuples, extracting the pathway names to immediately see if the mutation affects glycolysis, the citrate cycle, or something else entirely. The true power of the KEGG parsers is not just reading text, but converting isolated biological components into interconnected network coordinates, allowing you to trace a localized gene variant all the way to a systemic metabolic failure. If you find these episodes helpful and want to support the show, you can search for DevStoriesEU on Patreon. That is all for this one. Thanks for listening, and keep building!

Cluster Analysis for Gene Expression

3m 52s

Group genes by their behavior. In this final episode, we cover the Bio.Cluster module, applying K-means and hierarchical clustering to microarray expression data.

Download

Hi, this is Alex from DEV STORIES DOT EU. Biopython Fundamentals, episode 18 of 18. You just ran an RNA-seq experiment and now you are staring at an expression matrix with ten thousand rows. Finding meaning in that raw data is impossible until you group those genes into a handful of distinct biological responses. That is exactly what Cluster Analysis for Gene Expression does, using the Bio dot Cluster module. When you analyze microarray or RNA-seq data, you are tracking the activity of genes across different conditions. Take a concrete scenario. You have expression data for one thousand genes measured across four heat-shock time points. You want to identify which genes behave similarly. Perhaps fifty of them spike at hour two and drop at hour four. Genes with matching up and down regulation patterns often share biological pathways or regulatory mechanisms. Finding these cohorts manually across thousands of rows is a non-starter. The Bio dot Cluster module contains a C-based core designed specifically to process this type of matrix data efficiently. Your first step is getting the data into Biopython. Instead of writing custom parsing logic, you pass your tab-separated text file to the module's read function. This returns a specialized Record object. This object holds your numerical expression matrix, but it also keeps track of your gene identifiers and the names of your experimental conditions. With the data loaded, you apply a clustering algorithm. The most common approach is K-means, executed using the kcluster function. K-means partitions your items into a predetermined number of groups, represented by the variable k. You pass the data matrix and your chosen k value into the kcluster function. The function then returns an array assigning every single gene to a specific cluster ID, along with an error value that measures how tightly packed those clusters are. Because the algorithm starts with random assignments, you typically run it multiple times and keep the iteration that yields the lowest error. Here is the key insight. The algorithm groups genes based on a distance metric, and you must choose that metric carefully. If you use standard Euclidean distance, the algorithm groups genes that have similar absolute expression levels. But if you want to group genes by the shape of their expression curve over time, regardless of their starting baseline, you must instruct the kcluster function to use a correlation-based distance metric instead. K-means requires you to guess the number of clusters upfront. If you do not know how many distinct responses exist in your heat-shock data, you can use hierarchical clustering instead, via the treecluster function. Rather than forcing genes into flat buckets, treecluster builds a hierarchy. It finds the two genes with the most identical expression patterns and links them. It then finds the next closest gene or pair and links those, building a branching tree structure called a dendrogram. Once the tree connects all one thousand genes, you can observe the overall structure of the data and slice the branches at whatever level makes biological sense. Since this is the final episode of our Biopython Fundamentals series, I highly recommend opening the official Biopython documentation and trying these clustering functions on a real dataset. You can also visit devstories dot eu to suggest topics you want covered in future series. The ultimate goal of running these algorithms is not just sorting numbers into buckets, but uncovering the shared biological logic that forces those genes to activate together. That is all for this one. Thanks for listening, and keep building!