8 June 2021 – In 1992, renowned biochemist Cyrus Chothia made a back-of-the-envelope calculation that the bewildering number of protein sequences emerging from genome sequencing studies could be grouped into just 1,000 ‘families’. New EMBO Member Alex Bateman was a PhD student in Chothia’s group and happened to be an inveterate collector. “This seemed a very doable number, and I have dedicated my career to building a complete and accurate classification of all proteins,” says Bateman, who is Head of Protein Sequence Resources at EMBL-EBI in Hinxton, UK. “Proteins are the fundamental materials from which all organisms are built, and there are billions of genes coding for them. But they fall into a relatively small number of families, which can be linked through similarities in sequence, structure and function. By understanding each of these families, we can learn a huge amount about molecular machinery.”
I have dedicated my career to building a complete and accurate classification of all proteins.
Bateman has led the development of the Pfam database, which has grown to nearly 20,000 entries in the past two decades, and other resources such as UniProt, Rfam, and RNAcentral at EMBL-EBI. Ultimately, he wants to create a ‘periodic table’ of protein families. “We will only be able to do experiments on a small fraction of genes and proteins: to understand the functions of those we don’t do experiments on, we need to transfer knowledge from the ones we do,” says Bateman, who also leads a research group that studies bacterial cell surface proteins that mediate host colonization. “Our work can help scientists identify the proteins encoded in a genome sequence, learn how they interact, and understand their function. It can be used for anything from identifying potential vaccine candidates to supporting the development of CRISPR systems.”
EMBO creates an amazing network and I hope to use my membership to both inspire and learn from both young and established researchers around the world.
At one stage, Bateman feared his mission of classifying all protein families might not be complete before he retires – but he says that thanks to machine learning, the field could be on the cusp of another transformation. “Machine learning presents the opportunity to take a sequence and accurately predict a protein structure, which provides a powerful way of finding similarities,” he says. “But within individual families there is immense complexity, and untold numbers of puzzles remain. Databases and tools provided by EMBL-EBI and others create a road-like infrastructure for biologists to reach the answers more quickly. I am excited and honoured to be elected an EMBO Member. EMBO creates an amazing network and I hope to use my membership to both inspire and learn from both young and established researchers around the world.”