Reflections on integrating bioinformatics into the undergraduate curriculum: The Lancaster experience

Bioinformatics is an essential discipline for biologists. It also has a reputation of being difficult for those without a strong quantitative and computer science background. At Lancaster University, we have developed modules for the integration of bioinformatics skills training into our undergraduate biology degree portfolio. This article describes those modules, situating them in the context of the accumulated quarter century of literature on bioinformatics education. The constant evolution of bioinformatics as a discipline is emphasized, drawing attention to the continual necessity to revise and upgrade those skills being taught, even at undergraduate level. Our overarching aim is to equip students both with a portfolio of skills in the currently most essential bioinformatics tools and with the confidence to continue their own bioinformatics skills development at postgraduate or professional level.


| WHAT IS BIOINFORMATICS?
Most of the readers of this article will probably know the answer to the above question and, if they read further, may wonder why I feel it necessary to offer a potted history of the field. I do this because the main contention of this article is that bioinformatics teaching is in a greater state of flux than other branches of biological science education, and that we can only decide what we need to teach now in bioinformatics by considering what was taught in the past. In the light of these issues, I then present the new curriculum for undergraduate bioinformatics at Lancaster University, outlining how it has developed since 2013 and how I think it is likely to develop into the middle of the next decade.
As the name suggests, bioinformatics might be regarded as anything that can be done on a computer that is of relevance to biology. An occasionally undignified scramble for precedence as the inventor of the word "bioinformatics" was ended by the eventual collective acknowledgment that the first usage was by Hogeweg in 1978. 1 In practice, however, bioinformatics does not have such a wide definition. The first papers to use the word in its modern sense appeared around 1993 or 1994, for instance those by Boguski 2 and Harper, 3 and since then there have been several narrower areas where labor in the field known as bioinformatics has been concentrated. These have varied over the years as funding priorities and intellectual fashions have waxed and waned but, despite this, bioinformatics has been accepted for at least the last two decades as an essential discipline within biology. Consequently, the lack of bioinformatics skills among biology graduates is regularly lamented by both the pharmaceutical industry, which has historically been one of the major career destinations for those interested in bioinformatics, and by UK central government as part of a more general anxiety concerning lack of quantitative skills among British graduates. In the words of one report from 2017: "Data analytics, especially bioinformatics, appear to be particularly vulnerable". 4 National initiatives in the United Kingdom to stimulate "Science, Technology, Engineering and Mathematics" have regularly included development of bioinformatics skills as one of their key goals. 5 However, due to the rapidity of technological advancement in biology and transformation of the field into a "big data" science, 6,7 it has not always been clear exactly what bioinformatics skills need to be developed among biology graduates. Prior to the launch of the Human Genome Project (HGP) in 1990, bioinformatics was seen very much as an eccentric alternative occupation for those whose careers as laboratory-based researchers had foundered. Despite roots going back to the 1950s, and a modestly thriving literature, bioinformatics was a backwater of science. Suddenly in the mid-1990s, it became hugely in vogue, and the rebranding of Oxford University Press's journal Computer Applications in the Biosciences as Bioinformatics in 1998, marked a coming-of-age moment. The late 1990s saw the simultaneous mass desertion of academia by bioinformaticians for higher-paid jobs in the pharmaceutical industry-an industry eager to put the data of the HGP to its own use-and the rapid development of 1-year Masters-level courses in bioinformatics by those who remained. The bioinformatics "gold rush" had arrived. For a flavor of the time, see Brass. 8 For a more detailed account, see Leendert den Besten. 9 The turn of the millennium saw the peak of this first wave of bioinformatics. The bursting of the "dot.com" bubble on March 11, 2000 and the further stock market slump following September 11, 2001, confronted many biotech companies with a withdrawal of investor capital and consequent liquidation or hostile merger. These events occurred just as the HGP was drawing to a close and its results were becoming public domain. A bruised pharmaceutical industry began to move away from the analysis of the genome itself ("target discovery") to specific drug design projects on what had been discovered ("target validation" and "lead discovery"). 10 Those sequence analysts who survived the initial financial crash in industry now found themselves elbowed aside by other bioinformaticians specializing in the analysis of three-dimensional protein structures and how these interact with drug molecules-the subdiscipline of computer-assisted drug design, or "docking" (as the drugs "dock" into small crevices in the proteins). Crucially, dockers often had more of a background in chemistry than the molecular biology-trained sequence analysts.
Meanwhile, in academic bioinformatics, attention during the first 5 years of the new millennium turned away from genome sequencing and became oriented toward gene expression analysis using microarrays. 11 Although the major genome projects of the late 90s had been massive undertakings by the standards of previous molecular biology, the advent of microarray genomics and other "-omics" technologies such as proteomics, brought bioinformatics for the first time into the territory of a "big data" science. Omics practitioners, confronted with the problem of making sense of all their data, reached out to the biochemical discipline of metabolic control theory, which had for many years been wrestling with the problems of how to model far smaller-scale biochemical networks. The result was the birth of "systems biology", 12 and an influx of statisticians, mathematicians, and computer scientists into biology. For a short while, it seemed as if most academic bioinformaticians were intent on rebranding themselves as systems biologists or systems bioinformaticians. Network analysis tools became the new center of attention. However, just as this new mainstream in bioinformatics was becoming established, it was once again undermined, not this time by market forces and international politics, but by technological developments.
In the late 1990s, while the HGP was still underway, novel sequencing technologies began to be developed, with an eye to faster and cheaper sequence analysis on a grand scale-"deep sequencing." Many of these technologies were highly innovative and initially beset with multiple technical and engineering problems. However, by the end of the first decade of the twenty-first century, these difficulties began to be solved and deep sequencing entered the research mainstream. 13,14 Even microarray analysis, although barely a decade old, began to be edged out by deep sequencing-based transcriptomics as the preferred method for studying gene activity. 15 As the third decade of the present century approaches, another technological shift is underway, as long-read sequencing technologies begin to edge out the short-read technologies of the first wave of deep sequencing. 16 Table 1 summarizes the rapid development of bioinformatics during this time, identifying the main trends in molecular biology and how they have impacted bioinformatics. It is evident that anyone trained in bioinformatics in the 1990s or even in the 2000s will be seriously in need of a refresher course. Table 1 also demonstrates how bioinformatics has always been both a discipline that creates new software and one in which that software is put to use. Those who wish to have a career as bioinformaticians need to learn how to write computer programs and, furthermore, to be prepared to learn new computer languages every few years as these are adopted into the field. Bioinformatics has benefitted over the years by influxes of computer science graduates, particularly at times of transition, for example, when microarrays, systems biology, or deep sequencing made their first appearances each with a whole raft of new problems to be solved. Not all bioinformaticians, however, are full-time software developers. Many spend most of their time using existing software tools to analyze data produced in the lab, and need to know only enough programming to be able to organize their data workflows. This distinction between the "pure" bioinformaticians engaged in software development, and the "applied" bioinformaticians engaged in data analysis is often based on undergraduate degree background: computer scientists being the former and biologists the latter. Teaching bioinformatics in a mixed-background Masters-level course often feels like a struggle to explain biology to computer scientists while simultaneously explaining computing to biologists. The focus of this article, however, is on teaching bioinformatics to biology undergraduates. This is a narrower remit, but one which presents its own challenges. Table 1 may also be read as an exercise in the bioinformatics subdiscipline of "workbenching," the heyday of which happened around the turn of the millennium. Workbenchers focused on defining a minimum toolkit for bioinformatics, a suite of "must-have" programs. For an example of this approach, see Baker et al. 49 Workbenchers saw their contribution as helping other bioinformaticians to adopt common working methods and shared tool sets, to make starting out in the field easier and to encourage reproducibility and sharing of results. The peak of the field was achieved with the release of Bio-Linux, 50 which provided in a single download an entire bioinformatics-oriented operating system preinstalled with hundreds of tools. After the appearance of Bio-Linux, workbenching evaporated as an area of research interest. However, since the last update of Bio-Linux was version 8 in 2014, the necessity for workbenching studies is beginning to arise once more. In applying a workbench ethos to bioinformatics curriculum development, I follow in the footsteps of Greene and Donovan. 51 Before describing this in detail, I shall briefly review previous published bioinformatics curricula and discuss the philosophy behind them.

| THE EMERGENCE OF BIOINFORMATICS CURRICULA
Although, as mentioned above, bioinformatics in its modern sense was well underway by the mid-90s, it took a while for articles on bioinformatics curriculum development to be written. Altman's 1998 paper 52  BLAST, 20 Artemis, 21 MEGA, 22 DAMBE, 23 EMBOSS, 24 ACEDB, 25 DNASP, 26 PAML, 27 Simplot, 28 RasMol, 29 HMMER, 30 Pfam, 31 GeneWise 32 Jalview, 33 BioConductor, 34 Cytoscape, 35 Chimera, 36 Swiss-Model 37 BWA, 38 Bowtie, 39 TopHat, 40 DataMonkey, 41 Galaxy, 42  first. Many of these initial efforts were possibly responses to the ad hoc nature of the first bioinformatics Masters courses during the 90s "gold rush" era, and the need to inform universities where there were no actual bioinformaticians among the staff, about what was needed if their graduate product was to be fit for purpose in industry. One early influential paper by Hughey and Karplus 53 reviewed the experience of the first 5 years of undergraduate bioinformatics teaching at University of California, Santa Cruz, culminating in a degree major in the subject. Dubay et al. 54 were the first to describe a Masters curriculum. One of the most striking things in these pioneering papers is their description of the heavy mathematics and engineering pre-requisites for entry to the final year of the course, which would exclude most prospective bioinformatics students in the UK. Some curricula were specifically aimed at computer science students 55,56 or emphasized the need for a strong computer science grounding. 57 A second surprising feature is how theoretical the courses are, but it must be remembered that they were constructed in an era when far less bioinformatics software had been written, and the emphasis was on teaching students to program new tools rather than master existing ones. The next few years after Hughey and Karplus's seminal 2001 paper saw a huge surge in similar descriptive and discursive considerations of bioinformatics teaching (e.g., Zadeh 58 ). Zatz 59 produced something almost equivalent to a "which guide" to bioinformatics courses. A workbenching perspective was represented by Green and Donovan, 51 and Rustad 60 explored if special tools are needed for bioinformatics education. Tusch et al. 61 were the first to discuss the technical infrastructure needed to run such a course. Most papers were written from a U.S. perspective, but bioinformatics education became a global phenomenon and Shamsir et al., 62 Tastan Bishop et al., 63 and Richard et al. 64 provided views from other continents. The precursors of today's mixed "Bioinformatics and …." courses also began to appear in the 5 years after the turn of the millennium, and these also became subjects for discussion in the burgeoning bioinfocurricular literature. For instance, see LeBlanc and Dyer 65 on the "Genomics" course at Wheaton College, and Pham et al. 66 on the University of Wisconsin-Parkside's "Molecular Biology and Bioinformatics" undergraduate course. Governmental bodies and professional societies also began to take an interest 67,68 and as early as 2003, discussions began to appear of how to do it all online, [69][70][71][72] and for those with no prior experience. 73 One interesting trend [74][75][76][77] is to choose to emphasize structural bioinformatics, perhaps with an eye to continued demand for drug development "dockers" within the pharmaceutical industry. At the other end of the spectrum, Wightman and Hark 78 emphasize the positive impact bioinformatics education has on the mathematical skills of biologists otherwise disinclined to numeracy.
Debate concerning which methods really are the best has had to wait for more recent publications, where a variety of education research perspectives have been presented, such as the core competencies approach, 79,80 case study-based learning, 81 peer-assisted and team-based learning, [82][83][84] and the use of the popular hobbyist 4273pi hardware system. 85 Now bioinformatics education has sufficient scholarly groundwork to be considered a field in its own right and reviews have begun to appear. 86

| THE LANCASTER UNDERGRADUATE BIOINFORMATICS CURRICULUM
The scarcity of bioinformatics provision in the undergraduate curriculum was lamented in 2005 by Hack and Kendal. 87 At Lancaster University, bioinformatics only began to appear in the undergraduate biology curriculum in academic year 2013-2014. In writing about the integration of bioinformatics into the undergraduate curriculum, I follow in the footsteps of various authors. 55,57,58,74,76,78,[81][82][83][84][88][89][90][91][92][93] My own efforts to stand on the shoulders of these giants began initially in a single module, BIOL273 DNA Technology. This module had been running for several years and was a techniques-based course focused on teaching second-year undergraduates the basic skills required in gene cloning, polymerase chain reaction and DNA sequencing. To introduce bioinformatics, two of the laboratory sessions were replaced with bioinformatics computer workshops. In the following academic year, bioinformatics content was added to BIOL113 Genetics and BIOL313 Protein Biochemistry, again by removing some of the existing material to make space for bioinformatics workshops. These module contributions constituted the undergraduate bioinformatics component for the academic years 2014-2015 to 2016-2017 inclusive. In academic year 2017-2018, two major changes were introduced: BIOL313 was redesigned and rebranded as Proteins: Structure, Function and Evolution, removing the remnants of classical protein biochemistry from the course to make way for greater bioinformatics content, and a 4-year course BIOL445 Bioinformatics was initiated. This latter course was the first module at Lancaster devoted entirely to bioinformatics. Lancaster University fourth-year modules have a very mixed group of students, divided approximately equally into undergraduates on 4-year extended undergraduate degrees (MSci), postgraduates on a taught Masters degrees (MSc) and postgraduates in the first year of a 4-year joint PhD program with the Liverpool School of Tropical Medicine (LSTM). Many of the last category are medical or veterinary graduates with several years of professional experience. Those in the second category are divided fairly equally between overseas students, often from China, and our own undergraduates who have opted to stay for an MSc after graduation. BIOL445 is also unusual in that the entire content is delivered in a single week, rather than the 5-or 10-week courses normal at Lancaster. The compression is designed to minimize student travel between Lancaster and Liverpool for the joint LSTM PhD students.
Finally in the academic year 2018-2019, bioinformatics content was withdrawn from BIOL273 DNA Technology, replaced by material on CRISPR and synthetic biology. A new module BIOL275 Bioinformatics was introduced. Just as BIOL445 was the first Lancaster course dedicated entirely to bioinformatics, BIOL275 was the first offered at exclusively undergraduate level. Table 2 summarizes the bioinformatics content of the modules mentioned earlier. Table 2 illustrates how the bulk of the bioinformatics delivery at Lancaster takes place in second and fourth years. For the majority of undergraduates who are only on 3-year degrees, bioinformatics is introduced in first year, studied intensively in second year, and then applied to the subject of protein evolution in third year. Those staying for the fourth year receive the same experience as the Masters students. The first 3 years are designed to develop progression from point-and-click internetfocused bioinformatics in first year, through advanced internet-focused bioinformatics and basic Windows stand-alone tool use in second year, to a more advanced command of the tools and their application to a specific problem in protein evolution in the third year. For biochemistry undergraduates, all levels are compulsory. Students from other degree programs are only compelled to enroll for BIOL113 Genetics. This can mean that occasionally students may appear in the third year class without the second year grounding. However, as the tools used within BIOL313 Proteins: Structure, Function, and Evolution are a subset focused on protein evolution, the time required to catch up with the rest of the class is limited. The fourth year partly sits within this learning arc insofar as, for the undergraduates on 4-year degrees, it represents a return from the narrower focus of the third year bioinformatics teaching to the general scope and emphasis on mastery of tools introduced in second year. However, since postgraduate students of various types must also be catered for in fourth year, some of whom will be complete beginners, a certain amount of crash course introduction must also be delivered in that module. Whether fourth year undergraduates find this a welcome refresher or an annoying distraction largely depends on the extent to which they absorbed their second year course. We therefore deliver bioinformatics across our degree programs as an almost equal mixture of dedicated modules (second and fourth years) and integration (first and third years). Our general trajectory has been away from integration toward dedicated modules, with the removal of bioinformatics from BIOL273 DNA Technology in 2018-2019, and the transformation in 2017-2018 of BIOL313 Protein Biochemistry into a strongly bioinformatics-oriented Proteins: Structure, Function, and Evolution. We therefore do not follow the trend of integrating bioinformatics teaching as a minor component of several modules (e.g., Furge et al., 94 or for an extreme example the integration of bioinformatics into 10 courses at University of Wisconsin-La Crosse 95 ). Table 3 summarizes the software training in our two applications based modules

| TECHNICAL DELIVERY OF TEACHING AND LEARNING
Liki c 91 emphasized the introduction of programming skills and the need to go beyond "internet bioinformatics." My own experience at Lancaster (and in previous bioinformatics teaching in Glasgow) is that teaching biology students a programming language from scratch requires more time than is available. Within a dedicated Masters course on bioinformatics, programming is of course essential, and several languages need to be mastered (Table 1), even if only those currently in vogue are chosen. However, a decision not to include programming skills in undergraduate bioinformatics need not confine us to internet-focused techniques. The large quantity of open-source or closed-but-free tools in the field means that there is ample scope for developing expertize that goes beyond simple knowledge of the best bioinformatics websites (although that is important and is included in first-and second-year teaching). Lancaster University deploys AppsAnywhere (https://www.appsanywhere. com) as an interface to deliver a large range of software to all Windows PCs fully connected to the university network, including both computer lab PCs, staff offices and the personal devices of students. Lancaster University is a Windows-only desktop environment, which precludes the deployment of some popular classic Macintosh applications such as MacClade. 104 We use VMWare Horizon (https://www.vmware.com/uk/products/horizon.html) to deliver a virtual Bio-Linux server. The Bio-Linux file system is shared with Windows, allowing students to work on the same files within both Windows and Bio-Linux (cf. Floriano 105 ).

| EVOLUTION OF LEARNING OBJECTIVES AND ASSESSMENT METHODS OVER TIME
The extensive changes to course content and delivery described earlier have also necessitated change in the learning objectives over the years. At Lancaster, a cascade system of learning objectives is used, starting with overarching objectives for degree programs, then devolving more specific learning objectives to each module, with the bottom level consisting of detailed objectives for each teaching session. Approval of new teaching, or of changes to existing teaching, is governed at the module level. Consideration of learning objectives for bioinformatics teaching at Lancaster must therefore take account of the fact that first-and third-year teaching are embedded within modules-BIOL113 Genetics and BIOL313 Proteins: Structure, Function, and Evolution-where most or some of the content, respectively, is not bioinformatics, and therefore the learning objectives must be congruent with the broader aims of the module. With the modules entitled Bioinformatics-BIOL275 and BIOL445-there is considerably more room to specify relevant learning objectives in more detail. See "Data Availability Statement" below for a link to the handouts for the various courses on which lists of learning objectives may be found. These have varied from year-to-year as the emphasis of teaching has evolved. To give one particular example, in BIOL113 Genetics the 2014-2015 bioinformatics content covered recognition of common sequence formats, retrieval of sequences from GenBank, BLAST searching, multiple alignment, and phylogenetic tree building in MEGA. These sessionspecific detailed learning objectives report upwards to the module learning objectives for BIOL113, among which are two bioinformatics-focused objectives: (1) to become aware of bioinformatics as a discipline and (2) to be able to perform a set of basic bioinformatics techniques. The specific bioinformatics workshop content in BIOL113 changed on two occasions since 2014-2015, requiring adjustments to the detailed sessional learning objectives but without any need to change the overarching modulelevel objective pertinent to the bioinformatics content. Similar adjustments have been made to BIOL313 over the years, changing sessional learning objectives while maintaining relevance to those of the module as a whole. In the dedicated bioinformatics modules, by contrast, module-level learning objectives often appear directly at sessional level, sharpened, or elaborated as appropriate.
Assessment is also governed at the module level ( Table 2). BIOL275 Bioinformatics is part of a series of techniques-focused second-year modules, which includes BIOL273 DNA Technology in which bioinformatics was previously taught, that are all assessed via equally weighted multiple-choice test and practical report. BIOL313 Proteins: Structure, Function, and Evolution is assessed via an exam in which two out of the four essay choices will be on bioinformatics-and the students must write one bioinformatics essay-and a practical report, weighted 60:40, respectively. A similar 60:40 exam/report structure is used for BIOL445 Bioinformatics. In the first run of BIOL445, the exam was a mixture of problemsolving questions and essays, but in subsequent runs only essay questions have been used. This change resulted from an observation in the first run of BIOL445, that there was a very bipolar marks distribution for problem solving questions which skewed the overall exam marks distribution from the bell-curve ideal.

| THE FUTURE OF BIOINFORMATICS TEACHING
The future of bioinformatics teaching is difficult to predict. The only things that can confidently be said are that bioinformatics will continue to be of central importance to biology education in general, and that bioinformatics teaching a decade from now will look very different to that of today. Table 1 provides a guide to what would have been taught in each of what I conjecture to be the five eras of the discipline. Many of the earlier era columns of Table 1 contain software of continued usefulness in the present day, whereas other mentioned software has reached obsolescence (compare Tables 1-3). A particularly rapid turnover is evident in the field of sequencing assembly. The decade spent developing tools for short read deep sequencing assembly, and the corresponding time spent teaching those tools, may soon seem an archaic epoch if the latest long read sequencing technologies fulfill their initial promise. A movement away from the recent years of intense focus on sequence assembly may produce a situation reminiscent of the early 2000s, with systems biology and the omics field beginning to figure once more as a main research orientation of bioinformatics. What is new now in 2020 that was not around in 2005 is the potential for bringing virtual reality, artificial intelligence and the internet-of-things approaches into bioinformatics. I speculate that the first of these, especially as applied to protein structure and electron microscopy, would seem to be the most likely to break through soon into the mainstream. Perhaps bioinformatics classes in the year 2030 will be delivered to students encased in headsets, spinning detailed simulations of proteins and cells before their virtual eyes.
In the meantime, students need to have certain fundamental skills, and they need to have skills that are in demand. Some of those skills are challenging to acquire, especially for those who have not had much previous experience of thinking abstractly, or of thinking quantitatively. There are several places where "threshold concepts," as defined by Meyer and Land, 106 need to be grasped. Given the fickle nature of the employment market in bioinformatics, students also need to have a foundation that will enable them to build new bioinformatics skills once graduated and in the workplace. As with so much in higher education, it is the ability to learn to the highest level, rather than what is actually learned, that is the key.

ACKNOWLEDGMENTS
I thank all the students, undergraduate and postgraduate, who have participated in my bioinformatics classes at Lancaster University since 2013, and who have been so forthcoming with their feedback. I also thank the Lancaster University Organizational and Educational Development (OED) team for indirectly prompting me to write this article.

DATA AVAILABILITY STATEMENT
Selected bioinformatics laboratory class protocols and instructional videos from the courses mentioned are available under CC-BY at https://doi.org/10.17635/ lancaster/researchdata/308.