Design and debugging of ultrastable engineered genetic systems

(1)

Design and debugging of ultrastable engineered genetic systems

by

Yongjin Park

B.S. Chemical and Biomolecular Engineering, KAIST, 2012 M.S. Biological Engineering, Massachusetts Institute of Technology, 2017

SUBMITTED TO THE DEPARTMENT OF BIOLOGICAL ENGINEERING IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR, THE DEGREE OF

DOCTOR OF PHILOSOPHY IN BIOLOGICAL ENGINEERING ATTHE

MASSACHUSETTS INSTITUTE OF TECHNOLOGY~uly-29>

@Massachusetts

Signatureredacted

Signature of author:

Yongjin Park Department of Biological Engineering July 2 4 th, 2019

Signature redacted

Certified by: Christopher Voigt Biological Engineering Thesis Advisor

Signature redacted

Accepted by: Paul Blainey

MASSACHUSETS

UE

Biological Engineering

OF TECHNOLOGY OChair, Graduate Program Committee

(2)

Table of Content

ABSTRACT...5

ACK N O W LEDG EM ENTS... .-••••.7

PREFACE ...---. 8

INTRO DUCTIO N ... ...•••••-•••. 9

CHAPTER 1. FUNCTIONAL OPTIMIZATION OF REFACTORED NITROGENASE CLUSTERS USING RNA-SEQ ... 16

1.1 INTRODUCTION ...- -... -.... .. ... 16

1.2 M ATERIALS AND M ETHODS ...- -.. ....-- 20

1.2.1 Plasmids, Strains and media ... 20

1.2.2 DNA assembly and verification. ... 21

1.2.3 Nitrogenase activity assay... 24

1.2.4 Strand-specific RNA-seq ... 26

1.2.5 Relative quantitation of nfprotein levels with proteom ic analysis ... . 27

1.3 RESULTS... 29

1.3.1 Combinatorial optimization ofrefactored nifUSVWZM cluster ... 29

1.3.2 Screening and analysis of the n fUSVWZM library ... 31

1.3.3 Robustness to changes in RNAP concentration... 33

1.3.4 Transcriptome diversity in the library of refactored nif clusters ... 33

1.3.5 Characterization of Part Behaviorfrom Transcriptomics Data... 35

1.3.6 Transfer and improvement of the refactored gene cluster in E. coli... 36

1.4 DISCUSSION ...---.-. 38

1.5 FIGURES ...-.-.-.-.--.--.-.-. ... 40

1.6 SUPPLEMENTARY NOTES ...--... -.-.... 47

CHAPTER2. DESIGN AND DEBUGGING INTERNAL WORKINGS OF GENETIC CIRCUITS USING RNA-SEQ ... 55

2.1 INTRODUCTION ...-.-.-... 55

2.2 M ATERIALS AND M ETHODS ...-.. 58

2.2.1 Strain, media, and inducers... 58

2.2.2 Circuit induction... 58

2.2.3 Flow cytometry analysis... 59

2.2.4 RNA -seq library preparation and sequencing. ... 59

2.2.5 Processing ofsequencing data ... 60

2.2.6 Genetic circuit design and simulations. ... 61

2.2.7 Numericalfitting... 61

2.2.8 Fitting ofpromoters in series and estimating their individual activities... 62

2.2.9 Calculation of the predicted transcription profiles... 63

2.2.10 M easurement of doubling times ... 63

2.2.11 Characterization of the modifiedparts and circuits... 64

2.2.12 Data availability. ... 64

2.3 RESULTS...----... .. --- 65

2.3.1 Data collection ... 65

2.3.2 Conversion ofraw RNA-seq reads to transcrption profiles... 65

2.3.3 Generation of transcription profilesfor part characterization... 67

2.3.4 Genetic part characterization from transcription profiles ... 69

2.3.5 Characterization ofgenetic devicesfrom transcription profiles... 71

2.3.6 Characterization of a combinatorial logic circuit ... 73

2.3.7 Characterization of devices internal to the circuit... 77

2.3.8 Part substitution to correct antisense transcription... 80

(3)

2.3. 10 Environmental robustness ... 81

2.4 DISCUSSION ... 83

2 .5 F IG U R E S ... 8 5 2.6 SUPPLEMENTARY INFORMATION ... 94

CHAPTER 3. AUTOMATED DESIGN OF ULTRASTABLE GENETIC CIRCUITS USING GENOME LANDING PADS... 99

3.1 INTRODUCTION ... 99

3.2 M ATERIALS AND M ETHODS... 103

3.2.1 Strains and media... 103

3.2.2 Flow cytometry analysis... 103

3.2.3 Double terminator library characterization... 105

3.2.4 Construction of Tn5 transposon library... 106

3.2.5 Characterization of Tn5 transposon library... 107

3.2.6 Construction ofgenomic landing pads... 108

3.2.7 Removing antibiotic markersfrom genome integrated landingpads... 109

3.2.8 Construction ofNOT-gate library... 110

3.2.9 Integration ofgenetic circuit components into the genomic landingpads ... 111

3.2.10 Phage transduction... 112

3.2.11 Thiamine dependence growth test... 113

3.2.12 Sensor characterization... 113

3.2.13 N OT/NOR-gate characterization... 114

3.2.14 Design and characterization ofgenome circuits... 115

3.2.15 RNA-seq library preparation... 115

3.2.16 Processing ofsequencing data ... 117

3.3 RESULTS... 118

3.3.1 Genetic landing pad construction and characterization ... 118

3.3.2 Genetic device (sensor, NOT NOR gates) construction and characterization... 123

3.3.3 NOR gate designfor genetic circuits on the genome ... 126

3.3.4 Genetic circuit design and implementation on the genome ... 128

3.4 DISCUSSION ... 131

3 .5 F IG U R E S ... 13 3 3.6 SUPPLEMENTARY INFORMATION... 143

CO NCLUSIO NS AND FUTURE DIRECTIONS... 158

(4)

Design and debugging of ultrastable engineered genetic systems

by

Yongjin Park

SUBMITTED TO THE DEPARTMENT OF BIOLOGICAL ENGINEERING IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR, THE DEGREE OF

DOCTOR OF PHILOSOPHY IN BIOLOGICAL ENGINEERING

AT THE

(5)

Abstract

Engineered genetic systems in bacteria have tremendous potential for biotech applications ranging from living therapeutics to the controlled production of chemicals. Engineering such genetic systems is challenging as these genetic systems often consist of multiple genes and genetic parts (>4 genes or > 45 genetic parts) interacting with each other as intertwined networks. These intertwined networks are invisible, making the design and debugging of these genetic systems to be particularly challenging. Additionally, expressing a large number of genes creates burdens on the host cell and reduces the long-term stability of these genetic systems. Here we address these two problems by (i) adapting high-throughput RNA-seq to visualize the inner-workings of these engineered genetic systems and (ii) developing a robust and efficient genome engineering platform that enables the implementation of long-term stable engineered genetic systems on the genome. First, we applied a high-throughput RNA-sequencing, RNA tag-seq, to analyze the behavior of engineered genetic systems. We analyzed two systems with RNA-seq: (i) a library of 84 refactored nitrogenase clusters where each cluster consists of six genes with varying levels of expression and (ii) a genetic circuit that consists of eight interacting genes. With this analysis, we studied the design parameters for these genetic systems and identified various unexpected failure modes. Swapping a troubling genetic part in RNA-seq profile allowed us to effectively debug unwanted circuit expression profiles. To reduce the cellular burden from expressing these genetic systems, we developed a reliable and efficient genome engineering platform on the E. coli MG1655 K-12

genome. We built three genome landing pads, each of which consists of an att (phage attachment sites) site insulated with ultra-strong bidirectional terminators. Landing pads locations were determined by Tn5 transposon library screening by finding genomic locations that showed high gene expression levels without interfering endogenous gene expression. We also developed a set of plasmids that integrates genetic circuits into these landing pads via simple transformation. With these landing pads, seven orthogonal sensors and eight orthogonal TetR-homolog NOT gates were engineered on the genome to have up to 640-fold changes in output promoter activity upon induction. Utilizing these sensors and gates, we successfully implemented 3-input genome circuits that are stably maintained without antibiotics for more than two weeks in rich media with continuous daily ON/OFF state cycling. We expect this platform could facilitate the design and debugging of long-term stable engineered genetic systems.

(6)

Thesis supervisor: Christopher A. Voigt,

Daniel IC Wang Professor ofAdvanced Biotechnology Department of Biological Engineering, MIT

"Pain is inevitable. Suffering is optional" Murakami Haruki

(7)

Acknowledgements

First and foremost, I want to thank my thesis advisor Chris Voigt. With immense patience, he took a 22-year-old boy and watched him grow into a responsible professional. I have been really lucky to learn science, strategic thinking and an undefeatable spirit from him. When I needed a mentor, he came to me as a mentor. When I needed an advocate for my research, he became the best patron of it. When I needed an objective critic of my research he never stopped asking important questions until we could reach a satisfactory conclusion. Without his advice and mentoring, I would have never been able to stand where I am right now. In that regard, I also thank Forest and Eric for being on my thesis committee. They have provided wonderful comments on the direction of my research and have been critical but encouraging mentors. I should also mention that if it were not for Barbara and Terry's seamless care, this forty-person lab will never run without being chaotic. I really appreciate their support.

With Chris's effort to bring in wonderful synthetic biologists into the group, I was very fortunate to share my time in graduate school with wonderful postdocs and fellow graduate students. I thank all the Voigt lab members who have been willing to share time with me for intriguing discussions. I especially thank Amin Espah Borujeni who has been working on three papers with me, providing academic and moral support. I also thank postdocs Johan, Thomas, Mao, Michael, Nick, Marcello, Jonghyeon and Minhyung for their help and wonderful mentorships. I will especially miss the time with my fellow graduate students Amar (and Alicia), Felix, Emerson, and Andee as well. Last but not least, colleagues in MIT SBC such as Jin, Giyoung, Nicholas and Jaewook will also be deeply missed.

With all my heart, I want to express my deepest gratitude to my parents for their endless emotional support. They have been great mentors, patrons and friends throughout my life. I have been immensely fortunate to be their son. I thank my sisters for their companionship no matter how tough the going gets. I am also very grateful to my dear wife Moonkyung for our 16-year-long friendship and unfathomable love. As an insightful, patient, strong and witty partner, she has shone the light on the darkest hours of my life. She has been the best thing I have ever found during my Ph.D. training and my life. I have seen my life through her.

Lastly, I would like to also acknowledge funding sources that enabled this work. I especially thank Samsung scholarship and Defense Advanced Research Projects Agency (DARPA) for allowing me to work on the projects related to this thesis.

(8)

Preface

The scientific results described in chapters 1 and 2 of this study include parts of the publications I participated in as a co-author. For chapterl, I was mainly involved in developing RNA-seq based transcriptomics assays that can study the behavior of multi-gene genetic systems. By analyzing the data acquired from high-throughput RNA-seq, I aimed to learn design parameters for the multi-gene multi-genetic systems. For chapter2, I built, characterized and debugged multi-genetic circuits using RNA-seq. Thomas Gorochowski and Amin Espah Borujeni worked together with me as co-first authors of this publication. Lastly, the majority of the written contents and figures presented in Chapter 3 were based on the manuscript that has been submitted to a peer-reviewed journal. All of the publications included in this thesis will follow the copyright policy of the publisher. I clearly note that the writings in Chapter 1-3 were a joint effort with Chris Voigt and other co-workers in the author list.

(9)

Introduction

One of the salient goals in synthetic biology is to harness intriguing abilities of living organisms by designing engineered genetic systems. The applications of these engineered genetic systems include efficiently producing molecules with therapeutic potential , forming micro and nano-scale patterns - and tightly controlling the timing and the level of gene expression 8-10 in

vivo.

To achieve this goal, synthetic biology has renovated practices in genetic engineering by enabling the design-based, bottom-up implementation of engineered genetic systems 11,12. To create engineered genetic systems de novo, independently characterized and annotated "genetic parts" with defined DNA sequences and functions are concatenated with each other 1. When genetic parts are concatenated to initiate and terminate the transcription of a gene, this set genetic parts become a transcription unit. A canonical transcription unit consists of a set of genetic parts that initiates and terminates transcription of a gene 14. For instance, each transcription unit consists

of a promoter where RNA polymerase (RNAP) binds to the DNA to initiate the transcription, a ribozyme insulator 1 5that buffers contextual effects in gene expression, ribosome binding site where ribosomes bind to initiate translation 16, open reading frame (ORF) and a terminator to

terminate the transcription 17-19.

These transcription units can then be connected to create more complex engineered genetic systems that often have multiple transcription units (> 4 genes) in a system. Such genetic systems are essential for advanced applications such as sophisticated cell-to-cell communication 20-22, coordinated enzymatic pathway expression and the logic-based control of gene expression 10. In this thesis, two examples of engineered genetic systems that consist of multiple genes are studied:

(10)

refactored nitrogenase gene clusters (nif) from Kelbsiella oxytoca m5al 23,24 and genetic circuits

25

I. Engineering of refactored nif clusters

In bacteria, multiple genes encoding proteins related to a specific function form clusters in a contiguous locus on the genome 26. The nitrogenase encoding gene cluster (nif) from Kelbsiella

oxytoca m5al contains 20 genes (nifJHDKTYENXUSVWZMFLABQ) organized into seven

operons in the 24kb region of the genome 27. The genes in the nf cluster encode nitrogenase

enzyme subunits, cofactors and regulatory proteins cooperate to fix diatomic nitrogen gas to ammonia. Engineering of nif cluster has been hindered by redundancies in regulations, unknown regulatory sequences, and non-essential genes in the cluster 28. To address the issue, 16 genes in

this nif cluster were refactored by replacing endogenous DNA sequences with synthetic genetic parts to eliminate potential sequence-based endogenous regulatory interactions and to enable orthogonal control of the engineered genetic system 23,29. The first version of refactored

nitrogenase cluster, however, only attained 6% of the wild type Klebsiella oxytoca nitrogenase enzyme activity, potentially due to the challenges in controlling gene expression levels. This thesis will present the effort to engineer the refactored nif cluster by leveraging its plasticity and engineerability 30. _{By constructing and characterizing diverse variants of the refactored nfcluster,}

this thesis aimed to a devise new paradigm to design and implement genetic systems with a large number of genes.

II. Engineering of genetic circuits

Another example of the engineered genetic system studied in this thesis is genetic circuits. Genetic circuits are networks of regulatory proteins in living cells that cooperate to control the gene expression with clearly defined inputs, logic and outputs. To design genetic circuits, a set of

(11)

defined biological sensors is used as inputs to detect changes in the environment where the hosts harboring genetic circuits are located in. While a variety of biological sensors have been developed, they generally consist of two components: a DNA binding sensor protein and an output promoter 31. _{A sensor protein recognizes external stimuli (i.e. changes in chemical inducer concentrations)} and binds a specific DNA sequence, known as an "operator," adjacent to the output promoter region 3. _{When sensor protein recognizes external stimuli (i.e. induction, temperature change} etc.), the DNA binding property of the sensor protein changes, thereby adjusting the activity of the output promoter . In genetic circuits, such changes in sensor output promoter activity are wired to logic circuits. Logic circuits process changes in environment and convert the changes to designed outputs. Diverse modalities of gene expression regulation have been used to create the logic circuits: cis and trans RNA interactions 36-38, DNA-protein interactions 39-41, and recombinase-based DNA arrangement 41 4 3 and protein-based interactions 445. Note that this thesis

focuses on DNA-protein interactions, particularly using TetR-homolog repressors to create logic circuits 39. _{The output promoter of the logic circuit is then connected to the desired output of the} genetic circuits. Outputs are therefore controlled by combinations of inputs and logic circuits. Output promoters can be connected to a variety of cellular processes that need to be controlled by genetic circuits 46-49. In this thesis, however, most of the outputs are promoters expressing

fluorescent proteins, which are then analyzed with flow cytometry to obtain the distribution of gene expression.

This thesis will present efforts to improve genetic circuit engineering in two ways. The first is to enable transcriptomics-based high-throughput debugging of failed genetic circuits and is presented in chapter 2. The second is to create a reliable and effieicnt genome engineering platform that all three modules of genetic circuits (input, output and logic circuit) can be implemented on

(12)

the genome without severely burdening cells. The rationale behind why such efforts were necessary will be discussed throughout the rest of the introduction session.

Refactored nifcluster and genetic circuits are two examples of sophisticated engineered genetic systems. Both systems consist of a large number of transcription units of which expression levels have to be precisely orchestrated to have fully functioning genetic systems. Challenges associated with the increasing complexity of engineered genetic systems generally fall into one of three categories: (i) understanding environmental and compositional contexts and conveying it into the genetic design, (ii) debugging failing genetic systems with complex inner workings and (iii) reducing cellular burdens from excessive exogenous gene expression.

Significant progress facilitating the design and implementation of engineered genetic systems has been achieved by understanding genetic contexts and applying the understanding to the genetic design. Genetic contexts include environmental and compositional contexts ". Environmental context refers to the composition of the environment where the host cells harboring engineered genetic systems are located. The environmental context includes the pH 1, temperature 52 and nutrition compositions 11 (i.e. media types, sugar sources, etc). These factors alter gene expression by affecting the host cell's physiology and growth rate. These factors also affect the activity of many environment-responsive promoters (i.e. cI promoter with temperature change) and change gene expression modalities. To minimize the variance in gene expression due to changes in environmental contexts, gene expression levels from different environmental contexts are normalized using a defined gene expression cassette (measurement standards). Absolute measurement of protein, RNA and DNA levels in living cells in different conditions will further improve the designability of genetic systems under different environmental contexts.

(13)

Compositional context refers to the changes in gene expression patterns related to the changes in DNA sequence contexts. Changes in such compositional context include alterations in gene orders, genetic architecture and RNA secondary structures . Various genetic parts that serve as insulators have been developed to minimize the fluctuations in gene expression from the changes in compositional contexts15,5859 Computer software that harnesses the quantitative understanding of genetic contexts has also been developed to facilitate the design of engineered genetic systems

60,61

The second challenge in engineering genetic systems with multiple transcription units emerges from debugging failed genetic systems. Genetic parts in a genetic system interact with each other to create networks of interacting parts. The increasing number of genetic parts in these genetic systems exponentially increases the number of such interactions and complicate the debugging process25. The current state-of-the-art to debug failing genetic systems uses fluorescence proteins as probes to indirectly monitor part/system behaviors. However, this approach has an inherent limit in its indirectness, low-throughput and rarely identifies the specific genetic part of failure. Therefore, it is valuable to be able to monitor the performance of genetic system directly and to debug genetic parts of the failure. In this thesis, in chapter 1 and 2, we developed experimental and computational approaches to apply high-throughput transcriptomics (RNA-seq) to analyze engineered genetic systems. Transcripts from engineered genetic systems were sequenced, aligned to the reference sequence, and then analyzed to elucidate the behavior of genetic systems in situ. Transcriptome data analysis revealed common failure modes of engineered genetic systems such as unexpected transcription (both sense and antisense) from part junctions, transcription read-through between genes and the global impact to the host transcriptome from expressing engineered genetic systems. In chapter 1, we applied this approach to analyze 84 variants of refactored nf

(14)

clusters to elucidate design rules for multi-protein enzyme pathways and to monitor part behaviors. In chapter 2, this approach was applied to analyze and debug a partially failing 3-input 5-gate genetic circuit (0x58). With this technique, eight different input states of the circuit were analyzed. The analysis revealed inner-workings of the circuit and provided detailed behavior of genetic parts and devices (gates). By visualizing the behavior of genetic circuits from a single genetic part level, we identified unexpected antisense transcription affecting gene expression and were able to debug the transcription profile of the circuit by simply replacing a genetic part.

The third challenge associated with engineered genetic systems is "cellular burden" 62. The

quantity of RNA polymerases, nucleotides, a factors and other essential coenzymes in each cell is limited 63. Because conserving these cellular resources for the expression of essential genes is

critical to the survival of host cells, high-level exogenous gene expression from plasmids draining such resources perturbs the balance in the host resource allocation and alters its physiology 64-69.

Such changes in physiology eventually alter the gene expression modalities from the engineered genetic system by mutating the genetic system itself or by mutating the host's genomic DNA 66,70-74

In chapter 3, this thesis presents an approach to reduce this cellular burden from expressing exogenous engineered genetic systems. Historically, plasmids have been widely used for implementing engineered genetic systems due to the ease of genetic manipulation. However, the gene expression from high copy number plasmids can readily overload the cell with expression burden 63,75. _{While genome can be an attractive alternative for its stability and lower copy numbers,}

the inefficiency associated with genome engineering and many unknown contextual effects has deterred the use of genome as a test-bed for engineered genetic systems 76. To address the issue,

(15)

77 _{(att) insulated by a pair of ultra-strong bidiretional terminators, and engineering toolkits} compatible with these landing pads. We implemented genetic circuits on the landing pads on the genome. The resulting circuits showed up to >10-fold reduced burden in cells which is expected to lead to the enhanced stability.

All in all, this thesis focuses on overcoming challenges in implementing large-sized complex engineered genetic systems. To overcome the challenge, we developed synthetic biology toolkits for (i) high-throughput analysis pipeline for engineered genetic systems and (ii) efficient and reliable genome engineering platform for genetic circuit implementation.

(16)

Chapter 1. Functional optimization of refactored nitrogenase clusters using RNA-seq 1.1 Introduction

Biology is able to build intricate materials and chemicals that require precise dynamic and spatial control over many genes. However, engineering large systems that are composed of many genetic parts is not straightforward. First, the design process is time consuming, where software is focused at combining parts at the primary DNA-sequence level. Second, it takes months to prototype a design. Although DNA synthesis is routine for individual genes78 _{and indeed, has been}

used to build entire megabase genomes?9, it remains too expensive to be used to simultaneously synthesize many large alternative designs. In practice, this means that it is only feasible to build a small set of alternate designs for testing, meaning it may take considerable time find a design that works.

This is further complicated when working with a large system that is encoded by natural genetics, which are the product of evolutionary forces and exhibit redundant, overlapping, regulatory elements 23,27,29,80. Further, for even well-characterized systems, not all of the regulation

or regulatory parts (e.g., promoters) are known. Starting with such a system, design choices cannot be cleanly implemented without triggering a web of secondary effects. For example, a desired change in gene order may be tolerable in itself, but if there are promoters internal to the ORFs, then this could create transcriptional interference. Overlapping genetic elements also thwart part substitutions; if genes are translationally coupled, this complicates codon optimization or the substitution of an RBS, where these will have a secondary impact on neighboring genes.

An engineering approach to clean up a natural genetic system is to refactor it 81. The goal

is to create a highly modular system, where every genetic part is defined, and the native regulation is replaced by synthetic genetic circuits8 2_{. Refactoring works towards several goals. First, it}

(17)

enables complex, multi-gene pathways to be removed from the control of the host and placed under the control of synthetic genetic sensors and circuits. This eliminates the influence of the many environmental and cellular inputs that can influence a system and enables it to be controlled with an inducible switch or more complex circuitry. Second, it facilitates the large-scale part swapping and engineering that is required for species transfer. Each species speaks a different regulatory language and refactoring simplifies the conversion of the code from one to the other (codon optimizing each gene, converting ribosome binding sites, etc).

Nitrogen fixation is a key process in agriculture involving the conversion of atmospheric N2 to ammonia, and since the 1970s it has been a goal in biotechnology to move this function into cereal crops to reduce the use of chemically derived fertilizer. In Klebsiella, the native cluster contains 20 genes encoded in 7 operons in 25kb, including regulatory proteins, the components of the nitrogenase enzyme, chaperones, electron transport proteins, and enzymes directing the biosynthesis of the iron-molybdenum cofactor (FeMo-co) and other metalloclusters 27. Under the

right conditions, it is highly induced with 30% of protein synthesis dedicated to nifHDK and nifH alone makes up 10% of cell weight 83. The activity and expression level of nitrogenase is optimized

to avoid the production of H284. Thus, the challenge is not increasing activity in Klebsiella, but rather transferring a system with this activity to a new host, such as a plant chloroplast or root-associated organism 85. The transfer process will require part substitutions for the new host and subsequent optimization and the native system is not organized in a way where this can be done easily.

To reduce the genetic barriers to transfer, we previously refactored the nif gene cluster by systematically eliminating native regulation and converting the system to a modular set of well-defined and characterized genetic parts. This involved the removal of all non-coding DNA as well

(18)

as non-essential and regulatory genes. The 16 remaining genes were "codon randomized" to eliminate regulation internal to the open reading frame. These genes were organized into artificial operons, placed under the control of T7 RNAP promoters and terminators, and synthetic RBSs were selected to optimize the expression levels. Finally, a "controller" plasmid was constructed that contains synthetic sensors and circuits, whose output is an attenuated T7* RNAP 23,81. T7

RNAP and promoters were selected because these are transcriptionally orthogonal from the host and can be used in many organisms. Thus, the cluster could be transferred and optimized using the same promoter set.

Starting with this refactored cluster, combinatorial design and DNA assembly are applied to build permutations of operons and clusters (Figure la). Combinatorial design is a field of mathematics that studies the arrangement of elements of a finite set into patterns according to specified constraints 86. As applied to synthetic biology, this allows a design to be articulated as a

set of parts and formalized constraints between parts. This is written as a EUGENE file 87 where

constraints are articulated using a formal semantic language that can capture any descriptor of a genetic system made by an experimentalist (gene A before B, all genes must have RBSs, etc). This approach enables one design to capture many potential DNA constructs.

There are a variety of molecular cloning techniques that enable designs to be realized. Type I1s. endonuclease-dependent methods 88 as well as restriction enzyme-independent methods 8889 allow

for the placement of parts into a user-defined sequence. When scars can be tolerated, these methods are very efficient at building combinatorial libraries. Part assembly has been used to build large libraries as a means of screening for pathway improvements or the diversification of the product chemistry. To date, the application of combinatorial genetic design only rarely addressing

(19)

variables other than part strength. For example, libraries are often built by substituting a set of parts at a specific location, such as a set of promoters of different strength 86,90

The dream of cellular analytics is to be able to debug failed genetic systems by understanding everything that is happening within the cell, including proteins, transcripts, and metabolites. However, the high cost of these techniques typically only allows the most successful constructs to be analyzed to retrospectively explain the impact of design choices. Improvements in molecular barcoding methods and data processing have resulted in significant cost-savings by allowing cDNA from many strains to be pooled and sequenced in a single Illumina lane. By attaching molecular barcodes early in the RNA-processing protocol, even high-information content experiments like strand-specific RNA-seq can be multiplexed and performed affordably

in a cost-effective manner91.

Here, we combine these approaches to build a parallelized design-build-test-learn pipeline. Starting with the refactored nif cluster, combinatorial assembly is used to build libraries with diverse architectures and regulatory parts. A large library is built for the 6-gene nifJSVWZM operon (0.7 Mb total) and from it we identify optimized operons whose architectures differ significantly from wild-type. The plasticity of the cluster as a whole is confirmed by building large libraries where the order, operon occupancy, and orientation is changed for all 16 genes (1.9 Mb total). The refactored cluster is transferred from Klebsiella to E. coli and activity is improved by building a library (0.9 Mb total) with simultaneous RBS substitutions across all 16 genes, which would be impossible with the native cluster. RNA-seq is used to extract new constraints that can be incorporated into the next round of design. This pipeline is applicable to genetic engineering challenges beyond nitrogen fixation. For example, the field of natural products is often challenged

(20)

with the transfer of large, multi-gene constructs from one organism to the other, which can require the large-scale substitution of genetic parts and optimization in the new host.

1.2 Materials and Methods 1.2.1 Plasmids, Strains and media.

Escherichia coli DH5a was used for routine cloning and plasmid propagation. E coli MG1655 was used as a heterologous host for screening libraries of full refactored nitrogen fixation gene clusters. Klebsiella oxytoca M5al 92 was used to determine wild-type nitrogenase activity levels, and knockout mutant strains K oxytoca NF10 (A nifUSVWZM) 2 was used to screen synthetic nifUSVWZM operons. The starting plasmid for the 16-gene refactored nif cluster was pCV-RBS20; this differs from the construct in Temme and co-workers (SBa_000534) by a corrected point mutation in nifZ, and the presence of 4 bp scars used for MoClo assembly, neither of which impact activity. Prior to this study, pCV27083, the refactored nifUSVWZM operon under control of the T7 promoter, was reported to recover 25% of wild-type nitrogenase activity. Re-measuring this strain resulted in a significantly different activity of 60% wild-type.

Luria-Bertani (LB) medium (10 g/L tryptone, 5 g/L yeast extract, 10 g/L NaCl; VWR cat. #90003-350) with appropriate antibiotic supplementation was used for strain maintenance and plasmid construction in E coli strains. LB-Lennox medium (10 g/L tryptone, 5 g/L yeast extract, 5 g/L NaCl; Invitrogen cat. #12780-052) was used for strain maintenance in K. oxytoca strains. All nitrogen fixation assays were performed in minimal medium (25 g/L Na2HPO4, 3 g/L KH2PO4, 0.25 g/L MgSO4•7H20, 1 g/L NaCl, 0.1 g/L CaCl2•2H20, 2.9 mg/L FeCl3, 0.25 mg/L Na2MoO4•2H20, and 20 g/L sucrose). Growth medium is defined as minimal medium supplemented with 6 mL/L of 22% ammonium acetate (filter sterilized). De-repression medium is

(21)

defined as minimal medium supplemented with 1.5 mL/L of 10% serine (filter sterilized). Phosphates were dissolved in distilled water and autoclaved separately from the remaining ingredients to prevent precipitation and sterile medium components were freshly mixed before each use. Antibiotic selection was performed with spectinomycin (100 mg/L; MP Biomedicals cat. #021 5899305), kanamycin (50 mg/L; Gold Bio cat. #K-120-5), ampicillin (100 mg/L; Affymetrix cat. #11259 5), and/or chloramphenicol (33 mg/L; VWR cat. #AAB20841-14).

Isopropyl-p-D-1-thiogalactopyranoside (IPTG; Gold Bio cat. #12481C25 259) was supplemented to medium for induction at various levels. Blue-white screening of colonies resulting from DNA assembly reactions was performed on LB-agar plates (1.5% Bacto agar; VWR cat. #90000-760) supplemented with 0.15 mM IPTG, 60 mg/L 5-bromo-4-chloro-indolyl-

P-D-galactopyranoside

(Roche cat. #10 745 740 001), and appropriate antibiotics.

1.2.2 DNA assembly and verification.

The promoter parts, RBS/CDS parts, and terminator parts that entered into the pipeline at the highest level of the assembly tree were themselves constructed using standard cloning techniques including isothermal assembly 89,93 and PCR-ligation. Parts with an identification

number beginning with SBa are also deposited in the SynBERC registry of parts (registry.synberc.org). Part characterization is described in the Supplementary Note 1. All promoter parts are flanked by sequences "GGAG" (upstream) and "TACT" (downstream), RBS/CDS parts are flanked by sequences "AATG" (upstream) and "AGGT" (downstream), and terminator parts (TPs) are flanked by sequences "TACT" (upstream) and "AATG" (downstream). These four-bp sequences correspond to 5'-overhanging single-stranded cohesive ends when digested with restriction enzymes BbsI (promoter and RBS/CDS parts) or BsaI (terminator parts). Application of the Scarless Stitching method to create a seamless junction between any

(22)

combination of promoter part and RBS/CDS part proceeds as follows: 20 fmol each of promoter part plasmid, RBS/CDS part plasmid, pMJS20BC, and pMJS23AD are mixed with 5 U BbsI (New England Biolabs, Ipswich, MA, cat. #R0539S) and 5 U T4 DNA Ligase (Promega, Madison, WI, cat. #M1794) in a total of 10 1 1x Promega T4 DNA Ligase Buffer and incubated at 37 °C for 4.5 hours. Next, a 10 pl solution containing 5 U MlyI (New England Biolabs, cat. #R061OS) and 5 U T4 DNA Ligase in 1x Promega T4 DNA Ligase Buffer is added to each reaction and incubated an additional 30 min at 37 °C. Reactions are terminated by incubating at 50 °C for 5 min and 80 °C for 10 min. Constructed plasmids are transformed into E. coli and prepared for sequence

confirmation by Sanger sequencing using standard techniques. The efficiency of this method was established by reconstructing a GFP coding sequence from two halves. Scarless stitching of a promoter-RBS/CDS construct to a terminator part follows a similar protocol to that described above, with pMJS25DB and pMJS24AC replacing pMJS20BC and pMJS23AD, and BsaI (New England Biolabs cat. #R0535S) replacing BbsI. For unknown reasons, the efficiency of this second round is significantly worse than the first, with single base pair deletions present in at the part junction in over 70% of the sequenced constructs. Constructs containing a promoter part, RBS/CDS part, and terminator part are considered "cistron parts."

Sequence-verified cistron parts are PCR amplified to give each construct specific cohesive ends upon BbsI digestion that dictate the orientation and relative position in the overall assembly. PCR products are cloned into Level 1 plasmids (pCV27069) with the appropriate flanking cohesive ends using a Golden Gate assembly reaction16,22,23. At this stage each part is sequence verified. Fourteen of the 48 cistron parts contained a 1-2 bp deletion in the beginning of the terminator part but as the first 6 bp of the terminator parts are not part of the hairpin structure and are not expected to effect termination efficiency37, these were still carried further in the library assembly. Three

(23)

(nifJSVWZM library) or four (monocistron library) Level 1 plasmids are combined by BsaI digestion/ligation into Level 2 plasmids (pCV27070) using a Golden Gate assembly reaction to intermediate assembly plasmids dubbed half-clusters or quarter-clusters for the nifUSVWZM or monocistron libraries, respectively. Finally, Level 2 plasmids are combined by BbsI digestion/ligation into the expression vector pMJS2001AC to form Level 3 plasmids containing 6 or 16 genes of the nifJSVWZM operon or complete refactored nif gene cluster.

Level 2 and Level 3 plasmids are verified by colony-multiplex PCR using primers that anneal to the CDS sequences of each gene. Colonies are picked into 10 pl of sterile H20 and boiled at 100 °C for 10 min. Boil preps are centrifuged to pellet cell debris, and 0.5 pl supernatant is used as template in 5 pl PCR reactions using Phusion High-Fidelity DNA Polymerase (New England Biolab, cat. #M0530L) with standard reaction conditions and the following heat cycle in a Bio-Rad C1000 Touch Thermal Cycler (Hercules, CA): 98°C for 30 s, 35 cycles of 98 °C for 10 s, 60 °C for 30 s, and 72 °C for 15 s, followed by 72 °C for 10 min. PCR reactions are analyzed by agarose gel electrophoresis or on a Qiaxcel (Qiagen, Germantown, MD) with a DNA Screening cartridge and 320 s separation time. The Golden-Gate assembly of cistron-parts into larger constructs proceeds through a cut-and-paste type mechanism and is likely less error prone than polymerase-dependent cloning techniques. Multiplexed PCR verification tests whether product constructs contain each of the desired parts. Performing a multiplex PCR reaction produced a characteristic pattern of products that could be analyzed by agarose gel electrophoresis or capillary electrophoresis. Test assemblies of complete refactored nitrogenase gene clusters with the gene order and orientation unchanged revealed an efficiency of the four-piece Golden-Gate reactions to be >80% (not shown). Because expected PCR product profiles for gene clusters with permuted gene order and orientation are unique and complex we screened for correct constructs by checking

(24)

at least three colonies from each reaction by multiplex PCR. Correct constructs were selected as those producing identical product profiles in 3/3 or 2/3 replicates. Finally, 30 members of the nifUSVWZM library (#1-20 plus the best and worst five performing constructs) were sequence verified by Sanger sequencing. As intermediate parts from the assembly were present in multiple of the sequence-verified constructs, we could infer when a sequence error was present in an intermediate construct. In such cases, all final constructs bearing these intermediate parts (USV-2, frameshift in stop codon of nifV; USV-10, frameshift in nifV; and USV-9, duplication of nifJ cistron part). Constructs for which mutations were directly observed or could be inferred based on the assembly hierarchy include: #3, 4, 17, 18, 19, 20, 26, 33, 34, 38, 45, 46, 50, 57, 58, 62, 63, 69, 70, 74, 76, 81, and 83. Only the remaining 62 constructs were included in the analyses reported here, unless specifically noted (for example in characterizing part behavior via RNA-seq).

The RBS-swapping library was constructed by directly cloning cistron-level parts into level 1 MoClo plasmids. Level 2 plasmids containing each possible combination of the four-gene quarter clusters (HDKY; ENJB; QFUS; VWZM) were constructed by type Ils digestion/ligation as described above. Thirty-nine of the forty level 3 plasmid reactions yielded colonies that were verified by multiplexed PCR to contain each of the nif genes. The top two gene clusters in the library were sequence-verified by Sanger sequencing.

1.2.3 Nitrogenase activity assay.

Nitrogenase activity is determined in vivo via the previously described acetylene reduction assay 23,94. Each strain is grown in 2 ml growth medium (supplemented with required antibiotics)

in 15 mL culture tubes for 14 hours in an incubated shaker (30 °C, 250 rpm). Cultures are diluted in 2 ml derepression medium (supplemented with required antibiotics and inducers) to a final OD600 of 0.5 in 10 ml glass vials with PTFE/silicone septa screw caps (Supelco Analytical,

(25)

Bellefonte, PA, cat. #SU860103). Headspace in the bottles was repeatedly evacuated and flushed with N2 gas using a vacuum manifold equipped with a copper catalyst 02 trap. After 5 hour incubation at 30 °C and 250 rpm in an incubated shaker, headspace was replaced with 1 atm argon. Acetylene was freshly generated from CaC2 in a Burris bottle, and 1 ml was injected into each bottle to start the reaction. Cultures were incubated at 30 °C, 250 rpm for 15 hr before the assay was quenched by the addition of 500 tl of 4 M NaOH to each vial. Ethylene production was analyzed by gas chromatography on an Agilent 7890A GC system (Agilent Technologies, Inc. Santa Clara, CA USA) equipped with a PAL headspace autosampler and flame ionization detector as follows. 250 pL headspace preincubated to 35 °C was sampled and separated on a GS-CarbonPLOT column (0.32 mm x 30 m, 3 micron; Agilent) at 60 °C and a He flow rate of 1.8 ml/min. Detection occurred in a FID heated to 300 °C with a gas flow of 35 ml/min H2 and 400 ml/min air. Under these conditions, acetylene eluted at 3.0 min post injection and ethylene at 3.7 min. Ethylene production was quantified by integrating the 3.7 min peak using Agilent GC/MSD ChemStation Software. Cell growth is determined in identical conditions, with 500 ml of culture sampled five hours post induction and diluted 1:1 with minimal medium to return cultures to within the linear range for optical density (OD600) measurement. Optical density is measured on a Varian 50 Bio UV-Vis spectrophotometer.

To generate the T7* RNAP expression vs. normalized nitrogenase activity plots for Figure 2d, raw nitrogen fixation activities at each level of induction were first corrected for T7* RNAP-independent activity by subtracting the latter value from each data point, with a lower bound of 0 (i.e. corrected activity levels were not allowed to be negative). Next, nitrogen fixation was normalized to the maximum activity of each gene cluster across the range of induction levels

(26)

assayed. For a rough measure of cluster robustness, we integrating under a third order polynomial best fit curve and report this value as a construct's robustness.

1.2.4 Strand-specific RNA-seq

For RNA-sequencing samples, total RNA is harvested from each of the nifJSVWZM library strains cultured in nitrogenase assay conditions as well as wildtype K oxytoca m5al grown with and without IPTG. RNA preparation is initiated following 5.5 hours of growth in inducing conditions. From 8 ml of culture, cells are spun down at 4 °C, with 21,000 relative centrifugal force (rfc) for 3 minutes. After centrifugation, supernatant is discarded and cell pellets are flash frozen in liquid nitrogen for storage at -80°C. RNA is isolated with PureLink RNA Mini Kit (Life Technologies, Carlsbad, CA) according to the manufacturer's instructions and further purified and concentrated with RNA Clean & Concentrator-5 (Zymo Research) to assure sample quality. Purified RNA samples were submitted for deep sequencing at the Broad Institute (Cambridge, MA).

Strand specific RNA-seq libraries were created by the Broad Technology Labs specialized service facility using Tag-seq method91. Briefly, Individual sample RNA was fragmented, and the 3' end was tagged with a DNA oligonucleotide containing a sample tag and a partial 5' Illumina adapter. Uniquely tagged RNAs were then pooled and carried through rRNA depletion (Ribo-ZeroTMMagnetic Kit (Bacteria); Epicentre, Madison, WI), cDNA synthesis, ligation to a second oligonucleotide containing a partial Illumina 3' adapter and amplified with full-length barcoded Illumina adapter primers to tag pools and generated strand-specific sequence-ready RNA-seq

libraries. The ninety-six libraries were created as three pools of 32 samples. Each pool was split and sequenced on two lanes of an Illumina HiSeq 2500. A reference genome for each design was assembled by combining the WT genomic sequence with a plasmid sequence predicted based on

(27)

the specific design. Reads were trimmed of barcodes and aligned to associated reference genomes using BWA version 0.7.4 using the default settings95. Strand-specific RPKM values were calculated using custom scripts which used the Bamtools API9 6_{. Read depth profiles were} computed using the "mpileup -d 20000" function from the SAMtools suite 17. For both

computations, counts were added together from the replicate lanes. Experimental error in RPKM errors was calculated to be minimal using values obtained from biological replicates of eight strains (Supplementary Note 2). Expression levels from the two sets of data are highly correlated (R2 ₌_{0.92 to}_{x =}_{y line).}

To calculate the average pairwise ratio of the expression levels of refactored genes from the nifUSVWZM library, fold change between each of the fifteen possible pairwise combinations (i.e. U-S, U-V, U-W, etc) was computed and represented as a fold change value >1. These pairwise ratios were averaged to generate the final metric D The formal equation to calculate this metric is:

D = - m:rrax f;,X

15. ._z<J X;' Xi ₍₁₎

where Xi and Xj represent RPKM values for genes nifUSVWZM for one gene cluster.

1.2.5 Relative quantitation of nif protein levels with proteomic analysis

Klebsiella oxytoca m5al strains were cultured in a normal nitrogenase activity assay

condition. Cells were harvested from 100 ml by centrifugation at 4°C with 12,000 rcf for 10 minutes. Supernatant was removed after centrifugation and cell pellets are flash frozen in liquid nitrogen for storage at -80°C. Cell pellets were resuspended in lysis buffer (50 mM Tris pH 8.0 and 150 mM NaCl) supplemented with lysozymes (Pierce) and DNaseI (Pierce) and were

(28)

MEMOIR--incubated for 30min at 4°C. Cell suspensions were then sonicated at 45% of power for four 15 seconds cycles using a Sonic Dismembranator 500 (Fischer Scientific) equipped with a Branson 102C sonication probe, with 1.5 minutes incubation on ice between cycles. Extracted proteins were denatured and reduced by DTT (Sigma Aldrich) and iodoacetanide (Sigma Aldrich) in ammonium acetate (pH 8.9). For reduced protein extracts, trypsin (Promega) was treated at 1:50 ratio (lug of trypsin for 50ug of extracted proteome) overnight. Following incubation, digested proteome

samples were passed through a C-18 Sep-pak column (Waters) and submitted for iTRAQ labeling and directed mass spectrometry measurement at the Swanson Biotechnology Center at the Koch Institute for Integrative Cancer Research. For most proteins, several peptides were identified and used in the quantification. For NifE and NifV, quantification was based on identification of a single peptide; further attempts to validate peptides have not been made.

(29)

1.3 Results

1.3.1 Combinatorial optimization of refactored nifUSVWZM cluster

When refactoring the nif cluster, two operons proved particularly difficult to optimize: nifHDKY (containing the nitrogenase subunits) and nifLJSVWZM. Here, we focus on nifJSVWZM, whose genes are required for the biosynthesis of metal cofactors, including the FeMo-co and FeS 'P-cluster' of nitrogenase and the FeS cluster of NifH9 8. NifL and NifS form a complex that initiates metallocluster biosynthesis by producing the [Fe2-S2] and [Fe4-S4] clusters. NifV synthesizes homocitrate98, which coordinates the molybdenum in FeMo-co 98. NifM is required for the maturation of the NifH. NifZ is required for the second P-cluster in NIfH and NifW is required for fully functional nitrogenase.

In this manuscript, we start with a variant of the nifJSVWZM operon that yields 60% 1.2% wild-type activity in the context of K oxytoca NF10 (A nifUSVWZM). Based on this construct, a library was designed to vary the architecture and component parts. There was little guiding information about the importance of the genetic organization of the native operon in the literature. We noted that when the nifUSVWZM operon was compared across a range of species, the component genes were found to be arranged differently into subclusters with different orders and orientations '00. So rather than constraining the genes into an operon structure, the architectures in the library were allowed to vary considerably. Constructs were designed that contain different operon structures, gene orders, and gene orientations (Figure 2a). The only architectural constraint that was imposed was to limit nifJSV to the first half of the cluster and nifWZM to the second half, in order to reduce the number of half-clusters that had to be assembled (note that this does not constrain these genes to be part of the same operon). Several non-standard design features were

(30)

allowed in the library, including tandem promoters, interference from downstream reverse promoters, and genes after terminators that rely on read-through for transcription.

Genetic parts were selected to vary gene expression levels. Three T7 RNAP promoters were used to vary transcription levels (SBa_000920, 0.025 0.009 REU 'relative expression units';

SBa_000445, 0.08 ±0.037 REU; and SBa_000446, 0.12 0.041 REU). The same set of "codon

randomized" open reading frames for the six genes were used as described previously23. A single terminator is reused throughout the design (SBa_000450, TS = 2.6 0.6) 18. A total of 12 RBSs

were designed using the RBS Calculator38 to provide strong and weak RBSs for each gene. This includes six from the original refactored cluster, four that were designed to be 5-fold stronger (for nifL,V,W,Z) and two that were designed to be 5-fold weaker (for nifS,M). Seven spacers composed of randomly generated 50 bp DNA sequences were included to increase the distance between RBSs and upstream elements and to separate cistron parts to reduce context effects that could occur between neighboring parts (the spacers were computationally scanned to eliminate functional sequences, Supplementary Note 1) 58.

A hierarchal DNA assembly strategy was developed to efficiently combine genetic parts to form intermediate composite parts, half-clusters, and whole clusters. Each level of the hierarchy uses a different DNA assembly strategy that is optimal for the size and types of parts that exist at that stage (Figure 1b). The first stage combines individual parts (spacers, promoters, RBSs, genes, terminators) to form 48 cistron-sized constructs. At this stage, scarless assembly is critical as the introduction of new sequences at the seams can impact part behavior 01. We developed a simple "Scarless Stitching" method that can combine up to three parts in a one-pot reaction and uses an additional enzyme to remove the bridging scar. The cistron-level constructs were then combined to form half-clusters via the MoClo variation of Golden Gate assembly 89,102. This method

(31)

introduces 4 bp scars when building libraries, which we place in the spacers separating cistrons. After building the cistron parts, one round of PCR was used to customize the flanking regions of each cistron, which contain MoClo cohesive ends that determine the eventual order and orientation in future assembly steps. Twenty-four half-clusters were built: 12 for nifLUSV and 12 for nifWZM. These were then put together in 84 different combinations to build the full clusters using the same assembly process. We sequence verified a subset of the clusters and from that inferred that 22 were incorrect, mainly from point mutations in coding sequences occurring in intermediate plasmids. A total of 62 constructs were analyzed further for sequence-activity relationships.

1.3.2 Screening and analysis of the nifLSVWZM library

All 62 gene clusters were introduced into K oxytoca NF10 (A nifUSVWZM) bearing a controller containing the IPTG-inducible PTac promoter driving the expression of T7* RNAP (plasmid N249). Each variant was characterized using an acetylene reduction assay (Methods) and compared to wild-type K oxytoca M5al (Figure 2a). In measuring the activity, samples are diluted so that they have the same OD prior to induction. One variant (USVWZM#30) recovered full wild-type activity (96% 9%), but had a different genetic architecture than the wild-type, with five transcription units, different gene order and a change in orientation between nifUVS and nifZMW. Additionally, tandem promoters control nifZ and nifM. However, it is noteworthy that the second-best operon (USVWZM#1, 85% ±5%) has the same single-operon architecture as the original refactored operon, with the only different parts being RBSs and spacers. The next three variants (USVWZM#61, #68, and #41) have high activity (77%± 5 %, 75% ±5%, and 70% 3%) but also differ substantially in their architectures with (2, 5, and 3) transcriptional units and (2, 8 and 4) promoters. The diversity of genetic architectures present in the top-five performing variants highlights the genetic plasticity of this operon.

(32)

Several of the permuted nifLJSVWZM constructs produced a growth phenotype in the assay conditions. The OD after 22 hours of growth is reported as a measure of the growth rate (Methods). Upon ordering the nifLJSVWZM variants according to growth first and then activity (Figure 2a), constructs #1, #68, and #61 stand out as maintaining both high activity and wild-type growth.

The library was analyzed to see if there were correlations between nitrogenase activity and features of the genetic architecture, including gene orientation and order, part combinations, and part activity. There is a negative correlation between activity and the number of transcription units (as well as the number of promoters and terminators, which are related), but there are many outliers and the most active variants contain multiple transcription units. There is no correlation with the number of orientation changes, but there is a preference for nifU and nifS to be in the reverse orientation. There is no enrichment for constructs that preserve any aspect of the gene order, including those orders most preserved in native clusters when compared across bacterial species. Likewise, there are few correlations between genetic architecture and growth rate, where there is enrichment for the preservation of the nifUSVWZM order of genes.

To further determine whether there is any correlation between operon structure and activity, we built a library of 80 constructs where the operon occupancy and order was varied over all 16 genes. These data indicate that the nifEN gene pair is particularly sensitive to disruption, but there was no enrichment for any other pairs of genes. In addition, from this library several active fully monocistronic designs were identified (e.g., constructs #14 and #9).

(33)

1.3.3 Robustness to changes in RNAP concentration

Operons have been proposed as a mechanism to maintain protein ratios despite changes in the promoter activity 03 05. In comparing the top nifUJSVWZM variants, it is possible that a variant with disrupted operons has high activity but is less robust over a range of RNAP concentrations. The refactored clusters are induced by a controller, which simplifies the measurement of robustness by varying the concentration of IPTG to change T7* RNAP, which is the sole input to all of the promoters (Figure 2b).

For 49 of the most active nifUSVWZM variants, the robustness was quantified by measuring the nitrogenase activity at five levels of IPTG induction across two orders of magnitude (Figure 2b). Clusters were placed into three groups based on their response to increasing IPTG. The majority (35) of these monotonically increase in activity as a function of RNAP concentration. Several clusters are very robust over a wide range of RNAP concentrations, showing almost no change in activity. Remarkably, the top cluster identified in the library (USVWZM#30) is in this category, demonstrating that it has both the highest activity as well as the highest robustness, despite having disrupted the original operon structure with seven promoters. Lastly, a number of clusters yield nitrogenase activities that decline monotonically as RNAP is increased.

1.3.4 Transcriptome diversity in the library of refactored nif clusters

Very few architectural rules were gleaned from the nifLSVWZM library. It may be that it is important to maintain the correct expression levels and, so long as this is satisfied, many architectures are equivalent. We used high-throughput transcriptomics to quantify the relationship between genetic architecture, expression, and activity. RNA-seq data was gathered on the wild-type cluster, refactored cluster, and all 84 members of the nifSVWZM library (Methods). Analyzing this many samples in a cost-effective manner required the application of new techniques

(34)

involving pooling samples early in the process and multiplexing reactions91_{. The method is also}

strand-specific, which is important for obtaining data for promoters oriented in opposite directions. This provides base-pair resolved transcript levels across the refactored clusters, as well as the entire genome, for the complete set of samples.

RNA-seq data was first gathered for the wild-type Klebsiella strain under inducing conditions. To our knowledge, this is the first transcriptomics investigation of nitrogen fixation in

K oxytoca, although the cluster has been investigated in other organisms1 06_-1 08_{. The different}

operons within the nif cluster are transcribed to different levels (Figure 3a,b). NifHDKTY is highly expressed, whereas nifUSVWZM, nifF and nifJ are transcribed approximately 10-fold lower. The least transcribed operons (nifENX and nif3Q) are 20-fold lower than nifHDKTY.

We then measured the transcript profile for the original refactored nif cluster. Unlike the wild-type cluster, this profile is very flat, with very little change across the entire cluster (Figure 3b). This is reflective of the design of this cluster; genes were maintained in the same orientation, super-operons were built by combining nifl-DKY, nifEN, and nifJ, and T7 RNAP is known for terminator read-through 09 _{and reduced attenuation} 1_{0. While various promoters were selected to}

vary expression, the total range of those that were available at the time was small. To convert the profile into expression levels, the normalized RPKM value was calculated for each gene. Figure 3c shows the relative level of each gene with respect to wild-type. Proteomics was performed to determine if the protein expression levels were more comparable. Indeed, we found that they were closer to the wild-type levels (Figure 3d). This is consistent with the fact that most of the debugging of the refactored cluster was performed by tuning the RBS strengths.

RNA-seq was performed on each member of the nifJSVWZM library. From this data, we calculated the transcript levels (RPKM values) of the plasmid-borne nifLUSVWZM variants as well