18 resultados para Recognition algorithms
em CaltechTHESIS
Resumo:
DNA recognition is an essential biological process responsible for the regulation of cellular functions including protein synthesis and cell division and is implicated in the mechanism of action of some anticancer drugs. Studies directed towards defining the elements responsible for sequence specific DNA recognition through the study of the interactions of synthetic organic ligands with DNA are described.
DNA recognition by poly-N-methylpyrrolecarboxamides was studied by the synthesis and characterization of a series of molecules where the number of contiguous N-methylpyrrolecarboxamide units was increased from 2 to 9. The effect of this incremental change in structure on DNA recognition has been investigated at base pair resolution using affinity cleaving and MPE•Fe(II) footprinting techniques. These studies led to a quantitative relationship between the number of amides in the molecule and the DNA binding site size. This relationship is called the n + 1 rule and it states that a poly-N methylpyrrolecarboxamide molecule with n amides will bind n + 1 base pairs of DNA. This rule is consistent with a model where the carboxamides of these compounds form three center bridging hydrogen bonds between adjacent base pairs on opposite strands of the helix. The poly-N methylpyrrolecarboxamide recognition element was found to preferentially bind poly dA•poly dT stretches; however, both binding site selection and orientation were found to be affected by flanking sequences. Cleavage of large DNA is also described.
One approach towards the design of molecules that bind large sequences of double helical DNA sequence specifically is to couple DNA binding subunits of similar or diverse base pair specificity. Bis-EDTA-distamycin-fumaramide (BEDF) is an octaamide dimer of two tri-N methylpyrrolecarboxamide subunits linked by fumaramide. DNA recognition by BEDF was compared to P7E, an octaamide molecule containing seven consecutive pyrroles. These two compounds were found to recognize the same sites on pBR322 with approximately the same affinities demonstrating that fumaramide is an effective linking element for Nmethylpyrrolecarboxamide recognition subunits. Further studies involved the synthesis and characterization of a trimer of tetra-N-methylpyrrolecarboxamide subunits linked by β-alanine ((P4)_(3)E). This trimerization produced a molecule which is capable of recognizing 16 base pairs of A•T DNA, more than a turn and a half of the DNA helix.
DNA footprinting is a powerful direct method for determining the binding sites of proteins and small molecules on heterogeneous DNA. It was found that attachment of EDTA•Fe(II) to spermine creates a molecule, SE•Fe(II), which binds and cleaves DNA sequence neutrally. This lack of specificity provides evidence that at the nucleotide level polyamines recognize heterogeneous DNA independent of sequence and allows SE•Fe(II) to be used as a footprinting reagent. SE•Fe(II) was compared with two other small molecule footprinting reagents, EDTA•Fe(II) and MPE•Fe(II).
Resumo:
This thesis discusses various methods for learning and optimization in adaptive systems. Overall, it emphasizes the relationship between optimization, learning, and adaptive systems; and it illustrates the influence of underlying hardware upon the construction of efficient algorithms for learning and optimization. Chapter 1 provides a summary and an overview.
Chapter 2 discusses a method for using feed-forward neural networks to filter the noise out of noise-corrupted signals. The networks use back-propagation learning, but they use it in a way that qualifies as unsupervised learning. The networks adapt based only on the raw input data-there are no external teachers providing information on correct operation during training. The chapter contains an analysis of the learning and develops a simple expression that, based only on the geometry of the network, predicts performance.
Chapter 3 explains a simple model of the piriform cortex, an area in the brain involved in the processing of olfactory information. The model was used to explore the possible effect of acetylcholine on learning and on odor classification. According to the model, the piriform cortex can classify odors better when acetylcholine is present during learning but not present during recall. This is interesting since it suggests that learning and recall might be separate neurochemical modes (corresponding to whether or not acetylcholine is present). When acetylcholine is turned off at all times, even during learning, the model exhibits behavior somewhat similar to Alzheimer's disease, a disease associated with the degeneration of cells that distribute acetylcholine.
Chapters 4, 5, and 6 discuss algorithms appropriate for adaptive systems implemented entirely in analog hardware. The algorithms inject noise into the systems and correlate the noise with the outputs of the systems. This allows them to estimate gradients and to implement noisy versions of gradient descent, without having to calculate gradients explicitly. The methods require only noise generators, adders, multipliers, integrators, and differentiators; and the number of devices needed scales linearly with the number of adjustable parameters in the adaptive systems. With the exception of one global signal, the algorithms require only local information exchange.
Resumo:
Oligonucleotide-directed triple helix formation is one of the most versatile methods for the sequence specific recognition of double helical DNA. Chapter 2 describes affinity cleaving experiments carried out to assess the recognition potential for purine-rich oligonucleotides via the formation of triple helices. Purine-rich oligodeoxyribonucleotides were shown to bind specifically to purine tracts of double helical DNA in the major groove antiparallel to the purine strand of the duplex. Specificity was derived from the formation of reverse Hoogsteen G•GC, A•AT and T•AT triplets and binding was limited to mostly purine tracts. This triple helical structure was stabilized by multivalent cations, destabilized by high concentrations of monovalent cations and was insensitive to pH. A single mismatched base triplet was shown to destabilize a 15 mer triple helix by 1.0 kcal/mole at 25°C. In addition, stability appeared to be correlated to the number of G•GC triplets formed in the triple helix. This structure provides an additional framework as a basis for the design of new sequence specific DNA binding molecules.
In work described in Chapter 3, the triplet specificities and required strand orientations of two classes of DNA triple helices were combined to target double helical sequences containing all four base pairs by alternate strand triple helix formation. This allowed for the use of oligonucleotides containing only natural 3'-5' phosphodiester linkages to simultaneously bind both strands of double helical DNA in the major groove. The stabilities and structures of these alternate strand triple helices depended on whether the binding site sequence was 5'-(purine)_m (pyrimidine)_n-3' or 5'- (pyrimidine)_m (purine)_n-3'.
In Chapter 4, the ability of oligonucleotide-cerium(III) chelates to direct the transesterfication of RNA was investigated. Procedures were developed for the modification of DNA and RNA oligonucleotides with a hexadentate Schiff-base macrocyclic cerium(III) complex. In addition, oligoribonucleotides modified by covalent attachment of the metal complex through two different linker structures were prepared. The ability of these structures to direct transesterification to specific RNA phosphodiesters was assessed by gel electrophoresis. No reproducible cleavage of the RNA strand consistent with transesterification could be detected in any of these experiments.
Resumo:
The behaviors of six new cyclophane receptors for organic guest molecules in aqueous media are reported. These new hosts are modifications of more basic parent structures, and the main goal of their examination has been to determine how the modifications affect host selectivity for cationic guests. In particular, we have been interested in determining how additional non-covalent binding interactions can complement the cation-π interactions active in the parent systems. Three types of modifications were made to these systems. Firstly, neutral methoxy and bromine substituents were added to produce four of the six new macrocycles. Secondly, two additional aromatic rings (relative to the parent host) capable of making cation-π interactions with charged guest species were appended. Thirdly, a negatively charged carboxyl group was attached to produce a cavity in which electrostatic interactions should enhance cationic guest binding. ^1H-NMR and circular dichroic techniques were employed to determine the binding affinities of a wide variety of organic guests for the parent and modified structures in aqueous media.
Bromination of the parent host greatly enhances its binding in a general fashion, primarily as the result of hydrophobic interactions. The addition of methoxy groups does not enhance binding, apparently as a result of a collapse of the hosts into a conformation that is not suitable for binding. The appendage of extra aromatic rings enhances the binding of positively charged guests, most likely in response to more complete encapsulation of guest species. The addition of a negatively charged carboxylate enhances the binding to only selective groups of cationic guests. AM1 calculations of the electrostatic potentials of several guests molecules suggests that the enhancements seen with the modified receptor compared to the parent are most likely the result of close contact between regions of highest potential on the guest and the appended carboxylate.
Resumo:
Computer science and electrical engineering have been the great success story of the twentieth century. The neat modularity and mapping of a language onto circuits has led to robots on Mars, desktop computers and smartphones. But these devices are not yet able to do some of the things that life takes for granted: repair a scratch, reproduce, regenerate, or grow exponentially fast–all while remaining functional.
This thesis explores and develops algorithms, molecular implementations, and theoretical proofs in the context of “active self-assembly” of molecular systems. The long-term vision of active self-assembly is the theoretical and physical implementation of materials that are composed of reconfigurable units with the programmability and adaptability of biology’s numerous molecular machines. En route to this goal, we must first find a way to overcome the memory limitations of molecular systems, and to discover the limits of complexity that can be achieved with individual molecules.
One of the main thrusts in molecular programming is to use computer science as a tool for figuring out what can be achieved. While molecular systems that are Turing-complete have been demonstrated [Winfree, 1996], these systems still cannot achieve some of the feats biology has achieved.
One might think that because a system is Turing-complete, capable of computing “anything,” that it can do any arbitrary task. But while it can simulate any digital computational problem, there are many behaviors that are not “computations” in a classical sense, and cannot be directly implemented. Examples include exponential growth and molecular motion relative to a surface.
Passive self-assembly systems cannot implement these behaviors because (a) molecular motion relative to a surface requires a source of fuel that is external to the system, and (b) passive systems are too slow to assemble exponentially-fast-growing structures. We call these behaviors “energetically incomplete” programmable behaviors. This class of behaviors includes any behavior where a passive physical system simply does not have enough physical energy to perform the specified tasks in the requisite amount of time.
As we will demonstrate and prove, a sufficiently expressive implementation of an “active” molecular self-assembly approach can achieve these behaviors. Using an external source of fuel solves part of the the problem, so the system is not “energetically incomplete.” But the programmable system also needs to have sufficient expressive power to achieve the specified behaviors. Perhaps surprisingly, some of these systems do not even require Turing completeness to be sufficiently expressive.
Building on a large variety of work by other scientists in the fields of DNA nanotechnology, chemistry and reconfigurable robotics, this thesis introduces several research contributions in the context of active self-assembly.
We show that simple primitives such as insertion and deletion are able to generate complex and interesting results such as the growth of a linear polymer in logarithmic time and the ability of a linear polymer to treadmill. To this end we developed a formal model for active-self assembly that is directly implementable with DNA molecules. We show that this model is computationally equivalent to a machine capable of producing strings that are stronger than regular languages and, at most, as strong as context-free grammars. This is a great advance in the theory of active self- assembly as prior models were either entirely theoretical or only implementable in the context of macro-scale robotics.
We developed a chain reaction method for the autonomous exponential growth of a linear DNA polymer. Our method is based on the insertion of molecules into the assembly, which generates two new insertion sites for every initial one employed. The building of a line in logarithmic time is a first step toward building a shape in logarithmic time. We demonstrate the first construction of a synthetic linear polymer that grows exponentially fast via insertion. We show that monomer molecules are converted into the polymer in logarithmic time via spectrofluorimetry and gel electrophoresis experiments. We also demonstrate the division of these polymers via the addition of a single DNA complex that competes with the insertion mechanism. This shows the growth of a population of polymers in logarithmic time. We characterize the DNA insertion mechanism that we utilize in Chapter 4. We experimentally demonstrate that we can control the kinetics of this re- action over at least seven orders of magnitude, by programming the sequences of DNA that initiate the reaction.
In addition, we review co-authored work on programming molecular robots using prescriptive landscapes of DNA origami; this was the first microscopic demonstration of programming a molec- ular robot to walk on a 2-dimensional surface. We developed a snapshot method for imaging these random walking molecular robots and a CAPTCHA-like analysis method for difficult-to-interpret imaging data.
Resumo:
The neonatal Fe receptor (FeRn) binds the Fe portion of immunoglobulin G (IgG) at the acidic pH of endosomes or the gut and releases IgG at the alkaline pH of blood. FeRn is responsible for the maternofetal transfer of IgG and for rescuing endocytosed IgG from a default degradative pathway. We investigated how FeRn interacts with IgG by constructing a heterodimeric form of the Fe (hdFc) that contains one FeRn binding site. This molecule was used to characterize the interaction between one FeRn molecule and one Fe and to determine under what conditions FeRn forms a dimer. The hdFc binds one FeRn molecule at pH 6.0 with a K_d of 80 nM. In solution and with FeRn anchored to solid supports, the heterodimeric Fe does not induce a dimer of FeRn molecules. FcRnhdFc complex crystals were obtained and the complex structure was solved to 2.8 Å resolution. Analysis of this structure refined the understanding of the mechanism of the pH-dependent binding, shed light on the role played by carbohydrates in the Fe binding, and provided insights on how to design therapeutic IgG antibodies with longer serum half-lives. The FcRn-hdFc complex in the crystal did not contain the FeRn dimer. To characterize the tendency of FeRn to form a dimer in a membrane we analyzed the tendency of the hdFc to induce cross-phosphorylation of FeRn-tyrosine kinase chimeras. We also constructed FeRn-cyan and FeRn-yellow fluorescent proteins and have analyzed the tendency of these molecules to exhibit fluorescence resonance energy transfer. As of now, neither of these analyses have lead to conclusive results. In the process of acquiring the context to appreciate the structure of the FcRn-hdFc interface, we developed a study of 171 other nonobligate protein-protein interfaces that includes an original principal component analysis of the quantifiable aspects of these interfaces.
Resumo:
Mannose receptor (MR) is widely expressed on macrophages, immature dendritic cells, and a variety of epithelial and endothelial cells. It is a 180 kD type I transmembrane receptor whose extracellular region consists of three parts: the amino-terminal cysteine-rich domain (Cys-MR); a fibronectin type II-like domain; and a series of eight tandem C-type lectin carbohydrate recognition domains (CRDs). Two portions of MR have distinct carbohydrate recognition properties: Cys-MR recognizes sulfated carbohydrates and the tandem CRD region binds terminal mannose, fucose, and N-acetyl-glucosamine (GlcNAc). The dual carbohydrate binding specificity allows MR to interact with sulfated and nonsulfated polysaccharide chains, and thereby facilitating the involvement of MR in immunological and physiological processes. The immunological functions of MR include antigen capturing (through binding non-sulfated carbohydrates) and antigen targeting (through binding sulfated carbohydrates), and the physiological roles include rapid clearance of circulatory luteinizing hormone (LH), which bears polysaccharide chains terminating with sulfated and non-sulfated carbohydrates.
We have crystallized and determined the X-ray structures of unliganded Cys-MR (2.0 Å) and Cys-MR complexed with different ligands, including Hepes (1.7 Å), 4SO_4-N-Acetylgalactosamine (4SO_4-GalNAc; 2.2 Å), 3SO_4-Lewis^x (2.2 Å), 3S04-Lewis^a (1.9 Å), and 6SO_4-GalNAc (2.5 Å). The overall structure of Cys-MR consists of 12 anti-parallel β-strands arranged in three lobes with approximate three fold internal symmetry. The structure contains three disulfide bonds, formed by the six cysteines in the Cys-MR sequence. The ligand-binding site is located in a neutral pocket within the third lobe, in which the sulfate group of ligand is buried. Our results show that optimal binding is achieved by a carbohydrate ligand with a sulfate group that anchors the ligand by forming numerous hydrogen bonds and a sugar ring that makes ring-stacking interactions with Trpll7 of CysMR. Using a fluorescence-based assay, we characterized the binding affinities between CysMR and its ligands, and rationalized the derived affinities based upon the crystal structures. These studies reveal the mechanism of sulfated carbohydrate recognition by Cys-MR and facilitate our understanding of the role of Cys-MR in MR recognition of its ligands.
Resumo:
The discovery that the three ring polyamide Im-Py-Py-Dp containing imidazole (Im) and pyrrole (Py) carboxamides binds the DNA sequence 5'-(A,T)G(A,T)C(A,T)-3' as an antiparallel dimer offers a new model for the design of ligands for specific recognition of sequences in the minor groove containing both G,C and A,T base pairs. In Chapter 2, experiments are described in which the sequential addition of five N- methylpyrrolecarboxamides to the imidazole-pyrrole polyamide Im-Py-Py-Dp affords a series of six homologous polyamides, Im-(Py)2-7-Dp, that differ in the size of their binding site, apparent first order binding affinity, and sequence specificity. These results demonstrate that DNA sequences up to nine base pairs in length can be specifically recognized by imidazole-pyrrole polyamides containing three to seven rings by 2:1 polyamide-DNA complex formation in the minor groove. Recognition of a nine base pair site defines the new lower limit of the binding site size that can be recognized by polyamides containing exclusively imidazole and pyrrolecarboxamides. The results of this study should provide useful guidelines for the design of new polyamides that bind longer DNA sites with enhanced affinity and specificity.
In Chapter 3 the design and synthesis of the hairpin polyamide Im-Py-Im-Py-γ-Im- Py-Im-Py-Dp is described. Quantitative DNase I footprint titration experiments reveal that Im-Py-Im-Py-γ-Im-Py-Im-Py-Dp binds six base pair 5'-(A,T)GCGC(A,T)-3' sequences with 30-fold higher affinity than the unlinked polyamide Im-Py-Im-Py-Dp. The hairpin polyamide does not discriminate between A•T and T•A at the first and sixth positions of the binding site as three sites 5'-TGCGCT-3', 5'-TGCGCA-3', and 5 'AGCGCT- 3' are bound with similar affinity. However, Im-Py-Im-Py-γ-Im-Py-Im-PyDp is specific for and discriminates between G•C and C•G base pairs in the 5'-GCGC-3' core as evidenced by lower affinities for the mismatched sites 5'-AACGCA-3', 5'- TGCGTT-3', 5'-TGCGGT-3', and 5'-ACCGCT-3'.
In Chapter 4, experiments are described in which a kinetically stable hexa-aza Schiff base La3+ complex is covalently attached to a Tat(49-72) peptide which has been shown to bind the HIV-1 TAR RNA sequence. Although these metallo-peptides cleave TAR site-specifically in the hexanucleotide loop to afford products consistent with hydrolysis, a series of control experiments suggests that the observed cleavage is not caused by a sequence-specifically bound Tat(49-72)-La(L)3+ peptide.
Resumo:
The design of synthetic molecules that recognize specific sequences of DNA is an ongoing challenge in molecular medicine. Cell-permeable small molecules targeting predetermined DNA sequences offer a potential approach for offsetting the abnormal effects of misregulated gene-expression. Over the past twenty years, Professor Peter B. Dervan has developed a set of pairing rules for the rational design of minor groove binding polyamides containing pyrrole (Py), imidazole (Im), and hydroxypyrrole (Hp). Polyamides have illustrated the capability to permeate cells and inhibit transcription of specific genes in vivo. This provides impetus to identify structural elements that expand the repetoire of polyamide motifs with recognition properties comparable to naturally occurring DNA binding proteins. Through the introduction of chiral amino acids, we have developed chiral polyamides with stereochemically regulated binding characteristics. In addition, chiral substituents have facilitated the development of new polyamide motifs that broaden binding site sizes targetable by this class of ligands.
Resumo:
A general framework for multi-criteria optimal design is presented which is well-suited for automated design of structural systems. A systematic computer-aided optimal design decision process is developed which allows the designer to rapidly evaluate and improve a proposed design by taking into account the major factors of interest related to different aspects such as design, construction, and operation.
The proposed optimal design process requires the selection of the most promising choice of design parameters taken from a large design space, based on an evaluation using specified criteria. The design parameters specify a particular design, and so they relate to member sizes, structural configuration, etc. The evaluation of the design uses performance parameters which may include structural response parameters, risks due to uncertain loads and modeling errors, construction and operating costs, etc. Preference functions are used to implement the design criteria in a "soft" form. These preference functions give a measure of the degree of satisfaction of each design criterion. The overall evaluation measure for a design is built up from the individual measures for each criterion through a preference combination rule. The goal of the optimal design process is to obtain a design that has the highest overall evaluation measure - an optimization problem.
Genetic algorithms are stochastic optimization methods that are based on evolutionary theory. They provide the exploration power necessary to explore high-dimensional search spaces to seek these optimal solutions. Two special genetic algorithms, hGA and vGA, are presented here for continuous and discrete optimization problems, respectively.
The methodology is demonstrated with several examples involving the design of truss and frame systems. These examples are solved by using the proposed hGA and vGA.
Resumo:
Metal complexes that utilize the 9,10-phenanthrene quinone diimine (phi) moiety bind to DNA through the major groove. These metallointercalators can recognize DNA sites and perform reactions on DNA as a substrate. The site-specific metallointercalator Λ-1-Rh(MGP)_2phi^(5+) competitively disrupts the major groove binding of a transcription factor, yAP-1, from an oligonucleotide that contains a common binding site. The demonstration that metal complexes can prevent transcription factor binding to DNA site-specifically is an important step in using metallointercalators as therapeutics.
The distinctive photochemistry of metallointercalators can also be applied to promote long range charge transport in DNA. Experiments using duplexes with regions 4 to 10 nucleotides long containing strictly adenine and thymine sequences of varying order showed that radical migration is more dependent on the sequence of bases, and less dependent on the distance between the guanine doublets. This result suggests that mechanistic proposals of long range charge transport must involve all the bases.
RNA/DNA hybrids show charge migration to guanines from a remote site, thus demonstrating that nucleic acid stacking other than B-form can serve as a radical bridge. Double crossover DNA assemblies also provide a medium for charge transport at distances up to 100 Å from the site of radical introduction by a tethered metal complex. This radical migration was found to be robust to mismatches, and limited to individual, electronically distinct base stacks. In single DNA crossover assemblies, which have considerably greater flexibility, charge migration proceeds to both base stacks due to conformational isomers not present in the rigid and tightly annealed double crossovers.
Finally, a rapid, efficient, gel-based technique was developed to investigate thymine dimer repair. Two oligonucleotides, one radioactively labeled, are photoligated via the bases of a thymine-thymine interface; reversal of this ligation is easily visualized by gel electrophoresis. This assay was used to show that the repair of thymine dimers from a distance through DNA charge transport can be accomplished with different photooxidants.
Thus, nucleic acids that support long range charge transport have been shown to include A-track DNA, RNA/DNA hybrids, and single and double crossovers, and a method for thymine dimer repair detection using charge transport was developed. These observations underscore and extend the remarkable finding that DNA can serve a medium for charge transport via the heteroaromatic base stack.
Resumo:
The visual system is a remarkable platform that evolved to solve difficult computational problems such as detection, recognition, and classification of objects. Of great interest is the face-processing network, a sub-system buried deep in the temporal lobe, dedicated for analyzing specific type of objects (faces). In this thesis, I focus on the problem of face detection by the face-processing network. Insights obtained from years of developing computer-vision algorithms to solve this task have suggested that it may be efficiently and effectively solved by detection and integration of local contrast features. Does the brain use a similar strategy? To answer this question, I embark on a journey that takes me through the development and optimization of dedicated tools for targeting and perturbing deep brain structures. Data collected using MR-guided electrophysiology in early face-processing regions was found to have strong selectivity for contrast features, similar to ones used by artificial systems. While individual cells were tuned for only a small subset of features, the population as a whole encoded the full spectrum of features that are predictive to the presence of a face in an image. Together with additional evidence, my results suggest a possible computational mechanism for face detection in early face processing regions. To move from correlation to causation, I focus on adopting an emergent technology for perturbing brain activity using light: optogenetics. While this technique has the potential to overcome problems associated with the de-facto way of brain stimulation (electrical microstimulation), many open questions remain about its applicability and effectiveness for perturbing the non-human primate (NHP) brain. In a set of experiments, I use viral vectors to deliver genetically encoded optogenetic constructs to the frontal eye field and faceselective regions in NHP and examine their effects side-by-side with electrical microstimulation to assess their effectiveness in perturbing neural activity as well as behavior. Results suggest that cells are robustly and strongly modulated upon light delivery and that such perturbation can modulate and even initiate motor behavior, thus, paving the way for future explorations that may apply these tools to study connectivity and information flow in the face processing network.
Resumo:
The signal recognition particle (SRP) targets membrane and secretory proteins to their correct cellular destination with remarkably high fidelity. Previous studies have shown that multiple checkpoints exist within this targeting pathway that allows ‘correct cargo’ to be quickly and efficiently targeted and for ‘incorrect cargo’ to be promptly rejected. In this work, we delved further into understanding the mechanisms of how substrates are selected or discarded by the SRP. First, we discovered the role of the SRP fingerloop and how it activates the SRP and SRP receptor (SR) GTPases to target and unload cargo in response to signal sequence binding. Second, we learned how an ‘avoidance signal’ found in the bacterial autotransporter, EspP, allows this protein to escape the SRP pathway by causing the SRP and SR to form a ‘distorted’ complex that is inefficient in delivering the cargo to the membrane. Lastly, we determined how Trigger Factor, a co-translational chaperone, helps SRP discriminate against ‘incorrect cargo’ at three distinct stages: SRP binding to RNC; targeting of RNC to the membrane via SRP-FtsY assembly; and stronger antagonism of SRP targeting of ribosomes bearing nascent polypeptides that exceed a critical length. Overall, results delineate the rich underlying mechanisms by which SRP recognizes its substrates, which in turn activates the targeting pathway and provides a conceptual foundation to understand how timely and accurate selection of substrates is achieved by this protein targeting machinery.
Resumo:
Protein structure prediction has remained a major challenge in structural biology for more than half a century. Accelerated and cost efficient sequencing technologies have allowed researchers to sequence new organisms and discover new protein sequences. Novel protein structure prediction technologies will allow researchers to study the structure of proteins and to determine their roles in the underlying biology processes and develop novel therapeutics.
Difficulty of the problem stems from two folds: (a) describing the energy landscape that corresponds to the protein structure, commonly referred to as force field problem; and (b) sampling of the energy landscape, trying to find the lowest energy configuration that is hypothesized to be the native state of the structure in solution. The two problems are interweaved and they have to be solved simultaneously. This thesis is composed of three major contributions. In the first chapter we describe a novel high-resolution protein structure refinement algorithm called GRID. In the second chapter we present REMCGRID, an algorithm for generation of low energy decoy sets. In the third chapter, we present a machine learning approach to ranking decoys by incorporating coarse-grain features of protein structures.
Resumo:
In the first part of the thesis we explore three fundamental questions that arise naturally when we conceive a machine learning scenario where the training and test distributions can differ. Contrary to conventional wisdom, we show that in fact mismatched training and test distribution can yield better out-of-sample performance. This optimal performance can be obtained by training with the dual distribution. This optimal training distribution depends on the test distribution set by the problem, but not on the target function that we want to learn. We show how to obtain this distribution in both discrete and continuous input spaces, as well as how to approximate it in a practical scenario. Benefits of using this distribution are exemplified in both synthetic and real data sets.
In order to apply the dual distribution in the supervised learning scenario where the training data set is fixed, it is necessary to use weights to make the sample appear as if it came from the dual distribution. We explore the negative effect that weighting a sample can have. The theoretical decomposition of the use of weights regarding its effect on the out-of-sample error is easy to understand but not actionable in practice, as the quantities involved cannot be computed. Hence, we propose the Targeted Weighting algorithm that determines if, for a given set of weights, the out-of-sample performance will improve or not in a practical setting. This is necessary as the setting assumes there are no labeled points distributed according to the test distribution, only unlabeled samples.
Finally, we propose a new class of matching algorithms that can be used to match the training set to a desired distribution, such as the dual distribution (or the test distribution). These algorithms can be applied to very large datasets, and we show how they lead to improved performance in a large real dataset such as the Netflix dataset. Their computational complexity is the main reason for their advantage over previous algorithms proposed in the covariate shift literature.
In the second part of the thesis we apply Machine Learning to the problem of behavior recognition. We develop a specific behavior classifier to study fly aggression, and we develop a system that allows analyzing behavior in videos of animals, with minimal supervision. The system, which we call CUBA (Caltech Unsupervised Behavior Analysis), allows detecting movemes, actions, and stories from time series describing the position of animals in videos. The method summarizes the data, as well as it provides biologists with a mathematical tool to test new hypotheses. Other benefits of CUBA include finding classifiers for specific behaviors without the need for annotation, as well as providing means to discriminate groups of animals, for example, according to their genetic line.