990 resultados para Discovery Tools


Relevância:

30.00% 30.00%

Publicador:

Resumo:

BACKGROUND: HIV-1 clade C (HIV-C) predominates worldwide, and anti-HIV-C vaccines are urgently needed. Neutralizing antibody (nAb) responses are considered important but have proved difficult to elicit. Although some current immunogens elicit antibodies that neutralize highly neutralization-sensitive (tier 1) HIV strains, most circulating HIVs exhibiting a less sensitive (tier 2) phenotype are not neutralized. Thus, both tier 1 and 2 viruses are needed for vaccine discovery in nonhuman primate models. METHODOLOGY/PRINCIPAL FINDINGS: We constructed a tier 1 simian-human immunodeficiency virus, SHIV-1157ipEL, by inserting an "early," recently transmitted HIV-C env into the SHIV-1157ipd3N4 backbone [1] encoding a "late" form of the same env, which had evolved in a SHIV-infected rhesus monkey (RM) with AIDS. SHIV-1157ipEL was rapidly passaged to yield SHIV-1157ipEL-p, which remained exclusively R5-tropic and had a tier 1 phenotype, in contrast to "late" SHIV-1157ipd3N4 (tier 2). After 5 weekly low-dose intrarectal exposures, SHIV-1157ipEL-p systemically infected 16 out of 17 RM with high peak viral RNA loads and depleted gut CD4+ T cells. SHIV-1157ipEL-p and SHIV-1157ipd3N4 env genes diverge mostly in V1/V2. Molecular modeling revealed a possible mechanism for the increased neutralization resistance of SHIV-1157ipd3N4 Env: V2 loops hindering access to the CD4 binding site, shown experimentally with nAb b12. Similar mutations have been linked to decreased neutralization sensitivity in HIV-C strains isolated from humans over time, indicating parallel HIV-C Env evolution in humans and RM. CONCLUSIONS/SIGNIFICANCE: SHIV-1157ipEL-p, the first tier 1 R5 clade C SHIV, and SHIV-1157ipd3N4, its tier 2 counterpart, represent biologically relevant tools for anti-HIV-C vaccine development in primates.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

An enterprise information system (EIS) is an integrated data-applications platform characterized by diverse, heterogeneous, and distributed data sources. For many enterprises, a number of business processes still depend heavily on static rule-based methods and extensive human expertise. Enterprises are faced with the need for optimizing operation scheduling, improving resource utilization, discovering useful knowledge, and making data-driven decisions.

This thesis research is focused on real-time optimization and knowledge discovery that addresses workflow optimization, resource allocation, as well as data-driven predictions of process-execution times, order fulfillment, and enterprise service-level performance. In contrast to prior work on data analytics techniques for enterprise performance optimization, the emphasis here is on realizing scalable and real-time enterprise intelligence based on a combination of heterogeneous system simulation, combinatorial optimization, machine-learning algorithms, and statistical methods.

On-demand digital-print service is a representative enterprise requiring a powerful EIS.We use real-life data from Reischling Press, Inc. (RPI), a digit-print-service provider (PSP), to evaluate our optimization algorithms.

In order to handle the increase in volume and diversity of demands, we first present a high-performance, scalable, and real-time production scheduling algorithm for production automation based on an incremental genetic algorithm (IGA). The objective of this algorithm is to optimize the order dispatching sequence and balance resource utilization. Compared to prior work, this solution is scalable for a high volume of orders and it provides fast scheduling solutions for orders that require complex fulfillment procedures. Experimental results highlight its potential benefit in reducing production inefficiencies and enhancing the productivity of an enterprise.

We next discuss analysis and prediction of different attributes involved in hierarchical components of an enterprise. We start from a study of the fundamental processes related to real-time prediction. Our process-execution time and process status prediction models integrate statistical methods with machine-learning algorithms. In addition to improved prediction accuracy compared to stand-alone machine-learning algorithms, it also performs a probabilistic estimation of the predicted status. An order generally consists of multiple series and parallel processes. We next introduce an order-fulfillment prediction model that combines advantages of multiple classification models by incorporating flexible decision-integration mechanisms. Experimental results show that adopting due dates recommended by the model can significantly reduce enterprise late-delivery ratio. Finally, we investigate service-level attributes that reflect the overall performance of an enterprise. We analyze and decompose time-series data into different components according to their hierarchical periodic nature, perform correlation analysis,

and develop univariate prediction models for each component as well as multivariate models for correlated components. Predictions for the original time series are aggregated from the predictions of its components. In addition to a significant increase in mid-term prediction accuracy, this distributed modeling strategy also improves short-term time-series prediction accuracy.

In summary, this thesis research has led to a set of characterization, optimization, and prediction tools for an EIS to derive insightful knowledge from data and use them as guidance for production management. It is expected to provide solutions for enterprises to increase reconfigurability, accomplish more automated procedures, and obtain data-driven recommendations or effective decisions.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Constitutive biosynthesis of lipid A via the Raetz pathway is essential for the viability and fitness of Gram-negative bacteria, includingChlamydia trachomatis Although nearly all of the enzymes in the lipid A biosynthetic pathway are highly conserved across Gram-negative bacteria, the cleavage of the pyrophosphate group of UDP-2,3-diacyl-GlcN (UDP-DAGn) to form lipid X is carried out by two unrelated enzymes: LpxH in beta- and gammaproteobacteria and LpxI in alphaproteobacteria. The intracellular pathogenC. trachomatislacks an ortholog for either of these two enzymes, and yet, it synthesizes lipid A and exhibits conservation of genes encoding other lipid A enzymes. Employing a complementation screen against aC. trachomatisgenomic library using a conditional-lethallpxHmutantEscherichia colistrain, we have identified an open reading frame (Ct461, renamedlpxG) encoding a previously uncharacterized enzyme that complements the UDP-DAGn hydrolase function inE. coliand catalyzes the conversion of UDP-DAGn to lipid Xin vitro LpxG shows little sequence similarity to either LpxH or LpxI, highlighting LpxG as the founding member of a third class of UDP-DAGn hydrolases. Overexpression of LpxG results in toxic accumulation of lipid X and profoundly reduces the infectivity ofC. trachomatis, validating LpxG as the long-sought-after UDP-DAGn pyrophosphatase in this prominent human pathogen. The complementation approach presented here overcomes the lack of suitable genetic tools forC. trachomatisand should be broadly applicable for the functional characterization of other essentialC. trachomatisgenes.IMPORTANCEChlamydia trachomatisis a leading cause of infectious blindness and sexually transmitted disease. Due to the lack of robust genetic tools, the functions of manyChlamydiagenes remain uncharacterized, including the essential gene encoding the UDP-DAGn pyrophosphatase activity for the biosynthesis of lipid A, the membrane anchor of lipooligosaccharide and the predominant lipid species of the outer leaflet of the bacterial outer membrane. We designed a complementation screen against theC. trachomatisgenomic library using a conditional-lethal mutant ofE. coliand identified the missing essential gene in the lipid A biosynthetic pathway, which we designatedlpxG We show that LpxG is a member of the calcineurin-like phosphatases and displays robust UDP-DAGn pyrophosphatase activityin vitro Overexpression of LpxG inC. trachomatisleads to the accumulation of the predicted lipid intermediate and reduces bacterial infectivity, validating thein vivofunction of LpxG and highlighting the importance of regulated lipid A biosynthesis inC. trachomatis.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In drug discovery, different methods exist to create new inhibitors possessing satisfactory biological activity. The multisubstrate adduct inhibitor (MAI) approach is one of these methods, which consists of a covalent combination between analogs of the substrate and the cofactor or of the multiple substrates used by the target enzyme. Adopted as the first line of investigation for many enzymes, this method has brought insights into the enzymatic mechanism, structure, and inhibitory requirements. In this review, the MAI approach, applied to different classes of enzyme, is reported from the point of view of biological activity.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Endothelial progenitor cells (EPCs) have great clinical value because they can be used as diagnostic biomarkers and as a cellular therapy for promoting vascular repair of ischaemic tissues. However, EPCs also have an additional research value in vascular disease modelling to interrogate human disease mechanisms. The term EPC is used to describe a diverse variety of cells, and we have identified a specific EPC subtype called outgrowth endothelial cell (OEC) as the best candidate for vascular disease modelling because of its high-proliferative potential and unambiguous endothelial commitment. OECs are isolated from human blood and can be exposed to pathologic conditions (forward approach) or be isolated from patients (reverse approach) in order to study vascular human disease. The use of OECs for modelling vascular disease will contribute greatly to improving our understanding of endothelial pathogenesis, which will potentially lead to the discovery of novel therapeutic strategies for vascular diseases.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Modulators of metabotropic glutamate receptor subtype 5 (mGluR5) may provide novel treatments for multiple central nervous system (CNS) disorders, including anxiety and schizophrenia. Although compounds have been developed to better understand the physiological roles of mGluR5 and potential usefulness for the treatment of these disorders, there are limitations in the tools available, including poor selectivity, low potency, and limited solubility. To address these issues, we developed an innovative assay that allows simultaneous screening for mGluR5 agonists, antagonists, and potentiators. We identified multiple scaffolds that possess diverse modes of activity at mGluR5, including both positive and negative allosteric modulators (PAMs and NAMs, respectively). 3-Fluoro-5-(3-(pyridine-2-yl)-1,2,4-oxadiazol-5-yl) benzonitrile (VU0285683) was developed as a novel selective mGluR5 NAM with high affinity for the 2-methyl-6-(phenyl-ethynyl)-pyridine (MPEP) binding site. VU0285683 had anxiolytic-like activity in two rodent models for anxiety but did not potentiate phen-cyclidine-induced hyperlocomotor activity. (4-Hydroxypiperidin-1-yl)(4-phenylethynyl) phenyl) methanone (VU0092273) was identified as a novel mGluR5 PAM that also binds to the MPEP site. VU0092273 was chemically optimized to an orally active analog, N-cyclobutyl-6-((3-fluorophenyl) ethynyl) nicotinamide hydrochloride (VU0360172), which is selective for mGluR5. This novel mGluR5 PAM produced a dose-dependent reversal of amphetamine-induced hyperlocomotion, a rodent model predictive of antipsychotic activity. Discovery of structurally and functionally diverse allosteric modulators of mGluR5 that demonstrate in vivo efficacy in rodent models of anxiety and antipsychotic activity provide further support for the tremendous diversity of chemical scaffolds and modes of efficacy of mGluR5 ligands. In addition, these studies provide strong support for the hypothesis that multiple structurally distinct mGluR5 modulators have robust activity in animal models that predict efficacy in the treatment of CNS disorders.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Tese de doutoramento, Farmácia (Química Farmacêutica e Terapêutica), Universidade de Lisboa, Faculdade de Farmácia, 2014

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Tese de doutoramento, Farmácia (Química Farmacêutica e Terapêutica), Universidade de Lisboa, Faculdade de Farmácia, 2016

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This work aimed to contribute to drug discovery and development (DDD) for tauopathies, while expanding our knowledge on this group of neurodegenerative disorders, including Alzheimer’s disease (AD). Using yeast, a recognized model for neurodegeneration studies, useful models were produced for the study of tau interaction with beta-amyloid (Aβ), both AD hallmark proteins. The characterization of these models suggests that these proteins co-localize and that Aβ1-42, which is toxic to yeast, is involved in tau40 phosphorylation (Ser396/404) via the GSK-3β yeast orthologue, whereas tau seems to facilitate Aβ1-42 oligomerization. The mapping of tau’s interactome in yeast, achieved with a tau toxicity enhancer screen using the yeast deletion collection, provided a novel framework, composed of 31 genes, to identify new mechanisms associated with tau pathology, as well as to identify new drug targets or biomarkers. This genomic screen also allowed to select the yeast strain mir1Δ-tau40 for development of a new GPSD2TM drug discovery screening system. A library of unique 138 marine bacteria extracts, obtained from the Mid-Atlantic Ridge hydrothermal vents, was screened with mir1Δ-tau40. Three extracts were identified as suppressors of tau toxicity and constitute good starting points for DDD programs. mir1Δ strain was sensitive to tau toxicity, relating tau pathology with mitochondrial function. SLC25A3, the human homologue of MIR1, codes for the mitochondrial phosphate carrier protein (PiC). Resorting to iRNA, SLC25A3 expression was silenced in human neuroglioma cells, as a first step towards the engineering of a neural model for replicating the results obtained in yeast. This model is essential to understand the mechanisms of tau toxicity at the mitochondrial level and to validate PiC as a relevant drug target. The set of DDD tools here presented will foster the development of innovative and efficacious therapies, urgently needed to cope with tau-related disorders of high human and social-economic impact.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Hepatocellular Carcinoma (HCC) is a major healthcare problem, representing the third most common cause of cancer-related mortality worldwide. Chronic infections with Hepatitis B virus (HBV) and/or Hepatitis C virus (HCV) are the major risk factors for the development of HCC. The incidence of HBV -associated HCC is in decline as a result of an effective HBV vaccine; however, since an equally effective HCV vaccine has not yet been developed, there are 130 million HCV infected patients worldwide who are at a high-risk for developing HCC. Because reliable parameters and/or tools for the early detection of HCC among high-risk individuals are severely lacking, HCC patients are always diagnosed at a late stage where surgical solutions or effective treatment are not possible. Using urine as a non-invasive sample source, two different approaches (proteomic-based and genomic-based approaches) were pursued with the common goal of discovering potential biomarker candidates for the early detection of HCC among high-risk chronic HCV infected patients. Urine was collected from 106 HCV infected Egyptian patients, 32 of whom had already developed HCC and 74 patients who were diagnosed as HCC-free at the time of initial sample collection. In addition to these patients, urine samples were also collected from 12 healthy control individuals. Total urinary proteins, Trans-renal nucleic acid (Tr-NA) and microRNA (miRNA) were isolated from urine using novel methodologies and silicon carbide-loaded spin columns. In the first, "proteomic-based", approach, liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) was used to identify potential candidates from pooled urine samples. This was followed by validating relative expression levels of proteins present in urine among all the patients using quantitative real time-PCR (qRT-PCR). This approach revealed that significant over-expression of three proteins: DJ-1, Chromatin Assembly Factor-1 (CAF-1) and 11 Moemen Abdalla HCC Biomarkers Heat Shock Protein 60 (HSP60), were characteristic events among HCC-post HCV infected patients. As a single-based HCC biomarker, CAF-1 over-expression identified HCC among HCV infected patients with a specificity of 90%, sensitivity of 66% and with an overall diagnostic accuracy of 78%. Moreover, the CAF-lIHSP60 tandem identified HCC among HCV infected patients with a specificity of 92%, sensitivity of 61 % and with an overall diagnostic accuracy of 77%. In the second genomic-based approach, two different approaches were processed. The first approach was the miRNA-based approach. The expression levels of miRNAs isolated from urine were studied using the Illumina MicroRNA Expression Profiling Assay. This was followed by qRT-PCR-based validation of deregulated expression of identified miRNA candidates among all the patients. This approach shed the light on the deregulated expression of a number of miRNAs, which may have a role in either the development of HCC among HCV infected patients (i.e. miR-640, miR-765, miR-200a, miR-521 and miR-520) or may allow for a better understanding of the viral-host interaction (miR-152, miR-486, miR-219, miR452, miR-425, miR-154 and miR-31). Moreover, the deregulated expression of both miR-618 and miR-650 appeared to be a common event among HCC-post HCV infected patients. The results of the search for putative targets of these two miRNA suggested that miR-618 may be a potent oncogene, as it targets the tumor-suppressor gene Low density lipoprotein-related protein 12 (LPR12), while miR-650 may be a potent tumor-suppressor gene, as it is supposed to downregulate the TNF receptor-associated factor-4 (TRAF4) oncogene. The specificity of miR-618 and miR-650 deregulated expression patterns for the early detection of HCC among HCV infected patients was 68% and 58%, respectively, whereas the sensitivity was 64% and 72%, respectively. When the deregulated expression of both miRNAs was combined as a tandem biomarker, the specificity and the sensitivity were 75% and 58% respectively. 111 Moemen Abdalla HCC Biomarkers In the second, "Trans-renal nucleic acid-based", approach, the urinary apoptotic nucleic acid (uaNA) levels of 70ng/mL or more were found to be a good predictor of HCC among chronic HCV infected patients. The specificity and the sensitivity of this diagnostic approach were 76% and 86%, respectively, with an overall diagnostic value of 81 %. The uaNA levels positively correlated to HCC disease progression as monitored by epigenetic changes of a panel of eight tumor-suppressor genes (TSGs) using methylation-sensitive PCR. Moreover, the pairing of high uaNA levels (:::: 70 ng/mL) and CAF-1 over-expreSSIOn produced a highly specific (l 00%) multiple-based HCC biomarker with an acceptable sensitivity of 64%, and with a diagnostic accuracy of 82%. In comparison to the previous pairing, the uaNA levels (:::: 70 ng/mL) in tandem with HSP60 over-expression was less specific (89%) but highly sensitive (72%), resulting in a diagnostic accuracy of 64%. The specificities of miR-650 deregulated expression in combination with either high uaNA content or HSP 60 over-expression were 82% and 79%, respectively, whereas, the sensitivities of these combinations were 64% and 58%, respectively. The potential biomarkers identified in this study compare favorably with the diagnostic accuracy of the a-fetoprotein levels test, which has a specificity of 75%, sensitivity of 68% and an overall diagnostic accuracy of 70%. Here we present an intriguing study which shows the significance of using urine as a noninvasive sample source for the identification of promising HCC biomarkers. We have also introduced new techniques for the isolation of different urinary macromolecules, especially miRNA, from urine. Furthermore, we strongly recommend the potential biomarkers indentified in this study as focal points of any future research on HCC diagnosis. A larger testing pool will determine if their use is practical for mass population screening. This explorative study identified potential targets that merit further investigation for the development of diagnostically accurate biomarkers isolated from 1-2 mL urine samples that were acquired in a non-invasive manner.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Knowledge discovery support environments include beside classical data analysis tools also data mining tools. For supporting both kinds of tools, a unified knowledge representation is needed. We show that concept lattices which are used as knowledge representation in Conceptual Information Systems can also be used for structuring the results of mining association rules. Vice versa, we use ideas of association rules for reducing the complexity of the visualization of Conceptual Information Systems.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

By considering left-right (L-R) asymmetries we study the capabilities of lepton colliders in searching for new exotic vector bosons. Specifically we study the effect of a doubly charged bilepton boson and an extra neutral vector boson appearing in a 3-3-1 model on the L-R asymmetries for the processes e-e- → e-e-, μ-μ- → μ-μ- and e-μ- → e-μ- and show that these asymmetries are very sensitive to these new contributions and that they are in fact powerful tools for discovery of this sort of vector bosons.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The discovery and development of a new drug are time-consuming, difficult and expensive. This complex process has evolved from classical methods into an integration of modern technologies and innovative strategies addressed to the design of new chemical entities to treat a variety of diseases. The development of new drug candidates is often limited by initial compounds lacking reasonable chemical and biological properties for further lead optimization. Huge libraries of compounds are frequently selected for biological screening using a variety of techniques and standard models to assess potency, affinity and selectivity. In this context, it is very important to study the pharmacokinetic profile of the compounds under investigation. Recent advances have been made in the collection of data and the development of models to assess and predict pharmacokinetic properties (ADME - absorption, distribution, metabolism and excretion) of bioactive compounds in the early stages of drug discovery projects. This paper provides a brief perspective on the evolution of in silico ADME tools, addressing challenges, limitations, and opportunities in medicinal chemistry.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The study of protein expression profiles for biomarker discovery in serum and in mammalian cell populations needs the continuous improvement and combination of proteins/peptides separation techniques, mass spectrometry, statistical and bioinformatic approaches. In this thesis work two different mass spectrometry-based protein profiling strategies have been developed and applied to liver and inflammatory bowel diseases (IBDs) for the discovery of new biomarkers. The first of them, based on bulk solid-phase extraction combined with matrix-assisted laser desorption/ionization - Time of Flight mass spectrometry (MALDI-TOF MS) and chemometric analysis of serum samples, was applied to the study of serum protein expression profiles both in IBDs (Crohn’s disease and ulcerative colitis) and in liver diseases (cirrhosis, hepatocellular carcinoma, viral hepatitis). The approach allowed the enrichment of serum proteins/peptides due to the high interaction surface between analytes and solid phase and the high recovery due to the elution step performed directly on the MALDI-target plate. Furthermore the use of chemometric algorithm for the selection of the variables with higher discriminant power permitted to evaluate patterns of 20-30 proteins involved in the differentiation and classification of serum samples from healthy donors and diseased patients. These proteins profiles permit to discriminate among the pathologies with an optimum classification and prediction abilities. In particular in the study of inflammatory bowel diseases, after the analysis using C18 of 129 serum samples from healthy donors and Crohn’s disease, ulcerative colitis and inflammatory controls patients, a 90.7% of classification ability and a 72.9% prediction ability were obtained. In the study of liver diseases (hepatocellular carcinoma, viral hepatitis and cirrhosis) a 80.6% of prediction ability was achieved using IDA-Cu(II) as extraction procedure. The identification of the selected proteins by MALDITOF/ TOF MS analysis or by their selective enrichment followed by enzymatic digestion and MS/MS analysis may give useful information in order to identify new biomarkers involved in the diseases. The second mass spectrometry-based protein profiling strategy developed was based on a label-free liquid chromatography electrospray ionization quadrupole - time of flight differential analysis approach (LC ESI-QTOF MS), combined with targeted MS/MS analysis of only identified differences. The strategy was used for biomarker discovery in IBDs, and in particular of Crohn’s disease. The enriched serum peptidome and the subcellular fractions of intestinal epithelial cells (IECs) from healthy donors and Crohn’s disease patients were analysed. The combining of the low molecular weight serum proteins enrichment step and the LCMS approach allowed to evaluate a pattern of peptides derived from specific exoprotease activity in the coagulation and complement activation pathways. Among these peptides, particularly interesting was the discovery of clusters of peptides from fibrinopeptide A, Apolipoprotein E and A4, and complement C3 and C4. Further studies need to be performed to evaluate the specificity of these clusters and validate the results, in order to develop a rapid serum diagnostic test. The analysis by label-free LC ESI-QTOF MS differential analysis of the subcellular fractions of IECs from Crohn’s disease patients and healthy donors permitted to find many proteins that could be involved in the inflammation process. Among them heat shock protein 70, tryptase alpha-1 precursor and proteins whose upregulation can be explained by the increased activity of IECs in Crohn’s disease were identified. Follow-up studies for the validation of the results and the in-depth investigation of the inflammation pathways involved in the disease will be performed. Both the developed mass spectrometry-based protein profiling strategies have been proved to be useful tools for the discovery of disease biomarkers that need to be validated in further studies.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

L'innovazione delle tecnologie di sequenziamento negli ultimi anni ha reso possibile la catalogazione delle varianti genetiche nei campioni umani, portando nuove scoperte e comprensioni nella ricerca medica, farmaceutica, dell'evoluzione e negli studi sulla popolazione. La quantità di sequenze prodotta è molto cospicua, e per giungere all'identificazione delle varianti sono necessari diversi stadi di elaborazione delle informazioni genetiche in cui, ad ogni passo, vengono generate ulteriori informazioni. Insieme a questa immensa accumulazione di dati, è nata la necessità da parte della comunità scientifica di organizzare i dati in repository, dapprima solo per condividere i risultati delle ricerche, poi per permettere studi statistici direttamente sui dati genetici. Gli studi su larga scala coinvolgono quantità di dati nell'ordine dei petabyte, il cui mantenimento continua a rappresentare una sfida per le infrastrutture. Per la varietà e la quantità di dati prodotti, i database giocano un ruolo di primaria importanza in questa sfida. Modelli e organizzazione dei dati in questo campo possono fare la differenza non soltanto per la scalabilità, ma anche e soprattutto per la predisposizione al data mining. Infatti, la memorizzazione di questi dati in file con formati quasi-standard, la dimensione di questi file, e i requisiti computazionali richiesti, rendono difficile la scrittura di software di analisi efficienti e scoraggiano studi su larga scala e su dati eterogenei. Prima di progettare il database si è perciò studiata l’evoluzione, negli ultimi vent’anni, dei formati quasi-standard per i flat file biologici, contenenti metadati eterogenei e sequenze nucleotidiche vere e proprie, con record privi di relazioni strutturali. Recentemente questa evoluzione è culminata nell’utilizzo dello standard XML, ma i flat file delimitati continuano a essere gli standard più supportati da tools e piattaforme online. È seguita poi un’analisi dell’organizzazione interna dei dati per i database biologici pubblici. Queste basi di dati contengono geni, varianti genetiche, strutture proteiche, ontologie fenotipiche, relazioni tra malattie e geni, relazioni tra farmaci e geni. Tra i database pubblici studiati rientrano OMIM, Entrez, KEGG, UniProt, GO. L'obiettivo principale nello studio e nella modellazione del database genetico è stato quello di strutturare i dati in modo da integrare insieme i dati eterogenei prodotti e rendere computazionalmente possibili i processi di data mining. La scelta di tecnologia Hadoop/MapReduce risulta in questo caso particolarmente incisiva, per la scalabilità garantita e per l’efficienza nelle analisi statistiche più complesse e parallele, come quelle riguardanti le varianti alleliche multi-locus.