942 resultados para Data Driven Modeling
Resumo:
We study semiparametric two-step estimators which have the same structure as parametric doubly robust estimators in their second step. The key difference is that we do not impose any parametric restriction on the nuisance functions that are estimated in a first stage, but retain a fully nonparametric model instead. We call these estimators semiparametric doubly robust estimators (SDREs), and show that they possess superior theoretical and practical properties compared to generic semiparametric two-step estimators. In particular, our estimators have substantially smaller first-order bias, allow for a wider range of nonparametric first-stage estimates, rate-optimal choices of smoothing parameters and data-driven estimates thereof, and their stochastic behavior can be well-approximated by classical first-order asymptotics. SDREs exist for a wide range of parameters of interest, particularly in semiparametric missing data and causal inference models. We illustrate our method with a simulation exercise.
Resumo:
This research attempts to analyze the effects of open government data on the administration and practice of the educational process by comparing the contexts of Brazil and England. The findings illustrate two principal dynamics: control and collaboration. In the case of control, or what is called the "data-driven" paradigm, data help advance the cause of political accountability through the disclosure of school performance. In collaboration, or what is referred to as the "data-informed" paradigm, data is intended to support the decision-making process of administrators through dialogical processes with other social actors.
Resumo:
The synthetic control (SC) method has been recently proposed as an alternative method to estimate treatment e ects in comparative case studies. Abadie et al. [2010] and Abadie et al. [2015] argue that one of the advantages of the SC method is that it imposes a data-driven process to select the comparison units, providing more transparency and less discretionary power to the researcher. However, an important limitation of the SC method is that it does not provide clear guidance on the choice of predictor variables used to estimate the SC weights. We show that such lack of speci c guidances provides signi cant opportunities for the researcher to search for speci cations with statistically signi cant results, undermining one of the main advantages of the method. Considering six alternative speci cations commonly used in SC applications, we calculate in Monte Carlo simulations the probability of nding a statistically signi cant result at 5% in at least one speci cation. We nd that this probability can be as high as 13% (23% for a 10% signi cance test) when there are 12 pre-intervention periods and decay slowly with the number of pre-intervention periods. With 230 pre-intervention periods, this probability is still around 10% (18% for a 10% signi cance test). We show that the speci cation that uses the average pre-treatment outcome values to estimate the weights performed particularly bad in our simulations. However, the speci cation-searching problem remains relevant even when we do not consider this speci cation. We also show that this speci cation-searching problem is relevant in simulations with real datasets looking at placebo interventions in the Current Population Survey (CPS). In order to mitigate this problem, we propose a criterion to select among SC di erent speci cations based on the prediction error of each speci cations in placebo estimations
Resumo:
Waterflooding is a technique largely applied in the oil industry. The injected water displaces oil to the producer wells and avoid reservoir pressure decline. However, suspended particles in the injected water may cause plugging of pore throats causing formation damage (permeability reduction) and injectivity decline during waterflooding. When injectivity decline occurs it is necessary to increase the injection pressure in order to maintain water flow injection. Therefore, a reliable prediction of injectivity decline is essential in waterflooding projects. In this dissertation, a simulator based on the traditional porous medium filtration model (including deep bed filtration and external filter cake formation) was developed and applied to predict injectivity decline in perforated wells (this prediction was made from history data). Experimental modeling and injectivity decline in open-hole wells is also discussed. The injectivity of modeling showed good agreement with field data, which can be used to support plan stimulation injection wells
Resumo:
We present a measurement of the semileptonic mixing asymmetry for B0 mesons, asld, using two independent decay channels: B0→μ +D -X, with D -→K +π -π -; and B0→μ +D *-X, with D * -→D ̄0π -, D ̄0→ K +π - (and charge conjugate processes). We use a data sample corresponding to 10.4fb -1 of pp̄ collisions at √s=1.96TeV, collected with the D0 experiment at the Fermilab Tevatron collider. We extract the charge asymmetries in these two channels as a function of the visible proper decay length of the B0 meson, correct for detector-related asymmetries using data-driven methods, and account for dilution from charge-symmetric processes using Monte Carlo simulation. The final measurement combines four signal visible proper decay length regions for each channel, yielding asld=[0.68±0.45(stat)±0.14(syst)]%. This is the single most precise measurement of this parameter, with uncertainties smaller than the current world average of B factory measurements. © 2012 American Physical Society.
Resumo:
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)
Resumo:
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)
Resumo:
O processamento de voz tornou-se uma tecnologia cada vez mais baseada na modelagem automática de vasta quantidade de dados. Desta forma, o sucesso das pesquisas nesta área está diretamente ligado a existência de corpora de domínio público e outros recursos específicos, tal como um dicionário fonético. No Brasil, ao contrário do que acontece para a língua inglesa, por exemplo, não existe atualmente em domínio público um sistema de Reconhecimento Automático de Voz (RAV) para o Português Brasileiro com suporte a grandes vocabulários. Frente a este cenário, o trabalho tem como principal objetivo discutir esforços dentro da iniciativa FalaBrasil [1], criada pelo Laboratório de Processamento de Sinais (LaPS) da UFPA, apresentando pesquisas e softwares na área de RAV para o Português do Brasil. Mais especificamente, o presente trabalho discute a implementação de um sistema de reconhecimento de voz com suporte a grandes vocabulários para o Português do Brasil, utilizando a ferramenta HTK baseada em modelo oculto de Markov (HMM) e a criação de um módulo de conversão grafema-fone, utilizando técnicas de aprendizado de máquina.
Resumo:
Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)
Resumo:
O método de empilhamento Superfície de Reflexão Comum (SRC) foi originalmente introduzido como um método data-driven para simular seções afastamento-nulo a partir de dados sísmicos de reflexão pré-empilhados 2-D adquiridos ao longo de uma linha de aquisição reta. Este método está baseado em uma aproximação de tempos de trânsito hiperbólica de segunda ordem parametrizada com três atributos cinemáticos do campo de onda. Em dados terrestres, os efeitos topográficos desempenham um papel importante no processamento e imageamento de dados sísmicos. Assim, esta característica tem sido considerada recentemente pelo método SRC. Neste trabalho apresentamos uma revisão das aproximações de tempos de trânsito SRC que consideram topografia suave e rugosa. Adicionalmente, nós revemos também a aproximação de tempos de trânsito Multifoco para o caso da topografia rugosa. Por meio de um exemplo sintético simples, nós fornecemos finalmente as primeiras comparações entre as diferentes expressões de tempos de trânsito.
Resumo:
A measurement of differential cross sections for the production of a pair of isolated photons in proton-proton collisions at root s = 7 TeV is presented. The data sample corresponds to an integrated luminosity of 5.0 fb(-1) collected with the CMS detector. A data-driven isolation template method is used to extract the prompt diphoton yield. The measured cross section for two isolated photons, with transverse energy above 40 and 25 GeV respectively, in the pseudorapidity range vertical bar eta vertical bar < 2.5, vertical bar eta vertical bar (sic) [1.44, 1.57] and with an angular separation Delta R > 0.45, is 17.2 +/-0.2 (stat) +/-1.9 (syst) +/- 0.4 (lumi) pb. Differential cross sections are measured as a function of the diphoton invariant mass, the diphoton transverse momentum, the azimuthal angle difference between the two photons, and the cosine of the polar angle in the Collins-Soper reference frame of the diphoton system. The results are compared to theoretical predictions at leading, next-to-leading, and next-to-next-to-leading order in quantum chromodynamics.
Resumo:
1. Distance sampling is a widely used technique for estimating the size or density of biological populations. Many distance sampling designs and most analyses use the software Distance. 2. We briefly review distance sampling and its assumptions, outline the history, structure and capabilities of Distance, and provide hints on its use. 3. Good survey design is a crucial prerequisite for obtaining reliable results. Distance has a survey design engine, with a built-in geographic information system, that allows properties of different proposed designs to be examined via simulation, and survey plans to be generated. 4. A first step in analysis of distance sampling data is modeling the probability of detection. Distance contains three increasingly sophisticated analysis engines for this: conventional distance sampling, which models detection probability as a function of distance from the transect and assumes all objects at zero distance are detected; multiple-covariate distance sampling, which allows covariates in addition to distance; and mark–recapture distance sampling, which relaxes the assumption of certain detection at zero distance. 5. All three engines allow estimation of density or abundance, stratified if required, with associated measures of precision calculated either analytically or via the bootstrap. 6. Advanced analysis topics covered include the use of multipliers to allow analysis of indirect surveys (such as dung or nest surveys), the density surface modeling analysis engine for spatial and habitat-modeling, and information about accessing the analysis engines directly from other software. 7. Synthesis and applications. Distance sampling is a key method for producing abundance and density estimates in challenging field conditions. The theory underlying the methods continues to expand to cope with realistic estimation situations. In step with theoretical developments, state-of- the-art software that implements these methods is described that makes the methods accessible to practicing ecologists.
Resumo:
Double-observer line transect methods are becoming increasingly widespread, especially for the estimation of marine mammal abundance from aerial and shipboard surveys when detection of animals on the line is uncertain. The resulting data supplement conventional distance sampling data with two-sample mark–recapture data. Like conventional mark–recapture data, these have inherent problems for estimating abundance in the presence of heterogeneity. Unlike conventional mark–recapture methods, line transect methods use knowledge of the distribution of a covariate, which affects detection probability (namely, distance from the transect line) in inference. This knowledge can be used to diagnose unmodeled heterogeneity in the mark–recapture component of the data. By modeling the covariance in detection probabilities with distance, we show how the estimation problem can be formulated in terms of different levels of independence. At one extreme, full independence is assumed, as in the Petersen estimator (which does not use distance data); at the other extreme, independence only occurs in the limit as detection probability tends to one. Between the two extremes, there is a range of models, including those currently in common use, which have intermediate levels of independence. We show how this framework can be used to provide more reliable analysis of double-observer line transect data. We test the methods by simulation, and by analysis of a dataset for which true abundance is known. We illustrate the approach through analysis of minke whale sightings data from the North Sea and adjacent waters.
Resumo:
Objective. To define inactive disease (ID) and clinical remission (CR) and to delineate variables that can be used to measure ID/CR in childhood-onset systemic lupus erythematosus (cSLE). Methods. Delphi questionnaires were sent to an international group of pediatric rheumatologists. Respondents provided information about variables to be used in future algorithms to measure ID/CR. The usefulness of these variables was assessed in 35 children with ID and 31 children with minimally active lupus (MAL). Results. While ID reflects cSLE status at a specific point in time, CR requires the presence of ID for >6 months and considers treatment. There was consensus that patients in ID/CR can have <2 mild nonlimiting symptoms (i.e., fatigue, arthralgia, headaches, or myalgia) but not Raynaud's phenomenon, chest pain, or objective physical signs of cSLE; antinuclear antibody positivity and erythrocyte sedimentation rate elevation can be present. Complete blood count, renal function testing, and complement C3 all must be within the normal range. Based on consensus, only damage-related laboratory or clinical findings of cSLE are permissible with ID. The above parameters were suitable to differentiate children with ID/CR from those with MAL (area under the receiver operating characteristic curve >0.85). Disease activity scores with or without the physician global assessment of disease activity and patient symptoms were well suited to differentiate children with ID from those with MAL. Conclusion. Consensus has been reached on common definitions of ID/CR with cSLE and relevant patient characteristics with ID/CR. Further studies must assess the usefulness of the data-driven candidate criteria for ID in cSLE.
Resumo:
The construction and use of multimedia corpora has been advocated for a while in the literature as one of the expected future application fields of Corpus Linguistics. This research project represents a pioneering experience aimed at applying a data-driven methodology to the study of the field of AVT, similarly to what has been done in the last few decades in the macro-field of Translation Studies. This research was based on the experience of Forlixt 1, the Forlì Corpus of Screen Translation, developed at the University of Bologna’s Department of Interdisciplinary Studies in Translation, Languages and Culture. As a matter of fact, in order to quantify strategies of linguistic transfer of an AV product, we need to take into consideration not only the linguistic aspect of such a product but all the meaning-making resources deployed in the filmic text. Provided that one major benefit of Forlixt 1 is the combination of audiovisual and textual data, this corpus allows the user to access primary data for scientific investigation, and thus no longer rely on pre-processed material such as traditional annotated transcriptions. Based on this rationale, the first chapter of the thesis sets out to illustrate the state of the art of research in the disciplinary fields involved. The primary objective was to underline the main repercussions on multimedia texts resulting from the interaction of a double support, audio and video, and, accordingly, on procedures, means, and methods adopted in their translation. By drawing on previous research in semiotics and film studies, the relevant codes at work in visual and acoustic channels were outlined. Subsequently, we concentrated on the analysis of the verbal component and on the peculiar characteristics of filmic orality as opposed to spontaneous dialogic production. In the second part, an overview of the main AVT modalities was presented (dubbing, voice-over, interlinguistic and intra-linguistic subtitling, audio-description, etc.) in order to define the different technologies, processes and professional qualifications that this umbrella term presently includes. The second chapter focuses diachronically on various theories’ contribution to the application of Corpus Linguistics’ methods and tools to the field of Translation Studies (i.e. Descriptive Translation Studies, Polysystem Theory). In particular, we discussed how the use of corpora can favourably help reduce the gap existing between qualitative and quantitative approaches. Subsequently, we reviewed the tools traditionally employed by Corpus Linguistics in regard to the construction of traditional “written language” corpora, to assess whether and how they can be adapted to meet the needs of multimedia corpora. In particular, we reviewed existing speech and spoken corpora, as well as multimedia corpora specifically designed to investigate Translation. The third chapter reviews Forlixt 1's main developing steps, from a technical (IT design principles, data query functions) and methodological point of view, by laying down extensive scientific foundations for the annotation methods adopted, which presently encompass categories of pragmatic, sociolinguistic, linguacultural and semiotic nature. Finally, we described the main query tools (free search, guided search, advanced search and combined search) and the main intended uses of the database in a pedagogical perspective. The fourth chapter lists specific compilation criteria retained, as well as statistics of the two sub-corpora, by presenting data broken down by language pair (French-Italian and German-Italian) and genre (cinema’s comedies, television’s soapoperas and crime series). Next, we concentrated on the discussion of the results obtained from the analysis of summary tables reporting the frequency of categories applied to the French-Italian sub-corpus. The detailed observation of the distribution of categories identified in the original and dubbed corpus allowed us to empirically confirm some of the theories put forward in the literature and notably concerning the nature of the filmic text, the dubbing process and Italian dubbed language’s features. This was possible by looking into some of the most problematic aspects, like the rendering of socio-linguistic variation. The corpus equally allowed us to consider so far neglected aspects, such as pragmatic, prosodic, kinetic, facial, and semiotic elements, and their combination. At the end of this first exploration, some specific observations concerning possible macrotranslation trends were made for each type of sub-genre considered (cinematic and TV genre). On the grounds of this first quantitative investigation, the fifth chapter intended to further examine data, by applying ad hoc models of analysis. Given the virtually infinite number of combinations of categories adopted, and of the latter with searchable textual units, three possible qualitative and quantitative methods were designed, each of which was to concentrate on a particular translation dimension of the filmic text. The first one was the cultural dimension, which specifically focused on the rendering of selected cultural references and on the investigation of recurrent translation choices and strategies justified on the basis of the occurrence of specific clusters of categories. The second analysis was conducted on the linguistic dimension by exploring the occurrence of phrasal verbs in the Italian dubbed corpus and by ascertaining the influence on the adoption of related translation strategies of possible semiotic traits, such as gestures and facial expressions. Finally, the main aim of the third study was to verify whether, under which circumstances, and through which modality, graphic and iconic elements were translated into Italian from an original corpus of both German and French films. After having reviewed the main translation techniques at work, an exhaustive account of possible causes for their non-translation was equally provided. By way of conclusion, the discussion of results obtained from the distribution of annotation categories on the French-Italian corpus, as well as the application of specific models of analysis allowed us to underline possible advantages and drawbacks related to the adoption of a corpus-based approach to AVT studies. Even though possible updating and improvement were proposed in order to help solve some of the problems identified, it is argued that the added value of Forlixt 1 lies ultimately in having created a valuable instrument, allowing to carry out empirically-sound contrastive studies that may be usefully replicated on different language pairs and several types of multimedia texts. Furthermore, multimedia corpora can also play a crucial role in L2 and translation teaching, two disciplines in which their use still lacks systematic investigation.