931 resultados para large data sets
Resumo:
International non-governmental organisations (NGOs) are powerful political players who aim to influence global society. In order to be effective on a global scale, they must communicate their goals and achievements in different languages. Translation and translation policy play an essential role here. Despite NGOs’ important position in politics and society, not much is known about how these organisations, who often have limited funds available, organise their translation work. This study aims to contribute to Translation Studies, and more specifically to investigating institutional translation, by exploring translation policies at Amnesty International, one of the most successful and powerful human rights NGOs around the world. Translation policy is understood as comprising three components: translation management, translation practices, and translation beliefs, based on Spolsky’s study of language policy (2004). The thesis investigates how translation is organised and what kind of policies different Amnesty offices have in place, and how this is reflected in their translation products. The thesis thus also pursues how translation and translation policy impact on the organisation’s message and voice as it is spread around the world. An ethnographic approach is used for the analysis of various data sets that were collected during fieldwork. These include policy documents, guidelines on writing and translation, recorded interviews, e-mail correspondence, and fieldnotes. The thesis at first explores Amnesty’s global translation policy, and then presents the results of a comparative analysis of local translation policies at two concrete institutions: Amnesty International Language Resource Centre in Paris (AILRC-FR) and Amnesty International Vlaanderen (AIVL). A corpus of English source texts and Dutch (AIVL) and French (AILRC-FR) target texts are analysed. The findings of the analysis of translation policies and of the translation products are then combined to illustrate how translation impacts on Amnesty’s message and voice. The research results show that there are large differences in how translation is organised depending on the local office and the language(s), and that this also influences the way in which Amnesty’s message and voice are represented. For Dutch and French specifically, translation policies and translation products differ considerably. The thesis describes how these differences are often the result of different beliefs and assumptions relating to translation, and that staff members within Amnesty are not aware of the different conceptions of translation that exist within Amnesty International as a formal institution. Organising opportunities where translation can be discussed (meetings, workshops, online platforms) can help in reducing such differences. The thesis concludes by suggesting that an increased awareness of these issues will enable Amnesty to make more effective use of translation in its fight against human rights violations.
Resumo:
Resource discovery is one of the key services in digitised cultural heritage collections. It requires intelligent mining in heterogeneous digital content as well as capabilities in large scale performance; this explains the recent advances in classification methods. Associative classifiers are convenient data mining tools used in the field of cultural heritage, by applying their possibilities to taking into account the specific combinations of the attribute values. Usually, the associative classifiers prioritize the support over the confidence. The proposed classifier PGN questions this common approach and focuses on confidence first by retaining only 100% confidence rules. The classification tasks in the field of cultural heritage usually deal with data sets with many class labels. This variety is caused by the richness of accumulated culture during the centuries. Comparisons of classifier PGN with other classifiers, such as OneR, JRip and J48, show the competitiveness of PGN in recognizing multi-class datasets on collections of masterpieces from different West and East European Fine Art authors and movements.
Resumo:
Parkinson's disease is a complex heterogeneous disorder with urgent need for disease-modifying therapies. Progress in successful therapeutic approaches for PD will require an unprecedented level of collaboration. At a workshop hosted by Parkinson's UK and co-organized by Critical Path Institute's (C-Path) Coalition Against Major Diseases (CAMD) Consortiums, investigators from industry, academia, government and regulatory agencies agreed on the need for sharing of data to enable future success. Government agencies included EMA, FDA, NINDS/NIH and IMI (Innovative Medicines Initiative). Emerging discoveries in new biomarkers and genetic endophenotypes are contributing to our understanding of the underlying pathophysiology of PD. In parallel there is growing recognition that early intervention will be key for successful treatments aimed at disease modification. At present, there is a lack of a comprehensive understanding of disease progression and the many factors that contribute to disease progression heterogeneity. Novel therapeutic targets and trial designs that incorporate existing and new biomarkers to evaluate drug effects independently and in combination are required. The integration of robust clinical data sets is viewed as a powerful approach to hasten medical discovery and therapies, as is being realized across diverse disease conditions employing big data analytics for healthcare. The application of lessons learned from parallel efforts is critical to identify barriers and enable a viable path forward. A roadmap is presented for a regulatory, academic, industry and advocacy driven integrated initiative that aims to facilitate and streamline new drug trials and registrations in Parkinson's disease.
Resumo:
The real purpose of collecting big data is to identify causality in the hope that this will facilitate credible predictivity . But the search for causality can trap one into infinite regress, and thus one takes refuge in seeking associations between variables in data sets. Regrettably, the mere knowledge of associations does not enable predictivity. Associations need to be embedded within the framework of probability calculus to make coherent predictions. This is so because associations are a feature of probability models, and hence they do not exist outside the framework of a model. Measures of association, like correlation, regression, and mutual information merely refute a preconceived model. Estimated measures of associations do not lead to a probability model; a model is the product of pure thought. This paper discusses these and other fundamentals that are germane to seeking associations in particular, and machine learning in general. ACM Computing Classification System (1998): H.1.2, H.2.4., G.3.
Resumo:
The Digital Observatory for Protected Areas (DOPA) has been developed to support the European Union’s efforts in strengthening our capacity to mobilize and use biodiversity data, information and forecasts so that they are readily accessible to policymakers, managers, experts and other users. Conceived as a set of web based services, DOPA provides a broad set of free and open source tools to assess, monitor and even forecast the state of and pressure on protected areas at local, regional and global scale. DOPA Explorer 1.0 is a web based interface available in four languages (EN, FR, ES, PT) providing simple means to explore the nearly 16,000 protected areas that are at least as large as 100 km2. Distinguishing between terrestrial, marine and mixed protected areas, DOPA Explorer 1.0 can help end users to identify those with most unique ecosystems and species, and assess the pressures they are exposed to because of human development. Recognized by the UN Convention on Biological Diversity (CBD) as a reference information system, DOPA Explorer is based on the best global data sets available and provides means to rank protected areas at the country and ecoregion levels. Inversely, DOPA Explorer indirectly highlights the protected areas for which information is incomplete. We finally invite the end-users of DOPA to engage with us through the proposed communication platforms to help improve our work to support the safeguarding of biodiversity.
Resumo:
A tanárok pályaelhagyási döntését vizsgálva, a tanulmány a következő két kérdésre keresi a választ. Milyen szerepet játszanak e döntésekben a keresetek, alternatív kereseti lehetőségek? Hogyan hatott a tanárok pályaelhagyására a 2002. évi közalkalmazotti béremelés? Az elemzéshez az OEP-ONYF-FH összekapcsolt nagymintás adatbázis felhasználásával kétféle modellt becsült a szerző: 1. két lehetőséget megkülönböztetve (elhagyja a tanári pályát/nem hagyja el) Cox-féle arányos hazárdfüggvényeket, 2. a pályaelhagyás okai között a más állásba kerülést és az egyéb pályaelhagyási okokat megkülönböztetve versengő kockázati modelleket. Az eredmények azt mutatják, hogy a kereseti lehetőségek hatnak a pályaelhagyási döntésekre. A magasabb jövedelem és magasabb relatív kereset csökkenti annak valószínűségét, hogy egy tanár elhagyja a pályát, és más pályán helyezkedjen el, vagy nem foglalkoztatotti státusba kerüljön. A közalkalmazotti béremelés átmenetileg csökkentette a pályaelhagyás valószínűségét a fiatal tanárok körében, de a hatás egy-két év alatt eltűnt. Az 51 évesnél idősebb tanárokat pedig inkább a pályán tartotta a béremelés, csökkentette annak valószínűségét is, hogy más pályán helyezkedjenek el, vagy hogy nem foglalkoztatotti státusba kerüljenek. ______ The paper investigates teachers decisions to leave the profession. It first examines the role in such decisions of pay compared with earnings in alternative occupations, and then discusses how the public-sector pay increase of 2002 af-fected exit decisions by teachers. Duration models were estimated using large merged administrative data sets. First binary-choice Cox proportional hazard models (leaving teaching profession or not), then competing risk models that distinguish exits to another occupation and exits to a non-working state. Results show that earnings matter. Higher wages reduce the probability of exiting teacher profession to go to another occupation or to non-employment. The public-sector pay increase decreased the probability of inexperienced teachers leaving the teacher profession temporarily, but the effect disappeared after one or two years. For experienced teachers over 51 years old, the wage increase was found to reduce attrition.
Resumo:
Methods for accessing data on the Web have been the focus of active research over the past few years. In this thesis we propose a method for representing Web sites as data sources. We designed a Data Extractor data retrieval solution that allows us to define queries to Web sites and process resulting data sets. Data Extractor is being integrated into the MSemODB heterogeneous database management system. With its help database queries can be distributed over both local and Web data sources within MSemODB framework. ^ Data Extractor treats Web sites as data sources, controlling query execution and data retrieval. It works as an intermediary between the applications and the sites. Data Extractor utilizes a twofold “custom wrapper” approach for information retrieval. Wrappers for the majority of sites are easily built using a powerful and expressive scripting language, while complex cases are processed using Java-based wrappers that utilize specially designed library of data retrieval, parsing and Web access routines. In addition to wrapper development we thoroughly investigate issues associated with Web site selection, analysis and processing. ^ Data Extractor is designed to act as a data retrieval server, as well as an embedded data retrieval solution. We also use it to create mobile agents that are shipped over the Internet to the client's computer to perform data retrieval on behalf of the user. This approach allows Data Extractor to distribute and scale well. ^ This study confirms feasibility of building custom wrappers for Web sites. This approach provides accuracy of data retrieval, and power and flexibility in handling of complex cases. ^
Resumo:
The primary goal of this dissertation is to develop point-based rigid and non-rigid image registration methods that have better accuracy than existing methods. We first present point-based PoIRe, which provides the framework for point-based global rigid registrations. It allows a choice of different search strategies including (a) branch-and-bound, (b) probabilistic hill-climbing, and (c) a novel hybrid method that takes advantage of the best characteristics of the other two methods. We use a robust similarity measure that is insensitive to noise, which is often introduced during feature extraction. We show the robustness of PoIRe using it to register images obtained with an electronic portal imaging device (EPID), which have large amounts of scatter and low contrast. To evaluate PoIRe we used (a) simulated images and (b) images with fiducial markers; PoIRe was extensively tested with 2D EPID images and images generated by 3D Computer Tomography (CT) and Magnetic Resonance (MR) images. PoIRe was also evaluated using benchmark data sets from the blind retrospective evaluation project (RIRE). We show that PoIRe is better than existing methods such as Iterative Closest Point (ICP) and methods based on mutual information. We also present a novel point-based local non-rigid shape registration algorithm. We extend the robust similarity measure used in PoIRe to non-rigid registrations adapting it to a free form deformation (FFD) model and making it robust to local minima, which is a drawback common to existing non-rigid point-based methods. For non-rigid registrations we show that it performs better than existing methods and that is less sensitive to starting conditions. We test our non-rigid registration method using available benchmark data sets for shape registration. Finally, we also explore the extraction of features invariant to changes in perspective and illumination, and explore how they can help improve the accuracy of multi-modal registration. For multimodal registration of EPID-DRR images we present a method based on a local descriptor defined by a vector of complex responses to a circular Gabor filter.
Resumo:
Prior research has established that idiosyncratic volatility of the securities prices exhibits a positive trend. This trend and other factors have made the merits of investment diversification and portfolio construction more compelling. ^ A new optimization technique, a greedy algorithm, is proposed to optimize the weights of assets in a portfolio. The main benefits of using this algorithm are to: (a) increase the efficiency of the portfolio optimization process, (b) implement large-scale optimizations, and (c) improve the resulting optimal weights. In addition, the technique utilizes a novel approach in the construction of a time-varying covariance matrix. This involves the application of a modified integrated dynamic conditional correlation GARCH (IDCC - GARCH) model to account for the dynamics of the conditional covariance matrices that are employed. ^ The stochastic aspects of the expected return of the securities are integrated into the technique through Monte Carlo simulations. Instead of representing the expected returns as deterministic values, they are assigned simulated values based on their historical measures. The time-series of the securities are fitted into a probability distribution that matches the time-series characteristics using the Anderson-Darling goodness-of-fit criterion. Simulated and actual data sets are used to further generalize the results. Employing the S&P500 securities as the base, 2000 simulated data sets are created using Monte Carlo simulation. In addition, the Russell 1000 securities are used to generate 50 sample data sets. ^ The results indicate an increase in risk-return performance. Choosing the Value-at-Risk (VaR) as the criterion and the Crystal Ball portfolio optimizer, a commercial product currently available on the market, as the comparison for benchmarking, the new greedy technique clearly outperforms others using a sample of the S&P500 and the Russell 1000 securities. The resulting improvements in performance are consistent among five securities selection methods (maximum, minimum, random, absolute minimum, and absolute maximum) and three covariance structures (unconditional, orthogonal GARCH, and integrated dynamic conditional GARCH). ^
Resumo:
Due to the rapid advances in computing and sensing technologies, enormous amounts of data are being generated everyday in various applications. The integration of data mining and data visualization has been widely used to analyze these massive and complex data sets to discover hidden patterns. For both data mining and visualization to be effective, it is important to include the visualization techniques in the mining process and to generate the discovered patterns for a more comprehensive visual view. In this dissertation, four related problems: dimensionality reduction for visualizing high dimensional datasets, visualization-based clustering evaluation, interactive document mining, and multiple clusterings exploration are studied to explore the integration of data mining and data visualization. In particular, we 1) propose an efficient feature selection method (reliefF + mRMR) for preprocessing high dimensional datasets; 2) present DClusterE to integrate cluster validation with user interaction and provide rich visualization tools for users to examine document clustering results from multiple perspectives; 3) design two interactive document summarization systems to involve users efforts and generate customized summaries from 2D sentence layouts; and 4) propose a new framework which organizes the different input clusterings into a hierarchical tree structure and allows for interactive exploration of multiple clustering solutions.
Resumo:
Modern geographical databases, which are at the core of geographic information systems (GIS), store a rich set of aspatial attributes in addition to geographic data. Typically, aspatial information comes in textual and numeric format. Retrieving information constrained on spatial and aspatial data from geodatabases provides GIS users the ability to perform more interesting spatial analyses, and for applications to support composite location-aware searches; for example, in a real estate database: “Find the nearest homes for sale to my current location that have backyard and whose prices are between $50,000 and $80,000”. Efficient processing of such queries require combined indexing strategies of multiple types of data. Existing spatial query engines commonly apply a two-filter approach (spatial filter followed by nonspatial filter, or viceversa), which can incur large performance overheads. On the other hand, more recently, the amount of geolocation data has grown rapidly in databases due in part to advances in geolocation technologies (e.g., GPS-enabled smartphones) that allow users to associate location data to objects or events. The latter poses potential data ingestion challenges of large data volumes for practical GIS databases. In this dissertation, we first show how indexing spatial data with R-trees (a typical data pre-processing task) can be scaled in MapReduce—a widely-adopted parallel programming model for data intensive problems. The evaluation of our algorithms in a Hadoop cluster showed close to linear scalability in building R-tree indexes. Subsequently, we develop efficient algorithms for processing spatial queries with aspatial conditions. Novel techniques for simultaneously indexing spatial with textual and numeric data are developed to that end. Experimental evaluations with real-world, large spatial datasets measured query response times within the sub-second range for most cases, and up to a few seconds for a small number of cases, which is reasonable for interactive applications. Overall, the previous results show that the MapReduce parallel model is suitable for indexing tasks in spatial databases, and the adequate combination of spatial and aspatial attribute indexes can attain acceptable response times for interactive spatial queries with constraints on aspatial data.
Resumo:
Extensive data sets on water quality and seagrass distributions in Florida Bay have been assembled under complementary, but independent, monitoring programs. This paper presents the landscape-scale results from these monitoring programs and outlines a method for exploring the relationships between two such data sets. Seagrass species occurrence and abundance data were used to define eight benthic habitat classes from 677 sampling locations in Florida Bay. Water quality data from 28 monitoring stations spread across the Bay were used to construct a discriminant function model that assigned a probability of a given benthic habitat class occurring for a given combination of water quality variables. Mean salinity, salinity variability, the amount of light reaching the benthos, sediment depth, and mean nutrient concentrations were important predictor variables in the discriminant function model. Using a cross-validated classification scheme, this discriminant function identified the most likely benthic habitat type as the actual habitat type in most cases. The model predicted that the distribution of benthic habitat types in Florida Bay would likely change if water quality and water delivery were changed by human engineering of freshwater discharge from the Everglades. Specifically, an increase in the seasonal delivery of freshwater to Florida Bay should cause an expansion of seagrass beds dominated by Ruppia maritima and Halodule wrightii at the expense of the Thalassia testudinum-dominated community that now occurs in northeast Florida Bay. These statistical techniques should prove useful for predicting landscape-scale changes in community composition in diverse systems where communities are in quasi-equilibrium with environmental drivers.
Resumo:
The move from Standard Definition (SD) to High Definition (HD) represents a six times increases in data, which needs to be processed. With expanding resolutions and evolving compression, there is a need for high performance with flexible architectures to allow for quick upgrade ability. The technology advances in image display resolutions, advanced compression techniques, and video intelligence. Software implementation of these systems can attain accuracy with tradeoffs among processing performance (to achieve specified frame rates, working on large image data sets), power and cost constraints. There is a need for new architectures to be in pace with the fast innovations in video and imaging. It contains dedicated hardware implementation of the pixel and frame rate processes on Field Programmable Gate Array (FPGA) to achieve the real-time performance. ^ The following outlines the contributions of the dissertation. (1) We develop a target detection system by applying a novel running average mean threshold (RAMT) approach to globalize the threshold required for background subtraction. This approach adapts the threshold automatically to different environments (indoor and outdoor) and different targets (humans and vehicles). For low power consumption and better performance, we design the complete system on FPGA. (2) We introduce a safe distance factor and develop an algorithm for occlusion occurrence detection during target tracking. A novel mean-threshold is calculated by motion-position analysis. (3) A new strategy for gesture recognition is developed using Combinational Neural Networks (CNN) based on a tree structure. Analysis of the method is done on American Sign Language (ASL) gestures. We introduce novel point of interests approach to reduce the feature vector size and gradient threshold approach for accurate classification. (4) We design a gesture recognition system using a hardware/ software co-simulation neural network for high speed and low memory storage requirements provided by the FPGA. We develop an innovative maximum distant algorithm which uses only 0.39% of the image as the feature vector to train and test the system design. Database set gestures involved in different applications may vary. Therefore, it is highly essential to keep the feature vector as low as possible while maintaining the same accuracy and performance^
Resumo:
The primary goal of this dissertation is the study of patterns of viral evolution inferred from serially-sampled sequence data, i.e., sequence data obtained from strains isolated at consecutive time points from a single patient or host. RNA viral populations have an extremely high genetic variability, largely due to their astronomical population sizes within host systems, high replication rate, and short generation time. It is this aspect of their evolution that demands special attention and a different approach when studying the evolutionary relationships of serially-sampled sequence data. New methods that analyze serially-sampled data were developed shortly after a groundbreaking HIV-1 study of several patients from which viruses were isolated at recurring intervals over a period of 10 or more years. These methods assume a tree-like evolutionary model, while many RNA viruses have the capacity to exchange genetic material with one another using a process called recombination. ^ A genealogy involving recombination is best described by a network structure. A more general approach was implemented in a new computational tool, Sliding MinPD, one that is mindful of the sampling times of the input sequences and that reconstructs the viral evolutionary relationships in the form of a network structure with implicit representations of recombination events. The underlying network organization reveals unique patterns of viral evolution and could help explain the emergence of disease-associated mutants and drug-resistant strains, with implications for patient prognosis and treatment strategies. In order to comprehensively test the developed methods and to carry out comparison studies with other methods, synthetic data sets are critical. Therefore, appropriate sequence generators were also developed to simulate the evolution of serially-sampled recombinant viruses, new and more through evaluation criteria for recombination detection methods were established, and three major comparison studies were performed. The newly developed tools were also applied to "real" HIV-1 sequence data and it was shown that the results represented within an evolutionary network structure can be interpreted in biologically meaningful ways. ^
Resumo:
Mapping of vegetation patterns over large extents using remote sensing methods requires field sample collections for two different purposes: (1) the establishment of plant association classification systems from samples of relative abundance estimates; and (2) training for supervised image classification and accuracy assessment of satellite data derived maps. One challenge for both procedures is the establishment of confidence in results and the analysis across multiple spatial scales. Continuous data sets that enable cross-scale studies are very time consuming and expensive to acquire and such extensive field sampling can be invasive. The use of high resolution aerial photography (hrAP) offers an alternative to extensive, invasive, field sampling and can provide large volume, spatially continuous, reference information that can meet the challenges of confidence building and multi-scale analysis.