989 resultados para Unstructured data
Resumo:
En los últimos años han surgido nuevos campos de las tecnologías de la información que exploran el tratamiento de la gran cantidad de datos digitales existentes y cómo transformarlos en conocimiento explícito. Las técnicas de Procesamiento del Lenguaje Natural (NLP) son capaces de extraer información de los textos digitales presentados en forma narrativa. Además, las técnicas de machine learning clasifican instancias o ejemplos en función de sus atributos, en distintas categorías, aprendiendo de otros previamente clasificados. Los textos clínicos son una gran fuente de información no estructurada; en consecuencia, información no explotada en su totalidad. Algunos términos usados en textos clínicos se encuentran en una situación de afirmación, negación, hipótesis o histórica. La detección de esta situación es necesaria para la estructuración de información, pero a su vez tiene una gran complejidad. Extrayendo características lingüísticas de los elementos, o tokens, de los textos mediante NLP; transformando estos tokens en instancias y las características en atributos, podemos mediante técnicas de machine learning clasificarlos con el objetivo de detectar si se encuentran afirmados, negados, hipotéticos o históricos. La selección de los atributos que cada token debe tener para su clasificación, así como la selección del algoritmo de machine learning utilizado son elementos cruciales para la clasificación. Son, de hecho, los elementos que componen el modelo de clasificación. Consecuentemente, este trabajo aborda el proceso de extracción de características, selección de atributos y selección del algoritmo de machine learning para la detección de la negación en textos clínicos en español. Se expone un modelo para la clasificación que, mediante el algoritmo J48 y 35 atributos obtenidos de características lingüísticas (morfológicas y sintácticas) y disparadores de negación, detecta si un token está negado en 465 frases provenientes de textos clínicos con un F-Score del 73%, una exhaustividad del 66% y una precisión del 81% con una validación cruzada de 10 iteraciones. ---ABSTRACT--- New information technologies have emerged in the recent years which explore the processing of the huge amount of existing digital data and its transformation into knowledge. Natural Language Processing (NLP) techniques are able to extract certain features from digital texts. Additionally, through machine learning techniques it is feasible to classify instances according to different categories, learning from others previously classified. Clinical texts contain great amount of unstructured data, therefore information not fully exploited. Some terms (tokens) in clinical texts appear in different situations such as affirmed, negated, hypothetic or historic. Detecting this situation is necessary for the structuring of this data, however not simple. It is possible to detect whether if a token is negated, affirmed, hypothetic or historic by extracting its linguistic features by NLP; transforming these tokens into instances, the features into attributes, and classifying these instances through machine learning techniques. Selecting the attributes each instance must have, and choosing the machine learning algorithm are crucial issues for the classification. In fact, these elements set the classification model. Consequently, this work approaches the features retrieval as well as the attributes and algorithm selection process used by machine learning techniques for the detection of negation in clinical texts in Spanish. We present a classification model which, through J48 algorithm and 35 attributes from linguistic features (morphologic and syntactic) and negation triggers, detects whether if a token is negated in 465 sentences from historical records, with a result of 73% FScore, 66% recall and 81% precision using a 10-fold cross-validation.
Resumo:
El objetivo de este proyecto se basa en la necesidad de replantearse la filosofía clásica del TLH para adecuarse tanto a las fuentes disponibles actualmente (datos no estructurados con multi-modalidad, multi-lingualidad y diferentes grados de formalidad) como a las necesidades reales de los usuarios finales. Para conseguir este objetivo es necesario integrar tanto la comprensión como la generación del lenguaje humano en un modelo único (modelo LEGOLANG) basado en técnicas de deconstrucción de la lengua, independiente de su aplicación final y de la variante de lenguaje humano elegida para expresar el conocimiento.
Resumo:
Thesis (Master's)--University of Washington, 2016-06
Resumo:
In the year 2001, the Commission on Dietetic Registration (CDR) will begin a new process of recertifying Registered Dietitians (RD) using a self-directed lifelong learning portfolio model. The model, entitled Professional Development 2001 (PD 2001), is designed to increase competency through targeted learning. This portfolio consists of five steps: reflection, learning needs assessment, formulation of a learning plan, maintenance of a learning log, and evaluation of the learning plan. By targeting learning, PD 2001 is predicted to foster more up-to-date practitioners than the current method that requires only a quantity of continuing education hours. This is the first major change in the credentialing system since 1975. The success or failure of the new system will impact the future of approximately 60,000 practitioners. The purpose of this study was to determine the readiness of RDs to change to the new system. Since the model is dependent on setting goals and developing learning plans, this study examined the methods dietitians use to determine their five-year goals and direction in practice. It also determined RD's attitudes towards PD 2001 and identified some of the factors that influenced their beliefs. A dual methodological design using focus groups and questionnaires was utilized. Sixteen focus groups were held during state dietetic association meetings. Demographic data was collected on the 132 registered dietitians who participated in the focus groups using a self-administered questionnaire. The audiotaped sessions were transcribed into 643 pages of text and analyzed using Non-numerical Unstructured Data - Indexing Searching and Theorizing (NUD*IST version 4). Thirty-four of the 132 participants (26%) had formal five-year goals. Fifty-four participants (41%) performed annual self-assessments. In general, dietitians did not currently have professional goals nor conduct self-assessments and they claimed they did not have the skills or confidence to perform these tasks. Major barriers to successful implementation of PD 2001 are uncertainty, misinterpretation, and misinformation about the process and purpose, which in turn contribute to negative impressions. Renewed vigor to provide a positive, accurate message along with presenting goal-setting strategies will be necessary for better acceptance of this professional development process. ^
Resumo:
In a highly connected society, avid for information and technological innovations, constantly changing the consumption patterns, the brand management strategy occupies a growing place. Allied with the increased competition among companies, the brand that can differentiate in consumers’ minds becomes strong. This aspect is even more important in the service industry, where the consumer experience, the definition and support of the brand’s values are vital to the continued strength of both your identity and image. These aspects are seen as a process of communication in which the way the image is developed in the minds of consumers comes from how identity is constructed and transmitted to them (DE CHERNATONY; DRURY; SEGAL-HORN, 2004). Considering the dynamic and complex scenario, this study aims to identify and analyze the possible convergences or divergences between the identity built by the organization and the brand image perceived by consumers of a telecommunications services company. To achieve this objective, the model proposed by De Chernatony, Drury and Segal-Horn (2004) was used as a theoretical basis, which addresses the transformation of identity in brand image, specifically under the perspective of Pontes (2009). For him, customers are more motivated to buy and consume products that they believe that take a complementary image that they have of themselves, and proposes the existence of multiple selves: the perceived, which refers to the employees and the organization’s management opinions on the brand; the ideal, which deals with effective brand identity thought by its leaders, the vision of what it should be; social, which shows how managers think that consumers see it; the apparent, formed by the image of the brand by customers; and finally the real self, that would be an integrated composite of all of these visions. In this regard, a case study was made in a telecommunications company with regional actions, from a qualitative and quantitative approach. It was identified the company’s vision through semi-structured interviews with marketing managers and analysis of documents related to the brand strategy. The point of view of consumers was addressed for text mining techniques applied to internal unstructured data coming from the collection of posts made on Facebook and Twitter, related to the brand, and customer interaction with the company through these social networks. The results showed the importance of the concepts of identity and brand image, and how they are interrelated. Moreover, the qualitative analysis it was shown that the vision of marketing executives is quite close and in line with the Brand Book, showing that there is a cohesive and well disseminated speech internally in the organization. On the other hand, when evaluating the customer's point of view there was no specific comments on the brand, and it was not possible to identify the evaluation of Algar Telecom image by consumers. Nevertheless, other relevant aspects could be identified for the consolidation of the brand identity, as the occurrence of a number of complaints, especially regarding the internet as well as the concern of customers for the quality of the provision of services.
Resumo:
Thesis (Ph.D.)--University of Washington, 2016-08
Resumo:
Compared with structured data sources that are usually stored and analyzed in spreadsheets, relational databases, and single data tables, unstructured construction data sources such as text documents, site images, web pages, and project schedules have been less intensively studied due to additional challenges in data preparation, representation, and analysis. In this paper, our vision for data management and mining addressing such challenges are presented, together with related research results from previous work, as well as our recent developments of data mining on text-based, web-based, image-based, and network-based construction databases.
Resumo:
1. Ecological data sets often use clustered measurements or use repeated sampling in a longitudinal design. Choosing the correct covariance structure is an important step in the analysis of such data, as the covariance describes the degree of similarity among the repeated observations. 2. Three methods for choosing the covariance are: the Akaike information criterion (AIC), the quasi-information criterion (QIC), and the deviance information criterion (DIC). We compared the methods using a simulation study and using a data set that explored effects of forest fragmentation on avian species richness over 15 years. 3. The overall success was 80.6% for the AIC, 29.4% for the QIC and 81.6% for the DIC. For the forest fragmentation study the AIC and DIC selected the unstructured covariance, whereas the QIC selected the simpler autoregressive covariance. Graphical diagnostics suggested that the unstructured covariance was probably correct. 4. We recommend using DIC for selecting the correct covariance structure.
Resumo:
Objectives Demonstrate the application of decision trees – classification and regression trees (CARTs), and their cousins, boosted regression trees (BRTs) – to understand structure in missing data. Setting Data taken from employees at three different industry sites in Australia. Participants 7915 observations were included. Materials and Methods The approach was evaluated using an occupational health dataset comprising results of questionnaires, medical tests, and environmental monitoring. Statistical methods included standard statistical tests and the ‘rpart’ and ‘gbm’ packages for CART and BRT analyses, respectively, from the statistical software ‘R’. A simulation study was conducted to explore the capability of decision tree models in describing data with missingness artificially introduced. Results CART and BRT models were effective in highlighting a missingness structure in the data, related to the Type of data (medical or environmental), the site in which it was collected, the number of visits and the presence of extreme values. The simulation study revealed that CART models were able to identify variables and values responsible for inducing missingness. There was greater variation in variable importance for unstructured compared to structured missingness. Discussion Both CART and BRT models were effective in describing structural missingness in data. CART models may be preferred over BRT models for exploratory analysis of missing data, and selecting variables important for predicting missingness. BRT models can show how values of other variables influence missingness, which may prove useful for researchers. Conclusion Researchers are encouraged to use CART and BRT models to explore and understand missing data.
Resumo:
Huge amount of data are generated from a variety of information sources in healthcare while the data sources originate from a veracity of clinical information systems and corporate data warehouses. The data derived from the above data sources are used for analysis and trending purposes thus playing an influential role as a real time decision-making tool. The unstructured, narrative data provided by these data sources qualify as healthcare big-data and researchers argue that the application of big-data in healthcare might enable the accountability and efficiency.
Resumo:
In this paper, an unstructured Chimera mesh method is used to compute incompressible flow around a rotating body. To implement the pressure correction algorithm on unstructured overlapping sub-grids, a novel interpolation scheme for pressure correction is proposed. This indirect interpolation scheme can ensure a tight coupling of pressure between sub-domains. A moving-mesh finite volume approach is used to treat the rotating sub-domain and the governing equations are formulated in an inertial reference frame. Since the mesh that surrounds the rotating body undergoes only solid body rotation and the background mesh remains stationary, no mesh deformation is encountered in the computation. As a benefit from the utilization of an inertial frame, tensorial transformation for velocity is not needed. Three numerical simulations are successfully performed. They include flow over a fixed circular cylinder, flow over a rotating circular cylinder and flow over a rotating elliptic cylinder. These numerical examples demonstrate the capability of the current scheme in handling moving boundaries. The numerical results are in good agreement with experimental and computational data in literature. (C) 2007 Elsevier Ltd. All rights reserved.
Resumo:
Flow around moving boundary is ubiquitous in engineering applications. To increse the efficienly of the algorithm to handle moving boundaries is still a major challenge in Computational Fluid Dynamics (CFD). The Chimera grid method is one type of method to handle moving boundaries. A concept of domain de-composition has been proposed in this paper. In this method, sub-domains are meshed independently and governing equations are also solved separately on them. The Chimera grid method was originally used only on structured (curvilinear) meshes. However, in a problem which involves both moving boundary and complex geometry, the number of sub-domains required in a traditional (structured) Chimera method becomes fairly large. Thus the time required in the interior boundary locating, link-building and data exchanging also increases. The use of unstructured Chimera grid can reduce the time consumption significantly by the reduction of domain(block) number. Generally speaking, unstructured Chimera grid method has not been developed. In this paper, a well-known pressure correction scheme - SIMPLEC is modified and implemented on unstructured Chimera mesh. A new interpolation scheme regarding the pressure correction is proposed to prevent the possible decoupling of pressure. A moving-mesh finite volume approach is implemented in an inertial reference frame. This approach is then used to compute incompressible flow around a rotating circular and elliptic cylinder. These numerical examples demonstrate the capability of the proposed scheme in handling moving boundaries. The numerical results are in good agreement with other experimental and computational data in literature. The method proposed in this paper can be efficiently applied to more challenge cases such as free-falling objects or heavy particles in fluid.
Resumo:
This study was undertaken by UKOLN on behalf of the Joint Information Systems Committee (JISC) in the period April to September 2008. Application profiles are metadata schemata which consist of data elements drawn from one or more namespaces, optimized for a particular local application. They offer a way for particular communities to base the interoperability specifications they create and use for their digital material on established open standards. This offers the potential for digital materials to be accessed, used and curated effectively both within and beyond the communities in which they were created. The JISC recognized the need to undertake a scoping study to investigate metadata application profile requirements for scientific data in relation to digital repositories, and specifically concerning descriptive metadata to support resource discovery and other functions such as preservation. This followed on from the development of the Scholarly Works Application Profile (SWAP) undertaken within the JISC Digital Repositories Programme and led by Andy Powell (Eduserv Foundation) and Julie Allinson (RRT UKOLN) on behalf of the JISC. Aims and Objectives 1.To assess whether a single metadata AP for research data, or a small number thereof, would improve resource discovery or discovery-to-delivery in any useful or significant way. 2.If so, then to:a.assess whether the development of such AP(s) is practical and if so, how much effort it would take; b.scope a community uptake strategy that is likely to be successful, identifying the main barriers and key stakeholders. 3.Otherwise, to investigate how best to improve cross-discipline, cross-community discovery-to-delivery for research data, and make recommendations to the JISC and others as appropriate. Approach The Study used a broad conception of what constitutes scientific data, namely data gathered, collated, structured and analysed using a recognizably scientific method, with a bias towards quantitative methods. The approach taken was to map out the landscape of existing data centres, repositories and associated projects, and conduct a survey of the discovery-to-delivery metadata they use or have defined, alongside any insights they have gained from working with this metadata. This was followed up by a series of unstructured interviews, discussing use cases for a Scientific Data Application Profile, and how widely a single profile might be applied. On the latter point, matters of granularity, the experimental/measurement contrast, the quantitative/qualitative contrast, the raw/derived data contrast, and the homogeneous/heterogeneous data collection contrast were discussed. The Study report was loosely structured according to the Singapore Framework for Dublin Core Application Profiles, and in turn considered: the possible use cases for a Scientific Data Application Profile; existing domain models that could either be used or adapted for use within such a profile; and a comparison existing metadata profiles and standards to identify candidate elements for inclusion in the description set profile for scientific data. The report also considered how the application profile might be implemented, its relationship to other application profiles, the alternatives to constructing a Scientific Data Application Profile, the development effort required, and what could be done to encourage uptake in the community. The conclusions of the Study were validated through a reference group of stakeholders.
Resumo:
Engineering companies face many challenges today such as increased competition, higher expectations from consumers and decreasing product lifecycle times. This means that product development times must be reduced to meet these challenges. Concurrent engineering, reuse of engineering knowledge and the use of advanced methods and tools are among the ways of reducing product development times. Concurrent engineering is crucial in making sure that the products are designed with all issues considered simultaneously. The reuse of engineering knowledge allows existing solutions to be reused. It can also help to avoid the mistakes made in previous designs. Computer-based tools are used to store information, automate tasks, distribute work, perform simulation and so forth. This research concerns the evaluation of tools that can be used to support the design process. These tools are evaluated in terms of the capture of information generated during the design process. This information is vital to allow the reuse of knowledge. Present CAD systems store only information on the final definition of the product such as geometry, materials and manufacturing processes. Product Data Management (PDM) systems can manage all this CAD information along with other product related information. The research includes the evaluation of two PDM systems, Windchill and Metaphase, using the design of a single-handed water tap as a case study. The two PDMs were then compared to PROSUS/DDM. PROSUS is the Process-Based Support System proposed by [Blessing 94] using the same case study. The Design Data Model is the product data model that includes PROSUS. The results look promising. PROSUS/DDM is able to capture most design information and structure and present it logically. The design process and product information is related and stored within the DDM structure. The PDMs can capture most design information, but information from early stages of design is stored only as unstructured documentation. Some problems were found with PROSUS/DDM. A proposal is made that may make it possible to resolve these problems, but this will require further research.
Resumo:
A parallel method for the dynamic partitioning of unstructured meshes is described. The method introduces a new iterative optimization technique known as relative gain optimization which both balances the workload and attempts to minimize the interprocessor communications overhead. Experiments on a series of adaptively refined meshes indicate that the algorithm provides partitions of an equivalent or higher quality to static partitioners (which do not reuse the existing partition) and much more rapidly. Perhaps more importantly, the algorithm results in only a small fraction of the amount of data migration compared to the static partitioners.