32 resultados para data pre-processing
Resumo:
BACKGROUND: Clinical Trials (CTs) are essential for bridging the gap between experimental research on new drugs and their clinical application. Just like CTs for traditional drugs and biologics have helped accelerate the translation of biomedical findings into medical practice, CTs for nanodrugs and nanodevices could advance novel nanomaterials as agents for diagnosis and therapy. Although there is publicly available information about nanomedicine-related CTs, the online archiving of this information is carried out without adhering to criteria that discriminate between studies involving nanomaterials or nanotechnology-based processes (nano), and CTs that do not involve nanotechnology (non-nano). Finding out whether nanodrugs and nanodevices were involved in a study from CT summaries alone is a challenging task. At the time of writing, CTs archived in the well-known online registry ClinicalTrials.gov are not easily told apart as to whether they are nano or non-nano CTs-even when performed by domain experts, due to the lack of both a common definition for nanotechnology and of standards for reporting nanomedical experiments and results. METHODS: We propose a supervised learning approach for classifying CT summaries from ClinicalTrials.gov according to whether they fall into the nano or the non-nano categories. Our method involves several stages: i) extraction and manual annotation of CTs as nano vs. non-nano, ii) pre-processing and automatic classification, and iii) performance evaluation using several state-of-the-art classifiers under different transformations of the original dataset. RESULTS AND CONCLUSIONS: The performance of the best automated classifier closely matches that of experts (AUC over 0.95), suggesting that it is feasible to automatically detect the presence of nanotechnology products in CT summaries with a high degree of accuracy. This can significantly speed up the process of finding whether reports on ClinicalTrials.gov might be relevant to a particular nanoparticle or nanodevice, which is essential to discover any precedents for nanotoxicity events or advantages for targeted drug therapy.
Resumo:
La principal aportación de esta tesis doctoral ha sido la propuesta y evaluación de un sistema de traducción automática que permite la comunicación entre personas oyentes y sordas. Este sistema está formado a su vez por dos sistemas: un traductor de habla en español a Lengua de Signos Española (LSE) escrita y que posteriormente se representa mediante un agente animado; y un generador de habla en español a partir de una secuencia de signos escritos mediante glosas. El primero de ellos consta de un reconocedor de habla, un módulo de traducción entre lenguas y un agente animado que representa los signos en LSE. El segundo sistema está formado por una interfaz gráfica donde se puede especificar una secuencia de signos mediante glosas (palabras en mayúscula que representan los signos), un módulo de traducción entre lenguas y un conversor texto-habla. Para el desarrollo del sistema de traducción, en primer lugar se ha generado un corpus paralelo de 7696 frases en español con sus correspondientes traducciones a LSE. Estas frases pertenecen a cuatro dominios de aplicación distintos: la renovación del Documento Nacional de Identidad, la renovación del permiso de conducir, un servicio de información de autobuses urbanos y la recepción de un hotel. Además, se ha generado una base de datos con más de 1000 signos almacenados en cuatro sistemas distintos de signo-escritura. En segundo lugar, se ha desarrollado un módulo de traducción automática que integra dos técnicas de traducción con una estructura jerárquica: la primera basada en memoria y la segunda estadística. Además, se ha implementado un módulo de pre-procesamiento de las frases en español que, mediante su incorporación al módulo de traducción estadística, permite mejorar significativamente la tasa de traducción. En esta tesis también se ha mejorado la versión de la interfaz de traducción de LSE a habla. Por un lado, se han incorporado nuevas características que mejoran su usabilidad y, por otro, se ha integrado un traductor de lenguaje SMS (Short Message Service – Servicio de Mensajes Cortos) a español, que permite especificar la secuencia a traducir en lenguaje SMS, además de mediante una secuencia de glosas. El sistema de traducción propuesto se ha evaluado con usuarios reales en dos dominios de aplicación: un servicio de información de autobuses de la Empresa Municipal de Transportes de Madrid y la recepción del Hotel Intur Palacio San Martín de Madrid. En la evaluación estuvieron implicadas personas sordas y empleados de los dos servicios. Se extrajeron medidas objetivas (obtenidas por el sistema automáticamente) y subjetivas (mediante cuestionarios a los usuarios). Los resultados fueron muy positivos gracias a la opinión de los usuarios de la evaluación, que validaron el funcionamiento del sistema de traducción y dieron información valiosa para futuras líneas de trabajo. Por otro lado, tras la integración de cada uno de los módulos de los dos sistemas de traducción (habla-LSE y LSE-habla), los resultados de la evaluación y la experiencia adquirida en todo el proceso, una aportación importante de esta tesis doctoral es la propuesta de metodología de desarrollo de sistemas de traducción de habla a lengua de signos en los dos sentidos de la comunicación. En esta metodología se detallan los pasos a seguir para desarrollar el sistema de traducción para un nuevo dominio de aplicación. Además, la metodología describe cómo diseñar cada uno de los módulos del sistema para mejorar su flexibilidad, de manera que resulte más sencillo adaptar el sistema desarrollado a un nuevo dominio de aplicación. Finalmente, en esta tesis se analizan algunas técnicas para seleccionar las frases de un corpus paralelo fuera de dominio para entrenar el modelo de traducción cuando se quieren traducir frases de un nuevo dominio de aplicación; así como técnicas para seleccionar qué frases del nuevo dominio resultan más interesantes que traduzcan los expertos en LSE para entrenar el modelo de traducción. El objetivo es conseguir una buena tasa de traducción con la menor cantidad posible de frases. ABSTRACT The main contribution of this thesis has been the proposal and evaluation of an automatic translation system for improving the communication between hearing and deaf people. This system is made up of two systems: a Spanish into Spanish Sign Language (LSE – Lengua de Signos Española) translator and a Spanish generator from LSE sign sequences. The first one consists of a speech recognizer, a language translation module and an avatar that represents the sign sequence. The second one is made up an interface for specifying the sign sequence, a language translation module and a text-to-speech conversor. For the translation system development, firstly, a parallel corpus has been generated with 7,696 Spanish sentences and their LSE translations. These sentences are related to four different application domains: the renewal of the Identity Document, the renewal of the driver license, a bus information service and a hotel reception. Moreover, a sign database has been generated with more than 1,000 signs described in four different signwriting systems. Secondly, it has been developed an automatic translation module that integrates two translation techniques in a hierarchical structure: the first one is a memory-based technique and the second one is statistical. Furthermore, a pre processing module for the Spanish sentences has been implemented. By incorporating this pre processing module into the statistical translation module, the accuracy of the translation module improves significantly. In this thesis, the LSE into speech translation interface has been improved. On the one hand, new characteristics that improve its usability have been incorporated and, on the other hand, a SMS language into Spanish translator has been integrated, that lets specifying in SMS language the sequence to translate, besides by specifying a sign sequence. The proposed translation system has been evaluated in two application domains: a bus information service of the Empresa Municipal de Transportes of Madrid and the Hotel Intur Palacio San Martín reception. This evaluation has involved both deaf people and services employees. Objective measurements (given automatically by the system) and subjective measurements (given by user questionnaires) were extracted during the evaluation. Results have been very positive, thanks to the user opinions during the evaluation that validated the system performance and gave important information for future work. Finally, after the integration of each module of the two translation systems (speech- LSE and LSE-speech), obtaining the evaluation results and considering the experience throughout the process, a methodology for developing speech into sign language (and vice versa) into a new domain has been proposed in this thesis. This methodology includes the steps to follow for developing the translation system in a new application domain. Moreover, this methodology proposes the way to improve the flexibility of each system module, so that the adaptation of the system to a new application domain can be easier. On the other hand, some techniques are analyzed for selecting the out-of-domain parallel corpus sentences in order to train the translation module in a new domain; as well as techniques for selecting which in-domain sentences are more interesting for translating them (by LSE experts) in order to train the translation model.
Resumo:
The present work describes a new methodology for the automatic detection of the glottal space from laryngeal images based on active contour models (snakes). In order to obtain an appropriate image for the use of snakes based techniques, the proposed algorithm combines a pre-processing stage including some traditional techniques (thresholding and median filter) with more sophisticated ones such as anisotropic filtering. The value selected for the thresholding was fixed to the 85% of the maximum peak of the image histogram, and the anisotropic filter permits to distinguish two intensity levels, one corresponding to the background and the other one to the foreground (glottis). The initialization carried out is based on the magnitude obtained using the Gradient Vector Flow field, ensuring an automatic process for the selection of the initial contour. The performance of the algorithm is tested using the Pratt coefficient and compared against a manual segmentation. The results obtained suggest that this method provided results comparable with other techniques such as the proposed in (Osma-Ruiz et al., 2008).
Resumo:
Multigroup diffusion codes for three dimensional LWR core analysis use as input data pre-generated homogenized few group cross sections and discontinuity factors for certain combinations of state variables, such as temperatures or densities. The simplest way of compiling those data are tabulated libraries, where a grid covering the domain of state variables is defined and the homogenized cross sections are computed at the grid points. Then, during the core calculation, an interpolation algorithm is used to compute the cross sections from the table values. Since interpolation errors depend on the distance between the grid points, a determined refinement of the mesh is required to reach a target accuracy, which could lead to large data storage volume and a large number of lattice transport calculations. In this paper, a simple and effective procedure to optimize the distribution of grid points for tabulated libraries is presented. Optimality is considered in the sense of building a non-uniform point distribution with the minimum number of grid points for each state variable satisfying a given target accuracy in k-effective. The procedure consists of determining the sensitivity coefficients of k-effective to cross sections using perturbation theory; and estimating the interpolation errors committed with different mesh steps for each state variable. These results allow evaluating the influence of interpolation errors of each cross section on k-effective for any combination of state variables, and estimating the optimal distance between grid points.
Resumo:
Recent advances in non-destructive imaging techniques, such as X-ray computed tomography (CT), make it possible to analyse pore space features from the direct visualisation from soil structures. A quantitative characterisation of the three-dimensional solid-pore architecture is important to understand soil mechanics, as they relate to the control of biological, chemical, and physical processes across scales. This analysis technique therefore offers an opportunity to better interpret soil strata, as new and relevant information can be obtained. In this work, we propose an approach to automatically identify the pore structure of a set of 200-2D images that represent slices of an original 3D CT image of a soil sample, which can be accomplished through non-linear enhancement of the pixel grey levels and an image segmentation based on a PFCM (Possibilistic Fuzzy C-Means) algorithm. Once the solids and pore spaces have been identified, the set of 200-2D images is then used to reconstruct an approximation of the soil sample by projecting only the pore spaces. This reconstruction shows the structure of the soil and its pores, which become more bounded, less bounded, or unbounded with changes in depth. If the soil sample image quality is sufficiently favourable in terms of contrast, noise and sharpness, the pore identification is less complicated, and the PFCM clustering algorithm can be used without additional processing; otherwise, images require pre-processing before using this algorithm. Promising results were obtained with four soil samples, the first of which was used to show the algorithm validity and the additional three were used to demonstrate the robustness of our proposal. The methodology we present here can better detect the solid soil and pore spaces on CT images, enabling the generation of better 2D?3D representations of pore structures from segmented 2D images.
Resumo:
LHE (logarithmical hopping encoding) is a computationally efficient image compression algorithm that exploits the Weber–Fechner law to encode the error between colour component predictions and the actual value of such components. More concretely, for each pixel, luminance and chrominance predictions are calculated as a function of the surrounding pixels and then the error between the predictions and the actual values are logarithmically quantised. The main advantage of LHE is that although it is capable of achieving a low-bit rate encoding with high quality results in terms of peak signal-to-noise ratio (PSNR) and image quality metrics with full-reference (FSIM) and non-reference (blind/referenceless image spatial quality evaluator), its time complexity is O( n) and its memory complexity is O(1). Furthermore, an enhanced version of the algorithm is proposed, where the output codes provided by the logarithmical quantiser are used in a pre-processing stage to estimate the perceptual relevance of the image blocks. This allows the algorithm to downsample the blocks with low perceptual relevance, thus improving the compression rate. The performance of LHE is especially remarkable when the bit per pixel rate is low, showing much better quality, in terms of PSNR and FSIM, than JPEG and slightly lower quality than JPEG-2000 but being more computationally efficient.
Resumo:
La presente Tesis analiza y desarrolla metodología específica que permite la caracterización de sistemas de transmisión acústicos basados en el fenómeno del array paramétrico. Este tipo de estructuras es considerado como uno de los sistemas más representativos de la acústica no lineal con amplias posibilidades tecnológicas. Los arrays paramétricos aprovechan la no linealidad del medio aéreo para obtener en recepción señales en el margen sónico a partir de señales ultrasónicas en emisión. Por desgracia, este procedimiento implica que la señal transmitida y la recibida guardan una relación compleja, que incluye una fuerte ecualización así como una distorsión apreciable por el oyente. Este hecho reduce claramente la posibilidad de obtener sistemas acústicos de gran fidelidad. Hasta ahora, los esfuerzos tecnológicos dirigidos al diseño de sistemas comerciales han tratado de paliar esta falta de fidelidad mediante técnicas de preprocesado fuertemente dependientes de los modelos físicos teóricos. Estos están basados en la ecuación de propagación de onda no lineal. En esta Tesis se propone un nuevo enfoque: la obtención de una representación completa del sistema mediante series de Volterra que permita inferir un sistema de compensación computacionalmente ligero y fiable. La dificultad que entraña la correcta extracción de esta representación obliga a desarrollar una metodología completa de identificación adaptada a este tipo de estructuras. Así, a la hora de aplicar métodos de identificación se hace indispensable la determinación de ciertas características iniciales que favorezcan la parametrización del sistema. En esta Tesis se propone una metodología propia que extrae estas condiciones iniciales. Con estos datos, nos encontramos en disposición de plantear un sistema completo de identificación no lineal basado en señales pseudoaleatorias, que aumenta la fiabilidad de la descripción del sistema, posibilitando tanto la inferencia de la estructura basada en bloques subyacente, como el diseño de mecanismos de compensación adecuados. A su vez, en este escenario concreto en el que intervienen procesos de modulación, factores como el punto de trabajo o las características físicas del transductor, hacen inviables los algoritmos de caracterización habituales. Incluyendo el método de identificación propuesto. Con el fin de eliminar esta problemática se propone una serie de nuevos algoritmos de corrección que permiten la aplicación de la caracterización. Las capacidades de estos nuevos algoritmos se pondrán a prueba sobre un prototipo físico, diseñado a tal efecto. Para ello, se propondrán la metodología y los mecanismos de instrumentación necesarios para llevar a cabo el diseño, la identificación del sistema y su posible corrección, todo ello mediante técnicas de procesado digital previas al sistema de transducción. Los algoritmos se evaluarán en términos de error de modelado a partir de la señal de salida del sistema real frente a la salida sintetizada a partir del modelo estimado. Esta estrategia asegura la posibilidad de aplicar técnicas de compensación ya que éstas son sensibles a errores de estima en módulo y fase. La calidad del sistema final se evaluará en términos de fase, coloración y distorsión no lineal mediante un test propuesto a lo largo de este discurso, como paso previo a una futura evaluación subjetiva. ABSTRACT This Thesis presents a specific methodology for the characterization of acoustic transmission systems based on the parametric array phenomenon. These structures are well-known representatives of the nonlinear acoustics field and display large technological opportunities. Parametric arrays exploit the nonlinear behavior of air to obtain sonic signals at the receptors’side, which were generated within the ultrasonic range. The underlying physical process redunds in a complex relationship between the transmitted and received signals. This includes both a strong equalization and an appreciable distortion for a human listener. High fidelity, acoustic equipment based on this phenomenon is therefore difficult to design. Until recently, efforts devoted to this enterprise have focused in fidelity enhancement based on physically-informed, pre-processing schemes. These derive directly from the nonlinear form of the wave equation. However, online limited enhancement has been achieved. In this Thesis we propose a novel approach: the evaluation of a complete representation of the system through its projection onto the Volterra series, which allows the posterior inference of a computationally light and reliable compensation scheme. The main difficulty in the derivation of such representation strives from the need of a complete identification methodology, suitable for this particular type of structures. As an example, whenever identification techniques are involved, we require preliminary estimates on certain parameters that contribute to the correct parameterization of the system. In this Thesis we propose a methodology to derive such initial values from simple measures. Once these information is made available, a complete identification scheme is required for nonlinear systems based on pseudorandom signals. These contribute to the robustness and fidelity of the resulting model, and facilitate both the inference of the underlying structure, which we subdivide into a simple block-oriented construction, and the design of the corresponding compensation structure. In a scenario such as this where frequency modulations occur, one must control exogenous factors such as devices’ operation point and the physical properties of the transducer. These may conflict with the principia behind the standard identification procedures, as it is the case. With this idea in mind, the Thesis includes a series of novel correction algorithms that facilitate the application of the characterization results onto the system compensation. The proposed algorithms are tested on a prototype that was designed and built for this purpose. The methodology and instrumentation required for its design, the identification of the overall acoustic system and its correction are all based on signal processing techniques, focusing on the system front-end, i.e. prior to transduction. Results are evaluated in terms of input-output modelling error, considering a synthetic construction of the system. This criterion ensures that compensation techniques may actually be introduced, since these are highly sensible to estimation errors both on the envelope and the phase of the signals involved. Finally, the quality of the overall system will be evaluated in terms of phase, spectral color and nonlinear distortion; by means of a test protocol specifically devised for this Thesis, as a prior step for a future, subjective quality evaluation.
Resumo:
In a series of attempts to research and document relevant sloshing type phenomena, a series of experiments have been conducted. The aim of this paper is to describe the setup and data processing of such experiments. A sloshing tank is subjected to angular motion. As a result pressure registers are obtained at several locations, together with the motion data, torque and a collection of image and video information. The experimental rig and the data acquisition systems are described. Useful information for experimental sloshing research practitioners is provided. This information is related to the liquids used in the experiments, the dying techniques, tank building processes, synchronization of acquisition systems, etc. A new procedure for reconstructing experimental data, that takes into account experimental uncertainties, is presented. This procedure is based on a least squares spline approximation of the data. Based on a deterministic approach to the first sloshing wave impact event in a sloshing experiment, an uncertainty analysis procedure of the associated first pressure peak value is described.
Resumo:
The manipulation and handling of an ever increasing volume of data by current data-intensive applications require novel techniques for e?cient data management. Despite recent advances in every aspect of data management (storage, access, querying, analysis, mining), future applications are expected to scale to even higher degrees, not only in terms of volumes of data handled but also in terms of users and resources, often making use of multiple, pre-existing autonomous, distributed or heterogeneous resources.
Resumo:
To date, big data applications have focused on the store-and-process paradigm. In this paper we describe an initiative to deal with big data applications for continuous streams of events. In many emerging applications, the volume of data being streamed is so large that the traditional ‘store-then-process’ paradigm is either not suitable or too inefficient. Moreover, soft-real time requirements might severely limit the engineering solutions. Many scenarios fit this description. In network security for cloud data centres, for instance, very high volumes of IP packets and events from sensors at firewalls, network switches and routers and servers need to be analyzed and should detect attacks in minimal time, in order to limit the effect of the malicious activity over the IT infrastructure. Similarly, in the fraud department of a credit card company, payment requests should be processed online and need to be processed as quickly as possible in order to provide meaningful results in real-time. An ideal system would detect fraud during the authorization process that lasts hundreds of milliseconds and deny the payment authorization, minimizing the damage to the user and the credit card company.
Resumo:
The Microarray technique is rather powerful, as it allows to test up thousands of genes at a time, but this produces an overwhelming set of data files containing huge amounts of data, which is quite difficult to pre-process, separate, classify and correlate for interesting conclusions to be extracted. Modern machine learning, data mining and clustering techniques based on information theory, are needed to read and interpret the information contents buried in those large data sets. Independent Component Analysis method can be used to correct the data affected by corruption processes or to filter the uncorrectable one and then clustering methods can group similar genes or classify samples. In this paper a hybrid approach is used to obtain a two way unsupervised clustering for a corrected microarray data.
Resumo:
This paper proposes the optimization relaxation approach based on the analogue Hopfield Neural Network (HNN) for cluster refinement of pre-classified Polarimetric Synthetic Aperture Radar (PolSAR) image data. We consider the initial classification provided by the maximum-likelihood classifier based on the complex Wishart distribution, which is then supplied to the HNN optimization approach. The goal is to improve the classification results obtained by the Wishart approach. The classification improvement is verified by computing a cluster separability coefficient and a measure of homogeneity within the clusters. During the HNN optimization process, for each iteration and for each pixel, two consistency coefficients are computed, taking into account two types of relations between the pixel under consideration and its corresponding neighbors. Based on these coefficients and on the information coming from the pixel itself, the pixel under study is re-classified. Different experiments are carried out to verify that the proposed approach outperforms other strategies, achieving the best results in terms of separability and a trade-off with the homogeneity preserving relevant structures in the image. The performance is also measured in terms of computational central processing unit (CPU) times.
Resumo:
Due to the advancement of both, information technology in general, and databases in particular; data storage devices are becoming cheaper and data processing speed is increasing. As result of this, organizations tend to store large volumes of data holding great potential information. Decision Support Systems, DSS try to use the stored data to obtain valuable information for organizations. In this paper, we use both data models and use cases to represent the functionality of data processing in DSS following Software Engineering processes. We propose a methodology to develop DSS in the Analysis phase, respective of data processing modeling. We have used, as a starting point, a data model adapted to the semantics involved in multidimensional databases or data warehouses, DW. Also, we have taken an algorithm that provides us with all the possible ways to automatically cross check multidimensional model data. Using the aforementioned, we propose diagrams and descriptions of use cases, which can be considered as patterns representing the DSS functionality, in regard to DW data processing, DW on which DSS are based. We highlight the reusability and automation benefits that this can be achieved, and we think this study can serve as a guide in the development of DSS.
Resumo:
Due to the relative transparency of its embryos and larvae, the zebrafish is an ideal model organism for bioimaging approaches in vertebrates. Novel microscope technologies allow the imaging of developmental processes in unprecedented detail, and they enable the use of complex image-based read-outs for high-throughput/high-content screening. Such applications can easily generate Terabytes of image data, the handling and analysis of which becomes a major bottleneck in extracting the targeted information. Here, we describe the current state of the art in computational image analysis in the zebrafish system. We discuss the challenges encountered when handling high-content image data, especially with regard to data quality, annotation, and storage. We survey methods for preprocessing image data for further analysis, and describe selected examples of automated image analysis, including the tracking of cells during embryogenesis, heartbeat detection, identification of dead embryos, recognition of tissues and anatomical landmarks, and quantification of behavioral patterns of adult fish. We review recent examples for applications using such methods, such as the comprehensive analysis of cell lineages during early development, the generation of a three-dimensional brain atlas of zebrafish larvae, and high-throughput drug screens based on movement patterns. Finally, we identify future challenges for the zebrafish image analysis community, notably those concerning the compatibility of algorithms and data formats for the assembly of modular analysis pipelines.
Resumo:
In the last years, there has been an increase in the amount of real-time data generated. Sensors attached to things are transforming how we interact with our environment. Extracting meaningful information from these streams of data is essential for some application areas and requires processing systems that scale to varying conditions in data sources, complex queries, and system failures. This paper describes ongoing research on the development of a scalable RDF streaming engine.