337 resultados para outliers


Relevância:

10.00% 10.00%

Publicador:

Resumo:

The usage of intensity modulated radiotherapy (IMRT) treatments necessitates a significant amount of patient-specific quality assurance (QA). This research has investigated the precision and accuracy of Kodak EDR2 film measurements for IMRT verifications, the use of comparisons between 2D dose calculations and measurements to improve treatment plan beam models, and the dosimetric impact of delivery errors. New measurement techniques and software were developed and used clinically at M. D. Anderson Cancer Center. The software implemented two new dose comparison parameters, the 2D normalized agreement test (NAT) and the scalar NAT index. A single-film calibration technique using multileaf collimator (MLC) delivery was developed. EDR2 film's optical density response was found to be sensitive to several factors: radiation time, length of time between exposure and processing, and phantom material. Precision of EDR2 film measurements was found to be better than 1%. For IMRT verification, EDR2 film measurements agreed with ion chamber results to 2%/2mm accuracy for single-beam fluence map verifications and to 5%/2mm for transverse plane measurements of complete plan dose distributions. The same system was used to quantitatively optimize the radiation field offset and MLC transmission beam modeling parameters for Varian MLCs. While scalar dose comparison metrics can work well for optimization purposes, the influence of external parameters on the dose discrepancies must be minimized. The ability of 2D verifications to detect delivery errors was tested with simulated data. The dosimetric characteristics of delivery errors were compared to patient-specific clinical IMRT verifications. For the clinical verifications, the NAT index and percent of pixels failing the gamma index were exponentially distributed and dependent upon the measurement phantom but not the treatment site. Delivery errors affecting all beams in the treatment plan were flagged by the NAT index, although delivery errors impacting only one beam could not be differentiated from routine clinical verification discrepancies. Clinical use of this system will flag outliers, allow physicists to examine their causes, and perhaps improve the level of agreement between radiation dose distribution measurements and calculations. The principles used to design and evaluate this system are extensible to future multidimensional dose measurements and comparisons. ^

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The Work Limitations Questionnaire (WLQ) is used to determine the amount of work loss and productivity which stem from certain health conditions, including rheumatoid arthritis and cancer. The questionnaire is currently scored using methodology from Classical Test Theory. Item Response Theory, on the other hand, is a theory based on analyzing item responses. This study wanted to determine the validity of using Item Response Theory (IRT), to analyze data from the WLQ. Item responses from 572 employed adults with dysthymia, major depressive disorder (MDD), double depressive disorder (both dysthymia and MDD), rheumatoid arthritis and healthy individuals were used to determine the validity of IRT (Adler et al., 2006).^ PARSCALE, which is IRT software from Scientific Software International, Inc., was used to calculate estimates of the work limitations based on item responses from the WLQ. These estimates, also known as ability estimates, were then correlated with the raw score estimates calculated from the sum of all the items responses. Concurrent validity, which claims a measurement is valid if the correlation between the new measurement and the valid measurement is greater or equal to .90, was used to determine the validity of IRT methodology for the WLQ. Ability estimates from IRT were found to be somewhat highly correlated with the raw scores from the WLQ (above .80). However, the only subscale which had a high enough correlation for IRT to be considered valid was the time management subscale (r = .90). All other subscales, mental/interpersonal, physical, and output, did not produce valid IRT ability estimates.^ An explanation for these lower than expected correlations can be explained by the outliers found in the sample. Also, acquiescent responding (AR) bias, which is caused by the tendency for people to respond the same way to every question on a questionnaire, and the multidimensionality of the questionnaire (the WLQ is composed of four dimensions and thus four different latent variables) probably had a major impact on the IRT estimates. Furthermore, it is possible that the mental/interpersonal dimension violated the monotonocity assumption of IRT causing PARSCALE to fail to run for these estimates. The monotonicity assumption needs to be checked for the mental/interpersonal dimension. Furthermore, the use of multidimensional IRT methods would most likely remove the AR bias and increase the validity of using IRT to analyze data from the WLQ.^

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Purpose: Traditional patient-specific IMRT QA measurements are labor intensive and consume machine time. Calculation-based IMRT QA methods typically are not comprehensive. We have developed a comprehensive calculation-based IMRT QA method to detect uncertainties introduced by the initial dose calculation, the data transfer through the Record-and-Verify (R&V) system, and various aspects of the physical delivery. Methods: We recomputed the treatment plans in the patient geometry for 48 cases using data from the R&V, and from the delivery unit to calculate the “as-transferred” and “as-delivered” doses respectively. These data were sent to the original TPS to verify transfer and delivery or to a second TPS to verify the original calculation. For each dataset we examined the dose computed from the R&V record (RV) and from the delivery records (Tx), and the dose computed with a second verification TPS (vTPS). Each verification dose was compared to the clinical dose distribution using 3D gamma analysis and by comparison of mean dose and ROI-specific dose levels to target volumes. Plans were also compared to IMRT QA absolute and relative dose measurements. Results: The average 3D gamma passing percentages using 3%-3mm, 2%-2mm, and 1%-1mm criteria for the RV plan were 100.0 (σ=0.0), 100.0 (σ=0.0), and 100.0 (σ=0.1); for the Tx plan they were 100.0 (σ=0.0), 100.0 (σ=0.0), and 99.0 (σ=1.4); and for the vTPS plan they were 99.3 (σ=0.6), 97.2 (σ=1.5), and 79.0 (σ=8.6). When comparing target volume doses in the RV, Tx, and vTPS plans to the clinical plans, the average ratios of ROI mean doses were 0.999 (σ=0.001), 1.001 (σ=0.002), and 0.990 (σ=0.009) and ROI-specific dose levels were 0.999 (σ=0.001), 1.001 (σ=0.002), and 0.980 (σ=0.043), respectively. Comparing the clinical, RV, TR, and vTPS calculated doses to the IMRT QA measurements for all 48 patients, the average ratios for absolute doses were 0.999 (σ=0.013), 0.998 (σ=0.013), 0.999 σ=0.015), and 0.990 (σ=0.012), respectively, and the average 2D gamma(5%-3mm) passing percentages for relative doses for 9 patients was were 99.36 (σ=0.68), 99.50 (σ=0.49), 99.13 (σ=0.84), and 98.76 (σ=1.66), respectively. Conclusions: Together with mechanical and dosimetric QA, our calculation-based IMRT QA method promises to minimize the need for patient-specific QA measurements by identifying outliers in need of further review.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

An interim analysis is usually applied in later phase II or phase III trials to find convincing evidence of a significant treatment difference that may lead to trial termination at an earlier point than planned at the beginning. This can result in the saving of patient resources and shortening of drug development and approval time. In addition, ethics and economics are also the reasons to stop a trial earlier. In clinical trials of eyes, ears, knees, arms, kidneys, lungs, and other clustered treatments, data may include distribution-free random variables with matched and unmatched subjects in one study. It is important to properly include both subjects in the interim and the final analyses so that the maximum efficiency of statistical and clinical inferences can be obtained at different stages of the trials. So far, no publication has applied a statistical method for distribution-free data with matched and unmatched subjects in the interim analysis of clinical trials. In this simulation study, the hybrid statistic was used to estimate the empirical powers and the empirical type I errors among the simulated datasets with different sample sizes, different effect sizes, different correlation coefficients for matched pairs, and different data distributions, respectively, in the interim and final analysis with 4 different group sequential methods. Empirical powers and empirical type I errors were also compared to those estimated by using the meta-analysis t-test among the same simulated datasets. Results from this simulation study show that, compared to the meta-analysis t-test commonly used for data with normally distributed observations, the hybrid statistic has a greater power for data observed from normally, log-normally, and multinomially distributed random variables with matched and unmatched subjects and with outliers. Powers rose with the increase in sample size, effect size, and correlation coefficient for the matched pairs. In addition, lower type I errors were observed estimated by using the hybrid statistic, which indicates that this test is also conservative for data with outliers in the interim analysis of clinical trials.^

Relevância:

10.00% 10.00%

Publicador:

Resumo:

BACKGROUND: Weight has been implicated as a risk factor for symptomatic community-acquired methicillin resistant Staphylococcus Aureus (CA-MRSA). Information from Texas Children's Hospital (TCH) in Houston, TX was used to implement a case-control study to assess weight-for-age percentile (WFA), race and seasonal exposure as risk factors. ^ METHODS: A retrospective chart review to collect data from TCH was conducted covering the time period January 1st, 2008 to May 31st, 2011. Cases were confirmed and identified by the infectious disease department and were matched on a 1:1 ratio to controls that were seen by the emergency department for non-infected fractures from June 1st, 2008 to May 31st, 2011. Data abstraction was performed using TCH's electronic medical records (EMR) system (EPIC ®). ^ RESULTS: Of 702 CA-MRSA identified cases, ages 9 to 16.99, 564 (80.3%) had the variable `weight' present in their EMR, were not duplicates and not determined to be outliers. Cases were randomly matched to a pool of available controls (n=1864) according to age and gender, yielding 539 1:1 matched pairs (95.5% case matching success) with a total study sample size, N=1078. Case median age was 13.38 years with the majority being White (66.05%) and male (59.4%). Adjusted conditional logistic regression analysis of the matched pairs identified the following risk factors to presenting with CA-MRSA infection among pediatric patients, ages 9 to 16.99 years: a) Individual weight in the highest (75th-99.9th) WFA quartile (OR=1.36; 95% confidence interval [CI]=1.06-1.74; P= 0.016), b) Infection during summer months (OR: 1.69; 95% CI=1.2-2.38; P= 0.003), c) patients of African American race/ethnicity (OR= 1.48; 95% CI=1.13-1.95; P= 0.004). ^ CONCLUSIONS: Pediatric patients, 9 to 16.99 years of age, in the highest WFA quartile (75th-99.9th), or of African-American race had an associated increased risk of presenting with CA-MRSA infection. Furthermore, children in this population were at a higher risk of contracting CA-MRSA infection during the summer season.^

Relevância:

10.00% 10.00%

Publicador:

Resumo:

La provincia de Mendoza tiene la mayor superficie regada de Argentina y cuenta con una vasta infraestructura de riego y drenaje en los cinco ríos aprovechados. Los suelos son de origen aluvial, con perfiles que alternan capas de distintas texturas, observándose la presencia de estratos muy finos -casi impermeables- que impiden el libre drenaje del agua de riego. Esta situación dinámica es más evidente a medida que el río disminuye su pendiente coincidiendo con los sectores bajos de la cuenca. La acumulación de agua produce el ascenso de los niveles freáticos hasta aproximarse a la superficie del suelo, incrementando la salinización del mismo. El área de riego del río Mendoza, con valores de salinidad media del agua en su derivación hacia la red de riego menor de 1 dS.m-1, es una de las más intensamente explotadas del país y presenta dos sectores con problemas de freática cercana a superficie. Los mismos corresponden a una zona central llamada Área de Surgencia AS y a otra llamada Área Lavalle AL. En AS hay una red de 98 pozos de observación (freatímetros) para conocer las profundidades, direcciones de flujos y calidad del agua freática. El AL tiene una red de 100 freatímetros distribuidos en tres subáreas correspondientes a tres colectores de drenajes: Tres de Mayo-Jocolí TMJ, Villa Lavalle VL y Costa de Araujo-Gustavo André CG. El presente trabajo muestra los resultados de la evaluación de la salinidad del agua freática expresada como salinidad total a 25 °C (CE) para las dos áreas de estudio. Las muestras han sido extraídas en 2002 y 2004. Los resultados indican que en los dos momentos de muestreo la mediana es menor que la media correspondiente, lo que evidencia asimetría positiva en las distribuciones. Las medianas obtenidas fueron: 6180 μS cm-1 (2002) y 6195 μS cm-1 (2004). Además se observan cambios en las distribuciones entre los momentos de muestreo y entre las áreas: en 2004 aparecen valores extremos superiores mucho mayores que en 2002, y el área VL acusa frecuencias relativas más uniformes y los mayores incrementos de CE. Se distingue también que en los dos momentos de muestreo el área AS posee los valores de posición de CE más bajos, aunque también es la zona con mayor cantidad de outliers; las áreas TMJ, CG y AS no han sufrido cambios importantes en los valores de CE en dos años, pero sí se advierte un sensible aumento de la CE en VL. Con la base de datos depurados se realizaron isolíneas para diferentes intervalos de la variable analizada (CE) que muestran espacialmente los sectores afectados con los distintos intervalos de salinidad freática.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

This archive consists of the hydrographic data collected on Cruise 82-002 of C.S.S. Hudson, April 11 to May 2, 1982. 78 stations were occupied on a line running near 48°N from the mouth of the English Channel to the Grand Banks of Newfoundland. Pressure, temperature and salinity were measured by a Guildline digital CTP system. Salinity, dissolved oxygen, silicate, nitrate and phosphate were measured from water samples collected on the CTP upcasts. CTP and discrete bottle data and associated derived parameters are tabulated at standard levels. This is the digital version of the printed report (of 1989, see further details), published in 2006 with the information system Pangaea.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

We present a reconstruction of El Niño Southern Oscillation (ENSO) variability spanning the Medieval Climate Anomaly (MCA, A.D. 800-1300) and the Little Ice Age (LIA, A.D. 1500-1850). Changes in ENSO are estimated by comparing the spread and symmetry of d18O values of individual specimens of the thermocline-dwelling planktonic foraminifer Pulleniatina obliquiloculata extracted from discrete time horizons of a sediment core collected in the Sulawesi Sea, at the edge of the western tropical Pacific warm pool. The spread of individual d18O values is interpreted to be a measure of the strength of both phases of ENSO while the symmetry of the d18O distributions is used to evaluate the relative strength/frequency of El Niño and La Niña events. In contrast to previous studies, we use robust and resistant statistics to quantify the spread and symmetry of the d18O distributions; an approach motivated by the relatively small sample size and the presence of outliers. Furthermore, we use a pseudo-proxy approach to investigate the effects of the different paleo-environmental factors on the statistics of the d18O distributions, which could bias the paleo-ENSO reconstruction. We find no systematic difference in the magnitude/strength of ENSO during the Northern Hemisphere MCA or LIA. However, our results suggest that ENSO during the MCA was skewed toward stronger/more frequent La Niña than El Niño, an observation consistent with the medieval megadroughts documented from sites in western North America.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

IBAMar (http://www.ba.ieo.es/ibamar) is a regional database that puts together all physical and biochemical data obtained by multiparametric probes (CTDs equipped with different sensors), during the cruises managed by the Balearic Center of the Spanish Institute of Oceanography (COB-IEO). It has been recently extended to include data obtained with classical hydro casts using oceanographic Niskin or Nansen bottles. The result is a database that includes a main core of hydrographic data: temperature (T), salinity (S), dissolved oxygen (DO), fluorescence and turbidity; complemented by bio-chemical data: dissolved inorganic nutrients (phosphate, nitrate, nitrite and silicate) and chlorophyll-a. In IBAMar Database, different technologies and methodologies were used by different teams along the four decades of data sampling in the COB-IEO. Despite of this fact, data have been reprocessed using the same protocols, and a standard QC has been applied to each variable. Therefore it provides a regional database of homogeneous, good quality data. Data acquisition and quality control (QC): 94% of the data are CTDs Sbe911 and Sbe25. S and DO were calibrated on board using water samples, whenever a Rossetta was available (70% of the cases). All CTD data from Seabird CTDs were reviewed and post processed with the software provided by Sea-Bird Electronics. Data were averaged to get 1 dbar vertical resolution. General sampling methodology and pre processing are described in https://ibamardatabase.wordpress.com/home/). Manual QC include visual checks of metadata, duplicate data and outliers. Automatic QC include range check of variables by area (north of Balearic Islands, south of BI and Alboran Sea) and depth (27 standard levels), check for spikes and check for density inversions. Nutrients QC includes a preliminary control and a range check on the observed level of the data to detect outliers around objectively analyzed data fields. A quality flag is assigned as an integer number, depending on the result of the QC check.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In the southeast of the Bolshoi Lyakhovsky Island there are outcrops of tectonic outliers composed of low-K medium-Ti tholeiitic basic rocks represented by low altered pillow basalts, as well as by their metamorphosed analogs: amphibolites and blueschists. The rocks are depleted in light rare-earth elements and were melted out of a depleted mantle source enriched in Th, Nb, and Zr also contributed to the rock formation. The magma sources were not affected by subduction-related fluids or melts. The rocks were part of the Jurassic South Anyui ocean basin crust. The blueschists are the crust of the same basin submerged beneath the more southern Anyui-Svyatoi Nos arc to depth of 30-40 km. Pressure and temperature of metamorphism suggest a setting of "warm" subduction. Mineral assemblages of the blueschists record time of a collision of the Anyui-Svyatoi Nos island arc and the New Siberian continental block expressed as a counter-clockwise PT trend. The pressure jump during the collision corresponds to heaping of tectonic covers above the zone of convergence 12 km in total thickness. Ocean rocks were thrust upon the margin of the New Siberian continental block in late Late Jurassic - early Early Cretaceous and mark the NW continuation of the South Anyui suture, one of the main tectonic sutures of the Northeastern Asia.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The long-term rate of racemization for amino acids preserved in planktonic foraminifera was determined by using independently dated sediment cores from the Arctic Ocean. The racemization rates for aspartic acid (Asp) and glutamic acid (Glu) in the common taxon, Neogloboquadrina pachyderma, were calibrated for the last 150 ka using 14C ages and the emerging Quaternary chronostratigraphy of Arctic Ocean sediments. An analysis of errors indicates realistic age uncertainties of about ±12% for Asp and ±17% for Glu. Fifty individual tests are sufficient to analyze multiple subsamples, identify outliers, and derive robust sample mean values. The new age equation can be applied to verify and refine age models for sediment cores elsewhere in the Arctic Ocean, a critical region for understanding the dynamics of global climate change.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The analysis of time-dependent data is an important problem in many application domains, and interactive visualization of time-series data can help in understanding patterns in large time series data. Many effective approaches already exist for visual analysis of univariate time series supporting tasks such as assessment of data quality, detection of outliers, or identification of periodically or frequently occurring patterns. However, much fewer approaches exist which support multivariate time series. The existence of multiple values per time stamp makes the analysis task per se harder, and existing visualization techniques often do not scale well. We introduce an approach for visual analysis of large multivariate time-dependent data, based on the idea of projecting multivariate measurements to a 2D display, visualizing the time dimension by trajectories. We use visual data aggregation metaphors based on grouping of similar data elements to scale with multivariate time series. Aggregation procedures can either be based on statistical properties of the data or on data clustering routines. Appropriately defined user controls allow to navigate and explore the data and interactively steer the parameters of the data aggregation to enhance data analysis. We present an implementation of our approach and apply it on a comprehensive data set from the field of earth bservation, demonstrating the applicability and usefulness of our approach.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

The Sonne transit cruise SO226-3 DipFIP took place from March 4th (Wellington, New Zealand) to March 28th (Kaohsiung, Taiwan) in 2013. CTD data for 16 stations along the cruise track were recorded using the onboard a SEABIRD SBE 9 plus CTD down to depth of 800m. Obtained hydrographic data were binned to 1 m intervals with the available SBE software. Obvious outliers in the readings of the oxygen sensor close to the sea surface have been manually removed. Fluorospectrometer (bbe Moldaenke) pigment data measured for 24 depth intervals are available for 10 stations. Measurements were conducted in the shipboard laboratory on water samples from the CTD rosette. Data are averages of a least 30 readings per sample.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Multibeam data were collected during R/V Polarstern cruise ARK-XXII/2 leading to the central Arctic Ocean. Multibeam sonar system was ATLAS HYDROSWEEP DS2. Data are unprocessed and may contain outliers and blunders. Because of an error in installation of the transducers, the data are affected by large systematic errors and must not be used for grid calculations and charting projects.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Machine learning techniques are used for extracting valuable knowledge from data. Nowa¬days, these techniques are becoming even more important due to the evolution in data ac¬quisition and storage, which is leading to data with different characteristics that must be exploited. Therefore, advances in data collection must be accompanied with advances in machine learning techniques to solve new challenges that might arise, on both academic and real applications. There are several machine learning techniques depending on both data characteristics and purpose. Unsupervised classification or clustering is one of the most known techniques when data lack of supervision (unlabeled data) and the aim is to discover data groups (clusters) according to their similarity. On the other hand, supervised classification needs data with supervision (labeled data) and its aim is to make predictions about labels of new data. The presence of data labels is a very important characteristic that guides not only the learning task but also other related tasks such as validation. When only some of the available data are labeled whereas the others remain unlabeled (partially labeled data), neither clustering nor supervised classification can be used. This scenario, which is becoming common nowadays because of labeling process ignorance or cost, is tackled with semi-supervised learning techniques. This thesis focuses on the branch of semi-supervised learning closest to clustering, i.e., to discover clusters using available labels as support to guide and improve the clustering process. Another important data characteristic, different from the presence of data labels, is the relevance or not of data features. Data are characterized by features, but it is possible that not all of them are relevant, or equally relevant, for the learning process. A recent clustering tendency, related to data relevance and called subspace clustering, claims that different clusters might be described by different feature subsets. This differs from traditional solutions to data relevance problem, where a single feature subset (usually the complete set of original features) is found and used to perform the clustering process. The proximity of this work to clustering leads to the first goal of this thesis. As commented above, clustering validation is a difficult task due to the absence of data labels. Although there are many indices that can be used to assess the quality of clustering solutions, these validations depend on clustering algorithms and data characteristics. Hence, in the first goal three known clustering algorithms are used to cluster data with outliers and noise, to critically study how some of the most known validation indices behave. The main goal of this work is however to combine semi-supervised clustering with subspace clustering to obtain clustering solutions that can be correctly validated by using either known indices or expert opinions. Two different algorithms are proposed from different points of view to discover clusters characterized by different subspaces. For the first algorithm, available data labels are used for searching for subspaces firstly, before searching for clusters. This algorithm assigns each instance to only one cluster (hard clustering) and is based on mapping known labels to subspaces using supervised classification techniques. Subspaces are then used to find clusters using traditional clustering techniques. The second algorithm uses available data labels to search for subspaces and clusters at the same time in an iterative process. This algorithm assigns each instance to each cluster based on a membership probability (soft clustering) and is based on integrating known labels and the search for subspaces into a model-based clustering approach. The different proposals are tested using different real and synthetic databases, and comparisons to other methods are also included when appropriate. Finally, as an example of real and current application, different machine learning tech¬niques, including one of the proposals of this work (the most sophisticated one) are applied to a task of one of the most challenging biological problems nowadays, the human brain model¬ing. Specifically, expert neuroscientists do not agree with a neuron classification for the brain cortex, which makes impossible not only any modeling attempt but also the day-to-day work without a common way to name neurons. Therefore, machine learning techniques may help to get an accepted solution to this problem, which can be an important milestone for future research in neuroscience. Resumen Las técnicas de aprendizaje automático se usan para extraer información valiosa de datos. Hoy en día, la importancia de estas técnicas está siendo incluso mayor, debido a que la evolución en la adquisición y almacenamiento de datos está llevando a datos con diferentes características que deben ser explotadas. Por lo tanto, los avances en la recolección de datos deben ir ligados a avances en las técnicas de aprendizaje automático para resolver nuevos retos que pueden aparecer, tanto en aplicaciones académicas como reales. Existen varias técnicas de aprendizaje automático dependiendo de las características de los datos y del propósito. La clasificación no supervisada o clustering es una de las técnicas más conocidas cuando los datos carecen de supervisión (datos sin etiqueta), siendo el objetivo descubrir nuevos grupos (agrupaciones) dependiendo de la similitud de los datos. Por otra parte, la clasificación supervisada necesita datos con supervisión (datos etiquetados) y su objetivo es realizar predicciones sobre las etiquetas de nuevos datos. La presencia de las etiquetas es una característica muy importante que guía no solo el aprendizaje sino también otras tareas relacionadas como la validación. Cuando solo algunos de los datos disponibles están etiquetados, mientras que el resto permanece sin etiqueta (datos parcialmente etiquetados), ni el clustering ni la clasificación supervisada se pueden utilizar. Este escenario, que está llegando a ser común hoy en día debido a la ignorancia o el coste del proceso de etiquetado, es abordado utilizando técnicas de aprendizaje semi-supervisadas. Esta tesis trata la rama del aprendizaje semi-supervisado más cercana al clustering, es decir, descubrir agrupaciones utilizando las etiquetas disponibles como apoyo para guiar y mejorar el proceso de clustering. Otra característica importante de los datos, distinta de la presencia de etiquetas, es la relevancia o no de los atributos de los datos. Los datos se caracterizan por atributos, pero es posible que no todos ellos sean relevantes, o igualmente relevantes, para el proceso de aprendizaje. Una tendencia reciente en clustering, relacionada con la relevancia de los datos y llamada clustering en subespacios, afirma que agrupaciones diferentes pueden estar descritas por subconjuntos de atributos diferentes. Esto difiere de las soluciones tradicionales para el problema de la relevancia de los datos, en las que se busca un único subconjunto de atributos (normalmente el conjunto original de atributos) y se utiliza para realizar el proceso de clustering. La cercanía de este trabajo con el clustering lleva al primer objetivo de la tesis. Como se ha comentado previamente, la validación en clustering es una tarea difícil debido a la ausencia de etiquetas. Aunque existen muchos índices que pueden usarse para evaluar la calidad de las soluciones de clustering, estas validaciones dependen de los algoritmos de clustering utilizados y de las características de los datos. Por lo tanto, en el primer objetivo tres conocidos algoritmos se usan para agrupar datos con valores atípicos y ruido para estudiar de forma crítica cómo se comportan algunos de los índices de validación más conocidos. El objetivo principal de este trabajo sin embargo es combinar clustering semi-supervisado con clustering en subespacios para obtener soluciones de clustering que puedan ser validadas de forma correcta utilizando índices conocidos u opiniones expertas. Se proponen dos algoritmos desde dos puntos de vista diferentes para descubrir agrupaciones caracterizadas por diferentes subespacios. Para el primer algoritmo, las etiquetas disponibles se usan para bus¬car en primer lugar los subespacios antes de buscar las agrupaciones. Este algoritmo asigna cada instancia a un único cluster (hard clustering) y se basa en mapear las etiquetas cono-cidas a subespacios utilizando técnicas de clasificación supervisada. El segundo algoritmo utiliza las etiquetas disponibles para buscar de forma simultánea los subespacios y las agru¬paciones en un proceso iterativo. Este algoritmo asigna cada instancia a cada cluster con una probabilidad de pertenencia (soft clustering) y se basa en integrar las etiquetas conocidas y la búsqueda en subespacios dentro de clustering basado en modelos. Las propuestas son probadas utilizando diferentes bases de datos reales y sintéticas, incluyendo comparaciones con otros métodos cuando resulten apropiadas. Finalmente, a modo de ejemplo de una aplicación real y actual, se aplican diferentes técnicas de aprendizaje automático, incluyendo una de las propuestas de este trabajo (la más sofisticada) a una tarea de uno de los problemas biológicos más desafiantes hoy en día, el modelado del cerebro humano. Específicamente, expertos neurocientíficos no se ponen de acuerdo en una clasificación de neuronas para la corteza cerebral, lo que imposibilita no sólo cualquier intento de modelado sino también el trabajo del día a día al no tener una forma estándar de llamar a las neuronas. Por lo tanto, las técnicas de aprendizaje automático pueden ayudar a conseguir una solución aceptada para este problema, lo cual puede ser un importante hito para investigaciones futuras en neurociencia.