962 resultados para Clustering a large document collection
Resumo:
The ultramafic-hosted Logatchev hydrothermal field (LHF) is characterized by vent fluids, which are enriched in dissolved hydrogen and methane compared with fluids from basalt-hosted systems. Thick sediment layers in LHF are partly covered by characteristic white mats. In this study, these sediments were investigated in order to determine biogeochemical processes and key organisms relevant for primary production. Temperature profiling at two mat-covered sites showed a conductive heating of the sediments. Elemental sulfur was detected in the overlying mat and metal-sulfides in the upper sediment layer. Microprofiles revealed an intensive hydrogen sulfide flux from deeper sediment layers. Fluorescence in situ hybridization showed that filamentous and vibrioid, Arcobacter-related Epsilonproteobacteria dominated the overlying mats. This is in contrast to sulfidic sediments in basalt-hosted fields where mats of similar appearance are composed of large sulfur-oxidizing Gammaproteobacteria. Epsilonproteobacteria (7- 21%) and Deltaproteobacteria (20-21%) were highly abundant in the surface sediment layer. The physiology of the closest cultivated relatives, revealed by comparative 16S rRNA sequence analysis, was characterized by the capability to metabolize sulfur com- ponents. High sulfate reduction rates as well as sulfide depleted in 34S further confirmed the importance of the biogeochemical sulfur cycle. In contrast, methane was found to be of minor relevance for microbial life in mat-covered surface sediments. Our data indicate that in conductively heated surface sediments microbial sulfur cycling is the driving force for bacterial biomass production although ultramafic- hosted systems are characterized by fluids with high levels of dissolved methane and hydrogen.
Resumo:
This data set contains four time series of particulate and dissolved soil nitrogen measurements from the main experiment plots of a large grassland biodiversity experiment (the Jena Experiment; see further details below). In the main experiment, 82 grassland plots of 20 x 20 m were established from a pool of 60 species belonging to four functional groups (grasses, legumes, tall and small herbs). In May 2002, varying numbers of plant species from this species pool were sown into the plots to create a gradient of plant species richness (1, 2, 4, 8, 16 and 60 species) and functional richness (1, 2, 3, 4 functional groups). Plots were maintained by bi-annual weeding and mowing. 1. Total nitrogen from solid phase: Stratified soil sampling was performed every two years since before sowing in April 2002 and was repeated in April 2004, 2006 and 2008 to a depth of 30 cm segmented to a depth resolution of 5 cm giving six depth subsamples per core. In 2002 five samples per plot were taken and analyzed independently. Averaged values per depth layer are reported. In later years, three samples per plot were taken, pooled in the field, and measured as a combined sample. Sampling locations were less than 30 cm apart from sampling locations in other years. All soil samples were passed through a sieve with a mesh size of 2 mm in 2002. In later years samples were further sieved to 1 mm. No additional mineral particles were removed by this procedure. Total nitrogen concentration was analyzed on ball-milled subsamples (time 4 min, frequency 30 s-1) by an elemental analyzer at 1150°C (Elementaranalysator vario Max CN; Elementar Analysensysteme GmbH, Hanau, Germany). 2. Total nitrogen from solid phase (high intensity sampling): In block 2 of the Jena Experiment, soil samples were taken to a depth of 1m (segmented to a depth resolution of 5 cm giving 20 depth subsamples per core) with three replicates per block ever 5 years starting before sowing in April 2002. Samples were processed as for the more frequent sampling but were always analyzed independently and never pooled. 3. Mineral nitrogen from KCl extractions: Five soil cores (diameter 0.01 m) were taken at a depth of 0 to 0.15 m (and between 2002 and 2004 also at a depth of 0.15 to 0.3 m) of the mineral soil from each of the experimental plots at various times over the years. In addition also plots of the management experiment, that altered mowing frequency and fertilized subplots (see further details below) were sampled in some later years. Samples of the soil cores per plot (subplots in case of the management experiment) were pooled during each sampling campaign. NO3-N and NH4-N concentrations were determined by extraction of soil samples with 1 M KCl solution and were measured in the soil extract with a Continuous Flow Analyzer (CFA, 2003-2005: Skalar, Breda, Netherlands; 2006-2007: AutoAnalyzer, Seal, Burgess Hill, United Kingdom). 4. Dissolved nitrogen in soil solution: Glass suction plates with a diameter of 12 cm, 1 cm thickness and a pore size of 1-1.6 µm (UMS GmbH, Munich, Germany) were installed in April 2002 in depths of 10, 20, 30 and 60 cm to collect soil solution. The sampling bottles were continuously evacuated to a negative pressure between 50 and 350 mbar, such that the suction pressure was about 50 mbar above the actual soil water tension. Thus, only the soil leachate was collected. Cumulative soil solution was sampled biweekly and analyzed for nitrate (NO3-), ammonium (NH4+) and total dissolved nitrogen concentrations with a continuous flow analyzer (CFA, Skalar, Breda, The Netherlands). Nitrate was analyzed photometrically after reduction to NO2- and reaction with sulfanilamide and naphthylethylenediamine-dihydrochloride to an azo-dye. Our NO3- concentrations contained an unknown contribution of NO2- that is expected to be small. Simultaneously to the NO3- analysis, NH4+ was determined photometrically as 5-aminosalicylate after a modified Berthelot reaction. The detection limits of NO3- and NH4+ were 0.02 and 0.03 mg N L-1, respectively. Total dissolved N in soil solution was analyzed by oxidation with K2S2O8 followed by reduction to NO2- as described above for NO3-. Dissolved organic N (DON) concentrations in soil solution were calculated as the difference between TDN and the sum of mineral N (NO3- + NH4+).
Resumo:
Remains of large Pleistocene mammals always attract attention. Scientists and local people who work and live in the Laptev Sea Region find and collect various bones and fragments of large mammals. Some of them are brought to the Lena Delta Reserve. Mammal remains of the "Mammoth fauna" are the most common artifacts in the paleontological collection of the Lena Delta Reserve museum. The collection includes single bones, fragments of skeletons, bones with soft tissues and hair of Late Pleistocene and Holocene specimens. It consists of nearly 300 samples. The museum was created thanks to the enthusiasm of Dr. A. Gukov, the present director of the reserve. Employees of the reserve, school teachers, pupils and other interested people also contribute. The first specimens were collected in 1985. They were bison bones collected by Yarlykov Yu. A. on Makar Island (Yana Delta Region) near the Makar polar station; Efimov S. N. found horse and reindeer bones on the Myostakh Cape, Bykovsky Peninsula (Lena Delta Region). Mammoth and reindeer bones were collected by Gukov A. Yu. during the same year on Kurungnakh-Sise Island. Over more than 20 years many people have presented their finds to the reserve. These are samples from different islands of the Lena Delta Region, from the New Siberian Islands, from the Yana Delta Region, and from the southern coasts of the Laptev and East Siberian Seas. Most of the collection consists of bones from the Bykovsky Peninsula (about 100 samples) as well as from the islands of the Lena Delta Region. Unfortunately not all samples have exact information about their origins or is geological information available for all finds. It is typical for this exhibition that the finds were collected by amateurs (not during geological or paleontological expeditions). A considerable portion of the collection consists of finds of Dr. A. Gukov from different locations within the Lena Delta Reserve. In 2001 Dr. A. Sher delivered about 40 samples from the Bykovsky Peninsula (Mamontovy Khayata) to the museum.
Resumo:
Postestimation processing and formatting of regression estimates for input into document tables are tasks that many of us have to do. However, processing results by hand can be laborious, and is vulnerable to error. There are therefore many benefits to automation of these tasks while at the same time retaining user flexibility in terms of output format. The estout package meets these needs. estout assembles a table of coefficients, "significance stars", summary statistics, standard errors, t/z statistics, p-values, confidence intervals, and other statistics calculated for up to twenty models previously fitted and stored by estimates store. It then writes the table to the Stata log and/or to a text file. The estimates are formatted optionally in several styles: html, LaTeX, or tab-delimited (for input into MS Excel or Word). There are a large number of options regarding which output is formatted and how. This talk will take users through a range of examples, from relatively basic simple applications to complex ones.
Resumo:
Clustering small manufacturers are believed to attain various types of collective efficiency. A woodworking and furniture SME district in Uganda has created a learning environment for artisans to start up their own workshops. In the district workers can access various managerial information including business skills and input materials easily than outside. Hence it attracted new entrants to follow and district growth continued. On contrary large firms are locating separately and dispersedly from SME district and have a negative image to SME. This dichotomy has been created partly through spatial division of two sectors and partly through policy favouritism toward large firms.
Resumo:
This work studied the combined use of gliadins and SSRs to analyse inter- and intra-accession variability of the Spanish collection of cultivated einkorn (Triticum monococcum L. ssp. monococcum) maintained at the CRF-INIA. In general, gliadin loci presented higher discrimination power than SSRs, reflecting the high variability of the gliadins. The loci on chromosome 6A were the most polymorphic with similar PIC values for both marker systems, showing that these markers are very useful for genetic variability studies in wheat. The gliadin results indicated that the Spanish einkorn collection possessed high genetic diversity, being the differentiation large between varieties and small within them. Some associations between gliadin alleles and geographical and agro-morphological data were found. Agro-morphological relations were also observed in the clusters of the SSRs dendrogram. A high concordance was found between gliadins and SSRs for genotype identification. In addition, both systems provide complementary information to resolve the different cases of intra-accession variability not detected at the agro-morphological level, and to identify separately all the genotypes analysed. The combined use of both genetic markers is an excellent tool for genetic resource evaluation in addition to agro-morphological evaluation.
Resumo:
In this work gliadin proteins were used to analyse the genetic variability in a sample of the durum wheat Spanish collection conserved at the CRF-INIA. In total 38 different alleles were identified at the loci Gli-A1, Gli-A3, Gli-B5, Gli-B1, Gli-A2 and Gli-B2. All the gliadin loci were polymorphic, possessed large genetic diversity and small and large differentiation within and between varieties, respectively. The Gli-A2 and Gli-B2 loci were the most polymorphic, the most fixed within varieties and the most useful to distinguish among varieties. Alternatively, Gli-B1 locus presented the least genetic variability out of the four main loci Gli-A1, Gli-B1, Gli-A2 and Gli-B2. The Gli-B1 alleles coding for the gliadin γ-45, associated with good quality, had an accumulated frequency of 69.7%, showing that the Spanish germplasm could be a good source for breeding quality. The Spanish landraces studied showed new gliadin alleles not catalogued so far. These new alleles might be associated with specific Spanish environment factors. The large number of new alleles identified also indicates that durum wheat Spanish germplasm is rather unique.
Resumo:
Large-scale structure formation can be modeled as a nonlinear process that transfers energy from the largest scales to successively smaller scales until it is dissipated, in analogy with Kolmogorov’s cascade model of incompressible turbulence. However, cosmic turbulence is very compressible, and vorticity plays a secondary role in it. The simplest model of cosmic turbulence is the adhesion model, which can be studied perturbatively or adapting to it Kolmogorov’s non-perturbative approach to incompressible turbulence. This approach leads to observationally testable predictions, e.g., to the power-law exponent of the matter density two-point correlation function.
Resumo:
The Microarray technique is rather powerful, as it allows to test up thousands of genes at a time, but this produces an overwhelming set of data files containing huge amounts of data, which is quite difficult to pre-process, separate, classify and correlate for interesting conclusions to be extracted. Modern machine learning, data mining and clustering techniques based on information theory, are needed to read and interpret the information contents buried in those large data sets. Independent Component Analysis method can be used to correct the data affected by corruption processes or to filter the uncorrectable one and then clustering methods can group similar genes or classify samples. In this paper a hybrid approach is used to obtain a two way unsupervised clustering for a corrected microarray data.
Resumo:
Machine learning techniques are used for extracting valuable knowledge from data. Nowa¬days, these techniques are becoming even more important due to the evolution in data ac¬quisition and storage, which is leading to data with different characteristics that must be exploited. Therefore, advances in data collection must be accompanied with advances in machine learning techniques to solve new challenges that might arise, on both academic and real applications. There are several machine learning techniques depending on both data characteristics and purpose. Unsupervised classification or clustering is one of the most known techniques when data lack of supervision (unlabeled data) and the aim is to discover data groups (clusters) according to their similarity. On the other hand, supervised classification needs data with supervision (labeled data) and its aim is to make predictions about labels of new data. The presence of data labels is a very important characteristic that guides not only the learning task but also other related tasks such as validation. When only some of the available data are labeled whereas the others remain unlabeled (partially labeled data), neither clustering nor supervised classification can be used. This scenario, which is becoming common nowadays because of labeling process ignorance or cost, is tackled with semi-supervised learning techniques. This thesis focuses on the branch of semi-supervised learning closest to clustering, i.e., to discover clusters using available labels as support to guide and improve the clustering process. Another important data characteristic, different from the presence of data labels, is the relevance or not of data features. Data are characterized by features, but it is possible that not all of them are relevant, or equally relevant, for the learning process. A recent clustering tendency, related to data relevance and called subspace clustering, claims that different clusters might be described by different feature subsets. This differs from traditional solutions to data relevance problem, where a single feature subset (usually the complete set of original features) is found and used to perform the clustering process. The proximity of this work to clustering leads to the first goal of this thesis. As commented above, clustering validation is a difficult task due to the absence of data labels. Although there are many indices that can be used to assess the quality of clustering solutions, these validations depend on clustering algorithms and data characteristics. Hence, in the first goal three known clustering algorithms are used to cluster data with outliers and noise, to critically study how some of the most known validation indices behave. The main goal of this work is however to combine semi-supervised clustering with subspace clustering to obtain clustering solutions that can be correctly validated by using either known indices or expert opinions. Two different algorithms are proposed from different points of view to discover clusters characterized by different subspaces. For the first algorithm, available data labels are used for searching for subspaces firstly, before searching for clusters. This algorithm assigns each instance to only one cluster (hard clustering) and is based on mapping known labels to subspaces using supervised classification techniques. Subspaces are then used to find clusters using traditional clustering techniques. The second algorithm uses available data labels to search for subspaces and clusters at the same time in an iterative process. This algorithm assigns each instance to each cluster based on a membership probability (soft clustering) and is based on integrating known labels and the search for subspaces into a model-based clustering approach. The different proposals are tested using different real and synthetic databases, and comparisons to other methods are also included when appropriate. Finally, as an example of real and current application, different machine learning tech¬niques, including one of the proposals of this work (the most sophisticated one) are applied to a task of one of the most challenging biological problems nowadays, the human brain model¬ing. Specifically, expert neuroscientists do not agree with a neuron classification for the brain cortex, which makes impossible not only any modeling attempt but also the day-to-day work without a common way to name neurons. Therefore, machine learning techniques may help to get an accepted solution to this problem, which can be an important milestone for future research in neuroscience. Resumen Las técnicas de aprendizaje automático se usan para extraer información valiosa de datos. Hoy en día, la importancia de estas técnicas está siendo incluso mayor, debido a que la evolución en la adquisición y almacenamiento de datos está llevando a datos con diferentes características que deben ser explotadas. Por lo tanto, los avances en la recolección de datos deben ir ligados a avances en las técnicas de aprendizaje automático para resolver nuevos retos que pueden aparecer, tanto en aplicaciones académicas como reales. Existen varias técnicas de aprendizaje automático dependiendo de las características de los datos y del propósito. La clasificación no supervisada o clustering es una de las técnicas más conocidas cuando los datos carecen de supervisión (datos sin etiqueta), siendo el objetivo descubrir nuevos grupos (agrupaciones) dependiendo de la similitud de los datos. Por otra parte, la clasificación supervisada necesita datos con supervisión (datos etiquetados) y su objetivo es realizar predicciones sobre las etiquetas de nuevos datos. La presencia de las etiquetas es una característica muy importante que guía no solo el aprendizaje sino también otras tareas relacionadas como la validación. Cuando solo algunos de los datos disponibles están etiquetados, mientras que el resto permanece sin etiqueta (datos parcialmente etiquetados), ni el clustering ni la clasificación supervisada se pueden utilizar. Este escenario, que está llegando a ser común hoy en día debido a la ignorancia o el coste del proceso de etiquetado, es abordado utilizando técnicas de aprendizaje semi-supervisadas. Esta tesis trata la rama del aprendizaje semi-supervisado más cercana al clustering, es decir, descubrir agrupaciones utilizando las etiquetas disponibles como apoyo para guiar y mejorar el proceso de clustering. Otra característica importante de los datos, distinta de la presencia de etiquetas, es la relevancia o no de los atributos de los datos. Los datos se caracterizan por atributos, pero es posible que no todos ellos sean relevantes, o igualmente relevantes, para el proceso de aprendizaje. Una tendencia reciente en clustering, relacionada con la relevancia de los datos y llamada clustering en subespacios, afirma que agrupaciones diferentes pueden estar descritas por subconjuntos de atributos diferentes. Esto difiere de las soluciones tradicionales para el problema de la relevancia de los datos, en las que se busca un único subconjunto de atributos (normalmente el conjunto original de atributos) y se utiliza para realizar el proceso de clustering. La cercanía de este trabajo con el clustering lleva al primer objetivo de la tesis. Como se ha comentado previamente, la validación en clustering es una tarea difícil debido a la ausencia de etiquetas. Aunque existen muchos índices que pueden usarse para evaluar la calidad de las soluciones de clustering, estas validaciones dependen de los algoritmos de clustering utilizados y de las características de los datos. Por lo tanto, en el primer objetivo tres conocidos algoritmos se usan para agrupar datos con valores atípicos y ruido para estudiar de forma crítica cómo se comportan algunos de los índices de validación más conocidos. El objetivo principal de este trabajo sin embargo es combinar clustering semi-supervisado con clustering en subespacios para obtener soluciones de clustering que puedan ser validadas de forma correcta utilizando índices conocidos u opiniones expertas. Se proponen dos algoritmos desde dos puntos de vista diferentes para descubrir agrupaciones caracterizadas por diferentes subespacios. Para el primer algoritmo, las etiquetas disponibles se usan para bus¬car en primer lugar los subespacios antes de buscar las agrupaciones. Este algoritmo asigna cada instancia a un único cluster (hard clustering) y se basa en mapear las etiquetas cono-cidas a subespacios utilizando técnicas de clasificación supervisada. El segundo algoritmo utiliza las etiquetas disponibles para buscar de forma simultánea los subespacios y las agru¬paciones en un proceso iterativo. Este algoritmo asigna cada instancia a cada cluster con una probabilidad de pertenencia (soft clustering) y se basa en integrar las etiquetas conocidas y la búsqueda en subespacios dentro de clustering basado en modelos. Las propuestas son probadas utilizando diferentes bases de datos reales y sintéticas, incluyendo comparaciones con otros métodos cuando resulten apropiadas. Finalmente, a modo de ejemplo de una aplicación real y actual, se aplican diferentes técnicas de aprendizaje automático, incluyendo una de las propuestas de este trabajo (la más sofisticada) a una tarea de uno de los problemas biológicos más desafiantes hoy en día, el modelado del cerebro humano. Específicamente, expertos neurocientíficos no se ponen de acuerdo en una clasificación de neuronas para la corteza cerebral, lo que imposibilita no sólo cualquier intento de modelado sino también el trabajo del día a día al no tener una forma estándar de llamar a las neuronas. Por lo tanto, las técnicas de aprendizaje automático pueden ayudar a conseguir una solución aceptada para este problema, lo cual puede ser un importante hito para investigaciones futuras en neurociencia.
Resumo:
The objectives of this study were to assess diversity and genetic structure of a collection of Spanish durum wheat (Triticum turgidum L) landraces, using SSRs, DArTs and gliadin-markers, and to correlate the distribution of diversity with geographic and climatic features, as well as agro-morphological traits. A high level of diversity was detected in the genotypes analyzed, which were separated into nine populations with a moderate to great genetic divergence among them. The three subspecies taxa, dicoccon, turgidum and durum, present in the collection, largely determined the clustering of the populations. Genotype variation was lower in dicoccon (one major population) and turgidum (two major populations) than in durum (five major populations). Genetic differentiation by the agro-ecological zone of origin was greater in dicoccon and turgidum than in durum. DArT markers revealed two geographic substructures, east-west for dicoccon and northeast-southwest for turgidum. The ssp. durum had a more complex structure, consisting of seven populations with high intra-population variation. DArT markers allowed the detection of subgroups within some populations, with agro-morphological and gliadin differences, and distinct agro-ecological zones of origin. Two different phylogenetic groups were detected; revealing that some durum populations were more related to ssp. turgidum from northern Spain, while others seem to be more related to durum wheats from North Africa
Resumo:
BETs is a three-year project financed by the Space Program of the European Commission, aimed at developing an efficient deorbit system that could be carried on board any future satellite launched into Low Earth Orbit (LEO). The operational system involves a conductive tape-tether left bare to establish anodic contact with the ambient plasma as a giant Langmuir probe. As a part of this project, we are carrying out both numerical and experimental approaches to estimate the collected current by the positive part of the tether. This paper deals with experimental measurements performed in the IONospheric Atmosphere Simulator (JONAS) plasma chamber of the Onera-Space Environment Department. The JONAS facility is a 9- m3 vacuum chamber equipped with a plasma source providing drifting plasma simulating LEO conditions in terms of density and temperature. A thin metallic cylinder, simulating the tether, is set inside the chamber and polarized up to 1000 V. The Earth's magnetic field is neutralized inside the chamber. In a first time, tether collected current versus tether polarization is measured for different plasma source energies and densities. In complement, several types of Langmuir probes are used at the same location to allow the extraction of both ion densities and electron parameters by computer modeling (classical Langmuir probe characteristics are not accurate enough in the present situation). These two measurements permit estimation of the discrepancies between the theoretical collection laws, orbital motion limited law in particular, and the experimental data in LEO-like conditions without magnetic fields. In a second time, the spatial variations and the time evolutions of the plasma properties around the tether are investigated. Spherical and emissive Langmuir probes are also used for a more extensive characterization of the plasma in space and time dependent analysis. Results show the ion depletion because of the wake effect and the accumulation of- ions upstream of the tether. In some regimes (at large positive potential), oscillations are observed on the tether collected current and on Langmuir probe collected current in specific sites.
Resumo:
We review previously published results, and present new results, on the way current to a cylindrical probe drops below the orbital-motion-limited (OML) value for probe cross-sections too large or concave. Results on size and shape effects arise from unrelated behavior in the near and far potential field, and apply to a general cross-section, which can be characterised by radius Req and perimeter peij of equivalent circles. These results are used to discuss collection interference among two or more parallel bare tethers when brought from far away to contact.
Resumo:
Objectives: A recently introduced pragmatic scheme promises to be a useful catalog of interneuron names.We sought to automatically classify digitally reconstructed interneuronal morphologies according tothis scheme. Simultaneously, we sought to discover possible subtypes of these types that might emergeduring automatic classification (clustering). We also investigated which morphometric properties weremost relevant for this classification.Materials and methods: A set of 118 digitally reconstructed interneuronal morphologies classified into thecommon basket (CB), horse-tail (HT), large basket (LB), and Martinotti (MA) interneuron types by 42 of theworld?s leading neuroscientists, quantified by five simple morphometric properties of the axon and fourof the dendrites. We labeled each neuron with the type most commonly assigned to it by the experts. Wethen removed this class information for each type separately, and applied semi-supervised clustering tothose cells (keeping the others? cluster membership fixed), to assess separation from other types and lookfor the formation of new groups (subtypes). We performed this same experiment unlabeling the cells oftwo types at a time, and of half the cells of a single type at a time. The clustering model is a finite mixtureof Gaussians which we adapted for the estimation of local (per-cluster) feature relevance. We performedthe described experiments on three different subsets of the data, formed according to how many expertsagreed on type membership: at least 18 experts (the full data set), at least 21 (73 neurons), and at least26 (47 neurons).Results: Interneurons with more reliable type labels were classified more accurately. We classified HTcells with 100% accuracy, MA cells with 73% accuracy, and CB and LB cells with 56% and 58% accuracy,respectively. We identified three subtypes of the MA type, one subtype of CB and LB types each, andno subtypes of HT (it was a single, homogeneous type). We got maximum (adapted) Silhouette widthand ARI values of 1, 0.83, 0.79, and 0.42, when unlabeling the HT, CB, LB, and MA types, respectively,confirming the quality of the formed cluster solutions. The subtypes identified when unlabeling a singletype also emerged when unlabeling two types at a time, confirming their validity. Axonal morphometricproperties were more relevant that dendritic ones, with the axonal polar histogram length in the [pi, 2pi) angle interval being particularly useful.Conclusions: The applied semi-supervised clustering method can accurately discriminate among CB, HT, LB, and MA interneuron types while discovering potential subtypes, and is therefore useful for neuronal classification. The discovery of potential subtypes suggests that some of these types are more heteroge-neous that previously thought. Finally, axonal variables seem to be more relevant than dendritic ones fordistinguishing among the CB, HT, LB, and MA interneuron types.