3 resultados para Minimum Description Length
em Université de Lausanne, Switzerland
Resumo:
RÉSUMÉ Cette thèse porte sur le développement de méthodes algorithmiques pour découvrir automatiquement la structure morphologique des mots d'un corpus. On considère en particulier le cas des langues s'approchant du type introflexionnel, comme l'arabe ou l'hébreu. La tradition linguistique décrit la morphologie de ces langues en termes d'unités discontinues : les racines consonantiques et les schèmes vocaliques. Ce genre de structure constitue un défi pour les systèmes actuels d'apprentissage automatique, qui opèrent généralement avec des unités continues. La stratégie adoptée ici consiste à traiter le problème comme une séquence de deux sous-problèmes. Le premier est d'ordre phonologique : il s'agit de diviser les symboles (phonèmes, lettres) du corpus en deux groupes correspondant autant que possible aux consonnes et voyelles phonétiques. Le second est de nature morphologique et repose sur les résultats du premier : il s'agit d'établir l'inventaire des racines et schèmes du corpus et de déterminer leurs règles de combinaison. On examine la portée et les limites d'une approche basée sur deux hypothèses : (i) la distinction entre consonnes et voyelles peut être inférée sur la base de leur tendance à alterner dans la chaîne parlée; (ii) les racines et les schèmes peuvent être identifiés respectivement aux séquences de consonnes et voyelles découvertes précédemment. L'algorithme proposé utilise une méthode purement distributionnelle pour partitionner les symboles du corpus. Puis il applique des principes analogiques pour identifier un ensemble de candidats sérieux au titre de racine ou de schème, et pour élargir progressivement cet ensemble. Cette extension est soumise à une procédure d'évaluation basée sur le principe de la longueur de description minimale, dans- l'esprit de LINGUISTICA (Goldsmith, 2001). L'algorithme est implémenté sous la forme d'un programme informatique nommé ARABICA, et évalué sur un corpus de noms arabes, du point de vue de sa capacité à décrire le système du pluriel. Cette étude montre que des structures linguistiques complexes peuvent être découvertes en ne faisant qu'un minimum d'hypothèses a priori sur les phénomènes considérés. Elle illustre la synergie possible entre des mécanismes d'apprentissage portant sur des niveaux de description linguistique distincts, et cherche à déterminer quand et pourquoi cette coopération échoue. Elle conclut que la tension entre l'universalité de la distinction consonnes-voyelles et la spécificité de la structuration racine-schème est cruciale pour expliquer les forces et les faiblesses d'une telle approche. ABSTRACT This dissertation is concerned with the development of algorithmic methods for the unsupervised learning of natural language morphology, using a symbolically transcribed wordlist. It focuses on the case of languages approaching the introflectional type, such as Arabic or Hebrew. The morphology of such languages is traditionally described in terms of discontinuous units: consonantal roots and vocalic patterns. Inferring this kind of structure is a challenging task for current unsupervised learning systems, which generally operate with continuous units. In this study, the problem of learning root-and-pattern morphology is divided into a phonological and a morphological subproblem. The phonological component of the analysis seeks to partition the symbols of a corpus (phonemes, letters) into two subsets that correspond well with the phonetic definition of consonants and vowels; building around this result, the morphological component attempts to establish the list of roots and patterns in the corpus, and to infer the rules that govern their combinations. We assess the extent to which this can be done on the basis of two hypotheses: (i) the distinction between consonants and vowels can be learned by observing their tendency to alternate in speech; (ii) roots and patterns can be identified as sequences of the previously discovered consonants and vowels respectively. The proposed algorithm uses a purely distributional method for partitioning symbols. Then it applies analogical principles to identify a preliminary set of reliable roots and patterns, and gradually enlarge it. This extension process is guided by an evaluation procedure based on the minimum description length principle, in line with the approach to morphological learning embodied in LINGUISTICA (Goldsmith, 2001). The algorithm is implemented as a computer program named ARABICA; it is evaluated with regard to its ability to account for the system of plural formation in a corpus of Arabic nouns. This thesis shows that complex linguistic structures can be discovered without recourse to a rich set of a priori hypotheses about the phenomena under consideration. It illustrates the possible synergy between learning mechanisms operating at distinct levels of linguistic description, and attempts to determine where and why such a cooperation fails. It concludes that the tension between the universality of the consonant-vowel distinction and the specificity of root-and-pattern structure is crucial for understanding the advantages and weaknesses of this approach.
Resumo:
Pizgrischite, (Cu,Fe)Cu14PbBi17S35, is a new mineral species named after the type locality, Piz Grisch Mountain, Val Ferrera, Graubunden, Switzerland. This sulfosalt occurs as thin, striated, metallic lead-grey blades measuring up to I cm in length, embedded in quartz and associated with tetrahedrite, chalcopyrite, pyrite, sphalerite, emplectite and derivatives of the aikinite-bismuthinite series. In plane-polarized light, the new species is brownish grey with no perceptible pleochroism; under crossed nicols in oil immersion, it presents a weak anisotropy with dark brown tints. Minimum and maximum reflectance values (in %) in air are: 40.7-42.15 (470 nm), 41.2-43.1 (546 nm), 41.2-43.35 (589 nm) and 40.7-43.3 (650 nm). Cleavage is perfect along 001 I and well developed on {010}. Abundant polysynthetic twinning is observed on (010). The mean micro-indentation hardness is 190 kg/mm(2) (Mohs hardness 3.3), and the calculated density is 6.58 g/cm(3). Electron-microprobe analyses yield (wt%; mean result of seven analyses): Cu 16.48, Pb 2.10, Fe 0.77, Bi 60.70, Sb 0.35, S 19.16, Se 0.04, total 99.60. The resulting empirical chemical formula is (Cu15.24Fe0.80Pb0.60)(Sigma 16.64)(Bi17.07Sb0.17)(Sigma 17.24)(S35.09Se0.03)(Sigma 35.12), in accordance with the formula derived from the single-crystal refinement of the structure, (Cu,Fe)Cu14PbBi17S35. Pizgrischite is monoclinic, space group C2/m, with the following unit-cell parameters: a 35.054(2), b3.91123(I), c43.192(2) angstrom, beta 96.713(4)degrees, V5881.24 angstrom(3), Z=4. The strongest seven X-ray powder-diffraction lines [d in angstrom (I)(hkl)] are: 5.364(40)((6) over bar 04), 4.080(50)((8) over bar 05), 3.120(40)(118), 3.104(68)((3) over bar 18), 2.759(53) ((9) over bar 11),2.752(44)(910) and 1.956(100)(020). The crystal structure is an expanded monoclinic derivative of kupcikite. Pizgrischite belongs to the cuprobismutite series of bismuth sulfosalts but, sensu stricto, it is not a homologue of cuprobismutite. At the type locality. pizarischite is the result of the Alpine metamorphism under greenschist-facies conditions of pre-Tertiary hydrothermal Cu-Bi mineralization.
Resumo:
Recently, we examined the spermatogenesis cycle length in two shrews species, Sorex araneus characterized by a very high metabolic rate and a polyandric mating system (sperm competition) resulting in a short cycle and Crocidura russula characterized by a much lower metabolic rate and a monogamous mating system showing a longer cycle. In this study, we investigated the spermatogenesis cycle in Neomys fodiens showing an intermediate metabolic rate. We described the stages of seminiferous epithelium according to the spermatid morphology method and we calculated the cycle length of spermatogenesis using incorporation of 5-bromodeoxyuridine into DNA of the germ cells. Twelve males were injected intraperitoneally with 5-bromodeoxyuridine, and the testes were collected. For cycle length determination, we applied a recently developed statistical method. The calculated cycle length is 8.69 days and the total duration of spermatogenesis based on 4.5 cycles is approximately 39.1 days, intermediate between the duration of spermatogenesis of S. araneus (37.6 days) and C. russula (54.5 days) and therefore congruent with both the metabolic rate hypothesis and the sperm competition hypothesis. Relative testes size of 1.4% of body mass indicates a promiscuous mating system.