5 resultados para Imbalanced datasets
em JISC Information Environment Repository
Resumo:
Supporting presentation slides as part of the Janet network end to end performance initiative
Resumo:
For sign languages used by deaf communities, linguistic corpora have until recently been unavailable, due to the lack of a writing system and a written culture in these communities, and the very recent advent of digital video. Recent improvements in video and computer technology have now made larger sign language datasets possible; however, large sign language datasets that are fully machine-readable are still elusive. This is due to two challenges. 1. Inconsistencies that arise when signs are annotated by means of spoken/written language. 2. The fact that many parts of signed interaction are not necessarily fully composed of lexical signs (equivalent of words), instead consisting of constructions that are less conventionalised. As sign language corpus building progresses, the potential for some standards in annotation is beginning to emerge. But before this project, there were no attempts to standardise these practices across corpora, which is required to be able to compare data crosslinguistically. This project thus had the following aims: 1. To develop annotation standards for glosses (lexical/word level) 2. To test their reliability and validity 3. To improve current software tools that facilitate a reliable workflow Overall the project aimed not only to set a standard for the whole field of sign language studies throughout the world but also to make significant advances toward two of the world’s largest machine-readable datasets for sign languages – specifically the BSL Corpus (British Sign Language, http://bslcorpusproject.org) and the Corpus NGT (Sign Language of the Netherlands, http://www.ru.nl/corpusngt).
Resumo:
Scientific research revolves around the production, analysis, storage, management, and re-use of data. Data sharing offers important benefits for scientific progress and advancement of knowledge. However, several limitations and barriers in the general adoption of data sharing are still in place. Probably the most important challenge is that data sharing is not yet very common among scholars and is not yet seen as a regular activity among scientists, although important efforts are being invested in promoting data sharing. In addition, there is a relatively low commitment of scholars to cite data. The most important problems and challenges regarding data metrics are closely tied to the more general problems related to data sharing. The development of data metrics is dependent on the growth of data sharing practices, after all it is nothing more than the registration of researchers’ behaviour. At the same time, the availability of proper metrics can help researchers to make their data work more visible. This may subsequently act as an incentive for more data sharing and in this way a virtuous circle may be set in motion. This report seeks to further explore the possibilities of metrics for datasets (i.e. the creation of reliable data metrics) and an effective reward system that aligns the main interests of the main stakeholders involved in the process. The report reviews the current literature on data sharing and data metrics. It presents interviews with the main stakeholders on data sharing and data metrics. It also analyses the existing repositories and tools in the field of data sharing that have special relevance for the promotion and development of data metrics. On the basis of these three pillars, the report presents a number of solutions and necessary developments, as well as a set of recommendations regarding data metrics. The most important recommendations include the general adoption of data sharing and data publication among scholars; the development of a reward system for scientists that includes data metrics; reducing the costs of data publication; reducing existing negative cultural perceptions of researchers regarding data publication; developing standards for preservation, publication, identification and citation of datasets; more coordination of data repository initiatives; and further development of interoperability protocols across different actors.
Resumo:
Authority files serve to uniquely identify real world ‘things’ or entities like documents, persons, organisations, and their properties, like relations and features. Already important in the classical library world, authority files are indispensable for adequate information retrieval and analysis in the computer age. This is because, even more than humans, computers are poor at handling ambiguity. Through authority files, people tell computers which terms, names or numbers refer to the same thing or have the same meaning by giving equivalent notions the same identifier. Thus, authority files signpost the internet where these identifiers are interlinked on the basis of relevance. When executing a query, computers are able to navigate from identifier to identifier by following these links and collect the queried information on these so-called ‘crosswalks’. In this context, identifiers also go under the name controlled access points. Identifiers become even more crucial now massive data collections like library catalogues or research datasets are releasing their till-now contained data directly to the internet. This development is coined Open Linked Data. The concatenating name for the internet is Web of Data instead of the classical Web of Documents.
Resumo:
The possibilities of digital research have altered the production, publication and use of research results. Academic research practice and culture are changing or have already been transformed, but to a large degree the system of academic recognition has not yet adapted to the practices and possibilities of digital research. This applies especially to research data, which are increasingly produced, managed, published and archived, but play hardly a role yet in practices of research assessment. The aim of the workshop was to bring experts and stakeholders from research institutions, universities, scholarly societies and funding agencies together in order to review, discuss and build on possibilities to implement the culture of sharing and to integrate publication of data into research assessment procedures. The report 'The Value of Research Data - Metrics for datasets from a cultural and technical point of view' was presented and discussed. Some of the key finding were that data sharing should be considered normal research practice, in fact not sharing should be considered malpractice. Research funders and universities should support and encourage data sharing. There are a number of important aspects to consider when making data count in research and evaluation procedures. Metrics are a necessary tool in monitoring the sharing of data sets. However, data metrics are at present not very well developed and there is not yet enough experience in what these metrics actually mean. It is important to implement the culture of sharing through codes of conducts in the scientific communities. For further key findings please read the report.