About This Project
Researchers are nowadays overwhelmed by thousands of findings, devoting substantial efforts to keep up with advances in their fields. Understanding scientific topics and domains is a complex endeavor that is not well supported by current systems. Regarding the need of making new discoveries from large datasets, this project aims to show how machine learning can be harnessed by leveraging the strengths of humans and computational agents to solve crowdsourcing tasks for scientific discovery.
Ask the ScientistsJoin The Discussion
What is the context of this research?
Big Data has become a major research venue in today’s computing landscape and refers to datasets that are so large and diverse that current automatic tools can't manage or analyze them efficiently (Borkar et al., 2012). On the other hand, human computation alone is unable to examine variances, correlating evidence, and compiling descriptive statistics at large scale. As academic literature production grows exponentially (Priem & Hemminger, 2010), scientific data repositories require “more meaningful indexing, classification, and descriptive metadata in order to facilitate data discovery, reuse and understanding” (Borne, 2013). This project relies on a mixed-initiative approach where both humans and machines can cooperate naturally and effectively for achieving new scientific breakthroughs.
What is the significance of this project?
It is known that automatic mechanisms able to read tens of thousands of research papers, and then predict new discoveries about diseases, could herald a faster and cheap approach to developing new treatments. Researchers also acknowledge that harnessing the cognitive abilities of a human crowd can constitute a reliable solution to perform tasks or provide data towards solving difficult problems that no known efficient computer algorithms can yet solve (Quinn & Bederson, 2011). Since the symbiosis between collective and computational intelligence is not adequately supported by an overarching framework for scientific discovery, this project aims at enhancing computational reasoning capabilities while keeping machines accurate when "timely" scientific information suddenly needs updating.
What are the goals of the project?
Considering the limitations identified by Correia et al. (2013) regarding the task of finding and manually analyzing all kinds of digital artifacts and other forms of intellectual assets produced by researchers, this project proposal aims to reduce bias, time and cognitive effort spent in scientific data seeking, cataloging, analysis, and classification. In addition, we want to extend the limitative spectrum (e.g., sample size) of literature reviews using machine learning and crowd-based human computation techniques. Last but not least, the funds will be used to pay citizen scientists to collect and analyze data in order to fill the lack of human-centered results and better inform the creation of a hybrid intelligent system for supporting research pursuits on a large scale.
Developing an intelligent system for scientific discovery using AI and crowdsourcing is an exciting challenge. We need funds to perform some experiments in order to build the system "from the ground up". The more funds we raise the better, since we can use excess funds to deliver a more robust approach to fill the gap between science and the general public.
1. We need datasets produced by crowds to learn from their behavior. MTurk will be used to recruit and pay people for participating in the surveys and experiments we will design.
2. The budget will be used to hire a research assistant (either undergrad or Master’s) with programming skills as a member of our team to help in the system development.
3. The proposed budget is intended to cover the costs necessary to the development of the system infrastructure, including web hosting, rental of server space, etc. Any additional funds will be used to incorporate more features.
This project has the following milestones: (1) identifying limitations and opportunities for crowd-based human computation in science, (2) surveying the problem under study by characterizing conceptual dimensions, (3) feature analysis and literature review comparing systems, (4) case studies and experiments, (5) conceptual framework for large-scale, scientific data search, analysis and classification, and (6) inform the design of a mixed-initiative system called SciCrowd.
Aug 15, 2017
Sep 01, 2017
Qualitative study of conceptual dimensions underlying crowd-based human computation, massively collaborative science, automated-based reasoning, and mixed-initiative systems.
Oct 01, 2017
Systematic literature review and feature analysis comparing the characteristics and features of existing tools for scientific data analysis and classification.
Dec 01, 2017
Survey with perceptions from researchers in crowdsourcing and citizen science.
Jan 01, 2018
Experiments and case studies using crowd-based human computation in science (N=100 volunteers).
Meet the Team
Benjamim Fonseca is an Assistant Professor at the University of Trás-os-Montes e Alto Douro (UTAD), in Portugal, where he lectures on collaboration and inclusive systems, and a researcher at the INESC TEC. He received a Ph.D. from UTAD, with a thesis subject of “A model for creating cooperative services”. His research interests are Computer Supported Cooperative Work (CSCW) and mobile accessibility, having participated in several research projects funded by companies and international funding agencies. He authored or coauthored over 100 scientific publications in conferences, journals and books in these research fields, and actively participates in the organization and scientific committees of several reputed conferences and journals.
Hugo Paredes received B.Eng. and Ph.D. degrees in Computer Science from the University of Minho, Braga, Portugal, in 2000 and 2008, and the Habilitation title from the University of Trás-os-Montes e Alto Douro (UTAD), Vila Real, Portugal in 2016. He was software engineer at SiBS, S.A. and software consultant at Novabase Outsourcing, S.A. Since 2003, he has been at UTAD, where he is currently Assistant Professor with Habilitation, lecturing on systems integration and distributed systems. Currently he is vice-director of the Masters in Computer Science and in Accessibility and Rehabilitation and Engineering. He is a Senior Researcher at Institute for Systems and Computer Engineering, Technology and Science – INESC TEC – and leader of the “Information Technologies, Virtual Environments and Accessibility” research group at INESC TEC UTAD Pole. His main research interests are in the domain of Human Computer Interaction, including Collaboration and Accessibility topics. He is a member of the J.UCS board of editors, was guest editor of three Special Issues in journals indexed by the Journal Citation Reports and collaborates with the steering committee of the DSAI International Conference. He has authored or co-authored more than 100 publications, including refereed journals, book chapters and conference papers. He is one of the inventors of a granted patent and a patent pending request. He participated in thirteen national projects and three international projects, eight of them with public funding and six with private funding.
Research prospects and hypotheses for future collaboration intelligent systems combining automated reasoning power and human annotation to harness machine analysis are explored in this proposal testing a crowd-enabled human computation and machine reasoning model for semantic analytics. It is expected that this model can allow the extraction of relevant facts about the relationships between disciplines, scholars and publications, filling the limitations of current tools for understanding research attributes and trends effectively at different levels of granularity, and to relate them “semantically” through an integrated solution (Osborne et al., 2013). Taking into account the Woolley et al.'s (2010) analysis on crowd behavior and collective intelligence, this work proposes a step forward studying convergence indicators and input requirements with the use of automatic tools and crowd-based human computation (e.g., MTurk, and Crowdcrafting) on a vast set of scientific publications. The design of a community self-organizing bibliographic information system (Correia et al., 2013) will be informed “from the ground up” to support this crowd-enabled scientific data analytics process. Currently, this system only allows users’ authentication, edit and classify data using different annotations, comments and categories.
A crowdsourcing scenario provides a reliable setting for investigating human collective intelligence, generated through networks of interactions among individuals, environments, and contents. It will be presented a scenario in which scholars can pick on machine annotations to focus on key parts of publications and then provide annotations and create classification categories. Therefore they can use those annotations to self-reflect on their interpretation, or they may read other people’s annotations and discover new aspects, interpretations and knowledge. Problems are formalized in the crowdsourced setting providing efficient algorithms that are guaranteed to achieve good results with high probability. The question of crowdsourcing scientific data analysis and clustering will be answered in two steps: 1) reduce the problem to a number of independent Human Intelligence Tasks of reasonable size and assign them to a large pool of participants, and 2) develop a model of the annotation process to aggregate the human data automatically yielding a partition of the dataset into categories. The outputs of this human-machine analytical approach will be tested on a number of real data sets and compared against existing methods.
For example, consider the following evaluation scenario. A paper classified as "medical informatics" could be characterized by subarea (e.g., cognitive aging), aims and purpose, setting and context, key concepts and definitions, participant characteristics, research boundaries and limitations, method, results and findings, social-technical aspects concerning a certain technology (e.g., a Wiki to support knowledge exchange in public health), related work, scientometric data (e.g., affiliation and country of authors’ affiliation), and annotations as a meta-cognitive activity. In addition, all these data could be correlated and filtered to present the final results considering specific research purposes (e.g., identify what kind of features was introduced in health care technologies by Canadian researchers between 2009 and 2016).
A controlled experiment applying data mining (to discover previously unknown properties) and machine learning classifiers (which will be trained to recognize certain patterns in data based on a thesaurus) on a large number of publications represents the research setting for testing computational intelligence. In this sense, metadata structures (e.g., Dublin Core), hybrid tools for data alignment and generation from text (e.g., Apolo), automatic and crowd-based taxonomy creation systems (e.g., Cascade), and open classification models will support the identification of complex system dynamics and emergent vocabularies resulting in a knowledge base of scientific facts extracted from literature.
A hybrid methodology will be mainly sustained on evidence-based research for systematic analysis of data so that common concepts and ideas are extracted and then axially referenced to produce higher-level themes and concepts that frame the theoretical understanding of the researched phenomenon. Case studies based on crowdsourcing and human computation will also support the global workflow measuring the relevance of personalized search and analysis, gathering training data for machine learning classifiers and designing an intelligent system that incorporates crowdsourcing in high-quality research.
Semantics will be framed to crowd workers by means of sentences, scenarios, and descriptions discussing scientific facts and performance measures concerning the crowdsourcing process to analyze the semantic correctness, naturalness, and bias of the collected data sets. Pattern recognition, word similarity, recognizing textual entailment, event temporal ordering, and word sense disambiguation will be some of the evaluation methods. Experiments will be applied on a large crowd of MTurk volunteers, and the workflow is based on selecting papers and classification dimensions for analysis, split a large task into batches, ask crowds and use automated mechanisms to classify a query, quality control, and aggregate contributions, collect results and metrics assembling methods for crowd wisdom consensus.
Once research on whether manual data gathering and evaluation can be scalable to a large set of publications and scholars remains unclear, it is assumed that the prerequisites for crowdsourcing and machine learning are present in academic settings and scientists perceive it as a useful approach for supporting research.
- $280Total Donations
- $28.00Average Donation