AI with and for Open Science
Background: Researchers can leverage AI algorithms to analyze vast amounts of data quickly and efficiently. Moreover, AI tools are being used to generate content, write code, resolve accessibility issues, reconfigure writing processes and detect plagiarism. All this is reshaping researcher practice and culture in how they communicate, how they share, how they view infrastructure.
What's at stake: Open Science and Open Scholarly Communication cannot and should not progress unless it seeks to understand current AI trends, finds ways to embrace in a trusted way, and identifies the critical elements related to Open Science (after all, data, code, and knowledge are at the heart of both). In our upcoming actions, it is critical to address the similarities and synergies between AI and Open Science practises, as well as investigate potential prospects for wider adoption and inclusion in our workflows. As early as possible!
AI with and for Open Science: In order to initiate the discussion and influence the path of Scholarly Communication (a shift that is imperative), we organised a panel at OSFAIR2023 in Madrid centred around this topic. It turned out to be one of the highlights of the conference, as it successfully united three seemingly disparate perspectives that ultimately converged on four key themes. The audience engagement was highly dynamic, indicating the growing enthusiasm for AI within the Open Science communities.
The panel featured three speakers who discussed the various ways in which Artificial Intelligence (AI) and Language Models (LLMs) are integrated into the research ecosystem, as well as the importance of Open Science in ensuring their proper implementation. Saikiran Chandha from SciSpace discussed how the relatively new service, SciSpace, utilises LLMs to aid researchers in their daily tasks of reading, summarising, and writing papers. Haris Papageorgiou from Athena Research & Innovation Centre and Opix presented various use cases demonstrating how LLMs are employed to extract information and enhance the OpenAIRE Graph, which can then be utilised in policy making to connect all aspects of Science-Technology-Innovation. Papageorgiou also advocated for the inclusion of LLMs in the Open Science infrastructure, citing their expensive and collaborative maintenance. Kaylin Bugbee from NASA highlighted a public-private partnership with IBM that supports the development of a foundational LLM. Bugbee emphasised the importance of Open Science and the meticulous curation it entails.
The discussion that followed with the panelists and the audience included a lively exchange with key questions around:
- Trust and quality: As anticipated, the majority of questions related to this particular topic. Considering the widely recognised principle in machine learning that states 'the quality of the output is determined by the quality of the input', along with the existence of questionable research practises and non-reproducible research, the conversation emphasised the significance of careful selection and organisation in the initial stages of the process. This is where FAIR principles play a crucial role. Additionally, since ensuring the credibility of an AI assistant relies heavily on algorithmic transparency, may we consider making the algorithms open-source?
- Open Science role: A recurring motif was the emphasis on the accessibility of the input, the algorithms (open source code), as well as the open infrastructure as a space where LLMs are integrated into the central framework of Open Science. What strategies can we employ to expand the scope of human curation and implement it on a larger scale as part of this remit?
- Commercial implication of AI: Companies are leading the way in offering AI-powered agents, and this is also true for commercial publishers who are defending their involvement in scholarly communications. Public organisations must devise strategies to counter this trend and prevent the repetition of past errors, and especially on how we can implement a public-private partnership framework that leverages commercial providers to facilitate the utilisation of AI, while ensuring content remains accessible and unrestricted.
- Human element and skills: How is the support for the human element ensured as we venture into AI? Methods to improve training in order to ensure proper utilisation of AI agent tools, such as providing users with information regarding the limitations of AI and the potential for biassed outcomes.
A summary of the three presentations
The impact of AI on research
Scispace Founder and CEO
The challenge of information overload lies in effectively navigating, researching, and extracting the most valuable insights from a staggering amount of 175 Zettabytes of information.
Researchers devote significant time to reviewing literature during the initial stages of their research (ideation) as well as leading up to the publication of their work. Struggling to manage the overwhelming influx of data on a weekly basis.
Artificial intelligence has the potential to significantly decrease the amount of time that researchers spend on mundane tasks, allowing them to allocate more time to critical thinking, data analysis, thought synthesis, and drawing conclusions.
Knowledge discovery (getting through papers faster); Communication enhancement (Articulating yourself better); Publishing efficiency (Expediting the workflow) are only some of the key aspects that AI is used to
SciSpace (https://typeset.io/) is a tool that exactly addresses the above. A demo showcased the different workflows, showing how thousands of researchers already use this to complete their scholarly related tasks.
In the end, how do we face the hallucination problem. Fortunately, the early signs of progress are encouraging, and continued developments are anticipated. We could expect better generation capability and factual consistency from the Large Language Models in the near future.
Large Language Models as Infrastructure for Open Science
Athena Research & Innovation Center, Research Director
Large Language Models (LLMs) are becoming a crucial part of the infrastructure for Open Science, significantly impacting the field of AI research and its applications.
Specific applications of LLMs in Open Science include research analysis (such as the Science No Borders toolkit), which involves tasks like field of science classification, sustainable development goals (SDG) analysis, citation and artifact analysis, and claim verification as already presented in the OpenAIRE Graph. These applications demonstrate the potential of LLMs to enrich data with factual knowledge and provide a more comprehensive understanding of scientific information. LLMs also play a role in reproducibility through artifact detection and citance analysis, enabling better verification and understanding of scientific claims. They aid in the analysis of news claims and scientific claims, verifying the authenticity and accuracy of such information.
In the realm of policy intelligence, LLMs can combine data from various sources to address complex policy and business questions, integrating analysis from areas like science, technology, industry, patents, and trademarks. This broad application scope shows the versatility of LLMs in handling diverse datasets and questions.
However, the use of LLMs comes with certain limitations and ethical considerations. These include potential misalignment with human needs, lack of interpretability, tendency to generate plausible but nonfactual predictions, challenges in keeping the models up-to-date, and ensuring proper attribution and reasoning capabilities. Legal and ethical issues such as data collection, output liability, and the risk of amplifying biases also need to be addressed. Addressing these challenges involves adopting good practices in data and model management, ensuring transparency, respecting community values, and considering user feedback and diverse user groups in model design.
Open Science, facilitated by LLMs, accelerates AI diffusion across various sectors, enhancing the ability of countries and regions to capitalize on scientific and technological progress. This process, termed 'diffusion efficiency', is critical for integrating technological advances into the economy and society at large.
Architecting the Future: NASA's Use of Large Language Models to Enable Open Science
NASA Marshall Space Flight Center
Project Lead, Science Discovery Engine
Project Scientist, Transform to Open Science (TOPS) Project Office
NASA's approach to integrating Large Language Models (LLMs) into their Open Science framework is centered around several key aspects:
Open Science and AI Principles: Open science at NASA involves a collaborative culture powered by technology that enables the open sharing of data and knowledge. AI and LLMs are seen as transformative tools in increasing accessibility, making research more efficient, and enhancing the understanding of scientific impact.
Challenges with LLMs: NASA acknowledges the challenges in using LLMs, such as their tendency to generate inaccurate information, biases in source content, and lack of transparency in development. These issues conflict with open science values like reproducibility and transparency.
Collaboration for LLM Development: In partnership with IBM Research, NASA is developing a domain-specific model for science applications. This involves curating resources, utilizing a common training pipeline, and establishing an evaluation suite for performance assessment.
Goals of the Science Discovery Engine (SDE): The SDE aims to facilitate rapid discovery of NASA's scientific data and documentation, support open science infrastructure, and promote interdisciplinary research. It also prototypes emerging technologies and search techniques, including LLMs.
Challenges in Implementing SDE: Implementing the SDE involves overcoming challenges like integrating diverse scientific topics, curating scattered content, and enabling effective interdisciplinary search.
Results and Future Plans for SDE: The beta SDE, already housing a substantial collection of metadata and documents, is expected to be further refined and improved with the implementation of the SMD LLM.
Emphasis on Transparency and Collaboration: Transparency in providing scientific knowledge and collaborating across various scientific disciplines is essential for the successful implementation of LLMs.
Innovation and Bias Acknowledgement: NASA stresses the need for innovation in discovery methods and acknowledges existing biases in scientific systems. It advocates for open science principles throughout the AI lifecycle and the development of open models.
Open Science Education: NASA emphasizes the importance of education in open science principles and techniques, highlighted in their Open Science 101 initiative, which teaches core open science skills and best practices.
When you subscribe to the blog, we will send you an e-mail when there are new updates on the site so you wouldn't miss them.