Raven: A Local AI System for Navigating Scientific Literature, Applied to Hydrogen Production Research

  • Jeronen, Juha (JAMK University of Applied Sciences)

Please login to view abstract download link

The volume of scientific publications has grown rapidly in recent years, making it increasingly difficult for researchers to maintain an overview of the state of the art. For example, a Web of Science search for `hydrogen production' alone returns over 100,000 results. When building a computational model of a hydrogen value chain, identifying which mathematical models and methods from the literature are best suited for the task is a substantial effort. Similar challenges arise with patent databases and in-house, proprietary document collections — and in such cases, privacy may preclude cloud services. To address this, we introduce Raven, an open-source AI system for navigating large scientific literature collections. Raven is a 100% local, privacy-first tool that runs entirely on a single workstation with an NVIDIA GPU. The system uses open-weight, pretrained AI and ML models throughout. Raven consists of two core applications. First, Visualizer performs topic discovery on a BibTeX citation database. It computes semantic embeddings of titles and abstracts, clusters similar entries with HDBSCAN, and projects the result into a 2D map with t-SNE. This enables rapid, interactive exploration of the topical structure of a large literature collection. Secondly, Librarian is a conversational search and analysis tool using a local large language model (LLM). It employs hybrid keyword-semantic retrieval-augmented generation (RAG) to ground the LLM's answers in the dataset. We demonstrate Raven on a dataset containing the metadata of approximately 12,000 hydrogen production studies from Web of Science. Topic analysis with Visualizer reveals major research clusters including photocatalysis, biological fermentation, and steam reforming. From a computational mechanics perspective, numerical modeling studies in this dataset are few, and they appear scattered across all topic areas rather than forming a distinct cluster. We then use Librarian to explore individual areas in more detail, with an eye toward subprocesses in a hydrogen value chain. As an example, we consider photovoltaic hydrogen production. Given variable solar irradiance, what modeling approaches are relevant --- for example, for cell efficiency, electrolyzer dynamics, or storage? Raven is open-source software, available at. As locally hostable open-weight LLMs continue to improve, we expect the utility of local AI tools for scientific work to increase correspondingly.