Modern domain-specific scientific literature is very often packed in extremely large and complex data sets. For scientists parsing these large repositories for a specific term and its corresponding articles, using standard methods could be impossible due to complexity and computational time. The scope of this project is to create a solution suitable for modern researchers and scientists that would help to better impact their work by parsing data gathered from the scientific literature in an easy and quick way. This work presents the design, development, and solution to the problem. It is provided as a federated Full-Text Search architecture which supports scientists in their research on biomedical literature such as PubMed, which size is close to 40 million individual records. The core solution is based on containerized OpenSearch engine instances, created and maintained within a federated system for its flexibility and ability to quickly adapt to various datasets and infrastructure architectures. With this principle in mind, potential users can define their own computing infrastructure, according to their needs and capabilities, which could greatly reduce the time and resources spent on research. The project is continuously evolving to improve its features and use cases. The future direction of this work will test the proposed solution using different computing infrastructures and software settings, to identify a well-optimized option for a drug repurposing knowledge graph use case.
In this seminar, I will introduce basic concepts and techniques of Federated Search, explain how we might benefit from such systems and how to design them, and present the results of experiments using Federated Search Engine on sample datasets.