An Accurate and Efficient Approach to Knowledge Extraction from Scientific Publications Using Structured Ontology Models, Graph Neural Networks, and Large Language Models

Standard

An Accurate and Efficient Approach to Knowledge Extraction from Scientific Publications Using Structured Ontology Models, Graph Neural Networks, and Large Language Models. / Ivanisenko, Timofey V.; Demenkov, Pavel S.; Ivanisenko, Vladimir A.

в: International Journal of Molecular Sciences, Том 25, № 21, 11811, 03.11.2024.

Результаты исследований: Научные публикации в периодических изданиях › статья › Рецензирование

BibTeX

@article{8ff6b6462cbf4e179288f99f199239bf,

title = "An Accurate and Efficient Approach to Knowledge Extraction from Scientific Publications Using Structured Ontology Models, Graph Neural Networks, and Large Language Models",

abstract = "The rapid growth of biomedical literature makes it challenging for researchers to stay current. Integrating knowledge from various sources is crucial for studying complex biological systems. Traditional text-mining methods often have limited accuracy because they don{\textquoteright}t capture semantic and contextual nuances. Deep-learning models can be computationally expensive and typically have low interpretability, though efforts in explainable AI aim to mitigate this. Furthermore, transformer-based models have a tendency to produce false or made-up information—a problem known as hallucination—which is especially prevalent in large language models (LLMs). This study proposes a hybrid approach combining text-mining techniques with graph neural networks (GNNs) and fine-tuned large language models (LLMs) to extend biomedical knowledge graphs and interpret predicted edges based on published literature. An LLM is used to validate predictions and provide explanations. Evaluated on a corpus of experimentally confirmed protein interactions, the approach achieved a Matthews correlation coefficient (MCC) of 0.772. Applied to insomnia, the approach identified 25 interactions between 32 human proteins absent in known knowledge bases, including regulatory interactions between MAOA and 5-HT2C, binding between ADAM22 and 14-3-3 proteins, which is implicated in neurological diseases, and a circadian regulatory loop involving RORB and NR1D1. The hybrid GNN-LLM method analyzes biomedical literature efficiency to uncover potential molecular interactions for complex disorders. It can accelerate therapeutic target discovery by focusing expert verification on the most relevant automatically extracted information. ",

keywords = "ANDSystem, GNN, LLM, deep learning, knowledge graph, text-mining, Data Mining/methods, Humans, Neural Networks, Computer, Deep Learning, Publications, Knowledge Bases",

author = "Ivanisenko, {Timofey V.} and Demenkov, {Pavel S.} and Ivanisenko, {Vladimir A.}",

note = "This work was supported by a grant for research centers, provided by the Analytical Center for the Government of the Russian Federation in accordance with the subsidy agreement (agreement identifier 000000D730324P540002) and the agreement with the Novosibirsk State University dated December 27, 2023 No. 70-2023-001318. Ivanisenko, T. V. An Accurate and Efficient Approach to Knowledge Extraction from Scientific Publications Using Structured Ontology Models, Graph Neural Networks, and Large Language Models / T. V. Ivanisenko, P. S. Demenkov, V. A. Ivanisenko // International Journal of Molecular Sciences. – 2024. – Vol. 25, No. 21. – P. 11811. – DOI 10.3390/ijms252111811.",

year = "2024",

month = nov,

day = "3",

doi = "10.3390/ijms252111811",

language = "English",

volume = "25",

journal = "International Journal of Molecular Sciences",

issn = "1661-6596",

publisher = "Multidisciplinary Digital Publishing Institute (MDPI)",

number = "21",

}

RIS

TY - JOUR

T1 - An Accurate and Efficient Approach to Knowledge Extraction from Scientific Publications Using Structured Ontology Models, Graph Neural Networks, and Large Language Models

AU - Ivanisenko, Timofey V.

AU - Demenkov, Pavel S.

AU - Ivanisenko, Vladimir A.

N1 - This work was supported by a grant for research centers, provided by the Analytical Center for the Government of the Russian Federation in accordance with the subsidy agreement (agreement identifier 000000D730324P540002) and the agreement with the Novosibirsk State University dated December 27, 2023 No. 70-2023-001318. Ivanisenko, T. V. An Accurate and Efficient Approach to Knowledge Extraction from Scientific Publications Using Structured Ontology Models, Graph Neural Networks, and Large Language Models / T. V. Ivanisenko, P. S. Demenkov, V. A. Ivanisenko // International Journal of Molecular Sciences. – 2024. – Vol. 25, No. 21. – P. 11811. – DOI 10.3390/ijms252111811.

PY - 2024/11/3

Y1 - 2024/11/3

N2 - The rapid growth of biomedical literature makes it challenging for researchers to stay current. Integrating knowledge from various sources is crucial for studying complex biological systems. Traditional text-mining methods often have limited accuracy because they don’t capture semantic and contextual nuances. Deep-learning models can be computationally expensive and typically have low interpretability, though efforts in explainable AI aim to mitigate this. Furthermore, transformer-based models have a tendency to produce false or made-up information—a problem known as hallucination—which is especially prevalent in large language models (LLMs). This study proposes a hybrid approach combining text-mining techniques with graph neural networks (GNNs) and fine-tuned large language models (LLMs) to extend biomedical knowledge graphs and interpret predicted edges based on published literature. An LLM is used to validate predictions and provide explanations. Evaluated on a corpus of experimentally confirmed protein interactions, the approach achieved a Matthews correlation coefficient (MCC) of 0.772. Applied to insomnia, the approach identified 25 interactions between 32 human proteins absent in known knowledge bases, including regulatory interactions between MAOA and 5-HT2C, binding between ADAM22 and 14-3-3 proteins, which is implicated in neurological diseases, and a circadian regulatory loop involving RORB and NR1D1. The hybrid GNN-LLM method analyzes biomedical literature efficiency to uncover potential molecular interactions for complex disorders. It can accelerate therapeutic target discovery by focusing expert verification on the most relevant automatically extracted information.

AB - The rapid growth of biomedical literature makes it challenging for researchers to stay current. Integrating knowledge from various sources is crucial for studying complex biological systems. Traditional text-mining methods often have limited accuracy because they don’t capture semantic and contextual nuances. Deep-learning models can be computationally expensive and typically have low interpretability, though efforts in explainable AI aim to mitigate this. Furthermore, transformer-based models have a tendency to produce false or made-up information—a problem known as hallucination—which is especially prevalent in large language models (LLMs). This study proposes a hybrid approach combining text-mining techniques with graph neural networks (GNNs) and fine-tuned large language models (LLMs) to extend biomedical knowledge graphs and interpret predicted edges based on published literature. An LLM is used to validate predictions and provide explanations. Evaluated on a corpus of experimentally confirmed protein interactions, the approach achieved a Matthews correlation coefficient (MCC) of 0.772. Applied to insomnia, the approach identified 25 interactions between 32 human proteins absent in known knowledge bases, including regulatory interactions between MAOA and 5-HT2C, binding between ADAM22 and 14-3-3 proteins, which is implicated in neurological diseases, and a circadian regulatory loop involving RORB and NR1D1. The hybrid GNN-LLM method analyzes biomedical literature efficiency to uncover potential molecular interactions for complex disorders. It can accelerate therapeutic target discovery by focusing expert verification on the most relevant automatically extracted information.

KW - ANDSystem

KW - GNN

KW - LLM

KW - deep learning

KW - knowledge graph

KW - text-mining

KW - Data Mining/methods

KW - Humans

KW - Neural Networks, Computer

KW - Deep Learning

KW - Publications

KW - Knowledge Bases

UR - https://www.mendeley.com/catalogue/cb47cdba-715c-3f12-aa26-6d6cc37f85e7/

UR - https://www.elibrary.ru/item.asp?id=79443708

UR - https://pubmed.ncbi.nlm.nih.gov/39519363/

U2 - 10.3390/ijms252111811

DO - 10.3390/ijms252111811

M3 - Article

C2 - 39519363

VL - 25

JO - International Journal of Molecular Sciences

JF - International Journal of Molecular Sciences

SN - 1661-6596

IS - 21

M1 - 11811

ER -

ID: 61105653