Multimodal neural network processing of video lectures using multi-agent systems

Milan E. Ismagulov; Исмагулов Милан Ерикович

doi:10.18822/byusu20250340-47

Multimodal neural network processing of video lectures using multi-agent systems

Authors: Ismagulov M.E.¹
Affiliations:
1. Yugra State University
Issue: Vol 21, No 3 (2025)
Pages: 40-47
Section: Mathematical modeling and information technology
Published: 17.09.2025
URL: https://vestnikugrasu.org/byusu/article/view/685568
DOI: https://doi.org/10.18822/byusu20250340-47
ID: 685568

Cite item

Full Text

Abstract
Full Text
About the authors
References
Supplementary files
Statistics

Abstract

Subject of research: multimodal processing of video lectures using multi-agent systems. The article focuses on intermediate results of the research, including an overview of the concepts of multimodality, multi-agent systems, and multi-model systems, as well as the development of approaches to processing video data from lectures.

Purpose of research: transformation of all relevant information from a video lecture into a text document to form an accompanying lecture summary. The goal is to develop an effective data processing cycle, taking into account differences in video lecture formats.

Research methods: selection of the «Orchestrator-Performer» pattern (Orchestrator-Worker Pattern) with a large language model (LLM) in the role of the orchestrator. Overview of alternative approaches, namely the peer-to-peer decentralized pattern and the hybrid pattern, with justification for choosing the orchestrator approach to ensure consistent processing and fault tolerance. Integration of pipeline video stream processing into a multi-agent system (hybrid approach).

The objects of research in this article are video lectures of three main types, serving as sources of multimodal data for analysis and processing. The first type – «Lecturer and Presentation» – includes video recordings where the lecturer is positioned to the left or right of the accompanying presentation, with an emphasis on the visual combination of the human figure and slides. The second type – «Presentation and Voiceover» – focuses on theoretical material presented on the presentation slides, with explanation off-screen through the audio track. The third type – «Lecturer and Blackboard» – covers recordings where the lecturer writes material on a classic chalk or marker board, emphasizing handwritten input of information.

Research findings: An architecture for a multi-agent system has been developed and justified based on the «Orchestrator-Performer» pattern with a hybrid approach, integrating pipeline video processing into a multi-agent environment for effective task distribution and load management. Models and tools have been selected and described, namely orchestrators, audio processing models, OCR, taking into account lecture types for adaptive pipelines. The functioning of agents is described, including initialization, interaction with the orchestrator, parallel audio/video processing, and aggregation of results into a text document with the possibility of downloading/printing.

Keywords

multimodality, multi-agent systems, Orchestrator-Performer pattern, large language model, agent interaction, data processing pipeline, multi-model approaches, machine learning, peer-to-peer decentralized architecture, video lecture processing

Full Text

About the authors

Milan E. Ismagulov

Yugra State University

Author for correspondence.
Email: m_ismagulov@ugrasu.ru

Postgraduate student, Engineering School of Digital Technologies

Russian Federation, Khanty-Mansiysk

References

Zhao, B. Hierarchical multimodal transformer for long video generation / B. Zhao, M. Gong, X. Li. – doi: 10.1016/j.neucom.2021.10.039 // Neurocomputing. – 2022. – Vol. 471. – P. 36–43.
VDTR: Video Deblurring with Transformer / M. Cao, Y. Fan, Y. Zhang [et al.]. – doi: 10.1109/TCSVT.2022.3201045 // IEEE Transactions on Circuits and Systems for Video Technology. – 2022. – Vol. 33. – P. 160–171.
Efficient Training of Audio Transformers with Patchout / K. Koutini, J. Schlüter, H. Eghbal-zadeh, G. Widmer. – doi: 10.21437/Interspeech.2022-227 // Interspeech. – 2022. – P. 2753–2757.
Comprehensive Survey on Applications of Transformers for Deep Learning Tasks / S. Islam, H. Elmekki, A. Elsebai [et al.]. – doi: 10.48550/arXiv.2306.07303 // ArXiv. – URL: https://arxiv.org/html/2306.07303 (date of application: 21.06.2025).
Large Language Model Should Understand Pinyin for Chinese ASR Error Correction / Y. Li, X. Qiao, X. Zhao [et al.] // ArXiv. – URL: https://arxiv.org/abs/2409.13262 (date of application: 21.06.2025).
AudioPaLM: A Large Language Model That Can Speak and Listen / P. K. Rubenstein, C. Asawaroengchai, A. Bapna [et al.] // ArXiv. – URL: https://arxiv.org/abs/2306.12925 (date of application: 21.06.2025).
Gutowska, A. What is a multiagent system? / A. Gutowska // IBM сайт. – URL: https://www.ibm.com/think/topics/multiagent-system/ (date of application: 21.06.2025).
Ismagulov, М. Е. Methods and Algorithms for Multimodal Conversion of Video Lectures / М. Е. Ismagulov // Proceedings of the XXIV International Conference on Information Technologies and Mathematical Modelling (ITMM-2024) (Tomsk, 2024). – Tomsk : Tomsk State University, 2024. – P. 605–607. – URL: https://www.researchgate.net/publication/391833448_1_Conf erence_proceedings_with_your_article_Ismagulov_M_E_Methods_and_Algorithms_for_Multimodal_Conversion_of_Video_Lectures (date of application: 17.05.2025).
Лекция 10. Распределенные интеллектуальные системы на основе агентов // Ronl. – URL: https://ronl.org/lektsii/informatika/882253/ (дата обращения: 21.06.2025).
A decentralized optimization approach for scalable agent-based energy dispatch and congestion management / M. Kilthau, V. Henkel, L. P. Wagner [et al.]. – doi: 10.1016/j.apenergy.2024.124659 // Applied Energy. – 2025. – Vol. 377, Part C. – URL: https://www.sciencedirect.com/science/article/pii/S0306261924020427?via%3Dihub (date of application: 17.05.2025).
Zhang, H. L. Classification of Intelligent Agent Network Topologies and a New Topological Description Language for Agent Networks / H. L. Zhang, C. H. C. Leung, G. K. Raikundalia. – doi: 10.1007/978-0-387-44641-7_3 // Intelligent Information Processing III : Proceedings of the IFIP International Conference. – Boston : Springer, 2006. – P. 21–31.
Mwifunyi, R. J. Distributed approach in fault localisation and service restoration: State-of-the-Art and future direction / R. J. Mwifunyi, M. M. Kissaka, N. H. Mvungi. – doi: 10.1080/23311916.2019.1628424 // Cogent Engineering. – 2019. – Vol. 6. – P. 1–20. – URL: https://www.researchgate.net/publication/344738267_Distributed_approach_in_fault_localisation_and_service_restoration_State-of-the-Art_and_future_direction (date of application: 08.06.2025).
Finio, M. What is AI agent orchestration? / M. Finio, A. Downie // IBM. – URL: https://www.ibm.com/think/topics/ai-agent-orchestration (date of application: 21.06.2025).
Falconer, S. The orchestrator-worker pattern is a well-known design pattern for structuring multi-agent systems / S. Falconer // LinkedIn. – URL: https://www.linkedin.com/posts/seanf_the-orchestrator-worker-pattern-is-a-well-known-activity-7294775230353313792-_zFL (date of application: 21.06.2025).
Orchestrator-Workers Workflow // Java AI Dev. – URL: https://javaaidev.com/docs/agentic-patterns/patterns/orchestrator-workers-workflow/ (date of application: 21.06.2025).