Yugra State University Bulletin

Вестник Югорского государственного университета

1816-92282078-9114

Yugra State University

685568

10.18822/byusu20250340-47

Mathematical modeling and information technology

Математическое моделирование и информационные технологии

Unknown

Multimodal neural network processing of video lectures using multi-agent systems

Мультимодальная нейросетевая обработка видеолекции посредством мультиагентных систем

Ismagulov

Milan E.

Исмагулов

Милан Ерикович

Russian Federation

Postgraduate student, Engineering School of Digital Technologies

аспирант 3 года обучения направления, «Системный анализ, управление и обработка информации, статистика» Инженерной школы цифровых технологий

m_ismagulov@ugrasu.ru

Yugra State UniversityЮгорский государственный университет

17092025

213

40472306202526082025

2025

Yugra State University

Югорский государственный университет

https://creativecommons.org/licenses/by-sa/4.0

https://vestnikugrasu.org/byusu/article/view/685568

Subject of research: multimodal processing of video lectures using multi-agent systems. The article focuses on intermediate results of the research, including an overview of the concepts of multimodality, multi-agent systems, and multi-model systems, as well as the development of approaches to processing video data from lectures.

Purpose of research: transformation of all relevant information from a video lecture into a text document to form an accompanying lecture summary. The goal is to develop an effective data processing cycle, taking into account differences in video lecture formats.

Research methods: selection of the «Orchestrator-Performer» pattern (Orchestrator-Worker Pattern) with a large language model (LLM) in the role of the orchestrator. Overview of alternative approaches, namely the peer-to-peer decentralized pattern and the hybrid pattern, with justification for choosing the orchestrator approach to ensure consistent processing and fault tolerance. Integration of pipeline video stream processing into a multi-agent system (hybrid approach).

The objects of research in this article are video lectures of three main types, serving as sources of multimodal data for analysis and processing. The first type – «Lecturer and Presentation» – includes video recordings where the lecturer is positioned to the left or right of the accompanying presentation, with an emphasis on the visual combination of the human figure and slides. The second type – «Presentation and Voiceover» – focuses on theoretical material presented on the presentation slides, with explanation off-screen through the audio track. The third type – «Lecturer and Blackboard» – covers recordings where the lecturer writes material on a classic chalk or marker board, emphasizing handwritten input of information.

Research findings: An architecture for a multi-agent system has been developed and justified based on the «Orchestrator-Performer» pattern with a hybrid approach, integrating pipeline video processing into a multi-agent environment for effective task distribution and load management. Models and tools have been selected and described, namely orchestrators, audio processing models, OCR, taking into account lecture types for adaptive pipelines. The functioning of agents is described, including initialization, interaction with the orchestrator, parallel audio/video processing, and aggregation of results into a text document with the possibility of downloading/printing.

Предмет исследования: мультимодальная обработка видеолекций с использованием мультиагентных систем. Статья фокусируется на промежуточных результатах исследования, включая обзор понятий мультимодальности, мультиагентности и многомодельных систем, а также на разработке подходов к обработке видеоданных из лекций.

Цель исследования: преобразование всей релевантной информации из видеолекции в текстовый документ для формирования сопровождающего конспекта лекции. Цель – разработать эффективный цикл обработки данных, учитывая различия в форматах видеолекций.

Методы исследования: выбор паттерна «Оркестратор-исполнитель» (Orchestrator-Worker Pattern) с большой языковой моделью (LLM) в роли оркестратора. Обзор альтернативных подходов, а именно одноранговый децентрализованный паттерн и гибридный паттерн, с обоснованием выбора оркестраторного подхода для обеспечения последовательной обработки и отказоустойчивости. Интеграция конвейерной обработки видеопотока в мультиагентную систему (гибридный подход).

Объекты исследования в данной статье представляют собой видеолекции трех основных типов, служащие источниками мультимодальных данных для анализа и обработки. Первый тип – «Лектор и презентация» – включает видеозаписи, где лектор располагается слева или справа от сопровождающей презентации, с акцентом на визуальное сочетание человеческой фигуры и слайдов. Второй тип – «Презентация и закадровый голос» – фокусируется на теоретическом материале, представленном на слайдах презентации, с объяснением за кадром через аудиодорожку. Третий тип – «Лектор и доска» – охватывает записи, где лектор пишет материал на классической меловой или маркерной доске, подчеркивая рукописный ввод информации.

Основные результаты исследования: разработана и обоснована архитектура мультиагентной системы на основе паттерна «Оркестратор-исполнитель» с гибридным подходом, интегрирующим конвейерную обработку видео в мультиагентную среду для эффективного распределения задач и управления нагрузкой. Выбраны и описаны модели и инструменты, а именно оркестраторы, модели аудиообработки, OCR, с учетом типов лекций для адаптивных конвейеров. Описано функционирование агентов, инициализация, взаимодействие с оркестратором, параллельная обработка аудио/видео, агрегация результатов в текстовый документ с возможностью скачивания/печати.

multimodalitymulti-agent systemsOrchestrator-Performer patternlarge language modelagent interactiondata processing pipelinemulti-model approachesmachine learningpeer-to-peer decentralized architecturevideo lecture processing

мультимодальностьмультиагентностьпаттерн «Оркестратор-исполнитель»большая языковая модельвзаимодействие агентовконвейер обработки данныхмногомодельные подходымашинное обучениеодноранговая децентрализованная архитектураобработка видеолекций

Zhao, B. Hierarchical multimodal transformer for long video generation / B. Zhao, M. Gong, X. Li. – DOI 10.1016/j.neucom.2021.10.039 // Neurocomputing. – 2022. – Vol. 471. – P. 36–43.

VDTR: Video Deblurring with Transformer / M. Cao, Y. Fan, Y. Zhang [et al.]. – DOI 10.1109/TCSVT.2022.3201045 // IEEE Transactions on Circuits and Systems for Video Technology. – 2022. – Vol. 33. – P. 160–171.

Efficient Training of Audio Transformers with Patchout / K. Koutini, J. Schlüter, H. Eghbal-zadeh, G. Widmer. – DOI 10.21437/Interspeech.2022-227 // Interspeech. – 2022. – P. 2753–2757.

Comprehensive Survey on Applications of Transformers for Deep Learning Tasks / S. Islam, H. Elmekki, A. Elsebai [et al.]. – DOI 10.48550/arXiv.2306.07303 // ArXiv. – URL: https://arxiv.org/html/2306.07303 (date of application: 21.06.2025).

Large Language Model Should Understand Pinyin for Chinese ASR Error Correction / Y. Li, X. Qiao, X. Zhao [et al.] // ArXiv. – URL: https://arxiv.org/abs/2409.13262 (date of application: 21.06.2025).

AudioPaLM: A Large Language Model That Can Speak and Listen / P. K. Rubenstein, C. Asawaroengchai, A. Bapna [et al.] // ArXiv. – URL: https://arxiv.org/abs/2306.12925 (date of application: 21.06.2025).

Gutowska, A. What is a multiagent system? / A. Gutowska // IBM сайт. – URL: https://www.ibm.com/think/topics/multiagent-system/ (date of application: 21.06.2025).

Ismagulov, М. Е. Methods and Algorithms for Multimodal Conversion of Video Lectures / М. Е. Ismagulov // Proceedings of the XXIV International Conference on Information Technologies and Mathematical Modelling (ITMM-2024) (Tomsk, 2024). – Tomsk : Tomsk State University, 2024. – P. 605–607. – URL: https://www.researchgate.net/publication/391833448_1_Conf erence_proceedings_with_your_article_Ismagulov_M_E_Methods_and_Algorithms_for_Multimodal_Conversion_of_Video_Lectures (date of application: 17.05.2025).

Лекция 10. Распределенные интеллектуальные системы на основе агентов // Ronl. – URL: https://ronl.org/lektsii/informatika/882253/ (дата обращения: 21.06.2025).

10.

A decentralized optimization approach for scalable agent-based energy dispatch and congestion management / M. Kilthau, V. Henkel, L. P. Wagner [et al.]. – DOI 10.1016/j.apenergy.2024.124659 // Applied Energy. – 2025. – Vol. 377, Part C. – URL: https://www.sciencedirect.com/science/article/pii/S0306261924020427?via%3Dihub (date of application: 17.05.2025).

11.

Zhang, H. L. Classification of Intelligent Agent Network Topologies and a New Topological Description Language for Agent Networks / H. L. Zhang, C. H. C. Leung, G. K. Raikundalia. – DOI 10.1007/978-0-387-44641-7_3 // Intelligent Information Processing III : Proceedings of the IFIP International Conference. – Boston : Springer, 2006. – P. 21–31.

12.

Mwifunyi, R. J. Distributed approach in fault localisation and service restoration: State-of-the-Art and future direction / R. J. Mwifunyi, M. M. Kissaka, N. H. Mvungi. – DOI 10.1080/23311916.2019.1628424 // Cogent Engineering. – 2019. – Vol. 6. – P. 1–20. – URL: https://www.researchgate.net/publication/344738267_Distributed_approach_in_fault_localisation_and_service_restoration_State-of-the-Art_and_future_direction (date of application: 08.06.2025).

13.

Finio, M. What is AI agent orchestration? / M. Finio, A. Downie // IBM. – URL: https://www.ibm.com/think/topics/ai-agent-orchestration (date of application: 21.06.2025).

14.

Falconer, S. The orchestrator-worker pattern is a well-known design pattern for structuring multi-agent systems / S. Falconer // LinkedIn. – URL: https://www.linkedin.com/posts/seanf_the-orchestrator-worker-pattern-is-a-well-known-activity-7294775230353313792-_zFL (date of application: 21.06.2025).

15.

Orchestrator-Workers Workflow // Java AI Dev. – URL: https://javaaidev.com/docs/agentic-patterns/patterns/orchestrator-workers-workflow/ (date of application: 21.06.2025).