| |
Last updated on October 12, 2025. This conference program is tentative and subject to change
Technical Program for Monday October 6, 2025
|
Mo-KN1 |
Hall F |
Keynote 1 |
Keynote |
Chair: Eigner, György | Obuda University |
|
09:30-10:30, Paper Mo-KN1.1 | |
Keynote Talk: From Model-Based to AI-Empowered Cyber-Physical Multi-Agent Systems |
|
Shi, Peng | University of Adelaide, Adelaide |
Keywords: Agent-Based Modeling, AI and Applications, Application of Artificial Intelligence
Abstract: Cyber-physical multi-agent systems (CPMAS) integrate various autonomous systems, and even human participants, through both cyber-layer communication and physical-layer collaboration, enabling them to operate collectively in complex, open environments. Traditionally, these systems have relied on model-based control methods to ensure robust coordination and reliable performance. However, recent advances in artificial intelligence (AI) offer promising opportunities to enhance autonomy, adaptability, and scalability when faced with various constraints at both the cyber and physical layers—such as malicious cyberattacks and internal system faults. This talk illustrates the technological evolution from model-based approaches to AI-empowered frameworks for CPMAS, highlighting how data-driven techniques—such as reinforcement learning, deep neural networks, and bio-inspired AI—can address limitations in traditional design and control frame. We will explore the challenges of seamlessly integrating model-driven and AI-driven strategies, including stability, safety, and security. By combining advanced control methodologies with emerging AI capabilities, we present a series of experimental case studies that verify our proposed techniques with potential real-world applications in intelligent manufacturing, autonomous vehicles, and smart infrastructure. Ultimately, this talk provides insights into the future of CPMAS, setting the stage for innovative research and practical deployments.
|
|
Mo-S1-T1 |
Hall F |
Deep Learning 1 |
Regular Papers - Cybernetics |
Chair: Wen, Di | Karlsruhe Institute of Techonology |
Co-Chair: Liu, Shilong | Information Research Center of Military Science, PLA Academy of Military Science, Beijing, China |
|
11:00-11:15, Paper Mo-S1-T1.1 | |
Towards Evaluation of Gradient-Based Reconstruction Attacks in Privacy-Preserving Federated Learning |
|
Phan, Quoc-Thang | National Central University |
Lu, Chun-Shien | Academia Sinica |
Wang, Jia-Ching | National Central University |
Keywords: Deep Learning, AI and Applications
Abstract: Learned Perceptual Image Patch Similarity (LPIPS) is an image similarity metric that is typically used to evaluate the effectiveness of reconstruction attacks in privacy leakage of federated learning (FL). Our study shows that such image similarity metric is insufficient to assess the degree of privacy leakage. To tackle this issue, a new evaluation method capable of revealing preserved privacy in privacy-preserving FL (PPFL) is explored in this paper. Through extensive empirical experiments, which employ a differential privacy (DP)-based PPFL (DP-PPFL) and a state-of-the-art image reconstruction adversary (at the time of writing this paper), high and stable perceptual metric values were observed across all ranges of a privacy budget in DP. These LPIPS values suggest the privacy of images can be preserved well in terms of perceptual quality preservation, regardless of how the DP-PPFL is configured. However, there are multiple discrepancies between the classification accuracy of original and reconstructed images. This finding suggests that there is indeed a privacy leakage that the LPIPS failed to capture; thus, LPIPS should not be used as a standard for evaluating privacy leakage. Alongside the discovery, this paper also proposes a so-called ``quadrant chart,'' to quantify the assessment of privacy leakage in PPFL.
|
|
11:15-11:30, Paper Mo-S1-T1.2 | |
PRAD-Net: Periodic Reorganization Anomaly Detection Network Using Anomaly Injection Strategy |
|
Lu, Han | Tongji University |
Li, Xiaojun | Tongji University |
Ning, Zhiheng | Tongji University |
Keywords: Deep Learning, AI and Applications
Abstract: Time-series anomaly detection is essential for ensuring the reliability of large-scale data stream applications. However, existing deep learning approaches primarily rely on reconstruction-based methods, making them highly sensitive to the quality of training data. To mitigate this limitation, we introduce an anomaly injection strategy to enhance model robustness under unsupervised settings. Additionally, most deep learning models are inherently constrained by their structural limitations, leading to weaknesses in capturing specific temporal characteristics. To address this, we propose PRAD-Net (Periodic Reorganization Anomaly Detection Network), which incorporates a Periodic Reorganization (PR) module to decompose time series into long-period and short-period features. These features are then processed separately by two specialized modules: the Long-Period Temporal Attention (LPTA) Module, which enhances long-term dependency modeling, and the Multi-Scale Focal (MSF) Module, which focuses on local temporal variations. By strategically assigning periodic features to appropriate modules, PRAD-Net effectively exploits their complementary strengths while mitigating their respective weaknesses. Extensive experiments on five real-world datasets demonstrate that PRAD-Net significantly outperforms state-of-the-art baselines, highlighting its strong capability for unsupervised anomaly detection.
|
|
11:30-11:45, Paper Mo-S1-T1.3 | |
CLUG: Contrastive Learning Unified Retrieval with Graph-Ranked Demonstrations for Enhanced In-Context Learning |
|
Xu, Xiantao | Information Research Center of Military Science, PLA Academy Of |
Hu, Minghao | Information Research Center of Military Science, PLA Academy Of |
Liu, Shilong | Information Research Center of Military Science, PLA Academy Of |
Luo, Wei | Information Research Center of Military Science, PLA Academy Of |
Keywords: Deep Learning, AI and Applications, Application of Artificial Intelligence
Abstract: Large language models (LLMs) have demonstrated the ability to perform in-context learning (ICL) with only a few demostrations, achieving remarkable performance across various downstream tasks. The selection of demonstrations plays a critical role in shaping the performance of ICL due to its high sensitivity. However, previous researchers have primarily focused on either the tricks for selecting demonstrations or the sequencing of demonstrations, thereby neglecting the critical role of retrieval models in ICL. The rigid structures and parameters employed in these studies often fail to align with the specific requirements of downstream tasks. To address this problem, we propose CLUG: contrastive learning unified retrieval with graph-ranked demonstrations for enhanced in-context learning, integrating demonstrations retrieval selection and demonstrations order to establish semantically coherent sequences of demonstrations, thereby ensuring enhanced semantic alignment and consistency. Experimental results across multiple datasets demonstrate consistent improvements with our method. Additional analyses further validate and explain the effectiveness and generalizability of our approach.
|
|
11:45-12:00, Paper Mo-S1-T1.4 | |
ACET: An Adaptive Component Extraction and Tokenization Framework for Time Series Forecasting |
|
Cai, Jiexuan | Tongji University |
Yin, Tianyi | Tongji University |
Wang, Jingwei | Tongji University |
Wang, Chenze | Tongji University |
Zhao, Yukai | Tongji University |
Ma, Yunlong | Tongji University |
Liu, Min | Tongji University |
Keywords: Deep Learning, AI and Applications, Expert and Knowledge-Based Systems
Abstract: Language models have been proven to handle time series data after tokenization and show generalization performance on unseen forecasting tasks. However, existing techniques for tokenizing time series data struggle to eliminate redundant information and noise, which can lead to signal aliasing and cumulative quantization errors, making it difficult to further improve prediction performance. In this paper, we propose an Adaptive Component Extraction and Tokenization (ACET) framework, which includes two key novelties to address these challenges: the Dynamic Component Extraction Module (DCEM) and the Time Series Tokenization Module (TSTM). The DCEM dynamically isolates the principal components from the interference in the original signal, eliminating the need for manual parameter tuning. This not only enhances the accuracy of signal tokenization but also mitigates the adverse effects of high-frequency noise. Then, the TSTM tokenizes continuous time series data while preserving long-term trend features, ensuring that critical information is retained for subsequent forecasting. Extensive cross-domain experiments on various real-world datasets demonstrate that, in zero-shot forecasting scenarios, ACET achieves improvements of 19.27% in WQL and 6.63% in MASE, compared to baseline methods.
|
|
12:00-12:15, Paper Mo-S1-T1.5 | |
Exploring Video-Based Driver Activity Recognition under Noisy Labels |
|
Fan, Linjuan | Karlsruhe Institute of Technology |
Wen, Di | Karlsruhe Institute of Techonology |
Peng, Kunyu | Kalrsruhe Institute of Technology, IAR |
Yang, Kailun | Hunan University |
Zhang, Jiaming | Hunan University |
Liu, Ruiping | Karlsruhe Institute of Technology |
Chen, Yufan | Karlsruhe Institute of Technology |
Zheng, Junwei | Karlsruhe Institute of Technology |
Wu, Jiamin | The Chinese University of Hong Kong |
Han, Xudong | University of Sussex |
Stiefelhagen, Rainer | Karlruher Institut Für Technologie |
Keywords: Deep Learning, AI and Applications, Machine Vision
Abstract: As an open research topic in the field of deep learning, learning with noisy labels has attracted much attention and grown rapidly over the past ten years. Learning with label noise is crucial for driver distraction behavior recognition as real-world video data often contains mislabeled samples, impacting model reliability and performance. However, label noise learning is barely explored in the driver activity recognition field. In this paper, we propose the first label noise learning approach for the driver activity recognition task. Based on the cluster assumption, we initially enable the model to learn clustering-friendly low-dimensional representations from given videos and assign the resultant embeddings into clusters. We subsequently perform co-refinement within each cluster to smooth the classifier outputs. Furthermore, we propose a flexible sample selection strategy that combines two selection criteria without relying on any hyper-parameters to filter clean samples from the training dataset. We also incorporate a self-adaptive parameter into the sample selection process to enforce balancing across classes. A comprehensive variety of experiments on the public Drive&Act dataset for all granularity levels demonstrates the superior performance of our method in comparison with other label-denoising methods derived from the image classification field. The source code is avialable at https://github.com/ilonafan/DAR-noisy-labels.
|
|
12:15-12:30, Paper Mo-S1-T1.6 | |
LieGNN: A Geometry-Aware Framework for Skeleton-Based Action Recognition Via Lie Group Trajectories |
|
Jiang, Nan | Soochow University |
Liu, Li | Soochow University |
Keywords: Image Processing and Pattern Recognition, Machine Learning, Deep Learning
Abstract: Human Action Recognition is a fundamental task in computer vision, yet remains challenging due to factors such as background clutter, occlusion, illumination changes,and viewpoint variation. Existing skeleton-based methods often rely on joint coordinates or depth data, while neglecting the geometric relationships between body parts. In this paper, we propose LieGNN, a geometry-aware framework that integrates Lie group representations with spatio-temporal graph-based learning for robust action recognition. The framework comprises three main modules: a high-precision 3D pose estimation module that combines multi-resolution feature fusion with camera-aware depth modeling; a Lie group trajectory modeling module, which encodes pose sequences as curves on the Lie manifold and computes alignment errors via class-specific standard curves; and a graph-based classification module that incorporates both pose features and alignment errors to enhance discriminative learning. Extensive experiments on three public datasets demonstrate that LieGNN outperforms several methods and exhibits strong generalization across diverse action categories. These results highlight the effectiveness of incorporating geometric priors into graph-based motion modeling.
|
|
12:30-12:45, Paper Mo-S1-T1.7 | |
MM-MQA: Multi-Modal Learning for No-Reference 3D Colored Mesh Models Quality Assessment |
|
Hao, Jie | Beijing University of Chemical Technology |
Zheng, Guoquan | Beijing University of Chemical Technology |
Zhang, Jianbo | Shanghai Jiao Tong University |
Zhang, Dong | Beijing University of Chemical Technology |
Yuan, Liang | Shanghai Jiao Tong University |
Zhai, Guangtao | Shanghai Jiao Tong University |
Keywords: Machine Vision, Deep Learning
Abstract: The proliferation of 3D colored mesh models lead to growing scholarly interest in assessing their visual quality. Existing 3D mesh quality assessment methods primarily rely on single-modal features derived from either 2D projections or the three-dimensional structure of the model. 2D projections contain rich semantic and texture information, but they cannot show enough quality loss caused by structured distortion. However, the 3D structure of the model can accurately reflect geometric distortions but does not fully utilize color information, making it difficult to comprehensively represent the perception of human visual system of complex distortions. Therefore, we propose a multi-modal learning for no-reference 3D colored mesh models quality assessment method (MM-MQA), First, we obtain projection images of the 3D colored mesh model from different viewpoints and split the model into equally sized patches. We then employ MeshNet and ResNet to encode the structural features of the model patches and the texture features of the projection images. Finally, double cross-attention is employed to achieve multi-modal fusion to perceive the overall quality of the mesh model. Experiments demonstrate that our method outperforms existing approaches in both CMDM and TMQ datasets, validating the effectiveness of multi-modal learning in mesh quality assessment tasks.
|
|
12:45-13:00, Paper Mo-S1-T1.8 | |
KiGRU for Long-Term Orbital Prediction with Kolmogorov-Arnold Networks |
|
Chu, Qinghao | Dalian University of Technology |
Wang, Zhelong | Dalian University of Technology |
Jiang, Yu | State Key Laboratory for Space-System Operation and Control |
Hou, Pengrong | Dalian University of Technology |
Nie, Ruicheng | Dalian University of Technology |
Shi, Xin | Dalian University of Technology |
Lin, Fang | Dalian University of Technology |
Kang, Yuntong | Dalian University of Technology |
Guo, Luchang | Dalian University of Technology |
Keywords: Application of Artificial Intelligence, Deep Learning, AI and Applications
Abstract: In the context of an increasingly complex and congested space debris environment, the development of high-precision long-term orbit prediction models has become a core technology for space situational awareness and space traffic management. Traditional orbit prediction methods often struggle to achieve an ideal balance between accuracy and computational efficiency. While deep learning-based orbit prediction approaches have demonstrated promising potential, they commonly face challenges related to high model complexity. This paper analyzes the long-term orbital prediction problem from the perspective of time series forecasting. Specifically, a dataset framework is first constructed, based on real high-precision orbital data, which incorporates temporal periodic features. Then, a novel model combining the Kolmogorov-Arnold Network (KAN) structure with a Gated Recurrent Unit (GRU) is proposed, named KiGRU, with the goal of enhancing the performance of orbital prediction. Experimental results show that the proposed KiGRU model outperforms existing methods in terms of prediction accuracy and achieves a good balance between model complexity and performance, making it more suitable for practical applications. Furthermore, this study demonstrates the effectiveness of integrating the KAN structure into traditional deep learning models to enhance their performance.
|
|
Mo-S1-T2 |
Hall N |
Application of Artificial Intelligence 1 |
Regular Papers - Cybernetics |
Chair: Yu, Lei | Inner Mongolia University |
Co-Chair: Ellinas, Georgios | University of Cyprus |
|
11:00-11:15, Paper Mo-S1-T2.1 | |
Model-MRFaG: A Test Code Generation Framework Based on Fine-Tuned LLMs |
|
Zhou, Xiang | Inner Mongolia University |
Yu, Lei | Inner Mongolia University |
Liu, Junhua | Inner Mongolia University |
Yang, Conghui | Inner Mongolia University |
Wlx, Wlx | Inner Mongolia University |
Keywords: Application of Artificial Intelligence
Abstract: Software testing is a critical component in software development that is closely related to software quality. Traditional test generation methods face challenges such as producing test cases that are difficult to read and maintain synchronously. Meanwhile, with the advancement of large language models (LLMs) in code generation, the quality of LLM-generated code is increasingly comparable to human-written code. Therefore, this paper proposes a test code generation framework using Model-driven Multiple Results Filtering and Multi-Round Generation strategy (Model-MRFaG). To better adapt to test generation tasks for mainstream programming languages, we built an Alpaca-format Test-Code DataSet for Finetuning Baseline Models (TCDSF) containing six programming languages: Python, Java, JavaScript, C++, C#, and Go, and used this dataset to fine-tune a Baseline Model to obtain the TestCoder model. Subsequently, we developed the Model-MRFaG framework based on the TestCoder model to further improve the accuracy of test code generation. Through comparative experiments evaluating both general test sets and test code generation capabilities, TestCoder outperforms the Baseline Model in both general test sets and test code generation accuracy. Furthermore, the Model-MRFaG framework can further improve the accuracy of test code generation, providing a new solution approach for the intelligent development of software testing.
|
|
11:15-11:30, Paper Mo-S1-T2.2 | |
The Impact of AI Identity on Consumer Behavior in the Use of AI-Enabled Recommendation Systems |
|
Peng, Bo | School of Management, Northwestern Polytechnical University |
Xu, Yan | Northwestern Polytechnical University |
Deng, Hepu | RMIT University |
Keywords: Application of Artificial Intelligence, AI and Applications
Abstract: 支持人工智能 (AI) 的推荐系统 越来越多地用于电子商务的推广 特定产品和服务。此类系统如何影响 消费者自我认同的行为 然而,目前尚不清楚。本研究探讨了 AI 身份对消费者采用行为的影响 使用 AI 身份的支持 AI 的推荐系统 理论和行为推理理论。相关的 研究了文献,从而得出了 用于理解 这种情况。然后可以测试这样的模型,并且 使用实证数据进行验证。本研究有助于 消费者行为研究的 用于更好地探索 AI 影响的概念模型 使用 AI 时消费者行为的标识 电子商务中的推荐系统。
|
|
11:30-11:45, Paper Mo-S1-T2.3 | |
KGSS: Knowledge-Guided Sample Selection for Prompt Generation in Large Language Model |
|
Chen, Yan | College of Information Science and Technology, Beijing Universit |
Yang, Guang | College of Information Science and Technology, Beijing Universit |
Zhu, Yutao | Gaoling School of Artificial Intelligence, Renmin University Of |
Dou, Zhicheng | Gaoling School of Artificial Intelligence, Renmin University Of |
Wu, Lifang | College of Information Science and Technology, Beijing Universit |
Keywords: Application of Artificial Intelligence, AI and Applications
Abstract: 在本文中,我们提出了一个知识引导样本 选择提示生成 (KGSS) 方法以减少 通过关注逻辑相关性来模拟幻觉 知识要点。我们认为,特定于任务的上下文 应该从知识关键点中得出,而不是 仅依赖语义相似性,特别是在 高考等复杂场景。KGSS系统 首先构建知识关键点的知识图谱 在主题内。然后,它采用两阶段样本 整合知识关键点的选型策略 过滤和知识级特征提取。 最初,系统会过滤掉不相关的知识密钥 根据输入的相关性进行点,确保 候选样本在上下文中是适当的。然后,一个 进行多级样品选择过程以平衡 知识关键点的相似性和多样性。 在Gaokao-Bench数据集上的实验结果,推导 从中国高考题,展示 KGSS 在以下方面明显优于现有方法 提高各种模型的准
|
|
11:45-12:00, Paper Mo-S1-T2.4 | |
HiIntent: A Collaborative Hierarchical Framework for Zero-Shot Intent Detection |
|
Wang, Shuo | University of Chinese Academy of Sciences; Computer Network Info |
Wei, Zeyu | University of Chinese Academy of Sciences; Computer Network Info |
Chang, Wenjing | Computer Network Information Center, CAS, Beijing, China; Univer |
Shi, Guangjun | Computer Network Information Center (CNIC) of the Chinese Academ |
Yu, Jianjun | Computer Network Information Center, Chinese Academy of Sciences, |
Keywords: Application of Artificial Intelligence, Machine Learning, Representation Learning
Abstract: In recent years, single-label intent recognition has faced significant challenges in handling short and semantically sparse user inputs. Traditional methods often treat intent labels as flat categories, neglecting their inherent hierarchical relationships and limiting model performance. To address these issues, we propose HiIntent , a novel zero-shot intent detection framework that integrates hierarchical semantic modeling with a collaborative generation-discriminative mechanism. HiIntent first constructs a structured label hierarchy through a two-stage process: a large language model (LLM) generates semantic abstractions of intent labels, which are then evaluated and refined by a discriminative module to ensure coherence and correctness. This is followed by a similarity-driven convergence strategy that enhances intra-class consistency and inter-class separability using multi-metric similarity calculations. Finally, a contrastive prompt construction method leverages the learned label hierarchy to generate enriched semantic descriptions for each intent, improving representation learning and facilitating accurate classification even in zero-shot scenarios. Extensive experiments on both general-purpose (CLINC-150) and domain-specific (RFMR) datasets demonstrate that HiIntent consistently outperforms existing approaches across multiple evaluation metrics. Ablation studies and hyperparameter analyses further validate the effectiveness of each component in the proposed framework.
|
|
12:00-12:15, Paper Mo-S1-T2.5 | |
UAV State Estimation and Trajectory Prediction Using Transformer-Based Neural Networks and Feature-Based Visual Odometry |
|
Grigoriou, Yiannis | University of Cyprus, KIOS Research and Innovation Center of Exc |
Souli, Nicolas | University of Cyprus, KIOS Research and Innovation Center of Exc |
Chrysanthou, Panagiotis | University of Cyprus, KIOS Research and Innovation Center of Exc |
Kolios, Panayiotis | University of Cyprus |
Ellinas, Georgios | University of Cyprus |
Keywords: Application of Artificial Intelligence, AI and Applications, Deep Learning
Abstract: Unmanned aerial vehicles (UAVs) are increasingly relied upon in a variety of critical applications, including infrastructure inspection, search-and-rescue, and traffic monitoring. While modern UAVs are typically equipped with global positioning system (GPS), inertial measurement unit (IMU) modules, and often include safeguards against adverse environmental conditions, they remain susceptible to sensor malfunctions and signal disruptions. These challenges have led to the need for robust, GPS-free solutions capable of maintaining accurate trajectory prediction and state identification. This work proposes a real-time multi-task learning framework for UAVs that employs Transformer-based neural networks to perform simultaneous trajectory prediction and state identification. The proposed system enables GPS-free UAV operations with the employment of a feature-based visual odometry algorithm that is implemented and fused with telemetry data to achieve accurate localization. The proposed system is implemented (hardware and software modules) in a functional prototype and validated through extensive outdoor experiments using a custom-built dataset, demonstrating strong performance and improved prediction accuracy in GPS-denied environments. The results demonstrate the framework’s ability to enhance the autonomy and reliability of UAV systems in challenging operational scenarios.
|
|
12:15-12:30, Paper Mo-S1-T2.6 | |
Selective-SAM: Memory Optimization for Segment Anything Model 2 with Application in Self-Checkout Product Counting |
|
Zhongling, Liu | Fujitsu |
Liu, Liu | FRDC |
Shi, Ziqiang | Fujitsu R&D Center, Co. Ltd |
Liu, Rujie | Fujitsu Research & Development Center |
Takahashi, Jun | Fujitsu Limited |
Jiang, Shan | Fujitsu Research, FUJITSU LIMITED |
Keywords: Application of Artificial Intelligence, AI and Applications, Deep Learning
Abstract: The Segment Anything Model 2 (SAM 2) demonstrates strong capabilities for video object segmentation (VOS) . We present a SAM 2-based framework for self-checkout product counting, where box prompts are generated by a detector selecting optimal tracking initiation frames. Our key contribution is Selective-SAM, a training-free enhancement that improves robustness in complex scenes via a selective memory bank. This mechanism selectively preserves high-quality and diverse features while filtering poor segmentation priors. For final counting, we introduce a novel Mask Overlap Degree metric to analyze object trajectories. By segmenting trajectories based on the mask overlap degrees, we accurately determine the product count. Experiments across various retail scenarios show improvements of 1.95% IDF1 and 3.86% MOTA.
|
|
12:30-12:45, Paper Mo-S1-T2.7 | |
SMoERec: Split-Band Mixture of Experts with Implicit Distillation for Sequential Recommendation |
|
Zhou, Zihao | East China Normal University |
Zhang, Junqi | East China Normal University |
Ruan, Qionglu | East China Normal University |
Chen, Wenjie | East China Normal University |
Keywords: Big Data Computing,, Knowledge Acquisition, Deep Learning
Abstract: Transformer-based sequential attention models have achieved excellent results in capturing dynamic changes in user interest. However, many studies have shown that these models lead to a convergence of the semantic representations of item embeddings to be the same, which leads to a loss of differentiated information and ultimately the accuracy of the model cannot be further improved. In order to tackle that, we propose a novel method called Split-Band Mixture of Experts with Implicit Distillation for Sequential Recommendation (SMoERec). The model fuses features from both the time and frequency domains for sequential recommendation, specifically, we use the self-attention mechanism to extract the time domain information, and use mixture of experts (MoE) to fine-grain features in different frequency bands in the frequency domain, and different experts focus on specific frequency bands to effectively improve the quality of features in the frequency domain. Finally, the features in the two spatial domains are adaptively fused using implicit distillation. We validate the effectiveness of our approach compared to baseline methods on five real datasets.
|
|
12:45-13:00, Paper Mo-S1-T2.8 | |
Drift-Aware Machine Learning for Operational State Classification in Biogas Dry Reforming |
|
Schreiner, Marcos A. | Federal University of Paraná |
Escribano, Renan A. N. G. | Federal University of Paraná |
Gomes, Heitor | Victoria University of Wellington |
Lisboa de Almeida, Paulo Ricardo | Universidade Federal Do Paraná |
Oliveira, Luiz S. | UFPR |
Keywords: Application of Artificial Intelligence, AI and Applications, Machine Learning
Abstract: This paper presents a machine learning-based approach for classifying the operational states in the biogas Dry Reforming (DR) reactor, focusing on catalyst activation, reaction, and irregularity detection. A key challenge in DR processes is the formation of coke, which can lead to reactor clogging. To address this, we propose incorporating the virtual drifts number, which are changes in the input data distribution, as an additional feature to enhance model performance. Different drift detection algorithms and classifiers were evaluated on a dataset comprising nine distinct DR reactions. Experimental results demonstrate that integrating virtual drift counts improves the average accuracy from 84.74% to 88.01% and the average F1 scores from 81.25% to 85.39% across all models, with RF achieving the highest performance (accuracy from 88.40% to 92.35% and F1 score from 85.95% to 91.59%). Our results highlight the potential of drift-aware features for real-time monitoring and fault detection in DR systems, offering a scalable solution to optimize reactor operations.
|
|
Mo-S1-T3 |
Room 0.11 |
Distributed Intelligent Systems |
Regular Papers - SSE |
Chair: An, Yisheng | Chang'an University |
Co-Chair: Pröstl Andrén, Filip | AIT Austrian Institute of Technology |
|
11:00-11:15, Paper Mo-S1-T3.1 | |
Secure Distributed Matrix Multiplication Outsourcing Computation Scheme in Unbalanced Edge Computing |
|
Sun, Xinrong | Shandong University |
Kong, Fanyu | Shandong University |
Tao, Yunting | Binzhou Polytechnic |
Keywords: Distributed Intelligent Systems
Abstract: In the Internet of Things (IoT) scenarios, edge computing assists in completing machine learning on resource-constrained terminal devices. As one of the most significant operations, large-scale matrix multiplication remains a huge efficiency bottleneck. Existing distributed computation approaches typically decompose the matrix computation into subtasks of the same scale, overlooking edge computing environments with unbalanced computing resources. In this paper, we propose a secure distributed matrix multiplication outsourcing scheme in edge computing with unbalanced resources. Specifically, the large-scale matrix multiplication is decomposed into several subtasks of varying scales according to unbalanced edge computing resources, which achieves better distributed computational performance. A novel matrix blinding method is presented by employing perturbation matrices and permutation matrices to guarantee input and output privacy. Experimental results demonstrate that our scheme improves the efficiency by 51.13% to 98.79% compared to traditional matrix multiplication without outsourcing. Additionally, our scheme outperforms state-of-the-art outsourcing schemes, with an average improvement of 9.36% in edge computing with balanced resources and 14.83% in unbalanced environments.
|
|
11:15-11:30, Paper Mo-S1-T3.2 | |
FedMP: Federated Learning with Manifold Mixup of Prototype |
|
Chen, Ze | Beijing University of Posts and Telecommunications |
Tian, Hui | Beijing University of Posts and Telecommunications |
Sun, Haofeng | Beijing University of Posts and Telecommunications |
Li, Lihua | Beijing University of Posts and Telecommunications |
Keywords: Distributed Intelligent Systems
Abstract: The non-independent and identically distributed (non-IID) data distribution in federated learning (FL) poses significant challenges to the training performance of FL models. To tackle this problem, we propose an novel framework called Federated Learning with Manifold Mixup of Prototype (FedMP) which takes advantage of both prototype learning and manifold mixup. Prototype learning ensures the effectiveness of the training results but also, in conjunction with manifold mixup, further enhances the generalization ability of the FL model. We also design a mixup matrix to regulate the training process and propose a strategy to redistribute the matrix weights. Additionally, to enhance the reliability of prototypes in the latent space, we introduce a conditional invertible generative network (cINN) to reconstruct the obtained prototypes. These reconstructed prototypes, in turn, participates in the training process to improve the model performance. We conduct experiments on two real-word datasets for the validation of the proposed FedMP framework. Experimental results show that FedMP outperforms other baselines on both training performance and communication costs.
|
|
11:30-11:45, Paper Mo-S1-T3.3 | |
Priority-Aware DNN Offloading Via Queuing Latency Estimation in Multi-User Edge-Device System |
|
Tao, Guifeng | Nanjing University of Aeronautics and Astronautics |
Li, Xin | Nanjing University of Aeronautics and Astronautics |
Qin, Xiaolin | Nanjing University of Aeronautics and Astronautics |
Keywords: Distributed Intelligent Systems
Abstract: In multi-user edge intelligence system, achieving efficient and stable DNN inference under heterogeneous task priorities is a critical challenge. This paper presents a priority-aware edge-device collaborative inference scheme that models the entire inference workflow while explicitly incorporating user-level priority constraints. The optimization objective is twofold: to maximize, in priority order, the number of users with stable local queue under the strict constraint of server queue stability, and to minimize the total end-to-end latency across all users. A key component is the Server Queuing Latency Estimation (SQLE) algorithm, which decomposes the latency contributions of user priority interactions and iteratively estimates task queuing times. Compared with classical models such as M/D/1, SQLE achieves significantly higher accuracy under dynamic workloads. Based on the predicted latency, we further develop a two-stage offloading decision algorithm: Maximum User Prioritized Selection for Local Queue Stability (MUPS) determines the maximal subset of users whose local queues can be stabilized, and Fine-grained Offloading Point for Latency Optimization (FOPL) refines offloading points to minimize global latency. Experiments on a heterogeneous edge-device system show that SQLE consistently achieves over 90% estimation accuracy with low error variance across varying system scales, significantly outperforming classical queuing models. Under different load, MUPS and FOPL supports more stable users and reduces average end-to-end latency, demonstrating its robustness and superiority over state-of-the-art methods.
|
|
11:45-12:00, Paper Mo-S1-T3.4 | |
AGS-MADDPG: Multi-Agent Path Finding with Partially Observations Based on Attention Mechanism and Reinforcement Learning |
|
Gong, Jianguo | Chang'an University |
An, Yisheng | Chang'an University |
Li, Peng | Chang'an University |
Zhang, Tao | Chang'an University |
Keywords: Distributed Intelligent Systems, Adaptive Systems
Abstract: We proposes Attention Gumbel-Softmax Multi-Agent Deep Deterministic Policy Gradient (AGS-MADDPG), a multi-agent reinforcement learning algorithm for collaborative pathfinding under partial observability. Traditional methods face challenges including non-differentiable policy gradients and inefficient exploration in discrete action spaces. We address these by integrating the MADDPG framework with Gumbel-Softmax for differentiable policy optimization and adaptive exploration via temperature annealing. A multi-head attention mechanism in the critic network explicitly models inter-agent dependencies by dynamically weighting teammates’ action features, enhancing Q-value estimation precision and training stability. Evaluations in Pogema environments with random obstacles and complex layouts demonstrate improved convergence speed, pathfinding success rates, and average path length compared to baseline methods. The results validate the effectiveness of combining differentiable discrete policies with explicit attention-based coordination for multi-agent collaboration in partially observable settings.
|
|
12:00-12:15, Paper Mo-S1-T3.5 | |
Dynamic QoS-Aware Scheduling Framework for Microservice in Edge Computing |
|
Li, Xinyu | Nanjing University of Aeronautics and Astronautics |
Li, Xin | Nanjing University of Aeronautics and Astronautics |
Qin, Xiaolin | Nanjing University of Aeronautics and Astronautics |
Keywords: Distributed Intelligent Systems, Service Systems and Organizations
Abstract: Cross machine traffic caused by distributed microservice deployment in edge computing significantly affects service performance. And the dynamics of the edge computing, such as fluctuating user request patterns and different network delay make static scheduling stategies challenging. To address these two issue, we first propose a Cross-Machine Traffic-Aware Scheduling Algorithm (CTSA), which models the microservice deployment process as a Markov Decision Process and utilizes a Dueling DQN-based approach to minimize cross-machine traffic while balancing node resource usage. Furthermore, we propose a Dynamic QoS-Aware Scheduling Framework (DQSF) that adapts deployment decisions in real time based on system monitoring to address the challenge of dynamics. Experimental evaluations using a real-world microservice application show that our approach significantly reduces request response time up to 30.2%, improves throughput to 36.7% and ensure Quality of Service (QoS) under dynamic edge computing continuum.
|
|
12:15-12:30, Paper Mo-S1-T3.6 | |
Distributed Scheduling Method Based on Data-Augmented QMIX for Smart Shop Floor |
|
Ma, Yumin | Tongji University |
Gui, Cunjuan | Tongji University |
Shi, Jiaxuan | Tongji University |
Liu, Juan | Tongji University |
Xing, Jianmin | Tongji |
Liao, Yipeng | Tongji University |
Keywords: Distributed Intelligent Systems, System Modeling and Control
Abstract: Driven by the rapid progress of digital technology, manufacturing cells of the smart shop floor have achieved the ability to collect data, compute, reason, and make autonomous decisions. By utilizing such cell intelligence, production schedules can be flexibly adjusted and production performance can be improved. Against this backdrop, this study proposes a novel distributed scheduling method based on data-augmented QMIX for smart shop floors to fully exploit the potential of cell intelligence. In the proposed method, QMIX is adopted to construct the scheduling agents for manufacturing cells and collaboratively optimize the decisions of agents. Furthermore, a data augmentation mechanism based on denoising diffusion probabilistic model is developed to improve the training efficiency of the QMIX algorithm. Finally, experiments conducted on the semiconductor production shop floor MiniFab verified the performance of the proposed method.
|
|
12:30-12:45, Paper Mo-S1-T3.7 | |
Data Consistency Verification and Detection Based on Formal Methods in Distributed Environment |
|
Li, Xuejian | Anhui University |
Lin, Kai | Anhui University |
Wang, Changyu | Anhui University |
Xia, Hantao | Anhui University |
|
|
12:45-13:00, Paper Mo-S1-T3.8 | |
BST-ID: A Variable-Length Binary Spatiotemporal ID Scheme for Hierarchical and Bandwidth-Constrained Systems |
|
Nagasawa, Yuki | University of Aizu |
Yuichi, Yaguchi | University of Aizu |
Keywords: Cyber-physical systems, Distributed Intelligent Systems, System Modeling and Control
Abstract: We propose BST-ID, a variable-length binary spatiotemporal identifier tailored for hierarchical and bandwidth-constrained systems such as LoRa-based drone networks. Unlike traditional spatial indices that treat time as an auxiliary attribute, BST-ID encodes four-dimensional coordinates (x, y, h, t) with independently adjustable resolution, enabling compact and semantically adaptive representation. The identifier includes a binary header and bit-encoded coordinates, enabling prefix filtering and low-overhead communication with anisotropic control. The format is suitable for edge devices operating in multi-hop LPWAN environments. We implemented BST-ID and evaluated it against GeoHash and Google S2, demonstrating compactness (13 bytes average), encoding speed, and flexible prefix queries. Furthermore, we discuss its potential for decentralized data exchange, including blockchain integration and Lightning Network micropayments, where each ID acts as a hash-stable, time-scoped transaction reference. BST-ID thus offers a scalable, query-efficient, and token-aware foundation for distributed sensing, routing, and autonomous information sharing.
|
|
Mo-S1-T4 |
Room 0.12 |
Decision Support Systems 1 |
Regular Papers - SSE |
Chair: Hosen, Mohammad Anwar | Deakin University |
Co-Chair: Zhang, Xu | University of East Anglia |
|
11:00-11:15, Paper Mo-S1-T4.1 | |
Exploring Backbones for DeepLabv3+ in Semantic Segmentation for Driving Scene Understanding |
|
Faruk, Mohammad Omar | Deakin University, Waurn Ponds |
Sakib, Nazmus | DEAKIN UNIVERSITY |
Hosen, Anwar | Deakin University |
Michael Johnstone, Michael | Deakin University, Geelong, Australia |
Keywords: Autonomous Vehicle, Trust in Autonomous Systems, Decision Support Systems
Abstract: Semantic segmentation is a crucial task in autonomous driving, enabling pixel-level understanding of complex road environments. DeepLabv3+ has emerged as a robust architecture for semantic segmentation, owing to its powerful Atrous Spatial Pyramid Pooling (ASPP) and decoder modules. However, the performance of DeepLabv3+ heavily depends on the choice of backbone network, which influences feature extraction, segmentation accuracy, computational efficiency, and inference speed. This paper investigates the impact of various backbones, including ResNet, Xception, Inception, and MobileNet, on the performance of DeepLabv3+ for semantic segmentation of driving scenes using the CamVid dataset. The study systematically compares these backbones, addressing their trade-offs between real-time applicability and segmenta- tion precision. Results reveal that InceptionResNetV2 achieves the highest performance with an accuracy of (90.83%) and mIoU of (83.58%) but at the cost of higher computational demands, while MobileNetV2 offers a lightweight alternative with competitive performance (mIoU 81.33%). The analysis highlights the strengths and limitations of each backbone, offering valuable insights into optimizing segmentation models for autonomous driving scenarios. Future research should explore hybrid backbones to balance accuracy, efficiency, and explainability for diverse driving environments.
|
|
11:15-11:30, Paper Mo-S1-T4.2 | |
Implementation and Evaluation of a Vision-Based Detect-And-Avoid System for Small UAS |
|
Konno, Daichi | University of Aizu |
Yuichi, Yaguchi | University of Aizu |
Keywords: Robotic Systems, Autonomous Vehicle, Conflict Resolution
Abstract: This paper presents the design, implementation, and evaluation of a vision-based Detect and Avoid (DAA) system for small uncrewed aircraft systems (UAS) using monocular vision and deep learning. The proposed system integrates YOLOv8 for object detection, BoT-SORT for tracking, and a rule-based avoidance strategy based on the right-hand traffic convention. The system is deployed on RyzeTech Tello EDU drones with offboard image processing and tested in indoor flight scenarios across multiple approach speeds. Experimental results demonstrate a detection accuracy of 91.7% and an 80% success rate for avoidance maneuvers in non-cooperative environments. Detection and avoidance timing varied by speed, with greater delays at lower speeds and more consistent performance at higher speeds. The system successfully tracked intruders between 6.55 and 9.25 meters, confirming its effectiveness in short-range encounters. Limitations include a narrow field of view, latency due to ground-based processing, and reduced performance in lateral approaches. Future improvements will target fisheye-compatible detection, onboard inference acceleration, and adaptive avoidance logic. These findings support the feasibility of lightweight, vision-only DAA systems for infrastructure-less, low-altitude airspace.
|
|
11:30-11:45, Paper Mo-S1-T4.3 | |
Priority-Driven Instant Delivery System with Drone Resupply |
|
Zhang, Xu | University of East Anglia |
Zhou, Xiaokang | Kansai University |
Ren, Yi | University of East Anglia |
Huang, Tao | James Cook University |
Keywords: Cooperative Systems and Control, Decision Support Systems, Service Systems and Organizations
Abstract: The unmanned aerial vehicle (UAV) resupply mode offers a viable alternative to parcel delivery by replenishing ground vehicles at intermediate open-space supply points. Usually, orders are treated as equally important with no guarantees of on-time delivery. In this paper, we propose a priority based delivery framework to efficiently manage real-time delivery demands with varying priorities. Specifically, this framework integrates a fleet of UAVs for resupply and trucks for final delivery. By considering the inherent priority of orders and their elapsed waiting time, a dynamic priority discipline is introduced and customized to different order contexts. This approach ensures that urgent orders are prioritized to ensure timely fulfillment within their deadlines, while low-prioritized orders can still be completed without excessive waiting. Simulations using a real-world city map of Helsinki, along with mobility features of both trucks and UAVs, show significant performance gains over baseline schemes in terms of on-time delivery rate, average delivery time, and waiting fairness among all orders.
|
|
11:45-12:00, Paper Mo-S1-T4.4 | |
Reinforcement Learning-Based Penalty Function Approach for Constrained Optimization |
|
Yoshida, Tsubasa | Graduate School of Science and Engineering, Kansai University |
Yozawa, Takaya | Kansai University |
Yagi, Atsunari | KANSAI University |
Yun, Yeboon | Kansai University |
Yoon, Min | Pukyong National University |
Keywords: System Modeling and Control, Decision Support Systems, Modeling of Autonomous Systems
Abstract: Many practical problems such as engineering design may be induced to constrained optimization problems. To solve those problems, even though penalty function methods are widely applied for handling constraints, the difficulty of adjusting a parameter remains a significant challenge. This study proposes a reinforcement learning-based penalty function approach. Based on it, we suggest a self-adaptive fitness in metaheuristics, for example genetic algorithms and particle swarm optimization, and also aim to improve their performance as a global optimization tool for the problems with constraints. The method contributes to effectively search for the design space that satisfies constraints, which may yield the generation of much better optimal solution. Through several benchmark test problems, this study demonstrates that the proposed method can provide relatively good performance in comparison with conventional constraint handling techniques.
|
|
12:00-12:15, Paper Mo-S1-T4.5 | |
Region-Based Optimization of Emergency Multimodal Vehicle-Drone Collaborative Delivery |
|
Pan, Yang | National University of Defense Technology |
Lei, Hongtao | National University of Defense Technology |
Tang, Luohao | National University of Defense Technology |
Zhu, Cheng | National University of Defense Technology |
Keywords: Cooperative Systems and Control, Intelligent Transportation Systems, System Modeling and Control
Abstract: With the rapid development of low-altitude logistics technology, the collaboration between vehicles and drones has emerged as a significant research focus in the field of emergency logistics. Traditional single-mode transportation methods often suffer from low efficiency due to road damage and traffic congestion during natural disasters and other emergencies. To address this challenge, this paper focuses on a region-based multimodal vehicle-drone collaborative delivery problem in emergency scenarios. A mixed-integer linear programming (MILP) model is proposed to maximize customer demand satisfaction and delivery balance under dynamic environmental conditions. Different disaster regions are classified (such as vehicle-drone handover zones, drone-only or truck-only zones), and an integrated scheduling and routing optimization approach is designed by considering transportation capacity, endurance limitations, and weather conditions. Several datasets are constructed and solved using the gurobi optimizer. Experimental results show that the proposed approach can significantly enhance the completion rate and balance of emergency deliveries in complex environments, providing theoretical support and practical guidance for emergency logistics planning.
|
|
12:15-12:30, Paper Mo-S1-T4.6 | |
GECAT: A Graph-Enhanced Causality-Aware Transformer for Industrial Control System Intrusion Detection |
|
Zhang, Yuzhe | Beijing University of Technology |
Zhang, Shanshan | Beijing University of Technology |
Wei, Wan | Beijing University of Technology |
Lai, Yingxu | Beijing University of Technology |
Keywords: Trust in Autonomous Systems, Infrastructure Systems and Services, Smart Sensor Networks
Abstract: Modern Industrial Control Systems (ICS) have become prime targets for cyberattacks due to the increasing connectivity and complexity of operational infrastructures. These attacks can have severe consequences, making intrusion detection crucial. Recent deep learning approaches have steadily improved intrusion detection accuracy. However, they often fail to model the relationships between multiple sensors and actuators, lack the integration of physical priors, and demonstrate limited adaptability to varying conditions. In this paper, we propose GECAT (Graph-Enhanced Causality-Aware Transformer), a novel intrusion detection framework tailored for ICS data streams. Our approach combines graph-based sensor relationship modeling, causal inference principles, and a specially designed temporal transformer backbone to robustly detect and classify complex attacks.We integrate physical domain knowledge with data-driven graph learning modules to create an adaptive representation that adapts to diverse scenarios and enforces realistic process constraints. Experimental results on SWaT and WADI demonstrate that GECAT significantly outperforms a comprehensive set of state-of-the-art baselines across multiple metrics, indicating its effectiveness in defending ICS against sophisticated adversaries. Additionally, extensive ablation studies confirm the distinct contributions of each core component in our design.
|
|
12:30-12:45, Paper Mo-S1-T4.7 | |
Online Active Learning for Dynamic Surrogate Model Updates in Manufacturing Flow Simulation |
|
Saadi, Maryam | IMT, Institut Mines Télécom |
Bernier, Vincent | Airbus |
Zacharewicz, Gregory | IMT, Institut Mines Télécom |
Daclin, Nicolas | IMT, Institut Mines Télécom, ALES, France |
Keywords: Decision Support Systems, Discrete Event Systems, Manufacturing Automation and Systems
Abstract: Discrete-event simulation (DES) is widely used in industrial decision-making to evaluate system performance before implementing changes. In complex environments like helicopter assembly lines, DES provides accurate key performance indicators (KPIs) such as Work-in-Progress (WIP), investment, and the rate of respect of customer due dates (customer satisfaction). However, each simulation run typically takes several hours, and users must test multiple configurations to find an acceptable solution. This makes DES impractical for daily operations or real-time planning. We developed four surrogate models—Random Forest, XGBoost, Transformers, and CNN-MLP that approximate simulation outputs in milliseconds. Despite their speed, these models lose accuracy when production conditions change. We propose an online active learning approach that updates surrogate models with selected new simulations to maintain accuracy over time.
|
|
12:45-13:00, Paper Mo-S1-T4.8 | |
ADONiS Framework for Automated Decision of Neonatal Oxygen Support |
|
Kettle, Troy | University of Nottingham |
Pekaslan, Direnc | University of Nottingham |
Wagner, Christian | University of Nottingham |
Keywords: Adaptive Systems, Control of Uncertain Systems, Decision Support Systems
Abstract: This paper explores the application of the Adaptive Online Non-Singleton (ADONiS) framework for automated decision making for neonatal oxygen support. Maintaining optimal oxygen saturation (SpO2) levels in preterm infants is critical for preventing severe complications such as chronic lung disease and retinopathy of prematurity. Current clinical practice relies on manual adjustment of oxygen support by bedside caregivers, a process complicated by sensor uncertainty caused by contextual factors, such as challenging placement of sensors on, and high mobility of babies. Studies indicate this manual approach results in infants spending only 30–40% of time within target SpO2 ranges, highlighting the potential for automated systems. Crucially however, such systems must combine the handling sensor uncertainty handling with clinical interpretability. The ADONiS framework was designed specifically to address systems where input noise is a challenge while established and ideally immutable rule sets and associated well-defined linguistics terms and fuzzy sets, are available which describe the desired and verified behaviour of a given system. In this paper we explore the applicability of ADONiS to the setting of neonatal oxygen support. We collaborated with a neonatology expert at Queen’s Medical Centre (QMC), one of the largest hospitals in Europe, to define the system’s membership functions and rule base, ensuring clinical interpretability remained central to the design. Additionally, we collected a real-world dataset from six neonates at QMC to validate our approach. Using these expertderived structures, we apply the ADONiS framework to model oxygen support, systematically evaluating its performance on both synthetic scenarios and the collected real patient data. The resulting oxygen support suggestions were then reviewed by a neonatologist, whose assessments on ADONiS’s usage in compare to the traditional singleton approaches. Quantitive metrics were not feasible as in neonatal care optimal oxygen adjustment inherently lacks a definitive ground truth which leaves clinical judgment to be the most appropriate evaluation metric. This work represents an initial, exploratory application of the ADONiS framework to
|
|
Mo-S1-T5 |
Room 0.14 |
Image Processing and Pattern Recognition 1 |
Regular Papers - Cybernetics |
Chair: Chen, Xinran | Shandong University |
Co-Chair: Lin, Chaochao | Beijing Institute of Technology |
|
11:00-11:15, Paper Mo-S1-T5.1 | |
POGS: Position-Optimized Gaussian Splatting for Reconstruction in Unevenly Illuminated Scenarios |
|
Gao, Chaoyu | National Elite Institute of Engineering, Chongqing University |
Wang, Chengliang | Chongqing University |
Liu, Ji | Chongqing University |
Luo, Yonggang | Chongqing Changan Automobile |
Jiang, Tian | Changan Automobile |
Zheng, Bo | Chongqing Changan Automobile Co. Ltd., China |
Keywords: Image Processing and Pattern Recognition
Abstract: The advent of 3D Gaussian Splatting (3DGS) has achieved a major breakthrough in the field of 3D reconstruction and realized real-time high-quality novel view synthesis. However, 3DGS relies on the color backpropagation gradients, which are greatly affected by the illumination intensity, to optimize the position parameters of gaussian spheres such as the average center coordinates and opacity, resulting in issues like blurring and artifacts when handling scenes with uneven illumination. This is particularly important in 3D reconstruction of autonomous driving scenarios, especially in night scenes with streetlights and vehicle lights. To address this issue, we propose the POGS framework, which integrates depth and edge constraints to optimize the position parameters of gaussian spheres and reduce the interference of illumination intensity. Our approach first incorporates a pre-trained monocular depth estimation network to generate depth maps, which are used to constrain the positions of gaussian spheres. In addition, we introduce an Edge Optimization Loss and incorporate the Segment Anything Model (SAM) to generate contour maps for constraining the weights of gaussian spheres, enabling the model to pay more attention to the gaussian spheres of the objects composing the scene. Our method significantly enhances the reconstruction performance of 3DGS on datasets with low texture, high depth of field, and low illumination, improving the SSIM by 3.1%, and outperforms the state-of-the-art method RawNeRF (2% SSIM) in NeRF while maintaining real-time rendering speed.
|
|
11:15-11:30, Paper Mo-S1-T5.2 | |
Improving Real-Time End-To-End Object Detection with Cross-Feature Interaction |
|
Ge, Junjie | Nanjing University of Science and Technology |
Huang, Bo | Nanjing University of Science and Technology |
Lv, JianYong | Nanjing University of Sciencen and Technology |
Wang, Mingxin | Pengcheng Intelligent Equipment CO., LTD |
Keywords: Image Processing and Pattern Recognition, AI and Applications, Deep Learning
Abstract: Real-Time DEtection TRansformer (RT-DETR) has attracted considerable attention for its innovative real-time and end-to-end object detection. However, it suffers from insufficient spatial and channel information in shallow features as well as information loss during the network process. To alleviate these problems, this paper proposes two novel modules: an Atrous Attention Module and a Cross-Feature Interaction Module. The Atrous Attention Module captures contextual information by operating on feature maps at different scales, enhancing them in both spatial and channel dimensions and improving their information richness. The Cross-Feature Interaction Module, inspired by the Cross-Attention mechanism of Transformer, facilitates better interactions between high-level and low-level features. It mitigates information loss by capturing channel similarities and differences across feature maps at different scales. To demonstrate our model's effectiveness, we conducted extensive experiments on MS COCO 2017 and Pascal VOC 2012 datasets. The results show that our model achieves significant performance improvements without notable reduction in efficiency.
|
|
11:30-11:45, Paper Mo-S1-T5.3 | |
A Brain-Inspired Dual-Stream Neural Network for Tumor Classification in Ultrasound Images |
|
Lin, Chaochao | Beijing Institute of Technology |
Boumaraf, Said | Space Telecommunications Exploitation Center, Algeriqn Space Age |
Liu, Xiabi | Beijing Institute of Technology |
Liu, Qianglin | National Cancer Center/National Clinical Research Center for Can |
Niu, Lijuan | National Cancer Center/National Clinical Research Center for Can |
Werghi, Naoufel | Khalifa University |
Keywords: Image Processing and Pattern Recognition, Application of Artificial Intelligence, Deep Learning
Abstract: Early and accurate tumor classification in ultrasound images plays a pivotal role in improving cancer diagnosis and patient outcomes. Existing computer-aided diagnostic (CAD) algorithms often rely on cropping-based single feedforward pathways, which can result in the loss of crucial contextual information around the tumor. The surrounding ultrasound data, including relative intensity, plays a significant role in tumor diagnosis, and incorrect cropping or positioning may lead to unreliable results. To overcome these limitations, we propose a novel Brain-inspired Dual-stream Network (BidsNet), aiming to emulate the functional mechanisms of the dorsal and ventral streams in human visual processing. BidsNet processes the entire ultrasound image as input, preventing errors or loss of contextual details from cropping. The dorsal stream in BidsNet specializes in extracting spatial features, such as shape and texture, while the ventral stream focuses on object recognition and classification. A cross-stream communication mechanism is introduced to facilitate dynamic information sharing between the streams: spatial attention generated in the dorsal stream informs the ventral stream to improve feature localization, while channel attention derived from the ventral stream refines spatial feature representation in the dorsal stream. This collaborative interplay boosts both the interpretability and performance of the network. Extensive experiments on multiple ultrasound datasets demonstrate that BidsNet delivers superior accuracy and interpretability, validating the effectiveness of its dual-stream design and cross-stream communication mechanism.
|
|
11:45-12:00, Paper Mo-S1-T5.4 | |
Diffusion-Based Data Augmentation Via Noise Concatenation for Chest X-Ray Classification |
|
Chen, Xinran | Shandong University |
Xin, Yujing | Department of Minimally Invasive Comprehensive Treatment of Canc |
Liu, Ning | Shandong University |
Xu, Yanyu | School of Software, Shandong University |
Xu, Yonghui | School of Software, Shandong University |
Cui, Lizhen | School of Software, Shandong University |
Keywords: Image Processing and Pattern Recognition, Deep Learning, Neural Networks and their Applications
Abstract: The automated analysis of chest X-rays using deep learning technologies has emerged as a critical trend in healthcare, offering the potential to enhance diagnostic efficiency and reduce clinician workload. However, the prohibitive cost of annotation and privacy concerns restrict the expansion of chest X-ray datasets, hindering the development of chest X-ray analysis models. One solution is to introduce generative models that synthesize new samples to expand datasets. Recent advances in diffusion models have demonstrated impressive capabilities, yet prevalent diffusion-based data augmentation techniques typically depend on alterations in color and shape, which are not always appropriate for medical imaging tasks. Consequently, we propose a diffusion-based data augmentation method that generates noise incorporating only the essential information, derived from the region of interest and guided by textual prompts. This noise is then combined with noise that encapsulates global information through concatenation, resulting in a synthesized version of chest X-rays that satisfy the fine-grained characteristics of medical images. We fine-tune the diffusion model using a publicly available dataset and evaluate our method on typical chest X-ray classification tasks. The results show that our method outperforms common data augmentation methods on multiple tasks.
|
|
12:00-12:15, Paper Mo-S1-T5.5 | |
WaveConvX: Multi-Level Wavelet Enhancement for Histopathology Image Classification |
|
Zhang, Yuzhe | Beijing University of Technology |
Fang, Yi | Beijing University of Technology |
Liu, Feiran | Beijing University of Technology |
Liu, Zixuan | Beijing University of Technology |
Jia, Xibin | Beijing University of Technology |
Keywords: Image Processing and Pattern Recognition, Deep Learning, Neural Networks and their Applications
Abstract: Artificial histopathological image classification is emerging as a promising approach for assisting early cancer diagnosis and informing treatment planning. Although convolutional neural network (CNN) based methods have achieved significant success, they still struggle to capture the rich multi-scale and high-frequency details inherent in pathological tissues. To address the above issue, we propose WaveConvX, a novel histopathological image classification model that integrates multi-level wavelet decomposition with the ConvNeXt backbone. Our approach applies a two-stage discrete wavelet transform (DWT) to intermediate feature maps, enabling the decomposition of features into multiple frequency sub-bands. High-frequency components are enhanced using Adaptive Power Gabor Convolution (APGConv), while mid-frequency ranges are refined through tailored attention mechanisms. These processed sub-bands are then fused via inverse wavelet transforms, producing feature maps enriched with both global context and local morphological cues essential for accurate diagnosis. Comprehensive experiments are conducted on three public datasets (BreakHis for breast cancer, KBSMC for gastric cancer, and LC25000 for lung and colon cancer). The results demonstrate that WaveConvX consistently outperforms ten state-of-the-art benchmarks, achieving superior accuracy, F1 scores, and robustness across multiple types and magnifications of cancer. Our work demonstrates the significant potential of wavelet-enhanced CNNs in histopathological image analysis.
|
|
12:15-12:30, Paper Mo-S1-T5.6 | |
ELSPU: Expanding Labeled Samples Via Semantic Features for Positive Unlabeled Learning |
|
Zhang, Qingren | Beijing University of Technology |
Li, Jinghua | Beijing University of Technology |
Kong, Dehui | Beijing University of Technology |
Sun, Yanfeng | Beijing University of Technology |
Yin, Baocai | Beijing University of Technology |
Keywords: Image Processing and Pattern Recognition, Machine Vision, Deep Learning
Abstract: Positive Unlabeled (PU) learning, as a form of weakly supervised learning, is characterized by only having a fraction of positive samples labeled, while all the others remain unlabeled. The objective of PU learning is to train a binary classifier that can effectively distinguish positive and negative samples with the incomplete labeling. To achieve a more realistic distribution of samples, we propose a novel approach leveraging the semantic information of the latent classes to expand positive samples via CLIP. Especially, we design and propose a kind of adapter training method, which enables the proposed framework to adapt the medical images well. We validate the effectiveness of the proposed ELSPU method. Extensive experiments demonstrate that ELSPU significantly outperforms baseline methods, with an average accuracy improvement of 1.96% on common benchmarks, highlighting its competitive performance.
|
|
12:30-12:45, Paper Mo-S1-T5.7 | |
S-Track: An SAM-Based Model for Real-Time Surgical Video Tracking under High Dynamic Scenarios |
|
Lu, Sicheng | Qingdao University |
Cheng, Zesheng | College of Computer Science and Technology, Qingdao University, |
Wang, Yuxin | Qingdao University |
Keywords: Machine Vision, Image Processing and Pattern Recognition
Abstract: Real-time surgical video tracking faces critical challenges in highly dynamic endoscopic environments, including rapid instrument-tissue interactions, computational inefficiency, and annotation scarcity. This paper proposes S-Track, a novel SAM-based framework integrating a memory-augmented spatiotemporal fusion module, a dual-path motion prediction mechanism, and a dynamic attention-guided computation strategy. These innovations synergistically address the accuracy-efficiency trade-off, achieving state-of-the-art performance while reducing GPU memory usage by 27% and accelerating inference speed to 27 FPS. Furthermore, we introduce URS-Lithotripsy, the first comprehensive dataset for multi-target surgical tracking, comprising 20 hours of high-resolution videos with 1.2 million expert-annotated frames. Cross-domain validation on EndoVis 2018 demonstrates robust generalization, and ablation studies confirm the necessity of each module. The efficiency, precision, and adaptability to sparse annotations highlight the potential of S-Track for real-time clinical deployment.
|
|
12:45-13:00, Paper Mo-S1-T5.8 | |
Uncertainty-Specialized Tracking with Weighted Entropy Based Probabilistic Graph |
|
Gao, Haoyu | Beihang Unniversity |
Sheng, Hao | Beihang University |
Wang, Shuai | Beihang University |
Yang, Da | Hangzhou International Innovation Institute, Beihang University |
Su, Guanqun | Shandong Qingniao HoT Co. Itd |
Keywords: Machine Vision, Image Processing and Pattern Recognition
Abstract: Multiple-Object Tracking (MOT) has been an attracting area in these years with excellent progresses. However, there are still complicated uncertainties caused by movement of pedestrians due to environmental disturbance and mental state. To handle this, we characterize pedestrian moving patterns through a dual-phase paradigm, and Probabilistic Graphical Model (PGM) is introduced into our work to build Hidden Markov Model (HMM) to better understand the uncertainties. Moreover, the weighted entropy mechanism is utilized in feature fusion for balanced importance between appearance and motion information. Finally, our method achieves state-of-the-art results on MOT17 and MOT20 datasets, especially in the category of graphical methods.
|
|
Mo-S1-T6 |
Room 0.31 |
Medical Informatics |
Regular Papers - HMS |
Chair: Chen, Yuyu | Nankai University |
Co-Chair: Guo, Qing | Beijing University of Chemical Technology |
|
11:00-11:15, Paper Mo-S1-T6.1 | |
SAM-BrainFocus: A Multi-Scale Boundary-Refined Framework for Precise Brain Tumor Segmentation |
|
Chen, Yuyu | Nankai University |
Zhang, Ziyi | Nankai University |
Dong, Yufang | Nankai University |
Keywords: Medical Informatics
Abstract: 医学图像分割在临床诊断中起着至关重要的作用。然而,分割的准确性仍然受到目标边界模糊、形状复杂、数据标注困难等因素的挑战。尽管基于深度学习的方法已经取得了重大进展,但这些方法在处理肿瘤边界模糊、多尺度分布和难以捕获小病灶等问题方面仍面临挑战。为了应对这些挑战,我们创新性地提出了一种基于 SAM 的脑肿瘤分割模型——SAM-BrainFocus。该模型基于原始 SAM 方法构建,引入了脑肿瘤特征提取器、肿瘤提示生成器和肿瘤边界细化器,显著提高了分割精度和稳健性。它在多尺度特征融合和复杂目标适应性方面取得了突破性进展。实验结果表明,SAM-BrainFocus 在脑肿瘤图像分割任务中优于现有的主流模型,实现了更高的准确性和可靠性。
|
|
11:15-11:30, Paper Mo-S1-T6.2 | |
ASW-YOLO: Hierarchical Global-Local Feature Learning with Dynamic Focus Loss for Accurate Hyperparathyroidism Detection in Ultrasound Imaging |
|
Xie, Qihong | Beijing University of Chemical Technology |
Yang, Jie | Beijing University of Chemical Technology |
Guo, Qing | Beijing University of Chemical Technology |
Wang, Huaqing | Beijing University of Chemical Technology |
Yu, Mingan | China-Japan Friendship Hospital |
Wei, Ying | China-Japan Friendship Hospital |
Zhao, Zhenlong | China-Japan Friendship Hospital |
Keywords: Medical Informatics
Abstract: Precise automated detection of parathyroid nodules is critical for enhancing the diagnosis and treatment of hyperparathyroidism (HPT). However, the ectopic nature, varied shapes, fuzzy boundaries, and significant patient variability of these nodules challenge traditional object detection methods. To tackle these issues, we present ASW-YOLO, an advanced detection framework built on YOLOv10, designed to address the limitations of medical imaging object detection. First, we introduce the Feature Fusion Assembly Block (FFAB), which integrates convolutional and Transformer-based modules to better capture the global context of parathyroid nodules. Second, we propose the Squeeze-Enhanced Axial Feature Pyramid Network(SEA-FPN), which uses adaptive weighting of high-level features to refine low-level feature selection and incorporates axial squeeze attention to balance detection accuracy and processing speed. Additionally, we adopt the WIoU v3 loss function to enhance bounding box accuracy by dynamically adjusting gradient gains, reducing the impact of unclear boundaries. Experiments on a specialized parathyroid ultrasound dataset show that ASW-YOLO significantly outperforms the baseline model, achieving a 95.3% mAP@50, with a 3.2% increase in recall and a 10.2% reduction in parameters. These results highlight the potential of ASW-YOLO for real-time, high precision computer-aided diagnosis in clinical practice.
|
|
11:30-11:45, Paper Mo-S1-T6.3 | |
EEGGraphNet: A Multi-Class Seizure Classification Model with Adaptive Channel Aggregation Via Graph Attention Neural Network |
|
Luo, Jiajie | Newcastle University |
Ji, Wenxing | Newcastle University |
Degenaar, Patrick | Newcastle University |
Yu, Dahui | Newcastle University |
Yang, Xiao | Newcastle University |
Li, Jichun | Newcastle University |
Keywords: Medical Informatics, Brain-Computer Interfaces, Brain-based Information Communications
Abstract: A complete electroencephalogram (EEG) signal comprises multiple signals from various channels. Effectively aggregating these channel signals to extract spatial features is crucial for brain-computer interface applications. Traditional approaches have largely relied on methods such as channel selection and classic dimensionality reduction algorithms. However, with advancements in artificial intelligence, researchers have begun to explore deep learning-based feature extraction techniques. In recent years, numerous algorithms have been proposed for binary seizure detection, but comparatively less attention has been given to multi-class seizure classification. To address this gap, we present a novel EEG graph network, EEGGraphNet, specifically designed for classifying different seizure types. EEGGraphNet leverages a combination of a graph attention network and a gated recurrent unit to effectively capture both the spatial and temporal features of EEG data. Moreover, unlike many previous studies that focus on seizure-wise evaluation—an approach that may not adequately represent real-world scenarios—we conduct a comprehensive evaluation using both seizure-wise and patient-wise methodologies. Experimental results demonstrate that EEGGraphNet achieves a weighted F1 score of 98.15% for seizure-wise evaluation and 86.11% for patient-wise evaluation. Additionally, comparisons with state-of-the-art models reveal that EEGGraphNet matches or surpasses existing performance benchmarks.
|
|
11:45-12:00, Paper Mo-S1-T6.4 | |
A Quantitative Evaluation of China's Integrated Medical and Elderly Care Policies Using a PMC-LDA Index Model Based on Central and Five Provincial Policy Documents |
|
Zhao, Jing | Northwestern Polytechnical University |
Zhao, Ting | Northwestern Polytechnical University |
Liu, Chuanxu | Northwestern Polytechnical University |
Keywords: Medical Informatics, Cooperative Work in Design, Networking and Decision-Making
Abstract: To support the reform of integrated medical and elderly care and promote the development of the eldercare industry, this study integrates the Latent Dirichlet Allocation (LDA) into the Policy Modeling Consistency (PMC) framework, forming a three-dimensional evaluation system covering policy content, tool applicability, and network structure, thereby assessing central–local policy coordination. Based on 132 policy documents (2013–2024) from the central government and five provinces, the study extracts six key policy themes to serve as one of the variables within the PMC model and introduces Euclidean and cosine distances to measure implementation intensity and directional deviation. Results show that Shanghai demonstrates the highest level of coordination, with strong alignment and standardized execution. Guangxi and Inner Mongolia exhibit insufficient policy input and directional deviation, reflected in fiscal gaps, weak collaboration networks, and localized policy adjustments. Beijing and Qinghai require improvements in timeliness and policy network density. The study proposes a differentiated governance strategy, including targeted central oversight for high-deviation provinces, a policy diffusion network anchored in Shanghai, and enhanced policy adaptation mechanisms in ethnic regions.
|
|
12:00-12:15, Paper Mo-S1-T6.5 | |
Multitask LSTM for Arboviral Outbreak Prediction Using Public Health Data |
|
Rodolfo Celestino, Farias, Lucas | Universidade Federal De Pernambuco |
Pacheco, Silva, Talita | Universidade Católica De Pernambuco |
Henrique Meira, de Araújo, Pedro | Universidade CatÓlica De Pernambuco |
Keywords: Medical Informatics, Human-Machine Interaction, Information Systems for Design and Marketing
Abstract: This paper presents a multitask learning approach based on long-short-term memory (LSTM) networks for the joint prediction of arboviral outbreaks and case counts of dengue, chikungunya, and Zika in Recife, Brazil. Leveraging historical public health data from DataSUS (2017–2023), the proposed model concurrently performs binary classification (outbreak detection) and regression (case forecasting) tasks. A sliding window strategy was adopted to construct temporal features using varying input lengths (60, 90, and 120 days), with hyperparameter optimization carried out using Keras Tuner. Model evaluation used time series cross-validation for robustness and a held-out test from 2023 for generalization assessment. The results show that longer windows improve dengue regression accuracy, while classification performance peaked at intermediate windows, suggesting an optimal trade-off between sequence length and generalization. The multitask architecture delivers competitive performance across diseases and tasks, demonstrating the feasibility and advantages of unified modeling strategies for scalable epidemic forecasting in data-limited public health scenarios.
|
|
12:15-12:30, Paper Mo-S1-T6.6 | |
A Comparative Study of BERT Variants for Detecting Mental Health Disorders in Online Discourse |
|
Sakib, Nazmus | DEAKIN UNIVERSITY |
Hosen, Mohammad Anwar | Deakin University |
Mumu, Sabrina Mostafij | Ahsanullah University of Science and Technology |
Jamal, Hasan Bin | Swinburne University of Technology |
Keywords: Medical Informatics, Intelligence Interaction, Human-Machine Cooperation and Systems
Abstract: Mental health disorders, including conditions like depression and suicidal ideation, have emerged as critical public health challenges. Leveraging the synergy between human expertise and intelligent systems, this study explores the potential of Bidirectional Encoder Representations from Transformers(BERT) variants for analysing mental health discourse. Using approximately 11,000 text samples collected from social media platforms, a meticulously pre-processed and annotated dataset was developed. Comparative evaluations were conducted using baseline models with balanced data across disorder categories. Results revealed that general BERT variants outperformed Clinical BERT, a medical-specific model. This study introduces a novel dataset and presents key findings that not only enhance machine learning model capabilities but also provide insights into behavioral patterns associated with mental health disorders. These contributions pave the way for developing targeted machine learning solutions tailored to mental health applications.
|
|
12:30-12:45, Paper Mo-S1-T6.7 | |
MambaXray-CTL: Multi-Stage Contrastive Training for Medical Report Generation with a Mamba-Based Multi-Modal Large Model |
|
Feng, Wenbin | Shenzhen Technology University |
Lu, Yu | Shenzhen Technology University |
Li, Xiaoqing | National University of Singapore |
Shi, Shijie | Shenzhen Technology University |
Qi, Yingjian | Shenzhen Technology University |
Keywords: Medical Informatics, Visual Analytics/Communication, Human-Machine Interaction
Abstract: The rapid advancement of artificial intelligence (AI) is revolutionizing radiology, particularly in the automation of medical report generation. AI-driven systems offer the potential to alleviate the increasing workload of radiologists while improving diagnostic accuracy, consistency, and overall workflow efficiency. Despite significant progress in multi-modal medical image captioning, existing approaches often suffer from high computational costs and limitations in modeling long-range dependencies between visual and textual features. To address these challenges, we propose MambaXray-CTL, a novel framework for medical report generation that integrates a Mamba-based vision backbone with a large language model (LLM) text decoder. By leveraging a multi-stage training strategy—including autoregressive pretraining, image-text contrastive learning, and supervised fine-tuning with contrastive regularization—our model achieves precise visual-textual alignment while maintaining computational efficiency. Experimental results on the IU X-ray and CheXpertPlus datasets demonstrate that MambaXray-CTL achieves performance comparable to or surpassing state-of-the-art methods in key metrics, particularly BLEU-4 and CIDEr, while significantly reducing inference cost compared to Vision Transformer (ViT)-based architectures. These findings highlight the promise of state space models and contrastive learning in building scalable and effective vision-language systems for real-world clinical deployment.
|
|
12:45-13:00, Paper Mo-S1-T6.8 | |
Do Non-Newtonian Midsoles Influence Strain in Knee Collateral Ligaments? a Finite Element Study |
|
Enze, Shao | Obuda University |
Goda, Tibor J. | Doctoral School on Safety and Security Sciences, Obuda Universit |
Yaodong, Gu | Faculty of Sport Science, Ningbo University |
Keywords: Medical Informatics
Abstract: Despite the potential implications for injury prevention, research examining the specific mechanisms by which non-Newtonian fluids midsoles affect collateral ligament strain remains limited. This knowledge gap warrants further investigation, as understanding these interactions could inform both footwear design and injury prevention strategies for runners. This study quantitatively evaluates the mechanical interactions within the non-Newtonian midsole-knee joint coupling system through finite element analysis, providing a theoretical foundation for sports equipment optimization. Eighteen young college students voluntarily participated in this study. This study represents the first systematic evaluation of how NF midsole materials influence knee collateral ligament strain through an integrated approach combining finite element analysis and biomechanical testing. Our findings demonstrate that the NF midsole indirectly modulates collateral ligament strain distribution by significantly altering knee joint kinetic parameters (moments and angles). The findings suggest that strategic implementation of NF materials in footwear can positively influence knee joint loading patterns, offering promising directions for injury prevention technologies in athletic footwear.
|
|
Mo-S1-T7 |
Room 0.32 |
Agent-Based Modeling |
Regular Papers - Cybernetics |
Chair: Strasser, Thomas | AIT Austrian Institute of Technology GmbH |
Co-Chair: Dong, Daoyi | Australian National University |
|
11:00-11:15, Paper Mo-S1-T7.1 | |
A Graph Attention-Based DRL Framework for Joint Task Offloading and Resource Allocation in Mobile Edge Computing |
|
Zhang, Yuchen | Beijing Information Science & Technology University |
Chen, Xin | Beijing Information Science and Technology University |
Jiao, Libo | Beijing Information Science and Technology University |
Zhang, Ning | Beijing Information Science & Technology University |
Cao, Aobo | Beijing Information Science and Technology University |
Zhang, Zhekun | Beijing Information Science and Technology University |
Keywords: Application of Artificial Intelligence, Agent-Based Modeling, Neural Networks and their Applications
Abstract: With the proliferation of sixth-generation (6G) communication networks, mobile edge computing (MEC) has become essential for addressing diverse user demands. The increasing heterogeneity of mobile devices, along with the rise of delay-sensitive and environmentally friendly tasks, exacerbates latency and energy consumption challenges. This paper proposes GERL-OA, a task offloading and resource allocation algorithm based on graph attention auto-encoder(GATE) assisted deep reinforcement learning (DRL). The MEC environment is modeled as an undirected graph, with spatial features extracted via graph neural network (GNN). GATE is utilized for unsupervised feature extraction, and a DRL framework is applied to optimize offloading decisions and resource allocation, minimizing the weighted sum of delay and energy consumption. Simulation results validate the superior convergence speed and global optimization capability of GERL-OA. Furthermore, Extensive experimental results demonstrate the algorithm’s strong adaptability to different task types, confirming its effectiveness in balancing the requirements of delay-sensitive and environmentally friendly applications.
|
|
11:15-11:30, Paper Mo-S1-T7.2 | |
Non-Reciprocal Interactions Based Emergent Navigation for 3D Autonomous Drones Swarm |
|
Hu, Linqiang | Fudan University |
Zhou, Ziqing | Fudan University |
Chen, Yuning | Fudan University |
Zhang, Hongda | Fudan University |
Meng, Chunlei | Fudan University |
Li, Xinlei | Shanghai University of International Business and Economics |
Liu, Yi | Fudan University |
Dong, Zhiyan | Fudan University |
Ouyang, Chun | Fudan University |
Gan, Zhongxue | Fudan University |
Wu, Dunzhao | Jiangling Motors Corporation, Ltd |
Nie, Zhihua | Fudan Univercity |
Keywords: Swarm Intelligence, Agent-Based Modeling, Optimization and Self-Organization Approaches
Abstract: We address a fundamental challenge in coordinating large-scale 3D drone swarms: how to achieve rapid collective response to environmental stimuli while ensuring group stability and safety. Existing swarm navigation modals often rely on sophisticated individual perception and communication capabilities, which can be computationally expensive and impractical for large swarms. In this paper, we propose the Non-reciprocal Collective Emergent Navigation model (NRCE), a decentralized approach designed for real-world drone flocking in complex environments. Unlike traditional models, our approach leverages localized non-reciprocal interactions, where boundary drones detect environmental stimuli and propagate this information throughout the swarm without directly controlling individual trajectories. Through extensive numerical simulations and physical experiments with up to 28 drones, we demonstrate how this model achieves coordinated collective motion while effectively balancing stability with responsiveness. Our findings reveal two notable insights: (1) intermediate cohesion levels (ωc) optimize collective response—a “Goldilocks zone” where individuals are neither too tightly coupled nor too independent, challenging the conventional wisdom that stronger cohesion always improves coordination; and (2) swarm queue configuration significantly affects optimal interaction parameters, with divergent trends observed between attraction- and repulsion-based coordination mechanisms as layer count increases. These discoveries provide critical design principles for cost-effective, high-density swarm systems while advancing the theoretical understanding of collective dynamics in both artificial and biological systems.
|
|
11:30-11:45, Paper Mo-S1-T7.3 | |
Conflict-Free Multi-Agent Path Generation Using Ant Colony Optimization with Load Balancing |
|
Kazama, Ryusuke | Waseda University |
Sugawara, Toshiharu | Waseda University |
Keywords: Swarm Intelligence, Metaheuristic Algorithms, Agent-Based Modeling
Abstract: We propose a method using ant colony optimization (ACO) with load-balancing techniques for multi-agent path finding problems. In automated warehouses and manufacturing facilities, multiple automated guided vehicles (AGVs) must efficiently perform transportation tasks without colliding with obstacles or other agents. Although several metaheuristic approaches, including ACO-based methods, address this challenge, conventional methods do not account for scenarios with many agents, and/or the planning efficiency and quality significantly decrease as the number of agents increases. The proposed method is inspired by conventional ACO-based approaches but differs in three aspects. First, we carefully define the collisions by considering the agent size and travel time. Second, we propose a tailored collision avoidance method for numerous agents by determining whether they should wait for other agents or generate detours. Finally, a directional load-balancing method that considers the movement direction is proposed for more effective ACO path generation. Our experiments demonstrate that the proposed method generates superior coordinated paths without increasing CPU time, even in high-density environments.
|
|
11:45-12:00, Paper Mo-S1-T7.4 | |
Daily Activity Schedule Generation Based on Survey on Time Use and Leisure Activities in Japan (I) |
|
Matsumura, Yudai | The University of Osaka |
Murata, Tadahiko | The University of Osaka |
Keywords: Agent-Based Modeling, Metaheuristic Algorithms, Heuristic Algorithms
Abstract: In this paper, a daily activity schedule for each resident is generated according to Survey on Time Use and Leisure Activities (STULA) in Japan. To construct agent-based model as a social simulation, it is important to develop activity schedule for each agent. Each agent should take their actions according to their roles in their community such as mother in a family, a worker in a com- pany, a player in a football team, and so on. Their activity schedule should be generated according to their roles in their community. To generate such daily schedule, we employ STULA survey in Japan that is conducted every 5 years by Statistics Bureau, Ministry of Internal Affairs and Communications, Japan. Since this survey is conducted by the government and all residents selected as respond- ents are required to reply to the survey mandatorily. We employed the results of this survey to generate daily activity schedule for each resident. We propose a daily activity schedule generation method to consider behavior continuity in each schedule. We compare the results of daily activity schedules generated by a method that are not consider the continuity of activities with the results of the proposed method. Comparing those results, the proposed method can generate more schedules with continuous activities.
|
|
Mo-S1-T8 |
Room 0.51 |
Intelligent Transportation Systems |
Regular Papers - SSE |
Chair: Petrovan, Adrian | Technical University of Cluj-Napoca |
Co-Chair: Sato, Fumiaki | Toho University |
|
11:00-11:15, Paper Mo-S1-T8.1 | |
LGM4TFP: A Large Graph Model for Traffic Flow Prediction |
|
Zhao, Jinhui | Qingdao University |
Cheng, Zesheng | College of Computer Science and Technology, Qingdao University, |
Keywords: Intelligent Transportation Systems
Abstract: Urban traffic flow prediction is a critical task in Intelligent Transportation Systems, yet existing methods often struggle with capturing long-range spatial dependencies and preserving high-frequency signals. To address these challenges, this paper proposes LGM4TFP, a novel framework that incorporates a Graph Attention with High-frequency Enhancement (GAHE) module. GAHE integrates dynamic graph attention for spatial feature extraction and a spectral-domain high-frequency enhancement mechanism to alleviate the oversmoothing problem. Extensive experiments on four real-world datasets demonstrate that LGM4TFP consistently outperforms state-of-the-art models by 5–10% in RMSE, MAE, and MAPE. Ablation studies confirm the effectiveness of the proposed modules, and results show that the model maintains strong robustness across diverse prediction scenarios, highlighting its practical value for dynamic traffic management.
|
|
11:15-11:30, Paper Mo-S1-T8.2 | |
A Prufer-Based Genetic Algorithm for Solving a Non-Linear Multi-Objective Fixed-Charge Transportation Problem |
|
Pop, Cristian | UT Cluj |
Pop Sitar, Petrica | Technical University of Cluj-Napoca |
Petrovan, Adrian | Technical University of Cluj-Napoca |
Keywords: Intelligent Transportation Systems
Abstract: The scope of this paper is to tackle a non-linear multi-objective optimization fixed-cost transportation problem (MO-FCTP) with multiple modes of transportation with given capacities, having two contradictory minimization objectives: the total transportation cost and, respectively, the total distribution time. The problem studied is solved using a modified NSGA-II, obtained by using a Prufer-based representation, designing an efficient decoding procedure, and integrating into its framework newly developed genetic operators tailored to accommodate multiple modes of transportation. We perform computational experiments for existing instances from the literature. The results obtained show that our Prufer-based NSGA-II outperforms the current state-of-the-art algorithm for solving the non-linear MO-FCTP with multiple modes of transportation with given capacities, providing improved Pareto fronts for all instances considered.
|
|
11:30-11:45, Paper Mo-S1-T8.3 | |
Leveraging RAG-LLMs for Urban Mobility Simulation and Analysis |
|
Ding, Yue | Dublin City University |
McCarthy, Conor | Dublin City University |
O'Shea, Kevin | Dublin City University |
Liu, Mingming | Dublin City University |
Keywords: Intelligent Transportation Systems, System Architecture, Decision Support Systems
Abstract: With the rise of smart mobility and shared e-mobility services, numerous advanced technologies have been applied to this field. Cloud-based traffic simulation solutions have flourished, offering increasingly realistic representations of the evolving mobility landscape. LLMs have emerged as pioneering tools, providing robust support for various applications, including intelligent decision-making, user interaction, and real-time traffic analysis. As user demand for e-mobility continues to grow, delivering comprehensive end-to-end solutions has become crucial. In this paper, we present a cloud-based, LLM-powered shared e-mobility platform, integrated with a mobile application for personalized route recommendations. The optimization module is evaluated based on travel time and cost across different traffic scenarios. Additionally, the LLM-powered RAG framework is evaluated at the schema level for different users, using various evaluation methods. Schema-level RAG with XiYanSQL achieves an average execution accuracy of 0.81 on system operator queries and 0.98 on user queries.
|
|
11:45-12:00, Paper Mo-S1-T8.4 | |
A Lightweight Hybrid Network for Vehicle Driving Behavior Recognition |
|
Gao, Jun | Jianghan University |
Liu, Hanyu | Jianghan University |
Chen, Meng | Qingdao Vocational College of Aeronautical Science and Technolog |
Mei, Yu | Jianghan University |
Keywords: Intelligent Transportation Systems
Abstract: Driving behavior recognition is significant to autonomous driving. However, the vast model size and computational expense makes deploying state-of-the-art driving behavior recognition models on vehicle-edge devices difficult. In this paper, a novel lightweight hybrid Model, LVBRM is proposed for vehicle driving behavior recognition, which consists of a Transformer-based vehicle detection network and an improved SlowFast behavior recognition network. Firstly, a Global-Local Knowledge Distillation (GLKD) method is proposed to efficiently compress the Transformer-based vehicle detection network and effectively improve the detection accuracy. Secondly, a Frame-Difference and Attention-enhanced (FDA) strategy is designed for behavior recognition network to improve driving behavior recognition accuracy with slightly increasing of model parameter. Finally, extensive experiments on a public large-scale BDD dataset and our self-built VBR-D dataset demonstrate the superiority of our proposed LVBRM, the recognition accuracy gains over the state-of-the-art models while with less parameters.
|
|
12:00-12:15, Paper Mo-S1-T8.5 | |
Adaptive Radar Clustering and Tracking Based on Point Criticality Assessment |
|
Brühl, Tim | Karlsruhe Institute of Technology |
Vico, Antonio | Technical University of Munich |
Goldscheider, Jan | Esslingen University of Applied Sciences |
Schwager, Robin | Dr. Ing. H.c. F. Porsche AG |
Sohn, Tin Stribor | Dr. Ing. H.c. F. Porsche AG |
Eberhardt, Tim Dieter | PhD Candidate KIT Porsche AG |
Hohmann, Sören | KIT |
Keywords: Intelligent Transportation Systems, Autonomous Vehicle
Abstract: Radar sensors are superior at measuring distances and velocities, even in adverse weather conditions. However, the noisiness of radar point clouds requires a filtering cascade to make these sensors usable for applications in automated driving. This filtering may cause hazardous situations if points on existing objects are removed by a filter. To serve as a perception component in fully automated driving systems, radar's error rate needs to be reduced significantly. To achieve this goal, we advance the view that points should be filtered adaptively according to their presumable relevance for the current driving task. In this work, we present a real-world study of our radar point criticality estimation algorithm. In addition, we introduce methods for acting on critical points. While the Posterior method recalls points in critical regions, we demonstrate how the creation of clusters and tracks can be facilitated for critical radar points. Our methods are evaluated on a novel dataset including 92 critical scenes with pedestrians in a parking garage. We demonstrate that all methods significantly increase detection rates at both the cluster and track levels and enable earlier detection. However, utilizing every radar point inevitably leads to several false positives. Future work should investigate how to invalidate these points by additional perception sensors.
|
|
12:15-12:30, Paper Mo-S1-T8.6 | |
Adversarial and Reactive Traffic Entities for Behavior-Realistic Driving Simulation: A Review |
|
Ransiek, Joshua | FZI Research Center for Information Technology |
Reis, Philipp | FZI Research Center for Information Technology |
Schürmann, Tobias | FZI Research Center for Information Technology |
Sax, Eric | Institute for Information Processing Technologies (ITIV), Karlsr |
Keywords: Intelligent Transportation Systems, Autonomous Vehicle, Trust in Autonomous Systems
Abstract: Despite advancements in perception and planning for autonomous vehicles (AVs), validating their performance remains a significant challenge. The deployment of planning algorithms in real-world environments is often ineffective due to discrepancies between simulations and real traffic conditions. Evaluating AVs planning algorithms in simulation typically involves replaying driving logs from recorded real-world traffic.However, entities replayed from offline data are not reactive, lack the ability to respond to arbitrary AV behavior, and cannot behave in an adversarial manner to test certain properties of the driving policy. Therefore, simulation with realistic and potentially adversarial entities represents a critical task for AV planning software validation. In this work, we aim to review current research efforts in the field of traffic simulation, focusing on the application of advanced techniques for modeling realistic and adversarial behaviors of traffic entities. The objective of this work is to categorize existing approaches based on the proposed classes of traffic entity behavior and scenario behavior control. Moreover, we collect traffic datasets and examine existing traffic simulations with respect to their employed default traffic entities. Finally, we identify challenges and open questions that hold potential for future research.
|
|
12:30-12:45, Paper Mo-S1-T8.7 | |
Detection of Encroaching Vehicles Based on Combination of Deep-Learning-Based Object Detection and Heuristics |
|
Sato, Fumiaki | Toho University |
Koshizen, Takamasa | Honda R&D Co. Ltd |
Yamakawa, Kazuhiko | PT. Mitrapacific Consulindo International |
Yasui, Yuji | Honda R&D Co. Ltd |
Keywords: Intelligent Transportation Systems, Consumer and Industrial Applications, Autonomous Vehicle
Abstract: Traffic warning systems have attracted attention for reducing traffic accidents caused by motorcycles. The authors previously developed a highly portable system that uses an in-vehicle web camera and a smartphone to detect motorcycles approaching at high speed based on rear-camera input images and notifies the driver of danger. In this study, we propose a system that combines deep-learning-based object detection and heuristics to detect encroaching vehicles from images taken by a front-facing in-vehicle camera. First, we consider the shape of the bounding box of the detected vehicle to identify the encroaching vehicle. Since the encroaching vehicle is turned sideways when entering the main road, we identify side-facing vehicles by calculating the aspect ratio of the bounding box. For motorcycles, the criterion for distinguishing between parked motorcycles and encroaching motorcycles is the presence or absence of a rider. To distinguish between parked vehicles and moving vehicles, the values of the background pixels around the detected vehicle are compared with the values of pixels around the corresponding vehicle in the previous frame and the difference is used to determine whether the vehicle is moving. The results show that the above heuristics improve the accuracy of detecting encroaching vehicles.
|
|
12:45-13:00, Paper Mo-S1-T8.8 | |
DSTFAGCN: A Dynamic Scene Trajectory Fusion Adaptive Graph Convolution Network for Pedestrian Trajectory Prediction |
|
Zhu, Xiaoyan | Qingdao University |
Li, Jianbo | Qingdao University |
Ye, Rongkun | Qingdao University |
|
|
Mo-S1-T9 |
Room 0.90 |
Cyber Modern Technology on Medicine, Health Care and Human Assist |
Special Sessions: Cyber |
Chair: Yagi, Naomi | University of Hyogo |
Co-Chair: Takano, Hironobu | Toyama Prefectural University |
Organizer: Yagi, Naomi | University of Hyogo |
Organizer: Takano, Hironobu | Toyama Prefectural University |
Organizer: Yuda, Emi | Mie University |
Organizer: Kiguchi, Kazuo | Kyushu University |
|
11:00-11:15, Paper Mo-S1-T9.1 | |
Facilitator Training System for Interactive Art Appreciation Using Large Language Models and Mixed Reality (I) |
|
Fukuda, Ryota | Kyushu University |
Kurazume, Ryo | Kyushu University |
Keywords: AI and Applications, Application of Artificial Intelligence
Abstract: Interactive art appreciation is a method in which multiple participants engage in repeated discussions to interpret artworks. In this approach, a facilitator plays a crucial role by asking questions to the viewers. However, there are currently limited opportunities for facilitator training. To address this, this paper proposes a Mixed Reality (MR) AI system that utilizes MR headsets and a Large Language Model to train facilitators. The system features five virtual viewers, each with distinct personalities and whose facial expressions change through dialogue, allowing users to practice facilitation through interactive dialogue. Contents of what the virtual viewers say are generated by the Large Language Model, GPT-4o. Moreover, the users’ voice is transcribed using the speech recognition model, Whisper, and the virtual viewers’ statements generated by GPT-4o is converted into voice using the speech synthesis system, Azure Text-to-Speech. In addition, the impression data was collected from approximately 500 people, who were asked about their thoughts and impressions after viewing the artwork. When generating the virtual viewers’ statements with GPT-4o, this impression data is provided as input to ensure that the virtual viewers’ responses are as close as possible to what real people actually think and feel when viewing the artwork. Evaluation experiments with facilitator trainees were conducted, and many participants expressed interest in the system.
|
|
11:15-11:30, Paper Mo-S1-T9.2 | |
Emotion Recognition in Robotic Healthcare: A New Approach to Mitigating Professional Burnout Syndrome (I) |
|
Ge, Mouzhi | Deggendorf Institute of Technology |
Bangui, Hind | Masaryk University |
Rossi, Bruno | Masaryk University |
Blanco, Jose Miguel | Escuela Tecnica Superior De Ingenieros De Telecomunicacion, Univ |
Keywords: Cloud, IoT, and Robotics Integration, Transfer Learning, Image Processing and Pattern Recognition
Abstract: Professional Burnout Syndrome (PBS) among healthcare professionals has been considered a threat to both staff well-being and patient safety, especially in a high-stress medical environment. While robotics and AI have been increasingly integrated into healthcare, their impact on PBS has not been explored in detail yet. Therefore, this paper proposes a real-time PBS detection framework for healthcare professionals using emotion recognition. This framework includes a deep learning-based emotion detection system for humanoid companion robots, which then correlates emotional trends with the Circumplex model to identify burnout risk. Our experimental evaluation results across five deep learning architectures, MobileNet, RegNetY, Swin Transformer, ConvNeXt V2, and EVA-02, show a highest accuracy of 74.05% on a public emotion dataset. These results demonstrate the feasibility of integrating such systems into healthcare workflows for early PBS warnings. Also, this work suggests a human-in-the-loop diagnostic model, where robotic emotion detection complements clinical expertise by providing proactive support, strengthening workforce resilience, and maintaining the quality of patient care.
|
|
11:30-11:45, Paper Mo-S1-T9.3 | |
Human-Centric Autonomous Cornering Control Based on Sense of Circular Vision: Evaluation with VR Simulation and Real Driving Video (I) |
|
Masuta, Hiroyuki | Toyama Prefectural University |
Degoshi, Ren | Toyama Prefectural University |
Fuse, Yotaro | Toyama Prefectural University |
Sawai, Kei | Toyama Prefectural University |
Koyanagi, Ken'ichi | Toyama Prefectural University |
Almassri, Ahmed | Toyama Prefectural University |
Li, Fengyu | Toyama Prefectural University |
Keywords: Computational Intelligence, Hybrid Models of Computational Intelligence, Machine Learning
Abstract: Research and development of autonomous driving technology are progressing rapidly. However many people still have concerns regarding the comfort and safety of autonomous vehicles. One key issue is the discrepancy between human driving behavior and the driving style adopted by current autonomous systems. To address this problem, we have proposed an autonomous cornering control system based on the Sense of Circular Vision (SoCV), which takes human visual perception during cornering into account. In our previous work, the proposed system demonstrated that its generated trajectories in simulation closely resembled those of expert drivers. However, the impact on subjective discomfort and ride quality had not yet been clarified. This paper evaluates these aspects using both subjective and objective measures through experiments conducted in a VR simulation environment and with real-world driving videos. Experimental results show that the proposed system improves subjective comfort and perceived safety, and promotes anticipatory gaze behavior compared to conventional control methods.
|
|
11:45-12:00, Paper Mo-S1-T9.4 | |
Data-Driven Aspiration Risk Assessment Based on Swallowing Posture with Future Smartphone Applicability (I) |
|
Yagi, Naomi | University of Hyogo |
Nakamura, Katsuya | Kawasaki University of Medical Welfare |
Nagami, Shinsuke | Health Sciences University of Hokkaido |
Kobashi, Syoji | University of Hyogo |
Keywords: Deep Learning, Application of Artificial Intelligence, AI and Applications
Abstract: Swallowing disorders have become a major problem among the aging population. A chin-up posture, along with curvature of the back, is considered unfavorable for swallowing and may lead to aspiration of saliva or food. To prevent this, a chin-down posture has been widely adopted; however, it has not been consistently defined. Therefore, we propose an innovative aspiration risk assessment system based on swallowing posture, using computer-based image analysis. The main contribution of this study is that the proposed system can assess aspiration risk with a high Area Under the Curve (AUC) of 0.799. This result was achieved solely by training a Light Gradient Boosting Machine (LightGBM) using total 113 features on lateral-view images of swallowing posture. In addition, we identified three novel predictive indexes: shoulder angle, the angle formed by tragus, acromion, and posterior and superior iliac spine (PSIS), and the angle formed by acromion, 4th lumbar vertebrae (L4), and PSIS. The proposed system was evaluated using a dataset of 68 participants over 65 years of age, using leave-one-out cross-validation (LOOCV). Its performance showed improvement compared to previous manual evaluations based on head angle, cervical angle, shoulder angle, pelvic angle, kyphosis index, and cervical range of motion. The experimental results indicated that cervical range of motion was not a particularly important factor. Moreover, it is expected that this assessment system will be applicable in the future. This will require only lateral-view photographs of swallowing posture taken with a smartphone.
|
|
12:00-12:15, Paper Mo-S1-T9.5 | |
Integration of BVG-LS into a Deep Neural Network Architecture Designed for EEG Signal Classification (I) |
|
Fukushima, Takuto | Meiji University |
Miyamoto, Ryusuke | Meiji University |
Keywords: Cybernetics for Informatics, Neural Networks and their Applications, Transfer Learning
Abstract: The present work enhanced the classification accuracy of motor imagery (MI) based on EEG signals using a novel deep neural network (DNN) architecture that integrated InternImage and ST-pooling to capture spatiotemporal features. While this architecture outperformed existing methods in classification accuracy, the validation loss indicates overfitting during training. We introduce a bias-variance guided layer selection (BVG-LS) strategy into the fine-tuning process to address this issue. This approach adaptively adjusts the number and selection of layers updated during fine-tuning, replacing the conventional single-layer update with a more effective multi-layer configuration guided by BVG-LS. Experimental evaluation on the cross-individual validation task using the PhysioNet EEG Motor Movement/Imagery dataset showed that the accuracy of two-, three-, and four-class classification were improved to 89.06%, 81.13%, and 70.90%, respectively, surpassing the performance of existing fine-tuning techniques, including full fine-tuning.
|
|
12:15-12:30, Paper Mo-S1-T9.6 | |
EEG-Driven Detection of Motion Mismatch in Upper-Limb Movements (I) |
|
Koukash, Ibrahim | Kyushu University |
Nishikawa, Satoshi | Kyushu University |
Kiguchi, Kazuo | Kyushu University |
Keywords: Biometric Systems and Bioinformatics, Computational Life Science, Cybernetics for Informatics
Abstract: In recent years, research and development of human-assist robots equipped with brain-computer interfaces (BCIs) have accelerated due to the increasing number of stroke cases worldwide, which are a major contributor to permanent disabilities and loss of independence. BCIs can be an effective neurorehabilitation tool, as they assist the impaired sensorimotor loop by providing compensatory somatosensory feedback during motor attempts. The human-assist robot must support the user’s movements based on their intended actions. However, determining whether the robot’s assist motion is as intended by the user in real time remains challenging. This paper proposes a multi-stage method that utilizes time-frequency and statistical analyses for electroencephalography (EEG) signals to detect whether the arm movements are as intended. Furthermore, it employs a binary classifier with threshold optimization—guided by the highest classification accuracy— to determine the most effective threshold for distinguishing between intentional and unintentional movements. The effectiveness of the proposed method was evaluated through experiments involving multiple arm movements that combine both simple and complex tasks. A disturbance mechanism device was built and employed during these experiments to generate the sense of motion mismatch by applying disturbance forces. The experimental results demonstrate the effectiveness of the proposed method in detecting motion mismatch and distinguishing between intentional and unintentional movements. The proposed method is applicable to human-assist robots to assess whether the robot’s assist movement matches the user’s intended motion.
|
|
12:30-12:45, Paper Mo-S1-T9.7 | |
Impatience Estimation from Gaze and Pupil Diameter under Time Pressure Conditions (I) |
|
Takano, Hironobu | Toyama Prefectural University |
Kinoshita, Akane | Toyama Prefectural University |
Keywords: Biometric Systems and Bioinformatics
Abstract: Impatience is a common psychological state experienced in daily life, particularly when performing tasks under time pressure. In such states, individuals may experience impaired judgment, which can lead to human errors. Therefore, detecting impatience in advance could prevent decision-making mistakes and reduce the occurrence of human errors. This study aims to develop a method for detecting impatient states using non-contact measurements of eye gaze and pupil diameter. We investigated the feasibility of classifying a state of impatience using logistic regression analysis, based on features extracted from gaze and pupil data. Three levels of impatience were defined, and classification performance was evaluated using two of these levels. As a result, a classification accuracy exceeding 70% was achieved when distinguishing between the highest and lowest perceived impatient states. Furthermore, the feature selection process revealed that variables related to gaze velocity were frequently selected, suggesting their effectiveness in impatience detection.
|
|
12:45-13:00, Paper Mo-S1-T9.8 | |
Highlighting System Using ArUco Markers for Surgical Instruments in Knee Replacement Surgery (I) |
|
Nagamune, Kouki | University of Hyogo |
Keywords: Machine Vision, Image Processing and Pattern Recognition, Biometric Systems and Bioinformatics
Abstract: In recent years, the shortage of nurses has become a global problem. Furthermore, in large-scale surgeries such as total knee arthroplasty, a large number of surgical instruments with similar shapes are prepared, which places a heavy burden on operating room nurses who manage them. To solve these problems, we have developed an automatic surgical instrument recognition system to support operating room nurses. We have also developed a system that applies projection mapping technology to highlight automatically recognized surgical instruments in the real world in real time, making it easy for nurses in the operating room to understand. In this study, we propose a method to correct the highlighting using ArUco markers. In experiments, we confirmed object detection using surgical instruments and demonstrated the effectiveness of this system.
|
|
Mo-S1-T10 |
Room 0.94 |
Assistive Technology 1 |
Regular Papers - HMS |
Chair: Ahmed, Samer | Faculty of Engineering and Design, University of Bath |
Co-Chair: Thakur, Dipanwita | University of Calabria |
|
11:00-11:15, Paper Mo-S1-T10.1 | |
Toward a Smart Lower-Limb Wearable Interface for Recognition of Gait Phases and Assistance in Level Walking |
|
Ahmed, Samer | Faculty of Engineering and Design, University of Bath |
Martinez-Hernandez, Uriel | University of Bath |
Keywords: Assistive Technology
Abstract: Peripheral nerve injury often impairs dorsiflexion, leading to asymmetrical gait and increased metabolic cost. While passive ankle-foot orthoses (AFOs) remain common in clinical use, their static, non-adaptive nature limits rehabilitation. Existing active orthoses face challenges such as unreliable gait transition detection, suboptimal actuation, poor generalization, and lack of open-sourcing. This work presents a smart, wearable interface featuring a reproducible design and Bowden cable transmission. A lightweight Bayesian framework enables robust real-time gait transition detection within robot operating system (ROS). The system employs phase-dependent control via a finite state machine, modulating ankle impedance based on gait events. Validation on three unseen users walking naturally demonstrated up to 25% dorsiflexion assistance and phase recognition accuracies of 99.21% (seen interfaces) and 98.75% (unseen), highlighting the system’s adaptability.
|
|
11:15-11:30, Paper Mo-S1-T10.2 | |
TactileNet: Bridging the Accessibility Gap with AI-Generated Tactile Graphics for Individuals with Vision Impairment |
|
Khan, Adnan | Carleton University |
Choubineh, Alireza | Carleton University |
Shaaban, Mai A. | Mohamed Bin Zayed University of Artificial Intelligence |
Akkasi, Abbas | Computers Science Department, Carleton University, Canada |
Komeili, Majid | Carleton University |
Keywords: Assistive Technology
Abstract: Tactile graphics are essential for providing access to visual information for the 43 million people globally living with vision loss. Traditional methods for creating these graphics are labor-intensive and cannot meet growing demand. We introduce TactileNet, the first comprehensive dataset and AI-driven framework for generating embossing-ready 2D tactile templates using text-to-image Stable Diffusion models. We fine-tune Stable Diffusion models using Low-Rank Adaptation and DreamBooth to generate high-fidelity, guideline-compliant graphics with reduced computational cost. Quantitative evaluations with tactile experts show 92.86% adherence to accessibility standards. Our structural fidelity analysis revealed near-human design similarity, with a Structural Similarity Index (SSIM) of 0.538 between generated and expert-designed tactile images. Notably, our method better preserves object silhouettes than human designs (binary mask SSIM: 0.259 vs. 0.215), addressing a key limitation of manual abstraction. The framework scales to 32,000 images (7,050 high-quality) across 66 classes, with prompt editing enabling customizable outputs (e.g., adding or removing details). By automating the 2D template generation step compatible with standard embossing workflows—TactileNet accelerates production while preserving design flexibility. This work demonstrates how AI can augment (not replace) human expertise to bridge the accessibility gap in education and beyond. Code, data, and models can be found at our project page https://tactilenet.github.io/.
|
|
11:30-11:45, Paper Mo-S1-T10.3 | |
Upper-Limb Rehabilitation in Chronic Stroke Using Brain-Computer Interface Based on Motor Imagery Plus Non-Invasive Brain and Muscular Stimulation |
|
Filho, Teodiano Freire | UFES |
Villa-Parra, Ana Cecilia | Universidad Politécnica Salesiana |
Gonzalez-Cely, Aura Ximena | Federal University of Espirito Santo |
Keywords: Assistive Technology, Brain-Computer Interfaces, Human-Machine Interaction
Abstract: This work presents the application of an upper-limb rehabilitation protocol in chronique stroke using a Brain-Computer Interface (BCI) based on Motor Imagery (MI), Non-Invasive Brain Stimulation (NIBS) like transcranial Alternating Current Stimulation (tACS), and Functional Electrical Stimulation (FES). This protocol uses the concept of Alternating Treatment Design (ATD), in which a chronic post-stroke subject is submitted to these techniques for recovery of his upper-limb movements affected by the stroke. The rehabilitation progress was verified through metrics, such as Fugl Meyer Assessment Scale (FMS), Functional Independence Measure (FIM), Modified Ashworth Scale (MAS), surface Electromyography (sEMG) and Electroencephalography (EEG). Results from these metrics include a 2.4% increase in FMS and an 11% increase in the muscle contraction of his finger extensors, evaluated by sEMG. For the EEG analysis, there was an energy increase in mu and beta rhythms at the end of protocol.
|
|
11:45-12:00, Paper Mo-S1-T10.4 | |
Recognition of Meal-Taking Activity Based on a Correlation Matrix to Identify Social Isolation in the Elderly |
|
Bouaziz, Ghazi | IRIT, University of Toulouse, Intégr' It, Esiee-It |
Brulin, Damien | LGP University of Tarbes, UTTOP |
Campo, Eric | LAAS-CNRS University of Toulouse, UT2J |
Keywords: Assistive Technology, Cognitive Computing, Medical Informatics
Abstract: Recognition of Activities of Daily Living (ADLs) has been the subject of research for several years now, with the aim of providing effective solutions, particularly in terms of personal assistance and home care. This paper aims to present an original approach to detect signs of social isolation by monitoring key digital indicators such as mobility and nutritional activity. In order to collect data in a real-life environment, we have deployed a sensor-based monitoring system for three months in the homes of five elderly people (aged 60 to 87) living alone. We have compared the results of four algorithms, to identify six activities of daily living (preparing meals, eating, washing up, hygiene, sleep/relaxation and other activities). The logical approach, based on a 20- minute time window, achieved the best results with a detection rate of 88.14% for the activity "eating a meal" in five elderly participants, and with a micro-average F1-score of 0.863 using the public ARUBA-1 database. These results represent an intermediate step in the assessment of social isolation in the elderly through the analysis of their ADLs.
|
|
12:00-12:15, Paper Mo-S1-T10.5 | |
Analysis of the Asynchrony Detection Ability between Auditory and Vibrotactile Stimuli on a Single Device for Supporting Singing Timing Acquisition in the Deaf and Hard of Hearing |
|
Yamamoto, Hayato | Tsukuba University of Technology |
Yasu, Keiichi | Tsukuba University of Technology |
Hiraga, Rumi | Tsukuba University of Technology |
Keywords: Assistive Technology, Human Perception in Multimedia, Haptic Systems
Abstract: This study examines the asynchrony detection ability between auditory and vibrotactile stimuli to improve the ``VIBES'' singing timing support system for Deaf and Hard-of-Hearing (DHH) individuals. We conducted synchrony judgment tasks with 20 DHH participants using a single device to present auditory and vibrotactile stimuli. We tested two conditions: a regular rhythm (RR) condition with contextual beats (120 beats per minute) surrounding the target stimulus and a no-rhythm (NR) condition where we presented only the target stimulus. Our results showed a significant negative shift in the point of subjective simultaneity (PSS: the timing at which two different sensory stimuli are perceived as co-occurring) under the NR condition (-15 [ms]), indicating that presenting vibrotactile stimuli before auditory stimuli facilitates simultaneity perception. We observed more expansive temporal binding windows (TBW: the period during which stimuli from different senses are integrated and perceived as a single event) when auditory stimuli preceded vibrotactile stimuli. This suggests that unintended vibrotactile interference from auditory stimuli occurs when using a single device. We also noted individual differences in asymmetry between the TBW of RR and the NR conditions. These findings suggest DHH participants may detect asynchrony based on comparison with surrounding stimuli rather than direct comparison of stimulus pairs, highlighting the importance of accurate timing estimation for adequate singing support.
|
|
12:15-12:30, Paper Mo-S1-T10.6 | |
Voice Assistant and Smart Speaker Usage among Individuals with Visual Impairments in Japan: The 2023 Survey |
|
Tsurumi, Masayo | Tsukuba University of Technology |
Miyagi, Manabi | Tsukuba University of Technology |
Keywords: Assistive Technology, Human-Computer Interaction, Human-Machine Interface
Abstract: Our research project found that individuals with visual impairments (VI) have an advantage in developing applications for smart speakers, which are devices equipped with voice assistants. This may increase their job opportunities. This study examines the usage frequency of voice assistants, smart speakers, and smart home devices in Japan. It explores the relationship between usage frequency and reluctance to speak to a voice assistant, compares survey results from 2019 and 2023, and contrasts usage patterns between the general public and VI. Additionally, the significance of voice assistants for VI will be discussed.
|
|
12:30-12:45, Paper Mo-S1-T10.7 | |
An Exploratory Study on Spatial Personalization in AR Caption Design for Deaf and Hard-Of-Hearing |
|
Funayama, Kosuke | Tsukuba University of Technology |
Shitara, Akihisa | University of Tsukuba |
Yoneyama, Fumio | Tsukuba University of Technology |
Kato, Nobuko | National University Corporation of Tsukuba University of Technol |
Shiraishi, Yuhki | Tsukuba University of Technology |
Keywords: Assistive Technology, Human-Computer Interaction, Virtual/Augmented/Mixed Reality
Abstract: Real-time captions presented through AR glasses hold great promise for enhancing accessibility for Deaf and Hard-of-Hearing (DHH) individuals. However, conventional approaches often overlook the diversity of user preferences and perceptual characteristics. In this study, we explore how spatial personalization—adjustments in display method, position, and depth—can improve the user experience of AR captions. We conducted two experiments. Experiment 1 focused on 2D position personalization, revealing that user-controlled positioning enhanced visibility and comfort, though comprehension gains were modest. Experiment 2 introduced a novel concept, Simulated Depth Personalization, which emulates different viewing distances despite a fixed AR focal plane (4 m). This revealed clear individual differences: participants varied in whether their preferences aligned more with the presentation method (e.g., monocular vs. binocular) or viewing distance. Our findings emphasize that effective AR captioning is not one-size-fits-all but must flexibly adapt to individual spatial preferences. We argue for incorporating spatial personalization as a core design principle for accessible AR systems.
|
|
12:45-13:00, Paper Mo-S1-T10.8 | |
EAPD-CS: Energy Aware Performance Driven Client Selection in Federated Learning Based Human Activity Recognition |
|
Thakur, Dipanwita | University of Calabria |
Guzzo, Antonella | University of Calabria |
Fortino, Giancarlo | University of Calabria |
Keywords: Human-centered Learning, Human Factors, Human-Machine Interaction
Abstract: Human Activity Recognition (HAR) represents a significant domain within pervasive computing, facilitating a diverse array of applications ranging from healthcare to smart environments. Traditional HAR models suffer from several challenges, including data privacy and the distributed participation of heterogeneous resource-constrained devices. To mitigate these challenges, the research community popularly uses federated learning (FL). However, selecting clients in FL is a critical issue, mainly when there is a combination of resource-constrained heterogeneous devices. This paper proposes a resource- and performance-aware client selection algorithm for HAR that amalgamates the benefits of FL with energy efficiency. The framework allows for the training of machine learning models across multiple devices without the necessity of sharing raw data, thereby preserving user privacy. Additionally, it employs energy-aware strategies to diminish the carbon footprint and reduce the computational costs typically linked to traditional cloud-based HAR systems. Experimental results indicate that the proposed framework achieves more than 90% accuracy, comparable to centralized models, while significantly lowering energy consumption and improving the robustness of the model. This work contributes to the evolving field of green AI, delivering an effective, privacy-preserving, and environmentally sustainable approach for HAR applications.
|
|
Mo-S1-T13 |
Room 0.97 |
Multimedia Computation |
Regular Papers - Cybernetics |
Chair: Tian, Yibin | Shenzhen University |
Co-Chair: Kwong, Sam Tak Wu | Lingnan University |
|
11:00-11:15, Paper Mo-S1-T13.1 | |
MCMG: Multi-Level Controllable Music Generation Model Based on Fine-Grained Control |
|
Zhao, Jingge | Zhengzhou University |
Li, Xiaobing | Central Conservatory of Music |
Zhang, Xinran | Central Conservatory of Music |
Wang, Xiaoqing | Central Conservatory of Music |
Zhou, Qingwen | Central Conservatory of Music |
Tie, Yun | ZhengZhou University |
Keywords: Media Computing, Deep Learning, Application of Artificial Intelligence
Abstract: The task of controlled music generation has been well developed, but the lack of modeling of music control attributes and neglect of music structure affect the quality of music creation. In order to solve the above problems, we first classify music into three levels according to its essentially different nature and extract the corresponding interpretable control attributes. Then by adding the control attributes to the music representation, a multi-level connection between music generation process and human composition is established. Finally, we propose a multi-level controllable music generation model with fine-grained control (MCMG). This model considers the structural relationships between music levels, enabling the music generation process highly controllable. The experiments show that our generative model not only improves on the basic music metric compared to the baseline, but also performs well on the controllability metric.
|
|
11:15-11:30, Paper Mo-S1-T13.2 | |
Natural Language Rationales with Sub-QAE Prompting: World Knowledge Discovery through Self-Questioning Architecture |
|
Wang, Hongfei | Shanghai University |
Wei, Xiao | Shanghai University |
Hu, Jinshuai | Shanghai University |
Keywords: Multimedia Computation, Deep Learning, Application of Artificial Intelligence
Abstract: Abstract— Visual Question Answering with Natural Language Explanations (VQA-NLE) requires models to provide logically grounded explanations while generating accurate answer. Existing methods struggle with complex questions that demand comprehensive logical reasoning and world knowledge integration. We propose that decomposing the target image and question into question-answer-explanation triples containing visual cues, world knowledge, and object relationships can enhance image understanding, question comprehension, and knowledge retrieval for final answer reasoning. We call the proposed method Sub-QAE Prompting that includes three key steps. We train a visual question generation (VQG) model using instruction-aligned question contexts which subsequently utilized to generate the sub-questions from image. Then generate image-based and text-based sub-questions via the VQG model and frozen large language model respectively. Last we construct question-answer-explanation triples by deriving corresponding answers and explanation for generated sub-questions, and encode question-answer-explanation triples into visual-aware prompting module for MLLM to generate the final answer and explanation. Experimental results on the two challenge benchmark VQA-X and A-OKVQA demonstrate that our method achieves state-of-the-art performance compared to existing VQA-NLE approaches
|
|
11:30-11:45, Paper Mo-S1-T13.3 | |
Small Scale Data Adaptive Enhancement for Cross-Modal Retrieval |
|
Li, Yang | Shandong Normal University |
Zhang, Huaxiang | Shandong Normal University |
Liu, Li | Shandong Normal University |
Leng, Yibin | Shandong Normal University |
Dong, Xinfeng | Shandong Normal University |
Keywords: Multimedia Computation, Deep Learning, Artificial Social Intelligence
Abstract: Cross-modal retrieval aims to retrieve semantically similar heterogeneous modal data based on queries from any modality. However, existing methods struggle to capture complex semantic relationships, particularly in small scale datasets, where data scarcity impairs alignment accuracy. To address these issues, we propose small scale data adaptive enhancement for Cross-modal Retrieval (SSDAE). The framework combines a Variational Autoencoder to generate high-quality enhanced samples, which alleviates data scarcity effectively. For further improve the effectiveness of data enhancement,we introduce a dynamic enhancement probability control mechanism, which selects the optimal enhancement probability according to dataset characteristics. The SSDAE incorporates Graph Attention Networks to capture high-order feature relationships of different modalities, dynamically learning the importance of different features. Besides, a modality consistency constraint is utilized to further enhance the accuracy of semantic alignment. Experimental results on three small scale benchmark datasets demonstrate that SSDAE significantly improves retrieval performance and accuracy.
|
|
11:45-12:00, Paper Mo-S1-T13.4 | |
Bidirectional Multimodal Knowledge Augmentation with Sparse Representation for Image-Text Retrieval |
|
Wang, Hengchang | Shandong Normal University |
Liu, Li | Shandong Normal University |
Zhang, Huaxiang | Shandong Normal University |
Dong, Xinfeng | Shandong Normal University |
Du, Hao | Shandong Normal University |
Keywords: Multimedia Computation, Image Processing and Pattern Recognition, Deep Learning
Abstract: In cross-modal retrieval, the core challenge of image-text retrieval lies in the mismatch of heterogeneous modality representation spaces. Traditional methods overly rely on inter-modal relationships or optimize a single modality, neglecting cross-modal feature understanding and the preservation of original data details. In this paper, we propose a multimodal collaborative knowledge augmentation framework. Unlike single-path enhancement, we design a novel generation-reconstruction dual-path architecture. The image-to-text generation driven by BLIP-2 and the text-to-visual reconstruction driven by Stable Diffusion form a closed-loop knowledge cycle, achieving multi-granular feature enhancement through bidirectional cross-modal semantic transformation and deeply constructing the relationship within the visual-language joint embedding space. Then, we design a sparse-gated dynamic aggregation mechanism, which precisely focuses on key cross-modal semantics through feature decoupling, spatially sparse masking, and dynamic weight allocation. Additionally, the temporal-progressive feature fusion strategy controls the feature integration ratio via dynamic fusion coefficients, realizing a smooth transition from single-modal representation learning to cross-modal association reinforcement, overcoming the modal interference issue in traditional fixed-weight fusion. Experimental results demonstrate that our method outperforms existing mainstream methods on the Flickr30K and MS-COCO datasets.
|
|
12:00-12:15, Paper Mo-S1-T13.5 | |
Multi-Granularity Semantic Association Learning for Cross-Modal Hashing Retrieval |
|
Du, Hao | Shandong Normal University |
Liu, Li | Shandong Normal University |
Zhang, Huaxiang | Shandong Normal University |
Lu, Xu | Shandong Agricultural University |
Wang, Hengchang | Shandong Normal University |
Keywords: Multimedia Computation, Image Processing and Pattern Recognition, Deep Learning
Abstract: Existing cross-modal hashing retrieval methods exhibit limitations in capturing semantic correlations and co-existing information across modalities. Most approaches primarily rely on single-level feature representations and fail to fully exploit both coarse-grained and fine-grained semantics. To address these issues, this paper proposes a novel framework, Multi-Granularity Semantic Association Learning for Cross-modal Hashing Retrieval (MSAH), which integrates semantic correlation mining at both coarse-grained and fine-grained levels. The framework processes redundant information in fine-grained features while leveraging coarse-grained features to capture global-level sample relationships, subsequently propagating semantics into the hash function via knowledge distillation. In addition, we designed an Indirect Association Learning strategy to capture potential relationships between instances through the descriptive consistency of image-text pairs. This approach not only reduces the distance between samples of the same category but also mitigates the limitations of contrastive learning in handling category-overlapping samples during hash function training. Experimental results demonstrate that the proposed method significantly outperforms existing deep cross-modal hashing approaches in retrieval efficiency across multiple datasets.
|
|
12:30-12:45, Paper Mo-S1-T13.7 | |
MCFA: Multi-Choice Full-Body Anonymization |
|
Wang, Xuan | Nanjing University of Science and Technology |
Lian, Zhichao | Nanjing University of Science and Technology |
Keywords: Multimedia Computation, Machine Vision, Machine Learning
Abstract: In the age of data explosion, data collection is everywhere, but data may contain a lot of private information. Therefore, anonymization techniques are used to protect privacy information while preserving important data information. For human images, personal privacy information is present not only in the face; parts of the body also contain a lot of personal identification information. Thus, human anonymization should be extended from the face to the full-body. However, current human anonymization mainly focuses on face anonymization. In this paper, we propose a multi-choice full-body anonymization framework that provides different anonymization choices at the semantic level while ensuring the image diversity and face diversity of full-body anonymized data. For full-body anonymized data, we propose a comprehensive evaluation system. We conduct a detailed evaluation of the anonymized data from three aspects: data quality and diversity, anonymization performance, and data utility. We have demonstrated through extensive experiments that our method provides more anonymous choices and better diversity performance compared to state-of-the-art methods while ensuring data quality and anonymization performance.
|
|
12:45-13:00, Paper Mo-S1-T13.8 | |
MVTS: Multimodal Visual-Tactile Sensor Using a Single Camera* |
|
Huang, Wenhao | Shenzhen University |
Lu, Dajiang | Shenzhen University |
Zhong, Xiaopin | Shenzhen University |
Tian, Yibin | Shenzhen University |
Wu, Zongze | Guangdong University of Technology |
Keywords: Multimedia Computation, Neural Networks and their Applications, Image Processing and Pattern Recognition
Abstract: Vision and tactile are the most widely used perception modes in robotic interactions. We propose a Multimodal Visual-Tactile Sensor (MVTS) using a single camera to synchronously capture visual and tactile information. Unlike the traditional design with a fixed opaque gel layer, the MVTS consists of a color camera, a transparent elastomer layer embedded with color markers, and multiple LEDs. It can acquire images for vision like a normal camera, even during object contact. It can also use the markers to acquire tactile information while interacting with objects. In order to simultaneously acquire and separate tactile and visual information in single-shot, we designed a multimodal information separation framework based on a multi-task learning deep neural network. It adopts a shared feature encoder and two separate and parallel decoders to restore visual images and extract tactile maps. The MVTS was evaluated on multiple tasks such as force estimation, contact surface reconstruction, texture reconstruction, and volume estimation of grasped simple-shaped objects, and the results show that it has good performance in multimodal information acquisition, which can help robots better interact with the environment.
|
|
Mo-S1-WS6 |
Room 0.16 |
Collaborative Human-AI Symbiosis 1 |
Workshop |
Chair: Hou, Ming | Department of National Defence, Canada |
Organizer: Hou, Ming | Department of National Defence, Canada |
|
11:00-11:15, Paper Mo-S1-WS6.1 | |
Collaborative Human-AI Symbiosis |
|
Hou, Ming | Department of National Defence, Canada |
Cummings, Missy | George Mason University |
Richards, Dale | Thales UK |
Cain, Brad | C3HF Human Factors Consulting Inc |
Keywords: Human-Machine Interaction
Abstract: When the society delves into the realm of Human-AI collaboration, we encounter a complex web of reliability, accountability, trust, and safety considerations. Crafting workflows and processes that up-hold stringent AI system requirements is not just a technical challenge, but a design imperative given the challenge with increasingly AI decision-making capabilities. To address the challenge, SMC Society (SMCS) started a new initiative in 2025 to formulate a Collaborative Human-AI Symbiosis (CHAIS) program and prepare, enable, and demonstrate SMCS as a world leader in CHAIS. It benefits from the internationally recognized SMCS interdisciplinary knowledge, expertise, and technology in Systems, Man, Cybernetics, and particularly Human-AI Teaming. The goal is to design, develop, and demonstrate the intricacies of Human-AI teaming and unlock the potential of this transformative alliance. The proposed workshop is the first event of the Initiative for engaging with world leaders in related expertise areas as well as SMCS and IEEE regular and student members working in the area of concepts and tools in human and AI decisionmaking processes. Specifically, the objective of this workshop is to brainstorm, identify challenges and priorities, and develop a plan for a SMCS roadmap to address the expertise and capability gaps towards a new IEEE program.
|
|
Mo-MPO |
Foyer F |
Cybernetics WiP Poster Session |
Work in Progress |
Chair: Pirani, Massimiliano | Pegaso University |
|
11:00-12:30, Paper Mo-MPO.1 | |
A Framework for Designing Explainable Micro-Level Customer Behavior Models Using Causal Discovery for Retail Management |
|
Chang, Shuang | Fujitsu Ltd |
Yamane, Shohei | Fujitsu Ltd |
Maruhashi, Koji | Fujitsu Ltd |
Keywords: Agent-Based Modeling
Abstract: Store layout design is of paramount importance in the realm of retail management. To enhance the layout design from a micro-level perspective, building accurate and explainable agent-based models on customers' behaviors accounting for the layout characteristics is critical yet challenging. Various methods have been developed to build data-driven agent-based models yet they are incapable to elucidate causal relations between customers' behaviors and layout characteristics to explicitly explaining the constructed model. To fill this research gap, we develop a framework specialized for retail management to refine the causal understanding of how layout characteristics influence customers' behaviors, to support explainable design and modeling of such behaviors, and based on which to enable retailers' informed decisions on layout design. By applying the proposed framework on a small retail store case with only limited real data, we demonstrated that the method outperforms conventional optimization methods in saving simulation costs and in constructing an explainable model that more accurately replicates the customers' in-store traffic.
|
|
11:00-12:30, Paper Mo-MPO.2 | |
Semantic Topologies in the Recursive Application of genAI Models |
|
Swift, Ben | ANU School of Cybernetics |
Hong, Sungyeon | The Australian National University |
Keywords: AI and Applications
Abstract: Text-to-image and image-to-text models allow automated (but imperfect) semantic translation across modalities. This paper presents results and preliminary analysis of an empirical study of recursive information processing in popular open-weight generative artificial intelligence (genAI) models such as FluxSchnell and BLIP-2. Through clustering and topological data analysis we show some of the ways that different genAI models and initial prompts give rise to different semantic embedding trajectories, and suggest some ways forward for understanding how semantic information is transmitted through these types of complex information-processing systems.
|
|
11:00-12:30, Paper Mo-MPO.3 | |
Neurocognitive-Inspired Memory Architectures for Agricultural Knowledge Systems: Performance Analysis of Hybrid Approaches |
|
Akbar, Nur Arifin | Universita Studi Degli Di Palermo |
Rahool, Dembani | SingularLogic |
Sullivan, Clare | Agricultural University of Athens |
Lenzitti, Biagio | Universita Degli Studi Di Palermo |
Tegolo, Domenico | Universita Degli Studi Di Palermo |
Keywords: AI and Applications, Application of Artificial Intelligence, Knowledge Acquisition
Abstract: This paper presents an experimental analysis of neuroscience-inspired memory architectures for agricultural information systems. Drawing from biological memory principles, we implement and evaluate four distinct approaches—Vector Database, Knowledge Graph, Finite State Machine, and a novel Hybrid Memory architecture that integrates these components. Our controlled experiments across diverse agricultural queries demonstrate that the biomimetic hybrid architecture achieves superior relevance scores (0.753) compared to single-component approaches while maintaining acceptable performance trade-offs. Matrix-based performance analysis reveals how different architectures excel in specific query types: Vector DB for factual retrieval, Knowledge Graph for relational queries, FSM for procedural information, and Hybrid Memory maintaining strong performance across all categories. This research provides quantitative evidence supporting specialized memory designs for agricultural knowledge systems, though current implementations lack multimodal capabilities that would further enhance agricultural decision support.
|
|
11:00-12:30, Paper Mo-MPO.4 | |
Can LLMs Write Fast System-Aware Numerical Computation Code? |
|
Yang, Xin | Zhejiang University |
Tang, Bintao | Tongji University |
Wang, Yuhao | Tongji University |
Ji, Zimo | Tongji University |
Jiang, Wenyuan | ETH Zurich |
Keywords: AI and Applications, Deep Learning, Machine Learning
Abstract: While Large Language Models (LLMs) have demonstrated impressive capabilities in code generation and mathematical reasoning, their ability to produce correct and highly optimized numerical computation code remains largely unexplored. This paper presents a systematic evaluation framework for assessing LLMs' performance in generating computationally efficient numerical code. We introduce a benchmark comprising carefully curated classical numerical computation problems where human experts typically achieve 3-10× speedups compared to straightforward implementations. These problems require deep understanding of numerical computation and computer architecture, identification of performance bottlenecks, and application of multi-level optimization techniques. To address potential dataset contamination, we also develop a benchmark generator that creates novel variants of these optimization challenges. Our evaluation of mainstream LLMs reveals that even state-of-the-art models consistently fall short of human expert optimization levels. The benchmark is open-sourced to facilitate further research in this direction.
|
|
11:00-12:30, Paper Mo-MPO.5 | |
Physics-Informed Neural Networks for Thermal Anomaly Prediction in Battery Energy Storage Systems (withdrawn from program) |
|
Vairo, Tomaso | University of Genova |
Benvenuto, Alessandro | Polythecnic School of the University of Genoa |
Bruzzone, Agostino | Polytechnic School of the University of Genova |
|
11:00-12:30, Paper Mo-MPO.6 | |
Novel Convolutional Neural Networks with Multi-Layer Feature Aggregation and Pooling Permutations for Sound Classification |
|
Choi, Hyosun | Royal Holloway, University of London |
Zhang, Li | Royal Holloway, University of London |
Watkins, Chris | Royal Holloway, University of London |
Panesar, Arjun | DDM Health Ltd |
Keywords: Big Data Computing,, AI and Applications, Transfer Learning
Abstract: Existing Convolutional Neural Networks (CNNs) suffer from limitations such as purely using features from the final layers for decision making, without leveraging important information obtained from middle layers. To tackle such limitations, we propose a novel CNN variant with multi-layer feature aggregation and pooling permutations for sound classification. We also introduce a method reshaping intermediate features from different layers to summarize in time-axis without losing important information. The proposed model results in significant boosting on model performance over the original networks for diverse sound classification problems.
|
|
11:00-12:30, Paper Mo-MPO.7 | |
Holonic Oracle Constructivism in Cyber-Physical Systems |
|
Pirani, Massimiliano | Pegaso University |
Bonifazi, Gianluca | Università Politecnica Delle Marche |
Cucchiarelli, Alessando | Università Politecnica Delle Marche |
Naeem, Tariq | Marche Polytechnic University |
Spalazzi, Luca | Università Politecnica Delle Marche |
Keywords: Cybernetics for Informatics, Soft Computing, Socio-Economic Cybernetics, Agent-Based Modeling
Abstract: The blockchain framework has increasingly moved from purely finance and the digital world to acquire a greater role as an indispensable diaphragm between the physical and the information processing parts of Cyber-Physical Systems. A significant need is the requirement for trustworthy contact between the two domains, which today can be handled with the Oracle concept and its inherent limitations. This proposal calls on the holonic paradigm to mitigate the Oracle problem, employing a constructivistic and second-order cybernetic stance that allows the trust model of the oracle to evolve and follow the dynamics of complex reality.
|
|
11:00-12:30, Paper Mo-MPO.8 | |
Encoding Symmetries of Humanoid Robots Using Equivariant Neural Networks in Reinforcement Learning for Locomotion |
|
Pratham, Salvi | TCS Research |
Nimmala, Sai Abhinay | Tata Consultancy Services |
Lima, Rolif | Tata Consultancy Services |
Kumar, Kishor | TCS Research and Innovation Lab Bangaluru |
Vatsal, Vighnesh | TCS Research, Tata Consultancy Services Ltd |
Das, Kaushik | TCS Research |
Keywords: Cyborgs,, Neural Networks and their Applications, Deep Learning
Abstract: In this work, we explore group-equivariant neural networks into reinforcement learning through Proximal Policy Optimization (PPO) to enhance locomotion learning in high- dimensional humanoid agents. By leveraging the symmetry properties of the robot’s morphology through Equivariant Multi-Layer Perceptrons (EMLP), we aim to improve sample efficiency and training stability. Our experiments with the 21-DoF Unitree H1 humanoid suggest that while EMLP combined with PPO improves sample efficiency, vanilla PPO achieves marginally higher performance in terms of gait quality and biomechanical realism. For assessment, we propose a suite of human-inspired biomechanical metrics, such as joint trajectory deviation, gait symmetry, phase consistency, energy efficiency, and motion smoothness—comparing learned policies against human motion capture (MoCap) data. We also provide these metrics as a framework for quantitatively evaluating the similarity between human gait and reinforcement learning-based humanoid locomotion. Surprisingly, despite the theoreticalbenefits of equivariance, our findings suggest that excessive symmetry constraints may limit expressivity and impede the emergence of human-like locomotion in complex agents. This study provides key insights into the trade-offs of incorporating symmetry priors in deep reinforcement learning for humanoid control.
|
|
11:00-12:30, Paper Mo-MPO.9 | |
Modelling Advanced Persistent Threats to Support Cyber Incident Response |
|
Silva, Jonathas | Universidade Federal Rural De Pernambuco |
Cordeiro, Ailton | Federal University of Pernambuco |
Lima, Milton | Enter for Advanced Studies and Systems of Recife, CESAR |
Lins, Fernando Antonio Aires | Federal Rural University of Pernambuco |
Santos, Wellison R. M. | Center for Advanced Studies and Systems of Recife (CESAR) |
Lima, Ricardo | UFPE |
Keywords: Expert and Knowledge-Based Systems, Computational Intelligence, Information Assurance and Intelligence
Abstract: The lateral movement phase in Advanced Persistent Threat (APT) represents a critical stage where attackers navigate internal systems to expand their foothold and escalate privileges. To improve the detection of these threats, this paper presents a process mining-based approach for detecting APTs, focusing specifically on the lateral movement phase. A controlled experimental environment was developed to simulate attack scenarios, enabling the collection and analysis of system and process logs. The results demonstrate the feasibility of using process mining to model attack progression, enhance incident response, and support the early detection of stealthy intrusions in Windows-based environments.
|
|
11:00-12:30, Paper Mo-MPO.10 | |
A Method to Visualize Evaluation Factors Affect Satisfaction in Surveys of the Service Industry Using XAI: Evaluation for Cram School Service As an Example |
|
Saitoh, Fumiaki | Aoyama Gakuin University |
Keywords: Knowledge Acquisition, Big Data Computing,, Machine Learning
Abstract: In this study, we focus on explainable artificial intelligence (XAI), which improves the interpretability of learning models that tend to become black boxes, and propose a method to visualize the factors that affect customer satisfaction. We have developed a method to visualize trends across features based on t-SNE using the contribution of LIME, one of the most representative XAIs, as a variable. By expanding the output results of t-SNE into a bubble chart based on the contribution of variables, it becomes possible to grasp overall trends regarding satisfaction at cram schools, in contrast to LIME, which provides an interpretation for each sample, and to support an understanding of overall needs. We confirmed its effectiveness by applying it to survey data from the Japanese cram school industry, where it is desirable to operate cram schools based on an understanding of the factors that affect the needs of students and parents.
|
|
11:00-12:30, Paper Mo-MPO.11 | |
ALIME: Local Interpretable Explanations Based on Generalized Additive Models |
|
Wang, Yixin | Sichuan University |
Dong, Yucheng | Sichuan University |
Liang, Haiming | Sichuan University |
Li, Yao | Sichuan University |
Wu, Yuzhu | Southwestern University of Finance and Economics |
Zha, Quanbo | Chongqing University |
Keywords: Machine Learning, Application of Artificial Intelligence
Abstract: Local Interpretable Model-agnostic Explanations (LIME) is an interpretable method used to explain the predictions of machine learning models. It generates perturbed samples around an instance and fits a simple surrogate model, such as linear regression, to approximate the local behavior of the black-box model. This paper introduces Additive Local Interpretable Model (ALIME), a LIME variant based on generalized additive models (GAMs), where feature effects are modeled by penalized splines (P-splines), providing flexibility for capturing nonlinear relationships. Experimental results show that ALIME outperforms the original LIME method in terms of local fidelity. In addition, the shape functions generated by ALIME can clearly capture local feature contributions, providing insight into the relationship between features and model outputs, and enhancing overall interpretability.
|
|
11:00-12:30, Paper Mo-MPO.12 | |
Dynamic Clustering in Asynchronous Online Hierarchical Federated Learning |
|
Singh, Nidhi | Indian Institute of Technology Bhilai |
Chouhan, Aaastha | University of Massachusetts, Amherst |
Arora, Vaibhav | INESC-ID, IST, U. Lisboa |
Sidhanta, Subhajit | IIT KGP |
Keywords: Machine Learning, Optimization and Self-Organization Approaches, AIoT
Abstract: In federated learning, model aggregation is performed at a common server leading to a single point of failure, which aggravates with increasing number of clients causing large network delays that may slow down the entire system. In such cases, hierarchical aggregation is performed with edge devices deployed at different geographical locations. Typically, edge devices employ online learning algorithms for processing the incoming data stream and so, one-way aggregation strategies might end up overwriting the local model updates after the clients receive the aggregated model. Hence, we propose a two-way aggregation strategy wherein models are aggregated both on the client and server. Also, since the incoming distribution may be temporal in nature, we need to dynamically update the aggregation tree. We quantify the non IIDness among different geo-spatialy distributed data-spouts without compromising data privacy. We propose a novel dynamic re-clustering algorithm which is triggered by the intra-cluster and inter-cluster divergence. We implement the above algorithm in an end-to-end FL system that can perform two-step aggregation for aggregation of parent-side and child-side models in the aggregation tree to preserve the model updates in asynchronous execution of the system. Using benchmark datasets, we demonstrate that our proposed framework achieves 95% accuracy, which we plan to further improve.
|
|
Mo-PL1 |
Hall F |
Plenary 1 |
Plenary |
Chair: Strasser, Thomas | AIT Austrian Institute of Technology GmbH |
|
14:00-14:45, Paper Mo-PL1.1 | |
Plenary Talk: Towards Safe and Ergonomic Human-Robot Collaboration: Control and Optimization Perspectives |
|
Dotoli, Mariagrazia | Politecnico Di Bari |
Keywords: Cloud, IoT, and Robotics Integration
Abstract: The advancement of Industry 4.0 and the rise of Industry 5.0 are transforming human-robot collaboration (HRC) within industrial environments. Addressing the challenges of operator safety, ergonomic health, and production efficiency remains essential for modern manufacturing and logistics systems. This talk offers an overview of cutting-edge control and optimization strategies aimed at improving safety, ergonomics, and the operational performance in collaborative robotics and drone-supported automation. An optimization-based approach for planning time-efficient and safety-compliant trajectories for robotic manipulators is presented, with a particular emphasis on ergonomic factors and compliance with safety standards. Moreover, innovative control frameworks for human-drone collaboration in industrial warehouse settings are explored, where drones support pick-and-delivery operations by adapting their paths in real time to human movement. These methods are validated through realistic use cases, paving the way for safer and more efficient HRC in the industrial landscapes of the future.
|
|
14:45-15:30, Paper Mo-PL1.2 | |
Plenary Talk: Scaling Industrial AI |
|
Heiss, Michael | Siemens AG |
Keywords: Manufacturing Automation and Systems, Cyber-physical systems, Distributed Intelligent Systems
Abstract: Scaling is key to economic success—but why does it pose greater challenges in Industrial AI compared to Consumer AI? This presentation examines the key obstacles and explores proven concepts such as assistance systems, foundation models, human-in-the-loop versus closed-loop approaches, and the multi-stage adaptation of models to specific use cases. The question is raised: Is a foundation model for industrial manufacturing applications realistic? Real-world examples of implemented Industrial AI solutions in Digital Manufacturing are presented to demonstrate how these approaches are applied in practice.
|
|
Mo-S2-T1 |
Hall F |
Deep Learning 2 |
Regular Papers - Cybernetics |
Chair: Zhang, Jianjun | South China Agricultural University |
Co-Chair: Pitakwatchara, Phongsaen | Chulalongkorn University |
|
16:00-16:15, Paper Mo-S2-T1.1 | |
LLM-DETR: An Enhanced DETR with a Large Language Model-Inspired Attention Mechanism for Object Detection |
|
Liu, Kuan-Hsien | National Taichung University of Science and Technology |
Mingru, Wang | National Taichung University of Science and Technology |
Liu, Tsung-Jung | National Chung Hsing University |
Keywords: Machine Vision, Deep Learning, Application of Artificial Intelligence
Abstract: We propose LLM-DETR, an enhanced version of the DETR (DEtection TRansformer) framework that integrates advanced attention mechanisms inspired by large language models (LLMs) for object detection. Applied to the MS COCO 2017 dataset, LLM-DETR demonstrates a 5% increase in average precision (AP) over the original DETR, while exhibiting efficient GPU training. This improvement underscores the potential of LLM-inspired attention mechanisms for advancing object detection accuracy in various domains. The code for our LLM-DETR is publicly available on GitHub: https://github.com/mingruWang/LLM-DETR.
|
|
16:15-16:30, Paper Mo-S2-T1.2 | |
Multi-Scale Fusion with Explicit Word-Level Alignment for Multimodal Sentiment Analysis |
|
Li, Xiaoge | Xi'an University of Posts and Telecommunications |
Xing, Jinshuo | Xi'an University of Posts and Telecommunications |
Ma, Yanan | Xi'an University of Posts AndTelecommunications |
An, Xiaochun | Xi'an University of Posts and Telecommunications |
Ren, Yunsheng | Xi'an University of Posts and Telecommunications |
Keywords: Deep Learning, Application of Artificial Intelligence
Abstract: Multimodal sentiment analysis (MSA) integrates and processes data from multiple sources, like audio and text, to better understand human emotions through crossmodal interactions. Previous research generally obtain a feature embedding from different modalities and fuse at the utterance level directly, making it difffcult to capture ffne-grained multimodal representations and susceptible to irrelevant information from heterogeneous modalities. Moreover, these methods often overlook explicit word-level alignment between modalities, which limits effective representation learning and multimodal fusion. Therefore, we propose a novel method, Fine-grained Multimodal Fusion Network(MMTA) for MSA. Speciffcally, a ffne-grained representation alignment module is introduced to extract representation at the word level by montreal forced aligner (MFA).A local-global multi-scale fusion method is then designed to further capture more ffne-grained interaction from local features, i.e., word-level features. Finally, we also investigate the effectiveness of incorporating word-level contextual information into the ffne-grained fusion process. Extensive experiments on the public MSA datasets, i.e., MOSI and MOSEI, show that our approach surpass previous baselines.
|
|
16:30-16:45, Paper Mo-S2-T1.3 | |
Difference-Guided Modality Fusion Network for Multimodal Object Detection |
|
Li, Linxuan | Xi'an Jiaotong University |
Liu, Meiqin | Xi'an Jiaotong University |
Lan, Jian | Xi'an Jiaotong University |
Dong, Shanling | Zhejiang University |
Liu, Zhunga | Northwestern Polytechnical University |
Keywords: Deep Learning, Application of Artificial Intelligence, Image Processing and Pattern Recognition
Abstract: In recent years, visible-infrared object detection has achieved significant progress. However, most existing methods primarily emphasize the shared features between the two modalities while overlooking their feature differences. To address this limitation, we propose the Difference-Guided Modality Fusion Network, which can effectively improve the fusion and detection performance of modalities. Specifically, we propose a cross-modal data augmentation strategy to overcome the limitations of single-modality reliance by exchanging the partial modal information. To further capture and analyze feature differences between modalities, we introduce a differential attention fusion approach that models a difference matrix across modal channels, thereby quantifying and strengthening the salient features of the two modalities. Additionally, we develop a modality-aware dynamic learning mechanism that employs a loss function that can simultaneously focus on the differences and common parts of the modalities, guiding the model to adaptively learn features between the modalities. Experimental results on FLIR, LLVIP and M3FD datasets demonstrate the effectiveness of the proposed method, with mAP reaching 42.3%, 67.5% and 59.0% respectively.
|
|
16:45-17:00, Paper Mo-S2-T1.4 | |
DSK-YOLO: Feature-Level Super Resolution Boosted Industrial Defect Detection |
|
Mu, Meichen | National Key Laboratory of Human-Machine Hybrid Augmented Intell |
Liu, Meiqin | Xi'an Jiaotong University |
Zhang, Senlin | Zhejiang University |
Du, Shaoyi | Xi'an Jiaotong University |
Keywords: Deep Learning, Application of Artificial Intelligence, Image Processing and Pattern Recognition
Abstract: Despite the significant advancements made in industrial defect detection, accurately and timely identifying complex and small-sized defects remains a challenge. Most current lightweight defect detectors are unable to fully extract both global and local contextual information due to their simplified network architectures. To address the above issues, this paper introduces a novel real-time detector DSK-YOLO, which efficiently enhance global and local contextual information under lower number of parameters. Specifically, DSK-YOLO comprises two key components: DSKblock and DSKSR. The DSKblock employs dilated separable kernels to expand the effective receptive fields (ERFs) without deep layer stacking, thereby identifying complex defects. For small-sized defects detection, we develop a feature-level super resolution (SR) auxiliary branch to enhance local contextual information in the training phase. Moreover, the train-only SR branch brings no extra computational overhead for inference, making it an impressive choice for real-time tasks. Experimental results demonstrate that, on the industrial datasets NEU-DET and ESD, DSK-YOLO achieves mAP of 45.7% and 64.8%, which is 1.4% and 1.0% higher than the baseline model YOLOv8n. Our proposed DSK-YOLO offers a favorable trade-off between precision and parameters compared to state-of-the-art models.
|
|
17:00-17:15, Paper Mo-S2-T1.5 | |
ConMH-Based Multi-Modal Video Retrieval with Contrastive Hashing and Fusion |
|
Ling, Rongye | South China University of Technology |
Li, Jingrou | South China University of Technology |
Ng, Wing Yin | South China University of Technology |
Li, Qihua | South China University of Technology |
Tian, Xing | South China Normal University |
Yan, Xingfu | South China Normal University |
Keywords: Neural Networks and their Applications, Deep Learning, Machine Learning
Abstract: With the rapid progress of urbanization, city governance faces growing challenges such as traffic violations and environmen-tal pollution. Traditional manual monitoring methods are inef-ficient and costly. To enhance the efficiency of monitoring and managing uncivil behaviors in urban environments, we pro-pose a self-supervised video hashing retrieval framework for uncivil behavior recognition. Leveraging deep learning tech-niques, our method generates compact binary hash codes for both video and text modalities via a contrastive masked auto-encoder (ConMH), enabling efficient large-scale retrieval. We further improve ConMH by introducing cross-attention mech-anisms in the text hashing branch to better handle context de-pendencies. To optimize retrieval results, we integrate five multimodal fusion and ranking strategies, including a novel Hybrid Distance-Rank Fusion method that balances similarity scores and rank information. Experiments conducted on MSRVTT and MSVD datasets demonstrate that our approach achieves superior performance in mAP@K and NDCG metrics. The framework significantly enhances cross-modal semantic coverage, ensures high retrieval precision, and maintains low computational and storage overhead through binary encoding.
|
|
17:15-17:30, Paper Mo-S2-T1.6 | |
On Using Wave Variables for Robot Imitation Learning |
|
Pitakwatchara, Phongsaen | Chulalongkorn University |
Keywords: Deep Learning, Application of Artificial Intelligence, Neural Networks and their Applications
Abstract: There are several options in specifying the action for robot control. Typically, robot motion is the natural choice since it is a readily observable quantity. However, applied force may be more suitable for tasks that involve contact with the environment. Moreover, many tasks that exhaustively interact with the environment typically require a specification of both motion and force simultaneously. The wave variable combines them into one quantity related to the power supplied to the robot. Therefore, wave variable may be used to specify the action. This helps the robot perform interactive tasks better than using motion or force alone. It also helps in creating high quality dataset of task demonstration since human perceives wave reaction as an additional modality to generate the wave action properly.
|
|
17:30-17:45, Paper Mo-S2-T1.7 | |
Learning Implicit Map Representations from Trajectories: An Enhanced Map-Free Framework for Motion Forecasting |
|
Gao, Zhen | Tongji University |
Wang, Liyou | Tongji University |
Xu, Jingning | City University of Hong Kong |
Hang, Peng | Tongji University |
Yu, Rongjie | Tongji University |
Fan, Hongfei | Tongji University |
Keywords: Deep Learning, Application of Artificial Intelligence, Representation Learning
Abstract: With the advancement of autonomous driving technology, trajectory prediction has become a critical task for ensuring traffic safety and intelligent decision-making. Existing motion forecasting models suffer from HD (High-Definition) map dependency, leading to high costs and poor adaptability. Furthermore, their accuracy sharply declines when maps are unavailable, motivating research into map-free alternatives. However, map-free models typically exhibit lower accuracy. To address this issue, we propose a universal enhancement framework that employs trajectory-map contrastive learning, utilizing a trajectory-to-map encoder to extract implicit map representations from raw trajectories, thereby improving performance. Extensive experiments on the Argoverse dataset demonstrate that, after incorporating our trajectory-to-map encoder into map-free models, the average minADE and minFDE are improved by 2.7% and 3.5%, respectively. These results underscore our method's robustness and generalizability in enhancing map-free models, confirming the efficacy of implicit map representation learning and offering a promising solution for HD-map-free autonomous driving in dynamic open-road environments.
|
|
17:45-18:00, Paper Mo-S2-T1.8 | |
An Explainable Model for Legal Case Matching with Optimal Transport and KAN |
|
Xingxing, Wang | Inner Mongolia Normal University |
Li, Yanling | Inner Mongolia Normal University |
Keywords: Deep Learning, Artificial Social Intelligence
Abstract: This paper proposes a method based on the Optimal Transport algorithm and Kolmogorov-Arnold Networks to address the issue of lacking explainability in current legal case matching tasks. The model first employs the Optimal Transport algorithm to construct semantic and legal feature matrices for case pairs, capturing deep semantic relations and legal feature disparities between the cases. Based on the information in these matrices, the model extracts rationales that support the decision-making process. To further enhance the model's ability to capture complex relationships in legal data, we integrate Kolmogorov-Arnold Networks. By decomposing high-dimensional functions into simple univariate mappings, KAN effectively captures the intricate semantic and legal dependencies between case pairs. Experimental results demonstrate that our approach outperforms four baseline methods on the CAIL and ELAM datasets, improving the F1 score by 5% and 12%, respectively, thus validating the effectiveness of our model.
|
|
18:00-18:15, Paper Mo-S2-T1.9 | |
Lightweight Deepfake Detection Application Using SFTNet |
|
Ip, Chun Shing | The Chinese University of Hong Kong |
Wong, Hei Lam | The Chinese University of Hong Kong |
Sum, K. W. | The Chinese University of Hong Kong |
Tsang, Colin S. C. | The Chinese University of Hong Kong |
Keywords: Machine Vision, Deep Learning, Image Processing and Pattern Recognition
Abstract: We present SFTNet, a lightweight (7.1–8.5M parameters) deepfake video detection model that fuses spatial and frequency‐domain features to combat advanced face‐swapping manipulations. SFTNet employs EfficientNet and MobileNet backbones to extract rich RGB and spectral representations, which are aggregated via average pooling and modeled temporally with bidirectional LSTMs. This architecture captures subtle inconsistencies across frames, enabling effective discrimination between genuine and synthesized videos. Designed for on‐device, real‐time operation, SFTNet eliminates reliance on cloud processing, making it suitable for social media, video conferencing, messaging, and financial authentication platforms. Trained and evaluated on FF++ and CelebDF, it achieves 93.44 % accuracy and 96.34 % AUC under diverse test conditions. By delivering low-latency, on-device inference with minimal resource overhead, SFTNet supports ubiquitous deployment, providing a first line of defense against misinformation, scamming, and digital extortion. Its compact, privacy‐preserving design bolsters digital security and media authenticity in resource‐constrained environments.
|
|
18:15-18:30, Paper Mo-S2-T1.10 | |
A Multimodal Gait-Based Depression Recognition Model with Consistency Training |
|
Huang, Zhuoyong | South China University of Technology |
Ng, Wing Yin | South China University of Technology |
Guo, Zimin | South China University of Technology |
Li, Huakang | Computer Science and Engineering |
Keywords: Deep Learning, AI and Applications, Neural Networks and their Applications
Abstract: Depression is one of the most common psychological disorders. In recent years, gait data-based depression recognition methods have drawn a lot of attentions. However, in existing gait-based depression recognition models, the extensive use of dropout leads to non-negligible inconsistencies between the training and inference stages, which may limit further improvements of model performance. In this paper, we propose a multimodal gait-based depression recognition model, which integrates skeleton and silhouette modalities of gait and incorporates consistency training method to address the inconsistencies caused by dropout. Experiments conducted on the D-Gait depression dataset demonstrate that the proposed model with consistency training yields better and more stable performance. Additionally, through extended experiments, we provide substantial insights into the trade-offs between advantages and disadvantages of the consistency training method.
|
|
Mo-S2-T2 |
Hall N |
Application of Artificial Intelligence 2 |
Regular Papers - Cybernetics |
Chair: Han, Yixuan | Nanjing University of Aeronautics and Astronautics |
Co-Chair: Iwashita, Yumi | Jet Propulsion Laboratory, California Institute of Technology |
|
16:00-16:15, Paper Mo-S2-T2.1 | |
Enhancing the Quality of 3D Lunar Maps Using JAXA’s Kaguya Imagery |
|
Iwashita, Yumi | Jet Propulsion Laboratory, California Institute of Technology |
Moe, Haakon | University of Norway |
Cheng, Yang | JPL |
Ansar, Adnan | NASA Jet Propulsion Laboratory |
Georgakis, Georgios | Jet Propulsion Lab |
Stoica, Adrian | NASA Jet Propulsion Laboratory |
Nakashima, Kazuto | Kyushu University |
Kurazume, Ryo | Kyushu University |
Torresen, Jim | University of Oslo |
Keywords: Application of Artificial Intelligence, Neural Networks and their Applications
Abstract: As global efforts to explore the Moon intensify, the need for high-quality 3D lunar maps becomes increasingly critical—particularly for long-distance missions such as NASA's proposed Endurance rover, which aims to traverse 2,000 km across the South Pole–Aitken basin. Kaguya TC images, though globally available at 10 m/pixel, suffer from altitude inaccuracies caused by stereo matching errors and JPEG-based compression artifacts. This paper presents a method to improve the quality of 3D maps generated from Kaguya TC images, focusing on mitigating the effects of compression-induced noise in disparity maps. We analyze the compression behavior of Kaguya TC imagery, and identify systematic disparity noise patterns, especially in darker regions. In this paper, we propose a method to enhance 3D map quality by reducing residual noise in disparity images derived from compressed images. Our experimental results show that the proposed method effectively reduces elevation noise, enhancing the safety and reliability of terrain data for future lunar missions.
|
|
16:15-16:30, Paper Mo-S2-T2.2 | |
Semantic SZZ: Mitigating the Impact of Misclassified Corrective Changes in Just-In-Time Software Defect Prediction |
|
Veras, Ronaldo | Federal University of Pernambuco |
Cabral, George Gomes | Federal Rural University of Pernambuco |
Oliveira, Adriano, Adriano L.I.Oliveira | Universidade Federal De Pernambuco |
Keywords: Application of Artificial Intelligence, AI and Applications, Machine Learning
Abstract: In the evolving landscape of software engineering, accurate identification of defect-inducing commits is critical to improving software quality and reducing development costs. This paper revisits the widely adopted SZZ algorithm, which is utilized for labeling commits as clean or defect-inducing, to address one of its main limitations, i.e., its reliance on outdated corrective commits identification strategies. We propose an innovative approach that integrates the semantic understanding capability of the GPT OpenAI model into the SZZ flow to better interpret commit messages. Our experiments reveal, for some projects, a large number of commits incorrectly interpreted as defect-fixing, consequently, leading to the misclassification of commits as defect-inducing. As an example, for the Postgresql dataset, the number of defect-inducing commits was reduced in 21% when compared to the original SZZ. Furthermore, results of our experiments strongly suggest that, as a result of the proposed SZZ labeling process, the JIT-SDP problem has been shown to be more challenging than originally reported by previous works.
|
|
16:30-16:45, Paper Mo-S2-T2.3 | |
Hybrid Scheduling of Periodic and Burst Inference Tasks in Real-Time Edge Systems |
|
Han, Yixuan | Nanjing University of Aeronautics and Astronautics |
Zhang, Tong | Nanjing University of Aeronautics and Astronautics |
Zhu, Kun | Nanjing University of Aeronautics and Astronautics |
Keywords: Application of Artificial Intelligence, AIoT, Deep Learning
Abstract: Deep neural networks (DNNs) have revolutionized multiple generations by harnessing the power of advanced GPUs and extensive datasets. In fields such as audio and video processing, DNN models are deployed on edge servers to achieve millisecond-level latency to meet stringent service-level objectives (SLOs). However, the dynamic and unpredictable nature of real-world applications, characterized by erratic surges in inference requests, poses significant challenges to existing edge systems. To address these challenges, this study presents a novel scheduling algorithm that utilizes deep reinforcement learning to maximize the minimum margins of all GPUs.This strategic approach significantly enhances the system’s capacity to manage unexpected high-demand tasks, thereby improving resilience and overall response capabilities under diverse fluctuating workloads. The experimental results confirm that the proposed algorithm consistently outperforms traditional algorithms, demonstrating superior performance in task completion rate, resource utilization efficiency, robustness, and responsiveness across different load conditions.
|
|
16:45-17:00, Paper Mo-S2-T2.4 | |
MoViE: A Mixture-Of-Experts Multimodal Fusion Model for Cardiovascular Disease Detection |
|
Zhang, Guodao | Hangzhou Dianzi University |
Wei, Cunnan | Hangzhou Dianzi University |
Lu, Yanjie | Hangzhou Dianzi University |
Sun, Hong | Jiaxing University |
Keywords: Application of Artificial Intelligence, Biometric Systems and Bioinformatics, Deep Learning
Abstract: Cardiovascular disease (CVD) remains one of the leading global health challenges, calling for accurate and scalable detection methods. This work presents MoViE, a multimodal fusion model based on sparsely activated experts, designed to effectively fuse electrocardiogram (ECG) and electronic health record (EHR) data for CVD detection. MoViE adopts a dual-branch architecture. In the ECG branch, the feed-forward layers of the Vision Transformer (ViT) are replaced with a Mixture-of-Experts (MoE) module that incorporates top-k routing and shared experts, enabling the model to capture both fine-grained waveform features and long-range temporal dependencies. The EHR branch encodes structured clinical records using TF-IDF, followed by a multilayer perceptron (MLP) to extract semantic representations. To support modality interaction, a MoE-based fusion module is introduced, where a gating network adaptively selects experts to combine complementary features. Extensive experiments on the MIMIC-IV-ECG dataset for myocardial infarction detection demonstrate that MoViE outperforms existing unimodal and multimodal mainstream baselines, achieving 87.98% Accuracy, 90.70% AUC, 66.15% F1-score, and 59.12% MCC, highlighting the potential of MoViE as a general framework for intelligent disease detection.
|
|
17:00-17:15, Paper Mo-S2-T2.5 | |
Dynamic Priority-Aware Joint Optimization for Multi-UAV Path Planning and Task Offloading in Mobile Edge Computing |
|
Zhu, Yaolin | Nanjing University of Aeronautics and Astronautics |
Li, Xin | Nanjing University of Aeronautics and Astronautics |
Qin, Xiaolin | Nanjing University of Aeronautics and Astronautics |
Keywords: Application of Artificial Intelligence, Cloud, IoT, and Robotics Integration, AIoT
Abstract: We investigate the problem of path planning and task offloading for UAV clusters in a UAV-assisted edge computing scenario. UAVs autonomously make decisions regarding path planning, continuous service provision, and task offloading based on collected information. In this setting, terminal equipment (TE) cannot directly connect to servers; thus, UAVs act as both edge servers and communication relays, proactively providing services to TEs.We construct a fine-grained temporal scale model that decomposes UAV actions into atomic time units, transforming decision-making on specific behaviors into state transition decisions. This approach better accommodates the needs of time-varying environments. Regarding path planning, given the time-sensitive nature of TE requests, we focus on how to provide stable and timely computational services to TEs. We propose a multi-agent reinforcement learning algorithm capable of dynamically sensing task priorities to enhance Quality of Service (QoS), with optimizations made in terms of task completion rate, UAV energy consumption, and processing delay.About task offloading decision, we introduce a dual-keyword-based offloading algorithm to optimize the binary offloading process. Finally, we conduct simulation experiments to demonstrate the effectiveness of the proposed algorithms, and comparative experiments confirm their superiority.
|
|
17:15-17:30, Paper Mo-S2-T2.6 | |
CLMN: Concept Based Language Models Via Neural Symbolic Reasoning |
|
Yang, Yibo | HKUST |
Keywords: Application of Artificial Intelligence, Deep Learning
Abstract: Deep learning’s remarkable performance in natural language processing (NLP) faces critical interpretability challenges, particularly in high-stakes domains like healthcare and finance where model transparency is essential. While concept bottleneck models (CBMs) have enhanced interpretability in computer vision by linking predictions to human-understandable concepts, their adaptation to NLP remains understudied with persistent limitations. Existing approaches either enforce rigid binary concept activations that degrade textual representation quality or obscure semantic interpretability through latent concept embeddings, while failing to capture dynamic concept interactions crucial for understanding linguistic nuances like negation or contextual modification. This paper proposes the Concept Language Model Network (CLMN), a novel neural-symbolic framework that reconciles performance and interpretability through continuous concept embeddings enhanced by fuzzy logic-based reasoning. CLMN addresses the information loss in traditional CBMs by projecting concepts into an interpretable embedding space while preserving human-readable semantics, and introduces adaptive concept interaction modeling through learnable neural-symbolic rules that explicitly represent how concepts influence each other and final predictions. By supplementing original text features with concept-aware representations and enabling automatic derivation of interpretable logic rules, our framework achieves superior performance on multiple NLP benchmarks while providing transparent explanations. Extensive experiments across various pre-trained language models and datasets demonstrate that CLMN outperforms existing concept-based methods in both accuracy and explanation quality, establishing a new paradigm for developing high-performance yet interpretable NLP systems through synergistic integration of neural representations and symbolic reasoning in a unified concept space.
|
|
17:30-17:45, Paper Mo-S2-T2.7 | |
A Loss Weighting Algorithm Based on In-Batch Positive Passage Rankings for Dense Retrievers |
|
Wang, Zihan | Zhengzhou University |
Li, Muyang | Zhengzhou University |
Chen, Yijun | Zhengzhou University |
Yang, Zhihao | Zhengzhou University |
Qiao, Yiming | Zhengzhou University |
Li, Xinyi | Zhengzhou University |
Chen, Sijia | Zhengzhou University |
Ji, Bo | Zhengzhou University |
Keywords: Application of Artificial Intelligence, Deep Learning
Abstract: In the domain of dense retrieval, training with hard negatives is widely used. However, the existence of a large number of false negatives has given rise to some hard-to-train queries, which often have a significant impact on the loss at the same time. The training process overemphasizes these queries that account for a disproportionately high proportion in the training loss while neglecting the positive examples with moderate rankings, thus resulting in suboptimal model outcomes. We theoretically analyze the excessive influence of hard-to-train queries and introduce an in-batch loss weighting algorithm that adaptively assigns weights based on the in-batch ranking position of its positive document. Specifically, our method will calculate the in-batch ranking positions of the similarity between positive examples and queries, construct query weights based on the Gaussian distribution, and balance the attention in the training process. This strategy mitigates the effects of false negatives while retaining the benefits of hard negatives. Experiments on the MS-MARCO dataset show statistically significant improvements: a 1.3% increase in MRR@10, higher Recall@1000, and consistent NDCG@10 gains on the TREC test set across diverse hard negative sources. The relevant code has been open-sourced on the GitHub repository: https://github.com/TRcreeper/IBRWLoss .
|
|
17:45-18:00, Paper Mo-S2-T2.8 | |
PETformer: Prototype-Enhanced Transformer with Multi-Scale Convolution Attention for Multivariate Time Series Anomaly Detection |
|
Liu, Han | XiaMen University |
Ren, Qianchen | Xiamen University |
Tang, Yuliang | XiaMen University |
Li, Shao zi | XiaMen University |
Keywords: Application of Artificial Intelligence, Deep Learning, Artificial Social Intelligence
Abstract: 变量中的无监督异常检测 时间序列在 工业监控和 IoT 设备管理。然而 现有方法在以下方面仍面临重大挑战 对复杂的时间模式进行建模。虽然 变压器 模型在逐点 表示学习,他们的注意力机制 主要关注时间步长之间的直接依赖关系, 使得在 Segment 级别。为了解决这些问题,我们建议 PETformer,原型增强的多尺度无监督 Transformer 架构。PETformer 引入了 多头尺度卷积注意力 (MHSCA) 模块, 它采用并行卷积核来提取 多尺度功能。此设计增强了模型的 能够捕获本地和全局依赖项。 此外,它还包含多尺度余弦原型 (MSCP) 作为与输入交互的电感偏置 序列,加强法向模式的先验建模 并改进模型的 对潜在的模式偏差敏感。实验的 结果表明,PETformer 的性能始终优于 PETformer 现有最先进的无
|
|
18:00-18:15, Paper Mo-S2-T2.9 | |
AlignGenRec: Aligning Collaborative and Textual Knowledge for Generative Recommendation with LLMs |
|
Xing, Jinshuo | Xi'an University of Posts and Telecommunications |
Li, Xiaoge | Xi'an University of Posts and Telecommunications |
Ma, Yanan | Xi'an University of Posts AndTelecommunications |
Ren, Yunsheng | Xi'an University of Posts and Telecommunications |
Keywords: Application of Artificial Intelligence, Deep Learning, Expert and Knowledge-Based Systems
Abstract: Abstract—Recommender systems based on collaborative filtering (CF) effectively model user-item interactions but struggle with data sparsity and cold-start issues. On the other hand, large language models (LLMs) offer strong semantic understanding but fail to capture structured user-item relationships.To bridge this gap, we propose AlignGenRec, a framework that integrates collaborative and textual knowledge for generative recommendation. Specifically, we introduce an embedding alignment mechanism that aligns item embeddings from a pre-trained CF model with text embeddings from item descriptions. These aligned embeddings are then transferred to the LLM without requiring fine-tuning, enabling structured knowledge integration while preserving generative capabilities.Additionally, constrained sequence decoding ensures that generated recommendations correspond to valid items, improving recommendation accuracy. Experimental results demonstrate that AlignGenRec outperforms both CF-based and LLM-based baselines, particularly in cold-start scenarios, with performance improvements of 7 - 11% across three benchmark datasets. Beyond recommendation tasks, AlignGenRec also supports preference prediction and user profiling, highlighting its versatility in real-world applications. Index Terms—Recommendation Systems, Collaborative Filtering, Large Language Models, Generative Recommendation.
|
|
18:15-18:30, Paper Mo-S2-T2.10 | |
Dual Weighting Attention Feature Fusion Network for Lane Detection |
|
Yao, Xin-Wei | Zhejiang University of Technology |
Liu, Liwei | College of Computer Science and Technology College of Software, |
Zhang, Yuchen | Zhejiang University of Technology |
Li, Qiang | Zhejiang University of Technology |
Pan, Yue | Zhejiang University of Technology |
Keywords: Deep Learning, Image Processing and Pattern Recognition, Application of Artificial Intelligence
Abstract: Abstract— Lane detection plays a crucial role in autonomous driving. Though modern anchor-based deep lane detection methods have demonstrated remarkable performance on stan- dard benchmarks, they continue to struggle with complex topological variations. In this work, we propose the Dual Weighting Attention Feature Fusion Network(DW AFFNet), which enhances detection performance through two key innova- tions: (1) hierarchical feature fusion for improved location and (2) anchor’s IoU classification score consistency optimisation. In computer vision, shallow low-level features provide precise spatial localization while deep high-level features capture es- sential global information. Therefore, complementary global- local representations for enhanced lane detection accuracy are established via Iterative Coordinate Attention Feature Fusion (ICAFF) module, which systematically combines these hierar- chical features through coordinate-sensitive attention mecha- nisms. Furthermore, we introduce the Dual Weighting Label Assignment Scheme (DW Scheme) to align IoU and classifi- cation scores through importance-aware dynamic weighting, significantly improving sample discrimination capability. We evaluate our method on two benchmarks of lane detection and the results demonstrate its effectiveness. Our method surpasses baseline on CULane. On CULane, it obtains 56.45 mF1 with 64.21/55.76/22.16 F1@75/80/90 scores, outperforming CLRNet by 1.52%/2.27%/3.06%/7.4% respectively. Significant improvements are observed across most scenarios in complex road conditions.
|
|
Mo-S2-T3 |
Room 0.11 |
Machine Vision |
Regular Papers - Cybernetics |
Chair: Lei, Jixiang | Graz University of Technology |
Co-Chair: Thornton, Callum John | National Institute of Advanced Industrial Science and Technology (AIST) |
|
16:00-16:15, Paper Mo-S2-T3.1 | |
AugCount: Test-Time Semantic Augmentation Via Diffusion for General Open-World Object Counting |
|
Shi, Ziqiang | Fujitsu R&D Center, Co. Ltd |
Liu, Rujie | Fujitsu Research & Development Center |
Takahashi, Jun | Fujitsu Limited |
Jiang, Shan | Fujitsu Research, FUJITSU LIMITED |
Keywords: Machine Vision, Image Processing and Pattern Recognition, Multimedia Computation
Abstract: Open-world general object counting is a critical task in computer vision, with applications in image understanding, environmental monitoring, and surveillance. Traditional methods rely on large-scale annotated data, which is expensive and time-consuming to obtain, and often fail to manage diverse object categories and complex scenes. To address this, we propose AugCount, a novel framework that enhances open-world object counting by generating high-quality, diverse synthetic data during testing. For the first time, we employ a diffusion model to produce conditional images based on density maps specifying object locations for general object counting. Our framework is highly versatile and adaptable to various counting tasks. Experiments demonstrate that AugCount significantly improves performance on benchmark datasets like FSC-147, reducing the average counting error by 1 per image and achieving a state-of-the-art MAE of 4.70. AugCount effectively addresses data scarcity and model generalization challenges, offering enhanced robustness and adaptability for practical counting systems.
|
|
16:15-16:30, Paper Mo-S2-T3.2 | |
Generating a Synthetic Dataset for Hand Pose Estimation Training |
|
Thornton, Callum John | National Institute of Advanced Industrial Science and Technology |
Tada, Mitsunori | National Institute of Advanced Industrial Science and Technology |
Maruyama, Tsubasa | National Institute of Advanced Industrial Science and Technology |
Keywords: Machine Vision, Machine Learning, AI and Applications
Abstract: Recent years have seen the rise of many competent pose estimation models. Unfortunately, many of these still rely on manually labelled training data. With many arguing the importance of larger and more diverse pools of training data, this study proposes the use of virtually created hand images for pose estimation training. Synthetically generating virtual pose data provides a faster and more accurate method for creating labelled training data, due to the tedious and taxing nature of manual labelling. This study looks to employ 3D modelling software to simulate and record randomised, realistic, human hand postures. For this, virtual hands were recorded with ever-changing joint angles and randomised backgrounds. This created data was used to retrain an existing YOLOv11 pose estimation network. The model trained with solely synthetic data was able to detect hands in real-world images, though it scored a low accuracy for these images. Injecting an existing dataset with these synthetic data was able to provide a significantly reduced error in the estimations, showcasing a 54.0% increase in accuracy — compared to a model trained without the addition of these data. The results achieved show promise for the utilisation of synthetic data in the training of pose estimation models.
|
|
16:30-16:45, Paper Mo-S2-T3.3 | |
Ultra-Lightweight Eye State Detection for Resource-Constrained Devices |
|
Otto, Mike | Hochschule Hannover |
Will, Jens Christian | University of Applied Science Hannover |
Homann, Hanno | Hannover University of Applied Sciences |
Keywords: Machine Vision, Machine Learning, Application of Artificial Intelligence
Abstract: Eye state detection plays a vital role in safety-critical applications such as driver monitoring systems and medical diagnostic or therapeutic devices. This study explores ultra-lightweight convolutional neural network architectures tailored for real-time deployment on resource-constrained embedded platforms, where power and processing capabilities are severely limited. Unlike traditional approaches that rely on complex models or post-processing techniques, our solution emphasizes low-latency inference while maintaining high classification accuracy. To achieve this, we combine efficient eye region extraction based on facial landmarks with a custom-designed CNN featuring just over 260,000 parameters. The model is evaluated on the MRL Eye State dataset and demonstrates competitive results, surpassing existing architectures in terms of inference speed and model size. Our implementation is validated on a Raspberry Pi 5 using the Camera Module 3, highlighting the system’s practical applicability in embedded environments. This research is part of the DeepLightAI project, where continuous eye monitoring is required to ensure patient safety under high-intensity medical lighting conditions. The results show that lightweight models can effectively bridge the gap between performance and efficiency, offering a viable path for real-world, real-time eye state detection in embedded applications.
|
|
16:45-17:00, Paper Mo-S2-T3.4 | |
Specific Proposal Feature R-CNN with Hybrid-Residual Feature Pyramid Network |
|
Yao, Xin-Wei | Zhejiang University of Technology |
Pan, Yue | Zhejiang University of Technology |
Li, Qiang | Zhejiang University of Technology |
Liu, Liwei | College of Computer Science and Technology College of Software, |
Keywords: Machine Vision, Machine Learning, Application of Artificial Intelligence
Abstract: 在对象检测领域,特征金字塔网络 在对象检测中得到广泛采用 算法,因为它简单、高效且健壮 特征生成功能。尽管有其优点,但 特征金字塔网络在其 建筑设计。在本文中,我们 旨在剖析 对 Pyramid 网络进行特征化并引入新网络 架构,称为 Hybrid-Residual Feature Pyramid 网络 (HR-FPN),旨在有效缓解 这些已确定的问题。HR-FPN 主要由 两个关键模块:Hybrid-Operation Module 和 残差特征增强模块。混合运营 Module 有效地整合了来自 high-level 功能转换为 low-level 功能,而 残差特征增强模块缓解信息 最高金字塔层要素映射中的损失 提取尺度不变的上下文信息。然后,我们 设计了稀疏 R-CNN 的增强版本,称为 特定提案特征 R-CNN(SPF R-CNN),其中 整合了 Learnable Proposal Classification 功能,并且 可学习的提案回归功能,可有效 缓解 分类和本ࢸ
|
|
17:00-17:15, Paper Mo-S2-T3.5 | |
TTTFormer: Token Pruning-Based Spatio-Temporal Topology Transformer for 3D Human Pose Estimation |
|
Li, Yifu | Central South University |
Yu, Jiaxuan | Central South University |
Chen, Zhigang | Central South University |
Keywords: Machine Vision, Machine Learning, Deep Learning
Abstract: In recent years, Transformer-based methods have remained the dominant approach for 3D human pose estimation (3D HPE). While these methods have achieved continuous improvements in accuracy, the inherently high computational complexity of Transformer modules makes current 3D HPE models excessively resource-intensive, rendering them impractical for deployment on resource-constrained devices. To address this challenge, we propose Token Pruning-based Spatio-Temporal Topology Transformer (TTTFormer), which consists of a Spatio-Temporal Topology Encoder (STTE) and a Token Pruning-based Spatio-Temporal Decoder. The STTE incorporates Spatial Position Embedding (SPE) to model spatial correlations between joints and a Temporal Topology Block (TTB) to effectively capture the temporal relationships of the same joint across frames. These components enable the STTE to extract richer spatio-temporal features. Furthermore, to mitigate the high computational cost of transformers when handling long sequences, we introduce a lightweight design in the Token Pruning-based Spatio-Temporal Decoder, which preserves representative tokens while discarding redundant ones that contribute disproportionately to computational overhead. This approach achieves a balance between efficiency and accuracy in complex 3D HPE tasks. Extensive experiments on the Human3.6M dataset demonstrate that our model significantly enhances both efficiency and precision. The results show that our model achieves a MPJPE of 40.8mm and better performance with 364M FLOPs per frame.
|
|
17:15-17:30, Paper Mo-S2-T3.6 | |
Accurate Monitoring of the Slagging During Converter Tapping |
|
Pang, Shuyang | CISDI Information Technology CO., LTD |
Liu, Jingsheng | CISDI Information Technology CO., LTD |
Yan, Mingtao | Qingdao University |
Zhang, Xiaohui | CISDI Information Technology CO., LTD |
Li, Qiang | CISDI Information Technology CO., LTD |
Luo, Qianhao | CISDI Information Technology CO., LTD |
Li, Weiling | Dongguan University of Technology |
Keywords: Machine Vision, Neural Networks and their Applications, AI and Applications
Abstract: Monitoring the slagging status during the tapping process of a converter is an important task in the metallurgical process. Currently, the monitoring of the slagging process during converter tapping mainly relies on manual visual inspection methods, which suffer from problems such as low efficiency, strong subjectivity, and low accuracy. Fortunately, monitoring of the slagging during converter tapping is essentially an object detection task. Considering the harsh scenarios and tiny targets, the object detection model for this task should be carefully designed. To address this issue, this paper proposes an efficient object detection model named ESD-YOLO based on an improved YOLO algorithm. ESD-YOLO aims to enhance detection accuracy and real-time performance, especially for small targets such as minor slag streams in complex scenarios. It integrates feature guidance from DINOv2, to strengthen fundamental feature learning capabilities, while introducing an efficient channel attention module to optimize multi-scale feature information fusion. Additionally, the NWD loss function is employed to further elevate focus on small target features. Experimental results demonstrate that ESD-YOLO achieves 96.8% mAP and 92.6 FPS on an actual converter slagging dataset, outperforming mainstream object detection models and reaching state-of-the-art performance.
|
|
17:30-17:45, Paper Mo-S2-T3.7 | |
PromptCAL: Entropy-Calibrated and Prompt-Tuned Test-Time Adaptation for Semantic Segmentation |
|
Lei, Jixiang | Graz University of Technology |
Pernkopf, Franz | Graz University of Technology |
Keywords: Transfer Learning, Machine Vision, Deep Learning
Abstract: Test-time adaptation (TTA) aims to improve the robustness of segmentation models to an unlabeled target domain without requiring access to the source data. While existing TTA methods have achieved promising results on image classification, they often fail to translate effectively to semantic segmentation due to the spatial complexity and fine-grained nature of dense predictions. We propose PromptCAL, a lightweight and effective TTA framework tailored for semantic segmentation, built upon the SegFormer architecture. Our method addresses two central challenges: (1) Which model component to adapt remains underexplored. Using Grad-CAM visualization and sensitivity analysis, we identify Stage 2 of the transformer backbone as the most domain-sensitive and restrict adaptation to this stage. (2) How to identify reliable supervision during adaptation is critical. We introduce a confidence-aware self-training mechanism based on per-pixel entropy filtering to guide pixel selection for model adaptation, ensuring label quality and model transferability. In addition, we incorporate lightweight prompt injection to enhance the adaptability of mid-level features. Our method achieves competitive improvements over the state-of-the-art while maintaining high adaptation efficiency and significantly reducing runtime overhead. Extensive experiments on corrupted semantic segmentation benchmarks, including ACDC (A-fog, A-night, A-rain, and A-snow), Cityscapes-foggy (CS-fog) and Cityscapes-rainy (CS-rain) demonstrate that PromptCAL achieves comparable or superior accuracy to state-of-the-art TTA baselines, while reducing adaptation time by over 50% per domain. This makes it a practical solution for efficient TTA in smart cities and edge-deployed vision systems. The source code is available at https://github.com/ml4papers/PromptCAL.
|
|
17:45-18:00, Paper Mo-S2-T3.8 | |
An Easily Deployable Image Dehazing Model for Industrial Sites |
|
Xiao, Xuewen | CISDI Information Technology CO., LTD |
Luo, Qianhao | CISDI Information Technology CO., LTD |
Cao, Xin | CISDI Information Technology CO., LTD |
Pang, Shuyang | CISDI Information Technology CO., LTD |
Zhang, Xiaohui | CISDI Information Technology CO., LTD |
Li, Qiang | CISDI Information Technology CO., LTD |
Li, Weiling | Dongguan University of Technology |
Keywords: Neural Networks and their Applications, Machine Vision, Application of Artificial Intelligence
Abstract: lex deep networks, such as transformers, have shown better dehazing performance. However, their practicability is restricted by low adaptability to industrial scenarios. It is necessary to realize an effective and lightweight model that can be applied in industrial sites. For such a purpose, an effective image dehazing model named MFDehaz-Net is proposed. MFDehaz-Net has two specially designed components, i.e., Multi-scale Fusion Block and Point-Depth wise Block, which help it achieve a deep fusion of image features of different scales to obtain better global understanding. Besides, to reduce the damage to the original color of the image caused by image dehazing, MFDehaz-Net integrates the supervision signal in the frequency domain with a specially designed loss function. Experimental results demonstrate that MFDehaz-Net outperforms SOTA models in terms of dehazing ability with much shorter inference time.
|
|
18:00-18:15, Paper Mo-S2-T3.9 | |
Grounded Multi-Modal Conversation for Zero-Shot Visual Question Answering |
|
Zarei, Mohammad Reza | Carleton University |
Akkasi, Abbas | Carleton University |
Komeili, Majid | Carleton University |
Keywords: Machine Vision, Application of Artificial Intelligence, Computational Intelligence
Abstract: Zero-shot visual question answering (VQA) poses a formidable challenge at the intersection of computer vision and natural language processing. Traditionally, this problem has been tackled using end-to-end pre-trained vision-language models (VLMs). However, recent advancements in large language models (LLMs) demonstrate their exceptional reasoning and comprehension abilities, making them valuable assets in multi-modal tasks, including zero-shot VQA. LLMs have been previously integrated with VLMs to solve zero-shot VQA in a conversation-based approach. However, while the focus in VQA tasks is often on specific regions rather than the entire image, this aspect has been overlooked in previous approaches. Consequently, the overall performance of the framework relies on the ability of the pre-trained VLM to locate the region of interest that is relevant to the requested visual information within the entire image. To address this challenge, this paper proposes Grounded Multi-modal Conversation for Zero-shot Visual Question Answering (GMC-VQA), a region-based framework that leverages the complementary strengths of LLMs and VLMs in a conversation-based approach. We employ a grounding mechanism to refine visual focus according to the semantics of the question and foster collaborative interaction between VLM and LLM, effectively bridging the gap between visual and textual modalities and enhancing comprehension and response generation for visual queries. We evaluate GMC-VQA across three diverse VQA datasets, achieving substantial average improvements of 10.04% over end-to-end VLMs and 2.52% over the state-of-the-art VLM-LLM communication-based framework, respectively. Our code is publicly available at https://github.com/mrzarei5/GMC-VQA.
|
|
18:15-18:30, Paper Mo-S2-T3.10 | |
WEMT: Wavelet-Enhanced Multiscale Transformer for Robust Dynamic Gesture Recognition |
|
Teng, Kunjiang | Qing Dao University |
An, Xinyu | Qingdao Menaul School |
Cheng, Zesheng | College of Computer Science and Technology, Qingdao University, |
Keywords: Machine Vision
Abstract: Dynamic gesture recognition faces critical challenges due to complex backgrounds, multi-scale feature variations, and inefficient multi-modal fusion. To address these challenges, this study proposes the Wavelet-Enhanced Multiscale Transformer (WEMT), a unified framework that integrates three key innovations: wavelet-based featuredecomposition for noise-robust detail enhancement, hierarchicalattention mechanisms for global-local spatiotemporalmodeling, and confidence-driven adaptive fusion for heterogeneous modality integration. The wavelet domain module selectively amplifies discriminative gesture features through frequency-domain analysis and multi-scale pooling,while the pyramid-structured attention heads capture both coarse-grained postures and fine-grained motions. The adaptive fusion framework dynamically prioritizes complementary modalities through entropy-based gating, ensuring robustness in diverse environments. Evaluations on NVGesture and Briareo datasets demonstrate state-of-the-art performance across single-modal and multi-modal configurations, with notable advantages in low-light scenarios. The model also maintains computational efficiency, making it suitable for real-world human-computer interaction applications. This work provides a systematic and scalable solution for accurate dynamic gesture recognition, bridging the gap between theoretical advances and practical deployment.
|
|
Mo-S2-T4 |
Room 0.12 |
Decision Support Systems 2 |
Regular Papers - SSE |
Chair: González-Quesada, Juan Carlos | University of Granada |
Co-Chair: dos Santos Nascimento, Francimaria Rayanne | Universidade Federal De Pernambuco |
|
16:00-16:15, Paper Mo-S2-T4.1 | |
Estimating Missing Values in Fuzzy Preference Relations through Different Information Granularity Allocation Protocols: An Analysis |
|
González-Quesada, Juan Carlos | University of Granada |
Trillo, José Ramón | University of Granada |
López-Herrera, Antonio Gabriel | University of Granada |
Herrera Viedma, Enrique | University of Granada (Spain) |
Cabrerizo, Francisco Javier | University of Granada |
Keywords: Decision Support Systems
Abstract: In Granular Computing, a key approach has been the handling of incomplete fuzzy preference relations through an allocation of information granularity and its corresponding optimization procedure. This methodology enables the transformation of numerical models into granular versions, offering a more accurate representation of reality by recovering missing information. In decision-making contexts involving fuzzy preference relations, it has played a crucial role in advancing procedures for estimating incomplete information. However, although several granularity allocation protocols have been proposed, only one has been employed so far: the one based on a uniform and symmetric allocation of information granularity. To address this gap, we aim to determine the efficacy of existing protocols for allocating information granularity in estimating missing values of incomplete fuzzy preference relations. Numerical tests are presented to demonstrate the effectiveness of each protocol.
|
|
16:15-16:30, Paper Mo-S2-T4.2 | |
HSFN: Hierarchical Selection for Fake News Detection Building Heterogeneous Ensemble |
|
Bandeira Coutinho, Sara | Universidade Federal De Pernambuco |
Menelau Oliveira e Cruz, Rafael | École De Technologie Supérieure |
dos Santos Nascimento, Francimaria Rayanne | Universidade Federal De Pernambuco |
Cavalcanti, George | Universidade Federal De Pernambuco |
Keywords: Decision Support Systems
Abstract: Psychological biases, such as confirmation bias, make individuals particularly vulnerable to believing and spreading fake news on social media, leading to significant consequences in domains such as public health and politics. Machine learning–based fact-checking systems have been widely studied to mitigate this problem. Among them, ensemble methods are particularly effective in combining multiple classifiers to improve robustness. However, their performance heavily depends on the diversity of the constituent classifiers—selecting genuinely diverse models remains a key challenge, especially when models tend to learn redundant patterns. In this work, we propose a novel automatic classifier selection approach that prioritizes diversity, also extended by performance. The method first computes pairwise diversity between classifiers and applies hierarchical clustering to organize them into groups at different levels of granularity. A HierarchySelect then explores these hierarchical levels to select one pool of classifiers per level, each representing a distinct intra-pool diversity. The most diverse pool is identified and selected for ensemble construction from these. The selection process incorporates an evaluation metric reflecting each classifier’s performance to ensure the ensemble also generalises well. We conduct experiments with 40 heterogeneous classifiers across six datasets from different application domains and with varying numbers of classes. Our method is compared against the Elbow heuristic and state-of-the-art baselines. Results show that our approach achieves the highest accuracy on three of six datasets. The implementation details are available on the project’s repository: https://github.com/SaraBCoutinho/HSFN.
|
|
16:30-16:45, Paper Mo-S2-T4.3 | |
MEVE-FN: An Adaptive Learning Framework for Multimodal Fake News Detection Using Mixture-Of-Experts |
|
Kang, Xiaoyu | Institute of Information Engineering,Chinese Academy of S |
Shi, Zhixin | Institute of Information Engineering, Chinese Academy of Science |
Keywords: Decision Support Systems, Discrete Event Systems, Enterprise Information Systems
Abstract: The rapid dissemination of multimodal fake news on social media platforms poses a significant challenge, as deceptive narratives often combine text and images to mislead users. To address this, we propose MEVE-FN, a novel fake news detection framework built upon a Mixture-of-Experts (MoE) architecture. Our model leverages modality-specific vision and language experts to extract specialized features. These features are then dynamically integrated using a Laplace Gating mechanism for adaptive fusion and a MEVE-Adapter module that enhances deep cross-modal interaction. Comprehensive experiments on three public benchmarks (Weibo, Weibo21, and Twitter) demonstrate that MEVE-FN consistently outperforms strong baseline models, achieving notable improvements in detection accuracy and robustness. Furthermore, ablation studies validate the effectiveness of our proposed components, confirming the critical contribution of both the Laplace gating and the MEVE-Adapter to the model's superior performance.
|
|
16:45-17:00, Paper Mo-S2-T4.4 | |
Knowledge Management and Technology-Driven Corporates' Sustainability: A Rapid Review |
|
Belchior, Paloma | Universidade Federal Do Rio De Janeiro |
Nóbrega, Lucas | Universidade Federal Do Rio De Janeiro |
Martinez, Luiz Felipe | Universidade Federal Do Rio De Janeiro |
Argôlo, Matheus | Universidade Federal Do Rio De Janeiro |
Barbosa, Carlos Eduardo | Universidade Federal Do Rio De Janeiro |
Souza, Jano | Federal University of Rio De Janeiro |
Keywords: Decision Support Systems, Enterprise Information Systems, Service Systems and Organizations
Abstract: In light of the increasing recognition of environmental and social challenges, technology-driven corporations often face significant barriers in aligning their innovation-driven goals with sustainable practices. Despite their capacity for rapid technological advancements, these companies struggle to integrate sustainability objectives into their core strategies due to competing priorities and a lack of cohesive frameworks. This study explores Knowledge Management's (KM) capacity to facilitate and implement sustainable practices. We use the Rapid Review methodology to systematically review the literature and achieve this objective. A total of 35 articles were analyzed, examining factors related to organizational knowledge, the adoption of KM within technology-driven companies, and effective sustainable practices. The results highlight the importance of KM in facilitating and implementing sustainable practices, showing that technology-driven companies that invest in organizing their knowledge can improve their effectiveness in implementing sustainable initiatives, increase their resilience and competitiveness, and strengthen trust with stakeholders.
|
|
17:00-17:15, Paper Mo-S2-T4.5 | |
MALE: A Multi-Objective Evaluation Method for AI Mobility Services across the Cloud-Edge-Device Continuum |
|
Lee, Junhee | Electronics and Telecommunications Research Institute (ETRI) |
Shin, Yong-Jun | Electronics and Telecommunications Research Institute |
Kang, Sungjoo | Electronics and Telecommunications Research Institute |
Keywords: Decision Support Systems, Infrastructure Systems and Services, Distributed Intelligent Systems
Abstract: Deploying AI services on battery-powered mobility platforms such as autonomous vehicles, mobile robots, and large scale IoT sensor networks requires determining the most suitable execution environment for each workload across the cloud, edge, and device computing options. Because every placement option imposes different trade‑offs among Accuracy, Latency, and Energy efficiency (ALE), stakeholders face a difficult, mission‑critical decision that existing studies seldom address in a holistic, mission‑aware manner. To fill this gap, we introduce the Mission‑driven ALE (MALE) evaluation method. MALE couples ALE metrics with explicit mission objectives by allowing analysts to apply customizable weights to each criterion. The evaluation results are aggregated and visualized as heatmaps, helping transform a previously heuristic and opaque placement decision (black-box) into a more transparent and interpretable process (white-box). We examine the applicability of MALE through three representative case studies: Autonomous Vehicles, Real-Time Robotics, and IoT Sensor Networks, each reflecting distinct ALE priorities. By supplying a structured, mission‑aware decision‑support method, MALE strengthens stakeholder confidence and accelerates the optimization of AI service placement across the cloud–edge–device continuum, offering a practical foundation for future validation in real‑world deployments.
|
|
17:15-17:30, Paper Mo-S2-T4.6 | |
Stochastic Modeling for Design Guidance of Static Vehicular Cloud Systems in Car Rental Scenarios |
|
Almeida, Vinícius | Centro De Informática, Universidade Federal De Pernambuco |
Silva, Jonatas | Universidade Federal De Pernambuco |
Santana, Marcelo | Universidade Federal De Pernambuco, Centro De Informática |
Dantas, Renata | Federal Institute of Pernambuco - IFPE |
Maciel, Paulo | UFPE |
Keywords: Decision Support Systems, Modeling of Autonomous Systems, Infrastructure Systems and Services
Abstract: In Vehicular Cloud Computing (VCC), the volatility of mobile resources presents a fundamental challenge: will a task be successfully delivered or ultimately discarded? This paper addresses this “discard or deliver” dilemma by proposing a Stochastic Petri-Net (SPN) model for a Static Vehicular Cloud (SVC) in the context of a car rental business. The model captures key dynamics, including task queuing, resource allocation, vehicle unavailability due to rentals or failures, and broker availability. System performance and availability are evaluated through five core metrics: Queue Discard Probability (QDP), Utilization of Available Resources (UAR), Unavailability (UA), Global Capacity-Oriented Availability (GCOA), and Internal Capacity-Oriented Availability (ICOA). To establish model correctness, its queueing behavior is validated against a classical M/M/m/k model, showing close agreement through experiments. Sensitivity analysis reveals that rental times, task arrival and task service times, and broker times have the greatest impact on the metrics. Case studies reveal key trade-offs between task loss, resource utilization, and availability, offering practical guidance for network planners seeking to deploy effective SVC infrastructures.
|
|
17:30-17:45, Paper Mo-S2-T4.7 | |
A Large Language Model-Enhanced Expert System for Patient Triage in Emergency Department and a Machine Learning Classifier for Hospital Admissions Forecasting |
|
Ben Othman, Sarah | Ecole Centrale De Lille |
Keywords: Decision Support Systems, Service Systems and Organizations
Abstract: In situations of overcrowding of hospitals, a fast and effective evaluation of the severity of patient illnesses is necessary to anticipate the allocation of limited resources. Patient triage provides a first assessment of patient acuity before admission to the Emergency Department (ED) but requires a complex management of medical data performed by specially trained staff. This need to anticipate resource allocation is further amplified when a patient must be admitted for hospitalization after a stay in the ED. This study presents a system for the automation of patient triage in the ED and the prediction of post-ED hospitalizations. A Large Language Model (LLM)-enhanced expert system architecture is designed for the assessment of patient acuity through the processing of rules expressed in natural language and converted into structured rules. The triage evaluation is reused by Machine Learning classifiers, based on eXtreme Gradient Boosting, Random Forest, Logistic Regression, and Decision Tree, predicting, based on triage/ED data and medical history, whether patients require observation, hospitalization, or can be safely discharged after their stay in the ED. Experiments underscore a high accuracy (between 72% and 82%) for best LLMs experimented while eXtreme Gradient Boosting classifier obtains the best performance in the prediction of post-ED orientation with an accuracy between 59.1% and 73.5% depending on the scenario. The proposed architecture stands out for its adaptability to other problems.
|
|
17:45-18:00, Paper Mo-S2-T4.8 | |
LODA Revisited: Enhancements for Robust Online Anomaly Detection with Concept Drift Handling |
|
Szalai, Márk Dániel | Budapest University of Technology and Economics |
Baranyi, Máté | Budapest University of Technology and Economics |
Horvath, Gabor | Budapest University of Technology and Economics |
Molontay, Roland | Budapest University of Technology and Economics |
Keywords: Decision Support Systems, Smart Sensor Networks, Manufacturing Automation and Systems
Abstract: The Lightweight On-line Detector of Anomalies (LODA) is one of the best performing generic unsupervised anomaly detection algorithms. This method returns an anomaly score using an ensemble of one-dimensional histograms in a random projected space. Deploying LODA in a streaming, dynamically evolving environment remains a challenging task as the method proposed by Pevn´y (2016) tends to forget the learned behaviors too quickly, meaning the model will not recognize reemerging normal behavior and thus often produces high anomaly scores for normal samples. In this paper, we propose a new algorithm, demonstrating that the proposed modifications lead to improved performance on benchmark datasets, facilitate smoother transitions during concept drifts, and provide more robust and significantly better anomaly scoring on sensor data and in streaming, dynamic environments.
|
|
18:00-18:15, Paper Mo-S2-T4.9 | |
TE-CNN-AAE: Learning Robust Financial Time Series Representations with Trend-Enhanced Adversarial Autoencoders |
|
Araujo, jefferson Oliveira Alves de | Federal University of Pernambuco |
Oliveira, Adriano, Adriano L.I.Oliveira | Universidade Federal De Pernambuco |
Cleber Zanchettin, Cleber | Federal University of Pernambuco |
Keywords: Decision Support Systems, System Modeling and Control, Distributed Intelligent Systems
Abstract: This paper introduces the Trend-Enhanced CNN Adversarial Autoencoder (TE-CNN-AAE), a novel approach for generating low-dimensional representations of intraday stock market activity from 5-minute interval quotes. The proposed model extends traditional autoencoders by incorporating an adversarial component that enhances the quality of embeddings by reconstructing input sequences while simultaneously estimating market trends. Using five years of intraday data from Dow Jones Industrial Average (DJIA) assets, we demonstrate that TE-CNN-AAE outperforms baseline methods in capturing intraday patterns and predicting daily price movements. Qualitative evaluations via UMAP visualizations show improved class separability in the latent space, further confirmed by the Silhouette Score and Davies-Bouldin Index, while quantitative analysis using an LSTM classifier validates the superior predictive utility of the embeddings. Ablation tests confirm the importance of the adversarial discriminator in generating robust representations. Our results suggest that TE-CNN-AAE effectively captures the complex dynamics of financial time series and holds the potential for improving decision-support systems in trading.
|
|
18:15-18:30, Paper Mo-S2-T4.10 | |
Multi-Perspective Log Generator for Declarative Models |
|
Alves Wanderley de Siqueira, Bruna | Universidade Federal De Pernambuco |
Lima, Ricardo | UFPE |
Oliveira Alpes, Silva, Katiane | Universidade Federal De Pernambuco |
Keywords: Enterprise Information Systems, Discrete Event Systems, Decision Support Systems
Abstract: Background: The generation of synthetic event logs is fundamental to the advancement and evaluation of process mining techniques. However, synthetic log generation in the context of multi-perspective declarative modeling faces challenges due to its flexible nature, where any behavior that is not constrained by rules is allowed. In addition, the synthetic event log needs to simultaneously meet process, data access and privacy constraints. Objective: This study presents a general-purpose, configurable, multi-perspective log generator capable of simultaneously ensuring compliance with the rules of the flow, data access and privacy perspectives in a flexible context such as that found in declarative processes. Method: A multi-perspective synthetic log generator was developed, capable of integrating declarative process models, data access and organizational structure. The tool was structured in three stages: processing the input models, generating the event log with Declare4Py, and generating the access log respecting the restrictions. Results: Two experiments were carried out with 1,000 and 2,000 cases respectively. The logs generated respected all the rules of conformity between activities, accesses and permissions. There was a linear increase in file size and a moderate increase in execution time, compatible with the complexity of the models. Conclusions: The proposed tool proved effective in generating coherent synthetic logs with multiple perspectives, supporting research into process mining and the evaluation of compliance algorithms. It is a relevant solution for contexts where real data is scarce or sensitive
|
|
Mo-S2-T5 |
Room 0.14 |
Image Processing and Pattern Recognition 2 |
Regular Papers - Cybernetics |
Chair: Chen, Junlin | Shanghai Jiao Tong University |
Co-Chair: Lee, Min-Jeong | Korea University |
|
16:00-16:15, Paper Mo-S2-T5.1 | |
Learning from Model Rankings Improves Blind Super-Resolution Image Quality Assessment |
|
Chen, Junlin | Shanghai Jiao Tong University |
Cao, Peibei | Nanjing University of Information Science and Technology |
Zhai, Guangtao | Shanghai Jiao Tong University |
Yang, Xiaokang | Shanghai Jiao Tong University |
Zhang, Weixia | Shanghai Jiao Tong University |
Keywords: Image Processing and Pattern Recognition
Abstract: Image super-resolution (SR) aims to generate a high-resolution (HR) image from a low-resolution (LR) input. Traditionally, full-reference image quality assessment (FR-IQA) models have been widely used to evaluate the perceptual quality of super-resolved images, relying on pristine reference images as the gold standard. However, in real-world SR applications, such reference images are often unavailable, posing challenges for the use of FR-IQA. While blind image quality assessment (BIQA) models can assess the perceptual quality of super-resolved images without requiring a reference, there remains a lack of comprehensive studies evaluating the effectiveness of existing BIQA models for real-world SR tasks. This dilemma can largely be attributed to the high cost of subjective testing required to collect sufficient human quality annotations, which in turn hinders the development of effective SR-IQA models. In this study, we tackle this challenge with a data-efficient approach. We first generate super-resolved images from LR inputs using state-of-the-art real-world SR methods. Then, we use the maximum differentiation competition (MAD) to select a diverse set of images for subjective testing, allowing us to efficiently gather human preferences and assess the alignment between BIQA model predictions and human judgments. The resulting global ranking of SR methods not only indicates the relative performance of recent real-world SR models, but also gives us an opportunity to develop a new BIQA model tailored for real-world SR-IQA. By utilizing the global rankings of SR algorithms as prior knowledge, we can refine pretrained BIQA models using vast amounts of super-resolved images textbf{without} any supervisory signal. Experimental results show that our approach substantially enhances IQA performance for real-world SR while preserving robust predictive accuracy across various distortion scenarios. The dataset and the code are available at url{https://github.com/cschenjunlin/SR-IQA-SMC25}.
|
|
16:15-16:30, Paper Mo-S2-T5.2 | |
Visual Prompt Tuning Mamba for MRI Brain Tumor Classification |
|
Wang, Wei | Dalian Minzu University |
Li, Fen | Dalian Minzu University |
Guo, Lixin | Dalian Minzu University |
Jianxin, Zhang | Dalian Minzu University |
Keywords: Image Processing and Pattern Recognition, Deep Learning
Abstract: Automatic MRI brain tumor classification plays an important role in the clinical diagnosis of brain tumors. Meanwhile, deep learning methods recently attract wide attention in this medical application. Different from existing methods that are mainly based on CNN and Transformer architectures, this paper attempts to explore Mamba-related models for MRI brain tumor classification, motivated by the recent success of state space models in capturing global features of medical images. More specifically, this paper introduces a novel vision prompt Mamba model, i.e., VPT-Mamba, for MRI brain tumor classification, which implants a visual prompt into the improved Vision Mamba to capture global and local important information of brain images. The VPT-Mamba utilizes the Mamba® as the backbone to mitigate feature artifacts of brain tumors by introducing register tokens into Vision Mamba, and it further incorporates the visual prompt tuning in input features to enhance feature representation. We evaluate the VPT-Mamba on two public and private brain tumor classification datasets. Our VPT-Mamba achieves the best ACC and AUC results of 99.510% and 99.966% on the public dataset, and the corresponding values are 96.582% and 99.125% on the private brain tumor dataset, respectively. Moreover, comparative experimental results demonstrate the effectiveness of VPT-Mamba for this medical image classification application.
|
|
16:30-16:45, Paper Mo-S2-T5.3 | |
Local Representative Token Guided Merging for Text-To-Image Generation |
|
Lee, Min-Jeong | Korea University |
Kim, Hee-Dong | Korea University |
Lee, Seong-Whan | Korea University |
Keywords: Image Processing and Pattern Recognition, Deep Learning
Abstract: Stable diffusion is an outstanding image generation model for text-to-image, but its time-consuming generation process remains a challenge due to the quadratic complexity of attention operations. Recent token merging methods improve efficiency by reducing the number of tokens during attention operations, but often overlook the characteristics of attention-based image generation models, limiting their effectiveness. In this paper, we propose local representative token guided merging (ReToM), a novel token merging strategy applicable to any attention mechanism in image generation. To merge tokens based on various contextual information, ReToM defines local boundaries as windows within attention inputs and adjusts window sizes. Furthermore, we introduce a representative token, which represents the most representative token per window by computing similarity at a specific timestep and selecting the token with the highest average similarity. This approach preserves the most salient local features while minimizing computational overhead. Experimental results show that ReToM achieves a 6.2% improvement in FID and higher CLIP scores compared to the baseline, while maintaining comparable inference time. We empirically demonstrate that ReToM is effective in balancing visual quality and computational efficiency.
|
|
16:45-17:00, Paper Mo-S2-T5.4 | |
FIQ: Fundamental Question Generation with the Integration of Question Embeddings for Video Question Answering |
|
Oh, Juyoung | Korea University |
Kim, Ho-Joong | Korea University |
Lee, Seong-Whan | Korea University |
Keywords: Image Processing and Pattern Recognition, Deep Learning
Abstract: Video question answering (VQA) is a multimodal task that requires the interpretation of a video to answer a given question. Existing VQA methods primarily utilize question and answer (Q&A) pairs to learn the spatio-temporal characteristics of video content. However, these annotations are typically event-centric, which is not enough to capture the broader context of each video. The absence of essential details such as object types, spatial layouts, and descriptive attributes restricts the model to learning only a fragmented scene representation. This issue limits the model's capacity for generalization and higher-level reasoning. In this paper, we propose a fundamental question generation with the integration of question embeddings for video question answering (FIQ), a novel approach designed to strengthen the reasoning ability of the model by enhancing the fundamental understanding of videos. FIQ generates Q&A pairs based on descriptions extracted from videos, enriching the training data with fundamental scene information. Generated Q&A pairs enable the model to understand the primary context, leading to enhanced generalizability and reasoning ability. Furthermore, we incorporate a VQ-CAlign module that assists task-specific question embeddings with visual features, ensuring that essential domain-specific details are preserved to increase the adaptability of downstream tasks. Experiments on SUTD-TrafficQA demonstrate that our FIQ achieves state-of-the-art performance compared to existing baseline methods. Code is available at https://github.com/juyoungohjulie/FIQ
|
|
17:00-17:15, Paper Mo-S2-T5.5 | |
ID-EA: Identity-Driven Text Enhancement and Adaptation with Textual Inversion for Personalized Text-To-Image Generation |
|
Jin, Hyun-Jun | Korea University |
Kim, Young-Eun | Korea University |
Lee, Seong-Whan | Korea University |
Keywords: Image Processing and Pattern Recognition, Deep Learning
Abstract: Recently, personalized portrait generation with a text-to-image diffusion model has significantly advanced with Textual Inversion, emerging as a promising approach for creating high-fidelity personalized images. Despite its potential, current Textual Inversion methods struggle to maintain consistent facial identity due to semantic misalignments between textual and visual embedding spaces regarding identity. We introduce ID-EA, a novel framework that guides text embeddings to align with visual identity embeddings, thereby improving identity preservation in a personalized generation. ID-EA comprises two key components: the ID-driven Enhancer (ID-Enhancer) and the ID-conditioned Adapter (ID-Adapter). First, the ID-Enhancer integrates identity embeddings with a textual ID anchor, refining visual identity embeddings derived from a face recognition model using representative text embeddings. Then, the ID-Adapter leverages the identity-enhanced embedding to adapt the text condition, ensuring identity preservation by adjusting the cross-attention module in the pre-trained UNet model. This process encourages the text features to find the most related visual clues across the foreground snippets. Extensive quantitative and qualitative evaluations demonstrate that ID-EA substantially outperforms state-of-the-art methods in identity preservation metrics while achieving remarkable computational efficiency, generating personalized portraits approximately 15 times faster than existing approaches.
|
|
17:15-17:30, Paper Mo-S2-T5.6 | |
MCoT-RE: Multi-Faceted Chain-Of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval |
|
Park, Jeong-Woo | Korea University |
Lee, Seong-Whan | Korea University |
Keywords: Image Processing and Pattern Recognition, Deep Learning
Abstract: Composed Image Retrieval (CIR) is the task of retrieving a target image from a gallery using a composed query consisting of a reference image and a modification text. Among various CIR approaches, training-free zero-shot methods based on pre-trained models are cost-effective but still face notable limitations. For example, sequential VLM-LLM pipelines process each modality independently, which often results in information loss and limits cross-modal interaction. In contrast, methods based on multimodal large language models (MLLMs) often focus exclusively on applying changes indicated by the text, without fully utilizing the contextual visual information from the reference image. To address these issues, we propose multi-faceted Chain-of-Thought with re-ranking (MCoT-RE), a training-free zero-shot CIR framework. MCoT-RE utilizes multi-faceted Chain-of-Thought to guide the MLLM to balance explicit modifications and contextual visual cues, generating two distinct captions: one focused on modification and the other integrating comprehensive visual-textual context. The first caption is used to filter candidate images. Subsequently, we combine these two captions and the reference image to perform multi-grained re-ranking. This two-stage approach facilitates precise retrieval by aligning with the textual modification instructions while preserving the visual context of the reference image. Through extensive experiments, MCoT-RE achieves state-of-the-art results among training-free methods, yielding improvements of up to 6.24% in Recall@10 on FashionIQ and 8.58% in Recall@1 on CIRR.
|
|
17:30-17:45, Paper Mo-S2-T5.7 | |
FAR-Net: Multi-Stage Fusion Network with Enhanced Semantic Alignment and Adaptive Reconciliation for Composed Image Retrieval |
|
Park, Jeong-Woo | Korea University |
Kim, Young-Eun | Korea University |
Lee, Seong-Whan | Korea University |
Keywords: Image Processing and Pattern Recognition, Deep Learning
Abstract: Composed image retrieval (CIR) is a vision-language task that retrieves a target image using a reference image and modification text, enabling intuitive specification of desired changes. While effectively fusing visual and textual modalities is crucial, existing methods typically adopt either early or late fusion. Early fusion tends to excessively focus on explicitly mentioned textual details and neglect visual context, whereas late fusion struggles to capture fine-grained semantic alignments between image regions and textual tokens. To address these issues, we propose FAR-Net, a multi-stage fusion framework designed with enhanced semantic alignment and adaptive reconciliation, integrating two complementary modules. The enhanced semantic alignment module (ESAM) employs late fusion with cross-attention to capture fine-grained semantic relationships, while the adaptive reconciliation module (ARM) applies early fusion with uncertainty embeddings to enhance robustness and adaptability. Experiments on CIRR and FashionIQ show consistent performance gains, improving Recall@1 by up to 2.4% and Recall@50 by 1.04% over existing state-of-the-art methods, empirically demonstrating that FAR-Net provides a robust and scalable solution to CIR tasks.
|
|
17:45-18:00, Paper Mo-S2-T5.8 | |
Single Image Reflection Separation by Using Reflection and Refraction Estimations |
|
Liu, Tsung-Jung | National Chung Hsing University |
Chan, U-In | National Chung Hsing University |
Liu, Kuan-Hsien | National Taichung University of Science and Technology |
Keywords: Image Processing and Pattern Recognition, Machine Vision, Neural Networks and their Applications
Abstract: Considering the influence of reflected and transmitted light on physical imaging in real-world scenarios, we propose a method that leverages reflection and refraction coefficients. By calculating these coefficients between the captured image and the transmission image, our approach effectively guides the separation of reflection layers within the captured image. The proposed three-branch framework integrates the Feature Enhancement Module (FEM), which is specifically designed to recover additional details and produce high-quality transmission images. These enhancements significantly improve the performance of the reflection removal process. Comprehensive experiments conducted on various datasets and in comparison with state-of-the-art reflection removal methods demonstrate the effectiveness of our approach. The results show that our method excels in eliminating reflection artifacts and correcting intensity distortions, yielding superior image quality. Especially, the PSNR and SSIM scores of our model outperform existing methods. Furthermore, the simplicity of the input and architecture underscores the practicality of the proposed method. The source code of the proposed method is available at https://reurl.cc/LnAmqx.
|
|
18:00-18:15, Paper Mo-S2-T5.9 | |
Mammo-Mamba: A Hybrid State-Space and Transformer Architecture with Sequential Mixture of Experts for Multi-View Mammography |
|
Bayatmakou, Farnoush | Concordia University |
Taleei, Reza | Thomas Jefferson University |
Simone, Nicole | Thomas Jefferson University |
Mohammadi, Arash | Concordia University |
Keywords: Image Processing and Pattern Recognition, Deep Learning, Machine Learning
Abstract: Breast cancer (BC) remains one of the leading causes of cancer-related mortality among women, despite recent advances in Computer-Aided Diagnosis (CAD) systems. Accurate and efficient interpretation of multi-view mammograms is essential for early detection, driving a surge of interest in Artificial Intelligence (AI)-powered CAD models. While state-of-the-art multi-view mammogram classification models are largely based on Transformer architectures, their computational complexity scales quadratically with the number of image patches, highlighting the need for more efficient alternatives. To address this challenge, we propose Mammo-Mamba, a novel framework that integrates Selective State-Space Models (SSMs), transformer-based attention, and expert-driven feature refinement into a unified architecture. Mammo-Mamba extends the MambaVision backbone by introducing the Sequential Mixture of Experts (SeqMoE) mechanism through its customized SecMamba block. The SecMamba is a modified MambaVision block that enhances representation learning in high-resolution mammographic images by enabling content-adaptive feature refinement. These blocks are integrated into the deeper stages of MambaVision, allowing the model to progressively adjust feature emphasis through dynamic expert gating, effectively mitigating the limitations of traditional Transformer models. Evaluated on the CBIS-DDSM benchmark dataset, Mammo-Mamba achieves superior classification performance across all key metrics while maintaining computational efficiency.
|
|
18:15-18:30, Paper Mo-S2-T5.10 | |
Lightweight Dual Attention Multi-Scale Inverted Residual Neural Network for Image Inpainting |
|
Liu, Kuan-Hsien | National Taichung University of Science and Technology |
Chang, Chun-Chieh | National Taichung University of Science and Technology |
Liu, Tsung-Jung | National Chung Hsing University |
Keywords: Image Processing and Pattern Recognition, Deep Learning, Machine Vision
Abstract: We propose DA-MSIRNet, a lightweight yet innovative architecture for high-quality image inpainting that significantly enhances the standard U-Net through four key innovations: (1) Context Anchor Attention (CAA) for efficient global context modeling via adaptive region selection, (2) Sparse Self-Attention (SpA), inspired by Spa-former, for dynamic and precise local detail refinement by focusing on salient relationships, (3) Multi-Scale Inverted Residual (MSIR) modules for enhanced multi-scale feature fusion through optimized skip connections, and (4) Structural Similarity (SSIM) Loss for improved perceptual quality and fidelity. DA-MSIRNet effectively addresses critical limitations of existing methods, including GAN instability, U-Net's restricted receptive field, and Transformer computational complexity. Comprehensive evaluations on Places2 and CelebA-HQ datasets demonstrate that DA-MSIRNet achieves state-of-the-art performance in both quantitative metrics (PSNR/SSIM/FID) and visual quality, while maintaining superior computational efficiency. The code for our DA-MSIRNet is publicly available on GitHub: https://github.com/nutcliu2507/DA-MSIRNet.
|
|
Mo-S2-T6 |
Room 0.31 |
Virtual/Augmented/Mixed Reality & Human Factors |
Regular Papers - HMS |
Chair: Kashihara, Koji | Ritsumeikan University |
Co-Chair: Li, Jamy | Edinburgh Napier University |
|
16:00-16:15, Paper Mo-S2-T6.1 | |
Effects of VR Experiences of Disorientation with Visual Field Narrowing on Understanding of Dementia Symptoms among Third Parties |
|
Nakagawa, Yukiyoshi | Ritsumeikan University |
Tokuno, Koji | Ritsumeikan University |
Kashihara, Koji | Ritsumeikan University |
Keywords: Virtual and Augmented Reality Systems, User Interface Design, Human-Machine Interface
Abstract: We have created a virtual reality (VR) system designed to enhance the understanding of dementia symptoms among third parties, including young people who may be unfamiliar with dementia, as well as caregivers and family members of patients. The VR system can effectively simulate the disorientation and visual field constriction associated with Alzheimer's disease, enabling users to navigate a realistic cityscape. By altering the position of the numerical landmarks on a round trip, the VR system increased the completion time for maze tasks, thereby simulating the experience of disorientation and the sensation of being lost. Furthermore, the VR system replicated the symptoms of visual field constriction and suppressed parasympathetic nervous system activity, along with subjective symptoms such as decreased sleepiness and increased discomfort. Such VR experiences will be instrumental in improving the understanding of patients with dementia and could be useful in patient care.
|
|
16:15-16:30, Paper Mo-S2-T6.2 | |
Visual Cues in Exergame-Like Feedback for Fitting Passive Upper Limbs Exoskeleton: Systematic Review, Usability and Users' Preferences |
|
Middendorf, Max | Department of Engineering, University of Cambridge |
Saeedi-Givi, Christine | Department of Engineering, University of Cambridge |
Daling, Lea Marleen | RWTH Aachen University |
Abdelrazeq, Anas | RWTH Aachen University |
Schmitt, Robert H. | Laboratory for Machine Tools and Production Engineering, RWTH Aa |
Bohne, Thomas | Department of Engineering, University of Cambridge |
Tadeja, Slawomir Konrad | University of Cambridge |
Keywords: Assistive Technology, Human Enhancements, Virtual/Augmented/Mixed Reality
Abstract: A key problem in the adoption of exoskeletons in industry is that workers are incorrectly fitting the device, leading to discomfort and suboptimal functioning of the exoskeleton. Although biomechanical modeling and design optimization strategies have tried to resolve this issue, we propose a user-centric, real-time fitting aid to guide and control the correct fitting process, which has not been achieved in practice. Inspired by previous work that uses exergame-like feedback to instruct a user, we compared augmented reality (AR)-based visual cues to guide the accurate fitting of a passive upper limb exoskeleton. We selected visual cues through a systematic literature review and evaluated their efficacy and usability for different aspects of exoskeleton fitting in a study with sixteen participants. The study outcome suggests a statistically significant preference for a semi-transparent overlay instead of a more abstract arrow-based method. Moreover, the results indicate high usability and satisfaction with our approach, improved user acceptance, and potentially enhanced fitting accuracy. These findings advance understanding of the viability of exergame-like real-time guidance as a means to increase exoskeleton acceptance and adoption in industrial settings.
|
|
16:30-16:45, Paper Mo-S2-T6.3 | |
Adaptive VR-Based Stroop Task Driven by Cognitive Engagement to Enhance Sustained Attention in Young Adults |
|
Loaeza-Martinez, Alfredo | Mirai Innovation Research Institute |
Gonzalez-Diaz de Leon, Juliana | Mirai Innovation Research Institute |
Hernández Ríos, Edgar Rafael | Mirai Innovation Research Institute |
Valencia, Victor | Advanced Telecommunications Research Institute International |
Penaloza, Christian | Mirai Innovation Research Institute |
Keywords: Virtual/Augmented/Mixed Reality, Brain-Computer Interfaces, Augmented Cognition
Abstract: Attention is a critical cognitive process required to perform everyday tasks, yet it has become increasingly difficult to maintain due to constant distractions. While several systems aim to support attentional control, many cannot adapt to the user’s cognitive state in real time. This study proposes an adaptive virtual reality platform based on the Stroop task that measures cognitive engagement (CE) using realtime electroencephalography (EEG) data. A subject study was implemented, comparing a control group (non-adaptive version) with an experimental group (adaptive version). Results showed that the adaptive environment was significantly more effective at sustaining attention than the static condition. Additionally, cognitive engagement increased as task difficulty advanced, but stabilized after reaching higher difficulty levels, suggesting the presence of a cognitive ceiling or overload threshold. These results suggest that an adaptive VR Stroop task can enhance attentional regulation, provided that the difficulty remains within the cognitive capacity of the user. This system has promising implications for real-time, personalized cognitive training.
|
|
16:45-17:00, Paper Mo-S2-T6.4 | |
SPICE: Smart Projection Interface for Cooking Enhancement |
|
Prohaska, Vera | CyPhy Life, Robotics & AI Lab, School of Science & Technology, I |
Castelló Ferrer, Eduardo | CyPhy Life, Robotics & AI Lab, School of Science & Technology, I |
Keywords: Virtual/Augmented/Mixed Reality, Human-Computer Interaction, Interactive and Digital Media
Abstract: Tangible User Interfaces (TUI) for human--computer interaction (HCI) provide the user with physical representations of digital information with the aim to overcome the limitations of screen-based interfaces. Although many compelling demonstrations of TUIs exist in the literature, there is a lack of research on TUIs intended for daily two-handed tasks and processes, such as cooking. In response to this gap, we propose SPICE (Smart Projection Interface for Cooking Enhancement). SPICE investigates TUIs in a kitchen setting, aiming to transform the recipe following experience from simply text-based to tangibly interactive. SPICE uses a tracking system, an agent-based simulation software, and vision large language models to create and interpret a kitchen environment where recipe information is projected directly onto the cooking surface. We conducted comparative usability and a validation studies of SPICE, with 30 participants. The results show that participants using SPICE completed the recipe with far less stops and in a substantially shorter time. Despite this, participants self-reported negligible change in feelings of difficulty, which is a direction for future research. Overall, the SPICE project demonstrates the potential of using TUIs to improve everyday activities, paving the way for future research in HCI and new computing interfaces.
|
|
17:00-17:15, Paper Mo-S2-T6.5 | |
Three-Touch: A Manual Overlaying Technique for Augmented Content |
|
Mundt, Martin | Fraunhofer Institute for Communication, Information Processing A |
Klöckner, Johannes | Fraunhofer Institute for Communication, Information Processing A |
Seynsche, Leonie Katharina | University of Applied Sciences Bonn-Rhein-Sieg |
Mathew, Tintu | Fraunhofer Institute for Communication, Information Processing A |
Keywords: Virtual/Augmented/Mixed Reality, User Interface Design, Human-Computer Interaction
Abstract: Overlaying real-world objects with virtual information is a fundamental aspect of augmented reality (AR). Traditional techniques typically rely on marker-based or feature-based computer vision methods, each with inherent limitations regarding robustness and environmental adaptability. In this paper, we introduce the Three-Touch (TT) technique, a novel approach that enables users to manually align virtual content with physical objects by selecting corresponding feature pairs. We conduct a comparative study between the TT technique and a conventional grab gesture (G) method, evaluating usability, user experience, task load, positioning accuracy, and task completion time. Our findings indicate that while the TT technique reduces physical and mental effort, it may lead to greater positional offsets and reduced task performance compared to the G technique.
|
|
17:15-17:30, Paper Mo-S2-T6.6 | |
Attention-Based Deep Learning for Quantifying Simulator Sickness Using Eye and Head Motion Data in the Genesis Simulator |
|
Hag, Ala | Deakin Uinversity |
Chalak Qazani, Mohamad Reza | College of Science and Engineeirng |
Wei, Lei | University of Central Florida |
Nahavandi, Saeid | Swinburne University of Technology |
Asadi, Houshyar | Deakin University |
Keywords: Affective Computing, Virtual and Augmented Reality Systems, Human-Computer Interaction
Abstract: Simulator sickness remains a major challenge in immersive simulation systems, particularly in high-fidelity driving environments. While previous research has utilized machine learning with multimodal physiological data, it often depends on restricted feature sets and overlooks comprehensive eye movement and head motion data. In this study, we propose a deep learning framework using a hybrid 1CNN-BiLSTMAttention model for real-time quantification of simulator sickness severity. Data were collected using Deakin University’s Genesis Simulator—an immersive 360° environment with a six degrees-of-freedom motion platform. Eye-tracking and head movement features were extracted, and the Fast Motion Sickness (FMS) scale was used for severity labeling. The proposed model achieved 87.6% accuracy and an F1-score of 0.91 for high-severity detection. A 5-fold cross-validation demonstrated a significant benefit of attention over baseline models (p = 0.0019). This study offers a scalable solution for adaptive simulation and intelligent vehicle systems through integrated eye and head movement data.
|
|
17:30-17:45, Paper Mo-S2-T6.7 | |
Too Close for Comfort? Investigating Virtual Professor Distance and Student Learning in VR |
|
Lafci, Mustafa Tevfik | Fraunhofer Heinrich-Hertz Institut |
Nierula, Birgit | Fraunhofer-Institut Für Nachrichtentechnik, Heinrich-Hertz-Insti |
Damar, Dilara | HHI |
Bosse, Sebastian | Fraunhofer HHI |
Keywords: Human-centered Learning, Virtual and Augmented Reality Systems, Virtual/Augmented/Mixed Reality
Abstract: This study investigates how virtual professor proximity influences student comprehension and attention in an immersive learning environment. 27 participants experienced three lectures in virtual reality (VR) under varying spatial conditions: close distance (personal space intrusion), optimal distance (user defined preferred proximity), and far distance (outside social interaction proximity). Attention was tracked via eye movement, and comprehension was evaluated through multiple-choice tests. Results show that close distance reduced comprehension despite increased gaze toward the professor, likely due to discomfort and attentional overload. Far distance increased visual distraction, while the optimal distance supported comfort, engagement, and performance. Our findings offer actionable design guidelines for avatar placements in virtual learning systems to optimize user experience and cognitive effectiveness.
|
|
17:45-18:00, Paper Mo-S2-T6.8 | |
GesturePTS: Using Predetermined Time Systems (PTS) from Human Factors Engineering for Coding Gesture Proposals from Elicitation Studies |
|
Li, Jamy | Edinburgh Napier University |
Penaranda Valdivia, Karen | Toronto Metropolitan University |
Renaud, Kade | Toronto Metropolitan University |
Manshaei, Roozbeh | Toronto Metropolitan University |
Mazalek, Ali | Toronto Metropolitan University |
Keywords: Human Factors, Design Methods, Human-Computer Interaction
Abstract: Coding the gestures people use when interacting with tangible devices can aid interface design by making interfaces more intuitive and consistent. Past work devotes very little space to developing coding schemes for user-defined gesture sets. A generalizable coding scheme for gestures made by the human hand is developed for tangible devices using inspiration from predetermined time systems (PTS) from human factors engineering. The coding scheme is generic, adaptable to any hand motion proposal and has features such as duration invariance. Case study examples demonstrate its ability to provide consistent labels for hand interactions with tangible devices. Coding scheme labels are also balanced, having neither too many nor too few distinct gestures. This paper can help teach and create coding schemes for gesture elicitation studies performed by human-computer interaction researchers. The larger societal impact of this work is to help advance scientific methods for beneficial interface design within HCI/HRI by combination with human factors theory.
|
|
18:00-18:15, Paper Mo-S2-T6.9 | |
Experimental Evaluation of Personalized Intervention Based on the PLS-SEM Model for Physical Activity |
|
Shiraishi, Masahiro | Fujitsu Limited |
Masuda, Yuta | Fujitsu Limited |
Yamamoto, Tatsuya | Fujitsu Limited |
Hayakawa, Shoji | Fujitsu Limited |
Kamimura, Takuya | Fujitsu Limited |
Iizuka, Hiroyuki | Center for Human Nature, Artificial Intelligence, and Neuroscien |
Suzuki, Keisuke | Center for Human Nature, Artificial Intelligence, and Neuroscien |
Keywords: Human Factors, Human-Computer Interaction, Human-centered Learning
Abstract: This study aims to conduct an experimental evaluation of personalized interventions to promote physical activity. Promoting health behavior such as physical activity to improve well-being is an important social issue. Maintaining health behavior based solely on an individual's will without any support is difficult due to lifestyle and environmental/social factors. Therefore, it is important to increase the intrinsic motivation for health behavior through external interventions. In recent years, services that promote physical activity by intervening with users through mobile health applications have become increasingly popular. In such services, users are generally encouraged to increase their health behavior through positive interventions such as supportive messages. However, such one-size-fits-all interventions have limited impact because of a mismatch with individual needs, resulting in inadequate or sometimes counterproductive effects. Therefore, interventions tailored to individual needs (i.e., personalized interventions) are needed. We used PLS-SEM to model the relationship between individually different personalities (i.e., personality traits) and the motivational ability of intervention for physical activity (i.e., persuasiveness), with the goal of enabling personalized interventions. We conducted an experiment to examine the effect of physical activity promotion through the personalized intervention based on the PLS-SEM model. As a result of the experiment, the personalized intervention group showed a significantly higher implementation rate and continuation rate of physical activity compared to the non-personalized intervention group and the control group. These results suggest that it is possible to promote physical activity more effectively through an intervention tailored to personality traits. This may be a solution to the problem of insufficient intervention effect with a one-size-fits-all approach.
|
|
18:15-18:30, Paper Mo-S2-T6.10 | |
Evaluating Parental Readiness to Manage Children's Privacy across Social Media Platforms |
|
Abreu, Mykaele | PUCPR |
Kugler Viegas, Eduardo | Pontifícia Universidade Católica Do Paraná |
Santin, Altair | Pontifícia Universidade Católica Do Paraná |
Geremias, Jhonatan | Pontifícia Universidade Católica Do Paraná |
Keywords: Human Factors, Systems Safety and Security
Abstract: Children's widespread use of digital platforms has intensified concerns about the adequacy of privacy protections. Current legislation places the responsibility for managing children's privacy on parents and guardians, assuming they possess the necessary knowledge to make informed decisions. In light of this, this work assesses parental maturity in managing children's privacy on social networks. First, we identify the main privacy attributes relevant to children’s online data protection by analyzing existing laws and regulations, including the GDPR, COPPA, and LGPD. This phase establishes a regulatory baseline for evaluating parental responsibilities and expectations. In the second phase, we surveyed 77 parents and guardians to assess their level of maturity in managing privacy-related measures and to evaluate how effectively they can protect their children's data in digital environments. Our results reveal a significant discrepancy between perceived and actual knowledge, suggesting that many parents may not be adequately prepared to fulfill the role expected by current regulations. These findings support the need for clearer policies and a shared responsibility model between platforms and guardians to ensure child privacy.
|
|
Mo-S2-T7 |
Room 0.32 |
Artificial Social Intelligence |
Regular Papers - Cybernetics |
Chair: Givigi, Sidney | Queen's University |
Co-Chair: Zhou, Mengchu | New Jersey Institute of Technology |
|
16:00-16:15, Paper Mo-S2-T7.1 | |
BSAlarm: A Time Series Dataset for Network Alarm Forecasting and Base Station Classification |
|
Li, Zhouyuan | Beijing University of Posts and Telecommunications |
Liu, Yaqiong | Beijing University of Posts and Telecommunications |
Wang, Xidian | China Mobile Group Design Institute Co., Ltd |
Jia, Zihan | China Mobile Group Design Institute Co., Ltd |
Shi, Duo | China Mobile Group Design Institute Co., Ltd |
Lv, Zhe | Beijing University of Posts and Telecommunications |
Jiang, Yuanzhen | Beijing University of Posts and Telecommunications |
Keywords: Artificial Social Intelligence, AI and Applications, Application of Artificial Intelligence
Abstract: Network Operation and Maintenance (NOM) is an important part of the IT and telecommunications infrastructure. To improve the efficiency of NOM personnel, predict base station alarms, and classify base stations more precisely, we build a new labeled time series dataset, called BSAlarm, which collects 1000 different time series from processed real-world network alarm log data. Furthermore, we develop a NOM pipeline that employs B-spline interpolation alongside the moving average method to create and utilize this dataset for both long- and short-time series forecasting (LTSF), as well as base station classification tasks. Unlike prior NOM methods, our pipeline is centered on the generation of time series data, thereby enhancing alarm prediction performance and the visualization of changes under base station alarm conditions. Extensive experiments show that the proposed BSAlarm dataset and pipeline achieve remarkable results on time-series-data-based NOM.
|
|
16:15-16:30, Paper Mo-S2-T7.2 | |
MASD: A Multi-Agent Sarcasm Detection Framework with Chain-Of-Thought |
|
Zheng, Yongjie | Southwest University of Science and Technology |
Ma, Liping | University |
Keywords: Artificial Social Intelligence, Application of Artificial Intelligence
Abstract: Ironic language, prevalent on social media, Existing sarcasm detection methods face three main limitations: (1) the implicit and context-dependent nature of irony, requiring advanced reasoning; (2) substantial computational overhead in locally fine-tuning large language models (LLMs); and (3) insufficient detection performance due to a narrow analysis of sarcastic expressions. To address these issues, we propose the textbf{Multi-Agent Sarcasm Detection framework with Chain-of-Thought (MASD)}. MASD introduces a Chain-of-Thought prompting paradigm that combines syntactic cues with deep semantic reasoning. A KNN-based sampling technique identifies semantically similar examples to produce a robust LLM emulation template. Leveraging LLM inferential capabilities, MASD enables effective sarcasm detection with minimal reliance on extensive pre-trained datasets. Moreover, its multi-agent architecture allows for the collaborative extraction of contextual, emantic and rhetorical insights. By incorporating a self-reflection mechanism, LLMs continuously refine their reasoning logic. Experiments on benchmark datasets across real-world scenarios demonstrate Macro-F1 score improvements of 5.04%, 1.57%, and 8.82%, underscoring the framework's effectiveness and scalability, with substantial implications for advancing sarcasm detection methodologies.
|
|
16:30-16:45, Paper Mo-S2-T7.3 | |
Research on the Door-Opening Strategy of Humanoid Robots Assisted by YOLOv5 Navigation: Combining Proximal Policy Optimization (PPO) with Attention Mechanism |
|
Wang, Long | Harbin Engineering University |
Zhang, Haoyu | Harbin Engineering University |
Zhao, Guodong | Harbin Engineering University |
Liu, Chengzhuo | Harbin Engineering University |
Keywords: Artificial Social Intelligence, Application of Artificial Intelligence, Cyborgs,
Abstract: This paper proposes a comprehensive framework that integrates the YOLOv5 object detection model with the Proximal Policy Optimization (PPO) algorithm enhanced by a multi-head attention mechanism, aiming to enable a humanoid robot to autonomously locate itself in front of a door and complete the task of opening the door. Firstly, the real-time and precise positioning and navigation of the door are achieved through the object detection model. Secondly, the improved PPO algorithm dynamically extracts key features and establishes deep associations through the multi-head attention mechanism to optimize the arm control strategy. Experiments are carried out on the Webots simulation platform. The results show that this method not only improves the accuracy and efficiency of the robot in reaching the front of the door but also significantly enhances the success rate of the door-opening action and the adaptability to environmental changes. It verifies the effectiveness of the integration of the attention mechanism and deep reinforcement learning technology in improving the autonomous operation skills of humanoid robots, providing new ideas and technical support for the future development of humanoid robots.
|
|
16:45-17:00, Paper Mo-S2-T7.4 | |
YOLOv8-CTCD: An Improved YOLOv8 for Cherry Tomato Cluster Detection in Robotic Harvesting |
|
Wang, Gang | Chongqing Normal University |
Chai, Shanglei | Shenzhen University |
Zhang, Zhiyuan | Singapore Management University |
Zeng, Zhi | Chongqing Normal University |
Tian, Yibin | Shenzhen University |
Keywords: Artificial Social Intelligence, Deep Learning, Cyborgs,
Abstract: Cherry tomato harvesting is generally performed manually. Robotic harvesting is gaining increasing interest from both academia and industry. This paper proposes a cherry tomato cluster detection algorithm based on YOLOv8, named YOLOv8-CTCD. First, the YOLOv8 input channels are adjusted to enable 4-channel RGB-D images as input. Subsequently, a CARAFE-M module is designed to replace the upsampling method in YOLOv8n. It maintains a lightweight architecture while achieving a larger receptive field, allowing effective aggregation of contextual information. In addition, it assigns greater weight to more important features. Moreover, a C2f-MLCA module is introduced into YOLOv8, which integrates information from feature maps at different levels and enhances the network's capability of feature extraction. It also integrates the SPPELAN module to strengthen its feature fusion capability. YOLOv8-CTCD has been evaluated using a private cherry tomato dataset obtained from a greenhouse farm. The experimental results show that it achieves an mAP@50 of 93.8% and an mAP@50:90 of 68%, which represents improvements of 2.1% and 3% over YOLOv8n, respectively.
|
|
17:00-17:15, Paper Mo-S2-T7.5 | |
Dual Feature Enhancement and Adaptive Attention Fusion for Cross-Modal Scene Classification of Mining Land |
|
Zhou, Yue | School of Computer Science China University of Geosciences Wuhan |
Wang, Jiangyuan | School of Computer Science China University of Geosciences Wuhan |
Li, Xianju | School of Computer Science China University of Geosciences Wuhan |
Keywords: Artificial Social Intelligence, Deep Learning, Neural Networks and their Applications
Abstract: Remote sensing scene classification in mining areas is crucial for evaluating deposit occurrence states and supporting environmental monitoring and sustainable development. However, existing methods are limited by homogeneous and heterogeneous spectral spatial and topographic features of mining areas, large intra-class variations, and small target sizes. To overcome these limitations, this study integrates RGB and SAR data to construct a multi-modal dataset and proposes an RGB-SAR mining scene classification model with dual feature enhancement and adaptive cross-modal attention interaction. The model includes: (1) Dual feature enhancement module that suppresses irrelevant features and enhances discriminative multi-scale representations of mining targets; (2) BifocalNet based feature extraction module using a CNN-Transformer hybrid architecture to capture local textures and model global context; (3) Attention based adaptive cross-modal interaction module that achieves deep spectral geometric feature complementarity through the fusion of RGB and SAR modalities. Experiments show the model achieves an OA of 84.58%, outperforming other models and ranking first or second in most evaluation metrics. The proposed dataset and model thus advance mining scene classification.
|
|
17:15-17:30, Paper Mo-S2-T7.6 | |
Spacecraft Pose Estimation Based on High-Resolution Feature Network |
|
Fu, Zhiyong | Dalian University of Technology |
Ru, Bo | Dalian University of Technology |
Wang, Zhelong | Dalian University of Technology |
Wang, Luyao | Dalian University of Technology |
Yue, Dongyang | Dalian University of Technology |
Zhou, Jiangheng | Dalian University of Technology |
Li, Xvqing | Dalian University of Technology |
Tang, Lingxiang | Dalian University of Technology |
Keywords: Artificial Social Intelligence, Machine Vision, Neural Networks and their Applications
Abstract: Spacecraft pose estimation from monocular images presents significant challenges due to complex environmental conditions such as occlusions, illumination variations, and background interference, as well as estimation inaccuracies caused by multi-scale variations in object distance and viewpoint. To address these issues, this paper proposes a novel monocular pose estimation algorithm. In concrete terms, a high-resolution feature extraction framework is constructed using Higher-HRNet, a variant of the High-Resolution Network (HRNet), to generate multi-scale feature maps and enhance spatial feature representation. To balance estimation accuracy with real-time performance under limited computational resources, lightweight Ghost-BasicBlock and Ghost-Bottleneck modules are designed to reduce model complexity. Moreover, to mitigate the loss of feature representation capacity induced by model compression, a Biformer-Receptive Attention (BRA) mechanism is incorporated in the decoding stage to strengthen spatial-context modeling and improve keypoint localization accuracy. As the culminating process, the six-degree-of-freedom pose of the spacecraft is estimated by integrating the Efficient Perspective-n-Point (EPnP) algorithm with Random Sample Consensus (RANSAC). Experimental results on the SPEED dataset demonstrate that the proposed method achieves a mean translation error (meanET) of 0.0077m and a mean rotation error (meanER) of 0.0232°, with respective medians (medianET and medianER) corresponding to 0.0038m and 0.0112°, respectively—outperforming current state-of-the-art methods.
|
|
17:30-17:45, Paper Mo-S2-T7.7 | |
Can AlphaZero Master Human Concepts in Tic-Tac-Toe? |
|
Marasco, Anthony Joseph | Queen's University |
Givigi, Sidney | Queen's University |
Keywords: Computational Intelligence, AI and Applications, Artificial Social Intelligence
Abstract: AlphaZero’s success in games like chess, Go, and shogi has raised questions about its learning behaviour and limitations. Its play often mirrors human reasoning, applying familiar and novel concepts alike. In this work, we apply AlphaZero to Tic-Tac-Toe to examine its ability to learn a simple rules-based hierarchy that models human understanding. Despite the game’s simplicity, AlphaZero exhibits counter-intuitive, sub-optimal behaviour not typical of human players. However, we show that modifying the reward structure can enforce rule hierarchies in pure MCTS and supervised settings, and improves AlphaZero’s rule learning while reducing sub-optimal play.
|
|
17:45-18:00, Paper Mo-S2-T7.8 | |
A Feature Reduction and Interaction Optimization Method for Defect Prediction in Industrial Software (I) |
|
Gao, Shenghan | Beijing University of Technology |
Wang, Weidong | Beijing University of Technology |
Keywords: Machine Learning, Computational Intelligence, Computational Intelligence in Information
Abstract: In software defect prediction, achieving a balance between computational efficiency and accuracy is critical yet challenging. Traditional approaches often depend on high-dimensional feature sets, introducing redundant parameters that escalate complexity without enhancing predictive capability. This limitation becomes particularly acute in large-scale industrial software systems, where both feature selection and interaction analysis are vital for reliable predictions due to the complexity and scale of real-time operations. In such environments, defects can disrupt critical processes, making efficient and accurate prediction essential for maintaining system reliability and safety. To overcome these issues, we propose a two-phase method combining feature reduction and interaction optimization tailored for industrial software applications.The first phase employs mutual information and Lasso regularization to eliminate 43% of redundant features, significantly lowering model complexity. The second phase utilizes polynomial feature expansion to construct interaction terms, with Partial Dependence Plots (PDPs) decoding their nonlinear impacts. Experimental results demonstrate a 5%-25% acceleration in training speed across diverse models while preserving accuracy, proving the method’s efficacy for scalable defect prediction tasks in industrial software environments, where rapid deployment and robustness are paramount.
|
|
18:00-18:15, Paper Mo-S2-T7.9 | |
A Novel Social-Aware Clustering and Spatiotemporal Interest Modeling Method to Content Caching in MEC (I) |
|
Li, Bojia | Tongji University |
Zhao, Shengjie | Tongji University |
Chen, Weichao | Tongji University |
Keywords: AIoT, AI and Applications
Abstract: The Mobile Edge Computing (MEC) paradigm is widely believed to be highly potent in boosting computation-intensive AI applications, where massive amounts of data are processed at the internet-of-things end. Nevertheless, in reality, it remains a great challenge in filling the gap between resource capacity and user need especially when MEC users are socially connected and content requests can be highly dynamic in terms of their spatial-temporal patterns. Traditional approaches in this direction tend to leverage mobile trace prediction algorithms for yielding mobility-aware content acquisition schedules. In this research, however, we consider that the interest in scocially-connected MEC clusters of users can be highly useful in aiding the design of effective MEC service allocation mechanisms. For this purpose, we develop a novel interest-aware method for content placement. It consists of a clustering model for dividing servers based on their physical connection, a clustering model for grouping users based on their interest similarity, and a predictive content placement model by using a transformer-based method.Simulative results upon real-world datasets show our method outperforms traditional ones across multiple performance metrics.
|
|
Mo-S2-T8 |
Room 0.51 |
Intelligent Transportation Systems & Communications |
Regular Papers - SSE |
Chair: Wisniewski, Remigiusz | University of Zielona Gora |
Co-Chair: Wang, Wufan | Beijing University of Posts and Telecommunications |
|
16:00-16:15, Paper Mo-S2-T8.1 | |
Two-Stage Column Generation Optimization Algorithm for Flight Strings Based on Deep Reinforcement Learning |
|
Ding, Jianli | Civil Aviation University of China |
Chen, Jiaen | Civil Aviation University of China |
Li, Jing | Civil Aviation University of China |
Wang, Jing | Civil Aviation University of China |
Keywords: Intelligent Transportation Systems, Decision Support Systems
Abstract: Existing algorithms for the flight string scheduling problem often fall short due to their NP-hard nature and difficulties with local optima. To enhance efficiency, we propose a two-stage deep reinforcement learning column generation optimization algorithm (2-Stage DRLCG). This method leverages "offline training and online decision-making" to improve scheduling. In the first stage, we create a flight connection diagram based on airport connections and minimum transfer times. The second stage utilizes a deep Q network (DQN) to optimize column generation through a reduced cost (RC) approach. Experiments using data from a major Chinese airline demonstrate significant improvements in cost reduction and computational efficiency, highlighting the potential of deep reinforcement learning in flight scheduling.
|
|
16:15-16:30, Paper Mo-S2-T8.2 | |
A Two-Stage Dynamic Prediction Model for Flight Transit Time Based on Ensemble Learning and Transformer |
|
Ding, Jianli | Civil Aviation University of China |
Xu, Yuxin | Civil Aviation University of China |
Li, Jing | Civil Aviation University of China |
Wang, Jing | Civil Aviation University of China |
Keywords: Intelligent Transportation Systems, Decision Support Systems
Abstract: In order to accurately predict the flight transit time at airports of different sizes, this paper proposes a two-stage flight transit time dynamic prediction model based on ensemble learning and Transformer to jointly and dynamically predict the flight transit time of multiple airports. First, considering the continuity of flight operation, the flight information is associated, and the key features are extracted from the historical information of the flight to construct a multi-airport flight transit data set; secondly, the parameter transfer and connection of the flight transit time are carried out; thirdly, the Bayesian optimization algorithm is introduced to optimize the hyperparameters of the LightGBM and XGBoost models, and the optimal hyperparameters are used to make a preliminary prediction of the flight transit time in the first stage; finally, combined with the real-time delay data and the actual operation of the previous flight, the Transformer model is used to dynamically adjust the initial flight prediction time in the second stage to obtain the final prediction result. The experimental results show that the two-stage prediction results reduce the MAE by 2.42 and 2.49 minutes compared with the one-stage MAE, which proves the effectiveness of this method, and it is better than the random forest, support vector machine and neural network models in terms of prediction error.
|
|
16:30-16:45, Paper Mo-S2-T8.3 | |
LSTGCN: A Layer-By-Layer Spatio-Temporal Graph Convolutional Model for Flight Delay Prediction |
|
Ding, Jianli | Civil Aviation University of China |
Song, Peiyao | Civil Aviation University of China |
Wang, Jing | Civil Aviation University of China |
Keywords: Intelligent Transportation Systems, Decision Support Systems
Abstract: 频繁的航班延误已成为 全球空中交通,严重影响出行 乘客体验和运营效率 航空系统。为了有效地捕获 航班延误的时空动态特征, 本文提出了一种新的时空预测 航班延误模型,称为逐层 时空图卷积网络(LSTGCN),其中 增量捕获时空依赖关系 通过分层图卷积。该型号 集成时间建模和逐层图 卷积技术,使用时间卷积 网络 (TCN) 与 Transformer 机制相结合,以 提取航班延误的时间特征,介绍 图卷积网络 (GCN) 和图注意力 深入探索拓扑的机制(GAT) 机场之间的关系和动态影响 逐层交替的方式,构建多个 图邻接结构。与特定节点结合 聚合方法,这种方法准确地捕获了 拓扑结构特征及动态相互作用 机场网络内的关系。最后, 提取的时空特征融合成一个统一
|
|
16:45-17:00, Paper Mo-S2-T8.4 | |
YOLO-EMR: Efficient Multi-Scale and Rotated Object Detection in UAV Aerial Imagery |
|
Yan, Haimin | Ritsumeikan University |
Kong, Xiangbo | Toyama Prefectural University |
Shimada, Tomoyasu | Ritsumeikan University |
Wang, Juncheng | China United Network Communications Corporation |
Tomiyama, Hiroyuki | Ritsumeikan University |
Keywords: Intelligent Transportation Systems, Robotic Systems, Smart Buildings, Smart Cities and Infrastructures
Abstract: With the rapid development of UAVs, object detection in aerial imagery has become an important techniques. However, deploying real-time detection models on UAV platforms remains highly challenging due to limited computational costs, as well as the multi-scale objects and dense distribution caused by high-altitude imaging. Numerous studies have contributed valuable insights to address these challenges, yet opportunities remain for improving the balance between model efficiency and detection accuracy. Moreover, current research mainly focuses on small object detection, without fully considering the detection requirements for multi-scale and rotated objects in high-altitude imagery. To address these issues, this paper proposes a lightweight object detection model specifically designed for multi-scale and rotated object detection in high-altitude imagery. Furthermore, to better evaluate the model's performance in UAV-based object detection in high-altitude imagery, this work uses VisDrone 2019 dataset to assess the model’s real-world performance. As a results, compared to existing object detection approaches, the proposed model achieves a strong balance between detection accuracy, model efficiency, and inference speed, reducing parameters by approximately 31% and improving inference speed by 25%, while maintaining the same detection accuracy as the best baseline methods.
|
|
17:00-17:15, Paper Mo-S2-T8.5 | |
Joint Trajectory Optimization and Passive Beamforming in IRS-Assisted UAV-WRSN: A DRL-Based Method |
|
Liu, Pei | Changchun University of Technology |
Wen, Boge | Changchun University of Technology |
Keywords: Communications, Cooperative Systems and Control
Abstract: Wireless rechargeable sensor networks (WRSNs) play a pivotal role in ground-based monitoring within sixth-generation (6G) networks. However, these networks frequently face challenges in ground-to-air communication, stemming from urban obstructions, suboptimal antenna configurations, and other factors. To address these issues, this paper proposes a novel framework that integrates WRSNs with unmanned aerial vehicle (UAV) control. Specifically, we consider a system utilizing a mobile charging vehicle (MCV) equipped with an intelligent reflecting surface (IRS) to concurrently meet sensor charging needs and improve communication between the base station (BS) and the UAV. We formulate a dynamic multi-objective optimization problem to maximize sensor charging utility and BS-to-UAV communication rate while minimizing the MCV’s energy consumption. This is accomplished through the coordinated optimization of the MCV’s trajectory and the IRS’s phase configurations. To tackle the computational complexity of continuous two-dimensional trajectory optimization, we present a Pareto-optimal waypoint generation strategy. Furthermore, we propose an enhanced proximal policy optimization (IPPO) algorithm, incorporating shared feature extraction and LSTM-enhanced policy networks, to effectively manage high-dimensional state processing and temporal dependencies. Simulation results demonstrate that our approach outperforms benchmark algorithms in terms of charging utility, communication rate, and energy consumption optimization.
|
|
17:15-17:30, Paper Mo-S2-T8.6 | |
FS-IoT: Fast Few Shot IoT Devices Identification |
|
Dai, Kunling | Beijing University of Posts and Telecommunications |
Que, Xirong | BUPT |
Wang, Wufan | Beijing University of Posts and Telecommunications |
Keywords: Communications, Infrastructure Systems and Services, Smart Sensor Networks
Abstract: The widespread deployment of Internet of Things (IoT) devices, coupled with their often limited security capabilities, has significantly increased the network attack surface. Consequently, network asset managers need to continuously monitor and assess vulnerable IoT devices to mitigate potential threats. However, existing passive IoT device identification approaches, which rely on network traffic analysis, are hindered by substantial labeling requirements and computational overhead, severely limiting their practicality in real-world environments. To address these challenges, we propose FS-IoT, a fast few-shot IoT device identification framework. FS-IoT introduces the concept of packet bursts as the fundamental unit of recognition and systematically explores their extraction, representation, classification, and aggregation for IoT device identification. Experimental evaluations on two public datasets demonstrate that FS-IoT achieves superior accuracy (99.94% and 98.15%) while requiring only 2% of the labeled training data needed by state-of-the-art methods. Furthermore, FS-IoT improves recognition speed by an order of magnitude, making it highly suitable for practical, large-scale deployments.
|
|
17:30-17:45, Paper Mo-S2-T8.7 | |
Multi-Dimensional Extremist Classification Using Sentiment, Social Topology, and Temporal Data |
|
Kienast, Dan | UNSW |
Jiang, Jiaojiao | UNSW |
Keywords: Homeland Security, Decision Support Systems, Communications
Abstract: Social media’s rise has enabled rapid content dis- semination and fostered ideological echo chambers. These platforms, by design, often promote insular communities and targeted content exposure, making them effective tools for extremist recruitment and propaganda. Existing research has primarily focused on detecting extremist-related content particularly on Twitter (now X), which ignores the increasing adoption of other platforms such as Bluesky and Mastodon. To capture this evolving landscape, we propose SOCINT (Social Observations with Chronic Interactions and Network Tracking), a novel framework for user-level sentiment classification at specific timestamps. Unlike existing approaches that analyze individual posts in isolation, SOCINT leverages both temporal activity patterns and social network interactions to assess user sentiment and track its evolution over time. We also introduce a new multi-platform dataset enriched with social and temporal metadata. Experimental results demonstrate that SOCINT outperforms existing baselines, offering a dynamic and context-aware solution for detecting and monitoring extremist behavior across social media ecosystems.
|
|
17:45-18:00, Paper Mo-S2-T8.8 | |
Decomposition of Petri Net Toward Enhancement of Decision-Making in Manufacturing Systems (I) |
|
Wisniewska, Monika | Zielona Gora |
Patalas-Maliszewska, Justyna | University of Zielona Góra |
Wisniewski, Remigiusz | University of Zielona Gora |
Topczak, Marcin | University of Zielona Góra |
Zhou, Mengchu | New Jersey Institute of Technology |
Li, Zhiwu | Xidian University |
Konarczak, Dawid | Uniwersytet Zielonogórski |
Keywords: Decision Support Systems, Manufacturing Automation and Systems, System Modeling and Control
Abstract: The paper proposes a novel decomposition method aimed at supporting decision-making in manufacturing systems. The main goal of the presented technique is an adequate proposal for dividing the production process flow in the manufacturing system by delegating the selected activities to the outsourcing mode. In order to obtain such separated activities within a production process, the modelled system is decomposed into the set of components (sets of places) that are executed externally. The presented idea involves Petri net and hypergraph theories. In particular, the system is modelled by a Petri net and further decomposed with the use of hypergraph’s transversals. The key advantage of the proposed algorithm is many decomposition possibilities, since the activities performed as part of the production process in an outsourcing mode may be specified precisely, exactly conforming to the production strategy. It can be treated as a tool supporting decision-making by managers in the context of delegating tasks from the production process in an outsourcing mode. Such study complements the results of the analysis of cost and efficiency of the production process realized in a manufacturing enterprise.
|
|
18:00-18:15, Paper Mo-S2-T8.9 | |
Efficiency and Effectiveness Analysis of Invariant Coverage Verification Methods for Petri Net-Based Concurrent Systems (I) |
|
Wojnakowski, Marcin | University of Zielona Gora |
Wisniewski, Remigiusz | University of Zielona Gora |
Zhou, Mengchu | New Jersey Institute of Technology |
Li, Zhiwu | Xidian University |
Maliński, Maxim Zbigniew | Doctoral School of Exact and Technical Sciences, University of Z |
Obuchowicz, Andrzej | University of Zielona Góra |
Konarczak, Dawid | Uniwersytet Zielonogórski |
Keywords: System Modeling and Control, Discrete Event Systems
Abstract: This paper aims to analyse the efficiency and effectiveness of two invariant coverage verification methods for Petri net-based concurrent systems. The first algorithm is based on the classical approach, where the complete analysis of the Petri net’s incidence matrix is performed. This technique assures the correctness of results, but it may be inefficient due to the exponential computational complexity. The second algorithm applies an innovative idea of Petri net’s incidence matrix transformation and examination. The main advantage of such an approach is its polynomial computational complexity. On the other hand, the method may not always be accurate and may not detect all errors in the examined system. This paper exhaustively verifies the efficiency (run-time) and effectiveness (correctness of the achieved results) of both methods. The theoretical analyses are supported by the results of experiments performed on 386 benchmarks (Petri net models). In addition, besides the examination of the run-time of methods, the number of iterations performed by algorithms is computed and analysed.
|
|
18:15-18:30, Paper Mo-S2-T8.10 | |
Semantic Understanding-Based Open-Scene Re-Identiffcation |
|
Zhang, Zhengrui | Data Science and Intelligent Computing Laboratory, Hangzhou Inte |
Wang, Shuai | Beihang University |
Liu, Xi | Chinese Institute of Coal Science |
Sheng, Hao | Beihang University |
Yang, Da | Hangzhou International Innovation Institute, Beihang University |
Su, Guanqun | Shandong Qingniao HoT Co. Itd |
Keywords: Adaptive Systems, Smart Metering, Intelligent Transportation Systems
Abstract: Although current ReID (Re-Identification) methods have become relatively mature, they still require manual extraction of pedestrian images and annotation of features. They lack semantic understanding capabilities in open scenes. While some ReID models integrated with LLMs (Large Language Models) offer more comprehensive functions and better performance, they still fall short in terms of semantic understanding and cross-modal retrieval. To solve these problems, we introduce SUO-ReID, a semantic understanding-based approach for ReID in open scenes. SUO-ReID combines LVLM (Large Vision-Language Model) with ResNet (Residual Network) to extract high-level semantic features of targets in open scenes. It can also perform more flexible and complex functions, such as searching for or comparing targets with specified features, through instruction inputs. Experimental results show that SUO-ReID achieves an accuracy rate of 95.73% on datasets such as Market-1501 and DukeMTMC and exhibits excellent semantic understanding capabilities in open scenes, supporting cross-modal retrieval. It can also provide a detailed description of the features of the identified object and its surrounding scene. This study provides new insights into the application of large vision-language models in the field of ReID.
|
|
Mo-S2-T9 |
Room 0.90 |
Intelligent Multi-Agent Collaboration and Training |
Special Sessions: Cyber |
Chair: Tan, Yaoyao | Chongqing University |
Co-Chair: Zhu, QingHua | Guangdong University of Technology |
Organizer: Shi, Peng | University of Adelaide, Adelaide |
Organizer: Su, Xiaojie | Chongqing University |
Organizer: Kamath, Archit Krishna | Nanyang Technological University Singapore |
Organizer: Yan, Bing | The University of Adelaide |
|
16:00-16:15, Paper Mo-S2-T9.1 | |
Co-Design of Partly Transition Rates and OutputFeedback Control of Markovian Jump Systems (I) |
|
Ruiqing, Fu | Chongqing University |
Tian, Yufeng | Chongqing University |
Micheal, Shi | Victoria University |
Jiang, Tao | Chongqing University |
Tan, Yaoyao | Chongqing University |
Shen, Chao | Xi'an Jiaotong University |
Keywords: Cybernetics for Informatics, Hybrid Models of Computational Intelligence, Hybrid Models of Neural Networks, Fuzzy Systems, and Evolutionary Computing
Abstract: This paper addresses co-design control strategies for continuous-time Markov jump systems where subsets of transition rate matrices are fixed a priori, challenging conventional co-design methodologies. A synchronously mode-dependent parametric framework is developed to address partial transition rate optimization alongside output feedback controller synthesis. Novel criterion is derived to guarantee mean-square stability by reconstructing adjustable switching parameters while preserving fixed system transitions. Crucially, the formulation decouples decision variables associated with predefined and designable transition rates, streamlining computational complexity and eliminating the need for high-dimensional parameter optimization. Stability analysis and controller design are unified through hybrid control principles. A numerical case studies validate the proposed approach, demonstrating enhanced feasibility compared to existing methods.
|
|
16:15-16:30, Paper Mo-S2-T9.2 | |
Active Adaptation Control for Reconfigurable Vehicles Based on Collaborative Fault-Tolerant Mechanism (I) |
|
Wang, Jianxiang | Chongqing University |
Biao, Liu | School of Automation, Chongqing University |
Jiang, Tao | Chongqing University |
Yang, Yue | Xi'an University of Architecture and Technolog |
Tan, Yaoyao | Chongqing University |
Su, Xiaojie | Chongqing University |
Shi, Peng | University of Adelaide |
Keywords: Cyborgs,, Swarm Intelligence
Abstract: This paper presents a universal and adaptive control framework for reconfigurable vehicles subject to composite motion disturbances, incorporating a collaborative fault-tolerant mechanism. A model-based cascaded control architecture is developed based on the vehicle’s kinematic and dynamic models. To ensure safety, an improved adaptive geofencing strategy is proposed which integrates capability constraints with barrier functions in the kinematic loop. For dynamic feedback, an adaptive gain-filtered extended state observer enables accurate disturbance estimation from noisy outputs and improves robustness via feedforward compensation. Meanwhile, a collaborative fault-tolerant mechanism further allocates control inputs to mitigate actuator faults. Finally, experimental results validate the proposed method’s effectiveness under complex interference scenarios.
|
|
16:30-16:45, Paper Mo-S2-T9.3 | |
Multi-Agent Reinforcement Learning Algorithm Using Dynamic OW-QMIX in Complex Supply Chain Scenarios (I) |
|
Liu, ZhiQi | GuangDong University of Technology |
Zhu, QingHua | Guangdong University of Technology |
Zeng, An | GuangDong University of Technology |
Ji, YuZhu | GuangDong University of Technology |
Yang, BaoYao | GuangDong University of Technology |
Keywords: Machine Learning
Abstract: How to effectively optimize the operation for a complex supply chain environment has been high on the agenda. Although the existing deep reinforcement learning methods have achieved success in certain applications, they still face limitations in complex supply chain environments, including difficulties in data sharing and a lack of digital collaboration, especially in multi-agent systems. In order to meet this challenge, we propose a novel multi-agent reinforcement learning algorithm based on dynamic optimistic weights (DO-QMIX), aiming at solving the shortcomings of the traditional weighted QMIX algorithm (WQMIX) in the simplicity of the weighting function. WQMIX employs two weighting schemes to handle multi-agent issues. However, its fixed weighting function restricts algorithm performance and hinders adaptability to dynamic, complex supply chain challenges. Therefore, we propose a dynamic weighting mechanism, which can adjust the weighting function in real time based on the changes in the environment, thus improving the overall efficiency. We construct a complex multistage supply chain environment in the real-world supply chain scenario and conduct many experiments using both real-world and simulated datasets. The experimental results demonstrate that DO-QMIX is significantly superior to the traditional multi-agent reinforcement learning algorithm in complex supply chain scenarios, especially in dealing with dynamic changes and complex decisions.
|
|
16:45-17:00, Paper Mo-S2-T9.4 | |
Finite-Time RCBF-Based Cooperative Control of Heterogeneous Multi-Agent Systems for Forest Monitoring (I) |
|
Yan, Bing | The University of Adelaide |
Ni, Junkang | Northwestern Polytechnical University |
Shen, Wenqi | Harbin Engineering University |
Shi, Peng | University of Adelaide, Adelaide |
Keywords: Swarm Intelligence, Cybernetics for Informatics, Agent-Based Modeling
Abstract: In this paper, a finite-time robust safe cooperative control strategy is proposed for heterogeneous multi-agent systems (HMAS) in cluttered obstacle environments under input saturation, external disturbances, and Denial-of-Service (DoS) attacks. An adaptive event-triggered observer is designed at the cyber layer to achieve distributed resilient tracking under DoS-induced communication networks. At the physical layer, a distributed control scheme based on finite-time robust control barrier function (FT-RCBF) is first developed to ensure fast obstacle avoidance for HMAS. The proposed method is applied to a cooperative forest traversal and monitoring task for ground-air autonomous systems, and its effectiveness and robustness are verified through simulations.
|
|
17:00-17:15, Paper Mo-S2-T9.5 | |
Prescribed Performance Finite-Time Observer-Based Super-Twisting Controller for Cooperative Aerial Suspended Transport Systems (I) |
|
Lu, Xiaoqiang | Nanyang Technological University |
Kamath, Archit Krishna | Nanyang Technological University Singapore |
T, Thanaraj | Nanyang Technological University, Singapore |
Feroskhan, Mir | Nanyang Technological University Singapore |
Keywords: Optimization and Self-Organization Approaches, Computational Intelligence, Application of Artificial Intelligence
Abstract: This paper presents a prescribed performance finite-time observer-based super-twisting controller (PPFTOST) for cooperative aerial suspended transport systems. The proposed controller integrates a prescribed performance framework, an appointed-time disturbance observer (ATDO) for disturbance estimation, and a fast terminal sliding mode super-twisting controller (FTSMSTC) for rapid convergence with reduced chattering. Through this combination, the closed-loop system achieves finite-time convergence of both the disturbance estimation error and the sliding variables, while ensuring that the tracking errors remain within prescribed performance bounds. In simulations against an ATDO-only baseline, the PPFTOST reduced payload-tracking root mean square error by 8.2%, 6.7%, and 9.3% on the x, y, and z axes, respectively, with all actuator commands remaining within limits. These results demonstrate that the proposed method enables accurate, robust, and smooth multi-UAV load transport under realistic disturbance conditions.
|
|
17:15-17:30, Paper Mo-S2-T9.6 | |
Cloud-Based Higher-Order Sliding Mode Predictive Control for Time-Varying Formation of Multi-Agent Systems (I) |
|
Nandanwar, Anuj | Indian Institute of Technology, Kanpur |
Rybak, Larisa | Belgorod State Technological University Named after V.G. Shukhov |
Malyshev, Dmitry | Belgorod State Technological University Named after V.G. Shukhov |
Dyakonov, Dmitry | Belgorod State Technological University Named after V.G. Shukhov |
Keywords: Cloud, IoT, and Robotics Integration, Swarm Intelligence, Agent-Based Modeling
Abstract: This paper investigates a discrete higher-order sliding mode control (DHOSMC) framework integrated with cloud-based predictive control (CBPC) to address time-varying formation control in discrete multi-agent systems (DMAS). The approach ensures robustness against model uncertainties and external disturbances while compensating for communication delays. The DHOSMC guarantees finite convergence of tracking errors, and the CBPC leverages cloud resources for multi-step prediction and coordination among agents. Both delay-free and delayed scenarios are considered. Lyapunov-based analysis is used to prove system stability. Simulation results demonstrate the effectiveness of the proposed method in achieving accurate and robust time-varying formation of DMAS.
|
|
17:30-17:45, Paper Mo-S2-T9.7 | |
Distributed Bipartite Tracking Control for Heterogeneous Multi-Agent Systems with Hierarchical Framework (I) |
|
Yang, Yize | The University of Adelaide |
Shi, Peng | University of Adelaide, Adelaide |
Yan, Bing | The University of Adelaide |
Ni, Junkang | Northwestern Polytechnical University |
Keywords: Swarm Intelligence, Cybernetics for Informatics, Agent-Based Modeling
Abstract: This paper investigates the distributed bipartite tracking problem for heterogeneous multi-agent systems (MASs) on signed directed networks. Leaders with both cooperative and competitive interactions and two different types of followers form the heterogeneous MASs. The dynamics of each agent are described in strict feedback form by parameters of different structures. The so-called reference signal tracking approach is employed in the hierarchical design to generate reference signal generators and bipartite tracking controllers. The control gains of leaders and followers are integrated into the design of reference signal generators, which utilize locally estimated states and remain independent of the communication topology. The convergence of the proposed heterogeneous MASs is proved. The simulation result demonstrates the effectiveness of the proposed hierarchical design.
|
|
17:45-18:00, Paper Mo-S2-T9.8 | |
Task Allocation for Autonomous Machines Using Computational Intelligence and Deep Reinforcement Learning (I) |
|
Nguyen, Thanh Thi | Monash University |
Nguyen, Quoc Viet Hung | Griffith University |
Kua, Jonathan | Deakin University |
Razzak, Imran | Mohamed Bin Zayed University of Artificial Intelligence |
Nguyen, Dung | The University of Queensland |
Nahavandi, Saeid | Swinburne University of Technology |
Keywords: Agent-Based Modeling, Computational Intelligence, Machine Learning
Abstract: Enabling multiple autonomous machines to perform reliably requires the development of efficient cooperative control algorithms. This paper presents a survey of algorithms that have been developed for controlling and coordinating autonomous machines in complex environments. We especially focus on task allocation methods using computational intelligence (CI) and deep reinforcement learning (RL). The advantages and disadvantages of the surveyed methods are analysed thoroughly. We also propose and discuss in detail various future research directions that shed light on how to improve existing algorithms or create new methods to enhance the employability and performance of autonomous machines in real-world applications. The findings indicate that CI and deep RL methods provide viable approaches to addressing complex task allocation problems in dynamic and uncertain environments. The recent development of deep RL has greatly contributed to the literature on controlling and coordinating autonomous machines, and it has become a growing trend in this area. It is envisaged that this paper will provide researchers and engineers with a comprehensive overview of progress in machine learning research related to autonomous machines. It also highlights underexplored areas, identifies emerging methodologies, and suggests new avenues for exploration in future research within this domain.
|
|
18:00-18:15, Paper Mo-S2-T9.9 | |
Event-Triggered Control for Autonomous Detection and Treatment of Membrane Lesions Using Microrobot Swarms (I) |
|
Gao, Yun | Hong Kong University of Science and Technology(Guangzhou& |
Gao, Hao | Hong Kong University of Science and Technology (Guangzhou) |
Wang, Ziming | The Hong Kong University of Science and Technology(Guangz |
Shi, Yang | University of Victoria |
Ji, Yiding | Hong Kong University of Science and Technology (Guangzhou) |
Keywords: Swarm Intelligence, Optimization and Self-Organization Approaches, Agent-Based Modeling
Abstract: Recent advances in robotics have expanded the potential of microrobot swarms (MRSs) in medicine, yet clinical deployment remains limited due to reliance on non-autonomous systems. This study proposes an event-triggered distributed coverage control framework that enables MRSs to autonomously detect and treat membrane lesions. To model lesion dynamics accurately, we introduce a coupled reaction-diffusion equation and a Hawkes process that capture spatial spread and temporal emergence. This model informs a modified Lloyd algorithm to guide MRSs toward the centroids of Voronoi cells, optimizing drug release over pre-existing lesion areas. Furthermore, we design an event-triggered mechanism prioritizing treatment of newly emerging lesions, redirecting microrobots to lesion centers for prioritized response. This adaptive framework effectively addresses lesion proliferation and promotes membrane healing. Simulations demonstrate improved coverage efficiency and lesion containment compared to conventional strategies.
|
|
Mo-S2-T10 |
Room 0.94 |
Assistive Technology 2 |
Regular Papers - HMS |
Chair: Zhang, Jianjun | South China Agricultural University |
Co-Chair: Xue, Shan | University of Tsukuba |
|
16:00-16:15, Paper Mo-S2-T10.1 | |
Feedback Method for Motion Instruction to Electric Prosthetic Hands Users Via Self-Organizing Maps |
|
Miyazawa, Sho | Tokyo Denki University |
Asanuma, Yuhi | Tokyo Denki University |
Hamada, Yuki | Tokyo Denki University |
Inoue, Jun | Tokyo Denki University |
Keywords: Assistive Technology, Human-Machine Interface, Human Factors
Abstract: The adoption rate of myoelectric prosthetic hands is low owing to the lack of training facilities and challenges associated with prolonged use. A key issue in training is the difficulty users face in adapting to these devices, which we attribute to their inability to accurately recognize errors and areas for improvement. To address this, we developed a feedback method that visually simplifies error identification. Specifically, we used a self-organizing map (SOM) to map acquired muscle vibration data from a high to a low-dimensional space, visualizing the characteristics of each movement to identify the variations in movement that cause misrecognition. Furthermore, to promote accurate movement learning, we provided participants with feedback on movement biases and deficiencies. We used an ultrafine-diameter piezoelectric wire sensor to acquire muscle vibration data for seven different movements, including grasping and palm flexion/extension. The SOM was then used to identify the movements and provide feedback. Consequently, the movement identification rate after feedback improved from 91.8% to 96.1%, indicating that participants could more accurately replicate movements. This approach facilitates better movement awareness, contributing to shorter training times and improved proficiency for prosthetic hand users. Furthermore, the proposed feedback method could be applied to rehabilitation programs and training systems for other assistive devices, enhancing their effectiveness by promoting more accurate motor learning.
|
|
16:15-16:30, Paper Mo-S2-T10.2 | |
Controllability Assessment of Belt-Type Wheelchair Interface with Individualized Asymmetry-Aware Body-Axis Calibration |
|
Suzuki, Yuma | School of Science for Open and Environmental Systems, Keio Unive |
Kawasaki, Yosuke | Keio University |
Takahashi, Masaki | Keio University |
Keywords: Assistive Technology, Human-Machine Interface, User Interface Design
Abstract: The conventional joystick, the most common input device for powered wheelchairs, occupies one hand during operation and thus interferes with activities of daily living (ADL). Existing hands-free trunk-motion interfaces require extensive trunk movements, making them unsuitable for users with limited stability, increasing fall risk and psychological load, and lacking calibration schemes that explicitly account for left–right asymmetry in trunk kinematics. Their calibration schemes also do not explicitly account for left–right asymmetry in trunk kinematics. This study introduces a hands free belt type interface that senses subtle belt tensions via a six axis force/torque sensor mounted on the backrest and maps them to translational and angular velocity commands. The interface relieves the user’s hands and substantial postural changes during wheelchair operation. To handle asymmetrical motor function, which is frequently observed in users with physical disabilities, we also introduced a body axis calibration method. An empirical study involving path following tasks was conducted with twelve able bodied participants and two participants with physical disabilities. As a result, the path tracking accuracy in the able bodied group was statistically non inferior to joystick. In the group with physical disabilities, body-axis calibration substantially improved the belt-type interface performance, reducing its gap relative to joystick control.
|
|
16:30-16:45, Paper Mo-S2-T10.3 | |
Snap, Segment, Deploy: A Visual Data and Detection Pipeline for Wearable Industrial Assistants |
|
Wen, Di | Karlsruhe Institute of Techonology |
Zheng, Junwei | Karlsruhe Institute of Technology |
Liu, Ruiping | Karlsruhe Institute of Technology |
Xu, Yi | JD.com, Inc |
Peng, Kunyu | Kalrsruhe Institute of Technology, IAR |
Stiefelhagen, Rainer | Karlruher Institut Für Technologie |
Keywords: Assistive Technology, Human-Machine Interface, Wearable Computing
Abstract: Industrial assembly tasks increasingly demand rapid adaptation to complex procedures and varied components, yet are often conducted in environments with limited compute, connectivity, and strict privacy requirements. These constraints make conventional cloud-based or fully autonomous solutions impractical for factory deployment. This paper introduces a mobile-device-based assistant system for industrial training and operational support, enabling real-time, semi-hands-free interaction through on-device perception and voice interfaces.The system integrates lightweight object detection, speech recognition, and retrieval-augmented generation (RAG) into a modular on-device pipeline that operates entirely on-device, enabling intuitive support for part handling and procedure understanding without relying on manual supervision or cloud services. To enable scalable training, we adopt an automated data construction pipeline and introduce a two-stage refinement strategy to improve visual robustness under domain shift. Experiments on the Gear8 dataset demonstrate improved robustness to domain shift and common visual corruptions. A structured user study further confirms its practical viability, with positive user feedback on guidance clarity and interaction quality. These results indicate that our framework offers a deployable solution for real-time, privacy-preserving smart assistance in industrial environments. We will release the Gear8 dataset and source code upon acceptance.
|
|
16:45-17:00, Paper Mo-S2-T10.4 | |
Intelligent Guidance for the Visually Impaired: A Wearable Smart Helmet System |
|
Zhang, Junqi | East China Normal University |
Duan, Zhuowen | East China Normal University |
Yu, Mingrui | East China Normal University |
Chen, Wenjie | East China Normal University |
Keywords: Assistive Technology, Visual Analytics/Communication, Environmental Sensing,
Abstract: Visual impairment has become an increasingly serious issue, while existing assistive solutions often fail to effectively guide the visually impaired. To address this challenge, we designed a smart helmet to support safe and intelligent navigation for visually impaired users. First, we constructed a comprehensive dataset covering various real-world travel scenarios. Then, we designed a Multi-Scale Group Fusion (MSGF) module that efficiently fuses two feature maps with relatively low computational cost. Building upon the YOLO framework, we integrated the MSGF module and Multidimensional Collaborative Attention (MCA) to design the MSMD-YOLO network, which achieves a detection accuracy of 71.8% mAP. To ensure real-time performance, we accelerated the model using Neural Network Processing Units (NPUs), achieving 24 FPS, which is 14.17 times faster than the CPU. Finally, we performed tactile paving segmentation and designed a score-voting algorithm to provide directional guidance to users. We also conducted obstacle detection and calculated depth information using a binocular camera. Based on this, we proposed a weighted algorithm to deliver obstacle alerts to users. Our smart helmet is easy to wear, offers high detection accuracy and real-time performance, and features intelligent voice commands, making it well suited to meet the mobility needs of visually impaired people.
|
|
17:00-17:15, Paper Mo-S2-T10.5 | |
A Pressure Sensor Based Wearable Foot Interaction System and Its Applications |
|
Zeng, Limin | Zhejiang University |
Yu, Yinuo | Zhejiang University |
Qiu, Robin | The Pennsylvania State University |
Bu, Jiajun | Zhejiang Key Laboratory of Accessible Perception and Intelligent |
Keywords: Assistive Technology, Wearable Computing
Abstract: Individuals with upper-body motor impairments, but having full lower-body functionality, can use their feet to interact with mobile devices, such as smartphones and tablets. However, they face significant challenges controlling mobile devices and entering text via touchscreen with their feet. Moreover, such interactions are typically confined to limited scenarios, such as sitting in a chair. To address this, we developed a wearable, pressure-sensor-based foot interaction system featuring a novel two-key text entry method. Two use case studies, with a total of 18 participants, demonstrated that our system enables these individuals to control devices and achieve acceptable text entry speeds in diverse scenarios.
|
|
17:15-17:30, Paper Mo-S2-T10.6 | |
Development of an Epidermal Shear Force Estimation Model Based on Porcine Skin |
|
Chiba, Tomoki | Tokyo Denki University |
Inoue, Jun | Tokyo Denki University |
Ogikubo, Kota | Tokyo Denki University |
Keywords: Assistive Technology, Wearable Computing, Haptic Systems
Abstract: As diabetic neuropathy leads to toe deformities and sensory neuropathy impairs pain perception, using commercially available shoes poses health risks for diabetics. Prescribing appropriate footwear is a challenge; medical professionals need an objective system for evaluating shoe compatibility. This can be achieved by measuring the vibrations (that are generated by the in-shoe vertical load and shear forces) to estimate shear forces. To develop a system that can be used in evaluating shoe compatibility, porcine skin, which is biologically similar to human skin, was used to simulate the human foot in a replication experiment. An ultra-fine piezoelectric wire sensor was used to measure the vibrations that were reproduced. Frequency features were extracted from the vibration data; machine learning was applied to evaluate classification accuracy, and a regression model was constructed to assess estimation accuracy using the root mean square error (RMSE). In this study, the maximum classification accuracy was 93.3% when vertical load was included as a feature after porcine skin was excluded. The RMSE was 0.464 N. Future research will focus on enhancing the accuracy of shear force estimation under real walking conditions by exploring innovative methods for attaching sensors to the bottom and sides of the human foot. This approach will help achieve a more accurate and practical solution for real-time evaluations. Additionally, efforts will be made to improve classification accuracy using only vibration waveforms, leading to a more robust and generalized method for estimating shear forces. By addressing individual variability, future experiments on human feet will incorporate personalized approaches that account for differences in foot anatomy, ensuring broader applicability and reliability. These advancements will contribute to the development of a reliable system for real-time shoe fit evaluation, with the potential to improve comfort and health, especially for individuals with sensory impairments.
|
|
17:30-17:45, Paper Mo-S2-T10.7 | |
Wearable Cyborg HAL Trunk Unit Controlled by Voluntary Control Method for Patients with Parkinson’s Disease: A Pilot Study |
|
Ikeda, Kaosu | University of Tsukuba |
Uehara, Akira | University of Tsukuba |
Sankai, Yoshiyuki | University of Tsukuba |
Kawamoto, Hiroaki | University of Tsukuba |
Keywords: Assistive Technology, Wearable Computing, Human-Machine Interaction
Abstract: Parkinson’s disease causes various gait disturbances due to dopamine deficiency in the basal ganglia, significantly reducing patients’ ability to perform activities of daily living and diminishing their quality of life. Previous studies have demonstrated that the wearable cyborg Hybrid Assistive Limb (HAL) trunk unit, assisted lateral movement during walking by providing lateral sway, thereby improving gait disturbances. We have developed a hybrid control method for HAL that assists appropriately based on the wearer’s stride time stability and intentional stride time changes, using their biometric data, thus expanding on existing control method. However, HAL controlled by the hybrid control method has not yet been applied to patients, and it is necessary to verify the feasibility of voluntary control that assists in synchronization with the patient’s lateral movement based on motion intention estimated from biometric information. Additionally, the current power transmission link to HAL trunk lacks a rotation mechanism, preventing accommodation of natural trunk rotation during walking and thereby restricting this motion. To address this limitation, we developed a HAL that adapts to the wearer’s trunk rotation during walking. We confirmed through gait experiments on an able-bodied participant that this new mechanism did not inhibit trunk rotation during walking. Furthermore, we confirmed that the developed HAL with voluntary control could assist lateral movement in synchronization with a patient with Parkinson’s disease through the gait experiment.
|
|
17:45-18:00, Paper Mo-S2-T10.8 | |
Evaluation of Hydraulic Excavator Operator Development through Shared Control Based on Closed-Loop Characteristics |
|
Hiraoka, Kei | Hiroshima University |
Yamamoto, Toru | Hiroshima University |
Kozui, Masatoshi | KOBELCO Construction Machinery Co., Ltd |
Yumoto, Natsuki | Kobelco Construction Machinery Co., Ltd |
Koiwai, Kazushige | Kobelco Construction Machinery Co., LTD |
Keywords: Shared Control, Human Enhancements, Assistive Technology
Abstract: The Sustainable Development Goals (SDGs) have recently garnered increased attention. Additionally, Japan has been advocating for the achievement of ”Society 5.0”. In particular, the construction industry is promoting ”i-Construction” which involved the automation and semi-automation of hydraulic excavators. However, some situations require human judgment, particularly during unforeseen circumstances, underscoring the need for cooperative systems between humans and machines at construction sites. To realize such systems, evaluating human operability is essential. Therefore, the proposed method conducted system identification using closed-loop operating data comprising humans and machines. Furthermore, changes in human operability were captured by evaluating the poles calculated during the process. The effectiveness of this method was experimentally verified using a hydraulic excavator. The findings demonstrated that closed-loop system identification can be performed using operational data. Additionally, the results confirmed the potential for improving the development pf hydraulic excavator operators through shared control experience.
|
|
18:00-18:15, Paper Mo-S2-T10.9 | |
DCED: Deformable Convolutional Encoder-Decoder Network for Inflamed Appendix Segmentation and Classification from CT Images |
|
Ng, Wing Yin | South China University of Technology |
Zheng, Peixin | South China University of Technology |
Liang, Yinhao | South China University of Technology |
Wang, Ting | South China Agricultural University |
Zhang, Jianjun | South China Agricultural University |
Dan, Liang | Guangzhou First People’s Hospital/The Second Affiliated Hospital |
Hui, Zhou | The Sixth Affiliated Hospital of Guangzhou Medical University, Q |
Li, Guangming | Department of Radiology, the Sixth Affiliated Hospital of Guangz |
Wei, Xinhua | Department of Radiology, Guangzhou First People's Hospital, Sout |
Keywords: Biometrics and Applications,, Medical Informatics, Assistive Technology
Abstract: Acute appendicitis (AA) is one of the most prevalent surgical acute abdominal condition diseases. The recognition and segmentation of the inflamed appendix are important for AA diagnosis. However, it is a challenging task to find and segment the inflamed appendix from computed tomography (CT) images due to the varying sizes and shapes of different appendices and blurred borders with nearby tissues. To the best of our knowledge, the general expert segmentation model suffers due to the characterization of the inflamed appendix. Thus, we propose a deformable convolutional encoder-decoder network (DCED) for better recognition and segmentation of the inflamed appendix. The network consists of an encoder, a bottleneck, and a decoder. The encoder is composed of several convolutional neural network (CNN) layers to capture the local structural information. The bottleneck based on a vision transformer (ViT) focuses on the region of interest (ROI) using the global attention mechanism. The encoder and bottleneck modules effectively combine the local and global information of input data to locate the inflamed appendix. The decoder based on a deformable convolutional network (DCN) learns the varied boundary information, which helps to improve the accuracy of boundary segmentation. Extensive experimental results on a real-world AA dataset show that the proposed method yields the best average Dice similarity coefficient (DSC) of 71.29% and average Hausdorff Distance 95% (HD95) of 12.38 mm in comparison to state-of-the-art segmentation methods.
|
|
18:15-18:30, Paper Mo-S2-T10.10 | |
Empowering Audiobook Creation: An LLM-Powered Interactive System for Soundscape Design |
|
Xue, Shan | University of Tsukuba |
Nobuhara, Hajime | University of Tsukuba |
Keywords: Human-Computer Interaction, Assistive Technology, Interactive and Digital Media
Abstract: Large Language Models (LLMs) offer efficient and accessible support for audio engineers. However, text-only assistance lacks intuitive interaction and often fails to meet the practical demands of digital design. This study explores how an LLM can provide more effective support for audiobook producers, especially novices. We present a GPT-4o-powered analysis tool that extracts soundscape elements from text and align sound effects with semantic cues. By fusing LLM-guided interpretation with interactive audio control, the system introduces a new paradigm for supporting creative design through both semantic and acoustic dimensions. We invited three experts to evaluate the system and conducted a design experiment with 26 participants. Compared to the traditional method, our approach significantly improved design accuracy and efficiency. This work highlights the implicit tension and synergy between LLM-assisted creation and conventional design thinking, offering practical insights into the development of more adaptive and intelligent support tools for future audiobook production.
|
|
Mo-S2-T12 |
Room 0.96 |
Risk, Security, and Resilience in Cyber-Physical Systems |
Special Sessions: Cyber |
Chair: Guarino, Simone | Università Campus Bio-Medico Di Roma |
Co-Chair: Vitale, Francesco | University of Naples Federico II |
Organizer: Guarino, Simone | Università Campus Bio-Medico Di Roma |
Organizer: Fioravanti, Camilla | Università Campus Bio-Medico Di Roma |
Organizer: Vitale, Francesco | University of Naples Federico II |
Organizer: Ge, Hangli | The University of Tokyo |
Organizer: Flammini, Francesco | Mälardalen University |
|
16:00-16:15, Paper Mo-S2-T12.1 | |
A Multilayer Approach for Statistical-Based Anomaly Detection in Cyber-Physical Systems (I) |
|
Iannaccone, Antonio | University of Naples "Parthenope" |
Nardone, Roberto | University of Naples Parthenope |
Petruolo, Alfredo | University of Naples "Parthenope" |
Keywords: Cloud, IoT, and Robotics Integration, Big Data Computing,, Heuristic Algorithms
Abstract: Cyber-Physical Systems heavily rely on accurate and timely anomaly detection to ensure safety, security, and resilience, while maintaining low operational costs. However, traditional anomaly detection methods often depend on extensive datasets and heavy computational resources, limiting their practical implementation. This paper introduces a multilayer statistically based architecture designed specifically for real-time anomaly detection in CPS environments, without requiring large training datasets. Leveraging an edge-cloud paradigm, the approach combines lightweight statistical analysis performed locally at the edge with advanced centralised correlation analysis in the cloud. Our methodology dynamically adapts anomaly detection thresholds using Free Probability Theory (FPT), integrating real-time external data sources such as traffic information to significantly reduce false positives. A practical validation through a real-world structural health monitoring case study on a bridge in Caserta, Italy, demonstrates the effectiveness and robustness of our system in detecting anomalies, offering a scalable, adaptive, and efficient solution aligned with European data-sharing directives and standards.
|
|
16:15-16:30, Paper Mo-S2-T12.2 | |
Enhancing Continuity: Risk Assessment Analysis of Mission-Critical Network Services in a Disaster-Hit Area (I) |
|
Galassi, Alessandra | University of L'Aquila |
Franchi, Fabio | University of L'Aquila |
Keywords: Computational Intelligence, Cloud, IoT, and Robotics Integration
Abstract: Ever-increasing connectivity and telecommunication network performance optimization are powering innovative applications that connect people, organizations, and smart objects, changing the way we interact and redefining a wide range of industries. As systems become more complex, threats affect both safety and security attributes, and failures can lead to serious consequences. This fuels policies to improve business continuity and protection of mission-critical network services. Through the analysis of literature and use case, in this paper, we propose a continuous and dynamic workflow hypothesis for risk assessment (before and after the occurrence of an outage) that consists of some steps, namely risk identification, risk assessment, and mitigation strategy selection to facilitate the continuity of a cyber-physical system operating in a critical environment. The effectiveness of the methodology will be validated through the application to a seismic area in central Italy and the newly developed early warning system tailored to simulate a densely populated sensor network environment. The real-world scenario will demonstrate that the proposed workflow diagram supports resilience decision-making.
|
|
16:30-16:45, Paper Mo-S2-T12.3 | |
Simulation of Emergency Evacuation in Large Scale Metropolitan Railway Systems for Urban Resilience (I) |
|
Ge, Hangli | The University of Tokyo |
Fan, Zipei | Jilin University, School of Artificial Intelligence |
Yang, Xiaojie | The University of Tokyo |
Flammini, Francesco | Mälardalen University |
Koshizuka, Noboru | The University of Tokyo |
Keywords: Big Data Computing,, Cybernetics for Informatics, Computational Intelligence in Information
Abstract: This paper presents a simulation for traffic evacuation during railway disruptions to enhance urban resilience. The research focuses on large-scale railway networks and provides flexible simulation settings to accommodate multiple node or line failures. The evacuation optimization model is mathematically formulated using matrix computation and nonlinear programming. The simulation integrates railway lines operated by various companies, along with external geographical features of the network. Furthermore, to address computational complexity in large-scale graph networks, a subgraph partitioning solution is employed for computation acceleration. The model is evaluated using the extensive railway network of Greater Tokyo. Data collection included both railway network structure and real-world GPS footfall data to estimate the number of station-area visitors for simulation input and evaluation purposes. Several evacuation scenarios were simulated for major stations including Tokyo, Shinjuku, Shibuya and so on. The results demonstrate that both evacuation passenger flow (EPF) and average travel time (ATT) during emergencies were successfully optimized, while remaining within the capacity constraints of neighboring stations and the targeted disruption recovery times.
|
|
16:45-17:00, Paper Mo-S2-T12.4 | |
Security-By-Design with Cost-Constrained Opacity Enforcement for Modbus TCP Based Industrial Control Systems (I) |
|
Bonagura, Valeria | Roma Tre |
Cavone, Graziana | University Roma Tre |
Pascucci, Federica | Università Roma Tre |
Keywords: Heuristic Algorithms, Information Assurance and Intelligence, Cybernetics for Informatics
Abstract: In the era of Industry 5.0, securing Industrial Control Systems (ICS) is increasingly vital, especially when relying on legacy communication protocols like Modbus TCP that may lack built-in protection mechanisms. This paper addresses the challenge of preserving the confidentiality of internal system states from potential cyber adversaries through a security-by-design framework. We propose a novel approach that leverages Discrete Event Systems (DES) theory to model communication flows and applies probabilistic opacity to quantify the risk of state disclosure. Central to our method is the concept of selective encryption: instead of encrypting all messages, we strategically encrypt only those events that could reveal sensitive information. This gives rise to a budget-constrained optimization problem, where the goal is to enforce opacity under resource limitations. To solve this efficiently, we develop a greedy algorithm that maximizes security by allocating encryption effort to the most critical events. The proposed method is validated using a representative example featuring two distinct query types, demonstrating its capability to limit information leakage while keeping low the computational overhead.
|
|
17:00-17:15, Paper Mo-S2-T12.5 | |
Analyzing the 2015 Ukraine Power Grid Cyber-Attack: A Quantitative Assessment of Adversary Behavior and Impact (I) |
|
Kordi, Marzieh | IMT School for Advanced Studies Lucca |
Ali, Syed Muhammad Fasih | University of Florence |
Lollini, Paolo | University of Florence |
Bondavalli, Andrea | University of Florence, Department of Mathematics and Computer S |
Keywords: Cybernetics for Informatics, Complex Network
Abstract: The security of critical infrastructures, such as power grids, water treatment facilities, transportation net works, financial systems, and communication networks, is essential for social stability. These systems deliver vital services, but are increasingly reliant on digital control mechanisms, making them vulnerable to cyber threats. A successful cyber-attack on any of these infrastructures could lead to widespread disruptions, significant financial losses, and in severe cases, risks to public safety. An effective cyber-security risk assessment process requires structured methodologies that identify vulnerabilities and anticipate adversarial behavior. Traditional risk assessment approaches rely on static and qualitative analyses that focus on known vulnerabilities and configurations, but lack dynamic attack simulation. In contrast, formal modeling and simulation-based techniques provide a quantitative framework to analyze possible attack paths and their likelihood of success. Among these formal methods, the ADVISE (ADversary VIew Security Evaluation) formalism offers a structured approach to assess cyber threats from the perspective of an adversary. This paper explores the application of the formal security evaluation framework, ADVISE, to model and analyze the 2015 Ukraine Power Grid cyber-attack. It specifically highlights the impact and the importance of the execution timing of the attacks, the adversary capabilities, and the effects of countermeasures throughout the progression of cyber-attacks. This framework simulates attack dynamics and quantifies the security risks associated with the Ukrainian Power Grid, thereby complementing the qualitative analyses conducted in previous studies.
|
|
17:15-17:30, Paper Mo-S2-T12.6 | |
Methodologies and Tools for Quantitative Risk Assessment Including the Human Factor Analysis: A Railway Case Study (I) |
|
Carusone, Pasquale | Hitachi STS |
De Benedictis, Alessandra | University of Naples Federico II |
Gerbasio, Diego | Hitachi STS |
Keywords: Information Assurance and Intelligence, Fuzzy Systems and their applications, Computational Intelligence in Information
Abstract: The railway sector is vital for transporting people and goods, necessitating high safety and reliability standards. The increasing complexity of railway systems, with new technologies and automation, requires effective risk management strategies. Risk Assessment systematically identifies, analyzes, and mitigates hazards from technical failures, human errors, and environmental factors, aiming to maintain safety and resilience. Digitalization and automation have heightened the importance of Risk Assessment, introducing vulnerabilities like programming errors and cyber-attacks. The human factor remains critical, influencing safety through decisions, risk perception, and adherence to procedures. Technological advancements have reduced mechanical failures, but human errors, such as incorrect decisions and fatigue, are now major causes of accidents. The study, in collaboration with Hitachi Rail STS, aims to quantitatively assess risks in metro systems, including human factors, using specialized tools for risk evaluation and fault tree generation.
|
|
17:30-17:45, Paper Mo-S2-T12.7 | |
Chaos Engineering Strategies for Enhancing Resilience in Complex Cyber-Physical (CPS) and Industrial Control Systems (ICS) : A Literature Review (I) |
|
Ismailov, Ali | University of Agder |
Noori, Nadia Saad | University of Agder |
Keywords: Cloud, IoT, and Robotics Integration, Cybernetics for Informatics, Expert and Knowledge-Based Systems
Abstract: Chaos engineering has emerged as a proactive discipline for assessing and enhancing the resilience of complex cyber-physical and industrial control systems. In this literature review, we systematically synthesize recent advances in applying chaos engineering principles to critical infrastructure domains such as power grids, water treatment facilities, and industrial processes. Guided by two central research questions, our review employs a systematic snowball literature review methodology complemented by Boolean keyword searches to identify key experimental studies and frameworks. The literature reveals that deliberate fault injection from sensor disruptions to simulated cyberattacks can effectively uncover hidden vulnerabilities and validate system safeguards. In addition, we highlight the benefits of a modular approach to chaos engineering, whereby complex industrial control systems are decomposed into discrete modules for targeted testing, including the integration of digital twin technologies and cloud-based platforms. Our analysis addresses significant challenges such as safety management, system interdependencies, and the need for domain-specific frameworks, ultimately charting a course for future research and practical applications aimed at designing more robust and reliable critical infrastructures.
|
|
17:45-18:00, Paper Mo-S2-T12.8 | |
Cyber Situation Awareness Using Network Activity Classification Based on Granular Computing (I) |
|
Bellini, Emanuele | University of Roma Tre |
D'Aniello, Giuseppe | University of Salerno |
Flammini, Francesco | Mälardalen University |
Gaeta, Matteo | University of Salerno |
Iovaro, Damiana | University of Salerno |
Keywords: Computational Intelligence, Application of Artificial Intelligence, Hybrid Models of Computational Intelligence
Abstract: Cyber Situation Awareness requires effective methods to interpret complex, dynamic network data, and Granular Computing offers a powerful framework for managing such complexity through abstraction. In this work, we propose a granular computing-based approach for network activity classification that supports Cyber Situation Awareness by combining the Clustering-by-Time method with the principle of justifiable granularity. The system selects the most informative subsets of traffic within time windows, summarizes them into optimized frames, and trains a Random Forest classifier for anomaly detection. Evaluated on the LUFlow dataset, the approach achieves significant data reduction — up to 98% — while maintaining good detection accuracy. This enables scalable and efficient intrusion detection in complex network environments.
|
|
18:00-18:15, Paper Mo-S2-T12.9 | |
Conceptual Framework for Testbed Design: Specific to Cybersecurity and Operational Safety Anomalies in DER-Rich Smart Grid (I) |
|
Sechi, Fabien | University of Agder |
Noori, Nadia Saad | University of Agder |
Keywords: Cybernetics for Informatics, AI and Applications, Hybrid Models of Computational Intelligence
Abstract: Digitalisation and the rapid uptake of distributed energy resources (DERs) are making Europe’s smart-grid (SG) ecosystem more complex—and more exposed to cyber-threats. Although AI-based anomaly-detection tools are increasingly deployed, they rarely distinguish malicious intrusions from safety-critical equipment faults. Real progress therefore depends on datasets and testbeds that reproduce ordinary operation, failures and attacks before new analytics are rolled out. We present a conceptual cyber-physical-system (CPS) testbed designed for DER-rich European SGs. The platform, hosted at the Institute for Energy Technology, links (i) a power-flow simulation of the CIGRÉ benchmark distribution grid, (ii) a Human-Machine-Interface/SCADA control room and (iii) a Security Operations Centre. Across these layers, human operators, AI agents, safety interlocks and security controls interact to mimic real socio-technical behaviour under critical scenarios. The study follows three stages. First, we reviewed machine-learning approaches and public datasets, finding none that jointly capture security- and safety-driven anomalies with transparent explanations. Second, a systematic survey of 55 SG/ICS testbeds revealed unmet requirements: realistic operating envelopes, fault diversity, human–AI interaction and structured root-cause analysis. Third, we translated those needs into a modular architecture and validated it through expert consultation with power-system and cybersecurity specialists. The resulting blueprint can emulate cascading equipment faults, coordinated cyber-attacks and operator responses while logging high-resolution electrical, network and process data. We outline performance metrics—realism, scalability, flexibility and security robustness—and an evaluation workflow that researchers can reuse to benchmark anomaly-detection models and incident-response procedures. By bridging the gap between security testbeds and power-system simulators, the proposed framework supports AI tools that correctly classify cyber intrusions versus operational faults, thereby strengthening the safety and resilience of critical energy infrastructure worldwide.
|
|
Mo-S2-T13 |
Room 0.97 |
Optimization and Metaheuristic Algorithms |
Regular Papers - Cybernetics |
Chair: Widl, Edmund | Austrian Institute of Technology |
Co-Chair: Mousavirad, Seyed Jalaleddin | Mid Sweden University |
|
16:00-16:15, Paper Mo-S2-T13.1 | |
PruneClust-DE: A Novel Dual-Strategy Clustering-Based Differential Evolution Algorithm for Neural Network Training |
|
Mousavirad, Seyed Jalaleddin | Mid Sweden University |
O'Nils, Mattias | Mittuniversitetet |
Schaefer, Gerald | Loughborough University |
Oliva, Diego | Universidad De Guadalajara |
Keywords: Metaheuristic Algorithms, Computational Intelligence, Evolutionary Computation
Abstract: Training artificial neural networks is a fundamental step in developing machine learning models, as it determines their ability to learn and generalise from data. While gradient-based methods such as stochastic gradient descent and its variants dominate training approaches, they are susceptible to issues like sensitivity to initialisation and convergence to local optima. To address these challenges, gradient-free metaheuristic algorithms, such as differential evolution (DE), are promising alternatives due to their ability to effectively explore complex optimisation landscapes. In this paper, we propose a novel DE-based algorithm, PruneClust-DE, for training multilayer neural networks. Our approach introduces two key strategies: (1) clustering-based interpolation, which partitions the population into clusters, identifies centroids, and generates new candidate solutions by interpolating between cluster centroids to balance exploration and exploitation, and (2) fitness-based pruning, a mechanism that retains only the fittest individuals after introducing new candidates, ensuring a constant yet high-quality population. We validate our proposed algorithm across diverse datasets and compare its performance with other state-of-the-art methods, demonstrating its superiority in achieving robust results.
|
|
16:15-16:30, Paper Mo-S2-T13.2 | |
A Model-Based Approach to Quantifying Real-Time Performance and Resource Utilization in Heterogeneous Embedded Systems |
|
Li, He | Southeast University |
Chen, Long | Southeast University |
Xiaoping, Li | Southeast University |
Keywords: Metaheuristic Algorithms, Heuristic Algorithms, Optimization and Self-Organization Approaches
Abstract: The inherent complexity of heterogeneous embedded systems, characterized by dynamic resource interactions and diverse architectural components, presents significant challenges in quantifying real-time performance and optimizing resource allocation. To address these challenges, we propose a novel model-based framework that integrates three critical dimensions: application software behavior modeling, heterogeneous hardware architecture characterization, and software-hardware mapping configuration. We establish a unified modeling framework with standardized representations of software runtime resources, performance metrics, hardware capabilities, and mapping relationships. Using semi-physical co-simulation techniques, we quantitatively evaluate system-critical metrics, including task timing guarantees, computational resource efficiency, and interconnect bandwidth utilization. Furthermore, we formulate and implement optimization strategies to enhance real-time performance and operational efficiency. The efficacy of the framework is validated through the implementation of a bioinformatics workflow (the Epigenomics pipeline) on a heterogeneous embedded platform, comparing six representative scheduling paradigms: MIN-MIN, MAX-MIN, First-Come-First-Served (FCFS), Round Robin (RR), Genetic Algorithm (GA), and Particle Swarm Optimization (PSO). Experimental results demonstrate significant improvements in real-time performance and resource utilization efficiency over traditional methods, highlighting the practical value and effectiveness of the proposed methodology.
|
|
16:30-16:45, Paper Mo-S2-T13.3 | |
Layer-Wise Adaptive Compression Method under Non-IID Settings for Federated Learning |
|
Feng, Ziyuan | Nanjing University of Science and Technology |
Wang, Zijun | Nanjing University of Science and Technology |
Gao, Peng | Nanjing University of Science and Technology |
Qu, Zhihao | Hohai University |
Keywords: Optimization and Self-Organization Approaches
Abstract: Federated learning (FL) enables collaborative model training while preserving data privacy through decentralized data storage. However, the frequent transmission of high-dimensional model updates between FL clients and the central server incurs substantial communication overhead. Although prior studies compress model updates to reduce transmission overhead, fixed‑rate schemes retain two major limitations: insensitivity to client‑level data heterogeneity and uniform layer‑wise compression, resulting in undercompression or overcompression of different layers. To overcome these issues, we propose a Layer-wise Adaptive Compression in Non-IID Situation (LWACN) algorithm, which applies global-local gradient similarity and client label entropy to measure the degree of client non-IID in compression. Moreover, we introduce window loss fluctuation and layer importance to mitigate the mismatching problem caused by the constant compression rate. Extensive experiments demonstrate that LWACN exhibits a better convergence rate and generalization ability than fixed compression. Specifically, compared to the state-of-the-art method, LWACN reduces transmission cost by up to 19.4%, and improves the final model accuracy by 5.5%. Our code can be found at https://github.com/AzumaSeren1209/LWACN-FL.
|
|
16:45-17:00, Paper Mo-S2-T13.4 | |
A Minimal-Cost Framework for Joint Reputation and Influence Management in Public Engagement |
|
Zhang, Hangjing | Tsinghua University |
Zhao, Hong Vicky | Tsinghua University |
Dai, Yixin | Tsinghua University |
Keywords: Optimization and Self-Organization Approaches, Agent-Based Modeling, Complex Network
Abstract: In complex social environments, influential entities-such as governments, corporations, and public figures-face growing challenges in effectively managing both their reputation and influence during public engagement activities such as governance, service delivery, and crisis communication. In these interactions, the public forms perceptions and responds behaviorally, which can be reflected in two dynamic and interdependent attributes: reputation and influence. They are jointly shaped by the entity's actual performance over time, such as the provision of products or services. Prior research has extensively examined the management of reputation and influence as separate issues, and often neglect their interactions. This study analyzes strategies based on the influential entity's actual performance, accounting for the mutual interdependence between reputation and influence. Based on the reputation-influence co-evolution model, we propose a framework for determining the minimum time-invariant performance required to meet predefined thresholds for reputation and influence. We obtain the optimal strategy to satisfy both requirements. Simulation results on both real-world data and synthetic networks validate the proposed model and our theoretical analysis.
|
|
17:00-17:15, Paper Mo-S2-T13.5 | |
Path Optimization Approach for Post-Disaster UAV Search Based on a Novel Evolutionary Neural Network |
|
Zhang, Lijie | Nanjing University of Aeronautics and Astronautics |
Li, Xin | Nanjing University of Aeronautics and Astronautics |
Qin, Xiaolin | Nanjing University of Aeronautics and Astronautics |
Keywords: Optimization and Self-Organization Approaches, Application of Artificial Intelligence, Evolutionary Computation
Abstract: Unmanned aerial vehicles (UAVs) have attracted widespread attention in post-disaster search and rescue (SAR) due to high flexibility and low-cost advantages. However, traditional centralized control approaches face problems such as poor adaptability and low robustness in complex and dynamic post-disaster environments. In order to improve the autonomous and execution efficiency of UAV cooperative search tasks, decentralized control methods have gradually become a research focus. However, how to efficiently realize autonomous path planning for UAVs under the condition of limited computational resources is still a key challenge to be solved. In this paper, we propose a dynamic adaptive path optimization method based on evolutionary neural network (DAPO-ENN), which combines the global search capability of evolutionary algorithms with the adaptive characteristics of neural networks to realize the centerless autonomous path planning and search coverage optimization of UAVs in post-disaster environments. DAPO-ENN can optimize the performance of the model under the limitation of computational resources, and adapt to the dynamic changes of the environment by online path optimization adjustment, so as to effectively improve the coverage efficiency while ensuring a high search coverage rate. The experimental results show that the DAPO-ENN proposed in this paper has stronger environmental adaptability and lower resource consumption than the existing comparison algorithms. The results suggest that the method provides an efficient and flexible solution for the cooperative search of UAVs after disasters.
|
|
17:15-17:30, Paper Mo-S2-T13.6 | |
Nonlinear Mapping Meets Multi-Task Bayesian Optimization: A Knowledge Transfer Perspective |
|
Rui, Qingyun | SCUT |
Liu, Wei-Li | Guangdong Polytechnic Normal University |
Wu, Yusheng | GD Midea Heating & Ventilating Equipment Co., Ltd |
Zhong, Jinghui | South China University of Technology |
Keywords: Optimization and Self-Organization Approaches, Transfer Learning, Computational Intelligence
Abstract: Bayesian optimization (BO), a data-efficient method for expensive black-box optimization, has traditionally focused on single-task scenarios, ignoring potential correlations among related tasks and leading to resource inefficiency due to repeated explorations. While existing multi-task BO methods mainly enhance surrogate models and sampling strategies, they rely on implicit knowledge transfer mechanisms that risk performance degradation from interference tasks, leveraging existing knowledge to optimize similar tasks instead of jointly optimizing multiple tasks from scratch. To address these issues, we propose a novel algorithm with adaptive knowledge transfer via kernelized autoencoding for multi-task Bayesian optimization (AKT-MTBO), which mainly has two core innovations. One is a kernel-induced task similarity measurement, where a kernelized autoencoding mechanism is employed to capture the nonlinear relationships between datasets. The other is an adaptive explicit knowledge transfer mechanism, where a heuristic rule is introduced to dynamically adjust the priority of selection of auxiliary tasks, ensuring selective collaboration while mitigating interference. Experiments on benchmark problems demonstrate that our proposed AKT-MTBO performs reliably in terms of both optimization efficiency and optimal solution success rates.
|
|
17:30-17:45, Paper Mo-S2-T13.7 | |
Pheromone-Focused Ant Colony Optimization Algorithm for Path Planning |
|
Liu, Yi | Fudan University |
Zhang, Hongda | Fudan University |
Gan, Zhongxue | Fudan University |
Chen, Yuning | Fudan University |
Zhou, Ziqing | Fudan University |
Meng, Chunlei | Fudan University |
Ouyang, Chun | Fudan University |
Keywords: Swarm Intelligence, Heuristic Algorithms, Metaheuristic Algorithms
Abstract: Ant Colony Optimization (ACO) is a prominent swarm intelligence algorithm extensively applied to path planning. However, traditional ACO methods often exhibit shortcomings, such as blind search behavior and slow convergence within complex environments. To address these challenges, this paper proposes the Pheromone-Focused Ant Colony Optimization (PFACO) algorithm, which introduces three key strategies to enhance the problem-solving ability of the ant colony. First, the initial pheromone distribution is concentrated in more promising regions based on the Euclidean distances of nodes to the start and end points, balancing the trade-off between exploration and exploitation. Second, promising solutions are reinforced during colony iterations to intensify pheromone deposition along high-quality paths, accelerating convergence while maintaining solution diversity. Third, a forward-looking mechanism is implemented to penalize redundant path turns, promoting smoother and more efficient solutions. These strategies collectively produce the focused pheromones to guide the ant colony's search, which enhances the global optimization capabilities of the PFACO algorithm, significantly improving convergence speed and solution quality across diverse optimization problems. The experimental results demonstrate that PFACO consistently outperforms comparative ACO algorithms in terms of convergence speed and solution quality.
|
|
17:45-18:00, Paper Mo-S2-T13.8 | |
Controllable Multimodal Landscapes: An Interpretable Surrogate Model for Combinatorial Spaces and Its Application to the K-Order Traveling Salesman Problem |
|
Luan, Feng | Xi'an Jiaotong University |
Shi, Jialong | Xi'an Jiaotong University |
Sun, Jianyong | Xi'an Jiaotong University |
Keywords: Swarm Intelligence, Hybrid Models of Neural Networks, Fuzzy Systems, and Evolutionary Computing, Optimization and Self-Organization Approaches
Abstract: Surrogate models are widely employed to address optimization problems with high computational complexity. However, interpretable surrogate models for Combinatorial Optimization Problems (COPs) remain underexplored. In this paper, a novel surrogate model, termed the Controllable Multimodal Landscape (CML), is proposed for the k-order Traveling Salesman Problem (k-order TSP). The proposed method is inspired by the observation that a Traveling Salesman Problem (TSP) with cities arranged on a convex hull exhibits a unimodal landscape. For the k-order TSP, multiple local optima (or high-quality solutions) are collected, and convex hull TSPs are constructed based on them to generate multiple unimodal landscapes sharing the same search space. These unimodal landscapes are then combined to form a multimodal landscape that approximates the original landscape of the k-order TSP. Particle Swarm Optimization (PSO) is used to optimize the parameters of the CML. Experimental results demonstrate that the proposed CML surrogate model achieves higher accuracy than Random Forest (RF) in most test cases involving k-order TSP instances.
|
|
18:00-18:15, Paper Mo-S2-T13.9 | |
Domain-Adaptation Network for Knowledge Transfer in SOFC Operational Mode Identification under Extreme Distribution Shift |
|
Wang, Jingjing | Huazhong University of Science and Technology |
Fan, Lixin | Nanyang Technological University |
Zhang, Shuyu | Huazhong University of Science and Technology |
Lai, Jingang | Huazhong University of Science and Technology |
Deng, Zhonghua | Huazhong University of Science and Technology |
Xu, Yuanwu | Wuhan University of Science and Technology |
Li, Xi | Huazhong University of Science and Technology |
Keywords: Transfer Learning, Deep Learning, Neural Networks and their Applications
Abstract: Solid Oxide Fuel Cell (SOFC) systems are increasingly deployed for distributed energy applications, ranging from kilowatt-scale residential systems to megawatt-scale industrial parks, due to their efficient and clean energy conversion characteristics. Accurate operational mode identification is essential for preventing stack damage, optimizing performance, and ensuring safety. However, existing methods struggle with cross-system transfer (varying power ratings and device configurations) due to disparities in feature distribution. To address this issue, this study analyzes the operational features of 1kW and 35kW systems, revealing the fundamental causes of the failures of traditional transfer methods, and proposes an innovative Domain-Adaptation Network designed specifically for large cross-system discrepancies. Our approach employs dedicated encoders for different systems, which share a common classifier, enabling effective knowledge transfer. Experiments demonstrate 98.98% accuracy in operational mode identification between 1kW and 35kW systems, with an inference time of 0.64ms, requiring only 10% of target domain data. Comparative experiments show that our approach significantly outperforms traditional domain adaptation methods, which achieve less than 10% accuracy, thereby reducing data collection costs and accelerating deployment cycles.
|
|
18:15-18:30, Paper Mo-S2-T13.10 | |
Reconstruction of Switching Networks with Unknown Switching Instants and Number of Subnetworks |
|
Zheng, Yaozhong | Huazhong University of Science and Technology |
Dai, Dongyi | Huazhong University of Science and Technology |
Xu, Bowen | Northwestern Polytechnical University |
Wu, Yue | Beijing Forestry University |
Ding, Jianing | Huazhong University of Science and Technology |
Xing, Ning | Huazhong University of Science and Technology |
Zhang, Hai-Tao | Huazhong University of Science and Technology |
Keywords: Complex Network, Swarm Intelligence
Abstract: Reconstructing dynamical networks based on time series of nodal states is of significant interest in many fields of science and engineering. Despite recent progress in network reconstruction, most research focuses on static structures, rather than on dynamic ones with unknown switching instants and number of subnetworks. Therefore, this paper develops a method for reconstructing switching networks, where a new sparse Bayesian learning algorithm is proposed to estimate switching instants. The proposed method is theoretically proved to be convergent. Experimental results are elaborated to demonstrate the effectiveness and superiority of the proposed method.
|
|
Mo-S2-T14 |
Room 1.85 |
Human Performance Modeling |
Regular Papers - HMS |
Chair: Sarkar, Subharag | University of Texas at Arlington |
Co-Chair: An, Qi | The University of Tokyo |
|
16:00-16:15, Paper Mo-S2-T14.1 | |
Estimation of Lower Limb Joint Torque Using Handrail Force and Floor Reaction Force During Sit-To-Stand Motion in the Elderly |
|
Wakamatsu, Yuta | The University of Tokyo |
Kikuchi, Ken | University of Tokyo |
Hamada, Hiroyuki | The University of Tokyo |
Nakayama, Kazuhiro | ASANOHI Orthopedic Clinic |
Miyoshi, Kanta | ASANOHI Orthopedic Clinic |
Yamashita, Atsushi | The University of Tokyo |
An, Qi | The University of Tokyo |
Keywords: Human Performance Modeling, Assistive Technology, Medical Informatics
Abstract: Many elderly individuals experience a decline in motor function. To provide appropriate rehabilitation programs, a sufficient and convenient evaluation method is necessary. In this study, we focused on the sit-to-stand motion, a crucial activity in daily life, and measured the forces applied to handrails to obtain force data safely and easily. Previous studies have proposed methods for estimating scores such as the Timed Up and Go test from forces applied to the hand, hip, and foot, classifying elderly individuals into several motor function categories. However, these indicators are insufficient for evaluating the function of specific muscles or joints in the lower extremities individually. In this study, we focused on joint torque, which more directly represents the function of specific muscles and joints. We measured the time-series data of forces acting on the body during sit-to-stand movements and developed a model using Long Short-Term Memory to estimate lower limb joint torques. As a result, we have developed a method to accurately estimate knee and hip joint torques from force applied to hand, hip and foot during the sit-to-stand motion. Furthermore, this method demonstrated the potential for early detection of joint disorders. This approach allows for a detailed assessment of knee and hip joint conditions simply by having the individual stand up while holding onto a handrail.
|
|
16:15-16:30, Paper Mo-S2-T14.2 | |
Research on Selective Auditory Processing in Blind Football Players -Effects of Noise on Sound Localization |
|
Ochiai, Yuta | Waseda University |
Tsuji, Ayumu | Waseda University |
Aihara, Shimpei | Japan Institute of Sports Sciences |
Iwata, Hiroyasu | Waseda University |
Keywords: Human Performance Modeling, Human Enhancements, Virtual and Augmented Reality Systems
Abstract: In blind soccer, players rely solely on auditory cues for navigation and gameplay execution. This study investigates the sound source localization abilities of blind soccer athletes under conditions of background noise. Utilizing the Visual Space Acoustic System developed in our laboratory, we assessed participants’ capacity to track moving sound sources while simultaneously recording head movement data. The results indicate that visually impaired participants with blind soccer experience exhibited significantly enhanced sound source tracking capabilities compared to sighted participants. Moreover, they demonstrated greater adaptability to noisy environments. Post-experimental interviews revealed that visually impaired participants employed a distinct adaptation strategy in noisy conditions, initially localizing both the target sound and background noise before selectively focusing on the target source. Future research will expand the participant pool to include visually impaired individuals without blind soccer experience, enabling an assessment of auditory localization independent of sport-specific training. These findings contribute to the development of optimized training programs for blind soccer athletes and may inform rehabilitation strategies for individuals with visual impairments.
|
|
16:30-16:45, Paper Mo-S2-T14.3 | |
Neuronal Spectral Connectivity Networks to Anticipate Attention Lapses in Challenging Respiratory Environments |
|
Beres, Szilard Laszlo | University of Florida |
Ribeiro Rodrigues, Victoria | University of Florida |
Napoli, Nicholas Joseph | University of Florida |
Keywords: Human Performance Modeling, Human Factors, Brain-Computer Interfaces
Abstract: Whole-brain EEG connectivity offers insights into how attention is modulated under respiratory load, aiming to improve our understanding of inter-regional brain communication in extreme environments such as aviation or deep-sea diving. These conditions impose significant respiratory and interoceptive challenges that strain cognitive control systems and elevate the risk of attention lapses. Particularly during voluntary breathing control, traditional human performance studies attempt to understand cognitive states such as attention by isolating activity in specific brain regions and frequency bands, overlooking the distributed dynamics of brain-body integration. Building on recent theories of global neural coordination, we propose a spectral connectivity framework that captures neuronal oscillatory power, temporal stability, and interregional synchrony across frequency bands and compare it to traditional network methods. This network analysis was validated through a predictive modeling framework, which identified the critical brain regions and connectivity patterns that sustain attention during respiratory stress. Such a predictive network analysis is first of its kind for address how brain networks are modulated and impact attention lapses which has allowed us to extend the understanding of neural control of breathing by framing it as a dynamic network system in the brain, in which inhibitory (weakening) and excitatory (strengthening) connections between brain regions modulate in response to respiratory load and cause adaptations in breathing patterns. These attention lapses involve widespread breakdowns in dynamic spectral coordination across the brain, not just in the frontal cortex, providing a more holistic neural marker for tracking cognitive state in demanding environments. Furthermore, our developed novel signal processing techniques have demonstrated an enhanced ability to characterize the neuronal connectivity patterns within the brain, providing a 7.11% increase in predictive power from traditional coherence metrics.
|
|
16:45-17:00, Paper Mo-S2-T14.4 | |
Predicting Human Detection of Changes in Controlled Element Dynamics in Manual Control |
|
Eppenga, Thomas | TU Delft, Aerospace Engineering, Control & Simulation |
Pool, Daan Marinus | TU Delft |
van Paassen, Marinus M | Delft University of Technology |
Mulder, Max | Delft University of Technology |
Keywords: Human Performance Modeling, Human-Machine Cooperation and Systems, Human Factors
Abstract: A pursuit-tracking manual control model is introduced that includes an observer-like internal model to predict human detection of a change in controlled element dynamics. The internal model's innovation signal, the difference between the observed and expected system response, is studied for its capacity to drive the detection of a change. The model's performance is tested for different crossover frequencies, remnant power ratios, observer gains, and detection threshold settings, through Monte Carlo analysis of simulated pursuit-tracking tasks where the controlled element transitions from single to double integrator dynamics. The model shows highly accurate detection performance for a wide range in the observer gain, with a true positive rate of approximately 1 and a false positive rate of approximately 0.02. The high true and low false positive rates, combined with average detection times that match experimental human-in-the-loop data, show the observer model's potential for accurately predicting human detection of a change in controlled element dynamics.
|
|
17:00-17:15, Paper Mo-S2-T14.5 | |
Flow Rate Estimation Based on Digital Twin Environment of Pouring Work by Manual Operation in Iron Foundry |
|
Fuse, Reo | University of Yamanashi |
Noda, Yoshiyuki | University of Yamanashi |
Tsuyama, Seishi | Isobe Iron Works Co., Ltd |
Asano, Kazuya | Isobe Iron Works Co., Ltd |
Keywords: Human Performance Modeling, Human-Machine Interaction, Information Visualization
Abstract: This study contributes to a sophisticated evaluation system of a pouring work by manual operation in the foundry industry. The manual pouring work has been used in the foundries with high-mix low-volume production. In the manual pouring work, the skilled workers are needed for producing the high quality casting products. However, it is difficult to inherit the skills because the pouring work has not been evaluated explicitly. Therefore, in this study, we propose the pouring flow rate estimation system based on the digital twin environment of the manual pouring work. In the proposed approach, the model parameters in the mathematical pouring process model can be extracted systematically by scanning the practical ladle using the 3D-scanner and extracting the model parameters using the API function in the 3D-CAD. The tilting angle and the angular velocity of the ladle can be measured by the crane scale. The pouring flow rate can be estimated by simulating the pouring process model with the measurement data. Furthermore, the estimation results of the flow rate are visualized clearly by applying the Hampel filter and the finite-impulse-response type zero-phase filter. The efficacy of the proposed approach is verified by the experiments in the practical pouring works with the manual operation.
|
|
17:15-17:30, Paper Mo-S2-T14.6 | |
Human-Like Trajectories Generation Via Receding Horizon Tracking Applied to the TickTacking Interface |
|
Masti, Daniele | IMT School for Advanced Studies Lucca |
Menchetti, Stefano | IMT School for Advanced Studies Lucca |
Gnecco, Giorgio | IMT School for Advanced Studies Lucca |
Erdem, Çağri | University of Milan |
Rocchesso, Davide | University of Milan |
Keywords: Human Performance Modeling, Human-Machine Interface, Human-Computer Interaction
Abstract: TickTacking is a rhythm-based interface that allows users to control a pointer in a two-dimensional space through dual-button tapping. This paper investigates the generation of human-like trajectories using a receding horizon approach applied to the TickTacking interface in a target-tracking task. By analyzing user-generated trajectories, we identify key human behavioral features and incorporate them in a controller that mimics these behaviors. The performance of this human-inspired controller is evaluated against a baseline optimal-control-based agent, demonstrating the importance of specific control features for achieving human-like interaction. These findings contribute to the broader goal of developing rhythm-based human-machine interfaces by offering design insights that enhance user performance, improve intuitiveness, and reduce interaction frustration.
|
|
17:30-17:45, Paper Mo-S2-T14.7 | |
Learning Multi-Scale Spatial Features Representation in Frequency Domain for Gait Recognition |
|
Ning, Tong | University of Chinese Academy of Science |
Lu, Ke | University of Chinese Academy of Sciences |
Xue, Jian | University of Chinese Academy of Sciences |
Keywords: Biometrics and Applications,, Human Performance Modeling, Human-centered Learning
Abstract: Gait recognition is a highly promising biometric technology because of its robust performance in long‑distance scenarios. However, existing methods typically rely on geometric approaches to extract local and global features, which may lack the accuracy required for truly discriminative representations. To overcome this problem, we propose a frequency‑based channel attention mechanism that captures both global and local information in the frequency domain, where the signal preserves a rich set of latent details that can boost the recognition accuracy. Furthermore, assembling large-scale gait datasets remains prohibitively expensive, constraining research progress. To alleviate this issue, we introduce a simple yet effective data augmentation strategy, where each raw gait image is horizontally split into two parts, which are then randomly recombined to create new synthetic identities. Extensive experiments on the FVG and CASIA‑B datasets demonstrate that our method achieves competitive performance.
|
|
17:45-18:00, Paper Mo-S2-T14.8 | |
A Dual-Agent Learning Framework for Emotion-Aware Personalized Game Level Generation |
|
Sarkar, Subharag | University of Texas at Arlington |
Huber, Manfred | The University of Texas at Arlington |
Keywords: Human-centered Learning, Human Performance Modeling, Human-Machine Interaction
Abstract: A dual-agent learning framework is proposed for emotion-aware, personalized game level generation, minimizing real user interaction while maximizing engagement. The Inner Agent models player behavior using a Siamese Network to generate user embeddings, a Gaussian Mixture Model (GMM) to capture user-type distributions, and a Conditional GAN (CGAN) to simulate performance data conditioned on game state, emotion, and user embeddings. The Outer Agent, a Q-learning-based Reinforcement Learning (RL) agent, selects from 10 predefined game states based on player performance and facial emotion data. In experiments with 100 real-user interactions and 100 simulated interactions per iteration, the framework reduced the number of real interactions required to reach peak performance compared to a baseline without the Inner Agent. These results demonstrate the framework’s effectiveness in scalable, emotion-sensitive game personalization with reduced user burden.
|
|
18:00-18:15, Paper Mo-S2-T14.9 | |
Evaluation of Tractor Operators’ Aptitude for Remote Operation on Driving Courses Designed to Induce Fluctuations in Operator Tension |
|
Nanakubo, Moe | The University of Tokyo |
Honda, Koki | The University of Tokyo |
Fujii, Takafumi | KUBOTA Corporation |
Hasebe, Daisuke | KUBOTA Corporation |
Matsuzaki, Yushi | KUBOTA Corporation |
Fukui, Rui | The University of Tokyo |
Keywords: Human-Machine Interaction, Human Performance Modeling, Kansei (sense/emotion) Engineering
Abstract: A remote operation system for inexperienced tractor drivers is expected to enhance labor efficiency in agricultural work. This study aims to develop a method for evaluating tractor operators' aptitude for remote operation by analyzing physiological signals and operation logs related to tension fluctuation. Experimental driving courses were designed to induce and assess tension fluctuations, and participants' physiological responses were measured during these tasks. The experimental results have revealed that driving tasks designed in this study can fluctuate operators' levels of tension. Moreover, the effectiveness of proposed indices based on task results and the evaluation of operators' tension through physiological measurement was demonstrated.
|
|
18:15-18:30, Paper Mo-S2-T14.10 | |
ISC-Swin: Inter Sample Contrastive Enhancement for Swin-Transformer in Ultrasound Spine Feature Segmentation (I) |
|
Zhang, Chen | University Technology of Sydney |
Jia, Wenjing | University Technology of Sydney |
Zheng, Yongping | The Hong Kong Polytechnic University |
Ling, Steve | University of Technology Sydney |
Keywords: Biometrics and Applications,, Medical Informatics, Cognitive Computing
Abstract: Scoliosis, a three-dimensional spinal deformity, requires early detection for effective treatment. While Cobb's angle measurement via radiographs remains the gold standard, radiation risks necessitate safer alternatives. Ultrasound imaging offers a non-invasive option but presents challenges including low contrast, high noise, and irregular structures, complicating precise Ultrasound Curve Angle (UCA) estimation. Poor segmentation often impairs automated UCA calculation by missing critical features, and existing models lack clinical reliability and robustness. We propose ISC-Swin, a Swin Transformer-based model enhanced with inter-sample contrastive learning for improved ultrasound spine feature segmentation. The architecture captures both local and global contextual features through an innovative Inter-Sample Contrastive Bank (ISCB) that dynamically extracts multi-level features. By addressing inter-class and intra-class differences across samples, ISC-Swin enhances detection of subtle spinal features in noisy, low-contrast regions. ISC-Swin demonstrates a 1-5% improvement in Dice Similarity Coefficient and Intersection over Union metrics, outperforming state-of-the-art models for bone feature detection and diagnostic precision in ultrasound segmentation.
|
|
Mo-Online |
Online Room |
Online Session Cybernetics |
Regular Papers - Cybernetics |
|
09:00-18:30, Paper Mo-Online.1 | |
Graph Neural Network-Enhanced Feature Learning for Unsupervised Anomalous Sound Detection |
|
Lu, Jiyu | Institute of Acoustics, CAS |
Guan, Wenbo | Institute of Acoustics, CAS |
Zhang, Ming | Hubei University |
Li, Ta | Institute of Acoustics, CAS |
Keywords: Neural Networks and their Applications, Application of Artificial Intelligence, Artificial Social Intelligence
Abstract: Anomalous sound detection (ASD) is crucial in industrial applications due to its non-invasive and real-time capabilities. However, existing ASD methods often rely on autoencoders, which require machine-specific tuning, or large pre-trained models with high computational costs. Additionally, many self-supervised approaches depend on extensive meta-information, increasing deployment complexity. To address these limitations, we propose a lightweight, metadata-free ASD framework that generalizes across different machine types without requiring complex hyperparameter tuning. Our approach extracts high-dimensional features from Log-Mel spectrograms using MobileNetV2, then refines feature representations through relational learning with a Graph Neural Network-based SAGE-GAT model. Unlike conventional methods that treat machine types independently, our approach leverages cross-category feature propagation through local neighbor relationships, capturing discriminative information from nearby samples. Furthermore, an MLP optimized with ArcFace loss enhances feature structuring, while anomaly detection is performed using K-means clustering. Experiments on the DCASE Task 2 dataset validate the effectiveness of our approach, demonstrating its robustness, efficiency, and suitability for real-world industrial deployment
|
|
09:00-18:30, Paper Mo-Online.2 | |
WHSP: Windmill-Inspired Hierarchical Stripping Pooling for Channel Attention in Deep Neural Networks |
|
Zhu, Shengxiang | Sichuan University |
Ma, Yue | Sichuan University |
Wu, Yixuan | Sichuan University |
Xie, Zhenghe | Sichuan University |
Xing, Congsen | Sichuan University |
Keywords: Artificial Social Intelligence, Neural Networks and their Applications, Deep Learning
Abstract: Channel attention mechanisms are widely used to model channel dependencies and assign feature weights. However, mainstream methods that rely on Global Average Pooling (GAP) often lose critical spatial information, especially for detail-rich features, leading to suboptimal feature representation. To address this, we propose Windmill Hierarchical Stripping Pooling (WHSP), a novel feature fusion method inspired by the rotational mechanism of a windmill. WHSP extracts significant features layer by layer from the outer to inner regions of the feature map, preserving spatial structure and salient information. We further design a progressively compressed neural network to process WHSP features and generate channel attention weights, named WHSP-CA. Additionally, WHSP is extended into a general pooling framework, enhancing its flexibility and practicality. Experiments on ResNet18, ResNet34, and ResNet50, using two fine-grained classification datasets, demonstrate that WHSP-CA significantly improves model performance, validating its effectiveness in feature extraction and channel attention modeling.
|
|
09:00-18:30, Paper Mo-Online.3 | |
Enhancing Convolutional Neural Network Performance through Self-Distillation with Feature Extractor and MLP Head |
|
Liu, Lin | Shenyang University of Technology |
Li, Zhe | Shenyang University of Technology |
Keywords: Image Processing and Pattern Recognition, Deep Learning, Artificial Social Intelligence
Abstract: Convolutional neural networks have demonstrated notable advancements in recent years. In order to meet the demands of high-precision scenarios, researchers have improved the performance by using deeper or wider network architectures. However, the improvement in neural network performance is often accompanied by increased computational demands and a larger number of parameters, which in turn lead to exponential growth in costs related to computation and data storage, as well as longer response times. To address these problems, techniques like self-distillation have been proposed to compress models. However, models trained with self-distillation lack the guidance of a teacher model, causing the student model to struggle with capturing global information and focusing on key features. In light of this, we propose a novel self-distillation method that combines the feature extractor based on depthwise separable convolutions with an MLP Head to address the limitation. The feature extractor captures multi-level information from local to global, while the MLP Head highlights key features for the final classification. The combination of these two components enhances the model’s ability to extract global information and focus on key features. Experiments demonstrate that the model's accuracy increases by 3.62% on CIFAR-100 and 4.37% on Tiny ImageNet increase compared to baseline models on average. Moreover, the improvement is achieved with only a slight increase in the number of parameters.
|
|
09:00-18:30, Paper Mo-Online.4 | |
Imputing Missing Temperature Data of Meteorological Stations Based on Global Spatiotemporal Attention Neural Network |
|
Hou, Tianrui | Qinghai University |
Guo, Xinshuai | Qinghai University |
Wu, Li | Qinghai University |
Wang, Xiaoying | Qinghai Institute of Technology |
Zhang, Guojing | Qinghai University |
Huang, Jianqiang | Qinghai University |
Keywords: Deep Learning, Machine Learning, Neural Networks and their Applications
Abstract: Imputing missing meteorological site temperature data is necessary and valuable for researchers to analyze climate change and predict related natural disasters. Prior research often used interpolation-based methods, which basically ignored the temporal correlation existing in the site itself. Recently, researchers have attempted to leverage deep learning techniques. However, these models cannot fully utilize the spatiotemporal correlation in meteorological stations data. Therefore, this paper proposes a global spatiotemporal attention neural network (GSTA-Net), which consists of two sub networks, including the global spatial attention network and the global temporal attention network, respectively. The global spatial attention network primarily addresses the global spatial correlations among meteorological stations. The global temporal attention network predominantly captures the global temporal correlations inherent in meteorological stations. To further fully exploit and utilize spatiotemporal information from meteorological station data, adaptive weighting is applied to the outputs of the two sub-networks, thereby enhancing the imputation performance. Additionally, a progressive gated loss function has been designed to guide and accelerate GSTA-Net's convergence. Finally, GSTA-Net has been validated through a large number of experiments on public dataset TND and QND with missing rates of 25%, 50%, and 75%, respectively. The experimental results indicate that GSTA-Net outperforms the latest models, including Linear, NLinear, DLinear, PatchTST, and STA-Net, across both the mean absolute error (MAE) and the root mean square error (RMSE) metrics.
|
|
09:00-18:30, Paper Mo-Online.5 | |
Multi-Scale Based Cross-Modal Semantic Alignment Network for Radiology Report Generation |
|
Zhang, Zhihao | Qilu University of Technology (Shandong Academy of Sciences) |
Zhao, Long | Qilu University of Technology |
Lan, Dun | Qilu University of Technology (Shandong Academy of Sciences) |
Wang, Yuyao | Boston University |
Jiang, Linfeng | Qilu University of Technology (Shandong Academy of Sciences) |
Dong, Xiangjun | Qilu University of Technology |
Keywords: Application of Artificial Intelligence, Deep Learning, Image Processing and Pattern Recognition
Abstract: The automatic generation of radiology reports draws attention for easing radiologists' workload and aiding diagnosis. Cross-modal alignment between images and text is critical for high-quality reporting, but cross-modal alignment has not been fully explored at this time due to a lack of annotation. Meanwhile, existing alignment methods utilize single scale image region features for alignment and cannot accommodate the different sizes of anatomical structures in radiology images. To address these problems, we propose a Multi-scale based Cross-modal Semantic Alignment Network (MCSANet). It includes three modules: a multi-scale visual feature extraction module, capturing key image information in windows of different sizes; a cross-modal semantic alignment module, achieving semantic alignment between the two modalities without relying on additional auxiliary information; and a transformer report generator, which generates radiology reports using final features. Experimental results show that MCSANet surpasses other leading approaches on the IU-Xray and MIMIC-CXR datasets.
|
|
09:00-18:30, Paper Mo-Online.6 | |
Align Modalities: Advancing Medical Report Generation with Unified Encoder and Inter-Case Contrastive Learning |
|
Chen, Haoquan | Beijing Institute of Technology |
Pei, Mingtao | Beijing Institute of Technology |
Nie, Zhengang | Beijing Institute of Technology |
Keywords: Application of Artificial Intelligence, AI and Applications, Deep Learning
Abstract: Medical reports play a pivotal role in achieving accurate diagnoses. This technology not only reduces the burden on radiologists but also fosters consistency in treatment approaches. The crux of generating high-caliber medical reports lies in the model's ability to interpret and integrate both visual and textual data. However, the inherent distributional disparities between modalities pose a significant challenge to this process. Hence, we propose UEMA framework, which extracts features from both modalities through a Unified Encoder and consists of an ICCL (Inter-Case Contrastive Learning) module to facilitate multimodal alignment. The ICCL module leverages multi-label contrastive learning across different cases to align visual and textual features. Extensive experiments have been conducted on the publicly available IU X-Ray and MIMIC-CXR datasets with additional case studies and visual analysis, demonstrating the effectiveness of our designed module and that our model outperforms state-of-the-art methods across a wide range of metrics.
|
|
09:00-18:30, Paper Mo-Online.7 | |
Few-Shot Fine-Grained Image Classification Via Vision Transformer |
|
Liu, Yongqi | Chongqing University |
Xiao, Tong | Tsinghua University |
Chen, Zeao | Chongqing University |
Zhou, Chen | Chongqing University |
Wang, Zhi-Jie | Chongqing University |
Keywords: Image Processing and Pattern Recognition, Deep Learning, AI and Applications
Abstract: Few-shot fine-grained image classification (FSFGIC) is to classify images of the same class into fine-grained subclasses, where only a very limited number of labeled samples in each subclass are available (e.g., 5 or even 1 labeled sample). For most of existing methods, the feature representation capabilities are insufficient, which may harm the performance. Vision Transformers (ViTs) have shown strong feature representation capabilities in many research fields. In this paper, we attempt to solve FSFGIC problem via ViT or its variants. Generally, we use an enhanced ViT as the backbone and adopt a three-stage training strategy. More specifically, (i) we utilize an image matting module to enhance the focus of ViT on the main subjects of images; (ii) we introduce a part selection module to better focus on local image details; and (iii) we incorporate an AdaptMLP module into each Transformer Encoder to reduce the number of parameters that require fine-tuning. Extensive experiments based on three benchmark datasets show us that the proposed model is highly competitive, compared against state-of-the-art models.
|
|
09:00-18:30, Paper Mo-Online.8 | |
MAC-Lookup: Multi-Axis Conditional Lookup Model for Underwater Image Enhancement |
|
Yi, Fanghai | Guangdong University of Technology |
Zheng, Zehong | Guangdong University of Technology |
Liang, Zexiao | Huizhou University |
Dong, Yihang | University of Chinese Academy of Sciences |
Fang, Xiyang | Huizhou University |
Wu, Wangyu | Xi’an Jiaotong-Liverpool University |
Chen, Xuhang | Huizhou University |
Keywords: Image Processing and Pattern Recognition, Machine Vision, Media Computing
Abstract: Enhancing underwater images is crucial for exploration. These images face visibility and color issues due to light changes, water turbidity, and bubbles. Traditional prior-based methods and pixel-based methods often fail, while deep learning lacks sufficient high-quality datasets. We introduce the Multi-Axis Conditional Lookup (MAC-Lookup) model, which enhances visual quality by improving color accuracy, sharpness, and contrast. It includes Conditional 3D Lookup Table Color Correction (CLTCC) for preliminary color and quality correction and Multi-Axis Adaptive Enhancement (MAAE) for detail refinement. This model prevents over-enhancement and saturation while handling underwater challenges. Extensive experiments show that MAC-Lookup excels in enhancing underwater images by restoring details and colors better than existing methods. The code is url{https://github.com/onlycatdoraemon/MAC-Lookup}.
|
|
09:00-18:30, Paper Mo-Online.9 | |
A Multi-Modal Feature Interaction Enhancement Network for Complex Table Structure Recognition |
|
Jingdong, Wang | Northeast Electric Power University |
Wang, Peng | Northeast Electric Power University |
Fanqi, Meng | Northeast Electric Power University |
Keywords: Neural Networks and their Applications, Image Processing and Pattern Recognition
Abstract: 针对当前表结构的问题 识别方法往往存在 面对时细胞的误认和错位 具有复杂行列结构的复杂表,以及 丰富的细胞语义内容,本研究提出了一种 多模态复杂表结构识别 network-MTSRN 中,它集成了视觉、文本和 位置特征。这种方法的创新在于, 首先,增加 CANN 图像视觉特征提取分支 在 GNN 的基础上实现更全面 提取表的全局和局部视觉特征 以及通过组合 细胞的相对位置;其次,我们构造 用于获取语义特征的名为 CLAT 的集成网络 在文本内容之间,以便准确提取语义 单元格文本内容之间的连接;最后,我们 构造一个名为 NAPM 的节点预测模块,该模块 以交互方式增强视觉、语义和位置 特征,并使用 GNN 将这三个特征融合成 Graph 节点,以及预测和配对列 节点之间的关系,
|
|
09:00-18:30, Paper Mo-Online.10 | |
Cross-Modal World Models for Offline Visual Reinforcement Learning |
|
Wang, Qi | Shanghai Jiao Tong University, |
Jin, Xin | Eastern Institute of Technology, Ningbo |
Xie, Baao | Eastern Institute of Technology, Ningbo |
Yang, Xiaokang | Shanghai Jiao Tong University |
Zeng, Wenjun | Eastern Institute of Technology, Ningbo |
Keywords: Machine Learning
Abstract: Offline reinforcement learning (RL) with visual pixels encounters two primary challenges: overfitting in representation learning induced by limited data, and value overestimation of out-of-distribution states. Recent work has adopted accessible simulators to mitigate these issues, but rendering images introduces additional computational costs and training challenges. In this paper, we study the problem of anti-exploration with accessible state-based simulators for effective learning of offline visual control. To address these challenges, we introduce a model-based RL framework, dubbed Cross-Modal World Models (X-MWM). Concretely, we build two independent agents: a source model trained on low-dimensional states and a target model that learns from high-dimensional images. Initially, in light of the reward function discrepancy between the domains, we pretrain the source agent with latent disagreement-based intrinsic rewards. Subsequently, to prevent overfitting in offline representation learning, cross-modal latent alignment is employed to close the distance of latent state distributions. In this way, during the target agent training phase, the value of the source critic serves as an anti-exploration constraint to adjust the learning target of the offline RL agent, which encourages more conservative behavior, effectively alleviating value overestimation induced by out-of-distribution states. Experimental results of various robotic manipulation tasks on MetaWorld validate the superiority of our approach.
|
|
09:00-18:30, Paper Mo-Online.11 | |
GAIT: An Attention-Augmented Dynamic Time Series Model for Detecting Depression Levels on Social Media |
|
Jingdong, Wang | Northeast Electric Power University |
Zhao, Wenyan | School of Computer Science Northeast Electric Power University |
Fanqi, Meng | Northeast Electric Power University |
Guang qiang, Qu | 2723758322@qq.com |
Keywords: AI and Applications, Application of Artificial Intelligence, Computational Life Science
Abstract: By analyzing sentiment changes on social media, time series methods are able to capture the sentiment fluctuations of depressed patients at different time points, thus providing strong support for early diagnosis and risk prediction of depression. However, the existing models still have two major bottlenecks: insufficient global dependent modeling and lack of interpretability. To address the above problems, this paper innovatively proposes an attention-augmented dynamic time series model (GAIT), which realizes a breakthrough through a multi-stage architecture. First, multi-source feature fusion is performed to generate multivariate time series features by fusing sentiment lexicon and pre-trained model, which can effectively extract explicit and implicit features; second, dual interpretability is achieved by symptom-level similarity analysis and dynamic weight ablation; then, hierarchical feature extraction is performed to capture local patterns by multi-scale convolution of Inception module, and residual connectivity mitigates the gradient vanishing; and finally. Dynamic global modeling, embedding the global attention mechanism to achieve dynamic weighting of critical time steps, complemented by global average pooling to enhance robustness. The experimental results show that the model significantly outperforms the baseline model in the depression degree classification task, with F1-Scores of 90.0, 86.8, 84.0, and 87.7 on normal, mildly depressed, moderately depressed, and severely depressed users, respectively, which fully validates the effectiveness of the proposed method.
|
|
09:00-18:30, Paper Mo-Online.12 | |
Mitigating Feature Homogenization in GNNs Via Structure-Oriented Feature Augmentation for Fake News Detection |
|
Wang, Weiyi | Sichuan University |
Zhang, Xinyu | Sichuan University |
Liu, QuanHui | Sichuan University |
Zhou, Tao | University of Electronic Science and Technology |
Lv, Jiancheng | Sichuan University |
Keywords: Artificial Social Intelligence, AI and Applications, Deep Learning
Abstract: In recent years, GNN-based fake news detection models integrating news content, user characteristics, and propagation structure have gained substantial attention, yet they often face the potential homogenization issues in GNNs, limiting performance in detection. Despite numerous studies focusing on sophisticated models to tackle this issue, many have overlooked the unique structural characteristics of propagation trees. Here, we propose a structure-oriented model named DaFAN, which leverages a dual-attention mechanism to not only address the homogenization issue in message passing but also be able to boost the distinction between true and fake news. In specific, we design a novel Dual Attention Module with the multi-head graph attention mechanism to fuse the multi-modal features by utilizing the inherent characteristics of news propagation trees, and introduce a lightweight feature augmentation module compatible with various GNNs to retain the initial features and optimize the feature selection. Experiments on real datasets demonstrate that our DaFAN model outperforms the state-of-the-art models. Furthermore, the feature augmentation module has notably bolstered our model’s transferability across languages and datasets, fine-tuning on 10% of the target data can significantly surpass the supervised training from scratch.
|
|
09:00-18:30, Paper Mo-Online.13 | |
TransConvE: A Dual-Perspective Framework for Scalable and Efficient Knowledge Graph Embedding |
|
Wang, Chen | University of Electronic Science and Technology of China |
Cao, Changjie | Geomathematics Key Laboratory of Sichuan Province, Chengdu Unive |
Hu, Wang | University of Electronic Science and Technology of China |
Zhang, Yu | University of Electronic Science and Technology of China |
Keywords: Representation Learning, Knowledge Acquisition, Deep Learning
Abstract: Knowledge graph embedding models for link prediction often require a large number of model parameters, resulting in significant memory and computational costs that limit the scalability for large-scale knowledge graphs. To address the issue, this paper proposes a novel framework that balances expressiveness and computational efficiency through a dual-perspective modeling approach, named TransConvE, which combines a Transformer-based global dependency modeling with CNN-based local feature extraction for capturing both long-range dependencies and fine-grained local interactions. Experiments on standard link prediction benchmarks, FB15k-237 and WN18RR, demonstrate that the proposed TransConvE achieves competitive performance, reducing model parameters by an average of 66.7% compared to RotatE, TuckER, and CoKE, while maintaining comparable or even better effectiveness. This demonstrates the effectiveness of TransConvE in capturing complex patterns within knowledge graphs, balancing model complexity and performance, and ensuring scalability for large-scale knowledge graph embedding tasks.
|
|
09:00-18:30, Paper Mo-Online.14 | |
Enhancing the Transferability of Adversarial Attacks with Majority-Vote |
|
Xie, Zhaorong | Nanjing University of Science and Technology |
Huang, Chanying | Nanjing University of Science and Technology |
Li, Qianmu | Nanjing University of Science and Technology |
Wang, Rui | Fiberhome Communication Technology Co., LTD |
Zhang, Jing | Southeast University |
Zhang, Xuyun | Macquarie University |
Keywords: Machine Vision, Machine Learning, Deep Learning
Abstract: 基于快速梯度符号方法的方法 (FGSM) 在对抗性攻击任务中发挥着重要作用,尤其是基于传输的攻击。最近的一些方法试图通过使用精确的梯度信息修改 FGSM 中的符号函数来进一步增强对抗性可转移性。然而,由于梯度更新方向不稳定,精确的梯度信息也以某种方式落入局部最小值,从而降低了对抗性可传递性。为了解决这个问题,我们提出了一种称为多数票梯度缩放 (MVGS) 的方法,以稳定梯度更新的方向。我们的 MVGS 通过评估每个像素上先前梯度的一致性来重新调整动量。由于 MVGS 仅通过特定参数重新调整动量,因此它可以很容易地与利用精确梯度信息的对抗性攻击方法集成。综合比较实验和消融实验表明,与最新的 APAA 方法相比,MVGS 可以将 ASR
|
|
09:00-18:30, Paper Mo-Online.15 | |
Boundary Box-Guided Targeted Adversarial Attacks with Semantic Perturbation |
|
Zhao, Hongtian | Xinjiang University |
Shi, Wenzhuo | Xinjiang University |
Liu, Chang | Tsinghua University |
Yiquan, Wang | Xinjiang University |
Keywords: Deep Learning, Machine Vision, Representation Learning
Abstract: Targeted adversarial attacks in black-box settings are pivotal for uncovering vulnerabilities in neural networks and guiding the development of robust defenses. However, conventional attack methods typically perturb the primary content, leading to a degradation in image quality and highlighting the need for more reasonable optimization strategies. In contrast, we propose a novel algorithm that restricts perturbations to the image boundary regions, thereby preserving content fidelity while enhancing attack effectiveness. Our approach employs an encoder–decoder generative network to craft targeted adversarial examples guided by optimized semantic perturbations derived from the boundaries. Moreover, the boundary signal is jointly optimized with the model parameters, enabling efficient, amortized optimization for multi-class targeted attacks. Extensive experiments demonstrate that the proposed boundary-guided method significantly improves the success rates of targeted black-box attacks and can be seamlessly integrated into existing noise-injection techniques to enhance overall performance.
|
|
09:00-18:30, Paper Mo-Online.16 | |
Epidemic Informed Co-Evolution of Nodes and Edges with Graph Neural Networks for Source Detection |
|
Wang, Ruixiao | Sichuan University |
Yu, Lanlan | Sichuan University |
Yang, Xinfu | Sichuan University |
Liu, QuanHui | Sichuan University |
Zhou, Tao | University of Electronic Science and Technology |
Lv, Jiancheng | Sichuan University |
Keywords: Application of Artificial Intelligence, Deep Learning, Complex Network
Abstract: Accurate identification of the epidemic source is crucial for controlling the spread of infectious disease. During the spread of an epidemic, transmission occurs through the edges, which represent the contact relationship between individuals. Source detection, as a reverse problem, should be significantly benefited by leveraging the epidemic state of edges, as it entails both the direction of transmission and the likelihood of disease spreading through the edges. Although Graph Neural Network (GNN)-based methods have been developed for source detection, their performance is limited due to overlooking the epidemic state of edges. Here, we propose a novel GNN-based model, Adaptive Node and Edge Co-evolution Source Detection graph neural network (ANEC-SD), which utilizes the epidemic state of edges and optimizes the node representations individually to enhance prediction performance. In specific, the ANEC-SD consists of two innovative components, where the Co-evolution of Node and Edge Representations Layer is designed to refine the message aggregation mechanism for effectively characterizing the heterogeneous impact from a node to its neighbors by leveraging the epidemic state of edges. While the Propagation Controller determines the optimal number of propagation layers for each node according to node centrality within the infected subgraph, thereby mitigating the issue of over-smoothing. Experiments on six real-world networks demonstrate that our model significantly outperforms existing methods, achieving a 30% improvement on four out of six networks. Further analysis highlights the pivotal role of epidemic state of edges in message passing.
|
|
09:00-18:30, Paper Mo-Online.17 | |
Enhanced Hemorrhagic Transformation Prediction Leveraging CT Imaging and Lesion Segmentation Guidance |
|
Haodong, Xu | Chongqing Normal University |
Huanhuan, Ren | Chongqing University Cancer Hospital |
Feng, Jinwang | College of Computer Science and Technology, Zhejiang Sci-Tech Un |
Jiang, Jingfeng | Michigan Technological University |
Yongmei, Li | The First Affiliated Hospital of Chongqing Medical University |
Cui, Shaoguo | Chongqing Normal University |
Keywords: Neural Networks and their Applications, Biometric Systems and Bioinformatics, Image Processing and Pattern Recognition
Abstract: Hemorrhagic transformation (HT) is a time-sensitive severe complication of endovascular thrombectomy for patients with ischemic stroke, and there is an urgent need to develop deep learning models to assist doctors in making rapid preliminary diagnoses. Currently popular Transformer deep learning architectures, while superior in modeling global relationships compared to traditional CNN, it still faces quadratic complexity issues when handling long sequences of medical images due to its inherent attention mechanism. In contrast, computational complexity of the Mamba model-based method grows linearly. Based on these findings, we have developed a novel Mamba model that effectively captures long-range dependencies and the sequential relationships among slices in high-dimensional medical image sequences. The Spatial Cross Reconstruction Module and Channel Cross Reconstruction Module reduce redundant information within images. The Multi-Layer Perceptrons have been replaced with the Kolmogorov-Arnold Network, achieving good results while adding very few parameters. We evaluated the proposed model on a multi-center dataset. Experimental results show that our method outperforms other classical architectures and current advanced methods, validating the effectiveness and generalizability of the model composed of the aforementioned modules.
|
|
09:00-18:30, Paper Mo-Online.18 | |
Enhancing CNN-Based Network Robustness Predictors through Matrix Completion against Information Noise |
|
Chen, Liang | Sichuan Normal University |
Huang, Wenli | Sichuan Normal University |
Li, Hui | Sichuan Normal University |
Li, Junli | Sichuan Normal University |
Keywords: Complex Network, Deep Learning, Machine Learning
Abstract: Complex networks exist widely in social, biological, and physical systems, and their robustness directly reflects system stability. Traditional robustness evaluation methods have limitations, while CNN-based predictors efficiently learn network structural features for fast and accurate predictions. However, due to malicious attacks, random failures, and other factors, network data is often incomplete, which reduces the performance of CNN-based robustness predictors. Therefore, this study focuses on information recovery to handle network information loss. The main contributions can be summarized as follows: 1) Explores the impact of different types of information loss on the prediction performance of CNN-based robustness predictors in complex networks. 2) Proposes a recovery method based on Singular Value Decomposition (SVD-R) to handle information loss in networks. Theoretical analysis and extensive experiments show that SVD-R has good recovery ability under various types of information loss. This method effectively reduces the prediction error of CNN-based predictors when handling incomplete data.
|
|
09:00-18:30, Paper Mo-Online.19 | |
CA-Gen: Cluster-Aware Anomaly Generation for Video Anomaly Detection |
|
Ye, HaiFu | Chongqing Institute of Green and Intelligent Technology |
Chen, Lin | Chinese Academy of Scie |
Shang, Mingsheng | Chongqing Institute of Green and Intelligent Technology, Chinese |
Keywords: Machine Vision, Machine Learning, Neural Networks and their Applications
Abstract: Video anomaly detection (VAD) is a critical technology for intelligent surveillance systems, finding widespread application in public security, traffic monitoring, and industrial automation. Existing VAD methods rely on diverse techniques, including reconstruction, auxiliary tasks, and density estimation. However, they frequently neglect the inherent clustering of anomalies, particularly their prevalence at cluster boundaries. This study introduces a novel paradigm, Cluster-aware Anomaly Generation (CA-Gen), designed for weakly-supervised VAD, that explicitly models boundary anomaly patterns, and consists of two primary modules: (a) the margin-based anomaly generator utilizes cluster priors by introducing noise perturbations at cluster boundaries, leveraging boundary priors to produce initial anomalous samples; (b) the anomaly representation optimizer employs a trainable mapping function to learn the relational structure between normal and anomalous patterns within each cluster, synthesizing more generalizable anomalous samples. Finally, CA-Gen enhances VAD performance by training a simple discriminator to learn the generated anomaly patterns. Extensive experiments on two widely recognized benchmark datasets validate the efficacy of our approach, demonstrating superior performance compared to state-of-the-art methods in VAD. The code is available at https://github.com/Haifu-Ye/CA-Gen.
|
|
09:00-18:30, Paper Mo-Online.20 | |
FGHFN: High-Resolution Fusion Network with Frequency-Domain Guidance for Remote Sensing Semantic Segmentation |
|
Fu, Jiahao | Xinjiang University, Urumqi, China |
Yu, Yinfeng | Xinjiang University |
Wang, Liejun | Xinjiang University |
Keywords: Image Processing and Pattern Recognition, Machine Vision, Deep Learning
Abstract: When performing semantic segmentation on high-resolution remote sensing images, existing methods face a trade-off between capturing spatial details and modeling global context efficiently. To address this, we propose FGHFN, a High-Resolution Fusion Network with Frequency-Domain Guidance. The encoder extracts multi-scale local features, while the HRFusion module dynamically merges adjacent-resolution features via channel gating, preserving critical edges and textures. At the decoder, the Frequency-domain Global Filter (FFGF) models global context with convolutional cost, and the Strip Fusion Block (SFB) aligns cross-layer receptive fields through strip convolution. Experiments on LoveDA, Vaihingen, and Potsdam show mIoU scores of 55.65%, 84.65%, and 87.61%, demonstrating FGHFN's efficacy.
|
|
09:00-18:30, Paper Mo-Online.21 | |
MolCAF: A Cross-Attention Fusion Approach Combining Molecular Descriptors and Fragment-Enhanced Graph Representation for Molecular Property Prediction |
|
Qin, Haoyang | Southwest University of Science and Technology |
Xie, Jia | Automation Research Institute of China South Industries Group Co |
Peng, Lijuan | Southwest University of Science and Technology |
Xie, ShiDi | Southwest University of Science and Technology |
Zhou, Hongcheng | Southwest University of Science and Technology |
Keywords: Representation Learning, Deep Learning, Computational Life Science
Abstract: Accurately identifying and predicting molecular properties is crucial for AI-driven drug design. However, previous studies on molecular representation learning mainly relied on a single representation method and failed to effectively integrate multi-scale features, making it difficult to capture both local and global structural information of molecules, which in turn limited prediction accuracy and generalization ability. In this study, we propose a novel molecular property prediction framework called MolCAF, which integrates features from multiple molecular representations and effectively captures both local and global information from different perspectives to enhance molecular property prediction. Specifically, MolCAF consists of three core components: (1) a Hybrid Fingerprint Feature (HFF) Encoder that captures complex nonlinear relationships in the mixed molecular fingerprint representation through a Multi-Layer Perceptron network. In this study, we use molecular fingerprints as molecular descriptors; (2) a CMPNN-FGP Encoder, which combines the Communicative Message Passing Neural Network (CMPNN) with a Functional Group Prompt (FGP) Encoder to extract global representations from both the original molecular graph and an augmented molecular graph based on functional group fragments; and (3) a Multi-Dimensional Feature Interaction Module, which deeply integrates molecular fingerprint and molecular graph feature information for more accurate molecular property prediction. We evaluated MolCAF’s performance on 12 benchmark datasets, and the experimental results demonstrate that MolCAF outperforms state-of-the-art methods in molecular property prediction, validating its effectiveness and robustness.
|
|
09:00-18:30, Paper Mo-Online.22 | |
Multi-Feature Guided Generalization: Tackling Domain Shifts in Semi-Supervised Medical Image Segmentation |
|
Xu, Yuheng | Chongqing University |
Wang, Tianyang | Xi'an Jiaotong-Liverpool University |
Zhang, Taiping | Chongqing University |
Keywords: Deep Learning, Image Processing and Pattern Recognition, Transfer Learning
Abstract: Medical image segmentation faces the dual challenges of limited annotated data and domain shifts when generalizing to unseen domains. Existing methods often tackle one challenge at the expense of the other. To tackle both challenges, we propose a segmentation framework that addresses these issues simultaneously. Inspired by class-level representations, we hypothesize that data from unseen target domains can be expressed as linear combinations of source domain data, which can be approximated via data augmentation. Based on this, we introduce a Multi-Feature Balance Fusion (MFBF) data augmentation mechanism. MFBF combines global and class-specific local augmentations, exploring diverse and even extreme appearances in unknown domains under a balanced multi-feature framework. This approach significantly enriches sample diversity and ensures that augmented samples capture potential target domain distributions, effectively mitigating domain shift. Additionally, to address the limitations of traditional sharpening functions, which may lead to overconfident predictions and overfitting, we introduce a Dynamic Uncertainty Estimation (DUE) mechanism. DUE encourages the model to explore more regions during training while avoiding overfitting to noise or outliers. Experimental results validate the effectiveness of our approach, demonstrating its superior segmentation performance in addressing both data scarcity and domain shift challenges. This provides a practical and efficient solution for semi-supervised learning and domain generalization in medical image segmentation.
|
|
09:00-18:30, Paper Mo-Online.23 | |
A Dynamic Gradient Enhanced Particle Swarm Optimization-Incorporated Adaptive Latent Factor Analysis Model |
|
Ma, Ziwen | Southwest University |
Cheng, Jingna | Southwest University |
Lyu, Chao | Southwest University |
Keywords: Evolutionary Computation, Swarm Intelligence, Computational Intelligence
Abstract: In latent factor analysis (LFA), stochastic gradient descent (SGD) is a widely used algorithm to decompose highdimensional sparse matrices. However, traditional SGD relies on manually tuning the learning rate which is a vital hyperparameter to influence its optimization. To improve the performance of LFA, particle swarm optimization (PSO) has been employed to automatically adjust the learning rate of SGD. However, the classical PSO algorithm can not track the movement of the optimal hyper-parameter setting during its convergence process and suffers from the hysteresis due to its individual update scheme. To solve this problem, this paper proposes an algorithm called enhanced dynamic gradient particle swarm optimizer (EDG-PSO) which can achieve the dynamic global optimization of the learning rate during the model training process of LFA. In EDG-PSO, the directions of particles are updated with a certain probability through calculating their gradient’s changes. Moreover, the incorporation of the Adam algorithm further improves the search efficiency of PSO. By embedding the proposed hyper-parameter tuning algorithm into the SGD-based LFA, the EGD-PSO-incorporated LFA (EPLFA) model is proposed by this paper. The experimental results show that the proposed EPLFA model has a higher prediction accuracy and a lower computational cost than existing adaptive adjustment LFA models.
|
|
09:00-18:30, Paper Mo-Online.24 | |
Relative-Absolute Fusion: Rethinking Feature Extraction in Image-Based Iterative Method Selection for Solving Sparse Linear Systems |
|
Zhang, Kaiqi | Dalian University of Technology |
Yang, Mingguan | Greater Bay Area National Center of Technology Innovation |
Chang, Dali | Dalian University of Technology |
Chen, Chun | Greater Bay Area National Center of Technology Innovation |
Zhang, Yuxiang | Greater Bay Area National Center of Technology Innovation |
He, Kexun | CATARC Automotive Test Center (Tianjin) Co., Ltd |
Zhao, Jing | Dalian University of Technology |
Keywords: Deep Learning, Neural Networks and their Applications
Abstract: Iterative method selection is crucial for solving sparse linear systems because these methods inherently lack robustness. Though image-based selection approaches have shown promise, their feature extraction techniques might encode distinct matrices into identical image representations, leading to the same selection and suboptimal method. In this paper, we introduce RAF (Relative-Absolute Fusion), an efficient feature extraction technique to enhance image-based selection approaches. By simultaneously extracting and fusing image representations as relative features with corresponding numerical values as absolute features, RAF achieves comprehensive matrix representations that prevent feature ambiguity across distinct matrices, thus improving selection accuracy and unlocking the potential of image-based selection approaches. We conducted comprehensive evaluations of RAF on SuiteSparse and our developed BMCMat (Balanced Multi-Classification Matrix dataset), demonstrating solution time reductions of 0.08s-0.29s for sparse linear systems, which is 5.86%-11.50% faster than conventional image-based selection approaches and achieves state-of-the-art (SOTA) performance. BMCMat is available at https://github.com/zkqq/BMCMat.
|
|
09:00-18:30, Paper Mo-Online.25 | |
A Hybrid Force-Position Strategy for Shape Control of Deformable Linear Objects with Graph Attention Networks |
|
Yu, Yanzhao | Tsinghua University |
Yang, Haotian | Tsinghua University |
Tan, Junbo | Tsinghua University |
Wang, Xueqian | Tsinghua University |
Keywords: Deep Learning, Neural Networks and their Applications, AI and Applications
Abstract: Manipulating deformable linear objects (DLOs) such as wires and cables is crucial in various applications like electronics assembly and medical surgeries. However, it faces challenges due to DLOs' infinite degrees of freedom, complex nonlinear dynamics, and the underactuated nature of the system. To address these issues, this paper proposes a hybrid force-position strategy for DLO shape control. The framework, combining both force and position representations of DLO, integrates state trajectory planning in the force space and Model Predictive Control (MPC) in the position space. We present a dynamics model with an explicit action encoder, a property extractor and a graph processor based on Graph Attention Networks. The model is used in the MPC to enhance prediction accuracy. Results from both simulations and realworld experiments demonstrate the effectiveness of our approach in achieving efficient and stable shape control of DLOs. Videos are available at https://sites.google.com/view/dlom.
|
|
09:00-18:30, Paper Mo-Online.26 | |
MFBPNet: A Multi-Scale Fusion and Boundary Perception Network for Real-Time Semantic Segmentation in Autonomous Driving |
|
Guo, Jiajia | Shanghai Dianji University |
Fan, Guangyu | Shanghai Dianji University |
Lei, Rao | Shanghai Dianji University |
Cheng, Songlin | Shanghai Dianji University |
Niansheng, Chen | School of Electronic Information Engineering, Shanghai Dianji Un |
Song, Xiaoyong | Shanghai Dianji University |
Yang, Dingyu | Zhejiang University |
Keywords: Deep Learning, Image Processing and Pattern Recognition, Application of Artificial Intelligence
Abstract: Semantic segmentation plays a pivotal role in real-world applications, particularly in autonomous driving. Despite significant advancements in existing semantic segmentation methods, the performance of real-time segmentation approaches remains suboptimal. To address the trade-off between computational efficiency and accuracy in current methods, we propose a novel lightweight real-time semantic segmentation network named MFBPNet. Specifically, this paper introduces three core modules: (1) the Depthwise Separable Convolutional Pyramid Module (DSCPM), which expands the global receptive field and enhances deep feature representation; (2) the Local Attention Refinement Module (LARM), employing channel-wise attention to refine local feature discriminability, particularly for fine-grained objects; and (3) the Boundary Perception Feature Fusion Module (BPFM), which strengthens the feature representation of boundary regions through a multi-level feature fusion mechanism, effectively enhancing the clarity of object boundaries and mitigating boundary blurring issues. Extensive experiments on Cityscapes and CamVid datasets demonstrate that MFBPNet achieves state-of-the-art performance, attaining 75.3% mIoU at 67.3 FPS and 74.3% mIoU at 68.2 FPS, respectively. Compared to existing methods, MFBPNet achieves a superior balance between segmentation accuracy and real-time performance, making it highly suitable for autonomous driving systems requiring both real-time processing and high segmentation quality.
|
|
09:00-18:30, Paper Mo-Online.27 | |
High Order Collaboration-Oriented Federated Graph Neural Network for Accurate QoS Prediction |
|
Chen, Zehuan | Southwest University |
Yuan, Ye | Southwest University |
Xiangwei, Lai | Southwest University |
Keywords: Neural Networks and their Applications, Deep Learning, Machine Learning
Abstract: Predicting Quality of Service (QoS) data crucial for cloud service selection, where user privacy is a critical concern. Federated Graph Neural Networks (FGNNs) can perform QoS data prediction as well as maintaining user privacy. However, existing FGNN-based QoS predictors commonly implement on-device training on scattered explicit user-service graphs, thereby failing to utilize the implicit user-user interactions. To address this issue, this study proposes a high order collaboration-oriented federated graph neural network (HC-FGNN) to obtain accurate QoS prediction with privacy preservation. Concretely, it magnifies the explicit user-service graphs following the principle of attention mechanism to obtain the high order collaboration, which reflects the implicit user-user interactions. Moreover, it utilizes a lightweight-based message aggregation way to improve the computational efficiency. The extensive experiments on two QoS datasets from real application indicate that the proposed HC-FGNN processes the advantages of high prediction accurate and privacy protection.
|
|
09:00-18:30, Paper Mo-Online.28 | |
Federated Deep Latent Factor Model for Privacy-Preserving Recommendation |
|
Gao, Junxiang | Southwest University |
Wu, Di | Southwest University |
Chen, Jia | Beihang University |
Zhou, Min | Southwest University |
Luo, Xin | Chinese Academy of Sciences |
Keywords: Big Data Computing,, AI and Applications, Deep Learning
Abstract: Recommender systems (RSs) are extensively applied in various domains, such as e-commerce and online media services, to enhance user experience. Traditional RSs rely on centralized data collection and model training, which, while effective, raise serious privacy concerns as user interaction data may contain sensitive personal information. Federated Learning (FL) has emerged as a promising solution by allowing users to collaboratively train a global model without sharing their raw data. However, federated recommender systems face challenges, such as data sparsity, personalization limitations, and difficulties in capturing complex relationships between users and items. To address these challenges, we propose a novel federated deep latent factor framework (FedDeepLF) to improve recommendation accuracy while preserving user privacy. FedDeepLF introduces three key innovations: (1) utilizing deep neural networks to model complex user-item interactions; (2) incorporating user ratings into the model to better capture individual preferences; and (3) generating synthetic ratings to protect user interactions while improving recommendation quality. Extensive experiments on five datasets demonstrate that FedDeepLF significantly outperforms state-of-the-art federated recommender models in terms of prediction accuracy.
|
|
09:00-18:30, Paper Mo-Online.29 | |
Combining Side-Channel Features with Deep Learning for Network Traffic Anomaly Detection |
|
Gong, Yu | Sichuan Normal University |
Li, Min | Sichuan Normal University |
Zhu, Keyu | Sichuan Normal University |
Huang, Chen | Sichuan Normal University |
Lu, Yuanfang | Sichuan Normal University |
Zhou, Xiuwei | Sichuan Normal University |
Keywords: Deep Learning, Machine Learning
Abstract: As cybersecurity threats evolve, detecting anomalous network traffic is crucial for protecting information systems. However, this field still faces challenges such as data imbalance, feature extraction difficulties, and inadequate model performance. To address these challenges, this paper designs an anomaly detection model based on side-channel feature selection and random forest dimensionality reduction. Specifically, side-channel feature extraction explores potential features in network traffic data, capturing implicit information often missed by traditional methods. Subsequently, random forest importance scoring is used for feature selection and optimization. Additionally, the model also integrates the attention mechanism and bidirectional long short-term memory network. In experiments, the proposed method demonstrates outstanding performance in the CIC-IDS2017 and CIC UNSW-NB15 datasets, achieving accuracies of 98.90% and 98.67%, respectively, providing an effective strategy to strengthen network security defenses.
|
|
09:00-18:30, Paper Mo-Online.30 | |
Structure-Aware Hypergraph Transformer for Diagnosis Prediction in Electronic Health Records |
|
Wang, Haiyan | Southwest University |
Yuan, Ye | Southwest University |
Keywords: Biometric Systems and Bioinformatics, Neural Networks and their Applications, Deep Learning
Abstract: Electronic Health Records (EHR) systematically organize patient health data through standardized medical codes, serving as a comprehensive and invaluable resource for diagnosis prediction. Graph neural networks (GNNs) have demonstrated effectiveness in modeling interactions between medical codes within EHR. However, existing GNN-based methods face two key limitations when processing EHR: a) their reliance on pairwise relations fails to capture the inherent higher-order dependencies in clinical data, and b) the localized message-passing mechanisms extract global code interactions inadequately. To address these issues, this paper proposes a novel Structure-aware HyperGraph Transformer (SHGT) framework following three-fold ideas: a) modeling EHR data as a hypergraph and employing a hypergraph structural encoder to capture higher-order interactions among medical codes, b) integrating the Transformer architecture to effectively capture global dependencies across the entire hypergraph, and c) designing a tailored loss function incorporating hypergraph reconstruction to preserve the hypergraph’s original structure. Extensive experiments on two real-world EHR datasets demonstrate that the proposed SHGT outperforms existing state-of the-art models on diagnosis prediction.
|
|
09:00-18:30, Paper Mo-Online.31 | |
StyU-STD: Style-Diverse Sample Generation from Unlabeled Data for Query-By-Example Spoken Term Detection |
|
Ding, Hanyu | Jiangsu University |
Gao, Lijian | Jiangsu University |
Dong, Wenlong | Jiangsu University |
Li, Xiangrui | Jiangsu University |
Mao, Qirong | Jiangsu University |
Keywords: Deep Learning, Neural Networks and their Applications
Abstract: In recent years, query-by-example spoken term detection (QbE-STD) techniques have made significant progress in detection accuracy and speed. However, this task also encounters situations where labeled data is scarce or even nonexistent, with only unlabeled data available. Although some solutions exist, they still struggle to effectively handle highly variable speech, especially when it comes to differing styles. To address this issue, we propose a self-supervised learning method named Style-diverse sample generation from Unlabeled data for query-by-example Spoken Term Detection (StyU-STD). The core idea is to generate samples with the same content but different styles for learning. Specifically, we randomly extract segments from the speech to be tested as positive samples, while segments randomly extracted from other speech data are labeled as negative samples of the speech to be tested. In addition, various transformations are applied to alter the style of both positive and negative samples while preserving their original content. Then, the generated sample pairs are used to train the Style Suppressed Convolutional Network, which focuses more on content-related information in speech and effectively reduces the interference caused by style differences. The experimental results show that, across multiple datasets, our method outperforms existing methods, achieving higher accuracy and robustness.
|
|
09:00-18:30, Paper Mo-Online.32 | |
Aligning Shifting Preference: A PbRL-Driven Approach for Interactive Evolutionary Multi-Objective Optimization |
|
Dai, Yulong | National University of Defense Technology |
Chen, Ziyi | National University of Defense Technology |
Dou, Yajie | National University of Defense Technology |
Deng, Jinke | National University of Defense Technology, College of Systems En |
Jiang, Jiang | College of Systems Engineering National University of Defense Te |
Tan, Yuejin | College of Systems Engineering, National University of Defense T |
Keywords: Evolutionary Computation, Deep Learning, Expert and Knowledge-Based Systems
Abstract: Evolutionary Multi-Objective Optimization Algorithms (EMOAs) are extensively utilised to address issues involving conflicting objectives. Recent studies suggest that decision-makers (DMs) are typically only concerned with a portion of the Pareto frontier. The Interactive Evolutionary Multi-Objective Optimization Algorithm (iEMOA) has been developed to address this issue by integrating the preference of DM into the optimization process through multiple interactions. However, extant approaches often treat these preferences as static, neglecting the dynamic nature of DM. This paper proposes a preference-based reinforcement learning (PbRL) method that utilises list-wise preferences to dynamically align shifting preference of DM in a multi-objective optimization problem (MOP). First, a Markov decision process (MDP) model for list-wise preference is developed. Second, modifications are made to the reward and resampling mechanisms in the basic deep deterministic policy gradient (DDPG) algorithm. Third, the machine decision-maker (MDM) model is enhanced to account for preference shift. The efficacy of the proposed approach is demonstrated through comparative experiments with state-of-the-art methods, highlighting its superior performance in handling the complex preference behaviours exhibited by real-world DM.
|
|
09:00-18:30, Paper Mo-Online.33 | |
Knowledge Graph-Guided Diffusion Model for Personalized Traditional Chinese Medicine Prescription Recommendation |
|
Zhang, Chaobo | Heilongjiang University |
Tan, Long | Heilongjiang University |
Keywords: AI and Applications, Expert and Knowledge-Based Systems, Machine Learning
Abstract: Artificial intelligence (AI) has received much attention in the field of traditional Chinese medicine (TCM) prescription recommendation. Existing models pay less attention to patient-personalized attribute profiles and the problem of prescription data sparsity (imbalanced herb labels). To remedy this shortcoming, we propose a knowledge graph-guided diffusion model (DM) for personalized herbal prescription recommendation (HPR), namely TCM-KGDPR. We first learn patients' personalized representations by the prompt fine-tuning technique and enhance the knowledge via the pre-trained model. Subsequently, we leveraged an improved DM to model the multivariate symptom-herb relationship and adopted a mixed node and edge-level strategy to alleviate the problem of imbalanced herb labels. In addition, we utilize a manually constructed TCM knowledge graph as a complement to the knowledge to capture the symptom-herb relationship on a global scale. Extensive experiments compared to several baselines on a real-world dataset and two public datasets demonstrate the effectiveness of TCM-KGDPR. Not only does it provide new information for clinical decision-making in TCM, but also it promotes the modernization and innovative development of TCM diagnosis and treatment.
|
|
09:00-18:30, Paper Mo-Online.34 | |
Multi-Object Tracking Method of Pigs in Pig Farms Integrating VSS and Trajectory Interpolation |
|
Zhu, Deli | Chongqing Normal University |
He, Liang | Chongqing Normal University |
Li, Xinjie | Chongqing Normal University |
Li, Yi | ChongQing Academy of Animal Sciences |
Keywords: Image Processing and Pattern Recognition, Machine Vision, Neural Networks and their Applications
Abstract: Pig farming plays a crucial role in animal husbandry, and with the industry's growing scale, modern farming technologies are in high demand. This paper proposes an improved multi-object tracking method for pigs, leveraging AI to address key challenges in intelligent farming. The RT-DETR detection model is enhanced by adding the VSS (Visual State Space) module, which integrates state space models and convolution operations to improve long-distance dependency and global context modeling. The CAFM (Convolution and Attention Fusion Module) replaces RT-DETR’s original CCFF module, improving the fusion of multi-scale features. Additionally, the ByteTrack algorithm is modified with an interpolation-based trajectory completion mechanism to address occlusions and reduce object ID switches. Experimental results show that the optimized RT-DETR improves accuracy by 3.5%, recall by 5.1%, mAP50 by 1.2%, and mAP50:95 by 2.1%. The enhanced ByteTrack algorithm achieves significant improvements in multi-object tracking metrics, with IDF1 increasing by 4.8%, MOTA increasing by 0.8%, and HOTA increasing by 3.0%, and the number of IDs has been reduced from 13 to 3. This method significantly enhances tracking performance, ensuring stable ID continuity and providing technical support for the automated monitoring of pig growth in complex farming environments.
|
|
09:00-18:30, Paper Mo-Online.35 | |
Multi-Scale Dynamic Dilated Transformer for UAV Real-Time Ship Inspection |
|
Miao, Huijie | Academy of Military Sciences |
Zhang, Jie | Wu |
Yan, Jiang | Wu |
Zhou, Jian | Wu |
Keywords: Machine Vision, Image Processing and Pattern Recognition, Deep Learning
Abstract: Aiming at the problems encountered in object detection of UAV vessels in China's offshore waters, this paper, based on the YOLOv11 algorithmic model, creatively proposes a new object detection algorithmic architecture, DCM-Net, based on the problems of multi-scale and complex background of the targets in the images captured by UAVs by making various improvements to the model, so as to significantly improve the performance of the UAV monitoring model. To begin with, we aim to uncover the multidimensional semantic information encapsulated within the image, we introduce the Multi-scale Dilated Attention Transformer (MSDA) module, which captures details of targets across a spectrum of scales and refining the model's sensitivity to image subtleties. The MSDA module improves the model's comprehension and expression of the multi-scale features while maintaining the computational efficiency. Secondly, we dynamically adjust the model architecture so that DCM-Net is more sensitive to changes in the scale of ship targets, which improves the monitoring performance of the model to a greater extent. Finally, in order to optimize the model for reduced complexity and satisfy the detection task when the resources of the edge devices are insufficient, we introduce the dynamic convolution module to reduce the model computation consumption and improve the model expression ability. Through extensive experimental validation on the public datasets HRSC2016, Seaships, and Shipsdataset, our model improves 3.3%, 16%, and 50% in mAP50 and computational consumption and FPS metrics, respectively. The experimental outcomes demonstrate that our algorithm significantly enhances the model's detection capabilities.
|
|
09:00-18:30, Paper Mo-Online.36 | |
YOLO-CCOF: An Algorithm for Aluminum Defect Detection Based on Improved YOLOv8 |
|
Han, Jiashu | Hohai University |
Ding, Yitong | Hohai University |
Chen, Hua | College of Mechanical and Electrical Engineering, Hohai Universi |
Zhou, Chengyu | Hohai University |
Li, Yuyang | Hohai University |
Keywords: Image Processing and Pattern Recognition, Machine Vision, Deep Learning
Abstract: 在工业环境中,捕获的低分辨率图像 铝制生产线上的摄像头往往会导致 缺陷特征不明确,有助于提高产品质量 问题。本研究引入的YOLO-CCOF算法 通过几部小说来应对这些挑战 增强。一、提高检测小 目标,我们提出了C2f-SU,其中 通过替换前两个来修改主干网络 C2f 层。二、增强多尺度语境 聚合并更好地检测带钢缺陷,我们介绍 CPMS注意力机制,集成在末尾 骨干网络。三、有效抓获两者 高级语义信息和低级空间 细节,我们将原来的 Neck 网络替换为 ODEff-RepGFPN。最后,解决阶级问题 数据集中易样本和难样本之间的不平衡,我们 提出 Focaler-WIoU 损耗函数。广泛的消融 阿里云天池的研究与对比评价 竞赛数据集(APDDD)表明YOLO-CCOF 达到 86.5% 的精确度、召回率和 mAP@0.5 分, 分别为 77.8% 和 81.5%,
|
|
09:00-18:30, Paper Mo-Online.37 | |
TCN-BiGRU Hybrid Model with Periodic Huber Loss for Enhanced Multi-Energy Load Forecasting |
|
Xu, Danyang | Huaqiao University |
Fan, Zongwen | Huaqiao University |
Gou, Jin | Huaqiao University |
Keywords: AI and Applications, Deep Learning
Abstract: Accurate multi-energy load forecasting is crucial for the optimal operation of Integrated Energy Systems (IES). This study innovatively proposes a hybrid prediction model with two core innovations: (1) We design a periodic Huber loss function that dynamically adjusts penalty weights, effectively balancing load periodicity characteristics with outlier robustness; (2) We propose a stacked prediction architecture that combines the dilated convolution properties of Temporal Convolutional Network (TCN) with the bidirectional temporal modeling capabilities of Bidirectional Gated Recurrent Unit (BiGRU). By connecting multiple fundamental modules in series through residual connections, the proposed model achieves progressive extraction and prediction of multi-scale temporal features in multi-energy load sequences through gradual residual output optimization. Experiments conducted on the actual operational data from Arizona State University's campus energy system demonstrate that the proposed model exhibits better performance than the advanced methods (Informer and FEDformer). The experimental results indicate the proposed model is effective for real-time operational scheduling of integrated energy systems.
|
|
09:00-18:30, Paper Mo-Online.38 | |
HRCRec : A Hybrid Residual Connect Attention Network for Sequential Recommendation |
|
Ji, Peichen | Xinjiang University |
Qin, Jiwei | Xinjiang University |
Ma, Jie | Xinjiang University |
Keywords: Representation Learning, Media Computing, Expert and Knowledge-Based Systems
Abstract: In sequencial recommendation, the expansion of Transformer layers often results in over-smoothing phenomena, where the hidden representations of users become similar. The problem occurs because self-attention algorithms essentially act as a low-pass filter, insufficient for capturing high-frequency information. As a result, self-attention algorithms struggle to detect the rapidly changing interests of users in the short term. Residual connection promotes information flow by passing the rich low-frequency information within the model, compelling the model to focus on high-frequency details. Therefore, we propose a sequencial recommendation model named Hybrid Residual Connect attention network for sequential Recommendation(HRCRec). Hybrid Residual Connection utilizes a frequency separation module to separate low-frequency and high-frequency components, then employs feature rescaling to adaptively amplify the high-frequency components. Finially, adaptive aggregation is applied to balance the weights of high and low frequency information during training. Our experimental evaluation on four benchmark datasets indicates that our model surpasses other baseline methods in recommendation accuracy.
|
|
09:00-18:30, Paper Mo-Online.39 | |
ViT-DGA: ViT-Driven Dual-Granularity Architecture for Image Manipulation Localization |
|
Wu, Qiang | Shanghai University of Electric Power |
Wei, Weimin | Shanghai University of Electric Power |
Sun, Xueyang | Shanghai University of Electric Power |
Shi, Wuyao | Shanghai University of Electric Power |
Keywords: Information Assurance and Intelligence, Media Computing, Application of Artificial Intelligence
Abstract: With the development of deep learning, image manipulation techniques have become increasingly advanced and difficult to detect, leading to serious trust crises in fields such as news reporting, judicial forensics, and social media. In recent years, deep learning-based image manipulation localization methods have achieved significant progress. Most current methods focus on extracting non-semantic information, which is reasonable, but since most manipulation operations modify the semantic content of images, both semantic and non-semantic information are indispensable. Since the introduction of Vision Transformer (ViT), it has dominated the field of computer vision, with its Transformer architecture demonstrating powerful capabilities in extracting image features. However, few ViT-based methods have been applied to image manipulation localization, and their performance remains unsatisfactory. Therefore, we combine traditional vision models with advanced ViT, integrating both semantic and non-semantic information to construct a new image manipulation localization framework called ViT-DGA. The traditional vision model is used to extract features from different frequency domains of the image, while a ViT-based intermediate layer supplements the original image features, effectively guiding the model to learn multi-perspective features. Additionally, we conducted extensive experiments on various datasets. The results demonstrate that ViT-DGA not only achieves superior localization accuracy compared to state-of-the-art image manipulation localization methods but also exhibits strong robustness in common post-processing scenarios.
|
|
09:00-18:30, Paper Mo-Online.40 | |
SEER: Semantic Enhancement and Emotional Reasoning Network for Multimodal Fake News Detection |
|
Zhu, Peican | Northwestern Polytechnical University |
Jing, Yubo | Northwestern Polytechnical University |
Cheng, Le | Northwestern Polytechnical University |
Chen, Bin | Unit 93212 of People's Liberation Army of China |
Cui, Xiaodong | Northwestern Polytechnical University |
Wu, Lianwei | Northwestern Polytechnical University |
Tang, Keke | Guangzhou University |
Keywords: Multimedia Computation, Computational Intelligence, Deep Learning
Abstract: Previous studies on multimodal fake news detection mainly focus on the alignment and integration of cross-modal features, as well as the application of text-image consistency. However, they overlook the semantic enhancement effects of large multimodal models and pay little attention to the emotional features of news. In addition, people find that fake news is more inclined to contain negative emotions than real ones. Therefore, we propose a novel Semantic Enhancement and Emotional Reasoning (SEER) Network for multimodal fake news detection. We generate summarized captions for image semantic understanding and utilize the products of large multimodal models for semantic enhancement. Inspired by the perceived relationship between news authenticity and emotional tendencies, we propose an expert emotional reasoning module that simulates real-life scenarios to optimize emotional features and infer the authenticity of news. Extensive experiments on two real-world datasets demonstrate the superiority of our SEER over state-of-the-art baselines.
|
|
09:00-18:30, Paper Mo-Online.41 | |
MHG-DVA: A Multi-Scale Dual-View Attention Model for Heterogeneous Graph Neural Networks |
|
Li, Ping | South China University of Technology |
Ye, Qi | South China University of Technology |
Huang, Haitian | Tsinghua Shenzhen International Graduate School |
Wang, Zhenyu | South China University of Technology |
Keywords: Neural Networks and their Applications, Deep Learning, Artificial Social Intelligence
Abstract: Heterogeneous graphs(HGs), characterized by diverse node and edge types as well as complex semantics, pose significant challenges for graph neural networks. Existing methods suffer from three major limitations: (1) homogeneous aggregation leads to semantic confusion and weakens local structural information; (2) the importance of different semantic paths is difficult to distinguish due to a lack of dynamic semantic fusion mechanisms; (3) single-view modeling fails to capture cross-view interactions between structural and semantic information. To address these issues, we propose MHG-DVA, a multi-scale dual-view attention model framework that integrates both the structural view and the meta-path view. The structural view employs gated attention and structural-level fusion to finely model heterogeneous neighbors, while the meta-path view combines intra-path semantic propagation with inter-path context-aware attention to capture multi-hop semantic dependencies. A node-adaptive fusion module is further introduced to dynamically integrate multi-view information through gated residual connections and cross-view interactions. Experiments on DBLP, ACM, Yelp, and Amazon show that MHG-DVA outperforms state-of-the-art methods in node classification (+0.95%, +2.59%, +0.48% Micro-F1) and fraud detection, where it surpasses IDGL by +0.89% and +1.03% in AUC under 20% and 40% training settings, respectively. Ablation studies confirm the efficacy of the multi-scale mechanism, establishing a novel fusion paradigm for heterogeneous graph analysis with both theoretical and practical value.
|
|
09:00-18:30, Paper Mo-Online.42 | |
Mamba: A Contrastive Learning and Data Augmentation-Based Model for Improving Deepfake Generalization |
|
Li, Kai | XinJiang University |
Jiang, Shaochen | Xinjiang University |
Wang, Liejun | Xinjiang University |
Liu, Chao | XinJiang University |
He, Sijia | XinJiang University |
Lu, Hongmeng | Xinjiang University |
Keywords: Machine Vision, Artificial Social Intelligence, Image Processing and Pattern Recognition
Abstract: In Deepfake face detection, the variety of forgery techniques and limited datasets pose challenges for learning discriminative features for authenticity judgment. Addressing this issue is crucial for improving generalization ability, especially as models are likely to encounter unknown forgery techniques in real-world scenarios. To tackle this, we propose CDA-Net, a novel contrastive data augmentation network that enhances both detection accuracy and generalization. CDA-Net integrates three key modules: Contrast-Driven Feature Aggregation (CDFA), Dual-Perspective Normalization (DPN), and Multi-Scale Mamba (MS-Mamba). CDFA improves feature contrasts through sliding window scanning, enabling detection of subtle forgery boundaries without explicit supervision. DPN uses CrossNorm for data augmentation, avoiding overfitting, while SelfNorm adaptively reduces style shifts between training and testing sets, focusing on authenticity-critical features. MS-Mamba applies a multi-directional scanning mechanism for long-range relationship comparisons, boosting contrastive learning ability. Extensive evaluations across five benchmark datasets demonstrate that CDA-Net achieves superior average performance compared to existing state-of-the-art methods, highlighting its strong generalization ability to detect unknown forgery techniques and adapt to real-world scenarios.
|
|
09:00-18:30, Paper Mo-Online.43 | |
ViTProbe: Defending Vision Transformers against Adversarial Patch Attacks through Single-Layer Inspection |
|
Xu, Fanjin | National University of Defense Technology |
Li, Qiuran | National University of Defense Technology |
Wen, Kaiyan | National University of Defense Technology |
Wang, Yaohua | National University of Defense Technology |
Keywords: AI and Applications, Application of Artificial Intelligence, Deep Learning
Abstract: Vision Transformer (ViT) has achieved remarkable performance in image classification but remains susceptible to natural corruptions and adversarial attacks. These attacks fabricate either global image-wide or localized patch-based perturbations. Although ViT demonstrates inherent resilience to natural corruptions and global image-wide adversarial attacks, it is still vulnerable to patch-based attacks. Current defense methods for ViT against such attacks typically involve re-inference or layer-by-layer analysis, incurring substantial computational costs. To address this issue, we introduce ViTProbe, a defense mechanism that identifies and removes adversarial patches within a single layer of the model. It is observed that the relative length of attention score associated with an adversarial patch is extremely low. Leveraging this insight, ViTProbe detects adversarial patches by assessing the ratio between components of the attention score matrix and the corresponding components of the input matrix in the model's specific layer. Potential adversarial patches are discarded before forwarding the chosen layer’s output, thereby eliminating the need for re-inference. Comprehensive evaluations against three patched-based attacks demonstrate that ViTProbe can recover the classification accuracy under adversarial patch attacks to 94.17% of the clean accuracy on average, achieving performance comparable to state-of-the-art defense while significantly reducing computational costs by up to 95.67%.
|
|
09:00-18:30, Paper Mo-Online.44 | |
ACFormer: A Multimodal Attention and Contrastive Learning Framework for Chest Disease Risk Prediction |
|
Lin, Tao | Shanghai Institute of Technology |
Zhang, Yiheng | Shanghai Institute of Technology |
Keywords: Artificial Social Intelligence, Application of Artificial Intelligence, Deep Learning
Abstract: With the development of medical artificial intelligence, the application of multimodal data fusion in disease prediction has received increasing attention. However, most existing disease prediction methods rely on single-modality data?such as medical imaging or clinical text?which makes it difficult to fully exploit cross-modal associations, thereby limiting prediction accuracy. To address this limitation, we construct a chest disease risk prediction model that integrates medical imaging and clinical text. The model adopts a dual-tower architecture to independently encode image and text features and employs contrastive learning to optimize cross-modal semantic alignment. The overall framework consists of a medical image encoder, a clinical data encoder, a multimodal contrastive learning module, and a fusion-based prediction module. By combining intra-modal self-attention and cross-modal attention through bidirectional attention interaction, the model enhances semantic consistency between image and text, thereby improving classification accuracy. The proposed method demonstrates outstanding performance in chest disease prediction, particularly in detecting subtle lesions and assessing multi-label disease risks. This study offers new insights into the application of medical AI for clinical decision support.
|
|
09:00-18:30, Paper Mo-Online.45 | |
Boosting Image Dehazing with ell^0-Regularized Extreme Channels Prior |
|
Zhao, Hongtian | Xinjiang University |
Deng, Siyin | Xinjiang University |
Abdurahman, Abdujelil | Xinjiang University |
Xu, Hui | Shanghai Jiao Tong University |
Keywords: Machine Vision, Heuristic Algorithms, Image Processing and Pattern Recognition
Abstract: Current challenges in image dehazing primarily arise from significant scene variability, including underexposed images, distortions resulting from non-uniform fog density, atmospheric fluctuations, and variations in scene depth. This paper proposes a novel approach to address these issues by integrating Extreme Channels Prior (ECP) into restoration models. Specifically, a ell^0 norm constraint is applied to the extreme channel values, which effectively guides the iterative restoration process towards a haze-free condition. As an ECP-based dehazing algorithm is developed to optimize the proposed model, this paper explores the incorporation of this algorithm into established frameworks for transmission and atmospheric light estimation, including dark channel prior, combined radiance-reflectance, and saturation line, as well as strategies such as the haze-lines model and the light-absorption-enhanced model. Experiments on the SOTS and I-HAZE datasets are conducted to evaluate the proposed method, and test results verify that the integration of ECP consistently enhances the performance of the dehazing algorithm across various models, and that our approach surpasses several state-of-the-art dehazing methods with a computationally acceptable overhead.
|
|
09:00-18:30, Paper Mo-Online.46 | |
Frequency-Domain Decoupled Guided Feature Space Augmentation for Multi-Task Network in Few-Shot Diagnosis of Demyelinating Diseases |
|
Lin, Wenlong | Chongqing Normal University |
Liu, Jinhui | Chongqing Normal University |
Han, Yongliang | The First Affiliated Hospital of Chongqing Medical University |
Cui, Shaoguo | Chongqing Normal University |
Yongmei, Li | The First Affiliated Hospital of Chongqing Medical University |
Keywords: Biometric Systems and Bioinformatics, Deep Learning, Application of Artificial Intelligence
Abstract: Multiple sclerosis (MS) and neuromyelitis optica spectrum disorder (NMOSD) are rare demyelinating diseases of the central nervous system. Limited sample sizes pose significant challenges for traditional convolutional neural networks (CNNs) in learning lesion features, particularly in identifying critical regions relevant to disease prediction. Due to the distinct lesion patterns between MS and NMOSD, the anterior visual pathway (AVP) plays a crucial role in early diagnosis. However, these lesions often exhibit low contrast, making them difficult to detect using conventional methods, despite their clearer representation in the frequency domain. Few studies have incorporated AVP as prior knowledge into deep learning frameworks or addressed the differences in lesion characteristics across the frequency domain. To tackle these challenges, we propose a multi-task network, VAE-FreqNet, which jointly performs AVP segmentation and disease classification. First, a dynamic decoupling strategy based on discrete cosine transform (DCT) is introduced, where the HFDownsample and HFTUpsample modules preserve high-frequency and low-frequency details lost during downsampling and upsampling, thereby enhancing subtle frequency-domain features. Additionally, we design a frequency-domain hierarchical variational autoencoder (FHVAE) module that employs HVAE blocks for variational inference, fusing decoupled and original features to generate synthetic representations, thus alleviating data scarcity. Extensive experiments demonstrate that VAE-FreqNet significantly improves both classification and segmentation performance.
|
|
09:00-18:30, Paper Mo-Online.47 | |
Exploring Classical Piano Performance Generation with Expressive Music Variational AutoEncoder |
|
Luo, Jing | Xi'an Jiaotong University |
Yang, Xinyu | Xi'an Jiaotong University |
Wei, Jie | China Mobile Group Shaanxi Co., Ltd |
Keywords: Application of Artificial Intelligence, Neural Networks and their Applications, Media Computing
Abstract: The creativity of classical music arises not only from composers who craft the musical sheets but also from performers who interpret the static notations with expressive nuances. This paper addresses the challenge of generating classical piano performances from scratch, aiming to emulate the dual roles of composer and pianist in the creative process. We introduce the Expressive Compound Word (ECP) representation, which effectively captures both the metrical structure and expressive nuances of classical performances. Building on this, we propose the Expressive Music Variational AutoEncoder (XMVAE), a model featuring two branches: a Vector Quantized Variational AutoEncoder (VQ-VAE) branch that generates score-related content, representing the Composer, and a vanilla VAE branch that produces expressive details, fulfilling the role of Pianist. These branches are jointly trained with similar Seq2Seq architectures, leveraging a multiscale encoder to capture beat-level contextual information and an orthogonal Transformer decoder for efficient compound tokens decoding. Both objective and subjective evaluations demonstrate that XMVAE generates classical performances with superior musical quality compared to state-of-the-art models. Furthermore, pretraining the Composer branch on extra musical score datasets contribute to a significant performance gain.
|
|
09:00-18:30, Paper Mo-Online.48 | |
Light-UWDet: A Lightweight Network for Underwater Small Object Detection |
|
Yang, Zheqi | Qilu University of Technology |
Li, Aimin | Qilu University of Technology |
Li, Mian | Qilu University of Technology |
Su, Zhike | Qilu University of Technology (Shandong Academy of Sciences) |
Keywords: Image Processing and Pattern Recognition, Deep Learning, Application of Artificial Intelligence
Abstract: Underwater small object detection is vital for marine monitoring, archaeology, and resource exploration. However, complex backgrounds and small target sizes present significant challenges. Traditional methods often lack the feature extraction accuracy and efficiency required for practical, lightweight applications. Although YOLOv8 offers high accuracy, its backbone and feature extraction modules are computationally intensive, limiting deployment on resource-constrained devices. To address this, we propose a lightweight detection framework based on YOLOv8 with three key improvements: (1) replacing the original backbone with StarNet to enhance efficiency and reduce computation; (2) introducing Wavelet Transform Convolution (WTConv) in the C2f module to expand the receptive field and reduce redundancy; (3) incorporating a Convolutional Additive Self-Attention (CAS) mechanism in the Neck to improve feature fusion. Experiments on the URPC dataset show that our model achieves 85.8% mAP, surpassing mainstream methods (84.7%) while reducing parameters (2.2M vs. 2.4M) and computation (5.8G vs. 7.5G), demonstrating superior accuracy and lightweight design.
|
|
09:00-18:30, Paper Mo-Online.49 | |
SCText: A Privacy-Preserving Framework with Contextual Semantic Consistency for Secure Text Processing |
|
Cai, Tianxin | Dongguan University of Technology |
Pan, Yiteng | Dongguan University of Technology |
Yan, Xiaohu | Shenzhen Polytechnic University |
Keywords: Information Assurance and Intelligence, Machine Learning, Deep Learning
Abstract: With growing cloud server usage and frequent privacy breaches, text privacy protection research has gained significant attention. However, applying differential privacy noise to individual tokens causes semantic loss in sentence-level tasks and remains vulnerable to reconstruction attacks. Previous studies show privacy attacks can infer sensitive information using inter-part-of-speech context, leading to leakage. To address poor semantic consistency and weak privacy protection in single-token strategies, we propose SCText—a hierarchical part-of-speech contextual semantics framework. This solution employs hierarchical positioning, context influence matrices, and exponential noise desensitization to preserve semantic consistency while preventing leakage. Experiments demonstrate significantly enhanced semantic preservation and privacy-attack resistance between original and desensitized texts.
|
|
09:00-18:30, Paper Mo-Online.50 | |
DPHNet: A Dynamic Parallel Hybrid Network for Sitting Posture Recognition |
|
Zhong, Shanshan | Ningbo University |
Shi, Shoudong | Ningbo University |
Tai, Yongheng | Ningbo University |
Lan, Ting | Ningbo University |
Zhao, Tianxiang | Ningbo University |
Qiu, Kedi | Ningbo University |
Keywords: Deep Learning, Application of Artificial Intelligence
Abstract: Poor sitting posture can lead to a variety of diseases. Therefore, using visual technology to recognize and correct poor sitting posture is of great significance in promoting a healthier lifestyle. In recent years, hybrid architectures that combine the advantages of Convolutional Neural Networks and Vision Transformers have achieved success in a variety of vision tasks. However, challenges remain in effectively coordinating information fusion between the two models and controlling computational cost without compromising performance. To address these issues, we propose a dynamic parallel hybrid model, DPHNet. DPHNet introduces a Dual-Path Control (DPC) module that dynamically enables optional attention branche based on feature information and adjusts the weights between attention and fixed convolutional branches. In order to strike a balance between computational cost and feature representation accuracy, DPHNet introduce the Patch Dynamic Selection (PDS) module, which dynamically selects the patch size based on feature information. With this design, DPHNet demonstrates excellent performance in sitting posture recognition: on our constructed dataset containing 31,020 images covering 8 sitting postures, DPHNet attains a recognition accuracy of 96.1% while maintaining a minimal computational cost of just 3.1 GFLOPs, outperforming existing state-of-the-art models.
|
|
09:00-18:30, Paper Mo-Online.51 | |
Multivariate Water Quality Time Series Prediction Based on DL-SegRNN |
|
Xu, Fuqing | Qilu University of Technology (Shandong Academy of Sciences) |
Zhang, Zhanshuo | Jinan Institute of Supercomputing Technology |
Liu, Xin | Qilu University of Technology (Shandong Academy of Sciences) |
Keywords: Deep Learning, Neural Networks and their Applications, Machine Learning
Abstract: Water quality time series forecasting is crucial for environmental management and water resource protection. However, due to the nonlinear nature, periodic patterns and complex interdependencies among variables in water quality data, existing methods often suffer from accuracy degradation in long-term forecasting. To address this challenge, this study proposes DL-SegRNN, a novel model that integrates both frequency-domain and time-domain information while optimizing the SegRNN unit structure to enhance multivariate water quality time series prediction accuracy. The model innovatively incorporates frequency-domain features to improve the capture of periodic and trend patterns and refines the unit structure to strengthen temporal dependency modeling. Additionally, Hyperopt is employed for hyperparameter tuning to ensure optimal model configuration. Experimental results demonstrate that DL-SegRNN achieves superior performance in long-term forecasting, outperforming baseline models with an MSE(Mean Squared Error) of 0.1698, MAE (Mean Absolute Error) of 0.1872, and RMSE (Root Mean Squared Error) of 0.3934, showcasing enhanced stability and robustness. These findings highlight the model's potential to improve water quality forecasting accuracy, providing technical support for intelligent water resource management and contributing to more effective environmental decision-making.
|
|
09:00-18:30, Paper Mo-Online.52 | |
Group Point Symmetry and Multi-Step Dynamics Prediction for Physical Systems with Graph Neural Networks |
|
Huang, Bingwei | Institute of Software Chinese Academy of Sciences |
Zhang, Chenxi | Institute of Software Chinese Academy of Sciences |
Yang, Sheng | Chinese Academy of Sciences |
Wu, Fengge | Institute of Software, Chinese Academy of Sciences |
Zhao, Junsuo | Institute of Software, Chinese Academy of Sciences |
Keywords: AI and Applications, Application of Artificial Intelligence, Deep Learning
Abstract: Modeling the 3D dynamics of relational systems is a key challenge in the field of natural sciences, especially in molecular simulations and particle mechanics. However, existing graph neural network methods overlook two critical practical issues: (1) Error accumulation in multi-step reasoning: They are designed for single-step predictions, which leads to severe error accumulation during multi-step inference. (2) Overly strict equivariance constraints: In scientific and engineering applications, dynamics often exhibits symmetry breaking or relaxation due to boundary conditions. In this work, we propose Group Point Multi-Step Dynamics - Graph Neural Networks (GPMS-GNN), which consists of two main modules: (1) the neural operator, which approximates dynamics as a time-evolving function and enables multi-step prediction in a single inference; (2) the relaxed equivariant graph neural network, which relaxes continuous equivariant inductive biases into discrete forms to accommodate scenarios involving symmetry breaking. Extensive experiments on particle simulations, human motion capture, and molecular dynamics demonstrate that GPMS outperforms state-of-the-art methods in both accuracy and data efficiency.
|
|
09:00-18:30, Paper Mo-Online.53 | |
PerturbGen: A Population Based Perturbation Method for Processor Test Generation |
|
Wang, Jingkai | National University of Defense Technology |
Chen, Renzhi | Defense Innovation Institute |
Wang, Lei | Defense Innovation Institute |
Keywords: Evolutionary Computation
Abstract: The increasing complexity of processor design demands higher requirements for simulation-based verification, particularly in test case generation. Existing random test generators often struggle to thoroughly validate the deep processor states or fail to provide sufficient coverage diversity. In this paper, we propose PerturbGen, an evolutionary algorithm-inspired population-based perturbation test generation method, designed to enhance traditional random test generators. PerturbGen introduces crossover and mutation operators from evolutionary algorithm, applying perturbations at both the population and member levels to generate high-quality tests. Our approach also includes a coverage-guided feedback loop for iteratively filtering members to guide the exploration of uncovered areas in the processor. We evaluated PerturbGen on an open-source RISC-V processor and compared it with two widely-used random test generators. Our method achieves relative improvements of 3.56%, 5.70%, and 14.01% in three key coverage metrics, respectively, proving the superiority of PerturbGen. Our code is open-sourced at the anonymous link: https://anonymous.4open.science/r/PerturbGen-2B07.
|
|
09:00-18:30, Paper Mo-Online.54 | |
SAM-Teacher: SAM Can Be a Good Teacher for Enhancing Medical Image Segmentation |
|
Li, Lan | Chongqing University |
Wu, Xing | Chongqing University |
Fang, Dejian | Fujian Polytechnic Normal University |
He, Zhongshi | Chongqing University |
Keywords: Biometric Systems and Bioinformatics, Image Processing and Pattern Recognition, Transfer Learning
Abstract: Medical image segmentation has improved with advances in model architectures and feature extraction, yet current SOTA models still face challenges in generalization and accuracy. The vision foundation model SAM demonstrates strong segmentation precision and zero-shot generalization on natural images. However, due to domain gaps, SAM, even when fine-tuned or used as a backbone, underperforms compared to medical SOTA models. To address this, we propose a knowledge distillation framework that treats SAM as a teacher to transfer its powerful feature extraction and generalization abilities to a medical segmentation model, thereby improving both performance and domain robustness. Nevertheless, in this setting, existing distillation methods face two key challenges: (1) Existing feature-level knowledge distillation methods either rely on subjective assumptions for feature alignment or use attention-based mechanisms to align teacher-student features, which are computationally expensive. Moreover, since recent SOTA models in medical image segmentation are primarily Transformer-based, research on distilling SAM's feature extraction capabilities into Transformer-based medical models remains scarce. (2) SAM has strong edge perception, but existing distillation methods focus on global alignment, leaving its edge-aware capabilities largely underutilized. To tackle these issues, we propose the SAM-Teacher framework, which consists of Adaptive Feature Alignment Module (AFAM) and Edge Perception Distillation Module (EPDM). AFAM leverages gradients from segmentation loss to adaptively determine the alignment direction between teacher and student features, and incorporates an attention-based distillation loss tailored to Transformer models. EDPM explicitly strengthens the distillation of edge information, addressing the limitation of existing methods that treat edge and non-edge regions equally. Extensive experiments on three modal medical image datasets show that SAM-Teacher significantly improves the segmentation accuracy and generalization of the student model, outperforming existing distillation methods. Ablation studies further validate the effectiveness of each module.
|
|
09:00-18:30, Paper Mo-Online.55 | |
IEnhancer-CADS: A Cross-Modal Attention Framework for Enhancer Identification by Integrating DNABERT and DNA Shape Features |
|
Ren, Liqun | Shenyang Aerospace University |
Zheng, Xuedong | Shenyang Aerospace University |
Keywords: Biometric Systems and Bioinformatics, Artificial Social Intelligence, Deep Learning
Abstract: Enhancers are critical cis-regulatory elements in gene expression regulation, and their accurate identification is essential for studying transcriptional regulation and disease pathogenesis. In bioinformatics, enhancer identification remains an active research area. Despite various methods dedicated to identifying enhancers and their strengths, current approaches still exhibit limitations. Existing methodologies fail to effectively utilize three-dimensional spatial shape, a crucial factor influencing DNA sequence function, resulting in suboptimal prediction accuracy. To address these challenges, we propose a new deep learning framework named iEnhancer-CADS. The framework leverages pretrained DNABERT to achieve the representation of enhancer sequence features. Subsequently, a multi-head cross-attention mechanism is employed to fuse sequence features with shape features, followed by the use of convolutional neural networks to extract higher-order features. This approach not only improves the efficiency of feature extraction but also enhances the model's predictive ability for enhancers and their strengths. Experimental results on benchmark datasets demonstrate superior performance, with our model achieving 80.5% accuracy in the first-layer enhancer identification and 82.25% accuracy in the second-layer strength prediction.
|
|
09:00-18:30, Paper Mo-Online.56 | |
Towards More Accurate and Complete Iris Segmentation Using Dual-Branch Structural Network |
|
Huo, Guang | Northeast Electric Power University |
Yu, Xiaolu | Northeast Electric Power University |
Lou, Jianlou | Northeast Electric Power University |
Wang, Jiajun | Northeast Electric Power University |
Keywords: Image Processing and Pattern Recognition, Deep Learning
Abstract: The objective of iris segmentation is to delineate the iris from the human eye image accurately. The segmentation accuracy directly influences the effectiveness of subsequent feature extraction and matching processes. The performance of the iris recognition system will also be affected. Existing iris segmentation methods based on a single-branch CNN-Transformer hybrid architecture. It is difficult for a single branch to optimize both local and global features simultaneously. The balance between the iris edge detail information and the global structural relationships is disrupted by the single-branch architecture. To address these problems, this study proposes a novel dual-branch parallel network with a Pyramid Vision Transformer and Convolutional Neural Network (DBPC-Net). DBPC-Net effectively captures both the local spatial details and global dependencies of the iris, thereby achieving more complete and accurate iris segmentation. Additionally, this paper introduces two new modules, including the Local Emphasis Module (LEM) and the Dual-Branch Feature Fusion Module (DFFM). To address the weakness of the Pyramid Vision Transformer (PVT) branch in representing local features, the LEM is designed to prevent global features from overshadowing local feature information. Furthermore, to effect the fusion of global and local features at the same level, the DFFM is designed to enhance the complementary features between different branches. Extensive experiments on the IITD, CASIA-V4-Interval, and ND-Iris-0405 datasets demonstrate that DBPC-Net improves segmentation accuracy and achieves superior segmentation results.
|
|
09:00-18:30, Paper Mo-Online.57 | |
VulSCS: A Source Code Vulnerability Detection System Using Secondary Code Slicing |
|
Zhong, Yong | Foshan University |
Liu, Bin | Foshan University |
Yang, WenYin | Foshan University |
Ye, JunXian | Foshan University |
Li, JiHui | Foshan University |
Liu, Fen | Foshan University |
Keywords: Application of Artificial Intelligence, Deep Learning, Neural Networks and their Applications
Abstract: In the context of the information age, the frequent occurrence of software vulnerabilities has emerged as a critical issue demanding immediate resolution. Traditional vulnerability detection methods have struggled to keep pace with the escalating security demands, while the advent of deep learning technology has introduced novel solutions to software vulnerability detection. Deep learning not only facilitates the automatic extraction of features, thereby reducing the cost of manual intervention, but also demonstrates remarkable advantages across various domains. In recent years, research on vulnerability detection based on deep learning has achieved notable progress, yet it still faces several limitations. This study focuses on C/C++ program vulnerability detection and proposes an enhanced approach, VulSCS, based on secondary slicing. By performing secondary slicing on source code exhibiting vulnerable behaviors, this method extracts code segments with higher representational value, thereby capturing richer vulnerabilityrelated features. Experimental results indicate that, compared to state-of-the-art vulnerability detection tools, VulSCS improves detection accuracy by 3.2% and enhances detection efficiency by approximately threefold. This research offers new perspectives and methodologies for deep learningbased software vulnerability detection.
|
|
09:00-18:30, Paper Mo-Online.58 | |
Emotional Speech Synthesis Via Diffusion Transformer with Reference Audio Pool* |
|
Shen, Qian | Yunnan University |
Yang, Jian | Yunnan University |
|
|
09:00-18:30, Paper Mo-Online.59 | |
A Dual-Branch Unsupervised Network with Wavelet Transform and Style Consistency for CT to MRI Image Generation |
|
Zhang, Shuyue | University of Shanghai for Science and Technology |
Wang, Chaoli | University of Shanghai for Science and Technology |
Sun, Zhanquan | University of Shanghai for Science and Technology |
Feng, Xiaochen | The First Affiliated Hospital of Naval Medical University |
Zhang, Yaying | Shanghai Fourth People's Hospital Affiliated to Tongji Universit |
Keywords: Image Processing and Pattern Recognition, Deep Learning, Neural Networks and their Applications
Abstract: 摘要 — 医学影像生成起着重要作用 在临床应用中,特别是在模态中 翻译任务,例如 CT 到 MRI 的合成。比较 使用依赖于配对数据集的监督式方法, 无监督图像生成更实用,因为它 更低的数据要求。但是,现有的无监督 方法经常受到高频不足的问题 细节保留和风格一致性差。虽然 小波变换提供有前途的多尺度 表示能力,有效整合 进入深度生成模型仍然具有挑战性。要解决 这些问题,我们建议 WDS-Net,一个双分支 包含小波变换的无监督网络。 在内容编码器中,小波分解用于 增强结构建模和高频细节 萃取。多尺度特征提取机制是 在 Style Encoder 和 Layer-Wise Style 中引入 解码过程中采用注入以改善风格 一致性。实验结果表明,WDS-Net 可以在未配对的情况下生成高质量的 MRI ࢳ
|
|
09:00-18:30, Paper Mo-Online.60 | |
A Particle Swarm Optimization Algorithm with Similarity-Based Crossover and Elite List for UAV Path Planning |
|
Zhu, Gengshuo | Nanjing University of Aeronautics and Astronautics |
Qian, Hongyan | Nanjing University of Aeronautics and Astronautics |
Keywords: Swarm Intelligence, Metaheuristic Algorithms
Abstract: The application of unmanned aerial vehicles (UAVs) in disaster response missions is rapidly expanding. Reliable path planning is critical for flight safety and mission execution efficiency. To address this issue, this paper proposes an improved particle swarm optimization (PSO) algorithm, termed LCPSO, which incorporates a similarity-based crossover and an elite list strategy. In LCPSO, a local search mechanism is introduced into the personal best update process using the similarity-based crossover strategy, which enhances search capability while maintaining the stability of personal best solutions through a similarity threshold. Furthermore, an elite list strategy is adopted during the global best update, allowing more particles to guide the swarm evolution. This helps the algorithm escape local optima and improves the stability of the solution process. Simulation results demonstrate that LCPSO is capable of generating high-quality flight paths in mountainous terrain and outperforms SDPSO, GPSO, CPSO, and PSO in terms of both optimization effectiveness and solution robustness.
|
|
09:00-18:30, Paper Mo-Online.61 | |
Intermittent Discrete Dynamic Event-Triggered Anti-Synchronization Control for Semi-Markovian Delayed MNNs |
|
Liu, Xiaoman | Yunnan University |
Zhang, Haiyang | Yunnan Minzu University |
Xiong, Lianglin | Yunnan Open University |
Zhou, Xiaobing | Yunnan Univerisity |
Keywords: Neural Networks and their Applications, Complex Network
Abstract: This paper investigates the anti-synchronization control problem for a class of Memristor-based Neural Networks with time-delay and semi-Markov jump parameters. Firstly, to further effectively utilize the network resources, a novel Intermittent Discrete Dynamic Event-triggered (IDDET) scheme is introduced, where the dynamical update law of the IDDET scheme is designed to be related to the current sampling state. Secondly, by fully considering the information of jump parameters, time-delay, sampling period, and interaction of the current and past states, a general common Lyapunov functional is constructed. Then, with the virtue of inequalities analysis technique and quadratic polynomial negative definite lemma, a new less conservative criterion guaranteeing anti-synchronization for the underlying master-slave systems is derived in the form of Linear Matrix Inequalities (LMIs). In the end, the validity of our results is illustrated through a numerical example.
|
|
09:00-18:30, Paper Mo-Online.62 | |
SHL-NAS: Neural Architecture Search for Spiking Neural Networks with SNN Hardware Latency Model |
|
Li, Xinyu | National University of Defense Technology |
Chen, Renzhi | Defense Innovation Institute |
Wang, Lei | Defense Innovation Institute |
Keywords: Neural Networks and their Applications, Optimization and Self-Organization Approaches
Abstract: Spiking Neural Networks (SNNs) are increasingly becoming a research hotspot in next-generation intelligent computing architectures due to their potential in energy efficiency, which also imposes higher demands on their structural design. However, manually designing SNN architectures becomes increasingly complex, making it difficult to balance design efficiency and optimal performance. Neural Architecture Search (NAS) provides a new approach for automatically constructing high-performance SNNs. Yet, existing SNN-NAS methods primarily focus on accuracy as the main objective, neglecting the optimization of critical hardware efficiency metrics such as latency, which limits their practical deployment value. This paper proposes a hardware-aware NAS-based method for SNNs, called SHL-NAS, which combines Constrained Bayesian Optimization (CBO) with a deployed SNN Hardware Latency (SHL) model to identify the optimal architecture under user-specified latency constraints. The SHL model can directly evaluate latency based on hardware parameters and architectural parameters, thereby reducing reliance on resource-intensive training and hardware-specific measurements, significantly improving efficiency. Experimental results on image classification datasets such as CIFAR-10 and CIFAR-100 demonstrate that, compared to existing methods, SHL-NAS can effectively search for competitive optimal architectures that meet latency constraints with fewer iterations.
|
|
09:00-18:30, Paper Mo-Online.63 | |
Deep Reinforcement Learning with Positional Embedding for Influence Maximization |
|
Wang, Tao | Qilu University of Technology (Shandong Academy of Sciences) |
Di, Chong | Qilu University of Technology (Shandong Academy of Sciences) |
Huang, Jinchao | China Unicom Research Institute |
Chen, Chao | Shandong Artificial Intelligence Institute, Qilu University of Te |
Xu, Pengyao | Shandong Artificial Intelligence Institute, Qilu University of Te |
Yang, Bin | China Unicom Research Institute |
Keywords: Complex Network, Representation Learning, Neural Networks and their Applications
Abstract: The essence of Influence Maximization (IM) lies in determining seed nodes that maximize influence under specific diffusion models. As social networks become increasingly vast and complex, the rapid and efficient identification of seed nodes for information diffusion has become a research priority. The reward mechanisms in Deep Reinforcement Learning (DRL) naturally align with the methods of selecting seed nodes based on the increment of the influence spread in the IM problem. Consequently, this alignment has attracted significant research attention and inspired various methodological innovations. Despite advancements in optimization techniques, significant challenges remain, particularly in terms of inadequate node feature embedding and value function overestimation. To address these limitations, we propose DKIM, a novel model integrating node embedding with DRL. This model incorporates Kshell positional encoding for effective node representation and employs a Multi-DQN algorithm designed with multi-target networks and adversarial loss (ALoss) function to enhance training efficiency. These innovations effectively mitigate boundary node effects and Q-value overestimation issues. We conduct comprehensive experiments comparing DKIM with state-of-the-art algorithms across four public datasets—Wiki-vote, LastFM-Asia, P2P-Gnutella08, and Facebook. The results demonstrate that DKIM achieves optimal performance on the most IM tasks, providing a viable approach for RL-based models in addressing IM problems.
|
|
09:00-18:30, Paper Mo-Online.64 | |
Hierarchical Conditional Guidance Diffusion Model for Perceptual Image Compression |
|
Ji, Zekai | North University of China |
Qin, Jia | North University of China |
Qin, Pinle | North University of China |
Chai, Rui | North University of China |
Zeng, Jianchao | North University of China |
Lin, Zhipeng | North University of China |
Keywords: Image Processing and Pattern Recognition, Deep Learning, Machine Vision
Abstract: Recently, diffusion-based image compression has achieved significant progress in terms of rate–distortion-perception trade-off, these approaches have replaced decoders with conditional diffusion models to enhance the visual quality of reconstructed images. However, diffusion models introduce noise into the input image during the initial stages of the diffusion process, which may cause the potential degradation of crucial image information. To address these limitations, we propose a Hierarchical Conditional guidance Diffusion model for perceptual Image Compression (HCD-IC) to ensure the fidelity of reconstruction, in which hierarchical features with selected typical context provide informative guidance during the denoising process of diffusion model to preserve both structural integrity and fine details. Specifically, we design a Gated Scale-Cross module (GSC) to integrate and select representative semantics and details, which leverages a hierarchical feature interaction architecture and dynamic gated strategy to ensure more robust and expressive representations. Furthermore, we present a Conditional Control Diffusion decode module (CCD) to integrate time-step information and latent features augmented by GSC into the diffusion model, which can dynamically acquire the required time-aware conditional features during different denoise stages. Extensive experiments conducted on multiple public datasets demonstrate that our method outperforms state-of-the-art approaches in various quantitative realism metrics.
|
|
09:00-18:30, Paper Mo-Online.65 | |
C2BA: Cross-Domain Consistency and Bidirectional Alignment for Cross-Modal Domain-Incremental Learning |
|
Huang, Weiyi | East China Normal University |
Xi, Xidong | East China Normal University |
Wang, Hailing | East China Normal University |
Cao, Guitao | East China Normal University |
Keywords: Machine Learning, Machine Vision, Representation Learning
Abstract: In Cross-Modal Domain-Incremental learning, the primary challenge lies in learning from varying data distributions and maintaining its performance on prior domains. However, existing methods often overlook the importance of shared knowledge across domains and the interaction between modalities is still insufficient. To address these issues, we propose Cross-Domain Consistency and Bidirectional Alignment (C2BA), a novel framework that enhances the model's generalization ability and improves the cross-modal integration in VLMs through two key components. We design a Cross-domain Global Consistency Constraint (CGCC) to stabilize domain-invariant representations during incremental training, preventing excessive shifts of shared distributions toward new domains. In addition, we design a Bidirectional Cross-Modal Attention (BCMA) module, which enables effective interaction between visual and textual features through a bidirectional attention mechanism, thereby reducing cross-modal discrepancies. Experiments on three benchmark datasets demonstrate that our method outperforms state-of-the-art exemplar-free and even exemplar-based approaches, achieving superior generalization and cross-modal interaction.
|
|
09:00-18:30, Paper Mo-Online.66 | |
Object-Centric Transformer Framework for Fine-Grained Image-Text Retrieval with Global Consistency |
|
Shen, Sitong | Xinjiang University |
Ibrayim, Mayire | Xinjiang University |
Jiang, Peichao | Xinjiang University |
Keywords: Multimedia Computation, Image Processing and Pattern Recognition, Machine Vision
Abstract: 跨模态图像文本检索可实现高效 通过视觉-语言的异构模态交互 语义对齐,推进多模态智能 应用。然而,传统的跨模式检索 方法通常依赖于预先训练的特征提取器、 其性能限制阻碍了 图像文本检索。在本文中,我们提出了一种 端到端基于 Transformer 的跨模态图像文本 检索框架 OCGC,它提取视觉和 文本特征。框架 采用 Global-Attentive Slot Fusion 模块进行聚合 以对象为中心的视觉特征 to address feature redundancy 与 Global Semantic Consistency 相结合 用于增强跨模态特征对齐的对齐模块 在聚合图像特征和文本嵌入之间。 实验结果表明,所提模型 在两个基准测试数据集上表现非常出色, 比最先进的方法高出 6% Rsum 以上 在 Flickr30K 数据集上。
|
|
09:00-18:30, Paper Mo-Online.67 | |
ExplainDrive: A Multimodal Chain-Of-Thought Reasoning Approach for Explainable Automated Driving Systems |
|
Yu, Xing | East China Normal University |
Peng, Jinghan | East China Normal University |
Li, Hang | East China Normal University |
Ermuyun, Li | East China Normal University |
Du, Dehui | East China Normal University |
Keywords: Application of Artificial Intelligence, Deep Learning, Machine Vision
Abstract: End-to-end decision models based on deep learning have become increasingly prominent in automated driving systems. However, their black-box nature poses significant challenges to interpreting decision processes, especially in dynamic and complex scenarios. Existing approaches largely focus on post-hoc analyses or isolated single-step explanations, lacking comprehensive explanations from scenario understanding to decision-making and failing to address the complexity of realworld scenarios with coherent reasoning. To address these limitations, we propose ExplainDrive, a multimodal Chainof-Thought reasoning framework that integrates causally optimized temporal representations with explainable decisionmaking. ExplainDrive follows a three-stage pipeline: (i) extracting spatio-temporal features via a Causal Temporal Former, (ii) constructing hierarchical scenario understanding, and (iii) progressively deriving driving decisions with interpretable rationales. This design enhances transparency at each intermediate step and mitigates spurious correlations through causal feature selection. Extensive experiments on the BDD-X and nuScenes datasets demonstrate that ExplainDrive consistently improves the quality of decision explanations and outperforms compared models across multiple key evaluation metrics.
|
|
09:00-18:30, Paper Mo-Online.68 | |
WTFN: Wavelet Convolution and Transformer Fusion Network with Spatial-Spectral Features for Hyperspectral Image Classification |
|
Lu, Hongmeng | Xinjiang University |
Jiang, Shaochen | Xinjiang University |
Wang, Liejun | Xinjiang University |
Sun, Mengyuan | XinJiang University |
Li, Kai | XinJiang University |
Keywords: Image Processing and Pattern Recognition, Deep Learning, Machine Learning
Abstract: In hyperspectral image (HSI) classification tasks, effectively extracting the spatial-spectral features of the image is crucial. However, existing convolutional neural network (CNN)-based methods are limited by the fixed convolution kernel size, which means CNN can only extract local spatial features, neglecting the global features of the HSI. On the other hand, Transformer-based methods perform excellently in extracting global spectral features but are weaker at capturing texture and edge features. Therefore, to fully exploit the complementary advantages of CNN and Transformer methods, this paper proposes a dual-branch wavelet convolution and Transformer fusion network (WTFN). The two branches of WTFN are designed to capture local texture features and global contextual relationships, respectively. The attention cross fusion (ACF) module deeply enhances the information interaction of the spatial-spectral features of HSI. Extensive experimental results on three public datasets show that the WTFN method outperforms several state-of-the-art methods in classification performance.
|
|
09:00-18:30, Paper Mo-Online.69 | |
Relation-Aware Retrieval Augmented Generation for Relation Extraction |
|
Tian, Lei | Hefei University of Technology |
Bu, Chenyang | Hefei University of Technology |
Huang, Manzong | Hefei University of Technology |
Wu, Xindong | Hefei University of Technology |
Keywords: AI and Applications, Deep Learning, Application of Artificial Intelligence
Abstract: Relation Extraction (RE) is a core task in Information Extraction that identifies relationships between entities in the text. Existing prompt construction methods using Large Language Models (LLMs) achieve success by providing demonstrations to enhance relation extraction abilities. However, these methods provide retrieved demonstrations that have weak relational correlations with the test input and overlook the importance of associated entities and label semantics in RE tasks. This results in incomplete prompt construction, leading to suboptimal alignment between LLMs and RE tasks. To address these issues, we propose a Relation-Aware Retrieval Augmented Generation model (RelAwareRAG) for relation extraction. Firstly, we design a relation-aware retrieval that can identify demonstrations more effectively aligned with the target entity relation by fine-tuning a pre-trained language model for the target RE task and generating task-specific representations. Secondly, we refine the candidate relation space through entity constraint pruning and perform label semantic projection for LLMs to optimize the prompt construction, thereby improving the adaptability of LLMs in relation extraction tasks. Extensive experiments on TACRED and TACREV show that RelAwareRAG achieves superior performance, with a 3.65% increase in F1 score on TACRED and a 3.76% improvement on TACREV.
|
|
09:00-18:30, Paper Mo-Online.70 | |
MGDNet: Lightweight Human Pose Estimation Based on Multi-Dimensional Adaptive Frequency-Aware Attention |
|
Song, Yu | Qilu University of Technology(Shandong Academy of Sciences) |
Dong, Yunfeng | Qilu University of Technology (Shandong Academy of Sciences), Ji |
Wu, Xiaoming | Qilu University of Technology, Shandong Computer Science Center |
Man, Jiazheng | Shandong Shanke Intelligent Technolo |
Xu, Zan | Shandong Shanke Intelligent Technology Co., Ltd |
Qiao, Youwei | Shandong Shanke Intelligent Technology Co., Ltd |
Qi, Bei | Qilu University of Technology (Shandong Academy of Sciences) |
Liu, Xiangzhi | Shandong Computer Science Center (National Supercomputer Center |
Keywords: Machine Vision, Image Processing and Pattern Recognition
Abstract: Lightweight human pose estimation (HPE) has garnered significant attention due to its widespread applications in edge and mobile devices. However, lightweight human pose estimation methods demonstrate limited accuracy when detecting complex movements, thus restricting their practical utility. To address this issue, we propose MGDNet, a novel lightweight network featuring three innovative modules: the GA-Bottleneck module, the MAC-Block module, and the Dual-View Feature Enhancement Module (DV-FEM). The GA-Bottleneck module integrates ghost convolution with multi-dimensional adaptive frequency-aware attention to capture multi-scale frequency characteristics, enhancing robustness for complex actions. The MAC-Block combines multi-receptive field depth convolution with frequency domain analysis to achieve precise joint localization. The DV-FEM leverages complementary local-global information to enhance feature representation for complex actions. Extensive experiments conducted on the COCO and MPII datasets demonstrate that MGDNet achieves state-of-the-art performance among lightweight models, attaining an average precision (AP) of 73.0% on COCO val2017, which represents a 1.6% improvement over Greit-HRNet. On the MPII dataset, MGDNet achieves the highest PCKh score of 87.6%. The proposed MGDNet outperforms existing lightweight approaches by addressing the challenge of low detection accuracy in complex motion scenarios while maintaining comparable computational efficiency.
|
|
09:00-18:30, Paper Mo-Online.71 | |
AIGL: Adaptive Imbalance-To-Generalization Learning for Robust Multi-Label Retinal Disease Classification |
|
Yang, Jiacheng | Fudan University |
Gu, Yuanjie | Fudan University |
Zekuan, Yu | Fudan University |
Keywords: Transfer Learning, Image Processing and Pattern Recognition, Biometric Systems and Bioinformatics
Abstract: Retinal image processing is critical for the early diagnosis and management of ocular diseases. While foundation models pretrained on large-scale datasets show strong generalization, they struggle with rare diseases absent during pre-training. Fine-tuning on datasets containing these rare diseases introduces challenges like class imbalance and domain shift. To address these, we propose adaptive imbalance-to-generalization learning (AIGL), a novel pipeline that integrates imbalance-aware training and test-time feature reconstruction to enhance model performance. During training, we introduce the confidence-weighted adaptive loss, which dynamically adjusts the contribution of each sample based on the model’s confidence, addressing severe class imbalance and improving rare diseases recognition. At test-time, we design context-aware feature masked autoencoders, which operate on high-level features since structures like the macula remain consistent across raw retinal images. It improves generalization by leveraging dynamic token filtering and consistency constraints to adaptively mask and reconstruct features through self-supervised learning. Our AIGL outperforms state-of-the-art methods across multiple metrics on the customed RetinaX dataset, demonstrating superior classification ability.
|
|
09:00-18:30, Paper Mo-Online.72 | |
G2Co: Gaze-Guided Semantic Contrastive Learning for Self-Supervised Medical Image Segmentation |
|
Zhang, Jianshan | North University of China |
Wang, Qi | North University of China |
Qin, Pinle | North University of China |
Zeng, Jianchao | North University of China |
Keywords: Artificial Social Intelligence, Deep Learning, Representation Learning
Abstract: 传统自我监督学习 (SSL) 展品 细粒度特征建模中的显著限制,因为 医学成像中普遍存在的问题,例如器官模糊 边界、复杂解剖结构和特征 由相似的病理斑块引起的意识模糊,通常 导致样品干扰假阳性。要解决 这些挑战,本文提出了 Gaze-Guided 语义对比学习 (G2Co),一种创新的 SSL 受视觉诊断模式启发的算法 放射。在语义增强级别,G2Co 利用关键信息指导机制 区分解剖结构与背景噪声, 从而实现细粒度的特征提取。在 特征交互层面,G2Co 引入了交叉采样 用于提取判别性特征的特征融合策略 从潜在的阳性样本中,寻址特征 由视觉上相似的补丁引起的混淆。此外 G2Co 实现了组织形态的精细建模,并且 通过建立区域间的边界特征 互信息最大化约束。最后 对ߐ
|
|
09:00-18:30, Paper Mo-Online.73 | |
A Dynamic Feature-Aware Method for Obstacle Avoidance under DUOE Via Deep Reinforcement Learning |
|
Huang, Kaichen | South China University of Technology |
Hong, Shuo | South China University of Technology |
Bi, Sheng | South China University of Technology |
Qiu, Junbin | South China University of Technology |
Long, Tao | South China University of Technology |
Keywords: Agent-Based Modeling, Application of Artificial Intelligence, Neural Networks and their Applications
Abstract: To achieve obstacle avoidance for mobile robots in environments with unknown static and dynamic obstacles, and considering the experience of pedestrians, the robot needs to acquire the features of both static and dynamic obstacles. Current advanced methods focus on obstacle avoidance effects but lack spatio-temporal modeling of dynamic information, thus failing to perform efficient obstacle avoidance that distinguishes between static and dynamic obstacles. This study proposes a method within the framework of deep reinforcement learning, where the dynamic feature-aware module is combined with temporal reasoning to effectively extract the spatio-temporal dynamic information around the robot, thereby achieving precise obstacle avoidance for both static and dynamic obstacles. Additionally, the robot can determine the non-intrusive areas for pedestrians based on spatio-temporal information, realizing safe and socially acceptable robots. We designed a large number of simulation experiments to compare our method with the currently advanced methods, fully verifying the obstacle avoidance performance and the improvement of pedestrian comfort experience of our method. Moreover, through robustness experiments and qualitative analysis of trajectories, the stability and dynamic information discrimination ability of the proposed method were further verified.
|
|
09:00-18:30, Paper Mo-Online.74 | |
Enhancing Multi-Hop Reading Comprehension through Multi-Hop Attention Diffusion with Hierarchical Graph Extensions |
|
Zhang, Yingying | Beijing University of Posts and Telecommunications |
Cheng, Bo | Beijing University of Posts and Telecommunications |
Keywords: Neural Networks and their Applications, Application of Artificial Intelligence
Abstract: Multi-hop Reading Comprehension tasks require models to aggregate scattered information across multiple paragraphs and perform multi-step reasoning to answer complex questions. Although Graph Neural Network (GNN)-based approaches have demonstrated strong performance in this domain, they still face challenges in fully leveraging higher-order multi-hop information and capturing relationships between indirectly connected nodes. To tackle these challenges, we propose an innovative model, the Multi-hop Attention Diffusion graph network with Hierarchical Graph Extensions (MADHGE). This model introduces an effective multi-hop attention diffusion mechanism that aggregates information from higher-order neighboring nodes, generating richer and more semantically dense node representations, thereby enhancing the model's reasoning capabilities. Additionally, we extend the traditional hierarchical graph structure by incorporating three new edge types to strengthen connections between nodes of different granularities, enabling a more comprehensive capture of the interrelationships among information. We conducted extensive evaluations of our proposed method on the HotpotQA benchmark dataset, and experimental results demonstrate that the MADHGE model achieves state-of-the-art performance, outperforming all existing GNN-based graph models, thereby validating the effectiveness of the proposed extensions.
|
|
09:00-18:30, Paper Mo-Online.75 | |
Implicit Feature Fusion Function in Scene Text Detection |
|
Zhang, Ruizhe | Inner Mongolia Normal University,College of Computer Scie |
Yin, Yanjun | Inner Mongolia Normal University,College of Computer Scie |
Lian, Zhe | Inner Mongolia Normal University,College of Computer Scie |
Keywords: Image Processing and Pattern Recognition, Deep Learning
Abstract: 在场景文本检测中,有效融合细节 信息,语义信息对于 准确的结果。但是,在 融合过程中的现有方法使 特征图在各个尺度上携带的精确信息, 并且卷积的推理速度不是最佳的。自 针对这一问题,本文提出了 Refinement 校正隐式融合网络 (RCIFNet)。在 特征融合过程,隐式特征融合 使用我们设计的函数来获取相邻的 特征向量和相应的坐标 拖动到 multiscale 特征中的给定查询坐标, 并输入到 Lightweight MLP 中,实现隐式 多尺度特征的特征融合。实验包括 在 4 个基准数据集上进行,结果 验证了所提模型与 当前技术。
|
|
09:00-18:30, Paper Mo-Online.76 | |
Cross-Mask Consistency Masked Autoencoder for Point Cloud Self-Supervised Learning |
|
He, Yuan | Institute of Automation, Chinese Academy of Sciences |
Yu, Shan | Institute of Automation, Chinese Academy of Sciences |
Keywords: Multimedia Computation, Representation Learning, Machine Vision
Abstract: Masked Autoencoder has emerged as a powerful framework for self-supervised representation learning in 3D point clouds, leveraging masked reconstruction to learn meaningful features. However, existing methods face two key limitations due to their reliance on a high masking ratio: 1) Reconstruction inconsistency, where the same patch yields different reconstructions under varying masking schemes, degrading representation robustness; 2) Decoder over-reliance, where excessive contextual cues from masked regions enable the decoder to exploit positional and structural hints through self-attention, thereby reducing the reliance on the encoder and weakening its feature learning capacity. To address these issues, we propose a novel self-supervised learning approach, Cross-Mask Consistency Masked Autoencoder (C2MAE). Specifically, we propose a Cross-Mask Consistency Learning (CMCL) strategy to enforce consistent predictions of overlapping masked patches across different masking patterns. By aligning feature representations under diverse masked views, CMCL strengthens the network's ability to capture point cloud structures, leading to more robust and generalizable representations. Additionally, we introduce a Cross-attention Decoder (CD), which restricts reconstruction to the attention between visible and masked patches to enhance the encoder’s feature extraction capability. Our approach achieves outstanding performance on various downstream tasks, demonstrating that C2MAE effectively mitigates these issues and enhances the model's robustness and generalization capability.
|
|
09:00-18:30, Paper Mo-Online.77 | |
Differentiated Adversarial Training Method Based on Samples |
|
Lu, Zhang | Shenyang Aerospace University |
Wang, Wenhao | Shenyang Aerospace University |
Fan, Chunlong | Shenyang Aerospace University |
|
|
09:00-18:30, Paper Mo-Online.78 | |
Adaptive Balancing Unimodal and Cross-Modal Relationships for Rumor Detection |
|
Wan, Rong | Xinjiang University |
Li, Boyuan | Xinjiang University |
Li, Xiuhong | Xinjiang University |
Keywords: Multimedia Computation, Deep Learning
Abstract: The rapid spread of rumors on social media platforms poses a significant threat to society, making effective rumor detection a pressing challenge. Existing methods often exhibit an over-reliance on cross-modal information when capturing complex interactions between different modalities. However, when semantic information remains consistent across modalities, cross-modal information can become redundant or even detrimental. The research shows that when people read inconsistent multimodal information, their attention tends to focus on the cross-modal similarity characteristics. However, when dealing with similar multimodal information, attention is more focused on the details of unimodal information. To address this issue, we propose the Adaptive Balancing Unimodal and Cross-Modal Relationships (ABUCR) rumor detection framework that takes full advantage of unimodal information to enhance the accuracy and robustness of multimodal rumor detection. Firstly, we propose a distance metric learning module that dynamically harmonizes unimodal and cross-modal contributions, ensuring a fine-grained balance between the distinct yet complementary strengths of different modalities. We further propose a feature-enhanced fusion module to ensure that both fine-grained details of unimodal data and the global dependencies across modalities are preserved. Extensive experiments on publicly available datasets demonstrate that our method outperforms other baselines.
|
|
09:00-18:30, Paper Mo-Online.79 | |
Adaptive Dynamic Capacity Allocation for Reversible Data Hiding Based on Interpolation and Integer Wavelet Transform |
|
Jin, Youpeng | Southwest University |
Zhang, Jiayu | Southwest University |
Xiaobo, Hu | College of Computer and Information Science |
Li, Jiashuo | Southwest University |
Zhang, Tao | Southwest University |
Bohan, Kong | College of Computer and Information Science, Southwest Universit |
Zhang, Yu | Southwest University |
Keywords: Image Processing and Pattern Recognition, Information Assurance and Intelligence
Abstract: To address the limitations of existing reversible data hiding (RDH) methods, including low redundancy space utilization, static embedding strategies, and insufficient image adaptability, this paper presents an adaptive dynamic capacity allocation reversible data hiding scheme based on interpolation and integer wavelet transform (ADWC-RDH). By implementing a spatial-frequency domain collaborative optimization mechanism, the proposed method generates multi-dimensional redundancy spaces through combining nearest neighbor interpolation with integer wavelet transform. We design dynamic capacity allocation and low-bit priority embedding strategies to achieve adaptive adjustment of embedded bitrates. Experimental results demonstrate that the proposed method achieves significant improvements in average embedding rate across test images, along with excellent Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) performance. ADWC-RDH achieves substantial enhancements in embedding capacity and visual quality for complex texture images through dual-domain collaborative optimization and dynamic capacity allocation, thereby demonstrating superior universality and robustness.
|
|
09:00-18:30, Paper Mo-Online.80 | |
Social Influence and Preference-Guided Denoising for Social Recommendation |
|
Li, Shishen | Xinjiang University |
Qin, Jiwei | Xinjiang Uinversity |
Ma, Jie | Xinjiang University |
Zheng, Jiong | Xinjiang University |
Chen, Yanping | Guizhou University |
Feng, Qiangsheng | Xinjiang University |
Keywords: Artificial Social Intelligence, Deep Learning, Expert and Knowledge-Based Systems
Abstract: Social recommendation typically enhances user preference representation by integrating social connections among users. However, users' intricate social behaviors may introduce noisy social connections for user preference representation modeling. Due to the absence of ground truth labels, existing social denoising models typically rely simply on user similarity as the denoising criterion. These models fail to accurately identify noisy social connections with low preference similarity and neglect the crucial role of opinion leaders in the propagation of social influence. In this paper, we propose a novel Social Influence and Preference-guided denoising enhancement framework(SIP) for social recommendation. The framework includes the Preference-guided Social Relationship Augmentation (PSRA) module and the Social Influence Augmentation (SIA) module. In the PSRA module, we reduce the impact of social connections with low preference similarity by strengthening the associations between users and their close friends. At the same time, considering the importance of opinion leaders in user preference modeling, we use the SIA module to narrow the gap in preference representation between users and opinion leaders. Extensive experiments on three real-world datasets demonstrate that our proposed framework effectively removes noise from social data and significantly enhances the performance of existing state-of-the-art social recommendation models.
|
|
09:00-18:30, Paper Mo-Online.81 | |
Improved High-Fidelity Reversible Data Hiding Uses Pixel Replacement |
|
Xiaobo, Hu | College of Computer and Information Science |
Zhang, Jiayu | Southwest University |
Jin, Youpeng | Southwest University |
Bohan, Kong | College of Computer and Information Science, Southwest Universit |
Zhang, Yu | Southwest University |
Keywords: Image Processing and Pattern Recognition, Information Assurance and Intelligence
Abstract: Image steganography is a technique to hide secret information in cover images. Due to the low embedding efficiency of reversible data hiding (RDH) for large payloads, this paper proposes an improved steganography method based on RDH. The n-leftmost bit replacement (n-LBR) and n-rightmost bit replacement (n-RBR) techniques are employed, where n-LBR replaces the leftmost n bits of a pixel, and n-RBR replaces the rightmost n bits. In the n-LBR and n-RBR stages, 2n bits of secret data are evenly embedded into adjacent pixel pairs of the first two identical images. where 1leq nleq2. The experimental results evaluate the peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM), demonstrating the superiority of the proposed method over state-of-the-art approaches.
|
|
09:00-18:30, Paper Mo-Online.82 | |
A Multi-Agent Fuzzing Framework for Deep Learning Library |
|
Liao, RongTao | National University of Defense Technology |
Ou, Shiwen | National University of Defense Technology |
Yan, XueHu | National University of Defense Technology |
Zhu, KaiLong | National University of Defense Technology |
Keywords: Agent-Based Modeling, AI and Applications, Application of Artificial Intelligence
Abstract: The security of deep learning (DL) libraries is crucial due to their central role in developing AI applications. Fuzzing has become a key technique for discovering bugs in DL libraries, but generating high-quality seeds still poses a significant challenge. Although large language models (LLMs) offer promising opportunities, current methods that utilize single-agent frameworks for seed generation often produce invalid seeds. This issue arises from the complexities of DL API structures and the inherent randomness of LLMs, which ultimately reduce fuzzing efficiency. To overcome these challenges, we propose a collaborative multi-agent framework that utilizes specialized LLM-driven agents for iterative seed refinement. This framework consists of three components: (1) a coding agent that incorporates historical bug knowledge to produce the initial bug‑prone seeds,(2) a repair agent performs static analysis on the initial seeds and repairs the invalid seeds, and (3) a mutation agent leverages four mutation operators to explore the test space thoroughly.Our framework improves seed validity and functional diversity through ongoing collaboration among agents and established feedback loops. Experimental evaluations show that our approach achieves a 20% increase in seed validity compared to state-of-the-art methods and uncovers 12 previously unknown bugs in popular DL libraries.
|
|
09:00-18:30, Paper Mo-Online.83 | |
TD4ITG: A Test Data Generation Method for Issue Title Generation Models |
|
Chen, Jingjing | Beijing Information Science and Technology University |
Yang, Jun | Beijing Information Science and Technology University |
He, Qifan | Beijing Information Science and Technology University |
Cui, Zhanqi | Beijing Information Science and Technology University |
Zeng, Zheng | Beijing Information Science and Technology University |
Keywords: Application of Artificial Intelligence, AI and Applications, Deep Learning
Abstract: In open-source software platforms, users utilize issues to report software bugs or request new features. To improve the quality of issues, researchers have proposed several methods for issue title generation. It is widely recognized that deep learning models often suffer from robustness limitations, as minor input perturbations can lead to incorrect or significantly altered outputs. In this paper, we investigate the robustness of issue title generation models and propose a corresponding test data generation method, TD4ITG. This method leverages large language models in combination with chain-of-thought prompting to automatically generate test data to evaluate robustness. Experimental results demonstrate that both iTAPE and iTiger, two issue title generation models, exhibit robustness problems. Specifically, the test data generated by TD4ITG leads to a performance degradation of 21.73% for iTAPE, reducing its score to 75.00%, and a degradation of 17.34% for iTiger, reducing its score to 45.98%. Compared to MATS, a recently proposed testing method for text summarization models, TD4ITG is more effective in revealing the robustness limitations of the models.
|
|
09:00-18:30, Paper Mo-Online.84 | |
KEREM: Enhancing Reliability and Transparency in Medical QA through LLM and Knowledge Graph Fusion |
|
Dong, Shaojie | Qilu University of Technology (Shandong Academy of Sciences) |
Zhu, Zhe | Shandong Artificial Intelligence Institute, Qilu University of T |
Xu, Pengyao | Shandong Artificial Intelligence Institute, Qilu University of Te |
Shan, Ke | Shandong Artificial Intelligence Institute, Qilu University of T |
Zhou, Shuwang | Shandong Artificial Intelligence Institute, Qilu University of T |
Keywords: Application of Artificial Intelligence, AI and Applications, Deep Learning
Abstract: Open-domain medical question answering (QA) systems face significant challenges in achieving accurate reasoning and transparent explanations. In this study, we propose KEREM (Knowledge graph Enhanced Reasoning with Explainable Modeling), a novel framework that integrates large language models (LLMs) with structured medical knowledge graphs (KGs) to enable deep multimodal joint reasoning. KEREM supports multi-hop inference and causal chain explanations through four key modules: input processing and entity alignment, knowledge subgraph construction, joint reasoning and path control, and answer generation with natural language explanations. We evaluate KEREM on two real-world medical QA benchmarks—CMCQA and ChatDoctor-5k—where it consistently outperforms existing baselines in terms of answer accuracy, reasoning transparency, and structural consistency. Furthermore, lightweight supervised fine-tuning demonstrates KEREM’s strong contextual transferability and significantly improves its generation quality in specialized clinical settings. These findings highlight KEREM's effectiveness in generating accurate answers and causal explanations, establishing a solid foundation for trustworthy medical QA in high-stakes clinical domains.
|
|
09:00-18:30, Paper Mo-Online.85 | |
RandLA-Net++: A Large-Scale Point Cloud Segmentation Method for Autonomous Driving |
|
Zhang, Jun | Wuhan Institute of Technology |
Yan, Jie | Wuhan Institute of Technology |
Keywords: AI and Applications, Deep Learning, Neural Networks and their Applications
Abstract: Abstract—The rapid development of autonomous driving technology demands higher efficiency and real-time performance in point cloud semantic segmentation. However, existing methods face limitations in handling complex geometric structures and extracting local features, while also exhibiting high network complexity. To address these issues, this paper proposes a network architecture specifically designed for large-scale point cloud semantic segmentation in autonomous driving scenarios. The proposed method introduces the Local Feature Aggregation (LFA) module and Global Feature Aggregation (GFA) module. The LFA dynamically adjusts neighborhood point sets using an adaptive K-nearest neighbors approach and constructs local surfaces to extract fine-grained features. The GFA combines channel attention and global self-attention mechanisms to enhance feature fusion across different regions. Experimental results demonstrate significant improvements in performance for semantic segmentation tasks with low network complexity: achieving a mean Intersection-over-Union (mIoU) of 59.3% on the SemanticKITTI dataset, with only 1.8M parameters and 10.6G FLOPs on the SemanticPOSS dataset, showing outstanding performance, particularly in segmenting small objects and complex scenes. Index Terms—deep learning, point clouds semantic segmentation, autonomous driving, feature extraction
|
|
09:00-18:30, Paper Mo-Online.86 | |
MR-OT: A Metamorphic Testing Method for Object Tracking Models |
|
Cui, Zhanqi | Beijing Information Science and Technology University |
He, Qifan | Beijing Information Science and Technology University |
Yang, Jun | Beijing Information Science and Technology University |
Wang, Zhiwei | Data and Technology Support Center of the Cyberspace Administrat |
Keywords: Assurance, Neural Networks and their Applications, Information Assurance and Intelligence
Abstract: 各种测试方法。然而,大多数现有的 方法侧重于静态任务,例如图像识别、 忽视了时间连续性带来的挑战 动态场景中的输入数据和环境变化 例如对象跟踪和行为检测。在本文中, 我们提出了 MR-OT,一种物体的变质测试方法 在动态场景中跟踪模型。我们设计了五 metamorphic relations 来评估 模型,包括 natural 天气变化、速度变化和环境 干扰。此外,利用生成式 对抗网络 (GAN) 和其他技术,我们 生成大规模、真实且时间一致的 测试数据基于原始测试数据。实验结果 证明 MR-OT 实现了更大的变质 关系违规率,比 DeepTest 高出 0.06 到 0.49 和 MT4MOT 上调 0.04 至 0.57,验证其 在动态中发现稳健性问题的有效性 跟踪任务。
|
|
09:00-18:30, Paper Mo-Online.87 | |
MFDS-GFNet: A Multi-Feature Dual-Stream Gate Fusion Network for Distributed Optical Fiber Temperature Event Recognition |
|
Jiang, Wanchang | Northeast Electric Power University |
Tang, Rihao | Northeast Electric Power University |
Keywords: Deep Learning, Image Processing and Pattern Recognition, Machine Learning
Abstract: Distributed temperature data collected by BOTDR systems exhibit significant spatiotemporal patterns, where temperature events often share similar magnitudes but differ in structural characteristics, making accurate classification challenging. To address this, we propose MFDS-GFNet, a Multi-Feature Dual-Stream Gate Fusion Network for multi-class temperature event recognition. The method begins with a multi-dimensional feature extraction framework that derives temporal gradients, spatial gradients, and local statistical features from raw data to capture event-specific patterns. A dual-stream architecture is then employed, where raw data and physical features are processed separately using residual networks to enhance deep feature learning and training stability. Finally, a gate fusion module adaptively integrates the two feature streams, automatically weighting their contributions for optimal classification. Experimental results show that MFDS-GFNet achieves 99.74% accuracy across five representative temperature events. Compared to five baseline models, it outperforms the Transformer by 1.07% and GoogLeNet by 1.53%. The model also supports real-time deployment with an inference latency of 0.40ms per sample.le.
|
|
09:00-18:30, Paper Mo-Online.88 | |
Intelligent Extraction Technology of Security Attributes for Chip Manual |
|
Huang, Yudong | Information Engineering University, Zhengzhou |
Wang, Yisen | Information Engineering University |
Liang, Siyuan | Information Engineering University |
Yang, Tianchan | Information Engineering University |
Feng, Yaxuan | Information Engineering University |
Keywords: Computational Intelligence, Deep Learning, Information Assurance and Intelligence
Abstract: Abstract。在日益严峻的安全威胁中 嵌入式系统的格局、安全配置 芯片技术文档中的信息非常重要 影响硬件安全的有效性 机制。然而,传统方法存在 限制包括人工审核效率低、差 基于规则的方法的泛化功能,以及 通用型大型 DO-Main 知识不足 语言模型。这些缺点可能导致系统性 漏洞中的误报等风险 识别和关键威胁监督。要解决 这些问题,我们提出了 ChipGuard-BERT,一个基于 BERT 的 建立三阶段优化技术 包括特定领域的预训练的机制, 对抗式培训增强和知识导向 优化,从而实现对安全性的精确解析 配置信息。 实验结果表明,ChipGuard-BERT 在独立 测试集,在 CHIP 手册术语识别。比较分析 与 ChatGPT 和 RAGflow 本地知识库揭示 所提出的方法达到了 92.5% 的精确匹配率 在安全配置冲突检测任&
|
|
09:00-18:30, Paper Mo-Online.89 | |
DSDGT: Dual-Stage Dependency Enhanced Graph Transformer for Aspect-Based Sentiment Analysis |
|
Liu, Shaokun | XinJiang University |
Li, Yanbing | XinJiang University |
Wushouer, SIlamu | XinJiang University |
Keywords: Neural Networks and their Applications, Deep Learning, AI and Applications
Abstract: Aspect-Based Sentiment Analysis:(ABSA):seeks to determine the sentiment polarity of specific aspects within a text. Despite the strong performance of Graph Neural Networks:(GNNs):based on dependency syntax trees in ABSA, existing methods often fail to differentiate the importance of dependency relations and inadequately capture semantic interactions, limiting their ability to detect implicit sentiments. To address these limitations, we propose Dual-stage Dependency Enhanced Graph Transformer:(DSDGT), a serial architecture that combines Transformer and Graph Convolutional Network:(GCN):modules with an enhanced dependency mechanism. The Transformer module captures global semantic information, while the GCN models local dependencies. A dual-stage feature processing strategy is employed to integrate both original and enhanced dependency features effectively. Experiments on multiple benchmark datasets demonstrate that DSDGT achieves superior performance in terms of accuracy and F1 scores compared to state-of-the-art methods.
|
|
09:00-18:30, Paper Mo-Online.90 | |
Iterative Residual Cross-Attention Mechanism: An Integrated Approach for Audio-Visual Navigation Tasks |
|
Zhang, Hailong | Xinjiang University |
Yu, Yinfeng | Xinjiang University |
Wang, Liejun | Xinjiang University |
Sun, Fuchun | Tsinghua University |
Zheng, Wendong | Tianjin University of Technology |
Keywords: Machine Learning, Deep Learning, Machine Vision
Abstract: Audio-visual navigation represents a significant area of research in which intelligent agents utilize egocentric visual and auditory perceptions to identify audio targets. Conventional navigation methodologies typically adopt a staged modular design, which involves first executing feature fusion, then utilizing Gated Recurrent Unit (GRU) modules for sequence modeling, and finally making decisions through reinforcement learning. While this modular approach has demonstrated effectiveness, it may also lead to redundant information processing and inconsistencies in information transmission between the various modules during the feature fusion and GRU sequence modeling phases. This paper presents IRCAM-AVN (Iterative Residual Cross-Attention Mechanism for Audiovisual Navigation), an end-to-end framework that integrates multimodal information fusion and sequence modeling within a unified IRCAM module, thereby replacing the traditional separate components for fusion and GRU. This innovative mechanism employs a multi-level residual design that concatenates initial multimodal sequences with processed information sequences. This methodological shift progressively optimizes the feature extraction process while reducing model bias and enhancing the model's stability and generalization capabilities. Empirical results indicate that intelligent agents employing the iterative residual cross-attention mechanism exhibit superior navigation performance.
|
|
09:00-18:30, Paper Mo-Online.91 | |
ConsMatch: Semi-Supervised Medical Image Segmentation Via Multi-View Contrast and Feature Consistency |
|
Chen, Yuhan | Guilin University of Electronic Technology |
Li, You | Guilin University of Electronic Technology |
Zhao, Bin | Guilin University of Electronic Technology |
Keywords: AI and Applications, Image Processing and Pattern Recognition, Biometric Systems and Bioinformatics
Abstract: In the task of semi-supervised medical image segmentation, the focus is on how to fully utilize unlabeled data. However, most of the current methods can hardly solve the problems of difficult segmentation of medical image contours and insufficient feature exploration of unlabeled data. To overcome the above problems, this paper proposes a semi-supervised medical image segmentation framework based on contrast learning - ConsMatch. We designed two key modules: MCR (Multi-View Contrastive Representation) module, which utilizes the contrast learning idea to maximize the feature similarity of the same image under different perturbations, and at the same time, minimizes the redundant structural similarity between different images; FCR ( Feature-level Consistency Regularization) module, which guides the model to learn features with stable structural expressions by minimizing the difference in the distribution of structural similarity matrices under different enhancement views. The method introduces structural constraints in the feature space, and strengthens the model's ability to learn from unlabeled data in multiple dimensions through multi-view comparison representation learning and structural consistency regularization. We conduct extensive experiments on the ACDC dataset as well as the LA dataset, both of which yield good results. Our code can be found in https://github.com/magic-fortune/ConsMatch.
|
|
09:00-18:30, Paper Mo-Online.92 | |
Knowledge-Based and Data-Driven Fusion for Unsupervised Video Anomaly Detection |
|
Kong, Qinghao | Beijing Jiaotong University |
Xu, Wanru | Beijing Jiaotong University |
Miao, Zhenjiang | Beijing Jiaotong University |
Zhai, Ruizhao | Beijing Jiaotong University |
Yu, Wenhao | Beijing Jiaotong University |
Keywords: Image Processing and Pattern Recognition, Deep Learning, Machine Vision
Abstract: Video Anomaly Detection (VAD) has extensive applications in fields such as intelligent surveillance and autonomous driving. In the field of unsupervised VAD, data-driven methods based on pseudo-label generation have their advantages. They can utilize the characteristics of the data itself to generate pseudo-labels for model learning. However, the unsupervised setting leads to a lack of supervision information, resulting in a low confidence level of the supervision signal. On the other hand, prior knowledge can effectively supplement some supervision information. Nevertheless, prior knowledge is usually general and does not take into account the specific information of each sample. To address these limitations, this paper proposes an unsupervised video anomaly detection method that combines prior knowledge and data. This method takes unlabeled videos as input and learns to predict frame-level anomaly scores. The algorithm consists of a prior knowledge module and a data-driven module. The prior knowledge module calculates the degree of anomaly through normal propagation based on prior knowledge independent of the data. The data-driven module estimates the degree of anomaly through two branches, appearance and motion, and jointly generates pseudo-labels. Experiments conducted on the UCF-Crime and ShanghaiTech datasets, using frame-level AUC as the evaluation metric, show that the proposed method achieves state-of-the-art performance in the unsupervised category, outperforming existing one-class classification methods.
|
|
09:00-18:30, Paper Mo-Online.93 | |
SensiYOLO: YOLO-Based Detection Method for Solid Waste Stockpiles in Remote Sensing Images |
|
Li, Meishu | University of Jinan |
Liu, Kun | University of Jinan |
Zou, Benli | University of Jinan |
Peng, Qinghao | University of Jinan |
Keywords: Machine Vision, Application of Artificial Intelligence, Neural Networks and their Applications
Abstract: The intelligent detection of solid waste stockpiles plays a critical role in environmental management, particularly against the backdrop of rapid urbanization and industrialization, which have led to a continuous increase in solid waste generation. Efficient and accurate monitoring of these stockpiles has be-come an urgent necessity. Although remote sensing technology has been widely adopted for such tasks due to its wide coverage and cost-effectiveness, existing detection methods still face sig-nificant limitations in terms of feature representation and adaptability to complex environments. To address these chal-lenges, this study proposes a novel object detection algorithm, SensiYOLO, built upon the YOLOv11 framework and en-hanced by two key modules: the Dynamic Spatial-Channel Attention (DSCA) module and the Adaptive Context Perception (ACP) module. The DSCA module leverages multi-head atten-tion to dynamically enhance feature responses across spatial and channel dimensions, thereby improving feature discrimi-nation. Meanwhile, the ACP module combines depthwise sep-arable convolution with strip pooling to achieve lightweight, multi-scale context fusion, effectively capturing semantic in-formation with minimal parameter overhead. Without signifi-cantly increasing model complexity, the proposed method demonstrates improved capability in recognizing target in-formation under complex conditions. Experimental results on the Solid Waste Stockpiles (SWS) dataset show that SensiYOLO outperforms the baseline YOLOv11 model, achieving increases of 3.2% in precision, 3.7% in recall, and 3.1% in mAP. The synergistic design of the DSCA and ACP modules enhances feature representation while maintaining computational effi-ciency, significantly improving the model’s applicability in large-scale and high-precision detection scenarios. This work offers a practical deep learning-based solution to the technical challenges of solid waste stockpile monitoring in complex en-vironments, with strong potential for scalability and real-world deployment in environmental governance.
|
|
09:00-18:30, Paper Mo-Online.94 | |
YOLO-SSL: A Lightweight and High-Precision Model for UAV-Based Power Transmission Line Defect Detection |
|
Wang, Lei | Chongqing University of Posts and Telecommunications |
Zhou, YingHua | Chongqing University of Posts and Telecommunications |
Keywords: Image Processing and Pattern Recognition, AI and Applications, Deep Learning
Abstract: Accurate detection of power transmission line defects and prompt feedback to inspection personnel are crucial for the normal operation of power transmission lines. Addressing the problems of complex background, small defect size, high false detection, and missed detection rates in existing UAV inspection for power transmission line defect detection, we propose a YOLO-SSL model to improve the accuracy of power transmission line defect detection without significantly increasing the complexity of the model. First, we incorporate the SPD-Conv module to replace stride convolution and pooling layers, enhancing detection performance for low-resolution images and small objects while effectively extracting features and reducing redundant computations. Secondly, we design a dedicated small target detection layer to make the network focus more on minor defects and improve detection effectiveness. Finally, we propose a new scalable lightweight receptive field module (LwRFB) that better captures target features of different sizes without adding noticeable number of parameters. The experimental results show that the mAP@0.5 of the proposed YOLO-SSL reaches 89.7%, which is 4.8% higher than that of YOLO11n. Moreover, the YOLO-SSL is a lightweight model, meeting edge device deployment needs and demonstrating high application value.
|
|
09:00-18:30, Paper Mo-Online.95 | |
Dynamic Feedback-Based Cost-Sensitive Learning for Imbalanced Intrusion Detection |
|
Ma, Xiaohang | Harbin Engineering University |
Yang, Lin | Academy of Military Sciences of the People's Liberation Army Ins |
Wang, Huiqiang | Harbin Engineering University |
Keywords: Big Data Computing,, Neural Networks and their Applications, Cloud, IoT, and Robotics Integration
Abstract: In network intrusion detection systems, the persistent challenge of class imbalance critically undermines detection efficacy for minority attacks. To address this, we propose a Dynamic Feedback Based Cost Sensitive Learning (DFCSL) method. This method employs a two-stage detection framework to achieve hierarchical classification between normal traffic and anomalous attacks. Additionally, we introduce a dynamic weight update mechanism based on recall feedback, which adaptively enhances the model’s focus on hard-to-recognize categories, mitigating the limitations of traditional static weight settings. Moreover, a feature selection strategy based on Shapley values is designed to identify high-discriminative features, reduce input dimensionality, and lower the computational cost of the model. Experiments conducted on three datasets—CICIDS2017, CSE-CIC-IDS2018, and CIDDS-001 that the proposed method not only improves the detection performance of minority class attacks but also maintains strong recognition capabilities for majority class attacks.
|
|
09:00-18:30, Paper Mo-Online.96 | |
Dynamic Flexible Job Shop Scheduling Method Based on an Improved Artificial Lemming Algorithm and Its Bi-Objective Optimization under Machine Failures |
|
Guanghe, Cheng | Qilu University of Technology(Shandong Academy of Sciences) |
Tang, YaZhong | Qilu University of Technology |
Sun, Ruirui | Qilu University of Technology |
Ding, Qingyan | Qilu University of Technology |
Zhang, Hu | Qilu University of Technology |
Keywords: Metaheuristic Algorithms
Abstract: For the Dynamic Flexible Job Shop Scheduling Problem (DFJSP), this study proposes an Improved Artificial Lemming Algorithm with integrated Non-Dominated Sorting Genetic Algorithm II (NSGA-II) to solve the bi-objective optimization problem of minimizing makespan and maximizing machine utilization, with a focus on addressing scheduling optimization challenges under dynamic disturbances such as sudden machine failures. This work pioneers the application of the Artificial Lemming Algorithm (ALA) to Flexible Job Shop Scheduling (FJSP),proposing an efficient improved version called Improved Non-dominated Sorting Artificial Lemming Algorithm(INALA).Differential evolution (DE) strategies are embedded in both the global exploration and local exploitation phases of ALA, enhancing the algorithm's search capability while improving the convergence accuracy of solution sets. A dynamic probability-controlled variable neighborhood search(VNS) perturbation mechanism is designed to enhance solution diversity and local optimization efficiency. Additionally, this paper proposes a sectional crossover mechanism based on precedence preserving order-based crossover (POX) and multi-point replacement crossover, which can effectively handle dynamic events such as machine failures. Finally, the effectiveness and practicality of the algorithm are verified through case studies.
|
|
09:00-18:30, Paper Mo-Online.97 | |
A Triplet Optimization and Difference Detail Perception Network with Adaptive Feature Enhancement for Radiology Report Generation |
|
Zeng, Yijie | Qilu University of Technology(Shandong Academy of Sciences) |
Zhang, Zhen | Qilu University of Technology(Shandong Academy of Sciences) |
Jiang, Wenfeng | Qilu University of Technology (Shandong Academy of Sciences) |
Liu, Song | Qilu University of Technology (Shandong Academy of Sciences) |
Keywords: Image Processing and Pattern Recognition, Deep Learning, Neural Networks and their Applications
Abstract: The generation of radiation reports is an essential task in the field of medical artificial intelligence which aims to automatically generate text descriptions of radiology images. However, there are still several problems in this task: 1) existing methods lack global feature interaction when extracting image features, and their ability to represent images is limited; 2) previous models need to retrieve similar triplets input models from pre-constructed knowledge graphs, and the triplets lack entity relationship refinement; 3) existing approaches lack a correlation mechanism across samples that makes it difficult to effectively capture small abnormal regions; 4) the selfattention mechanism of the Transformer decoder is good at capturing global dependencies but ignores relationships between local contexts. To address these issues, we propose a triplet optimization and difference detail perception network with adaptive feature enhancement. In our model, we design an adaptive image feature enhancement module to dynamically capture global image features. Furthermore, we propose a multi-modal triplet optimization module that boosts capability for detecting abnormal regions by incorporating context-aware entity relationship refinement into the initial triplet. Moreover, we design a difference comparison weighting module to obtain fine-grained features between different samples and improve cross-sample correlation so that the model pays more attention to small details and anomalies that are easy to ignore. Finally, we design a detail-aware enhancement decoder to make the decoder pay more attention to the relationship between local contexts. We experimented and evaluated our model on the IUXray and MIMIC-CXR datasets to compare with other baseline models.
|
|
09:00-18:30, Paper Mo-Online.98 | |
FCD-YOLO: An Accurate and Efficient Method for Underwater Object Detection |
|
Yang, Zheqi | Qilu University of Technology |
Zhang, Jichen | Zaozhuang Natural Resources Development Center, Shandong Provinc |
Li, Mian | Qilu University of Technology |
Su, Zhike | Qilu University of Technology (Shandong Academy of Sciences) |
Li, Aimin | Qilu University of Technology |
Keywords: Application of Artificial Intelligence, Image Processing and Pattern Recognition, Deep Learning
Abstract: Underwater target detection presents substantial technical challenges due to the complex and dynamic characteristics of marine environments. Detection performance is frequently impaired by factors such as the presence of small targets with limited visual features, frequent occlusions from marine organisms and suspended particles, as well as significant morphological variations caused by varying viewing angles, lighting conditions, and target deformations in aquatic environments. To tackle these challenges, this paper presents an improved YOLOv8-based model, referred to as FCD-YOLO. The proposed model incoporates a FCA mechanism to emphasize key features while suppressing background noise. Additionally, it integrates a CCFF Module to effectively capture and merge multi-scale information for more comprehensive feature representation. To further improve adaptability, the original YOLOv8 detection head is replaced with a DyHead, enabling multi-scale feature detection and enhancing the model’s ability to handle targets of varying sizes. Experimental results on the URPC2020 dataset demonstrate that the FCD-YOLO model achieves a mean Average Precision (mAP) of 86.9%, representing a 5.0% improvement over the baseline YOLOv8 model. Comparative studies with other state-of-the-art detectors further validate FCD-YOLO's superior accuracy. Ablation studies confirm the individual contributions of each component, with the FCA mechanism and DyHead module showing particularly significant impacts on small target detection performance. These results substantiate FCD-YOLO's superiority and practical utility in complex underwater environments.
|
|
09:00-18:30, Paper Mo-Online.99 | |
Local-Global Collaborative Relational Representation for Understanding Knowledge Graphs |
|
Wang, Tao | South China Normal University |
Liang, Rongjiao | South China Normal University |
Fei, Chaoqun | South China Normal University |
Wang, Fu Lee | Hong Kong Metropolitan University |
Hao, Tianyong | South China Normal University |
Keywords: Representation Learning, Neural Networks and their Applications, Deep Learning
Abstract: Knowledge graphs (KGs) typically have distinct entity and relation vocabularies, there is usually no overlap between the vocabularies of different KGs. Consequently, most existing studies develop independent reasoning models for different KGs. However, such models generally lack generalization capability in reasoning. This paper proposes a novel model Local-Global Collaborative Relational Representation for KG Reasoning(LGRR) to achieve universal reasoning by learning relational invariance in KGs. Specifically, we introduce a local-global relational graph embedding method, which employs a graph neural network with an attention mechanism to perform local message passing. Then, global attention is applied to information propagation to overcome the limitations of traditional local message passing and enhance capturing more comprehensive graph structural information. Finally, in the local message passing phase, we propose a relation-aware dynamic attention mechanism and a relation aggregation strategy. By integrating relation-type semantics with local subgraph structural features, our method dynamically generates attention coefficients among nodes, thereby enhancing the model reasoning capability. The results demonstrate that the zero-shot (0-shot) inference capability of a single pre-trained LGRR model is comparable or superior to models trained on specific KGs across most unseen KGs.
|
|
09:00-18:30, Paper Mo-Online.100 | |
A Retrieval Filtering and Thought Enhancement Framework for Function-Level Code Generation Based on Large Language Model |
|
Wu, Chao | Qilu University of Technology (Shandong Academy of Sciences) |
Zhang, Pusheng | Qilu University of Technology (Shandong Academy of Sciences) |
Jiang, Xuesong | Qilu University of Technology(shandong Academy of Sciences) |
Liu, Song | Qilu University of Technology (Shandong Academy of Sciences) |
Keywords: Deep Learning, Application of Artificial Intelligence, Neural Networks and their Applications
Abstract: The function-level code generation is an important task in the combination of software engineering and artificial intelligence, which aims to improve the productivity of software development by automatically generating function-level code based on task descriptions. However, this task currently suffers from several problems: 1) in the fine-tuning phase, current large language models cannot sufficiently capture the detailed syntactic structure of the code dataset; 2) previous retrievalaugmented methods cannot adequately consider the complex dependencies between code snippets in external code repositories; 3) existing large language models introducing self-repair mechanism tend to overfocus on previously generated erroneous code during the self-repair process. To solve these problems, we propose a retrieval filtering and thought enhancement framework for function-level code generation based on large language model. In our model, we design an abstract syntax tree mapping preprocessing module to preprocess the dataset and help large language models learn the detailed syntax information of the dataset. Furthermore, we design a retrieval filtering and thought enhancement module to retrieve the most relevant snippets of code for the task and to enhance the chain-of-thought of our model. In addition, we design a self-repair mechanism to prevent large language models from overfocusing on generated erroneous code, helping them explore more solutions to repair the erroneous code. We experimented and evaluated our model on HumanEval, MBPP, and MultiPL-E benchmarks to compare with other baseline models.
|
|
09:00-18:30, Paper Mo-Online.101 | |
Efficient Weed Detection in Corn Fields Based on ESL-YOLOv8 |
|
Li, Mian | Qilu University of Technology |
Li, Aimin | Qilu University of Technology |
Yu, Jialin | Peking University Institute of Advanced Agricultural Sciences, S |
Jin, Xiaojun | College of Intelligent Manufacturing, Anhui Science and Technolo |
Yang, Zheqi | Qilu University of Technology |
Su, Zhike | Qilu University of Technology (Shandong Academy of Sciences) |
Zhu, Wenpeng | Peking University Institute of Advanced Agricultural Sciences, S |
Keywords: Application of Artificial Intelligence, Deep Learning, Image Processing and Pattern Recognition
Abstract: Effective weed management is essential for maintaining stable agricultural productivity and ensuring crop yield. Recent advancements in computer vision technologies have significantly enhanced weed detection efficiency. However, the diversity of weed species and irregular spatial distribution present considerable challenges to traditional weed detection methods, which often exhibit low accuracy in practical applications. Additionally, the construction of large-scale, high-quality datasets encompassing a wide variety of weed species is constrained by high costs and time limitations. Indirect weed recognition methods offer a promising solution to this issue. This approach first employs an image segmentation network to precisely identify and remove crop regions from the original image, followed by an image processing algorithm that extracts green pixels for weed identification. Based on the indirect weed detection method, this study optimizes the YOLOv8n-seg architecture to improve segmentation accuracy while maintaining model efficiency, leading to the development of the enhanced ESL-YOLOv8 segmentation model. To validate the effectiveness of the proposed method, a maize-specific segmentation dataset was constructed, and extensive experiments were conducted. The results show that the improved ESL-YOLOv8 model achieves a mask mAP50 value of 94.1% based on the mask, which is superior to the original YOLOv8n-seg(92.0%). Furthermore, the model size is reduced to 5.7 MB, compared to 6.8 MB for the original model, with a decrease in computational complexity. These findings confirm that the proposed framework provides a lightweight, high-precision, and practical solution for weed detection in agricultural fields, exhibiting enhanced robustness and significant engineering applicability.
|
|
09:00-18:30, Paper Mo-Online.102 | |
FEdiffusion-VAD: A Feature-Enhanced Diffusion Model for Skeleton-Based Video Anomaly Detection |
|
Du, Fengxin | Qilu University of Technology |
Wang, Xingang | Qilu University of Technology(Shandong Academy of Sciences) |
Zhang, Yudong | Southeast University |
Xiao, Yuteng | Qilu University of Technology |
Keywords: Application of Artificial Intelligence
Abstract: Current video anomaly detection models often suffer from a high probability of false negatives, a problem that has become increasingly significant. Moreover, many advanced models are sensitive to noise from previous frames due to their reliance on the temporal dimension for anomaly detection. In this paper, we propose a novel approach to skeleton-based video anomaly detection, called FEdiffusion-VAD, which addresses these challenges through feature enhancement. The model effectively reduces false negatives and increases robustness to noise from past frames. FEdiffusion-VAD first generates predicted future skeleton data and reconstructed data using the Spatiotemporal Generation (STG) model. Subsequently, the outlier degree of anomalous data is strengthened using our proposed Screening-Strengthening (S-S) method. Experimental results demonstrate that FEdiffusion-VAD outperforms existing methods on three publicly available skeleton datasets, improving both accuracy and robustness in video anomaly detection. This study highlights the significant potential of feature-enhanced interaction networks for skeleton-based video anomaly detection.
|
|
09:00-18:30, Paper Mo-Online.103 | |
Defending Deepfake Via Texture Feature Perturbation |
|
Zhang, Xiao | Qilu University of Technology |
Chen, Changfang | Shandong Artificial Intelligence Institute, Qilu University of Te |
Wang, Tianyi | National University of Singapore |
Keywords: Application of Artificial Intelligence, Deep Learning, Image Processing and Pattern Recognition
Abstract: The rapid development of Deepfake technology poses severe challenges to social trust and information security. While most existing detection methods primarily rely on passive analyses, due to unresolvable high-quality Deepfake contents, proactive defense has recently emerged by inserting invisible signals in advance of image editing. In this paper, we introduce a proactive Deepfake detection approach based on facial texture features. Since human eyes are more sensitive to perturbations in smooth regions, we invisibly insert perturbations within texture regions that have low perceptual saliency, applying localized perturbations to key texture regions while minimizing unwanted noise in non-textured areas. Our texture-guided perturbation framework first extracts preliminary texture features via Local Binary Patterns (LBP), and then introduces a dual-model attention strategy to generate and optimize texture perturbations. Experiments on CelebA-HQ and LFW datasets demonstrate the promising performance of our method in distorting Deepfake generation and producing obvious visual defects under multiple attack models, providing an efficient and scalable solution for proactive Deepfake detection.
|
|
09:00-18:30, Paper Mo-Online.104 | |
Text-To-SQL Correction Based on Correlation Degree and Multi-Round Error Evaluation |
|
Bing, Wang | Southwest Petroleum University |
Zhang, Youming | Southwest Petroleum University |
Zhang, Xingpeng | Southwest Petroleum University |
Zhao, Chunlan | Southwest Petroleum University |
Zhang, Chi | Southwest Petroleum University |
Cai, Chaoqi | Southwest Petroleum University |
Keywords: Application of Artificial Intelligence, AI and Applications
Abstract: Text-to-SQL is an innovative tool that automatically converts natural language queries into structured SQL statements, enabling users to extract data from complex databases efficiently. Automatic correction techniques are employed to enhance the performance of Text-to-SQL models. However, these corrections encounter challenges stemming from a lack of critical information in input sequences, substantial noise, and insufficient contextual knowledge, all hindering optimal parser performance. We propose a correction method based on correlation degree and multi-round error evaluation (CMC) to address these issues. Our approach uses a correlation evaluator to eliminate redundant table columns, emphasize key aspects of the input sequence, and minimize unnecessary noise. The multi-round error evaluator integrates database execution analysis of SQL queries, allowing for the updating of erroneous information and background knowledge to better guide model correction. Experimental results indicate that our method achieves a 3.3% increase in exact set execution accuracy (EX) and a 2.0% increase in exact match accuracy (EM) on the Spider benchmark, surpassing current mainstream automatic correction methods. Additionally, our correction technique has significantly improved the performance of the Text-to-SQL baseline model.
|
|
09:00-18:30, Paper Mo-Online.105 | |
PiG-Adapter: Lightweight Knowledge Graph Adaptation for Few-Shot Vision-Language Tuning |
|
Cai, Zeqiang | Sun Yat-Sen University |
Wu, Jingze | Sun Yat-Sen University |
Zhu, Nannan | Sun Yat-Sen University |
Chen, Hongbo | Sun Yat-Sen University |
Keywords: Image Processing and Pattern Recognition, Neural Networks and their Applications
Abstract: Vision-language models (VLMs), such as CLIP and ALIGN, have demonstrated impressive zero-shot capabilities across various tasks. Recently, prompt-tuning and adapter-based methods have achieved strong performance in few-shot learning, enabling efficient adaptation of VLMs to new tasks with limited supervision. However, these methods still face two key challenges: 1) They fail to explicitly model the fine-grained structure of visual data, limiting their ability to handle localized or detailed recognition tasks; 2) Many methods rely on external memory banks or static prior graphs, introducing significant computational and memory overhead during inference. To address these issues, we propose PiG-Adapter, a novel framework that dynamically constructs a fine-grained visual knowledge graph from the internal image structure, allowing adaptation without the need for external memory. Additionally, PiG-Adapter integrates a compact textual cluster graph, distilling class-level knowledge to further reduce computational burden. We demonstrate the effectiveness of PiG-Adapter through extensive experiments on 11 benchmark datasets, showing that it outperforms existing methods in both accuracy and efficiency, making it an ideal solution for low-resource, scalable adaptation of VLMs.
|
|
09:00-18:30, Paper Mo-Online.106 | |
CM-Net: Local and Global Enhanced Network for Skin Lesion Segmentation |
|
Dong, Aimei | Qilu University of Technology (Shandong Academy of Science) |
Sun, YaoYao | Qilu University of Technology (Shandong Academy of Science) |
Keywords: Deep Learning, Artificial Social Intelligence, Image Processing and Pattern Recognition
Abstract: Automatic skin lesion segmentation is a critical tool in clinical diagnosis, which significantly enhances the accuracy of early diagnosis. However, due to the diverse shapes, blurred boundaries and interference from hair in the samples, skin lesion segmentation remains a challenging task. To overcome these challenges, we propose a parallel-branch codec structure called CM-Net. Specifically, we design a feature extraction module that uses parallel Mamba and CNN branches to perform global modeling and local feature extraction. This module effectively captures global lesion area and detailed boundary information. Additionally, a Multi-scale Attention Cross Fusion module (MACF) is designed to replace traditional skip connections, thereby enhancing the interaction between shallow and deep features. It helps mitigate semantic gap while suppressing background interference. Extensive experiments conducted on two public datasets demonstrate that our method achieves superior segmentation performance compared to most state-of-the-art approaches.
|
|
09:00-18:30, Paper Mo-Online.107 | |
SPRoC: Semantics-Preserving Mutations for Robustness Evaluation of Code Generation Large Language Models |
|
Shi, Qiancheng | Beijing Information Science and Technology University |
Han, Qihong | Beijing Information Science and Technology University |
Cui, Zhanqi | Beijing Information Science and Technology University |
Zeng, Zheng | Beijing Information Science and Technology University |
Keywords: Machine Learning, Application of Artificial Intelligence, Assurance
Abstract: Abstract With the widespread use of large language models (LLMs) in code generation, their capabilities continue to improve. However, LLMs still exhibit instability when faced with minor input prompt variations, which presents challenges for practical deployment. Existing prompt mutation methods have limitations, such as random insertions, deletions, or replacements without understanding prompt semantics and structure. These methods fail to capture the diverse ways real users express the same problem, limiting their ability to assess LLM code generation robustness. To address this, we propose SPRoC (Semantics-Preserving Robustness of Code generation), a method for evaluating LLM robustness through prompt mutation. Using a BERT-based model, SPRoC generates mutated prompts that maintain semantic consistency but offer diverse expressions. These prompts create a new dataset to verify the functionality of LLM-generated code. SPRoC compares the functional correctness of code before and after mutation to assess LLM robustness to input variations.We conduct experiments on the HumanEval dataset with several mainstream LLMs, including ChatGPT, DeepSeek, Claude, ERNIE, and Qwen, to evaluate performance under SPRoC mutations. Results show that SPRoC reduces the models’ Pass@k scores with minimal semantic changes, with Pass@1, Pass@5, and Pass@10 decreasing by 4.46%, 1.83%, and 1.22%, respectively, outperforming the baseline Radamsa method. SPRoC also achieves better performance on similarity metrics like BLEU and BERTScore, improving by 12.96% and 1.83%, respectively.This work not only verifies the robustness of mainstream LLMs under semantics-preserving mutations but also demonstrates the practicality and generality of SPRoC, offering a new direction for enhancing LLM stability in real-world scenarios.
|
|
09:00-18:30, Paper Mo-Online.108 | |
EPPFL: Entropy-Based Personalized Privacy in Federated Learning |
|
Guo, Yang | Hebei University |
Keywords: Artificial Social Intelligence, Computational Intelligence in Information, AIoT
Abstract: Federated Learning is a privacy-friendly framework that effectively protects the privacy of local data while accomplishing the training task. The transmission of model parameters or gradients can still be exploited by malicious inference attacks. To further enhance the privacy protection capability, combining federated learning and differential privacy can reduce the risk of potential privacy leakage, but the existing methods ignore the individualized privacy needs of different clients and the balanced allocation of privacy budget. To address this problem, a federated learning framework combining information entropy computation and dynamic differential privacy policy is designed, which uses information entropy to compute the local privacy demand and dynamically adjusts the privacy budget allocation policy to effectively reduce the total privacy budget accumulation and enhance the privacy protection effect without affecting the model utility. The experimental results show that the method can achieve a better balance between personalized privacy protection and model performance, proving its effectiveness and practicality in federated learning.
|
|
09:00-18:30, Paper Mo-Online.109 | |
Enhancing Chinese Short Text Entity Linking Based on Entity Co-Occurrence Features |
|
Yu, Qing | Xinjiang University |
Guo, Yanliang | Xinjiang University |
Yao, Jiasheng | Xinjiang University |
Li, Yehang | Xinjiang University |
Keywords: Deep Learning, AI and Applications, Complex Network
Abstract: Entity linking is a crucial downstream task in knowledge graph research, which involves mapping ambiguous named entities in text to their corresponding entities in a knowledge base. Compared to long-document-level entity linking, short Chinese texts present additional challenges due to their sparse contextual information. Existing Chinese short-text entity linking models predominantly focus on the matching degree between the context of an mention and its candidate entities. However, they often overlook the co-occurrence relationships among candidate entities within the same query and fail to capture the fine-grained correlations between the mention context and the descriptive text of candidate entities. To address these issues, this paper leverages entity co-occurrence features to enrich the contextual semantic representation. Specifically, an entity co-occurrence graph is constructed based on the co-occurrence probabilities of entities in the training set. Furthermore, a bilinear attention mechanism is employed to capture the fine-grained correlations between the mention context and the candidate entity descriptions. Experimental results on two Chinese datasets, CCKS2019 and CCKS2020, demonstrate that the proposed model outperforms existing baseline models. These results validate the feasibility and effectiveness of incorporating co-occurrence features and the bilinear attention mechanism in improving entity linking performance in Chinese short texts.
|
|
09:00-18:30, Paper Mo-Online.110 | |
Joint Geometric Self-Attention and Boundary-Aware Search for High-Precision Intracranial Aneurysm Mesh Segmentation |
|
Li, Shuang | Sichuan Normal University |
Zhang, Fuhao | Sichuan Normal University |
Zhang, Jialin | Guizhou University |
Wang, Ling | Sichuan Normal University |
Zhou, Hui | West China Second University Hospital,Sichuan University |
Chen, Dapeng | West China Second University Hospital, Sichuan University |
Tang, Jinshan | George Mason University |
Jiang, Jingfeng | Michigan Technological University |
Mu, Nan | Sichuan Normal University |
Keywords: Image Processing and Pattern Recognition, Deep Learning, Neural Networks and their Applications
Abstract: Accurate segmentation of intracranial aneurysms (IAs) is critical due to their high rupture risks. Traditional voxel methods struggle to preserve surface details and maintain topological consistency. In contrast, mesh segmentation efficiently represents complex geometries using vertices, edges, and faces, but still faces challenges in modeling global context and refining ambiguous boundaries. We propose a novel mesh segmentation framework, GSBS-MSeg, composed of two main components: a geometric self-attention module and a boundary-aware search (BS) module. The geometric self-attention module captures long-range geometric dependencies, enhancing the global context and integrating both local and global features for improved segmentation accuracy. The BS module refines mesh boundaries and ensures topological consistency, significantly improving local boundary precision. GSBS-MSeg was evaluated using five-fold cross-validation on two publicly available datasets. Compared to existing methods, GSBS-MSeg improves aneurysm segmentation accuracy by 45.6%, total accuracy by 11.96%, and vessel segmentation accuracy by 7.73%. These results demonstrate its effectiveness and superior performance for mesh-based segmentation tasks.
|
|
09:00-18:30, Paper Mo-Online.111 | |
Adaptive Low-Light Image Enhancement Algorithm Based on Multi-Scale Retinex |
|
Guo, Jiali | Inner Mongolia University |
Pan, Jiaxu | Inner Mongolia University |
Yi, Ru | Inner Mongolia University |
Keywords: Image Processing and Pattern Recognition
Abstract: 在弱光条件下拍摄的图像经常面临 能见度差、对比度低、噪点和 颜色失真,显著影响视觉质量 以及后续图像处理任务的性能。 为了应对这些挑战,本文提出了一种 基于非均匀低照度图像增强算法 Retinex 理论。算法有效还原图像 细节并增强对比度,同时保持自然 纹理特征。通过优化照明 通过 Just Noticeable Difference 的估算过程 (JND) 模型,可实现精确的照明调整 使用从亮度β导出的衰减因子 JND 的。此外,该方法还集成了自适应局部 使用 DY-NAMIC 加权策略进行对比增强 正确的照明分布。这种方法不仅 显著提高了较暗区域的可见度,但也 实现光之间的光分布更加平衡 和暗区。实验结果表明, 与现有技术相比,提议的方法 在 各种照明条件,从而验证其 图像领域的有&
|
|
09:00-18:30, Paper Mo-Online.112 | |
Infrared Vehicle Adversarial Patch: Physical Attacking Infrared Vehicle Detectors |
|
Chen, Hanyang | Southwest University of Science and Technology |
Dong, Wanli | Southwest University of Science and Technology |
Fan, Jiachuan | Southwest University of Science and Technology |
Gao, Xiaoming | Southwest University of Science and Technology |
Peng, Anjie | Southwest University of Science and Technology |
Fan, Xingdi | Southwest University of Science and Technology |
Li, Dong | Southwest University of Science and Technology |
Keywords: Information Assurance and Intelligence, Assurance, Metaheuristic Algorithms
Abstract: The growing dependence of autonomous vehicles on infrared thermal imaging for adverse-condition perception has heightened the security importance of infrared vehicle detectors. This work proposes the Infrared Vehicle Adversarial Patch (InfVAP), a physically realizable attack framework targeting these detectors. Addressing the dual challenges of limited prior research and complex vehicle geometry, our solution introduces: (1) Pixel-level attack localization using DeepLab-v3+ segmentation to ensure precise patch placement; (2) An improved Particle Swarm Optimization (PSO) algorithm with dynamic parameter adaptation and shape optimization to enhance attack effectiveness; (3) Expectation Over Transformation (EOT)-based robustness enhancement against real-world perturbations; and (4) A top-K confidence suppression loss (K=1000) for maximum attack potency. Extensive evaluations on the FLIR ADAS V2 dataset demonstrate that InfVAP achieves state-of-the-art performance: 93.93% digital Attack Success Rate (ASR) and 80.2% physical ASR at distances of 12 meters. Comprehensive ablation studies prove all components are statistically significant, with PSO optimization contributing most to attack potency.
|
|
09:00-18:30, Paper Mo-Online.113 | |
IMDRec: Intent-Aware Multi-Interest Modeling with Diffusion-Enhanced Embeddings for Sequential Recommendation |
|
Wu, Wenjun | Huazhong University of Science and Technology |
Jia, Jingyi | Huazhong University of Science and Technology |
Dou, Linkun | Huazhong University of Science and Technology |
Jingxuanmeng, Cusuanjun | Huazhong University of Science and Technology |
Keywords: Machine Learning, Deep Learning, Representation Learning
Abstract: Sequential recommendation (SR) aims to predict users' future behaviors based on their historical interactions. While user interest modeling often lies at the core of SR, user intent which reflects the high-level motivations behind user behaviors—remains relatively underexplored. Despite the recent efforts in modeling user interest and intent, existing methods still suffer from several critical limitations: (1) They regard the entire user sequence as reflecting a single interest, failing to capture diverse user interests; (2) They approach intent modeling as a task separate from interest extraction, which leads to optimization inconsistency and suboptimal recommendation performance; (3) They lack mechanisms to learn robust item representations under sparse interactions, which further degrades both interest extraction and intent modeling in downstream tasks. To address these challenges, we propose IMDRec, an Intent-Aware Multi-Interest Modeling with Diffusion-Enhanced Embeddings for Sequential Recommendation. IMDRec first integrates a diffusion-based contrastive learning to enhance the robustness of item embeddings under data sparsity, providing a solid embedding foundation for downstream modeling. On top of this, a multi-expert architecture is adopted to capture diverse user interests. In the final stage, a novel end-to-end intent modeling mechanism is designed to learn discrete intent representations, which serve as high-level guidance to adjust user interest representations for recommendations. Extensive experiments on four real-world datasets demonstrate the superiority and effectiveness of our IMDRec.
|
|
09:00-18:30, Paper Mo-Online.114 | |
ICasA: 3D Object Detection Model with Feature Enhancement and Uncertainty Modeling |
|
He, Huaiqing | The Civil Aviation University of China |
Zhai, Yujia | Civil Aviation University of China |
Liu, Haohan | The Civil Aviation University of China |
Hui, Kanghua | Civil Aviation University of China |
Keywords: Machine Vision, Machine Learning, Deep Learning
Abstract: ICasA addresses the issue of insufficient spatial feature extraction, which hinders detection accuracy, by making the following improvements to CasA. Focal sparse convolution is introduced to enhance its 3D backbone network, optimizing the information flow extraction between non-empty voxel feature locations. Additionally, a multi-scale feature attention fusion module is created and embedded into the 2D backbone network. This mechanism first uses convolutions with varying strides to deepen the network structure and enhance global feature extraction capabilities, and then improves the feature attention fusion means, dynamically activating features at different scales and positions while retaining the original feature maps. Next, a dual-threshold control focal sparse convolution is introduced in the fourth stage of the 3D backbone network of CasA in order to suppress small object noise and control the expansion of voxels, thereby effectively enhancing the feature representation of small objects. Meanwhile, to overcome the stronger annotation instability for small objects, the IoU-DE module is proposed. On the KITTI dataset, the ICasA model achieves a 1.68% improvement in mAP compared to CasA, with the 3D average precision (AP@R40) for pedestrians and cyclists improving by 3.14% and 2.17%, respectively. Furthermore, compared to existing advanced methods, ICasA demonstrates more stable detection performance across all object categories.
|
|
09:00-18:30, Paper Mo-Online.115 | |
HCETrack: Visual Tracking with Historical Context Information and Feature Enhancement |
|
Yang, Xinyu | Shanghai University of Electric Power |
Xu, Man | Shanghai University of Electric Power |
Fan, Zizhu | Shanghai University of Electric Power |
|
|
09:00-18:30, Paper Mo-Online.116 | |
KANDU-Net: Enhancing Global Context Capture in Medical Image Segmentation with Kolmogorov-Arnold Networks |
|
Fang, Chenglin | Chongqing University |
Wu, Kaigui | Chongqing University |
Keywords: Machine Vision, Neural Networks and their Applications, Machine Learning
Abstract: The U-Net model has consistently demonstrated exceptional performance in the field of medical image segmentation, with various improvements and enhancements made since its introduction. However, convolution operations have limitations in capturing long-range dependencies and global contextual information, a constraint inherited by U-Net as it relies on convolutions. To address this issue, this paper proposes an innovative architecture. By leveraging the powerful nonlinear representation capabilities of Kolmogorov-Arnold Networks, a new global feature extraction module is introduced. A KAN-convolution dual-channel structure is employed, and a feature fusion module is designed to effectively combine local and global features. Building upon the strengths of U-Net, the KANDU-Net architecture is constructed and validated on the Kvasir, GLAS, and BUSI datasets, achieving the best segmentation performance.
|
|
09:00-18:30, Paper Mo-Online.117 | |
Background-Aware Prior Mask Refinement and Feature Augmentation for Few-Shot Segmentation |
|
He, Hongrong | Sun Yat-Sen University |
Chen, Jiaxin | Sun Yat-Sen University |
Ma, Jinhua | Sun Yat-Sen University |
Keywords: Image Processing and Pattern Recognition, Deep Learning, Machine Learning
Abstract: Few-shot semantic segmentation enables the rapid deployment of segmentation models to new categories with limited annotated data. In recent works, it is a promising approach to generating prior masks based on pixel-wise similarity as the guidance for segmentation. Since the prior masks are generated by a frozen pre-trained backbone, they may be biased to wrongly activate the non-target classes in the background regions. To address this issue, we propose a Background-aware Feature Augmentation (BFA) method for few-shot semantic segmentation. Our method first estimates foreground and background prior masks, and then fuses them by a learnable convolution. With the help of the background information, more accurate prior masks can be generated by deactivating the background regions. By utilizing the refined prior masks, foreground and background features are augmented to encode foreground and background information for segmentation. The proposed BFA is a plug-and-play method which can be easily integrated into existing works based on the prototype matching framework. Experiments on PASCAL-5i and COCO-20i datasets demonstrate the superiority of our method compared to the state of the arts.
|
|
09:00-18:30, Paper Mo-Online.118 | |
TWT-LLM: A Universal and Robust Tagged Watermark for Large Language Models |
|
Zheng, Jibin | Foshan University |
Ma, Li | Foshan University |
Yang, WenYin | Foshan University |
Liu, Fen | Foshan University |
Li, Yongqiang | Chinese Academy of Sciences |
Liu, Zhengbin | National Key Laboratory of Security Communication |
Keywords: Information Assurance and Intelligence, Intelligent Internet Systems, Artificial Social Intelligence
Abstract: Large Language Models (LLMs) generate realistic and coherent text, boosting efficiency and decision-making in various fields. However, their generative capabilities pose risks of intellectual property abuse. Watermarking technology offers a solution for information hiding in LLMs, with large model watermarking gaining attention for its unique methods. A critical challenge is embedding watermarks with minimal impact on text quality while ensuring rapid detection, making it an urgent issue to address. In this paper, to address these challenge, we propose a universal and robust tagged watermark technology for LLMs (TWT-LLM). Firstly, the method of TWT embeds a certain amount of watermark information in the sampling process while generating subsequent word text based on the prompt. Then, to enhance the quality of the generated text, we have proposed a group-based local watermark embedding method, which significantly reduces the impact on text quality. The method involves tagging tokens within each group, only the tokens that have been tagged will embed the watermark information. Moreover, to detect the watermark information in the generated text, we have designed a detection method specifically for this watermark embedding technique. Finally, we conducted experiments using the C4 and HC3 datasets, demonstrating that TWT-LLM achieves a lower False Negative Rate and is lighter compared to state-of-the-art methods. Additionally, experiments on randomly selected ARXIV abstracts showed that the True Positive Rate reached 100%, while the False Negative Rate was 0%.
|
|
09:00-18:30, Paper Mo-Online.119 | |
DEMFND: Domain-Enhanced Multimodal Fake News Detection |
|
Feng, Lexin | Guizhou Minzu University |
Wei, Jiayin | Guizhou Minzu University |
Lu, Youjun | Guizhou Minzu University |
Feng, Fujian | South China University of Technology |
Keywords: Multimedia Computation, Media Computing, Knowledge Acquisition
Abstract: The proliferation of multimodal fake news presents a significant challenge, demanding robust detection techniques. However, existing methods face the following limita- tions: (1) inadequate capture of deep cross-modal semantic relationships within specific domains, (2) poor generalization across diverse news domains, and (3) difficulty integrating multi- granularity features while mitigating domain biases amplified by pre-trained models. To address these issues, in this paper we propose DEMFND, a novel Domain-Enhanced Multimodal Fake News Detection framework. Specifically, DEMFND first extracts multi-granularity features using BERT, MAE, and CLIP. Subsequently, we employ a domain-aware multi-expert feature enhancement layer to learn domain-specific character- istics, incorporate domain-aware multimodal contrastive learning for domain-conditioned semantic alignment, and utilize a domain-aware multimodal fusion mechanism with gating for adaptive integration in our DEMFND. Comparative experiments conducted on three real-world datasets (encompassing both Chinese and English) against multiple baseline models indicate that our DEMFND outperforms the best baseline(MMDFND) with improvements of 0.64% and 0.96% in accuracy and F1-score, respectively.
|
|
09:00-18:30, Paper Mo-Online.120 | |
Higher-Order Singular Value Decomposition-Based Adaptive Complex-Valued MRI Denoising with Effective Phase Preservation |
|
Wang, Xueyi | University of Science and Technology of China |
Zuo, Yonglai | University of Science and Technology of China |
Zhang, Lingtong | University of Science and Technology of China |
Qiu, Bensheng | University of Science and Technology of China |
Keywords: Image Processing and Pattern Recognition, Optimization and Self-Organization Approaches
Abstract: Magnetic Resonance Imaging (MRI) is a powerful non-invasive imaging modality widely used in clinical diagnosis. However, its image quality and diagnostic reliability are often degraded by thermal and physiological noise. Over the years, numerous MRI denoising methods have been proposed, most of which rely on magnitude images reconstructed from complex-valued data. This practice discards valuable phase information and inter-channel structural correlations, while also transforming the underlying noise distribution (e.g., from Gaussian to Rician or noncentral chi-squared), thereby complicating accurate noise modeling, limiting denoising performance, and potentially affecting downstream tasks. To address these issues, we propose Adaptive Complex-valued Higher-Order Singular Value Decomposition (AC-HOSVD), a denoising framework for multi-channel complex-valued MRI that accurately models noise, preserves phase, and exploits inter-channel correlations. Our method employs a multi-scale adaptive block strategy and automatic parameter optimization to handle spatial structural variability. A dual soft-thresholding strategy further improves noise suppression while preserving fine anatomical details. Extensive experiments on both MRXCAT-simulated and in-house multi-channel complex-valued MRI datasets demonstrate that AC-HOSVD consistently outperforms state-of-the-art methods in peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), and visual quality, achieving average gains of +1.26 dB in PSNR and +0.014 in SSIM over the best competing method.
|
|
09:00-18:30, Paper Mo-Online.121 | |
LRS-GGCN: An Influence-Sensitive Subgraph-Based Approach for Detecting Android Malware |
|
Zhang, Shuhui | Shandong Computer Science Center (National Supercomputer Center |
Song, Xinru | Qilu University of Technology(Shandong Academy of Sciences) |
Wang, Lianhai | Shandong Computer Science Center (National Supercomputer Center |
Xu, Shujiang | Qilu University of Technology |
Shao, Wei | Qilu University of Technology (Shandong Academy of Sciences) |
Wang, Qizheng | Qilu University of Technology (Shandong Academy of Sciences) |
Keywords: Neural Networks and their Applications, Information Assurance and Intelligence, Deep Learning
Abstract: The open architecture and inherent flexibility of the Android platform make it a prime target for malicious code penetration, posing huge security risks to users. Current graph-based detection methods have two limitations: high computational overhead and reliance on single-dimensional feature representation, which leads to poor detection performance. To address these challenges, this paper proposes a malicious code detection framework based on impact-sensitive subgraphs(LRS-GGCN), which integrates multi-level feature fusion and graph pruning techniques to improve efficiency and accuracy. First, the impact-sensitive subgraph is constructed using the LeaderRank algorithm and sensitive APIs to improve detection efficiency. Second, node-level features are enriched by fusing opcode structure and API semantic embedding to capture the syntactic and contextual properties of malicious code. At the edge level, API call frequency is combined to model interprocedural interactions and enhance the representation capability of the graph. Finally, a gated graph convolutional network(GGCN) synthesizes these heterogeneous features to achieve efficient and accurate malicious code detection. Experiments show that our method achieves an accuracy of 99.28%. In addition, the training time is reduced by about 90% compared to the unpruned baseline.
|
|
09:00-18:30, Paper Mo-Online.122 | |
A Shiitake Mushroom Fruiting Body Detection Method Based on Multi-Modal Feature Fusion |
|
Wang, Fengyun | Shandong Academy of Agricultural Sciences |
Wang, Xuanyu | Qilu University of Technology (Shandong Academy of Sciences) |
Lv, Xiangyu | Qilu University of Technology (Shandong Academy of Sciences) |
Li, Xiaolong | Qilu University of Technology |
Keywords: AI and Applications, Image Processing and Pattern Recognition, Deep Learning
Abstract: The detection of shiitake mushroom fruiting bodies under the cultivation mode of mushroom logs is a critical step in factory-based production. However, existing methods primarily rely on RGB images for feature extraction, which limits the perception of spatial variations such as growth positions and height differences on the logs, leading to insufficient location information and reduced detection accuracy. To address these limitations, this paper proposes a lightweight multi-modal feature fusion detection network (LMFD) based on YOLOv11. First, a dual-backbone architecture is constructed to independently extract color and spatial features, effectively mitigating feature interference caused by discrepancies between modalities. To reduce the computational overhead introduced by the dual backbone, a lightweight feature extraction module named C3k2 Partial Convolution (C3k2P) is designed by replacing standard convolutions with partial convolutions, significantly decreasing the number of parameters and computational complexity. Furthermore, a Cross-domain Dual Attention Fusion (CDAF) module is introduced, which combines spatial and channel attention mechanisms to adaptively enhance multi-modal feature interaction, thereby mitigating feature imbalance and reducing interference caused by depth image noise and voids.Finally, two shiitake mushroom image datasets were constructed for experimentation. Results on the self-built multi-modal dataset (MD) demonstrate that LMFD achieves notable performance improvements, with the mean Average Precision (mAP50) increasing from 87.8% to 91.1%, while maintaining only 12.70M parameters and 27.0 GFLOPs. These results confirm the effectiveness of the proposed model for shiitake mushroom fruiting body detection.
|
|
09:00-18:30, Paper Mo-Online.123 | |
MFGAF: Multi-Faceted Granular Analysis Framework for LLM-Generated Text Detection |
|
Chen, Jia | Sichuan University |
Wang, Haizhou | Sichuan University |
Keywords: Machine Learning, Deep Learning, Application of Artificial Intelligence
Abstract: Large Language Models (LLMs) have become increasingly proficient in generating human-like text, yet their widespread deployment raises significant concerns, including disseminating fake information, privacy violations, and academic dishonesty. Detecting LLM-generated text is vital for mitigating these risks and is often framed as a binary classification task. Although zero-shot textual analysis methods have gained popularity due to their generalizability, they face key challenges: (1) limited ability to capture holistic and granular consistency and (2) insufficiently comprehensive textual analysis, particularly regarding syntax and lexical patterns. To address these problems, we propose a novel Multi-faceted Granular Analysis Framework (MFGAF), which leverages Rewriting Concordance and Completive Concordance to detect LLM-generated text through multi-granular textual dissection. Specifically, MFGAF is designed with two perspectives, Rewriting and Completion, to comprehensively capture global and local LLM-generated features. Additionally, for each perspective, a Multi-Granular Textual Dissection mechanism is constructed to thoroughly analyze LLM-generated text. Finally, MFGAF leverages LLMs to achieve adaptive integration of text analysis results and reflect on the correctness of these results. Our method achieves an average F1 improvement of 3.06% compared to the best baselines, demonstrating its robustness and effectiveness through extensive evaluations.
|
|
09:00-18:30, Paper Mo-Online.124 | |
SRDNet: Style Representation Disentanglement Network for Few Shot Semantic Segmentation |
|
Xia, Yuxin | Inner Mongolia Normal University |
Yin, Yanjun | Inner Mongolia Normal University,College of Computer Scie |
Xu, Qiaozhi | Inner Mongolia Normal University |
Zhi, Min | Inner Mongolia Normal University |
Keywords: Image Processing and Pattern Recognition, Deep Learning, Machine Learning
Abstract: Few-Shot Semantic Segmentation (FSS) effectively segments new classes with limited data. However, the often-overlooked style differences between support and query sets can lead to feature shifts, disrupting accurate feature matching due to inadequate abstraction in mid-level features. To tackle this challenge, we introduce a novel network for disentangling style representations from a frequency perspective. Specifically, we introduce a parameter-free Adaptive Style Fourier Alignment module that performs regional frequency replacement to generate style-aligned pseudo-support images. To further refine style adaptation, we construct Style-Aware Prototypes and employ a Style Modulation Module that selectively adjusts the query features based on low-frequency modulation via wavelet transform to preserve edge details. Extensive experiments on benchmark datasets demonstrate the effectiveness of our approach, yielding mIoU improvements of 1.91% in the 1-shot setting and 1.54% in the 5-shot configuration.
|
|
09:00-18:30, Paper Mo-Online.125 | |
DKF-RAG: Dynamic Knowledge Fusion-Enhanced Retrieval-Augmented Generation |
|
Wei, Yingying | University of Science and Technology of China |
Li, Weihai | University of Science and Technology of China |
Han, JingXuan | University of Science and Technology of China |
Keywords: AI and Applications, Artificial Social Intelligence, Neural Networks and their Applications
Abstract: 检索增强生成 (RAG) 缓解 大型语言模型 (LLM) 中的幻觉 by 在开放领域问题中纳入外部知识 回答。但是,对于复杂的多跳问题, 主流迭代检索方法苦苦挣扎 内部知识与检索知识之间的冲突,以及 嘈杂的外部文档。实现灵活有效 内外部知识的整合不断 成为一项研究挑战。在本文中,我们提出了一个 基于动态知识融合的RAG框架(DKF-RAG), 可以自适应调整知识融合策略 在迭代推理过程中。具体说来 DKF-RAG 迭代分解复杂的多跳问题 变成更简单的子问题,形成推理链。在 每次迭代时,它都会评估相关性和一致性 实时外部知识,基于哪个自适应 触发融合作以集成内部和 外部信息。我们对四个进行了实验 多跳 QA 数据集,并将 DKF-RAG 与多个 迭代检索增强基线,演
|
|
09:00-18:30, Paper Mo-Online.126 | |
A Disease Feature Fusion and Drug Constraint Decision GNN Model for Personalized Drug Recommendation |
|
Kong, Zan | Qilu University of Technology (Shandong Academy of Sciences) |
Wan, Wentong | Qilu University of Technology (Shandong Academy of Sciences) |
Zhao, Jing | Qilu University of Technology (Shandong Academy of Sciences) |
Liu, Song | Qilu University of Technology (Shandong Academy of Sciences) |
Keywords: Application of Artificial Intelligence, Deep Learning, Neural Networks and their Applications
Abstract: Personalized drug recommendation is a process that analyzes the individual characteristics of patients to predict their most suitable drug plan. However, existing drug recommendation methods use the information fusion method of vector horizontal concatenation, which is computationally complex and introduces unnecessary noise. Furthermore, previous models only rely on the relationship between drugs and symptoms to achieve the recommendation process from symptoms to drugs, ignoring the semantic interaction information between diseases and symptoms. In addition, existing methods cannot clearly determine which drugs should be retained or removed when faced with drug interaction conflicts, it is difficult for them to reduce drug-drug interactions (DDI) for recommendation. To solve these problems, we propose a personalized drug recommendation model named DFDC-GNN. In the model, we design a dimensionality reduction processing module to achieve a semantic information fusion of a lower dimension and obtain a comprehensive embedding representation for drug recommendation. Furthermore, we introduce a disease representation module and use a graph attention network to capture high-order semantic relationships among the nodes of symptoms, disease, and drugs. Moreover, we design a drug constraint decision mechanism to effectively extract the interaction relationships between drugs and reasonably control DDI in the drug combination. Finally, we conducted experiments on the MIMIC-III dataset to compare with other baseline models.
|
|
09:00-18:30, Paper Mo-Online.127 | |
Semi-Supervised Domain Adaptation for Medical Image Segmentation Via Local-Global Hybrid Dual-Teacher Collaborative Distillation |
|
Yuan, Yuyang | Inner Mongolia Normal University |
Xu, Qiaozhi | Inner Mongolia Normal University |
Yu, Lei | Inner Mongolia Normal University |
Keywords: Transfer Learning, Image Processing and Pattern Recognition, Machine Learning
Abstract: 无监督域自适应 (UDA) 利用标记数据 从源域对齐目标域 distribution,但其性能受到 缺乏对目标域的监督,导致 与完全监督相比,性能差距很大 方法。半监督学习 (SSL) 结合有限 具有大量未标记数据的标记数据,但假设 标记数据和未标记数据之间的分布相同, 未能解决医疗领域的跨领域挑战 成像。针对这些局限性,本文提出 基于半监督域自适应(SSDA)框架 关于本地-全球混合和双教师 协同蒸馏,旨在增强 医学图像分割的稳健性和准确性。这 主要贡献总结如下:(1) FMix 是 引入以利用频域信息 生成模拟真实器官结构的掩膜区域, 使模型能够专注于局部特征;(2)混淆是 合并以通过模拟来增强域不变性 域间转换,引导模型确定优先级 全局特征;(3) 互
|
|
09:00-18:30, Paper Mo-Online.128 | |
Contrastive Cross-Modal Prototype Prediction and Fusion for Video Anomaly Detection |
|
Wang, Junqiao | Sun Yat-Sen University |
Peng, Jiawen | Sun Yat-Sen University |
Chen, Jiaxin | Sun Yat-Sen University |
Ma, Jinhua | Sun Yat-Sen University |
Keywords: Deep Learning, Image Processing and Pattern Recognition, Machine Vision
Abstract: Video Anomaly Detection (VAD) identifies unexpected events by learning normal behavior from surveillance footage, assuming only normal training data is available. Previous methods, which focus on frame reconstruction or prediction tasks, are constrained by insufficient semantic sensitivity due to a reliance on pixel-level errors and inadequate adaptability to diverse normal patterns. Although contrastive learning methods attempt to address these issues by learning normal subcategories through manually constructed positive-negative pairs, they still encounter semantic ambiguity arising from inappropriate contrastive strategies. To overcome these limitations, we propose a Contrastive Cross-modal Prototype Prediction and Fusion (C2P2F) framework. Specifically, our method comprises three stages: (1) First, the Contrastive Prototype Prediction (CPP) module separately learns normal patterns of appearance and motion on RGB frames and optical flow inputs. Without constructing positive-negative pairs, we perform a cross-view prototype prediction task to discern inherent normal patterns within the data. (2) Then, the Cross-Modal Prototype Fusion (CMPF) conducts alternative training with RGB and optical flow inputs to establish comprehensive normal representations from two complementary modalities, enforcing consistency between cross-modal prototypes and learning semantically rich normal patterns. (3) Finally, the Prototype Number Adjustment (PNA) module is employed to mitigate initialization bias. Integrating these components, our approach adaptively models diverse normalcy and enhances anomaly discrimination via joint optimization of prototype stability and cross-modal consistency, with experiments on three benchmarks showing its superiority.
|
|
09:00-18:30, Paper Mo-Online.129 | |
DB-SKDNet: Dual-Branch Self-Knowledge Distillation with Multi-Granularity Perturbations for Robust Semi-Supervised Change Detection |
|
Cai, Yulin | Guilin University of Electronic Technology |
Yang, Le | Guilin University of Electronic Technology |
Zhao, Bin | Guilin University of Electronic Technology |
Keywords: AI and Applications, Deep Learning, Image Processing and Pattern Recognition
Abstract: Remote sensing image change detection is critical for dynamic surface monitoring in applications such as disaster assessment and urban expansion analysis. However, existing methods face two key limitations: 1) traditional image perturbations are vulnerable to illumination and cloud noise, degrading pseudo-label reliability, and 2) single-granularity augmentation strategies fail to model complex change patterns, limiting their generalization. This paper proposes a semi-supervised change detection framework that combines dual-branch self-knowledge distillation and multi-granularity perturbation augmentation. By introducing temporal feature alignment, the teacher-student architecture generates consistent soft labels while combining image-level blur augmentation and channel-wise dropout strategies to enhance discriminative feature robustness. Extensive experiments demonstrate the framework's effectiveness under limited supervision: On the WHU-CD dataset, it achieves a F1-score of 85.3% using only 5% labeled data. Notably, with just 10% annotations on the LEVIR-CD dataset, the proposed method nearly matches fully supervised performance. The proposed solution provides valuable insights into semi-supervised learning paradigms for geospatial artificial intelligence. The code is available at https://github.com/cyl1238685387/DB-SKDNet/.
|
|
09:00-18:30, Paper Mo-Online.130 | |
NAHID: Node-Level Host Intrusion Detection Based on Provenance Graph |
|
Li, Wentao | Institute of Information Engineering, Chinese Academy of Science |
Wei, Xingyuan | University of Chinese Academy of Sciences |
Lv, Qiujian | Institute of Information Engineering, Chinese Academy of Science |
Wang, Yan | Institute of Information Engineering, Chinese Academy of Science |
Li, Ning | Institute of Information Engineering, Chinese Academy of Science |
Yu, Ziyang | Institute of Information Engineering, Chinese Academy of Science |
Keywords: Deep Learning, Information Assurance and Intelligence, Neural Networks and their Applications
Abstract: Attacks, including program exploits, malware implantation, and targeted intrusions such as advanced persistent threats (APTs), are increasingly used by modern adversaries. Recently, provenance-based host intrusion detection systems (HIDS) have gained significant attention due to their superior performance in detecting complex attacks at the system level. Despite their potential, many existing approaches either do not fully utilize the rich information available in raw data or overlook the evolving topological structure of system behavior over time. Furthermore, some approaches lack fine granularity required for intrusion detection. These deficiencies may result in high false positives or false negatives. To address these limitations, we designed an anomaly-based node-level host intrusion detection system, NAHID, which does not require prior knowledge of attack patterns. It begins by extracting important entities (e.g., processes, files) and their interactions (e.g., read and write operations) from raw host data to construct a provenance graph. To effectively model the dynamic behavior of nodes over time, NAHID leverages temporal graph neural networks for graph representation learning, capturing both complex interaction patterns and temporal behavior evolution. The detection task is then formulated as a node-level multi-class classification problem. To address the inherent class imbalance present in host provenance data, we incorporate a class-weighted loss function that enhances the model’s ability to recognize minority-class anomalies. We evaluated our model on three public datasets and demonstrated that it outperforms our baseline host intrusion detection system. In addition, we also evaluated our model's runtime and conducted ablation experiments.
|
|
09:00-18:30, Paper Mo-Online.131 | |
MSADNet: Multi-Scale Adaptive Dual Attention Network for Multivariate Time Series Anomaly Detection |
|
Lyu, Xin | Hohai University |
Yang, Yuyan | Hohai University |
Zhang, Chao | Information Center, Ministry of Water Resources |
Huang, Yucong | Hohai University |
Keywords: Deep Learning, Application of Artificial Intelligence, Representation Learning
Abstract: 元时间序列异常检测要求 考虑不同变量的特征。 它涉及提取和聚合时态 多个尺度上的特征和依赖关系。为了 捕获本地时间模式和长期依赖关系 在多元时间序列中,大多数现有方法细分 time series 转换为不同大小的 patch 多尺度建模。但是,这些方法仍然面临 以下挑战:1) 他们通常使用固定大小的 patchs 而不是根据 不同的时间序列模式;2) 分段时间后 系列转换为多个补丁,他们通常很难 对本地和全局依赖关系进行建模。为了解决这些问题 问题,我们提出了一种多尺度自适应双重注意力 多变量时间序列异常的网络 (MSADNet) 检波。具体来说,自适应权重分配 提出动态调整 patch 大小的机制 并进一步启用 自适应多尺度特征提取和聚合。 同时,提出了一种双贴片
|
|
09:00-18:30, Paper Mo-Online.132 | |
A Feedback-Driven Framework That Synergizes GRASP with LLMs for Discrete Prompt Optimization |
|
Zhou, Xin | Hebei University |
Liang, Xiaoyan | Hebei University |
Du, Ruizhong | Hebei University |
Wang, Ziyuan | Hebei University |
Keywords: Application of Artificial Intelligence, Machine Learning, Metaheuristic Algorithms
Abstract: While prompt engineering is crucial for leveraging large language models (LLMs), existing optimization methods struggle to balance exploration of the vast prompt space with exploitation of high-quality candidates. To address this imbalance, we propose GRASP-Feedback Prompt (GFPrompt), a novel framework that integrates a dynamic grouping strategy with a teacher-model feedback mechanism. GFPrompt partitions prompts based on their performance: low-scoring prompts are sent to a global exploration operator to ensure diversity, while high-scoring ones undergo local refinement to enhance quality. A teacher model further provides semantic feedback to accelerate convergence. Experiments on six NLP tasks demonstrate that GFPrompt significantly outperforms standard baselines, expertdesigned prompts, and state-of-the-art automated methods by up to 10.6%, 6.6%, and 1.4%, respectively. Furthermore, it reduces token consumption by up to 10.3% compared to leading evolutionary approaches under the same iteration count. Our results validate GFPrompt’s effectiveness in achieving superior prompt quality and computational efficiency.
|
|
09:00-18:30, Paper Mo-Online.133 | |
YOLO-FHE: A Lightweight and Efficient Feature Aggregation Network for Object Detection in Remote Sensing |
|
Lei, Deyu | Henan University |
Zhao, Bingbing | Henan University |
Yanxu, Mao | Henan University |
Li, Wentao | Institute of Information Engineering, Chinese Academy of Science |
You, Datao | Henan University |
Keywords: Image Processing and Pattern Recognition, Machine Vision, Deep Learning
Abstract: Object detection is one of the research hotspots in the application of remote sensing images, and significant progress has been made in the past decade. However, it still faces challenges from dense small targets and multi-scale targets. Therefore, improving the detection capability and robustness of object detection models in scenarios involving multi-scale targets and dense small targets remains a key research focus. Recent approaches typically leverage improvements in feature fusion, attention mechanisms, and contextual information to enhance object detection performance, but they do not perform well in remote sensing object detection. It is worth noting that remote sensing object detection often requires real-time processing in practical applications, particularly when running models on resource-constrained edge devices. This poses a significant challenge in balancing performance improvement with model lightweighting and efficiency. To address this problem, this paper proposes a lightweight model YOLO-FHE, in which there are three innovative modules: the Feature Selective Aggregation with Downsampling Module(FSAD), the C2f-Heterogeneous Receptive Field Feature Extraction Module(C2f-HRFEM), and the Efficient Multi-Pool Channel Attention (EMCA) mechanism. The first module is designed to preserve fine-grained details through feature aggregation. The second module employs the multi-branch heterogeneous receptive field to enhance multi-scale feature learning. The third module is adopted to strengthen target response by integrating dual pooling and one-dimensional convolution. Experimental results indicate that the average precision (mAP) of the proposed YOLO-FHE model has respectively increased 2.5% on the SIMD dataset, 3.8% on the NWPU-10 dataset, and 2% on the RSOD dataset compared with the baseline models. Furthermore, the model demonstrates superior detection performance compared to other mainstream models.
|
|
09:00-18:30, Paper Mo-Online.134 | |
Rehearsal-Free Federated Continual Learning: An Orthogonal Projection Approach |
|
Cui, Hualong | Inner Mongolia University |
Yongqiang, Gao | Inner Mongolia University |
Liu, Yongmei | Inner Mongolia University |
Pang, Mingyu | Inner Mongolia University |
Keywords: Machine Learning
Abstract: Federated Continual Learning (FCL) aims to address the issue of Global Catastrophic Forgetting (GCF) in Federated Learning (FL) under dynamic data scenarios. Although various approaches have been proposed in FCL research to mitigate forgetting, most of them belong to the category of rehearsal-based approaches, relying on replaying historical data during learning the new task. However, this contravenes the principle of data forgett ability. To address this issue, this paper proposes a rehearsal-free federated continual learning framework, namely Federated Orthogonal Projection with Projection Compensation, FOP-PC. The framework extracts the Global Input Subspace (GIS,the important representation space of old tasks) and constrains the parameters update of the new task within a space orthogonal to GIS, thereby minimizing interference from the new task on old tasks and avoiding global catastrophic forget-ting. Meanwhile, the framework constructs a Similar Input Subspace (SIS) for the new task, its each layer is composed of the corresponding layer from the GIS of the most similar old task, and compute projection compensation to facilitate the learning of the new task. Experiment shows that, on three different datasets, FOP-PC outperforms the state-of-the-art FCL method, achieving accuracy improvements of 1.23%, 7.97%, and 5.44%, and reducing the forgetting by 0.1%, 1.12%, and 0.22%.
|
|
09:00-18:30, Paper Mo-Online.135 | |
Non-Rigid 3D Model Classification Based on Laplace-Beltrami Eigenfunctions |
|
Nie, Huijia | University of Jinan |
Niu, Dongmei | University of Jinan |
Diao, Zhenyu | University of Jinan |
Han, Xiaofan | University of Jinan |
Zhang, Chengyu | University of Jinan |
Keywords: Image Processing and Pattern Recognition, Machine Vision, Multimedia Computation
Abstract: ,而基于视图的方法已经取得了显着的成就 刚性 3D 模型分类的进步,它们 应用于非刚性 3D 时面临重大挑战 模型分类。的固有变形 非刚性 3D 模型,例如弯曲、拉伸和 压缩,给传统方法带来了挑战 生成 2D 视图,通常仅捕获单个 信息类型,例如照明或深度。因此,这些方法很难有效地捕获 非刚性的内在特征和关键特征 3D 模型。为了解决这一限制,我们提出了一种小说 用于分类的多视图深度学习框架 非刚性 3D 模型。与现有方法不同 从有限的信息生成 2D 视图,我们的方法 根据几何光谱特征生成二维视图。具体来说,我们使用了多个 Laplace-Beltrami 特征函数以渲染一个系列 模型的 2D 视图,随后融合到 统一的输入空间。此外,我们还介绍了一个视图 基于空间位置
|
|
09:00-18:30, Paper Mo-Online.136 | |
A Dual-Level Consistency Framework with Prototype Contrastive Learning for Semi-Supervised Semantic Segmentation |
|
Zou, Zhiqiang | Hohai University |
Lyu, Xin | Hohai University |
Fang, Yiwei | Hohai University |
Li, Xin | Hohai University |
Keywords: Artificial Social Intelligence, Image Processing and Pattern Recognition, Deep Learning
Abstract: Semi-supervised semantic segmentation aims to improve segmentation performance by leveraging a limited amount of labeled data along with a large set of unlabeled data. Existing methods mainly focus on enforcing image-level consistency between weakly and strongly augmented images in high-confidence regions. However, such approaches often neglect class similarity in feature space and discard low-confidence regions, leading to suboptimal representation learning. To address these issues, we propose a Dual-Level Consistency Framework with Prototype Contrastive Learning (DLC-PC). First, the prototype contrastive learning (PCL) is designed to enhance intra-class consistency and inter-class discrepancy of feature distribution. Specifically, representations are aligned with identical-class global prototypes, while being pushed away from different-class local prototypes. Second, the dynamic contrast threshold (DCT) is proposed to facilitate feature learning by progressively incorporating more low-confidence pixels into the contrastive learning process. Extensive experiments on two benchmarks demonstrate the state-of-the-art performance of the proposed framework.
|
|
09:00-18:30, Paper Mo-Online.137 | |
BLaVe-CoT: Consistency-Aware Visual Question Answering for Blind and Low Vision Users |
|
Cheng, Wanyin | Qufu Normal University |
Ruan, Zanxi | University of Verona |
Keywords: Image Processing and Pattern Recognition, Deep Learning, Artificial Social Intelligence
Abstract: Visual Question Answering (VQA) holds great potential for assisting Blind and Low Vision (BLV) users, yet real-world usage remains challenging. Due to visual impairments, BLV users often take blurry or poorly framed photos and face difficulty in articulating specific questions about what they cannot fully see. As a result, their visual questions are frequently ambiguous, and different users may interpret them in diverse ways. This leads to multiple valid answers, each grounded in different image regions—posing a mismatch with conventional VQA systems that assume a single answer and region. To bridge this gap, we present BLaVe-CoT, a VQA framework designed to reason about answer consistency in the face of ambiguity. Our method proposes diverse candidate answers using a LoRA-tuned BLIP-2 model, then grounds each answer spatially using PolyFormer, and finally applies a chain-of-thought reasoning module to assess whether the answers refer to the same or different regions. Evaluated on the VQA-AnswerTherapy benchmark, BLaVe-CoT outperforms previous methods and proves more robust to the ambiguity and visual noise common in assistive settings. This work highlights the need for VQA systems that can adapt to real human uncertainty and provide inclusive support for BLV users. To foster further research and accessibility applications, we have made the code publicly available at https://github.com/Accecwan/BLaVe-CoT.
|
|
09:00-18:30, Paper Mo-Online.138 | |
Multi-Level Feature Masking Network for Fine-Grained Visual Classification |
|
Sheng, You | Beijing University of Posts and Telecommunications |
Yang, Yu | Beijing University of Posts and Telecommunications |
Wang, Gang | Police Integration Computing Key Laboratory of Sichuan Province |
Zhou, Linna | Beijing University of Posts and Telecommunications |
Meng, Xiangli | Beijing University of Posts and Telecommunications |
Keywords: Deep Learning, Image Processing and Pattern Recognition, Neural Networks and their Applications
Abstract: Fine-grained visual classification (FGVC) requires capturing subtle distinctions between sub-categories, emphasizing both global context and local details. Existing methods often rely on bounding boxes or attention-based patch selection, but these approaches either introduce background noise or depend on a manually predefined patch number. We propose a novel textbf{M}ulti-level textbf{F}eature textbf{M}asking (MFM) architecture to address these issues by predicting spatial masks directly on intermediate feature maps, highlighting salient features while suppressing irrelevant regions without discarding information. MFM comprises three key modules: (1) textbf{M}ask textbf{F}eature textbf{A}lignment (MFA), predicting masks to align multi-level features; (2) textbf{C}ross-layer textbf{M}ask textbf{S}emantic textbf{A}lignment (CMSA), leveraging high-level semantic information for inter-layer alignment; and (3) Graph-based textbf{C}ross-layer textbf{F}eature textbf{E}nhancement (CFE), enriching spatial and structural representations. Extensive experiments confirm MFM's competitive performance on FGVC benchmarks and state-of-the-art accuracy on a luxury goods dataset. Code is available at https://github.com/SylU0/MFM.
|
|
09:00-18:30, Paper Mo-Online.139 | |
DMSA-Net: Deformable Multi-Head Self-Attention Network for Unsupervised Medical Image Registration |
|
Zheng, Linxiao | Sichuan Normal University |
Mu, Nan | Sichuan Normal University |
Li, Xiao-Ning | Sichuan Normal University |
Keywords: Deep Learning, Image Processing and Pattern Recognition, Neural Networks and their Applications
Abstract: Medical image registration is fundamental to clinical analysis, enabling structural comparison and the detection of pathological changes across subjects or time. Despite advances in Transformer-based models, accurate 3D registration remains challenging due to difffculties in capturing multiscal deformations and multi-directional spatial correlations. Moreover, standard Transformer architectures are computationally expensive when applied to 3D volumes. To address these issues, we propose DMSA-Net, an unsupervised medical image registration network featuring a novel 3D Deformable Multi-Head Self-Attention mechanism. Speciffcally, we design a Shifted Window Deformable Attention (SWDA) module that dynamically adjusts key-value pairs within localized 3D windows, enhancing modeling of multi-scale deformations while ensuring computational efffciency. We also introduce a Multi-directional Convolution (MC) module that uses large convolution kernels applied in parallel along the height, width and depth to capture spatially diverse deformation features. Experiments on two public brain MRI datasets (e.g., LPBA and IXI) demonstrate the superiority of DMSA-Net, showing a 4.1% improvement in Dice similarity coefffcient over VoxelMorph and a 2.1% gain over TransMorph on LPBA. These results conffrm the effectiveness of our approach for medical image registration.
|
|
09:00-18:30, Paper Mo-Online.140 | |
Privacy-Preserving Tabular Data Generation Based on Diffusion Models |
|
Wang, Siyu | Sichuan Normal University |
Wang, Rong | Sichuan Normal University |
Feng, Chaosheng | Sichuan Normal University |
Chang, Chin-Chen | Feng Chia University |
Keywords: Deep Learning, Neural Networks and their Applications, Information Assurance and Intelligence
Abstract: With the growing demand for public data openness, sharing data while preserving data privacy has become a critical challenge. Traditional techniques, such as data anonymization and differential privacy, provide baseline privacy guarantees but face some limitations, including limited generalizability and utility degradation due to inappropriate perturbations. To overcome these limitations, this paper proposes a hybrid diffusion model for generating privacy-preserving tabular data. Unlike single-structure data generation models, our proposed approach integrates differential privacy with two lightweight generative models to effectively balance data privacy and data utility. Specifically, our approach consists of three phases: data preprocessing, privacy protection, and data generation. In the data preprocessing phase, the adaptive techniques are used for data normalization. During the privacy protection phase, Gaussian noise and randomized response mechanisms are applied to enhance data privacy. Finally, in the data generation phase, Gaussian diffusion is used for numerical attributes and multinomial diffusion for categorical attributes, which effectively handles the original data of mixed types. This design enhances both the stability of the generative model and the diversity of the synthetic data. Experiments on six public datasets demonstrate that although our approach incurs only a slight reduction in machine learning utility, measured by classification accuracy, F1 score, and regression R² scores, it greatly improves privacy metrics compared to the state-of-the-art tabular diffusion models.
|
|
09:00-18:30, Paper Mo-Online.141 | |
DMut: Optimize Mutation Strategy in Directed Greybox Fuzzing by Multi-Population Genetic Algorithm |
|
Wen, Tingke | NUDT |
Li, Yuwei | National University of Defense Technology |
Zhang, Lu | National University of Defense Technology |
Ma, Huimin | National University of Defense Technology |
Li, Yang | National University of Defense Technology |
Hu, Miao | National University of Defense Technology |
Pan, Zulie | NUDT |
Keywords: Optimization and Self-Organization Approaches, Computational Intelligence, Application of Artificial Intelligence
Abstract: Directed greybox fuzzing has become a crucial technique for discovering vulnerabilities in software. However, existing seed mutation strategies often fall short in optimizing testcases quality and fuzzing efficiency. In this paper, we propose DMut, a novel seed mutation strategy based on a multi-population genetic algorithm, designed to address these limitations. DMut models the seed mutation process using genetic algorithm, optimizing the seed mutation probability distribution and iteratively evolving it to generate higher-quality testcases. The approach incorporates a well-designed fitness function and selection strategy that aligns with directed fuzzing scenarios to guide the evolution of mutation strategy. Through comprehensive experiments on real-world CVEs, we demonstrate that DMut significantly improves the effectiveness of directed fuzzing. Compared to the widely adopted directed greybox fuzzing tool, AFLGo, DMut reduces the time to expose the target vulnerability by 41% on average. Additionally, DMut improves path exploration efficiency, covering more unique execution paths and speeding up the exploration process. In summary, the experimental results show that DMut provides a robust, efficient method for improving directed fuzzing performance, offering a significant advancement over existing approaches.
|
|
09:00-18:30, Paper Mo-Online.142 | |
PCDO and Game Model for Analyzing Attack-Defense Confrontation of Complex Network with Cascading Effect |
|
Zhang, Liying | Shandong University of Science and Technology |
Pan, Jeng-Shyang | Fujian University of Technology |
Liu, Ning | Shandong University of Science and Technology |
Zheng, Weimin | Nanjing University of Information Science and Technology |
Keywords: Heuristic Algorithms, Complex Network, Swarm Intelligence
Abstract: The development of complex networks greatly facilitates the operation of modern society, but it also brings increasingly severe cybersecurity challenges. To effectively safeguard complex networks, this paper employs game theory to analyze the attack-defense confrontation from a macro perspective. A two-player attack-defense game model considering the cascading effect is established. To solve the Nash equilibrium of the game model, this study proposes a parallel compact Dandelion Optimizer algorithm (PCDO). The number of population groups of PCDO can be adjusted to support devices with different memory capacities. The comparative experiments demonstrate that PCDO outperforms other heuristic algorithms on CEC2013 test functions. Experimental results also show that PCDO is more effective than other algorithms in solving the mixed-strategy Nash equilibrium. Furthermore, this study analyzes the impact of different attack-defense resource allocations on the payoff of the attacker.
|
|
09:00-18:30, Paper Mo-Online.143 | |
FedDPSA: A Robust Semi-Asynchronous Differentially Private Federated Learning Framework in Heterogeneous Environments |
|
Shi, Kai | Tianjin University of Technology |
Zhou, Hao | Tianjin University of Technology |
Wang, Jinsong | Tianjin University of Technology |
Keywords: Neural Networks and their Applications, Computational Intelligence, Information Assurance and Intelligence
Abstract: Federated Learning (FL) is a distributed framework that shares local model parameters to reduce privacy leakage. However, performance disparities across devices and the non-IID data distributions pose significant challenges to privacy protection, training efficiency, and model accuracy. Although differential privacy (DP) mitigates information exposure by perturbing model updates, the added noise further slows convergence and degrades accuracy under data heterogeneity. To address these problems, we propose FedDPSA, a privacy-preserving FL framework that employs a semi-asynchronous communication mechanism with adaptive aggregation weights determined by update staleness and optimization direction. Locally, FedDPSA applies a selective update strategy under DP constraints to filter out low-quality updates, thereby reducing privacy budget consumption and mitigating heterogeneity effects. Experiments on MNIST, FMNIST, and CIFAR-10 demonstrate that FedDPSA consistently outperforms existing DP-FL methods in heterogeneous settings, achieving faster convergence and higher accuracy while ensuring rigorous privacy guarantees.
|
|
09:00-18:30, Paper Mo-Online.144 | |
FennelChain: A Blockchain Sharding Protocol Based on a GraphPartitioning Algorithm |
|
Wang, Yu | Qilu University of Technology (Shandong Academy of Sciences) |
Wu, Xiaoming | Qilu University of Technology, Shandong Computer Science Center |
Qiao, Youwei | Shandong Shanke Intelligent Technology Co., Ltd |
Ma, Junpeng | Shandong Inspur Intelligent Medical Technology Co., Ltd |
Ma, Liang | Shandong Inspur Intelligent Medical Technology Co., Ltd |
Liu, Jie | Shandong Inspur Intelligent Medical Technology Co., Ltd |
Xu, Zan | Shandong Shanke Intelligent Technology Co., Ltd |
Liu, Xiangzhi | Qilu University of Technology (Shandong Academy of Sciences) |
|
|
09:00-18:30, Paper Mo-Online.145 | |
MFCA-UNet: A Multi-Source Feature Fusion Cross-Attention Enhancement Network for 12 Lead Electrocardiogram Classification |
|
Pang, Wei | Inner Mongolia University |
Ma, Ming | Inner Mongolia University |
Yu, MeiJu | Inner Mongolia University |
Keywords: Deep Learning, Application of Artificial Intelligence
Abstract: The intelligent monitoring and classification of electrocardiogram (ECG) signals plays a crucial role in the early diagnosis of cardiovascular diseases. Despite significant progress in deep learning-based ECG analysis, existing models still struggle to effectively capture key information from critical signal regions while neglecting the impact of patient-specific attributes on ECG signals, thereby limiting their accuracy and generalization. To address these issues, we propose MFCA-UNet, a multi-source feature fusion cross-attention enhancement network. Specifically, we integrate a multi-scale feature attention module into the UNet backbone to improve the model’s sensitivity to crucial features. Meanwhile, the lead fusion module is optimized to facilitate effective interaction between local and global information across different leads. Additionally, we design a cross-fusion encoder to represent and extract patient-specific demographic features through feature mapping, which are then cross-fused with ECG signal data. Extensive experiments on the public datasets PTB-XL and Chapman show that the proposed method performs better than current advanced electrocardiogram classification models.
|
|
09:00-18:30, Paper Mo-Online.146 | |
GEE: Graphormer-Enhanced Encoder Model for Anomaly Detection in Weighted Signed Networks |
|
Du, Hongbo | New York Institute of Technology |
Li, Zhida | New York Institute of Technology |
Keywords: Machine Learning, Neural Networks and their Applications, Deep Learning
Abstract: Complex network structures with signed and weighted relationships are common in many real-world systems. We present Graphormer-Enhanced Encoder (GEE), a transformer-based model for anomaly detection in such graphs. GEE extends Graph-BERT's subgraph batching with Graphormer-style attention, integrating a signed-edge Weisfeiler-Lehman (WL) absolute positional embedding to capture global structure and edge signs, and an edge-weight encoding within the attention mechanism to incorporate rating magnitudes. We also prove two properties of the signed WL formulation. Experiments on the Bitcoin Alpha and Bitcoin OTC networks demonstrate that GEE can effectively detect anomalies in complex weighted signed networks.
|
|
09:00-18:30, Paper Mo-Online.147 | |
Decoupled Diffusion Model for Medical Image Translation (I) |
|
Qiu, Dechao | Wuhan University of Science and Technology |
Liu, Xiaoming | Wuhan University of Science and Technology |
Yuan, Zilong | Hubei Cancer Hospital |
Tang, Jinshan | George Mason University |
Lei, Tianxiang | Wuhan University of Science and Technology |
Keywords: Deep Learning
Abstract: Medical image translation enables the generation of contrast-enhanced CT (CECT) images from non-contrast CT (NCCT) scans, reducing dependence on iodinated contrast agents (ICAs) and minimizing associated health risks. Although diffusion models outperform generative adversarial networks (GANs) in medical image translation, challenges remain in improving sampling speed and restoring anatomical details. To address these issues, we propose a Decoupled Diffusion Model (DDM). Specifically, we first decouple the input image into highfrequency and low-frequency components via discrete wavelet transform (DWT) to enable parallel computing for accelerated sampling. Furthermore, we design a Wavelet UNet (WUNet) to enhance the recovery of anatomical details by leveraging the multi-scale representation capabilities of wavelet transform. Extensive experiments on two clinical datasets, TAP-CT and Coltea-Lung-CT-100W, demonstrate the superior performance of our method, indicating its potential for real-world clinical translation.
|
|
09:00-18:30, Paper Mo-Online.148 | |
Leveraging Long Method Decomposition to Improve Large Language Model-Based Test Case Generation |
|
Qi, Rongzhi | Hohai University |
Shen, Zhiyu | Hohai University |
Li, Yadi | Hohai University |
Keywords: Application of Artificial Intelligence, AI and Applications
Abstract: Recent studies have demonstrated the potential of large language models (LLMs) in test case generation. However, LLMs often struggle to achieve high levels of test coverage when generating test cases for long methods. Long methods are one of the typical manifestations of code smells, characterized by excessive lines of code, complex control flows, deep nesting levels, and numerous variables. These characteristics make it difficult for traditional testing tools to cover most of the lines and branches of focal methods. To address this issue, this paper proposes a novel approach, DecoTest, to generate unit test cases for long methods based on LLMs. The proposed method leverages LLM and automated validation based on static analysis to derive high-quality decomposition and refactoring plans. After refactoring the long methods, test cases are generated, iteratively verified and repaired to produce the final test case suite. This paper also presents an experimental analysis of DecoTest. The results indicate that the proposed method outperforms existing LLM-based test case generation methods in terms of line coverage, branch coverage, and test execution pass rate.
|
|
09:00-18:30, Paper Mo-Online.149 | |
3D Point Cloud Robust Sampling Method Based on Aggregation Rate |
|
Ge, Runqi | Southwest Jiaotong University |
Ge, Pingxu | Southwest Jiaotong University |
Dong, Sicong | Southwest Jiaotong University |
Li, Chongshou | Southwest Jiaotong University |
Keywords: Machine Vision, Image Processing and Pattern Recognition, Deep Learning
Abstract: Effective sampling plays a critical role in the preprocessing of 3D point cloud data, directly impacting the performance of downstream models. Traditional Farthest Point Sampling (FPS) ensures global spatial uniformity but is highly sensitive to noise and occlusion, often leading to reduced recognition accuracy. To address this issue, we propose a robust sampling method based on a novel density-aware metric called the aggregation rate. By computing K-nearest neighbor distances for each point, the method quantifies local compactness and suppresses outliers during sampling. We integrate our approach into the PointNet++ framework and evaluate it on two challenging corrupted datasets: ModelNet40-C and PointCloud-C. Experimental results show notable improvements in robustness and classification accuracy, with gains of up to 2.0% on PointCloud-C. Parameter studies further confirm the method’s stability and effectiveness. Our approach offers a lightweight, plug-and-play solution that enhances sampling robustness, making it well-suited for real-world noisy 3D environments.
|
|
09:00-18:30, Paper Mo-Online.151 | |
BiADet: Dual-Path Adaptive Aerial Small-Object Detection Network |
|
Yan, Kaiyue | Tianjin University of Technology |
Wang, Fayu | Tianjin University of Technology |
Keywords: Image Processing and Pattern Recognition, Deep Learning, Artificial Social Intelligence
Abstract: 随着无人机在监视和救援中的使用越来越多, 在弱光条件下准确检测小物体 由于对比度差,仍然是一个重大挑战。要解决 为此,我们设计了一个双路径自适应小对象 检测网络,BiADet。它建立在创新的 双骨干功能增强架构:1 根骨干 自适应优化反射率以提取细粒度 边缘信息和高频细节。另一个 捕获 目标。在此过程中,它们相辅相成。他们 输出由 Adaptive Feature Aggregation 组合 Module (AFAM) 的 Module,它动态合并并增强了 特征并强调关键区域。我们还优化 用于微小物体的混合编码器和检测头 缩放和重新设计多尺度特征选择 (MSFS) 跨级特征融合和噪声模块 抑制。在 VisDrone2019 上的实验表明 BiADet 改善了 mAP50 与 RT-DETR-R18 基线相比,mAP50-95 降低 3.6%。对 ExDark 和真实世界数据的测试进一步证明了其 不同&
|
|
09:00-18:30, Paper Mo-Online.152 | |
Spatiotemporal Self-Attention-Based Tensor Neural Network for Incomplete Traffic Data Imputation |
|
Xian, Yunchun | Southwest University |
Hao, Wu | Southwest University |
Keywords: Deep Learning, Neural Networks and their Applications, Hybrid Models of Computational Intelligence
Abstract: Intelligent Transportation Systems (ITS) play a critical role in alleviating traffic congestion, enhancing road safety, and supporting smart city development. These tasks fundamentally rely on high-quality and complete traffic data to ensure accurate spatiotemporal pattern analysis. However, traffic data collected by ITS often suffer from incompleteness due to sensor failures, communication errors, and environmental disturbances. Latent factorization of tensors (LFT) provides an efficient framework for traffic data imputation by leveraging low-rank structural priors. Nevertheless, the linear assumptions inherent in conventional LFT-based methods limit their ability to model nonlinear traffic patterns, leading to degraded imputation accuracy. To address these challenges, this paper proposes a Spatiotemporal Self-Attention-based Tensor Neural Network (SSA-TNN) with two key ideas: a) a tensor neural network framework integrating LFT with nonlinear neural architectures to capture intricate nonlinear traffic patterns, and b) an adaptive spatiotemporal self-attention mechanism that explicitly identifies implicit spatiotemporal dependencies through learnable transformation matrices. Extensive experiments conducted on six real-world traffic datasets demonstrate that the proposed SSA-TNN achieves outperforms nine state-of-the-art baselines in traffic data imputation tasks, providing a systematic framework for intelligent traffic analytics.
|
|
09:00-18:30, Paper Mo-Online.153 | |
Temporal Knowledge Graph Reasoning with Long and Short-Term Dependencies and Relational Semantics |
|
Tao, Yiguo | Hohai University |
Xu, Guoyan | Hohai University |
Zhu, Yanqiu | Hohai University |
Keywords: Representation Learning, Machine Learning, Knowledge Acquisition
Abstract: Temporal Knowledge Graph (TKG) reasoning aims to predict missing facts using historical TKG data. Existing methods often struggle to effectively capture crucial semantics within long-term historical data and typically overlook dynamic semantic associations among relations, thereby limiting reasoning accuracy. To address these challenges, we propose the TKG Reasoning with Long- and Short-Term Dependencies and Relational Semantics Model (LSTR). LSTR employs an encoder-decoder architecture with a dual-layer entity encoding structure comprising short-term and long-term entity encodings. The short-term entity encoding adaptively highlights essential subgraphs through a subgraph-aware attention mechanism. The long-term entity encoding incorporates a query association layer and a long-term historical dependency layer, constructing multi-hop association graphs and long-term temporal-aware graphs, respectively, to capture entities' evolving trends effectively. Furthermore, we introduce a relation association graph that explicitly models dynamic semantic associations among relations via relational graph convolution networks, significantly enhancing the representational power of relation embeddings. Extensive experiments on four benchmark datasets demonstrate the superior performance and effectiveness of LSTR.
|
|
09:00-18:30, Paper Mo-Online.154 | |
On the Taxonomy, Tasks, and Open-Challenges for Multimodal Large Language Models |
|
Yan, Lecheng | Xinjiang University |
Ruizhe, Li | University of Aberdeen |
Jiahui, Geng | MBZUAI |
Qing, Li | MBZUAI |
Minghao, Wu | Monash University |
Zhanyu, Wang | University of Sydney |
Li, Wenxi | Shanghai Jiao Tong University |
Ji, Tianbo | Nantong Unviersity |
Jiang, Shaochen | Xinjiang University |
Lyu, Chenyang | Alibaba Group |
Keywords: Deep Learning, Machine Learning, AI and Applications
Abstract: In recent years, the field of Artificial Intelligence has witnessed the emergence of Multimodal Large Language Models (MLLMs) that have significantly advanced the state-of-the-art in understanding and generating content across various data modalities. These models, capable of processing and integrating information from text, images, audio, and video, have opened new avenues for research and applications. Distinguished by their ability to understand and generation information with diverse modalities, such as text, image, audio and many others, MLLMs mark a significant step towards the final aim of Artificial General Intelligence(AGI). This comprehensive survey provides an in-depth examination of MLLMs, highlighting their evolutionary trajectory, current state-of-the-art developments, and prospective future directions. Specifically, we show taxonomy of MLLMs by their modalities to be processed and model architecture for aligning multiple modalities. Besides, we also present discussion regarding the different types of tasks related to MLLMs. The paper further delves into the pressing challenges confronted in this domain, such as data scarcity, computational complexity, ethical dilemmas, and privacy considerations. We analyze these issues in the context of both development and deployment of MLLMs. The survey comprehensively demonstrate and summarise the recent advances of the transformative influence of MLLMs while acknowledging their potential limitations, thereby outlining a prospective roadmap for future research endeavors in this rapidly developing field.
|
|
09:00-18:30, Paper Mo-Online.155 | |
A Transformer-Based Dual-Branch Mesh Convolutional Neural Network for Aortic Dissection Segmentation |
|
Zhang, Yuan | Sichuan Normal University |
Zhang, Fuhao | Sichuan Normal University |
Zhang, Jialin | Guizhou University |
Wang, Ling | Sichuan Normal University |
Zhou, Hui | West China Second University Hospital,Sichuan University |
Chen, Dapeng | West China Second University Hospital, Sichuan University |
Tang, Jinshan | George Mason University |
Jiang, Jingfeng | Michigan Technological University |
Mu, Nan | Sichuan Normal University |
Keywords: Image Processing and Pattern Recognition, Deep Learning, Neural Networks and their Applications
Abstract: Aortic dissection (AD) is a life-threatening condition caused by a tear in the aortic intima, allowing blood to enter the vessel wall and form a false lumen. Due to its high mortality rate, timely diagnosis and precise treatment are critical. Clinical diagnosis and treatment of AD rely heavily on accurate 3D vascular image segmentation. To address existing methods' low segmentation accuracy and insufficient geometric detail preservation, this paper proposes a Transformer-based Dual-Branch Mesh Segmentation Network (TD-MSeg) for AD. This network employs a mesh-based self-attention mechanism to retain vascular geometric details while adopting a dualbranch decoder to effectively fuse features and model longrange dependencies. Specifically, TD-MSeg incorporates three key components: a Hierarchical Mesh Transformer (HMT) module that enhances feature modeling of critical anatomical structures (e.g., intimal tears), a dual-branch decoder that facilitates collaborative optimization of multi-scale local and global features, and a mesh label refinement module that uses a widepath exploration algorithm to eliminate deformation artifacts and improve spatial label continuity. Moreover, experiments on two AD mesh segmentation datasets demonstrate that the proposed TD-MSeg achieves a 6% improvement in accuracy compared to traditional models and significantly enhances the recognition of complex vascular structures, thereby providing high-precision 3D reconstruction support for endovascular surgical planning.
|
|
09:00-18:30, Paper Mo-Online.156 | |
ProDisc-VAD: An Efficient System for Weakly-Supervised Anomaly Detection in Video Surveillance Applications |
|
Zhu, Tao | Jiangxi University of Finance and Economics |
Qi, Yu | Jiangxi Science and Technology Normal University |
Dong, Xinru | Jiangxi University of Finance and Economics |
Li, Shiyu | Jiangxi University of Finance and Economics |
Liu, Yue | Jiangxi University of Finance and Economics |
Jiang, Jinlong | Jiangxi University of Finance and Economics |
Shu, Lei | Jiangxi University of Finance and Economics |
Keywords: Image Processing and Pattern Recognition, AI and Applications, Machine Vision
Abstract: Weakly-supervised video anomaly detection (WS-VAD) using Multiple Instance Learning (MIL) suffers from label ambiguity, hindering discriminative feature learning. We propose ProDisc-VAD, an efficient framework tackling this via two synergistic components. The Prototype Interaction Layer (PIL) provides controlled normality modeling using a small set of learnable prototypes, establishing a robust baseline without being overwhelmed by dominant normal data. The Pseudo-Instance Discriminative Enhancement (PIDE) loss boosts separability by applying targeted contrastive learning exclusively to the most reliable extreme-scoring instances (highest/lowest scores). ProDisc-VAD achieves strong AUCs (97.98% ShanghaiTech, 87.12% UCF-Crime) using only 0.4M parameters, over 800x fewer than recent ViT-based methods like VadCLIP, demonstrating exceptional efficiency alongside state-of-the-art performance. Code is available at https://github.com/modadundun/ProDisc-VAD.
|
|
09:00-18:30, Paper Mo-Online.157 | |
Multi-Agent Cooperative Perception: A Structural Evaluation Framework for Autonomous Systems |
|
Tingting, Wu | Soochow University |
Jin, Wang | Soochow University |
Cong, Yang | Soochow University |
Xu, Junqi | Soochow University |
Keywords: Agent-Based Modeling, Artificial Social Intelligence, AIoT
Abstract: With multi-intelligent agent systems developing rapidly in various fields, cooperative perception has attracted much attention as a critical technology to enhance the intelligence level of autonomous systems. However, in the face of actual complex scenes, the perception ability of a single intelligent agent is often constrained by problems such as occlusion and limited perception range. To this end, multi-intelligence inter-intelligence cooperative sensing has emerged to enhance the overall perception performance by sharing perception information cooperatively. In this paper, we structurally study the multi-intelligent cooperative perception in depth and propose a novel and comprehensive evaluation framework to address the current situation in existing literature, which mainly focuses on latency or communication factors in cooperative perception evaluation. The framework covers four key modules: feature extraction, feature compression, feature fusion, and target detection, and aims to address the multifaceted challenges in multi-intelligence perception. Through an in-depth evaluation of existing cooperative perception algorithms, we comprehensively map the performance of each algorithm under the guidance of the framework. Particularly, we find that in the absence of aligning the correct pose, the detection performance degrades drastically as the latency increases. Our comprehensive framework will drive the development of multi-intelligent agent cooperative perception by providing researchers with a transparent and standardised methodology for evaluating, comparing, and improving existing cooperative perception approaches.
|
|
09:00-18:30, Paper Mo-Online.158 | |
AMS-KGNet: F-Adapter Multi-Scale Kernel Gated Network for Low-Light Image Enhancement |
|
He, Zhaokun | Wuhan University of Science and Technology |
Zhang, Chengshuo | Wuhan Uinversity of Science and Technology |
Yuan, Xin | Wuhan University of Science and Technology, China |
Hao, Guozhu | Wuhan University of Technology |
Keywords: Image Processing and Pattern Recognition
Abstract: This paper proposes a novel F-Adapter Multi-Scale Kernel Gated Network (AMS-KGNet) framework for low-light image enhancement (LLIE). We specifically address two key challenges: (1) excessive enhancement in bright areas and insufficient enhancement in dark areas, and (2) inaccurate perception of the image noise distribution. In the first stage of model training, we embed AMS-KGNet into the uncertainty model to guide the model to focus on low-light and high-noise areas in the image. Subsequently, in the second stage of model training, we use the Fine-tuning Adapter (FA) to fine-tune the model parameters and freeze the parameters of the Multi-Scale Large Kernel Gated Attention (MS-LKGA) module, significantly accelerating model convergence without affecting the enhancement quality. Compared with state-of-the-art methods in both qualitative and quantitative evaluations, extensive experiments on three LLIE benchmark datasets, (e.g., LOLv1, LOLv2-real, and LOLv2-synthetic) demonstrate the superiority of our proposed AMS-KGNet method. In addition, the ablation study further validates the effectiveness of our AMS-KGNet method.
|
|
09:00-18:30, Paper Mo-Online.159 | |
CTF: Hybrid Multi-Scale Contrastive Transformer for Website Fingerprinting Attack |
|
Xiao, Juxin | University of Chinese Academy of Sciences |
Song, Yang | School of Computer Science, Hangzhou Dianzi University |
Chen, Yanhui | Huawei Technologies Co., Ltd |
Li, Yunpeng | Institute of Information Engineering, Chinese Academy of Science |
Jie, Yin | Institute of Information Engineering,Chinese Academy of S |
Liu, Yuling | Institute of Information Engineering, Chinese Academy of Science |
Liu, Qixu | Institute of Information Engineering, Chinese Academy of Science |
Keywords: Deep Learning, Application of Artificial Intelligence, AI and Applications
Abstract: Website Fingerprinting (WF) attacks, a methodology that allows observers to infer visited websites through encrypted traffic analysis, are usually employed for evaluating the security of anonymous networks like Tor. Although deep learning-based methods have demonstrated significant success in recent studies, existing approaches exhibit limitations. They are insufficient in modeling the intricate characteristics of traffic patterns, which leads to compromised accuracy in identifying challenging samples. To address these limitations, we propose CTF, a novel WF attack comprising three key components: (1) extraction of Hybrid Multi-scale Traffic Features (HMTF) integrating spatio-temporal information, (2) a novel deep learning framework that synergizes CNN for spatial pattern recognition and Transformer for temporal dependency modeling within HMTF, and (3) a supervised contrastive learning mechanism to enhance discriminative capability across website fingerprints. Experimental evaluations across multiple benchmark datasets reveal that CTF achieves 99.09% accuracy in Closed-World scenarios, showing enhanced robustness against four SOTA WF defense mechanisms.
|
|
09:00-18:30, Paper Mo-Online.160 | |
StyleGAN3-SIFT Fusion for High-Fidelity Digital Core 3D Reconstruction (I) |
|
Wang, Yifan | Southwest Petroleum University |
Sun, Guilin | Southwest Petroleum University |
Zhou, Wenjun | Southwest Petroleum University |
Zhang, Quan | Southwest Petroleum University |
Peng, Bo | Southwest Petroleum University |
Keywords: Deep Learning, Image Processing and Pattern Recognition, Neural Networks and their Applications
Abstract: Digital core reconstruction plays a vital role in subsurface analysis, but conventional techniques face challenges such as data scarcity, limited sample diversity, and the high cost of high-resolution CT imaging. Additionally, existing GAN-based models often suffer from geometric distortions and training instabilities, which compromise microstructural fidelity. To address these limitations, we propose a novel framework that integrates three key components: (1) StyleGAN3 for high-resolution image synthesis, (2) SIFT-based registration for precise structural alignment, and (3) a GAN-based inverse reconstruction pipeline for resolution enhancement. Evaluations on the Estaillades carbonate dataset demonstrate that our method achieves a 2.86-fold resolution improvement over conventional CT reconstruction, reducing reliance on high-end imaging equipment while preserving fine-scale microstructural details. This approach offers a scalable and cost-effective solution for digital rock analysis and reservoir simulation, advancing the state of the art in pore-scale modeling.
|
|
09:00-18:30, Paper Mo-Online.161 | |
Confidence Optimization for Probabilistic Encoding |
|
Xia, Pengjiu | Beijing Institute of Technology |
Huang, Yidian | Beijing Institute of Technology |
Wei, Wenchao | Beijing Institute of Technology |
Tan, Yuwen | Beijing Institute of Technology |
Keywords: Representation Learning, Optimization and Self-Organization Approaches, Computational Intelligence in Information
Abstract: Probabilistic encoding introduces Gaussian noise into neural networks, enabling a smooth transition from deterministic to uncertain states and enhancing generalization ability. However, the randomness of Gaussian noise distorts point-based distance measurements in classification tasks. To mitigate this issue, we propose a confidence optimization probabilistic encoding (CPE) method that improves distance reliability and enhances representation learning. Specifically, we refine probabilistic encoding with two key strategies: First, we introduce a confidence-aware mechanism to adjust distance calculations, ensuring consistency and reliability in probabilistic encoding classification tasks. Second, we replace the conventional KL divergence-based variance regularization, which relies on unreliable prior assumptions, with a simpler L2 regularization term to directly constrain variance. The method we proposed is model-agnostic, and extensive experiments on natural language classification tasks demonstrate that our method significantly improves performance and generalization on both the BERT and the RoBERTa model.
|
|
09:00-18:30, Paper Mo-Online.162 | |
QLMR-PO: Intelligent Routing Algorithm for Underwater Acoustic Sensor Networks Combining Learning Automaton and Q-Learning |
|
Cheng, Haoran | Inner Mongolia University |
Bai, Xiangyu | Inner Mongolia University |
Keywords: Swarm Intelligence, Computational Intelligence, Machine Learning
Abstract: Routing algorithms in underwater acoustic sensor networks are the primary solution to address issues such as low bandwidth, long transmission delays, high bit error rates, and limited energy in underwater environments. However, many routing algorithms face challenges such as suboptimal path selection, low energy efficiency, and poor overall performance. To address these issues, this paper proposes an intelligent routing algorithm for underwater acoustic sensor networks(QLMR-PO) that combines learning automaton and Q-learning for efficient routing. First, we optimize the transmission power of each node using learning automaton. Then, we introduce multiple influencing factors into the reward function design and select the best relay node by choosing the maximum Q-value.Finally, we ensure the forward propagation of data by introducing depth information and a memory mechanism. Simulation results show that the proposed algorithm outperforms several Q-learning-based algorithms in terms of energy efficiency, end-to-end delay, network lifetime, and other performance metrics.
|
|
09:00-18:30, Paper Mo-Online.163 | |
EDRSNet: A Dual-Branch Real-Time Semantic Segmentation Network for UAV Autonomous Flight |
|
Zhu, Yuanxu | Xinjiang University |
Zhang, Tianze | The University of Melbourne |
Wu, Aiying | Xinjiang University |
Deng, Yang | Xinjiang University |
Shi, Gang | Xinjiang University |
Keywords: Deep Learning, Application of Artificial Intelligence, Machine Vision
Abstract: Autonomous UAV navigation in road scenes is one of the key research focuses in the field. However, vision-based UAV control remains a significant challenge. To address this issue, this paper proposes a real-time semantic segmentation network, EDRSNet, Edge Information Assisted Dual-Branch Encoder Real-Time Semantic Segmentation Net, as the visual perception module for UAVs. On this basis, we introduce the RENA, Ray-based Eight Neighborhood Algorithm, to enhance road extraction and accurately determine the UAV's target point. Combined with a PID control algorithm, this approach enables precise flight control of the UAV. Experimental results show that EDRSNet achieves 51.61 FPS and 17.39 GFLOPs on the DeepGlobe Road dataset and the DRS Road dataset. Furthermore, the integration of EDRSNet with RENA and PID achieves promising results in a simulation environment, with the UAV flight trajectory closely matching the expected path, demonstrating the feasibility of the proposed method for autonomous UAV flight tasks.
|
|
09:00-18:30, Paper Mo-Online.164 | |
Progressive Multi-Scale Vision Transformer for Hierarchical Myocardial Segmentation in Cardiac MRI |
|
Pu, Lei | Sichuan Normal University |
Li, Yangjie | West China Hospital of Sichuan University |
Xu, Yuanwei | West China Hospital, Sichuan University |
Jiang, Jingfeng | Michigan Technological University |
Tang, Jinshan | George Mason University |
Li, Xiao-Ning | Sichuan Normal University |
Chen, Yucheng | West China Hospital of Sichuan University |
Mu, Nan | Sichuan Normal University |
Keywords: Image Processing and Pattern Recognition, Deep Learning, Neural Networks and their Applications
Abstract: Myocardial infarction remains a global health challenge. Accurate myocardial segmentation in late gadolinium enhancement cardiac magnetic resonance imaging (LGE-CMRI) is critical for diagnosis and treatment planning. Although deep learning architectures have demonstrated excellent segmentation performance in conventional CMRI, their accuracy significantly declines in LGE-CMRI due to low tissue contrast and complex background interference. To address these challenges, we propose a Progressive Multi-scale Vision Transformer (PMVT) for myocardial segmentation in LGE-CMRI, which enhances spatial representation capabilities and improves adaptability in complex scenarios through multi-scale feature fusion and interaction. Specifically, PMVT includes a Multi-scale Progressive Attention Decoder (MPSD) for modeling both long and short-term dependencies, and a Multi-layer Hybrid Context Purification (MHCP) that combines different combinations of four prediction heads for prediction and loss calculation, effectively suppressing background interference. Experiments demonstrate that the proposed PMVT outperforms state-of-the-art (SOTA) models, achieving a Dice score of 89.61% for myocardial segmentation (a 1.56% improvement over the current SOTA). This result highlights its considerable potential for clinical applications in automated LGE-CMRI analysis.
|
|
09:00-18:30, Paper Mo-Online.165 | |
Adaptive PPO-LSTM-Based Particle Swarm Optimization for UAV Path Planning |
|
Huang, Lingjie | East China Normal University |
Cao, Huimin | East China Normal University |
Xiao, Bo | East China Normal University, Software Engineering Institute |
Keywords: Swarm Intelligence, Metaheuristic Algorithms, Optimization and Self-Organization Approaches
Abstract: Path planning for unmanned aerial vehicles (UAVs) in complex three-dimensional environments, characterized by dynamic constraints and dense obstacle distributions, remains a significant challenge. Conventional Particle Swarm Optimization (PSO) methods often struggle under such conditions due to uniform exploration behavior. To overcome the limitation, this paper proposes an Adaptive Proximal Policy Optimization and Long Short-Term Memory-based Particle Swarm Optimization (APLPSO) algorithm. APLPSO integrates Proximal Policy Optimization (PPO) and a Long Short-Term Memory (LSTM) network into the PSO framework, thereby allowing each particle to autonomously adjust its search strategy based on historical trajectories and environmental feedback. Extensive simulation studies across scenarios with varying obstacle densities demonstrate that APLPSO consistently outperforms conventional approaches in terms of path optimality, convergence rate, and robustness.
|
|
09:00-18:30, Paper Mo-Online.166 | |
TemVLT: Vision-Language Tracking Via Mamba-Based Temporal Information Learning |
|
Nie, Qishuai | Huazhong University of Science and Technology |
Weng, Zhimin | Huazhong University of Science and Technology |
Wang, Yuehuan | Huazhong University of Science and Technology, Wuhan, P.R. China |
Keywords: Image Processing and Pattern Recognition, Deep Learning, Neural Networks and their Applications
Abstract: The vision-language tracking (VLT) aims to improve target localization by leveraging natural language (NL) descriptions. However, most trackers fail to account for the discrepancy between the dynamically evolving target states and the initial NL during tracking, which leads to limitations in model performance. Therefore, developing effective temporal modeling methods to resolve this vision-language ambiguity represents a critical research challenge in VLT. In this paper, we propose a video-level VLT framework called TemVLT, which aggregates language information under the guidance of dynamic template, and stores and transmits temporal information across frames via hidden states in Mamba, resulting in more robust tracking performance. Specifically, we design the Language Adaptive Aggregation (LAA) module, which aggregates the Top-k most important language tokens based on the appearance information in dynamic template to improve the stability of language information. Additionally, we introduce the Mamba-based Temporal Information Capture and Storage (TCS) module, which captures temporal information in the visual features of the current frame via Mamba layers and stores it in hidden states for enhancing the context-awareness of the next frame. Extensive results on TNL2K, LaSOT, and OTB99-Lang confirm the superiority of TemVLT in VLT over existing trackers.
|
|
09:00-18:30, Paper Mo-Online.167 | |
A Multi-Structural Graph Fusion Approach for Code Representation in Code Search |
|
Ao, Longhao | Hohai University |
Qi, Rongzhi | Hohai University |
Li, Haoxuan | Hohai University |
Keywords: Deep Learning, Application of Artificial Intelligence
Abstract: Code search aims to retrieve semantically relevant code snippets on the basis of natural language queries. With the rapid expansion of public code repositories such as GitHub and Gitee, the efficient understanding and matching of relevant code have become critical challenges. Most existing deep learning-based code search approaches rely on feature extraction from code but often fail to account for its structural integrity. They neglect the complex hierarchical structure of the code, resulting in insufficient representational capacity for semantic matching. To address these challenges, we propose an innovative code search method, AcbertGraphC, which constructs a multi-relational graph representation by combining abstract syntax tree (AST), data dependency graph (DDG), and control flow graph (CFG) through a functional program graph and early fusion strategy. Additionally, we utilize meta-path aggregated graph neural networks (MAGNN) to extract complex relationships from the multi-relational graph, and leverage a graph attention mechanism to dynamically adjust meta-path selection, thereby enhancing the model's search capability. The experimental results demonstrate that AcbertGraphC can accurately retrieve target code snippets and outperforms existing baseline methods in terms of matching precision.
|
|
09:00-18:30, Paper Mo-Online.168 | |
DBS-FND: A Framework for Improving Multimodal Fake News Detection Via Decision Boundary Alignment |
|
Wang, Jiacheng | Hebei University of Technology |
Ye, Zihui | Xiamen University Malaysia |
Keywords: AI and Applications, Artificial Social Intelligence
Abstract: 多模态大语言模型(MLLM)的进步,其在假新闻检测中的思维链(CoT)能力也越来越全面。目前的提示方法主要侧重于引导模型分析多模态信息,使其适应假新闻检测任务。然而,这种显性引导思维机制忽视了模型与人类在真实性判断方面的决策边界差异,导致即使经过彻底分析也会导致误判。此外,作为示例的静态样本未能充分利用多模态大模型在不同新闻信息中的上下文学习能力。为了解决这个问题,我们提出了 DBS-FND 框架,旨在使模型的决策边界与人类的决策边界保持一致。该框架包括两个模块:不确定的样本对齐和类似的决策指导,使模型能够在不进行微调的情况下提高预用语的准确性,并且仅通过保持一小部分样本边界。实验结
|
|
09:00-18:30, Paper Mo-Online.169 | |
A Riemannian Multi-Scale RestNet with Manifold Attention Block for EEG Decoding |
|
Zou, Xuhui | Chongqing Normal University |
Chen, Zeming | Chongqing Normal University |
Song, Yipan | Chongqing Normal University |
Zhang, Tingting | Southwest University |
Dai, Weixiao | Chongqing Normal University |
Ran, Ruisheng | Chongqing Normal University |
Keywords: Neural Networks and their Applications, Deep Learning, Application of Artificial Intelligence
Abstract: In the field of brain–computer interfaces (BCIs), the recognition of electroencephalogram (EEG) signals lies at the heart of decoding neural activity patterns and facilitating efficient human–machine interaction. Geometric learning (GL) methods have garnered increasing attention due to their enhanced robustness in EEG signal decoding. However, existing GL methods lack the ability to extract multi-scale features and spatial and channel attention from manifold data. To address these two issues, this paper proposes a model called Riemannian multi-scale residual network (RMS-RestNet), which is built upon the MAtt. RMS-RestNet designed a Riemannian multi-scale residual module (RMSRM), which extends depthwise and pointwise convolutions to the space of symmetric positive definite (SPD) manifolds, referred to as SPD depthwise (SPD DW) and SPD pointwise (SPD PW) convolutions, respectively. By employing SPD convolution kernels of varying scales, the model facilitates more comprehensive and discriminative extraction of geometric features. On this basis, residual connections are employed to fuse the geometric information of the data before and after processing, thereby enhancing the feature representation ability of the module. In addition, by combining a tangent space pooling strategy with SPD convolutions, we construct a manifold convolutional block attention module (MCBA) to capture both channel-wise and spatial attention across multi-channel SPD manifold data. Extensive experiments on both temporally synchronous and asynchronous EEG datasets demonstrate the superiority of our approach over state-of-the-art methods.
|
|
09:00-18:30, Paper Mo-Online.170 | |
Leveraging Clinical Notes with Medical NER for Treatment Recommendation |
|
Saleem, Muhammad Adil | IBA Karachi |
Raza, Syed Ali | Institute of Business Administration Karachi |
Haider, Sajjad | IBA Karachi |
Keywords: Expert and Knowledge-Based Systems, AI and Applications, Machine Learning
Abstract: Treatment recommendation systems often depend on structured, multi-visit data and standardized coding (e.g., ICD-10), which are scarce in low-resource settings. This study explores the use of raw clinical notes paired with medical named entity recognition (NER) as a practical alternative. We evaluated three transformer-based NER models and a combined model (AMC) to extract problem and treatment entities. Two interpretable recommendation techniques—Apriori association rule mining and collaborative filtering—were then applied to assess treatment recommendations based on these entities. Results showed that SciBERT and AMC identified more unique entities, and Apriori generally outperformed collaborative filtering, especially under data-sparse conditions. Despite suboptimal performance due to dataset limitations, the approach demonstrates the feasibility of building accessible, explainable treatment recommender systems using unstructured clinical text in low-resource environments.
|
|
09:00-18:30, Paper Mo-Online.171 | |
MIP-Explainer: An Explainability Method for Graph Neural Networks Based on Mutual Information |
|
Wang, Huijiang | Guangxi Normal University |
Li, Qiyu | Guangxi Normal University |
Tan, Jing | Guangxi Normal University |
Su, Linlin | Guangxi Normal University |
Wang, Jinyan | Guangxi Normal University |
Keywords: Deep Learning, AI and Applications, Machine Learning
Abstract: Graph Neural Networks (GNNs) garnered significant success in modeling graphs due to their powerful capabilities in representation learning and reasoning. However, the lack of explainability in GNNs largely limits their application in security, sensitive, and other such scenarios. Due to their interpretability and high fidelity, surrogate methods in explainability techniques have sparked some explorations, but many challenges are still not well-addressed. 1) Most surrogate based explainers generate local neighborhoods by randomly perturbing node features, ignoring the graph topology perturbations. 2) As the graph grows, the combinations of perturbations increase exponentially, whereas only a few perturbation combinations are close to the prediction. 3) The application of multiple relaxation steps and regularization terms in practice may diminish the intrinsic interpretability of surrogate models in previous works. Towards this end, we propose a novel framework named MIPExplainer, which consists of Selector and Explainer modules. In the Selector module, we leverage mutual information to perturb the graph structure, efficiently generating local neighborhoods of the data. In the Explainer module, we propose an intuitive and interpretable linear model to these local neighborhoods, without incorporating any relaxation or regularization terms. Extensive experiments on real-world and synthetic datasets show the effectiveness of our MIP-Explainer by outperforming stateof-the-art baselines.
|
|
09:00-18:30, Paper Mo-Online.172 | |
IPv6 Target Generation Driven by Fine-Tuned Large Language Model |
|
Cheng, Xingqi | University of Jinan |
Jing, Shan | University of Jinan |
Jiao, Liang | Coordination Center of China Shandong Branch |
Zhao, Chuan | Quan Cheng Laboratory |
Yang, Hongjuan | University of Jinan |
Keywords: Application of Artificial Intelligence
Abstract: 的飞速发展带来了 探索和管理 IPv6 地址空间到 网络研究的前沿。凭借其 128 位寻址 方案,IPv6 提供大约 340 亿 地址,有效缓解地址耗尽 IPv4 固有的问题。IPv6 的巨大地址空间 呈现传统的暴力扫描,一旦生效 对于 IPv4,不再可行。这一挑战要求 开发高效、精确的IPv6地址 生成方法,确保稳定性和安全性 互联网。为了应对这一挑战,这项研究 引入创新的 IPv6 地址生成方法 利用微调的大型语言模型 (LLM), 通过 vLLM 推理框架增强,以改进 发电效率。这种方法利用了强者 LLM 的学习和泛化能力 有效处理IPv6地址的复杂性 代。我们从IPv6的初步分类开始 地址来降低模型的学习复杂性。下一个 我们策划了一个微调数据集并应用 LoRA 技术来微调 Qwen-7B-chat 模型。结果 表明微调后的 LLM 优于现有方&
|
|
09:00-18:30, Paper Mo-Online.173 | |
Channel Two-Stream Convolutional Networks with Adaptive Sparse Feature Location Activation for Object Detection on Drone Images |
|
Li, Yixuan | UCAS |
Xu, Yulong | National Key Laboratory of Intelligent Collaborative Perception |
Wu, Pengnian | Northwestern Polytechnical University |
Zhang, Meng | UCAS |
Keywords: Deep Learning, Image Processing and Pattern Recognition, Machine Vision
Abstract: Although sparse detectors achieve inference acceleration on high-resolution drone images, this is because they only perform convolution on the sampling areas of the sparse mask. This local area sampling operation results in the loss of global context information, which is detrimental to detection accuracy. To address this issue, we have rethought the existing context encoding method and proposed a solution to mitigate the loss of global context by introducing convolution selection on the channel dimension. Additionally, existing methods exhibit imprecise sampling of object locations, which further deteriorates detection accuracy. Therefore, we introduce a refined mask location activation strategy to dynamically adjust the learning objectives of sparse masks according to the current mask ratio, ensuring precise object activation. We extensively evaluate our methods using various base detectors on two major benchmarks, i.e., VisDrone and UAVDT, demonstrating the significant superiority of our methods.
|
|
09:00-18:30, Paper Mo-Online.174 | |
VLD-LP: Vulnerability Detection and Root Cause Localization with Large Language Model and Parameter-Efficient Language Model Tuning |
|
Wu, Huanyu | China, Huazhong University of Science and Technology, School of Ar |
Tu, Yunlu | HuaZhong University of Science and Technology |
Huang, Fan | Wuhan Institute of Digital Engineering&Shanghai Jiao Tong Univer |
Wu, Dongrui | Huazhong University of Science and Technology |
Keywords: Deep Learning, Application of Artificial Intelligence
Abstract: As software complexity rises, research on vulnerability detection becomes increasingly important. Deep learning-based vulnerability detection, an emerging approach, can segment code and identify hidden vulnerability patterns. However, challenges remain: (1) accurately correlating code slices with scripts to minimize false positives; (2) enhancing the precision of root cause localization for vulnerable scripts. To alleviate these, this paper introduces a vulnerability detection and root cause localization approach leveraging large language models (LLMs). The approach preprocesses C/C++ source code, extracts graph structures, and combines them with the script to form prompts. A novel Hierarchical Regulation for Parameter-Efficient Language Model Tuning (HR-PELT) approach fine-tunes the LLM for vulnerability detection by maintaining parameters of shallow layers substantially preserved while enhancing the adaptability of deep layers. For root cause localization, we similarly create prompts and fine-tune another LLM. Experimental results on three datasets demonstrated improvements: in vulnerability detection, our approach boosted average Accuracy (ACC) by 4.82% and Macro-F1 (M-F1) by 5.29% over the state-of-the-art (SOTA); in root cause localization, it enhanced ACC_{10%} by 4.80% and ACC_{20%} by 3.93%.
|
|
09:00-18:30, Paper Mo-Online.175 | |
FreTime: Dual-Branch Frequency-Time Representation Learning for Time Series |
|
Zhu, Junyu | Xinjiang University |
Zuo, Enguang | Xinjiang University |
Wang, Wenyan | Xinjiang University |
Sun, Ruishuang | Xinjiang University |
Yan, Ziwei | Xinjiang University |
Chen, Chen | Xinjiang University |
Chen, Cheng | Xinjiang University |
Lv, Xiaoyi | Xinjiang University |
Keywords: Representation Learning
Abstract: Time series analysis plays a fundamental role in revealing data evolution, trends, and cyclical patterns. However, existing studies often fail to effectively address the dynamic dependencies between variables in multidimensional time series and the temporal evolution patterns within variables, thereby limiting the effectiveness of complex time series feature analysis. In this paper, we propose a dual-branch frequency-time interactive representation learning model (FreqTime) that captures the correlations between variables and the temporal dependencies within variables through a collaborative architecture in the time domain and frequency domain. The time domain branch uses an inverse Transformer architecture to model cross-variable interactions, while the frequency domain branch utilizes multiscale gated convolutions to capture features and map them back to the time domain. Finally, global representations are obtained by interactively fusing the representations learned from the two branches in the time domain. Experiments demonstrate that FreqTime achieves state-of-the-art performance on long sequence prediction, classification, and anomaly detection tasks, and exhibits strong robustness in noisy environments.
|
|
09:00-18:30, Paper Mo-Online.176 | |
A Novel Multi-Objective Suboptimal Tracking Control Approach Based on Robust Fuzzy Model Predictive Control and Policy Iteration |
|
Guo, Xuyang | National University of Defense Technology |
Sun, ZhenPing | National University of Defense Technology (NUDT) |
Keywords: Fuzzy Systems and their applications, Neural Networks and their Applications, Optimization and Self-Organization Approaches
Abstract: Tracking control, a traditional subject in the industrial domain, has garnered extensive research over the past several decades and has been broadly applied in many different contexts, encompassing vehicle path tracking tasks and robotic control. In this study, a novel multi-objective suboptimal control strategy for discrete-time nonlinear tracking control systems is presented, integrating the robust fuzzy model predictive control (RFMPC) approach with an adaptive dynamic programming (ADP) method. First, the original system model is reformulated into a T-S fuzzy system, and an infinite-horizon RFMPC algorithm is developed to guarantee input-to-state stability (ISS) of the system. Subsequently, the policy iteration technique is utilized to perform multi-objective optimization for the RFMPC in the Pareto sense, resulting in the determination of a Pareto optimal tracking controller. The efficacy and benefits of the proposed algorithm are validated through a series of comparative simulations.
|
|
09:00-18:30, Paper Mo-Online.177 | |
Quantized Model Predictive Control for Nonlinear Systems with Delays and Parametric Uncertainties |
|
Guo, Xuyang | National University of Defense Technology |
Sun, ZhenPing | National University of Defense Technology (NUDT) |
Keywords: Fuzzy Systems and their applications, Optimization and Self-Organization Approaches, Cybernetics for Informatics
Abstract: This study develops a quantized control framework for nonlinear systems with state delays, external disturbances, and bounded parametric uncertainties. First, a static logarithmic quantizer is integrated into the controller architecture, where its participation form within the actual computations is determined through theoretical analysis based on the sector bound approach. The inherent nonlinear dynamics are subsequently decomposed into a group of interconnected linear subsystems via the Takagi-Sugeno (T-S) fuzzy modeling paradigm. Then by constructing a Lyapunov-Krasovskii functional (LKF), the input-to-state stability (ISS) is established, culminating in the derivation of a robust fuzzy model predictive control (RFMPC) algorithm underpinned by rigorous proofs. Numerical simulations validate the claims, demonstrating the effectiveness of the proposed method.
|
|
09:00-18:30, Paper Mo-Online.178 | |
Temporal-Aware Neural Tucker Factorization for Traffic Data Imputation |
|
Hou, Yikai | Southwest University |
Tang, Peng | Southwest University |
Keywords: Big Data Computing,, Neural Networks and their Applications, Machine Learning
Abstract: Modern transportation systems are integral to urban sustainability, economic vitality, and public safety. The effective analysis of spatiotemporal traffic patterns is critical for urban planning and smart city operations. However, real-world traffic monitoring system often produce high-dimensional and incomplete data, making accurate traffic modeling and control extremely challenging. This paper proposes a Gated-Recurrent Neural Tucker Factorization (GARNT) model for robust traffic data imputation. It integrates neural Tucker factorization with gated recurrent units (GRUs) to jointly model spatial correlations and temporal dynamics. A GRU-based temporal encoder captures temporal dependencies across time steps, while a novel latent spatiotemporal feature interactive learning module is responsible for modeling spatiotemporal interaction features based on the principle of Tucker Decomposition. Extensive experiments on six real-world traffic datasets demonstrates our model outperforms existing state-of-the-art methods in both accuracy and robustness. By combining the tensor decomposition's interpretability with deep learning's expressive power, GARNT offers a principled and scalable solution for data imputation tasks in intelligent transportation systems.
|
|
09:00-18:30, Paper Mo-Online.179 | |
Dynamically Reconfigurable NPU Acceleration for Knowledge Loading in LLM Retrieval-Augmented Generation |
|
Lin, Peidong | University of Electronic Science and Technology of China |
Li, Jintao | University of Electronic Science and Technology of China |
Deng, Hui | University of Electronic Science and Technology of China |
Li, Shihong | University of Electronic Science and Technology of China |
Yu, Shui | Shen Zhen Institute for Advanced Study, UESTC |
Li, Yun | Shenzhen Institute for Advanced Study, University of Electronic |
Keywords: AI and Applications, Deep Learning, Computational Intelligence
Abstract: Retrieval-Augmented Generation (RAG) provides large language models (LLMs) a means of retrieving relevant external knowledge, but its document parsing leads to increased latency and energy consumption. To address this issue, we propose a dynamically reconfigurable Neural Processing Unit(NPU) that accelerates both RAG document parsing and inference. By leveraging compute-in-memory fusion, dynamic convolution, and multi-level parallelism, our approach reduces memory transfer overhead and optimizes hardware resource allocation. Experimental results show that our design achieves a 1.8x speedup in document parsing and a 2.83x improvement in energy efficiency. Additionally, it achieves an 11.71% reduction in inference time and a 35.59% boost in energy efficiency over traditional CPU/GPU methods, offering a scalable solution for large-scale RAG tasks.
|
|
09:00-18:30, Paper Mo-Online.180 | |
HTF: Hierarchical Transformer Fusion for Fine-Grained Sketch-Based Retrieval |
|
Zhang, Ning | Zhejiang Normal University |
Yu, Mu Dan | College of Science & Technology Ningbo University |
Chao, Wang | Wuhan University |
Mithun, Mukherjee | Nanjing University of Nanjing University of Information Science |
Keywords: Deep Learning, Machine Vision, Machine Learning
Abstract: Fine-grained sketch-based image retrieval (FG-SBIR) aims to identify target photographs through minimal sketch input, yet current approaches face critical limitations in balancing local feature precision with global contextual awareness. Although existing methods predominantly focus on stroke-level variations through convolutional architectures, they often neglect the crucial interplay between detailed sketch patterns and holistic structural semantics. This oversight leads to suboptimal performance during the early retrieval phases when the sketch input remains incomplete and abstract. To bridge this critical gap, we present a novel Hierarchical Transformer Fusion (HTF) framework that synergizes dual-branch feature extraction with reinforcement learning optimization. Our architecture introduces three key innovations: First, a dual-branch encoder combines Swin Transformer for local stroke-level feature extraction with ResNet-50 and Transformer encoder for global semantic understanding, providing rich multi-level feature representations that facilitate effective cross-modal matching. Second, an adaptive fusion mechanism employs learnable weights to dynamically integrate multi-scale features from both branches, prioritizing semantically salient regions through content-aware attention. Third, we devise a reinforcement learning-driven cross-modal alignment strategy that optimizes retrieval policy through iterative feedback, allowing the model to accommodate diverse sketch patterns and user preferences. Experimental results confirm our method's superior early retrieval accuracy compared to existing FG-SBIR systems and baseline models.
|
|
09:00-18:30, Paper Mo-Online.181 | |
Semantic Token Enhancement and Clustering-Guided Activation for Weakly Supervised Semantic Segmentation |
|
Hou, Jingjing | Chongqing University |
Xu, Yuheng | Chongqing University |
Zhang, Taiping | Chongqing University |
Keywords: Machine Vision, Deep Learning, Machine Learning
Abstract: Weakly-supervised Semantic Segmentation (WSSS) methods based on image-level annotations often employ Class Activation Maps (CAM) to construct pseudo labels for dense prediction. Recently, Transformer-based WSSS approaches have attracted increasing attention due to the strong capability of Transformer in capturing global context. These methods typically generate object localization maps by modeling attention interactions between class tokens and patch tokens. However, the self-attention mechanism in Transformer tends on focus on only a limited number of tokens, resulting in sparse attention maps and consequently overlooking semantically relevant regions in the generation of pseudo labels. In this paper, we propose two novel attention mechanisms: SCAR (selective cross-attention refinement) and Self-squared attention, both aimed at uniformly highlighting entire object regions. The SCAR module extracts localized features by restricting cross-attention operations to the predicted attention regions. Self-squared attention includes a clustering-aware module that groups similar patch tokens from the same object to guide activation. The proposed methods ultimately generate a clustering-guided class activation map, which captures more comprehensive region coverage. Experimental results on standard benchmark datasets demonstrate that our approach consistently surpasses previous WSSS approaches.
|
|
09:00-18:30, Paper Mo-Online.182 | |
MASF-YOLO: An Improved YOLOv11 Network for Small Object Detection on Drone View |
|
Lu, LiuGang | Sichuan Agricultural University |
He, DaBin | Sichuan Agricultural University |
Deng, ZhiXiang | Sichuan Agricultural University |
Liu, CongXiang | Sichuan Agricultural University |
Shen, Zhaoli | Sichuan Agricultural University |
Keywords: Deep Learning, Application of Artificial Intelligence
Abstract: Abstract— With the rapid advancement of Unmanned Aerial Vehicle (UAV) and computer vision technologies, object detection from UAV perspectives has emerged as a prominent research area. However, the challenges of detection caused by the extremely small proportion of target pixels, significant scale variations of objects, and complex background information in UAV images have greatly limited the practical applications of UAV. To address these challenges, we propose a novel object detection network Multi-scale Context Aggregation and Scale-adaptive Fusion YOLO (MASF-YOLO), which is developed based on YOLOv11. First, to tackle the difficulty of detecting small objects in UAV images, we designed a Multi-scale Feature Aggregation Module (MFAM), which significantly improved the detection accuracy of small objects through parallel multi-scale convolutions and feature fusion. Second, to mitigate the interference of background noise, we propose an Improved Efficient Multi-scale Attention Module (IEMA), which enhances the focus on target regions through feature grouping, parallel sub-networks, and cross-spatial learning. Third, we introduce a Dimension-Aware Selective Integration Module (DASI), which further enhances the multi-scale feature fusion capabilities by adaptively weighting and fusing low- and high-dimensional features. Finally, we conducted extensive performance evaluations of the proposed method using the VisDrone2019 dataset. Compared with YOLOv11-s, MASF-YOLO-s achieved improvements of 4.6% in mAP@0.5 and 3.5% in mAP@0.5:0.95 on the VisDrone2019 validation set. Remarkably, MASF-YOLO-s outperformed YOLOv11-m while requiring only approximately 60% of its parameters and 65% of its computational cost. Furthermore, comparative experiments with state-of-the-art detectors confirmed that MASF-YOLO-s maintained a clear competitive advantage in both detection accuracy and model efficiency.
|
|
09:00-18:30, Paper Mo-Online.183 | |
BT-D2: A Hierarchical Combat Task Decomposition Framework with Dynamic Constraint Injection in Behavior Tree |
|
Li, Qinglin | National University of Defense Technology |
Huang, Xueqin | National University of Defense Technology |
Zhu, Xianqiang | National University of Defense Technology, National Key Laborator |
Zhang, Sheng | National University of Defense Technology |
Ma, Dianqiu | National University of Defense Technology |
Liu, Qiting | National University of Defense Technology |
Keywords: Agent-Based Modeling, AI and Applications, Expert and Knowledge-Based Systems
Abstract: This paper addresses challenges such as semantic gaps, logical inconsistencies, and neglect of domain knowledge in task decomposition for complex systems, especially in military command scenarios.We propose the BT-D2 framework—a behavior tree-guided method that integrates large language models (LLMs) with structured domain knowledge to enable dynamic and executable task decomposition.By embedding military doctrines and equipment parameters into hierarchical behavior tree nodes, the framework creates a constraint-based template for transforming high-level instructions into tactical action sequences.Key components include an extended behavior tree architecture for dynamic parameter binding, a multi-agent collaborative workflow for strategic-to-tactical planning, and an adaptive mechanism for semantic alignment and conflict resolution.Experimental results demonstrate that BT-D2 outperforms baseline methods (CoT,ReAct,GTN) in semantic integrity, decomposition diversity, and structural rationality across 200 combat scenarios.Ablation studies validate the critical role of behavior trees in enhancing task executability and hierarchical architecture in improving planning efficiency.This work contributes a novel framework for bridging generative AI with domain-specific constraints, offering promising applications in military planning, logistics, and intelligent systems.
|
|
09:00-18:30, Paper Mo-Online.184 | |
QCSH: Quantization Controlled Semantic Hashing for Effective Similar Text Search |
|
Ye, Zihan | Sun Yat-Sen University |
Hou, Zhitian | Sun Yat-Sen University |
Lin, Ge | Sun Yat-Sen University |
Zeng, Kun | Sun Yat-Sen University |
Keywords: Deep Learning, Neural Networks and their Applications, Machine Learning
Abstract: With the rapid growth of digital information, efficient similarity search has become a crucial challenge in large-scale information retrieval. Semantic hashing provides an effective solution by encoding high-dimensional data into compact binary representations, thereby significantly improving retrieval efficiency. While deep learning-based semantic hashing has shown promising results, existing methods struggle to preserve fine-grained features and suffer from excessive quantization errors due to fixed-threshold hard binarization. To address these issues, we propose an unsupervised novel semantic text hashing framework, Quantization Controlled Semantic Hashing (QCSH), which enhances feature representation while refines the binarization process. QCSH integrates several specialized modules to enhance hashing performance. QCSH enhances the extraction of effective features, and by jointly optimizing hash codes and quantization factors, it effectively captures their interdependence. Additionally, QCSH applies an optimized binarization strategy that preserves information fidelity and maximizes mutual information, effectively reducing quantization errors and ensuring a balanced bit distribution. Extensive experiments on three public datasets show that QCSH outperforms state-of-the-art baselines on various numbers of bits of hash codes.
|
|
09:00-18:30, Paper Mo-Online.185 | |
Efficient GraphRAG Via Community Embedding for Global and Reasoning-Intensive Question Answering |
|
Yang, Qianjing | Central South University |
Chen, Kangkun | Central South University |
Liu, Xiyao | Central South University |
Zhang, Tiandu | Hunan University |
Huang, Da | The State Key Laboratory of High Performance Computing (HPCL) & |
Keywords: Application of Artificial Intelligence, Expert and Knowledge-Based Systems, Knowledge Acquisition
Abstract: Retrieval-augmented generation (RAG) enables large language models (LLMs) to answer questions without hallucinations by retrieving external knowledge. However, it still struggles with addressing global and reasoning-intensive questions. The recently proposed GraphRAG tackles these issues by organizing the knowledge base into hierarchical graph communities, but its retrieval mechanism results in high computational costs and overly generalized answers. To overcome these limitations, we propose a novel community embedding-based GraphRAG framework comprising three key components: a graph community encoder that embeds community descriptions into dense vectors for efficient knowledge graph retrieval, a hierarchical retriever that leverages the communities’ hierarchy to balance abstract and specific information, and an iterative question processor that recursively decomposes a multi-hop question to enable multi-step reasoning. Experiments demonstrate that our method reduces token consumption by 96.7% compared to GraphRAG while achieving comparable performance, and it outperforms other state-of-the-art RAG approaches in terms of answer quality.
|
|
09:00-18:30, Paper Mo-Online.186 | |
FedUNL: Utilizing Noisy Labels for Improved Robustness in Federated Learning |
|
Chen, Cheng | Fuzhou University |
Tan, Zhou | Fuzhou University |
Lin, Ziyu | Fuzhou University |
Liu, Ximeng | Fuzhou University |
Keywords: Artificial Social Intelligence, Machine Vision, Deep Learning
Abstract: With the growing significance of federated learning (FL) in preserving data privacy, addressing the challenge of noisy labels has become critical. Conventional methods typically focus on filtering or reweighting noisy samples, often overlooking their inherent informational value. To exploit this value, we propose FedUNL, a novel framework for robust FL under label noise. FedUNL uses a Gaussian Mixture Model (GMM) to identify noisy clients and samples based on sample and class losses. Noisy models are then constructed with these samples, and knowledge distillation is employed to extract useful knowledge, supporting main model training and enabling effective noise correction. Distinct strategies are applied to train clean and noisy clients. Experimental results demonstrate that FedUNL improves accuracy by (0.31%) to (3.41%) over state-of-the-art (SOTA) methods, particularly excelling in challenging scenarios with high noise levels and non-IID data distributions.
|
|
09:00-18:30, Paper Mo-Online.187 | |
TADA: Enhancing Transferable Adversarial Attacks for Distribution Alteration Via Diffusion Model |
|
Zhu, Qiyu | Fuzhou University |
Lin, Xuanwei | Fuzhou University |
Chen, Hao | Fuzhou University |
Liu, Ximeng | Fuzhou University |
Keywords: Artificial Social Intelligence, Machine Vision, Deep Learning
Abstract: In black-box attack scenarios, enhancing the transferability of adversarial samples for targeted attacks presents significant challenges and is crucial for real-world applications. Most current research focuses on attack strategies that rely on data augmentation or generator training, which tend to perform well in untargeted attacks but exhibit limited effectiveness in targeted attack scenarios. To address this, we propose a novel framework for adversarial sample generation called TADA, designed to modify the sample distribution and improve transferability in both attack scenarios. TADA consists of two key components. First, we change the image distribution in the diffusion model by using the gradient of deviation between images to guide the denoising process. Second, we introduce a fine-tuning strategy that uses the gradient norm of the label loss as a constraint to bootstrap the loss to a flat minimum and regularize the penalty for large weight values. With these two strategies, TADA can more effectively modify the original distribution of images to generate highly transferable adversarial examples. Through extensive experiments with various general-purpose training and defense models, our TADA consistently outperform state-of-the-art attack methods in both scenarios. Notably, the reconstructed images enhance the transferability of existing attack methods. For example, the CFM attack achieves a 12.3% increase in success rate for targeted attacks when transitioning from Res-50 to Inc-Res-v2.
|
|
09:00-18:30, Paper Mo-Online.188 | |
Strong Predefined-Time Convergence Adaptive Zeroing Neural Network for Solving Time-Varying Complex Sylvester Equation and Application of Mobile Manipulator |
|
Wu, Yang | Hunan Normal University |
Qi, Zhaohui | Hunan Normal University |
Li, Shupeng | Hunan Normal University |
Keywords: Neural Networks and their Applications, Application of Artificial Intelligence, Cyborgs,
Abstract: This paper proposes a novel strong predefined-time convergence adaptive Zeroing Neural Network (SPC-AZNN) model for efficiently solving time-varying complex Sylvester equation (TV-CSE). A key feature of strong predefined-time convergence is the ability to predefine an independent, non-conservative convergence time based on practical constraints and modeling inaccuracies, thereby achieving precise control effects. The error-based adaptive parameter is designed to perfectly align with the convergence process of the model, enabling optimal allocation of computational resources and significantly improving convergence efficiency. By integrating a novel activation function with the adaptive parameter, the model attains a streamlined parameter set. Simulation and comparative experiments demonstrate that the proposed SPC-AZNN model exhibit superior convergence speed and overall performance. Finally, its practicality is demonstrated through a trajectory tracking application for a 6-degree-of-freedom mobile manipulator.
|
|
09:00-18:30, Paper Mo-Online.189 | |
LMW: LLM-Driven Multi-Agent Workflow for Unmanned Platforms |
|
Huang, Xueqin | National University of Defense Technology |
Li, Qinglin | National University of Defense Technology |
Zhu, Cheng | National University of Defense Technology |
Zhu, Xianqiang | National University of Defense Technology, National Key Laborator |
Zang, Yuechao | Mind Live Technology Co., Ltd,National University of Defe |
Chen, Tinghao | Mind Live Technology Co., Ltd |
Zhang, Sheng | National University of Defense Technology |
Keywords: Agent-Based Modeling, AI and Applications, Application of Artificial Intelligence
Abstract: The agentic workflow based on large language models (LLMs) provides a new avenue for constructing autonomous task systems in unmanned platforms. However, the agentic workflow for unmanned platforms is a multi-step autonomous task process, has a significant cumulative transmission effect of hallucination (CTEH). This requires consideration of structured prompt engineering and precise knowledge enhancement to alleviate CTEH. We find it challenging to directly and effectively handle the complex tasks associated with unmanned platforms using existing prompt methods. Therefore, we present LLM-driven multi-agent workflow (LMW), a novel framework that enables a fully automated, closed-loop process from task discovery to execution without manual intervention. Specifically, LMW deploys agents at different process nodes to share the overall task pressure and complexity in the workflow. In addition, we introduce an efficient knowledge flow (EKF) in LMW, which reduces the token consumption of agents and further enhances the flexibility and performance of LMW. Experiments conducted on various benchmarks demonstrate that LMW achieves better accuracy than other LLM-based agents. Furthermore, we synthesize a simulation dataset Unmanned Warehouse (UW) to verify the effectiveness of LMW in real-world scenarios. The results of the ablation experiments prove the importance of each LMW module.
|
|
09:00-18:30, Paper Mo-Online.190 | |
WG-DETR : A Wavelet Domain Enhancement-Based Graph Guided Fusion Model for Defect Detection |
|
Zhang, Chao | Inner Mongolia University of Technology |
Wu, Wenhong | Inner Mongolia University of Technology |
Niu, Hengmao | Inner Mongolia Technical College of Construction |
Wu, Nier | Inner Mongolia University of Technology |
Shi, Bao | Inner Mongolia University of Technology |
Keywords: Image Processing and Pattern Recognition
Abstract: With the advancement of deep learning, surface defect detection has achieved remarkable progress. However, traditional detection methods employ single feature extraction strategies, leading to feature aliasing, information redundancy, and interference that fails to model complex non-linear feature dependencies. This reduces detection accuracy and increases error rates. We propose a Wavelet Domain Enhancement-Based Graph Guided Fusion Model for Defect Detection (WG-DETR). Our Anisotropic Direction Enhancement Unit (ADEU) enhances spatial features through optimized multi-directional filters, improving sensitivity to linear defects and robustness against non-homogeneous textures. The Wavelet Bidirectional Frequency Enhancement Unit (WBFU) uses Haar wavelets for frequency domain decomposition, creating bidirectional feature interaction channels that effectively isolate and enhance different frequency domain representations. Our Feature-similarity Graph-guided Unit (FGU) constructs graph topological structures for receptive field decoupling and joint modeling of local and global feature dependencies. Experimental results show WG-DETR achieves 82.4% and 35.7% on mAP50 and mAP50-95 metrics, improvements of 4.8% and 2.8% over the baseline.
|
|
09:00-18:30, Paper Mo-Online.191 | |
Multi-View Stereo with Multi-Depth Value Classification and Dynamic Sampling for Dense Scene Perception |
|
Dai, Jiawen | Hong Kong Metropolitan University |
Xiong, Yue | Ludwig Maximilian University of Munich |
Keywords: Deep Learning, Image Processing and Pattern Recognition, Machine Vision
Abstract: Recent years, learning-based approaches in multi- view stereo reconstruction has shown significant promise in acquiring detailed 3D representations of complex scenes. However, existing depth regression methods often overfit when processing depth values and are prone to inaccuracies at object boundaries and in areas affected by lighting occlusions. To address these issues, this paper proposes an innovative dense depth inference model with several critical contributions to the field: 1) we transform the depth value regression task into a multi-depth value classification task, which significantly improves inference accuracy while conserving depth sampling rates and minimizing GPU consumption, so that addressing the overfitting problem in existing methods; 2) we introduce a dynamic sampling mechanism that refines the aggregation process of cross-view cost volumes, and enhance performance in complex scenes and improving depth estimation in challenging conditions; 3) we apply the Sinkhorn loss function in a multi-stage cascade structure to enhance the model’s accuracy and stability; 4) Combining classification with dynamic sampling, our method improves depth estimation accuracy and computational efficiency, leading to more precise 3D reconstructions with reduced resource usage. Experimental results on the DTU and Tank & Temples datasets validate the effectiveness of our method.
|
|
09:00-18:30, Paper Mo-Online.192 | |
Unifying Foundation Model and Segment Anything Model for Remote Sensing Weakly Supervised Semantic Segmentation |
|
Cui, Jinfeng | Hohai University |
Wu, Ming | Hohai University |
Li, Zhi | Nanjing Research Institute of Electronic Engineering |
Yao, Liang | Hohai University |
Wang, Yijun | Hohau University |
Xu, Guoyan | Hohai University |
Liu, Fan | Hohai University |
Keywords: Image Processing and Pattern Recognition, Application of Artificial Intelligence, Machine Vision
Abstract: Due to its reliance on fewer precise annotations, weakly supervised semantic segmentation (WSSS) techniques are in high demand in the field of remote sensing (RS) image processing. Despite mainstream WSSS approaches achieve remarkable dense prediction accuracies, they still face challenges such as insufficient pre-trained and ambiguous segment predictions. To this end, we propose to improve the accuracy of weakly supervised semantic segmentation by unifying the vision language foundation model and the Segment Anything Model (SAM). Specifically, we leverage a remote sensing vision language foundational model, RemoteCLIP, to provide sufficient pre-trained knowledge. Subsequently, we employ a decoder to transform the high-level feature representations extracted by RemoteCLIP into the segmentation predictions. Then, we introduce a multi-prompt fusion (MPF) approach via the Segment Anything Model (SAM) to obtain high-quality segment results with well-defined boundaries. To the best of our knowledge, this is the first study to apply a unified framework of foundation model and Segment Anything Model for RS WSSS. Experimental results demonstrate that our method achieves remarkable performance across three remote sensing datasets.Our code will be publicly available once the paper is accepted.
|
|
09:00-18:30, Paper Mo-Online.193 | |
RiceViM: An Efficient Multi-Scale Attention Enhanced Framework for Rice Leaf Disease Detection Leveraging Vision Mamba |
|
Chen, Jingjia | Guangzhou University |
Hu, Yingbiao | Zhongkai University of Agriculture and Engineering |
Chen, Baoyu | Dongguan University of Technology |
Long, Yingjie | South China Normal University |
Li, Shuting | Macau University of Science and Technology |
He, Feiyong | Guangdong Polytechnic of Science and Technology |
Keywords: Deep Learning, Image Processing and Pattern Recognition
Abstract: Rice leaf disease severely affects crop yield and food security. Traditional deep learning methods often require large-scale labeled data and lack generalization in complex environments. To address these limitations, we propose RiceViM, an efficient framework for rice leaf disease classification that integrates an Efficient Multi-Scale Attention (EMA) module with the Vision Mamba state space model. EMA enhances the extraction of local and global features, while Vision Mamba captures long-range dependencies with low computational cost. Experimental results on a benchmark dataset show that RiceViM achieves a classification accuracy of 95.87%, outperforming baseline models in both accuracy and efficiency. More notably, the model only requires 9.98M parameters and 8.96ms of single-graph inference time, achieving a balance between high precision and low power consumption. The proposed method demonstrates strong potential for practical applications in intelligent agricultural systems.
|
|
09:00-18:30, Paper Mo-Online.194 | |
Blinding the Infrared Tracker: An Adversarial Attack Based on Heating-Expanding Perturbations |
|
Fan, Xingdi | Southwest University of Science and Technology |
Dong, Wanli | Southwest University of Science and Technology |
Pu, JiaHao | Southwest University of Science and Technology |
Chen, Hanyang | Southwest University of Science and Technology |
Li, Dong | Southwest University of Science and Technology |
Keywords: Information Assurance and Intelligence, Assurance, Machine Vision
Abstract: Adversarial attacks manipulate deep learning models into generating erroneous predictions by introducing subtle yet carefully crafted perturbations to input data. Such attacks serve dual roles: revealing model vulnerabilities while enhancing robustness. Although adversarial attacks are well-explored in visual tracking, their efficacy in infrared single-object tracking remains limited. To bridge this gap, we propose a novel heating-expanding attack to deceive SiameseRPN-based infrared trackers. By exploiting the inherent thermal-texture discrepancy between visible and infrared imaging modalities, we introduce a novel heating-expanding loss function. This loss function drives the training of an adversarial perturbation generator that synthesizes imperceptible thermal perturbations. When strategically applied at the target-background boundaries, these perturbations trigger progressive bounding box expansion during tracking. The accumulated interference induces complete tracker failure via boundary drift. Benchmark evaluations on LSOTB-TIR and PTB-TIR demonstrate our method's consistent precision reduction of 45.0 and 45.4 (surpassing SOTA by 7.8/10.2 pp). The approach sustains attack performance against other trackers, reducing norm precision by 12.6 for DaSiamRPN and 4.5 for TaMOs respectively without architecture-specific tuning, proving plug-and-play readiness for deployment.
|
|
09:00-18:30, Paper Mo-Online.195 | |
OLMP: Operator-Level Computational Graph Partition Mapping for Deep Learning |
|
Liu, Zhengyu | Information Engineering University |
Zhang, Fan | Fudan University |
Zhang, Fengzhe | Fudan University |
Song, Yijing | Fudan University |
Qi, Xiaofeng | Information Engineering University |
Gao, Yanzhao | Information Engineering University |
Zhang, Xinyi | Information Engineering University |
Keywords: Deep Learning
Abstract: Expanding the scale of deep neural networks (DNNs) is a fundamental approach to improving the performance and accuracy of the model. However, as DNN models continue to grow in complexity, a single computing accelerator is often inadequate to accommodate the entire model, leading to computational and memory bottlenecks. Distributing the computation graph of a DNN across multiple accelerators offers an effective solution to this challenge, enabling better resource utilization and improved performance scalability. Existing methods for coarse-grained partitioning and mapping of computational graph, often resulting in suboptimal solutions. This separation can lead to imbalanced workloads across accelerators, reducing pipeline efficiency and overall system performance. In this work, we propose OLMP to address these limitations, which jointly optimizes the partitioning and mapping processes. We formulate this problem as a combinatorial optimization task and employ Mixed Integer Programming to find an optimal solution that balances execution time, workload distribution, and memory constraints. To validate our approach, we conducted extensive experiments on a diverse set of DNN models with varying scales and architectures. Compared to existing approaches, OLMP achieves up to 2.64× speedup in training time. Experimental results demonstrate that OLMP outperforms existing methods, significantly reducing training time and enhancing load balancing across accelerators.
|
|
09:00-18:30, Paper Mo-Online.196 | |
Efficient Extended Neighborhoods Dynamic Selection Re-Ranking for Person Re-Identification |
|
Chao, Wang | Wuhan University |
Zhongyuan, Wang | Wuhan University |
Xiaochen, Wang | Center for Multimedia Software School of Computer Science Wuhan |
Hu, Ruimin | Wuhan University |
Mithun, Mukherjee | Nanjing University of Nanjing University of Information Science |
Keywords: Machine Vision, Media Computing, Big Data Computing,
Abstract: Person re-identification(re-ID) is a difficult retrieval task that necessitates matching a person's captured images in non-overlapping camera views, and re-ranking is a critical step in improving its accuracy. The k-nearest neighbors relationship was used in previous techniques to determine the rank results by selecting only k fixed pedestrian image values for distance calculations;This operation, however, generates additional distance errors due to changes in appearance of pedestrians. This paper addresses the above issue by proposing a simple but effective Extended Neighborhood Dynamic Selection (ENDS), distance to optimize the performance of ReID reranking. The number of images selected is then distributed over an interval. A limit of upper and lower is placed on the selected number to ensure that it is neither too low nor too high. The automatic selection of adjacent images is achieved using this method. This distance is determined by combining ENDS distances with Jaccard distances. It is the core principle of this method that instead of using fixed values, the choice of number of images to be included in each neighborhood should be made automatically. It also allows the removal of images that are dissimilar in favour of those more representative. Experimental results demonstrate the novel method of ranking by using Market-1501 and DukeMTMC's reID dataset. In this paper, we propose a method which increases the Market-1501 mAP/Rank1 by 29.4%/12.9% while DukeMTMC-reID reranking by 35%/20.7%.
|
|
09:00-18:30, Paper Mo-Online.197 | |
Design RNA with Specified Secondary Structures Using CDMs (I) |
|
Xiao, Zhenran | Shanghai Jiao Tong University |
Chen, Letian | Shanghai Jiaotong University |
Li, Yichong | Shanghai Jiao Tong University |
Yang, Yang | Shanghai Jiao Tong University |
Keywords: Biometric Systems and Bioinformatics, Deep Learning, Neural Networks and their Applications
Abstract: As RNA function is strongly tied to its secondary structure, designing RNA molecules with specified structures stands as a key challenge in computational biology. Existing approaches often either yield limited results or incur high computational costs. Building on the success of Denoising Diffusion Probabilistic Models (DDPMs) and their variants including Conditional Diffusion Models (CDMs) in diverse fields such as image generation, we apply CDMs to generate RNA sequences matching specified secondary structures, using One-Hot or static vocabulary encoding, classifier-free guidance, and a Transformer denoiser to capture long-range dependencies. Experiments on the Rfam dataset showed our best model achieved a task-solving rate of 0.846, outperforming current next best method (≤0.648). Our models present better performance with a simpler architecture, which demonstrates the effectiveness of CDMs in sequence design problems. Future work may focus on optimizing the conditional encoding method and the architecture of denoiser, as well as developing biological relevance metrics for generated sequences.
|
|
09:00-18:30, Paper Mo-Online.198 | |
FedPAC: A Federated Semi-Supervised Learning Approach for Non-IID Data with Feature Shift |
|
Guo, Xudong | Yunnan University |
Wang, Shuo | The University of Birmingham |
Keywords: Deep Learning, Representation Learning, Neural Networks and their Applications
Abstract: Federated Semi-Supervised Learning (FSSL) enables collaborative model training across distributed clients with limited labeled data while preserving data privacy. However, a critical challenge in FSSL is feature shift, where clients exhibit diverse feature distributions despite sharing the same task. To address this issue, we propose FedPAC, a novel FSSL framework that integrates Contrastive Mean-Teacher Regularization and Perturbation-Aware Gradient Descent. Our framework enhances feature representation learning by aligning feature distributions between teacher and student models and mitigates optimization challenges caused by feature heterogeneity through controlled gradient perturbations. Extensive experiments on benchmark datasets demonstrate that FedPAC outperforms existing FSSL methods in feature shift scenarios, making it a practical solution for real-world applications such as medical imaging and industrial fault diagnosis.
|
|
09:00-18:30, Paper Mo-Online.199 | |
Ontology-Based Semantic Integration of Multi-Source Heterogeneous Industrial Devices (I) |
|
Wu, Rina | Beijing University of Technology |
Bi, Jing | Beijing University of Technology |
Wang, Ziqi | Zhejiang University |
Yuan, Haitao | Beihang University |
Zhao, Hailiang | Zhejiang University |
Liu, Yanan | Beijing University of Technology |
Zhang, Jia | Southern Methodist University |
Keywords: AI and Applications, Agent-Based Modeling
Abstract: The 3C manufacturing industry increasingly relies on industrial internet platforms to address challenges such as data fragmentation, low collaboration efficiency, and the lack of semantic adaptation in information models, thereby advancing digital transformation. Although the Object Linking and Embedding for Process Control Unified Architecture (OPC UA) standard supports industrial automation, it faces several limitations, including poor compatibility, ambiguous semantics, and incomplete models, which hinder automated reasoning and system interoperability. To address these issues, this work proposes an integrated framework that combines offline classification of heterogeneous devices using a Character-level Text Convolutional Neural Network (CTCNN) with online automated reasoning over OPC UA information models. CTCNN leverages character-level embeddings and convolutional layers to achieve fine-grained and accurate device type recognition. Furthermore, a deterministic Markov decision process is formulated, and an intelligent agent is trained to perform automatic reasoning based on the OPC UA model structure. This hybrid framework enables effective OPC UA device integration and interconnection. Experimental results demonstrate that CTCNN achieves up to an 18% improvement in device type identification precision, while the combined offline recognition and online reasoning approach enhances the accuracy of automatic OPC UA information model inference by approximately 12% compared to state-of-the-art methods.
|
|
09:00-18:30, Paper Mo-Online.200 | |
Latency-Minimized Computation Offloading in 3C Manufacturing Workshops (I) |
|
Liu, Yanan | Beijing University of Technology |
Bi, Jing | Beijing University of Technology |
Wang, Ziqi | Zhejiang University |
Zhang, Junqi | Beijing University of Technology |
Yuan, Haitao | Beihang University |
Zhang, Jia | Southern Methodist University |
Keywords: Computational Intelligence, Evolutionary Computation
Abstract: With the rapid advancement and integration of Internet of Things technology into manufacturing, industrial workshops in computer, communication, and consumer electronics (3C) manufacturing are increasingly confronted with complex computational tasks during production. However, the limited hardware resources and computational capabilities of local devices often hinder efficient task execution. Computational offloading offers a viable solution by allowing complex computational tasks to be processed on either edge or cloud servers, enhancing the efficiency of computational task handling in production environments. A critical challenge lies in optimizing task offloading among local devices, edge servers, and cloud servers to maximize production efficiency while ensuring reasonable task scheduling. To address this challenge, this work proposes a flexible computational offloading strategy based on an edge-cloud architecture in a smartphone manufacturing workshop. First, a framework for edge-cloud workshop manufacturing is constructed, integrating various smartphone production devices. Based on the edge-cloud framework, a constrained optimization problem for computation offloading is formulated, using latency as the objective in industrial production settings. A time consumption model is employed to optimize computational time, and a novel scheduling strategy named Ivy-Genetic Evolution Algorithm (IGEA) is designed to solve the scheduling problem. The IGEA integrates genetic operators into the Ivy Algorithm to introduce a randomness strategy. Experimental results demonstrate that IGEA significantly outperforms state-of-the-art approaches in optimizing production efficiency.
|
|
09:00-18:30, Paper Mo-Online.201 | |
Shapelet Temporal Evolution Graph Network for Water Quality Anomaly Detection (I) |
|
Wu, Xiangxi | Beijing University of Technology |
Bi, Jing | Beijing University of Technology |
Wang, Gongming | Beijing University of Technology |
Wang, Ziqi | Zhejiang University |
Li, Yibo | Beijing University of Technology |
Zhang, Junqi | Beijing University of Technology |
Yuan, Haitao | Beihang University |
Zhang, Jia | Southern Methodist University |
Chang, Xingyang | Beijing University of Technology |
Keywords: AI and Applications, Hybrid Models of Computational Intelligence
Abstract: Water quality anomaly detection refers to the identification of abnormal changes in water parameters, which is crucial for ensuring environmental safety and preventing contamination events. With the growing volume of water environment sensing data and increasing demand for intelligent, transparent water quality management systems, achieving accurate, rapid, and interpretable anomaly detection has become a critical challenge in early warning systems. To tackle this challenge, this work proposes an anomaly detection model named Shapelet Temporal Evolution Graph Network (STEG), which constructs time-aware Shapelets and adopts graph attention networks to build Shapelets evolution graphs, learning multidimensional dynamic relationships within and between time segments. By incorporating both local and global temporal evolution factors, the approach ensures the interpretability of both the detection process and its resulting outputs. Experiments on two real-world datasets show that STEG outperforms state-of-the-art methods in terms of anomaly detection accuracy and generalization. Moreover, it provides clear and transparent reasoning for water quality anomaly detection.
|
|
09:00-18:30, Paper Mo-Online.202 | |
Enhancing Visual Aesthetics in Stable Diffusion: A Reinforcement Learning Approach |
|
You, Junyong | NORCE Norwegian Research Centre |
Lin, Yuan | Kristiania University of Applied Science |
Hu, Bin | Kean University |
Keywords: Machine Vision, Deep Learning, Neural Networks and their Applications
Abstract: Generative models such as stable diffusion have recently achieved significant success in producing high-quality images conditioned on textual inputs. It is difficult to control the quality of generated images, such as aesthetic quality, after an image generative model has been trained. In this paper, we propose a novel method that leverages reinforcement learning (RL) to enhance the aesthetic appeal of images produced by stable diffusion models. An aesthetic assessment model has been first developed by modelling influence factors of image aesthetics in a deep network and then trained on publicly available datasets. The assessment model produces an aesthetic score of an image serving as the reward function in the RL framework. By reframing the denoising process of the stable diffusion model as a sequential decision-making problem, the intermediate denoising steps can be formulated as actions in a Markov decision process (MDP). The proximal policy optimization (PPO) algorithm can be applied in the RL framework to optimize the network parameters (i.e., U-Net) in the stable diffusion model, aiming to maximize the expected aesthetic reward. Through extensive experiments, we demonstrate that the proposed method can significantly improve the aesthetic quality of the generated images while maintaining their diversity and adherence to input prompts.
|
|
09:00-18:30, Paper Mo-Online.203 | |
MAR-Net: Multi-Scale Attention Refinement Network for Enhanced Medical Image Segmentation (I) |
|
Ma, Hongyao | Beijing University of Technology |
Bi, Jing | Beijing University of Technology |
Wang, Ziqi | Zhejiang University |
Li, Ning | Beijing University of Technology |
Yuan, Haitao | Beihang University |
Li, Yibo | Beijing University of Technology |
Zhang, Jia | Southern Methodist University |
Keywords: Deep Learning, Image Processing and Pattern Recognition
Abstract: Medical image segmentation remains challenging due to the complexity of lesion morphologies and the need for real-time clinical applicability. Here, we present MAR-Net, a novel framework that integrates adaptive attention mechanisms, hierarchical contextual modeling, and efficient training strategies to address these challenges. The architecture employs a CBAM-based dual-attention module to dynamically enhance discriminative features while suppressing redundant information, improving lesion boundary localization. Cascaded dilated convolutions expand the receptive field for global context capture, complemented by a multi-scale decoder that integrates deep semantic and shallow spatial features. A multi-scale training strategy with hierarchical loss supervision optimizes model adaptability without compromising inference efficiency. Experimental validation on ISIC and CholecSeg8K datasets demonstrates MAR-Net’s superiority: it outperforms mainstream methods across segmentation accuracy, recall rate, and other metrics, achieving notable improvements for complex lesions of varying sizes. Notably, MAR-Net maintains high performance on both dermoscopic and laparoscopic images, showcasing its broad applicability. These results establish MAR-Net as a robust solution for medical image segmentation, balancing precision and efficiency to enable practical clinical deployment.
|
|
09:00-18:30, Paper Mo-Online.204 | |
Dual-Model Semi-Supervised Anterior Segment Structure Segmentation Using Mamba (I) |
|
Ouyang, Dong | Wuhan University of Science and Technology |
Liu, Xiaoming | Wuhan University of Science and Technology |
Zhang, Ying | Wuhan Aier Hospital |
Wu, Guohuan | Aier Hospital |
Tang, Jinshan | George Mason University |
Keywords: Deep Learning
Abstract: Accurate segmentation of key anatomical structures in anterior segment OCT (AS-OCT) images is critical for diagnosing serious ophthalmic conditions such as keratitis and cataract. However, due to the scarcity of labeled data in this domain, most existing methods struggle to precisely segment both the lens and the anterior chamber angle simultaneously. To address these limitations, we propose a semi-supervised segmentation framework based on collaborative training between U-Net and Mamba-UNet. A Scale Fusion Module (SFM) is introduced to integrate the outputs of both models, generating multi-scale predictions and fused pseudo-labels. A multi-scale supervision strategy is then employed to guide learning at different levels. Additionally, we design a novel anatomical structure consistency loss that leverages anatomical properties from the fused pseudo-labels to preserve anatomical correctness. Experimental results on two AS-OCT datasets demonstrate the effectiveness and superiority of our proposed approach.
|
|
09:00-18:30, Paper Mo-Online.205 | |
Intelligent Offloading of Dependent Tasks for UAV-Relaying-Assisted Mobile Edge Computing (I) |
|
Li, Jingyao | Beihang University |
Yuan, Haitao | Beihang University |
Guo, Dengyu | Beihang University |
Bi, Jing | Beijing University of Technology |
Zhang, Jia | Southern Methodist University |
Keywords: Evolutionary Computation, Metaheuristic Algorithms, Intelligent Internet Systems
Abstract: Mobile edge computing (MEC) has emerged as a promising paradigm for effectively addressing network overload and communication latency caused by rapid user growth. Unmanned aerial vehicles (UAVs) have recently gained attention as flexible edge servers due to their exceptional mobility and operational flexibility. However, the optimization of UAV relays remains a persistent challenge in UAV-assisted MEC systems. Previous research has inadequately investigated the critical challenges of the data transmission path in UAV relay systems. This work proposes a graph-theoretic framework for modeling data transmission paths in UAV relay networks, and it innovatively classifies UAVs into three roles on their relay paths. The UAV-assisted MEC architecture considers task dependencies, offloading locations, and CPU frequency allocation to achieve joint reductions in both system latency and energy consumption. To solve this complex optimization problem, this work develops a novel hybrid multi-objective optimization algorithm improving the mutation in Non-dominated Sorting Genetic Algorithm II (NSGA-II) with Simulated Annealing (SA), called NSGA-II with SA-based Mutation (NSGSAM). Pareto front analysis reveals significant improvements in solution quality and convergence behavior. Experimental results demonstrate the superior performance of NSGSAM compared with other meta-heuristic approaches through comprehensive evaluations, showing a 39.08% reduction in time consumption and 23.96% improvement in energy consumption compared with NSGA-III.
|
|
09:00-18:30, Paper Mo-Online.206 | |
Automated Nuclear Cataract Grading in AS-OCT Using Mamba Architecture (I) |
|
Lei, Tianxiang | Wuhan University of Science and Technology |
Liu, Xiaoming | Wuhan University of Science and Technology |
Zhang, Ying | Wuhan Aier Hospital |
Wu, Guohuan | Aier Hospital |
Tang, Jinshan | George Mason University |
Keywords: Deep Learning
Abstract: Cataract remains one of the leading causes of blindness and visual impairment worldwide, representing a significant public health concern. Anterior segment optical coherence tomography (AS-OCT) provides high-resolution visualization of ocular structures and has become a key imaging modality for nuclear cataract (NC) grading. However, existing convolutional neural network (CNN)-based methods often struggle to differentiate subtle variations between adjacent severity levels due to limited capacity for capturing long-range dependencies, thereby affecting classification accuracy. To address this challenge, we propose an automatic nuclear cataract grading network based on the Mamba architecture. This framework combines the local feature extraction capabilities of traditional CNNs with the long-range dependency modeling power of Mamba modules. Furthermore, we introduce a Hybrid Wavelet Feature Refinement Module (HWFRM), which employs wavelet transforms to extract multi-frequency representations. Integrated with a detail-guided enhancement mechanism, the module adaptively strengthens discriminative features. Channel and spatial attention mechanisms are applied to each wavelet sub-band, enabling the network to selectively emphasize important frequency components and remain sensitive to both structural and fine-detail cues. Finally, an ordinal regression loss is incorporated to explicitly model the progressive nature of cataract severity, improving the network’s ability to reduce misclassifications between adjacent categories. Extensive experiments on both a local AS-OCT dataset and a public benchmark demonstrate that our approach achieves state-of-the-art performance.
|
|
09:00-18:30, Paper Mo-Online.207 | |
HMRNet: A Heterogeneous Multi-Relational Graph Neural Network for Financial Fraud Detection |
|
Huang, Wei | Beijing University of Posts and Telecommunications |
Zhiyi, Song | Beijing University of Posts and Telecommunications |
Xie, Weisheng | Bestpay AI Lab |
Xiangxiang, Gao | Bestpay AI Lab |
Fu, Xiangling | Beijing University of Posts and Telecommunications |
Keywords: Application of Artificial Intelligence, Deep Learning, Representation Learning
Abstract: Financial fraud causes global economic losses amounting to tens of billions of dollars each year. Existing methods usually employ graph neural networks (GNNs) to utilize the valuable information in data for fraud detection. However, most traditional methods rely on the homogeneity assumption, which overlooks the heterogeneous characteristics introduced by the increasing disguise of fraudster behavior patterns and the complex multi-relational nature of financial transactions, leading to the poor performance of fraud detection models. To address these issues, we propose HMRNet, a novel approach that mitigates the impact of heterogeneity and multi-relationality on detection performance through multi-frequency information fusion and multi-relation fusion. Specifically, three key steps, node representation encoding, intra-relation aggregation, and inter-relation aggregation, are designed to learn information from different nodes and relations. Extensive experiments demonstrate that our proposed method not only outperforms all baseline methods but also exhibits good robustness across different datasets.
|
|
09:00-18:30, Paper Mo-Online.208 | |
Bipartite Graph Black-Box Adversarial Attacks Based on Implicit Relations |
|
Deng, Bowen | Sichuan Normal University |
Feng, Lin | Sichuan Normal University |
Qin, Shuo | Sichuan Normal University |
Yang, Fancheng | Sichuan Normal University |
Li, Siwen | Sichuan Normal University |
Keywords: Neural Networks and their Applications
Abstract: Bipartite graph representation learning has been successfully applied in downstream applications such as link prediction and recommendation tasks. However, the performance of existing bipartite graph representation learning models is extremely vulnerable to adversarial perturbations, resulting in incorrect predictions in downstream tasks. To address this issue, we propose Bipartite graph Black-box adversarial Attack based on Implicit relations (BBAI). First, we extract the explicit and implicit relations in the bipartite graph to generate the entity relationship matrix. Then, we apply the Feature Perturbation Theory to calculate the perturbation score of each candidate edge and flip the top few candidate edges with the highest scores. Finally, we train on the perturbed bipartite graph to generate bipartite graph node embeddings. We use the embeddings for downstream tasks to evaluate the quality of the embeddings. Experimental results show that BBAI can significantly disrupt the performance of bipartite graph embeddings by perturbing the structure of bipartite graphs. Supplemental materials including code and data are available at https://github.com/DengBW-1998/BBAI.
|
|
09:00-18:30, Paper Mo-Online.209 | |
FedIFPG: Personalized Federated Learning Based on Information Fusion and Prototype Guidance |
|
Wang, Peng | Beihang University |
Liu, Xiaoyi | Beihang University |
Mi, ZhiLong | Beihang University |
Guo, Binghui | Beihang University |
Yin, Ziqiao | Beihang University |
Shen, Zihang | Beihang University |
Keywords: Deep Learning, Artificial Social Intelligence
Abstract: Personalized Federated Learning (PFL) aims to tailor models for individual clients while leveraging shared knowledge across a decentralized system. However, PFL faces challenges such as global model drift and global-local knowledge discrepancy, particularly under heterogeneous data distributions and client dropouts. To address these challenges, we propose FedIFPG, which integrates an information fusion module, a dynamic temporal weighting mechanism, and a global class prototype guidance strategy. The information fusion module, incorporating a dynamic temporal weighting mechanism, effectively balances global and local information, mitigating the effects of client drift and global-local bias. Simultaneously, the global class prototype guidance strategy enhances semantic consistency for the same class across different client models. Experiments on four benchmark datasets demonstrate that FedIFPG consistently outperforms SOTA methods in both accuracy and robustness.
|
|
09:00-18:30, Paper Mo-Online.210 | |
CTVD: Collaborative Training of Deep Learning and Large Model for C/C++ Source Code Vulnerability Detection |
|
Zheng, Yaning | Institute of System Engineering , Academy of Military Sciences |
Wang, Dongxia | Institute of System Engineering , Academy of Military Sciences |
Cao, Huayang | Institute of System Engineering, Academy of Military Sciences |
Qian, Cheng | Institute of System Engineering , Academy of Military Sciences |
Zhuang, Honglin | Institute of System Engineering , Academy of Military Sciences |
Keywords: Application of Artificial Intelligence, Deep Learning, Neural Networks and their Applications
Abstract: As software systems grow in complexity, source code vulnerability detection becomes crucial for software security. Existing methods, whether sequence-based or graph-based, face limitations in accurately detecting vulnerabilities. Sequence-based models often struggle with capturing code structure, while graph-based models have difficulty handling long-distance contextual relationships. To overcome these challenges, we propose a collaborative training framework that unifies a graph-based deep learning module and a semantic-rich large model module.The deep learning module, based on graph neural networks (GNNs), captures code structural information, and the large model module, leveraging pre-trained large language models (LLMs), understands code semantics. Through an iterative collaborative training mechanism, the two modules exchange information and learn from each other. Experimental results on three public datasets (Big-Vul, Reveal, and Devign) demonstrate the superiority of our approach. Compared with baseline models, our collaborative training model (CTVD) achieves significant improvements in accuracy, recall, precision, and F1-score. For example, on the Big-Vul dataset, our model's accuracy reaches 86.5%, outperforming the deep learning module alone by 8.3% and the large model module alone by 6.4%. Compared with the latest co-training method-Vul-LMGNN, CTVD outperforms Vul-LMGNN in the DiverseVul dataset. We applied CTVD in real projects and found seven undisclosed vulnerabilities, all of which were reported and included in the CNNVD. In conclusion, our proposed collaborative training framework effectively combines the strengths of deep learning and large model modules, providing a more accurate and reliable solution for source code vulnerability detection.
|
|
09:00-18:30, Paper Mo-Online.212 | |
Efficient Service Function Chaining in LEO Satellite Networks with Genetic Simulated Annealing-Based Particle Swarm Optimization |
|
Guo, Dengyu | Beihang University |
Yuan, Haitao | Beihang University |
Bi, Jing | Beijing University of Technology |
Zhang, Jia | Southern Methodist University |
Keywords: Intelligent Internet Systems, Hybrid Models of Computational Intelligence, Computational Intelligence in Information
Abstract: This work tackles the challenging problem of Virtual Network Function (VNF) deployment in Low Earth Orbit (LEO) satellite networks, which is a critical task to enable efficient service provisioning in space-based communication systems. Optimizing VNF placement across various satellite nodes is complicated by limited computational resources, stringent communication constraints, and complex service requirements. This work addresses this challenge by formulating the problem as a mixedinteger nonlinear program to maximize the service acceptance rate while ensuring balanced resource utilization. To solve it, this work proposes a hybrid metaheuristic algorithm called Genetic Simulated annealing-based Particle Swarm Optimization (GSPSO). GSPSO effectively integrates exploration capabilities of genetic algorithms, local optima avoidance of simulated annealing, and convergence efficiency of particle swarm optimization. GSPSO dynamically adjusts key parameters, including inertia weight and mutation probability, to balance exploration and exploitation throughout the optimization process. Experimental results demonstrate that GSPSO exhibits superior convergence behavior and solution stability across various network scales and significantly outperforms state-of-the-art methods.
|
|
09:00-18:30, Paper Mo-Online.213 | |
An Approach for Attack Chain Context Inference and Completion Based on Large Language Models |
|
Du, Dan | Institute of Information Engineering, Chinese Academy of Science |
Zhao, Changzhi | Institute of Information Engineering, CAS |
Li, Yunpeng | Institute of Information Engineering, Chinese Academy of Science |
Han, Dongxu | Institute of Information Engineering,Chinese Academy of S |
Li, Ning | Institute of Information Engineering,Chinese Academy of S |
Liu, Yuling | Institute of Information Engineering, Chinese Academy of Science |
Jiang, Bo | Institute of Information Engineering, Chinese Academy of Science |
Lu, Zhigang | Institute of Information Engineering, Chinese Academy of Science |
Keywords: AI and Applications, Application of Artificial Intelligence, Knowledge Acquisition
Abstract: Alert underreporting presents a significant challenge to the reconstruction of attack chains, as it often leads to the absence of critical information necessary for fully presenting the entire attack. To address this issue, this paper proposes an approach for attack chain context inference and completion based on Large Language Models (LLMs). By integrating an attack knowledge base, this approach leverages LLM-driven inference to identify missing attack stages and uncover potential attack behaviors. Experimental results demonstrate that this approach can effectively detect omitted alerts and complete the attack chain, thereby enhancing the integrity of attack detection.
|
|
09:00-18:30, Paper Mo-Online.214 | |
XOR-Fuse: Logical Operation-Driven Complementary Feature Fusion for Infrared-Visible Images under Variable Illumination |
|
Feng, Chenglin | University of Electronic Science and Technology of China |
Liu, Haoyu | University of Electronic Science and Technology of China |
Zhang, JunYao | University of Electronic Science and Technology of China |
Zhang, Yichen | University of Electronic Science and Technology of China |
Wu, Shaozhi | Yangtze Delta Region Institute (Quzhou), University of Electroni |
Liu, Xingang | University of Electronic Science and Technology of China |
Imran, Muhammad Ali | University of Glasgow |
Zhang, Lei | University of Glasgow |
Keywords: Machine Vision, Deep Learning, Machine Learning
Abstract: Multi-modal image fusion, a pivotal technique in computer vision, integrates data from various imaging sources to enhance the overall performance in applications such as clinical diagnosis, remote sensing, and autonomous driving. This study addresses the challenge of fusing visible light (VIS) and infrared (IR) images, which are characterized by distinct features and limitations. Traditional fusion methods struggle with varying illumination conditions and the lack of a unified strategy to extract useful information from both modalities. We propose a novel fusion method, XOR-Fuse, which leverages logical XOR operations to drive complementary feature fusion. By designing a logic operation-driven XOR loss function, our method explicitly captures complementary pixel-level discrepancies between modalities, particularly targeting scenarios where overexposed visible (VIS) regions dominate fusion outputs. This mechanism prioritizes IR thermal signatures that are conventionally masked by maximum-intensity fusion rules in high illumination conditions. To reinforce IR feature preservation, we integrate multi-scale Gabor wavelet filtering and wavelet decomposition for illumination-invariant texture extraction and VGG-based semantic constraints, ensuring structural congruence between IR and VIS details. Our approach is validated on three datasets: MSRS, RoadScene, and TNO. We enhance four models—DeepFuse, SDNet, U2Fusion, and DATFuse—with our XOR-Fuse method. Results show significant improvements in spatial frequency, mutual information, and visual information fidelity, while SSIM scores remain stable. Our enhanced DeepFuse model achieves a spatial frequency of 48.27 and a visual information fidelity of 0.62 on the MSRS dataset. The qualitative evaluation confirms accurate representation of IR thermal radiation, especially in bright conditions.
|
|
09:00-18:30, Paper Mo-Online.215 | |
Best Initialization Vectors: Image Dimensionality Reduction and Linear Feature Analysis |
|
Yang, Yichen | Inner Mongolia University |
Hou, HongXu | Inner Mongolia University |
Chen, Wei | Inner Mongolia University |
Keywords: Image Processing and Pattern Recognition, Machine Vision, Machine Learning
Abstract: In high-dimensional feature extraction tasks, probabilistic methods integrated with machine learning processes have become mainstream. However, while these methods help alleviate model complexity, they often introduce additional training burdens. To simultaneously achieve effective feature extraction and reduced computational cost, we propose a novel linear dimensionality reduction method called Best Initialization Vector (BIV), which leverages the principle of basis transformation to reduce the dimensionality of images. Specifically, we exploit the properties of matrix space by initializing a vector within the space as a parameter, and applying basis transformation to perform computations. This enables extreme dimensionality reduction of high-dimensional image data into vector representations, allowing the use of NLP models for image feature extraction. Our approach significantly reduces the number of parameters while maintaining compatibility with various NLP modules. To evaluate its effectiveness, we conducted experiments on multiple datasets. The results demonstrate that our method outperforms existing mainstream approaches under extreme dimensionality reduction scenarios.
|
|
09:00-18:30, Paper Mo-Online.216 | |
Long-Term Time Series Forecasting with Variational Mode Decomposition and Former-Style Models |
|
Cao, Yue | Beihang University |
Yuan, Haitao | Beihang University |
Kuang, Zhenwei | Beihang University |
Bi, Jing | Beijing University of Technology |
Li, Xingzi | Xingtang Telecommunications Technology Company Limited |
Zhang, Jia | Southern Methodist University |
Keywords: Cloud, IoT, and Robotics Integration, Deep Learning, AI and Applications
Abstract: Time series forecasting techniques have significant value in industrial production, financial markets, and energy management domains. Accurate prediction of future time series is vital for decision-making and operational optimization. However, existing methods often face two significant difficulties, i.e., insufficient available training data and the challenging requirement for long-term predictions, where errors accumulate over time. These limitations highlight the need for more reliable long-term forecasting methods. This work proposes a novel hybrid approach that combines VMD with attention-based transformer-style models to address these challenges. The VMD module decomposes raw sequences into simpler, more stable elements, reducing noise and irregular patterns. The transformer-style model captures long-range dependencies through its attention mechanism to identify meaningful long-term relationships in data. The proposed method is evaluated on multiple benchmark datasets representing different real-world scenarios, including ETT, ECL, and Traffic. The experimental result shows that the proposed method reaches higher prediction accuracy than existing methods across all datasets and prediction horizons, especially in long-term forecasting scenarios.
|
|
09:00-18:30, Paper Mo-Online.217 | |
SVBTformer: A Decomposition‑Enhanced Hybrid Transformer for Long‑term Time Series Forecasting |
|
Kuang, Zhenwei | Beihang University |
Yuan, Haitao | Beihang University |
Yang, Jinhong | CSSC Systems Engineering Research Institute |
Wang, Yi | Hainan Daily Press Group |
Bi, Jing | Beijing University of Technology |
Zhang, Jia | Southern Methodist University |
Keywords: AI and Applications, Application of Artificial Intelligence, Deep Learning
Abstract: Time series forecasting is a fundamental task in many domains, such as finance, energy, and intelligent systems. It is increasingly important in modern computing environments, including cloud computing and distributed resource management. However, real-world time series often exhibit complex temporal dependencies, high volatility, and multi-scale nonlinear patterns, making accurate forecasting challenging. To address these issues, this work proposes SVBTformer, a novel and effective forecasting model that enhances the Transformer-based Informer architecture with structured temporal learning modules. Specifically, SVBTformer integrates Savitzky–Golay (SG) filtering for noise reduction and signal smoothing, followed by Variational Mode Decomposition (VMD) to extract multi-resolution temporal components. Then, an improved Informer network called BTformer is employed to enhance the modeling capability for time series and strengthen the extraction of temporal dependencies. This work extensively experiments on publicly available benchmarks spanning multiple domains, including the ETT dataset for electric power demand, foreign exchange rates, and meteorological measurements. The results demonstrate that SVBTformer consistently outperforms state-of-the-art models, such as Informer and Autoformer, across most evaluation metrics, delivering superior accuracy and robustness. These gains underscore SVBTformer’s strong generalization capability and suitability for deployment in various real-world time series applications.
|
|
09:00-18:30, Paper Mo-Online.218 | |
ThyroSAM: Lightweight Mixed-Prompt Framework for Stable Thyroid Nodule Segmentation |
|
Yang, Zongjie | Wuhan University of Science and Technology |
Liu, Jun | Wuhan University of Science and Technology |
Tang, Jinshan | George Mason University |
Keywords: Deep Learning, Image Processing and Pattern Recognition, Neural Networks and their Applications
Abstract: Thyroid nodules, as one of the most common space-occupying lesions in the endocrine system, play a crucial role in early diagnosis and treatment decisions through the differentiation between benign and malignant cases. Ultrasound-based imaging, known for its cost-effectiveness, operational convenience, and strong device compatibility, has become the preferred modality for thyroid nodule screening. In recent years, the Segment Anything Model (SAM) has revolutionized medical image segmentation by introducing the visual foundation model paradigm. Through large-scale dataset training or the integration of medical knowledge adapters (such as MedSAM and Medical-SAM-Adapter) for domain adaptation, SAM has significantly enhanced its cross-modal segmentation generalization. However, existing methods still face three major limitations: (1) SAM’s multi-level encoder design demands high computational resources, making deployment in primary healthcare institutions challenging; (2) current interactive models rely on precise click-based localization, which differs from the complex annotation habits of clinical practitioners; (3) segmentation quality exhibits significant variance across multiple interactions, affecting diagnostic reliability. This study proposes a novel, efficient, and robust interactive segmentation framework for medical images, ThyroSAM, with the following technical innovations: replacing the original image encoder with a knowledge-distilled EfficientViT to enhance generalization via a low-rank adaptive convolution module (LoRAConv2d); integrating a hybrid prompting mechanism that combines points, boxes, scribbles, and masks to establish an interactive paradigm aligned with clinical workflows; and introducing a confidence-guided iterative correction algorithm that dynamically enhances target region embeddings, effectively mitigating segmentation instability. Experimental results demonstrate that with significantly reduced parameters and computational costs, ThyroSAM achieves an average Dice coefficient of 88.78% on the TN3K, TG3K, and DTTI thyroid nodule ultrasound datasets using only ten iterative prompts. This surpasses SAMUS by 8.51 percentage points, offering optimal accuracy and efficiency for resource
|
|
09:00-18:30, Paper Mo-Online.219 | |
A Privacy-Preserving Federated Learning System for Estimated Time of Arrival Prediction of Multi-Region Vehicular Trips (I) |
|
Zhai, Jiahui | Beijing University of Technology |
Wang, Shen | University College Dublin |
Keywords: Intelligent Internet Systems, Cloud, IoT, and Robotics Integration, Application of Artificial Intelligence
Abstract: Federated learning (FL), propelled by advancements in artificial intelligence and edge computing, is increasingly employed in privacy-sensitive intelligent transportation systems (ITS). However, accurately predicting the estimated time of arrival (ETA) for long vehicular trips spanning multiple regions remains challenging due to heterogeneous traffic patterns and insufficient local data. Although single-region or centralized solutions offer higher accuracy, the former struggles to address complex intra-regional dynamics, while the latter raises significant privacy concerns by aggregating large volumes of mobility data from various regional authorities by a single giant entity(textit{e.g.,} Google or Alibaba). To address these challenges, we introduce a novel underline{M}ulti-underline{R}egion underline{F}ederated underline{L}earning (MRFL) framework to collect traffic data at each region-specific base station (BS) to predict the ETA of vehicles without sharing the collected data among traffic BSs, ensuring privacy and alleviating local data scarcity. Experimental evaluations on SUMO datasets demonstrate that MRFL significantly outperforms single-region learning in prediction accuracy and convergence speed, highlighting the efficacy of MRFL in enhancing ETA prediction in diverse traffic scenarios and offering a promising avenue for future advancements in ITS.
|
|
09:00-18:30, Paper Mo-Online.220 | |
Generative Diffusion-Augmented Learning for Lesion Detection in Digital Breast Tomosynthesis: A Proof-Of-Concept Study |
|
Jilani, Golam | Southern Illinois University Carbondale |
Hossain, Mdbelayat | Southern Illinois University Carbondale |
Keywords: Deep Learning, Transfer Learning, Image Processing and Pattern Recognition
Abstract: Lesion detection in Digital Breast Tomosynthesis (DBT) remains challenging due to the complex imaging modality, large volume, and scarcity of annotated lesion cases, particularly biopsy-proven lesions. We propose a Generative Diffusion-Augmented Learning (GDAL) framework to enhance lesion detection in a classifier by leveraging a denoising diffusion probabilistic model to synthesize high-confidence lesion images only, as normal cases are abundant in a screening dataset. The diffusion-generated synthetic images vary in quality; not all are realistic or suitable for training purposes. Therefore, the synthetic lesions are selectively added to the training set based on our classifier-guided filtering method, enabling data-driven augmented learning. We evaluated the approach using the DBTex dataset and DenseNet121 as the reference classifier. Compared to random sampling, our method demonstrates consistent improvements in the classification performance of the lesion patches in various augmented sampling ratios. Experimental results demonstrated that our classifier-guided diffusion augmentation learning reduces variability and uncertainty in model development to improve lesion detection compared to random sampling.
|
|
09:00-18:30, Paper Mo-Online.221 | |
A Multi-Hierarchical Hidden Markov Model for Iceberg Orders in Finance |
|
Yu, Tianyi | Zhejiang University |
Lu, Andi | Zhejiang University of Technology |
Lu, Jiangang | Zhejiang University |
Keywords: AI and Applications, Machine Learning, Application of Artificial Intelligence
Abstract: Iceberg orders, which involve splitting large financial orders into smaller batches to conceal trading intent, pose significant challenges to market stability. To address this challenge, we propose a Multi-Hierarchical Hidden Markov Model (MH-HMM) that captures multi-scale stochastic states in trading sequences. The model employs a layered stacking mechanism to enhance its ability to detect hidden order patterns and incorporates an attention mechanism for adaptive temporal input selection. Comprehensive experiments on three realistic financial datasets and twenty multivariate time series classification datasets demonstrate that the proposed MH-HMM outperforms the state-of-the-art benchmarks on most datasets. Effectiveness experiments demonstrate that our approach effectively identifies iceberg orders and improves market trend prediction, providing a valuable tool for financial market analysis and algorithmic trading strategies. Code is available at https://github.com/tenyee-space/FinRL_MHHMM/tree/main.
|
|
09:00-18:30, Paper Mo-Online.222 | |
RATTCN-SVM: Route Attention and Temporal Convolutional Networks with SVM for Fall Detection Pose Prediction |
|
Wang, Weili | South China Normal University |
Hu, Yingbiao | Zhongkai University of Agriculture and Engineering |
Tang, Hua | South China Normal University |
Keywords: Machine Learning, Deep Learning, Image Processing and Pattern Recognition
Abstract: As global population aging accelerates, fall-related injuries among elderly individuals have become a critical public health challenge, with falls being the leading cause of injury-related deaths in adults aged 65 and older. Current fall detection methods face a fundamental trade-off between accuracy and computational efficiency, limiting their deployment in real-time eldercare monitoring systems. This paper presents RATTCN-SVM, a novel hybrid architecture that integrates Support Vector Machines with Temporal Convolutional Networks enhanced by a Route Attention mechanism specifically designed for fall detection. The Route Attention mechanism dynamically prioritizes critical temporal features by computing route-wise attention weights, enabling the model to focus on subtle motion patterns that precede fall events while maintaining computational efficiency. Our architecture leverages TCN's ability to capture long-range temporal dependencies through dilated convolutions, combined with SVM's robust classification capabilities in high-dimensional feature spaces. Comprehensive experiments on the UP-Fall and HARTH datasets demonstrate that RATTCN-SVM achieves superior performance with MAE of 0.1452, RMSE of 0.2305, and MAPE of 0.8938, outperforming the state-of-the-art Reformer model by 4.1% in MAE and 7.3% in MAPE. Critically, our model maintains detection latency under 300ms with only 7.8MB memory footprint, making it suitable for deployment on resource-constrained edge devices in eldercare environments. The proposed method advances fall detection technology by providing both improved accuracy and practical deployment capabilities for real-world healthcare monitoring applications.
|
|
09:00-18:30, Paper Mo-Online.223 | |
Spatial Interpolation Based on Causal Spatiotemporal Modeling |
|
Xie, Han | Sichuan University |
Lan, Shiyong | Sichuan University |
Ren, Yao | Sichuan University |
Yuan, Weihong | Sichuan University |
Zhou, Xinyuan | Sichuan University |
Hou, Zhiang | Sichuan University |
Keywords: Deep Learning
Abstract: The problem of spatial interpolation is a common challenge in fields such as traffic flow analysis. However, most existing methods directly utilize information from neighboring nodes to infer status of the unobserved location, without considering whether this information contains confounding factors or whether there is a true causal relationship, despite the fact that some unknown confounding factors are inevitably included in the data collection process. To address this, this paper proposes a Causal Attention Spatial Temporal Interpolation (CASI), which leverages causal relationships between nodes for spatial interpolation. The proposed CASI employs dilated convolutions and gating mechanisms to capture temporal dependencies, and introduces the Causal Spatiotemporal Attention (CSTA) mechanism, to uncover spatial causal dependencies between nodes. Subsequently, a novel Graph-Guided Feature Enhance Module (GFEM) is designed, which leverages causal probabilities from CSTA’s Gumbel-Softmax to weight the adjacency matrix in traditional GCN, composing GS-GCN, then adopts self-attention on temporal and spatial dimension respectively to further enhance the features from the improved GCN. We evaluate CASI on three real-world datasets, where it outperforms the optimal baselines across MAE, RMSE and MAPE by up to 3.7%, 1.7%, and 5.4% on PEMS04, 8.5%, 6.0%, and 5.9% on PEMS08, respectively. Ablation studies further demonstrated the effectiveness of the proposed modules in causal spatiotemporal modeling.
|
|
09:00-18:30, Paper Mo-Online.224 | |
A Local Perspective-Based Model for Overlapping Community Detection |
|
Zhou, Gaofeng | Qingdao University |
Wang, Rui-Feng | China Agricultural University |
Cui, Kangning | City University of Hong Kong |
Keywords: Machine Learning, Neural Networks and their Applications, Complex Network
Abstract: Community detection, which identifies densely connected node clusters with sparse between-group links, is vital for analyzing network structure in real-world systems. Most existing community detection methods based on GCNs primarily focus on node-level information while ignoring community-level features, leading to performance limitations on large-scale networks. To address this issue, we propose LQ-GCN, an overlapping community detection model from a local perspective. LQ-GCN employs a Bernoulli-Poisson model to construct a community affiliation matrix and form an end-to-end detection framework. By adopting local modularity as the objective function, the model incorporates local community information to enhance the clustering quality. Additionally, the conventional GCNs' architecture is optimized to improve the model’s capability in identifying overlapping communities in large-scale networks. Experimental results demonstrate that LQ-GCN achieves up to a 33% improvement in NMI and a 26.3% improvement in Recall compared to baseline models across multiple real-world benchmark datasets.
|
|
09:00-18:30, Paper Mo-Online.225 | |
Semi-Supervised Learning for Anterior Chamber Assessment: Fusing SAM with Adaptive Adapters (I) |
|
Wang, Ting | Wuhan University of Science and Technology |
Liu, Xiaoming | Wuhan University of Science and Technology |
Zhang, Ying | Wuhan Aier Hospital |
Wu, Guohuan | Aier Hospital |
Tang, Jinshan | George Mason University |
Keywords: Deep Learning
Abstract: Accurate structural segmentation and landmark detection in anterior segment optical coherence tomography (AS-OCT) images are crucial for extracting clinical parameters that guide the diagnosis and treatment of diseases such as glaucoma. However, current mainstream algorithmic paradigms suffer from an inherent limitation: their performance improvements heavily rely on large amounts of high-quality annotations. To overcome this bottleneck, we propose a novel semi-supervised multi-task learning framework. Our framework first incorporates the powerful Segment Anything Model (SAM) image encoder to enhance the model’s general feature extraction capability. To address SAM’s adaptability issues in the medical imaging domain, we design an adaptive feature fusion adapter (AFFA) for targeted fine-tuning, thereby improving its performance on AS-OCT images. Simultaneously, our proposed synergistic feature exchange module (SFEM) enables mutual promotion between the segmentation and detection tasks. Experimental results on a local dataset demonstrate that our proposed method achieves superior performance.
|
|
09:00-18:30, Paper Mo-Online.226 | |
Entropy-Driven Approach for Annotation-Free Image Regularity Assessment |
|
Zhang, Zhipeng | China Mobile Research Institute |
Ma, Wenting | China Mobile Research Institute |
Lin, Jinman | Netthink Technology Co., Ltd |
Hongyi, Tang | China Mobile GBA Innovation Institute |
Han, Shuangfeng | CMRI |
Guo, Meng | China Mobile Research Institute |
Yang, Lei | China Mobile Research Institute |
Yao, Zhenjie | Institute of Microelectronics, Chinese Academy of Sciences |
Keywords: Machine Vision, Image Processing and Pattern Recognition, AI and Applications
Abstract: Despite the rapid advancements in artificial intelligence, the preparation of large-scale labeled training data continues to pose a significant challenge, particularly within the manufacturing sector where annotations are frequently inadequate. This study introduces an entropy-based method for annotation-free image regularity assessment, with a specific focus on wire and cable distribution. Our approach encompasses three key components: (i) an analysis of spatial entropy pertaining to linear objects, (ii) a quantification of positional confusion present in images, and (iii) the provision of a standardized metric applicable to wiring operations. The proposed method provides a clear physical interpretation of entropy for annotation-free assessments while generating quantifiable and ordered entropy values that enable automated labeling and fully automated regularity analysis. Experimental results demonstrate that our approach surpasses human annotation in accuracy, thereby offering an objective and scalable solution.
|
|
09:00-18:30, Paper Mo-Online.227 | |
MFD: Multidimensional Feature Fusion and Masked Autoencoder for Encrypted Malicious Traffic Detection |
|
Liu, Chenhao | Wuhan University |
Ding, Xinwang | Wuhan University |
Wang, Lina | Wuhan University |
Pang, Zhi | Wuhan University |
Yang, Chenye | Wuhan University |
Jia, Bofei | WHU |
Yu, Rongwei | Wuhan University |
Keywords: Deep Learning, Representation Learning, Information Assurance and Intelligence
Abstract: The classification of encrypted network traffic (ENTC) is vital for ensuring network security, effective administration, and maintaining service quality. To accurately detect malicious encrypted traffic in communications and overcome the challenges posed by traditional detection methods, including the lack of labeled training data, difficulty in feature identification, and reliance on single-method approaches, we propose a novel framework based on Masked Autoencoders (MAE) and multidimensional feature fusion. Using a formatted traffic representation matrix that incorporates hierarchical flow information, we extract raw traffic features, plaintext packet features, and traditional statistical features in image format. Multidimensional feature fusion is achieved through RGB multi-channel integration. Our approach utilizes the MAE paradigm, which pre-trains a classifier on extensive unlabeled data and fine-tunes it with minimal labeled data for traffic classification. Experimental results demonstrate that our method achieves detection accuracy exceeding 98% on three public traffic datasets: USTC-TFC2016, ISCX-VPN2016, and CICIoT2022, significantly outperforming other deep learning methods.
|
|
09:00-18:30, Paper Mo-Online.228 | |
Fake Model Free-Rider Attacks in Federated Model Distillation |
|
Das, Kaushik Amar | IIIT Guwahati |
Ahmed Barbhuiya, Ferdous | IIIT Guwahati |
Dey, Kuntal | IIIT Guwahati |
Keywords: Deep Learning, Machine Learning, Application of Artificial Intelligence
Abstract: Federated Learning (FL) has made it possible to learn from data that would've otherwise not been possible due to privacy and security restrictions. FL relies on each client's honest participation and contribution. The existence of free-rider attackers may undermine the FL process allowing the free-riders to enjoy the contributions of the honest clients without any of their own. In FL algorithms such as Federated Averaging (FedAvg), the goal of the free-rider is to obtain the global model. However, algorithms such as Federated Model Distillation (FedMD) do not have a global model. This enables clients to train models with unique architectures. Here, collaborative learning is done by sharing prediction logits on a common public dataset aggregated by a central server. In this article, we propose a free-riding attack specific to this scenario. Here, the goal of the free-rider is to steal the aggregated logits from the server. Our proposed free-rider attack exploits the model heterogeneity property of FedMD and utilizes fake models to create the prediction logits. Furthermore, we improve upon FedMD and develop a KL-divergence-based detection mechanism to defend against such fake model attacks. Our experiments show that our mechanism can remove such free-riders at the very start of the FL process. Additionally, we also provide theoretical justification for the covertness of the fake model attack and the effectiveness of our detection mechanism.
|
|
09:00-18:30, Paper Mo-Online.229 | |
Detection of Strawberry Growth Status Based on BotNet-YOLOv7 |
|
Wang, Zhaofei | Hebei Agricultural University |
Yao, Jingfa | Social Service Center, Hebei Software Institute |
Chen, Hui | Department of Software Engineering, Hebei Software Institute |
Zhang, Jiawei | Department of Computer Application Engineering,Hebei Soft |
Yuan, Yingchun | College of Information Science and Technology, Hebei Agricultura |
Keywords: Image Processing and Pattern Recognition, Application of Artificial Intelligence, Deep Learning
Abstract: 针对传统 YOLOv7 网络 结构难以准确高效检测 草莓的生长状态,一种改进的 YOLOv7 网络模型被提出来更高效地和 准确识别草莓的生长状态。 首先,通过引入改进的 Transformer 模块 自我注意机制 Bottleneck Transformers (BotNet) 和 CBAM 以获得更好的特征提取能力, 提出了一种新的模型,这使得模型具有更多的 高效准确的增长识别率 草莓与传统模型相比的地位。 其次,通过引入归一化 Wasserstein 距离 (NWD) 损失函数,原始 IoU 算法 YOLOv7 进行了改进,解决了分类和 模型的回归问题,同时, 模型对微小物体的检测能力更强 改进,使模型的检测精度对于每个 类别进一步改进。结合上述内容 改进,改进的 YOLOv7 模型 (BotNet-YOLOv7) 提出了 BotNet 和 YOLOv7 相结合的方法。结果显示 平均识别准确率 (mAP0.5)、精度、 召回率
|
|
09:00-18:30, Paper Mo-Online.230 | |
LFBA: Latent-Space Frame-Level Backdoor Attacks on Keyword Spotting Systems |
|
Li, Zexin | Xiangtan University |
Yao, Wenhan | Xiangtan University |
Xiao, Ye | Xiangtan University |
Yang, Jinsu | Xiangtan University |
Xing, Zedong | Xiangtan University |
Chen, Xiarun | Peking University |
Xiao, Fen | Xiangtan University |
Wen, Weiping | Peking University |
Keywords: Deep Learning, Application of Artificial Intelligence
Abstract: Modern deep learning models increasingly rely on third-party data processing, exposing vulnerabilities to backdoor attacks. Existing audio backdoor methods often compromise stealthiness by introducing perceptible modifications. This paper proposes Latent-space Frame-level Backdoor Attacks (LFBA), a novel framework that manipulates frame-level features in latent space to achieve imperceptible and effective backdoor injection. Our approach extracts and transforms frame-level features to subtly alter rhythmic patterns, such as compressing or expanding temporal segments, without modifying semantic content or speaker characteristics. Evaluations demonstrate excellent attack effectiveness while maintaining near-original audio quality. Our attack evades human perception and automated detection, maintaining robustness even after defensive fine-tuning. This work reveals critical risks in outsourced speech model training and establishes a new paradigm for stealthy, latent-space poisoning in speech-controlled systems.
|
|
09:00-18:30, Paper Mo-Online.231 | |
A Dual-Stream LSTM and Transformer Framework with Gated Feature Fusion for Bus Travel Time Prediction |
|
Tang, Zicheng | Tongji University |
Cheng, Nan | Tongji University |
Zhang, Yaying | Tongji University |
Keywords: Neural Networks and their Applications, Deep Learning
Abstract: Accurate bus travel time prediction is essential for intelligent transportation systems, yet it remains challenging due to the nonlinear and dynamic nature of urban traffic. This paper proposes a novel hybrid model that combines a dual-stream LSTM encoder and a Transformer-based decoder within an encoder–decoder framework to capture both short- and long-term temporal dependencies. To effectively incorporate heterogeneous information, we design a Gated Feature Fusion Layer that adaptively integrates static and dynamic features. The decoder includes a Static-conditioned Variable Selection Block, Gated Residual Network (GRN), and Temporal Attention Layer to enhance feature extraction and temporal interpretability. Extensive experiments on a real-world dataset collected from Shanghai demonstrate that our model significantly outperforms baseline methods. Further ablation studies validate the effectiveness of each component, confirming the robustness and superiority of the proposed architecture in complex urban traffic scenarios.
|
|
09:00-18:30, Paper Mo-Online.232 | |
MO-SAE: Multi-Objective Stacked Autoencoders Optimization for Edge Anomaly Detection |
|
Zhang, Lizhao | Harbin Institute of Technology |
Kong, Shengsong | Harbin Institute of Technology |
Guo, Tao | Harbin Institute of Technology |
Li, Shaobo | Harbin Institute of Technology |
Ji, Zhenzhou | Harbin Institute of Technology |
Keywords: Hybrid Models of Neural Networks, Fuzzy Systems, and Evolutionary Computing, Deep Learning, Neural Networks and their Applications
Abstract: Stacked AutoEncoders (SAE) have been widely adopted in edge anomaly detection scenarios. However, the resource-intensive nature of SAE can pose significant challenges for edge devices, which are typically resource-constrained and must adapt rapidly to dynamic and changing conditions. Optimizing SAE to meet the heterogeneous demands of real-world deployment scenarios, including high performance under constrained storage, low power consumption, fast inference, and efficient model updates, remains a substantial challenge. To address this, we propose an integrated optimization framework that jointly considers these critical factors to achieve balanced and adaptive system-level optimization. Specifically, we formulate SAE optimization for edge anomaly detection as a multi-objective optimization problem and propose MO-SAE (Multi-Objective Stacked AutoEncoders). The multiple objectives are addressed by integrating model clipping, multi-branch exit design, and a matrix approximation technique. In addition, a multi-objective heuristic algorithm is employed to effectively balance the competing objectives in SAE optimization. Our results demonstrate that the proposed MO-SAE delivers substantial improvements over the original approach. On the x86 architecture, it reduces storage space and power consumption by at least 50%, improves runtime efficiency by no less than 28%, and achieves an 11.8% compression rate, all while maintaining application performance. Furthermore, MO-SAE runs efficiently on edge devices with ARM architecture. Experimental results show a 15% improvement in inference speed, facilitating efficient deployment in cloud–edge collaborative anomaly detection systems.
|
|
09:00-18:30, Paper Mo-Online.233 | |
Ontology-Based Graph and Large Language Model Fusion Method for Relational Triple Extraction |
|
Yao, Yuxiang | Tongji University |
Wang, Junli | Tongji University |
Yan, Chungang | Tongji University |
Keywords: Knowledge Acquisition, Application of Artificial Intelligence, Machine Learning
Abstract: In recent years, Relational Triple Extraction (RTE) methods leveraging Large Language Models (LLMs) have garnered lots of attention. Several studies attempted to harness ontology, the foundational template of knowledge graph, to assist LLMs in comprehending the structure of knowledge graph. However, the broadness of ontology within the prompts leads to unsatisfactory outcomes for RTE tasks. To address this, we proposed an Ontology-based Graph and LLM Fusion method for Relational Triple Extraction (OGLRTE). In this framework, we pioneeringly designed a two-stage RTE method centered around ontology, comprising of a relation filter and a text generator. To fully utilize the information in knowledge graph, we proposed an innovative combination of the ontology and co-occurrence graph within relation filter to repersent co-occurrence of relations in knowledge graph. Additionally, we employed established fine-tuning techniques to optimize the LLM within our triple generator, thereby enhancing its capability to extract triples. Our method surpasses traditional methods and LLM-based methods in extracting information, as evidenced by its superior performance on three public datasets: DocRED, NYT10 and CoNLL04.
|
|
09:00-18:30, Paper Mo-Online.234 | |
Memory and Inference Based Meta Reinforcement Learning for Dynamic Flexible Job Shop Scheduling Problem |
|
Wu, Lincong | College of Informatics, Huazhong Agricultural University |
Li, Xiaoxia | Huazhong Agricultural University |
Lu, Xin | Leeds Trinity University |
Pu, Hangyu | Huazhong Agricultural University |
Jing, Yanguo | University of Cumbria |
Keywords: Manufacturing Automation and Systems, Decision Support Systems, Discrete Event Systems
Abstract: The Dynamic Flexible Job Shop Scheduling Problem (DFJSP) poses a significant challenge in the field of intelligent manufacturing, due to inherent complexity and dynamic disruptions including machine breakdowns, new job arrivals, and other unforeseen events. Although many reinforcement learning approaches offer greater flexibility, they often suffer from low sample efficiency and struggle to adapt to unforeseen events. To overcome these limitations, we propose Vari-PPO, a meta-reinforcement learning method built upon implicit task inference. The approach leverages variational inference to model latent task characteristics and integrates Proximal Policy Optimization (PPO) for dynamic scheduling. Experimental findings demonstrate that the proposed approach exhibits stronger adaptability and generalization than traditional reinforcement learning approaches and pre-trained models across tasks of varying scales and benchmark datasets. It effectively manages fluctuations in makespan under dynamic events, highlighting its robustness and practical value in diverse scheduling environments.
|
|
09:00-18:30, Paper Mo-Online.235 | |
HC-RL-MPC: Epidemic Multi-Scale Hierarchical Control Framework Based on Reinforcement Learning and Model Predictive Control |
|
Luo, Xueting | Tongji University |
Li, Zeyuan | Tongji University |
Deng, Hao | Tongji University |
Zhao, Shengjie | Tongji University |
Keywords: Application of Artificial Intelligence, Agent-Based Modeling, AI and Applications
Abstract: Human mobility restrictions are considered effective measures for epidemic mitigation but require adaptability to complex transmission environments and consistency across spatial scales. Existing mainstream approaches, including model predictive control (MPC) and reinforcement learning (RL), still face limitations. MPC relies on precise mathematical models and struggles to handle large-scale dynamic scenarios. Meanwhile, RL suffers from the curse of dimensionality, making fine-grained control challenging. In this work, we propose a novel epidemic multi-scale Hierarchical Control framework based on Reinforcement Learning and Model Predictive Control (HC-RL-MPC). At the high level, RL generates regional mobility restriction quota to balance medical and socioeconomic. At the low level, MPC clusters optimize the inter-subregion mobility quota allocation under high-level constraints. Furthermore, to ensure cross-scale policy consistency, we introduce a Policy Inverse Guidance (PIG) mechanism, where low-level modules provide gradient feedback and real-time results to high-level policy. Experimental results demonstrate that HC-RL-MPC generates macro-micro consistent mobility restriction strategies. Compared with existing models, it significantly reduces training complexity and improves cumulative rewards, providing an efficient solution for multi-scale dynamic decision-making.
|
|
09:00-18:30, Paper Mo-Online.236 | |
UFedMBA: Unforgotten Personalized Federated Learning with Memory Bank for Adaptively Aggregated Layers |
|
Zhai, Yijun | Chongqing University |
Zhou, Pengzhan | Chongqing University |
He, Yuepeng | Chongqing University |
Gao, Kaixin | Chongqing University |
Fang, Qu | Chongqing University |
Li, Ziyi | Chongqing University |
Gao, Li | Chongqing University |
Luo, Youyu | Chongqing University |
Keywords: Intelligent Internet Systems, Machine Learning, Computational Intelligence
Abstract: Personalized federated learning (PFL) addresses the challenge of data heterogeneity across clients. However, existing efforts often struggle to balance model personalization and generalization under Non-IID data scenarios. This paper proposes uFedMBA, a novel PFL framework that decouples neural network parameters into global base-layer parameters and client-specific personalized-layer parameters. On the client side, uFedMBA adds a penalty term with base-layer parameters into the local loss function to prevent overfitting to local data and integrates the historical model into personalized-layer parameters for accelerating convergence. The server employs layer-wise aggregation based on gradient alignment to adaptively aggregate personalized layers, enhancing compatibility across heterogeneous clients. Extensive experiments demonstrate that the uFedMBA achieves state-of-the-art results on four image classification datasets. Code is available at: https://github.com/yjzhai-cs/uFedMBA.
|
|
09:00-18:30, Paper Mo-Online.237 | |
Dual-View Evidence Learning and Cross-View Fusion for Enhanced Text-Table Fact Verification |
|
Wu, Zhouhui | University of New South Wales |
Jiang, Jiaojiao | UNSW |
Yang, Shuiqiao | UNSW |
Sun, Nan | UNSW Canberra |
Keywords: Application of Artificial Intelligence, AI and Applications
Abstract: Fact verification involves assessing the factuality of claims to detect false information. This work focuses on a specific fact verification subtask: verifying claims using retrieved textual and tabular evidence. Existing approaches often overlook the distinct features and interactions of table and text evidence, which are essential for accurate claim verification by providing a comprehensive understanding. Moreover, current evidence fusion strategies used by existing work fail to model complex distinctions, leading to ineffective integration. This work introduces a novel veracity prediction model that leverages dual-view evidence learning and graph-based evidence fusion to address these limitations. Our model incorporates a local view, capturing the unique information within each sentence and table, and a global view, modeling the interactions between these evidence pieces. We further employ graph networks to fuse information within each view and across views, generating richer evidence representations for improved claim verification. Extensive experiments demonstrate the effectiveness of our method. We will make our code publicly available.
|
|
09:00-18:30, Paper Mo-Online.238 | |
DynTSM: A Dynamic Graph Contrastive Representation Learning Method Based on Temporal and Structural Perturbations |
|
Wang, Ying | Jilin University |
Li, Chenyu | Jilin University |
Chen, Zihao | Jilin University |
Keywords: Deep Learning
Abstract: In the real world, dynamic graphs undergo continuous changes in both their structure and information over time. To accurately represent these graphs, models must be capable of capturing the structural, temporal, and contextual relationships simultaneously. The integration of these complex, multi-aspects is crucial for comprehending the evolution of dynamic graphs. However, this integration poses a substantial challenge within the realm of graph representation learning. To address this problem, we propose an innovative dynamic graph representation learning method called DynTSM. DynTSM aims to more accurately capture the structural evolution patterns of dynamic graphs by using contrastive learning based on temporal and structural perturbations. Specifically, DynTSM learns node representations at different time points through the contrastive learning of temporal perturbations, allowing it to capture subtle changes in nodes and edges within the graph. Meanwhile, contrastive learning of structural perturbations focuses on the overall structural evolution of the graph, enabling us to systematically identify substructures associated with temporal and structural changes. Such a dynamic learning approach aims to provide a more comprehensive perspective for understanding the temporal and spatial evolution of dynamic graphs. We conduct experiments on a wide range of benchmark datasets to validate the superiority of the DynTSM approach to dynamic graph representation learning.
|
|
09:00-18:30, Paper Mo-Online.239 | |
Self-Supervised MRI Reconstruction Using Weighted SSDU Via Dual-Branch Latent Diffusion |
|
Yang, Fan | University of Electronic Science and Technology of China |
Jianchao, Wang | University of Electronic Science and Technology of China |
Li, Jiakai | University of Electronic Science and Technology of China |
Pu, Xiaorong | University of Electronic Science and Technology of China |
Keywords: Artificial Social Intelligence, Image Processing and Pattern Recognition, Application of Artificial Intelligence
Abstract: Magnetic Resonance Imaging (MRI) plays a critical role in clinical diagnostics, but its broader application is constrained by prolonged acquisition times, especially under high acceleration factors. We therefore propose a self-supervised MRI reconstruction framework that integrates latent diffusion models with a parallel dual-network architecture, aiming to achieving high-quality image reconstruction under highly accelerated conditions. Within this framework, a dual-branch network is designed to process distinct subsets of under-sampled k-space data. This design enables the capture of complementary information across subsets while effectively mitigating overfitting. Both branches are guided by reconstruction and difference losses to enforce consistency across unobserved regions, enabling accurate recovery of fully sampled images. A solid mathematical foundation is formulated to offer theoretical guarantees regarding the reliable approximation of fully sampled data under specific constraints. Experimental results demonstrate that integrating parallel dual-network self-supervision with latent diffusion advances self-supervised MRI reconstruction, highlighting its potential for clinically feasible high-fidelity MRI under high acceleration.
|
|
09:00-18:30, Paper Mo-Online.240 | |
Adversarial Domain Adaptation for Accurately Predicting the Associations between Herbal Compounds and Target Proteins (I) |
|
Qiao, Yantong | Southwest University |
Hu, Lun | Chinese Academy of Sciences |
Zhang, Jun | Xinjiang Technical Institute of Physics and Chemistry, Chinese A |
Hu, Pengwei | IBM Research |
Luo, Xin | Chinese Academy of Sciences |
Keywords: Biometric Systems and Bioinformatics, Deep Learning, Transfer Learning
Abstract: Traditional Chinese medicine (TCM), as a treasure of traditional Chinese medical heritage, has developed a unique therapeutic system through millennia of practice. Its multi-compound, multi-target mechanism demonstrates distinct advantages in disease treatment. However, under the modern medical framework, the modernization of TCM research faces dual challenges: on one hand, the complex interaction network between herbal compounds and target proteins has not yet been systematically elucidated, and traditional methods based on in vitro experiments or statistical correlations suffer from long experimental cycles and high failure rates; on the other hand, compared to chemical drugs with well-established large-scale annotated databases, the association data between herbal compounds and targets exhibit significant sparsity, making it difficult for existing methods to effectively model cross-domain knowledge transfer due to mismatched data dimensions. To address this scientific challenge, this study proposes a cross-domain association prediction framework, DGCPI, which integrates graph neural networks and adversarial learning. To begin with, DGCPI constructs two heterogeneous graph networks for modelling chemical drug-target (source domain) and herbal compound-target (taget domain) associations respectively. It then applies a graph convolutional network to capture source-domain-specific topological features in the chemical drug-target graph. After that, an adversarial discriminator is introduced to dynamically align the embedding distributions between the source and target domains, overcoming the dependence of traditional methods on the assumption of homomorphic data. Experimental results show that only with a small number of TCM samples involved in the training, the model significantly outperforms baseline models on two TCM datasets.
|
|
09:00-18:30, Paper Mo-Online.240 | |
Lightweight Feature Enhancement-Based Detection of Drone Aerial Images |
|
Huang, Hong | University |
Su, Han | Sichuan Normal University |
Keywords: Deep Learning, Image Processing and Pattern Recognition, Artificial Social Intelligence
Abstract: This paper presents a lightweight feature enhancement network tailored for object detection in aerial drone images, addressing challenges such as small target detection, occlusions, and environmental complexities. We propose a Multi-Scale Feature Enhancement (MSFE) module, which leverages large kernel convolutions and advanced feature fusion techniques to enhance the detection performance for small and occluded targets. In addition, a novel lightweight module, C2G-Net, is introduced to optimize computational efficiency and reduce model complexity, making it suitable for deployment on resource-constrained devices such as drones. Experimental results in VisDrone-2019 and RSOD datasets demonstrate the superiority of the proposed method, achieving higher accuracy and reduced parameter counts compared to state-of-the-art approaches. This study contributes to the advancement of UAV-based object detection in fields such as agriculture, surveillance, and environmental monitoring.
|
|
09:00-18:30, Paper Mo-Online.241 | |
SlimPose: Lightweight Multi-Person Pose Estimation Via Multi-Scale Structural Feature Fusion and Selective Attention |
|
Liu, Yandong | Central China Normal University |
Liu, Jiangnan | Central China Normal University |
Peng, Xi | Central China Normal University |
Hu, Shengnan | Central China Normal University |
Keywords: Image Processing and Pattern Recognition, Deep Learning, Artificial Social Intelligence
Abstract: As a foundational technology for human-centered visual understanding, pose estimation has broad applications in human-computer interaction, behavior recognition, and video surveillance. However, existing methods often incur high computational costs, limiting their deployment on resource-constrained edge devices. In this paper, we propose SlimPose, a lightweight multi-person pose estimation framework that enhances the model’s ability to perceive human poses across scales by integrating multi-scale features with directional structural cues of the human body. We introduce a compact Selective Attentional Feature Gating module that adaptively emphasizes human regions while suppressing background features, thereby reducing unnecessary computational overhead. Additionally, we design a Deconvolutional Shift-Channel Mixer that enhances the model's ability to infer occluded keypoints with minimal increase in parameters or computational cost. Extensive experiments on the COCO and CrowdPose datasets demonstrate that our approach achieves state-of-the-art accuracy among lightweight bottom-up methods, particularly excelling in crowded scenes and under resource-limited conditions.
|
|
09:00-18:30, Paper Mo-Online.242 | |
Improving LDC-MIA: A New Exploration of Enhancing the Privacy Assessment Performance of Machine Learning Models |
|
Liu, Yaxuan | National University of Defense Technology |
Zhou, Yun | National University of Defense Technology |
Keywords: Machine Learning, Neural Networks and their Applications, Deep Learning
Abstract: Machine learning's widespread application across various fields has raised significant privacy concerns, making Membership Inference Attacks (MIAs) a key technique for assessing model privacy vulnerabilities. This paper focuses on improving the LDC-MIA method and proposes the LDC-MIA+ framework. By incorporating a model output stability metric and integrating multiple features such as loss values, calibrated losses, and neighbor cosine similarities, we construct a more effective MIA classifier. Experimental results on CIFAR-10, Adult, and Credit datasets demonstrate that, compared to other MIA techniques, our method significantly enhances the True Positive Rate (TPR) at extremely low False Positive Rates (FPR), with notable improvements in balanced accuracy and Area Under the Curve (AUC). It more accurately infers membership information and exhibits strong robustness across diverse datasets and models, providing a more reliable framework for assessing the privacy risks of machine learning models.
|
|
09:00-18:30, Paper Mo-Online.243 | |
Building Neural Networks' Latent Space to Extract Instance-Based Explanations for Sleep Staging |
|
Gagliardi, Guido | University of Pisa |
Alfeo, Antonio Luca | University of Pisa |
Cimino, Mario G. C. A. | University of Pisa |
Valenza, Gaetano | University of Pisa |
De Vos, Maarten | KU Leuven |
Keywords: Representation Learning, Machine Learning, AI and Applications
Abstract: Sleep disorders and their diagnosis are a significant public health concern. Automated sleep stage classification using deep learning models has shown promising results, but these models often lack transparency and interpretability. In this study, we propose an eXplainable Artificial Intelligence (XAI) approach to enhance the interpretability of cutting-edge deep learning sleep stage classification models. The proposed approach consists of a three-steps framework: (i) employing contrastive learning to order a neural network latent space based on input similarity; (ii) mining meaningful instances from that space; and (iii) explaining those instances by a customized XAI methodology. By doing this we are capable of extracting human-comprehensible insights about the model decision-making process, enhancing the applicability of the proposed approach in real-world clinical scenarios. The explanations provided point out high and low-representative sleep epochs of each sleep phase. These sleep epochs are analyzed considering both the single sleep epoch and the sequence of adjacent sleep epochs for the sleep phase classification. Our approach proved to maintain the original model performances, improve the model interpretability, and confirm that the network decision-making process is valid even from the perspective of a physician.
|
|
09:00-18:30, Paper Mo-Online.244 | |
JAuthGuard: Automatic Detection for Broken Access Control in Java Web APIs |
|
Zhang, Mengjun | Institute of Information Engineering,Chinese Academy of S |
Li, Yunpeng | Institute of Information Engineering, Chinese Academy of Science |
Cheng, Jie | State Grid Information & Telecommunication Co., Ltd |
Xia, Ang | STATE GRID Corporation of China |
Zhang, Yue | Institute of Information Engineering,Chinese Academy of S |
Liu, Yuling | Institute of Information Engineering, Chinese Academy of Science |
Feng, RuiZhi | Institute of Information Engineering,Chinese Academy of S |
Keywords: Information Assurance and Intelligence
Abstract: Java Web applications are widely used across various industries, however, they are increasingly threatened by Broken Access Control (BAC) vulnerabilities, which may allow unauthorized users to access restricted resources. This paper proposed an automated detection method for BAC vulnerabilities at the API level in Java Web applications. We proposed JAuthGuard, a novel detection framework that combines rule-based analysis and graph-based path analysis to identify potential vulnerabilities. Our approach leverages prior knowledge of common Java Web development patterns and access control mechanisms to define target function rules, precisely locating critical functions that require strict access control checks. Additionally, based on an analysis of historical vulnerabilities, we propose a control defect (i.e., flaws in access control mechaisms that allows unauthorized access) detection algorithm based on authentication paths. This algorithm constructs authentication paths through static analysis and incorporates LLM prompt techniques to identify control defects. We implemented the JAuthGuard and conducted an empirical evaluation, demonstrating its effectiveness, with results showing superior performance compared to the commercial tool Fortify SCA. Furthermore, the system successfully detected multiple BAC vulnerabilities in high-profile projects on GitHub, earning six CVE identifiers. By providing an automated, efficient, and accurate BAC vulnerability detection tool, this research contributes to enhancing the security of Java Web applications.
|
|
09:00-18:30, Paper Mo-Online.245 | |
MGQA: Mixture Gaussian for Video Grounded Question Answering Via VLMs |
|
He, Zhixian | Sun Yat-Sen University |
Ma, Xiaofan | SUN YAT-SEN University |
Li, Qiushi | Sun Yat-Sen University |
Lin, Shujin | Sun Yat-Sen University |
Keywords: Multimedia Computation, Image Processing and Pattern Recognition, Machine Vision
Abstract: Video question answering has become a cornerstone task for evaluating vision language models. However, existing models often fail to ground their answers in relevant visual evidence or incorrectly model distributions during localization. To address this limitation, we propose MGQA, which models videos as a sequence of discrete events using mixture Gaussian distributions, with each Gaussian characterized by its center, range, and weight. MGQA leverages question-answering accuracy as a weak supervision signal and incorporates two additional Gaussian-related loss functions. The method can be easily integrated into existing models with negligible parameter overhead. Experiments conducted on the NExT-GQA and ReXTime datasets demonstrate the effectiveness of our proposed method.
|
|
09:00-18:30, Paper Mo-Online.247 | |
Neighborhood-Aware Graph Representation Learning Based on Large Language Models |
|
Huang, Wenxuan | Institute of Software Chinese Academy of Sciences |
Wu, Fengge | Institute of Software, Chinese Academy of Sciences |
Zhao, Junsuo | Institute of Software, Chinese Academy of Sciences |
He, Ruitao | Institute of Software Chinese Academy of Sciences |
Keywords: Deep Learning, Representation Learning, AI and Applications
Abstract: In recent years, representation learning of Text-Attributed Graphs (TAGs) has become a key research topic. TAGs have a wide range of applications in fields such as social media and recommendation systems. The advent of Large Language Models (LLMs), with their robust reasoning capabilities and zero-shot learning abilities, has opened up new avenues for TAGs modeling. However, how to effectively utilize the reasoning capabilities and general knowledge of LLMs while overcoming their high computational costs remains a significant challenge. In this work, we propose an innovative framework named Neighborhood-Aware Graph Representation Learning Based on Large Language Models (NAGL). NAGL employs an adaptive neighbor fusion module to extract meaningful features from neighboring nodes and utilizes an efficient fusion module to integrate global graph features extracted by Graph Neural Networks (GNNs) with neighborhood features into text encodings. The text encodings are generated through autoregressive training on next token prediction. After pretraining, the model can adapt graph representations to various downstream tasks using task-specific prompts. Extensive experiments on multiple real-world datasets demonstrate that NAGL outperforms baselines in node classification tasks. Furthermore, validation with various LLMs underscores the framework's efficacy, universality, and scalability.
|
|
09:00-18:30, Paper Mo-Online.248 | |
A Novel Semi-Automatic Approach for Security Risk Treatment for U-Space Solutions (I) |
|
Elia, Raffaele | University of Campania "Luigi Vanvitelli" |
Rak, Massimiliano | University of Naples Federico II |
Pascarella, Domenico | Italian Aerospace Research Centre (CIRA) |
Keywords: Information Assurance and Intelligence, Assurance
Abstract: The European U-space initiative is expected to drive the widespread adoption of drones, which in turn increases the risk of novel and evolving cyberattacks. Consequently, it is essential to prioritize the assessment and treatment of security risks within the design of U-space solutions. As part of the process, security controls must be selected and implemented to strengthen the system’s cybersecurity posture. However, such selection must take into account their costs, effectiveness, and efficiency. On the other hand, choosing the wrong security controls can leave the analyzed solution highly exposed to threat scenarios. Accordingly, manual security risk treatment could be impractical, especially for highly automated and interconnected systems like U-space. This paper introduces an innovative semi-automatic approach for the security risk treatment of U‑space solutions, introducing a bridging between some established frameworks. In detail, the work proposes a systematic integration between the NIST CyberSecurity Framework (CSF) and the Security Risk Assessment Methodology (SecRAM), with the latter representing the main point of reference within the Single European Sky ATM Research (SESAR) programme. The paper demonstrates the effectiveness and cost-efficiency of the approach in developing secure U-space systems through a case study on pharmaceutical delivery in a U‑space environment.
|
|
09:00-18:30, Paper Mo-Online.249 | |
Using Multiple Model Fusion and Attention Mechanism to Recognize Autism Based on Facial Images |
|
Chen, Ru | Qufu Normal University |
Chen, Weiyang | Qufu Normal University |
Pan, Yi | Shenzhen University of Advanced Technology |
Keywords: Application of Artificial Intelligence, AI and Applications, Image Processing and Pattern Recognition
Abstract: Autism Spectrum Disorder (ASD) is a neurodevelopmental condition typically emerging in early childhood, characterized by challenges in social interaction, communication deficits, restricted interests, and repetitive behavioral patterns. While no complete cure exists, early intervention remains critical for symptom management and skill development in affected children. Traditional diagnostic approaches depend on clinical assessments by mental health experts following established criteria, yet these methods are constrained by subjectivity, prolonged evaluation periods, and high costs. Aiming at the problem that most of the existing studies use a single deep learning model to classify autistic facial images with insufficient accuracy, this study presents an innovative deep learning framework MFAN that combines pre-trained VGG16 and MobileNetV2 architectures with a Convolutional Block Attention Module (CBAM). By exploiting the discriminative facial features distinguishing autistic and neurotypical individuals, the proposed model aims to classify ASD status using facial images. Evaluation metrics include accuracy, precision, and recall. After rigorous training and validation protocols, the MFAN model achieved 92.67% test accuracy and an AUC-ROC of 0.9635. These results outperform standalone VGG16 and MobileNetV2 models and their simple combinations, demonstrating enhanced classification efficacy. The findings highlight the potential of deep transfer learning for scalable ASD screening, offering a promising tool for early detection in population-level contexts.
|
|
09:00-18:30, Paper Mo-Online.250 | |
Edge-Cloud Collaborative Framework with Vision Transformer for Efficient Multi-Task Applications |
|
Zhang, Zhipeng | China Mobile Research Institute |
Ma, Wenting | China Mobile Research Institute |
Guo, Meng | China Mobile Research Institute |
Yang, Lei | China Mobile Research Institute |
Guo, Yongjie | China Mobile GBA (Greater Bay Area) Innovation Institute |
Keywords: Cloud, IoT, and Robotics Integration, Neural Networks and their Applications, Deep Learning
Abstract: The advent of Vision Transformers (ViTs) has significantly reshaped the landscape of computer vision, delivering competitive performance across a wide range of visual recognition tasks. However, the substantial size and computational demands of large-scale foundation models hinder their broader deployment on edge devices. In this paper, we address the challenge of deploying ViT-based models within an edge-cloud collaborative framework. Specifically, we propose a novel framework tailored for multi-task applications, comprising a universal block (UB) deployed at the edge and multiple task-specific branches (TSBs) executed in parallel in the cloud. Furthermore, we incorporate a lightweight and efficient patch slimming strategy that is well-suited for edge-device deployment, providing a simple yet effective means of reducing computational overhead. We empirically validate the effectiveness of our approach using several state-of-the-art ViT models across diverse benchmark datasets.
|
|
09:00-18:30, Paper Mo-Online.251 | |
FI-Transformer : A Fast Time Series Anomaly Detection Method |
|
Zhang, Anyi | State Key Laboratory of Networking and Switching Technology, Bei |
Xu, Peng | State Key Laboratory of Networking and Switching Technology, Bei |
Wang, Yusheng | State Key Laboratory of Networking and Switching Technology, Bei |
Keywords: Neural Networks and their Applications, Deep Learning, Application of Artificial Intelligence
Abstract: With the growing adoption of distributed systems in enterprise digital applications, real-time analysis of operational data is vital. And rapid anomaly detection and alerting are crucial for timely intervention. In such scenarios, the efficiency of anomaly detection primarily depends on the accuracy and inference time of the respective methods, while the update efficiency of anomaly detection capability mainly relies on the training time of the models employed in the respective methods. However, existing machine learning-based anomaly detection methods often suffer from long training and inference times, particularly when dealing with spatio-temporal dependencies. To address these challenges, we propose FI-Transformer, an improved Transformer-based method for fast anomaly detection in time series data. FI-Transformer introduces enhanced time series data embedding and a TS Feature Attention mechanism to capture complex spatio-temporal dependencies. Moreover, it features an optimized encoder decoder structure, reducing training and inference time. Experimental validation against Transformer, LSTM Encoder Decoder, and Anomaly Transformer demonstrates FI-Transformer's superior precision, recall, F1, training time and inference time.
|
|
09:00-18:30, Paper Mo-Online.251 | |
A CBiLAM Based Spatio-Temporal Data Mining Model for Earth Observation Satellites Demand Prediction |
|
Chen, Zipeng | National University of Defence Technology |
Gu, Tianyang | National University of Defense Technology |
Wang, Jianjiang | National University of Defence Technology, College of Systems E |
Keywords: Application of Artificial Intelligence, AI and Applications, Deep Learning
Abstract: The number and variety of observation data have increased significantly with the intellectualization of Earth Observation Satellites (EOSs). In order to make efficient use of these ob-servation data, this paper proposes a novel framework for predicting observation demand using a CNN + BiLSTM + Attention Network (CBiLAM), which is the first of its kind to analyze the problem of EOSs demand from a spatio-temporal perspective. In contrast to traditional deep learning methods, which rely on historical time series data at specific locations, our approach constructs spatio-temporal data sets to better capture the complex dependencies between different locations and times. Simulation experiments on observation demand prediction and observation area prediction show that our proposed method achieves satisfactory performance, illus-trating its potential for improving future EOS observation scheduling.
|
|
09:00-18:30, Paper Mo-Online.252 | |
TRUE DAO-Based Smart Journals for Sustainable Publishing |
|
Ma, Siji | Macau University of Science and Technology |
Li, Juanjuan | Institute of Automation, Chinese Academy of Sciences |
Liu, Yuhang | Institute of Automation, Chinese Academy of Science |
Huang, Jun | Macau University of Science and Technology |
Lin, Fei | Macau University of Science and Technology |
Zhang, Tengchao | Macau University of Science and Technology |
Ni, Qinghua | Macau University of Science and Technology |
Jiang, Tai | Macau University of Science and Technology |
Wang, Fei-Yue | Institute of Automation, Chinese Academy of Sciences |
Keywords: Agent-Based Modeling, AI and Applications, Deep Learning
Abstract: Academic journals serve as pivotal bridges for knowledge dissemination and technology innovation, playing a crucial role in promoting scientific research, industrial progress, and societal development. However, traditional models of journal management and operation, hindered by prolonged peer review processes, overarching publication inefficiencies, and surging page expenditures, are increasingly powerless to address challenges posed by ever-faster knowledge updates, broad information dissemination, and interdisciplinary scholarly work. This has led to an urgent need for smart organizations and intelligent operations of journals. In view of this, the paper identifies the primary issues in current journal management and proposes the concept of smart journals based on TRUE Autonomous Organizations and Operations (TRUE DAO or TAO). This paper introduces the foundational architecture of smart journals and proposes incentive mechanisms for a more dynamic publishing ecosystem. Moreover, a simulation experiment is designed to evaluate an adaptive incentive mechanism, demonstrating significant improvements in review quality and accuracy through personalized incentive allocation strategies. Smart journals not only enable digital transformation but also foster profound innovation and long-term sustainability in publishing.
|
|
09:00-18:30, Paper Mo-Online.253 | |
SET-Motif: A Lightweight Sparse Training Approach with Implications for Distributed and Multi-Agent Learning (I) |
|
Su, Yuan | Universiteit Van Amsterdam |
Liu, Hongyun | Eindhoven University of Technology |
Keywords: Neural Networks and their Applications, Optimization and Self-Organization Approaches, Representation Learning
Abstract: As learning systems grow more decentralized and dynamic, efficient and resilient training methods become increasingly important. Sparse Evolutionary Training (SET) is a scalable approach to neural network training that leverages sparse connectivity to reduce computational demands while maintaining performance. In this work, we introduce SET-motif, a topologically inspired extension of SET that replaces individual connection updates with a fixed motif-based structure during the pruning and reconnection phases. This design introduces structured variation across training epochs, increasing stochasticity and simulating conditions commonly found in real-world dynamic environments. Despite the added variability, SET-motif achieves up to 22% faster training while maintaining accuracy within 1.5% of standard SET, as demonstrated on FMNIST and lung X-ray classification tasks. These results demonstrate that SET-motif offers a lightweight and adaptable learning approach that aligns with the demands of scalable, robust training in multi-agent and decentralized systems.
|
|
09:00-18:30, Paper Mo-Online.254 | |
CGWSA: A Novel Strategy for Task-Dependent Load Balancing in Distributed Systems (I) |
|
Ma, Hanbo | BeiHang University |
Ma, Yaofei | Beihang University |
Yuan, Haitao | Beihang University |
Wang, Yihuan | Beihang University |
Keywords: Swarm Intelligence, Heuristic Algorithms, Cloud, IoT, and Robotics Integration
Abstract: Distributed systems form the critical infrastructure supporting high-performance computing and complex simulations, with effectiveness heavily dependent on load balancing strategies. Within these systems, distributed simulation tasks present unique challenges through strict temporal dependencies and sequential constraints that traditional methods struggle to address. This paper proposes CGWSA—a Color-Graph Grey Wolf-Simulated Annealing hybrid algorithm—that fundamentally advances load balancing for distributed simulation workloads. Our methodology introduces a comprehensive dual-layer framework that considers both node heterogeneity and intricate task dependencies: 1) Color-Graph preprocessing that categorizes tasks by resource dominance patterns, enabling efficient parallelization while preserving execution priorities; and 2) A bio-inspired optimization engine that combines Grey Wolf Optimizer's hierarchical search capabilities with Simulated Annealing's probabilistic acceptance mechanism to prevent local optima trapping. Experimental results on simulated computing environments demonstrate CGWSA's superiority with the lowest load balance degree of 1.678 and optimal makespan of 9.31 seconds—10.6% faster than the second-best approach. The algorithm's dependency-aware scheduling architecture establishes new performance standards for time-sensitive simulation computing while maintaining applicability across diverse distributed environments including cloud computing and smart manufacturing systems.
|
|
09:00-18:30, Paper Mo-Online.255 | |
A Transformer-Enhanced BiLSTM Model for Classifying Driver States from Physiological and Motion Signals under Auditory Stimuli |
|
Shajari, Arian | Deakin University |
Asadi, Houshyar | Deakin University |
Nazari, Farhad | Deakin University |
Najdovski, Zoran | Deakin University |
Nahavandi, Saeid | Swinburne University of Technology |
Keywords: Neural Networks and their Applications, Machine Learning, Deep Learning
Abstract: Traffic accidents are a major public safety challenge around the world and are often influenced by the cognitive and physiological states of drivers. Among the multiple in-vehicle factors, listening to music has shown complex effects on driver behavior, particularly in relation to music tempo. This study proposes TransBiNet, a novel deep learning architecture that integrates Transformer-based attention mechanisms with Bidirectional Long Short-Term Memory layers to classify driver states under different auditory conditions using internal biometric signals. Data were collected from 26 participants driving in a simulated environment, where each subject completed scenarios involving fast-tempo music, slow-tempo music, and no music. Physiological signals (heart rate, breathing rate, galvanic skin response, and skin temperature) and head motion data (gyroscope and accelerometer) were gathered via wearable sensors and used as input to the model. The architecture was optimized through Hyperband-based hyperparameter tuning and showed a test accuracy of 97.62% as well as strong precision and recall across all classes. The results showed that internal physiological and motion-based signals are sufficient for robust classification of music-induced driver states, supporting the potential for real-time, sensor-driven driver monitoring systems in intelligent transportation.
|
|
09:00-18:30, Paper Mo-Online.256 | |
ABAA-YOLO: Enhanced YOLOv8 with Abnormal Brain Area Aware Network for Tumor Detection |
|
Cao, Lei | Beihang University |
Wang, Hanyu | Beihang University |
Wu, Di | Beihang University |
Fang, Chen | Shanghai Jiao Tong University |
Keywords: Image Processing and Pattern Recognition, Biometric Systems and Bioinformatics, Deep Learning
Abstract: Current brain tumor detection methods face challenges in balancing computational efficiency with accurate lesion localization, particularly when handling heterogeneous tumor morphology across individual anatomical variations. In this paper, we propose Abnormal Brain Area Aware (ABAA) Network, an enhanced YOLOv8 architecture for brain tumor detection with three key contributions: (1) a brain template- guided module that explicitly incorporates anatomical con- straints during feature extraction, (2) A sparse self-attention mechanism is introduced into YOLOv8, effectively reducing computational redundancy while maintaining lesion focus, and (3) A context-aware contrastive loss enforcing tumor region consistency. On the Br35H dataset, ABAA-YOLO achieves state-of-the-art 96.4% precision (+3.5% vs YOLOv8n) and 92.9% recall (+6.0%), with 96.3% mAP50 outperforming BGF- YOLO by 3.5%. The model maintains computational efficiency at 10.3 GFLOPs, 54% lower than BGF-YOLO’s 22.3 GFLOPs. Ablation studies validate each component’s contribution, showing 4.8% precision improvement from sparse attention. Visualization confirms the detected regions align with pathological characteristics.
|
|
09:00-18:30, Paper Mo-Online.257 | |
ALSGCN: An Attention-Based Long and Short-Term Graph Convolutional Network for Stock Recommendation |
|
Yu, Junpeng | JINAN UNIVERSITY |
Yao, Wenjie | JINAN UNIVERSITY |
Li, Zhihao | JiNan University |
Gao, Lele | Jinan University |
Xiao, Wenyun | JiNan University |
Wang, Hongnian | North Sichuan Medical College |
Keywords: Neural Networks and their Applications, Deep Learning, Application of Artificial Intelligence
Abstract: Stock recommendation plays a critical role in financial investment decision-making. In real-world markets, stocks exhibit complex interdependencies through correlated price movements. Existing approaches derive stock relationships either from fundamental information (e.g., industry categories) or price temporal patterns. However, these methods face limitations: domain expertise-based approaches fail to capture implicit correlations, while short-term relationship modeling suffers from inadequate temporal feature representation. To address these challenges, we propose ALSGCN, an Attention-based Long- and Short-term Graph Convolutional Network for stock recommendation. ALSGCN captures dynamic stock relationships through two key components: (1) a market-aware mechanism that models long-term dependencies by incorporating the influence of large-cap stocks, and (2) an attention-based temporal feature extraction module that adaptively weights information across time steps. These complementary relationship representations are integrated through a multi-channel graph attention module for effective feature learning. Extensive experiments on two real-world datasets demonstrate that ALSGCN consistently outperforms state-of-the-art baselines across most evaluation metrics. Code is available at https://github.com/hongnianwang/ALSGCN.
|
|
09:00-18:30, Paper Mo-Online.258 | |
Charging Scheduling Optimization of Electric Buses Considering Battery Degradation (I) |
|
Yang, Tianyi | Nanjing University |
Jingwen, Wei | Nanjing University |
Chen, Chunlin | Nanjing University |
Dong, Guangzhong | Harbin Institute of Technology, Shenzhen |
Keywords: Optimization and Self-Organization Approaches, Soft Computing, Socio-Economic Cybernetics, Intelligent Internet Systems
Abstract: Due to the depletion of fossil fuels and the rise of electric vehicles, electric buses have emerged as a new means of transportation, helping to reduce air pollution and energy consumption. However, the limited battery capacity of electric buses makes on-the-go charging a major concern, as installing chargers at every bus stop is impractical due to costs. Therefore, employing appropriate charging strategies for electric buses is crucial.A critical issue that urgently needs resolution is determining the optimal timing and quantity for charging electric buses, considering the current size of the bus fleet, routes, and charging station facilities. Battery aging, which affects battery capacity and thereby influences charging decisions and the operating costs of bus routes, must be considered. This paper proposes a hybrid integer programming model to describe bus line operations and uses a semi-empirical method to estimate battery aging. The Gurobi solver is used to select an appropriate solution strategy to solve the dual-objective mathematical model in this paper, so as to reduce the operating cost of bus lines and the aging of batteries. The results show that the model proposed in this paper can effectively reduce the operating cost of the bus route and the aging of the battery.
|
|
09:00-18:30, Paper Mo-Online.259 | |
Food-YOLO: An Improved Food Detection Algorithm Based on YOLOv9 |
|
Zhu, Yuxuan | Wenzhou Kean University |
Dib, Omar | Wenzhou Kean University |
Keywords: AI and Applications, Deep Learning, Image Processing and Pattern Recognition
Abstract: As the demand for food safety and quality assurance grows, efficient and accurate food detection becomes essential in both industrial and consumer applications. Current methods face challenges in feature selection, multi-scale contextual integration, and bounding box regression, leading to suboptimal performance. To address these issues, we introduce Food-YOLO, a framework incorporating three key innovations: 1) the Squeeze-and-Excitation (SE) attention mechanism for dynamic recalibration of channel-wise feature responses, 2) the Selective Kernel Fusion (SKFusion) mechanism for adaptive multi-scale feature fusion, and 3) a novel Powerful-IoU loss function that integrates a size-adaptive penalty and gradient-adjusting mechanism for precise bounding box regression. Extensive experiments demonstrate the effectiveness of the proposed method in food detection, which consistently outperforms baseline models across the UEC-FOOD-100 and UNIMIB2016 datasets. Ablation studies further highlight the contribution of integrating these key components, demonstrating significant advancements in food detection and offering valuable insights for developing high-performance vision systems in real-world scenarios.
|
|
09:00-18:30, Paper Mo-Online.260 | |
Deep Q-Network for Optimising the Weights of Model Predictive Control-Based Motion Cueing Algorithm |
|
Al-serri, Sari | Deakin University |
Asadi, Houshyar | Deakin University |
Mohamed, Shady | Senior Research Fellow, Deakin University |
Chalak Qazani, Mohamad Reza | College of Science and Engineeirng |
Lim, Chee Peng | Deakin University |
Nahavandi, Saeid | Swinburne University of Technology |
Keywords: Machine Learning, Deep Learning, Agent-Based Modeling
Abstract: Motion cueing algorithm aims to replicate realistic motion sensations for drivers while adhering to the physical limitations of the simulation platform. Model Predictive Control has been widely used in motion cueing algorithm for vehicle and flight simulators due to its ability to handle system constraints and optimise motion fidelity. However, traditional model predictive control-based motion cueing algorithm rely on manually tuned cost function weights, which can be suboptimal and difficult to determine for different operating conditions. In this paper, we propose a reinforcement learning weight optimisation approach for the model predictive control-based motion cueing algorithm, leveraging Deep Q-Networks to determine an optimal set of cost function weights through training in a simulated environment. The optimised weights aim to minimise the cost function, thereby maximising the reward function. Simulation results demonstrate that the proposed method outperforms the traditional approach by reducing motion sensation errors and improving platform utilisation. The reinforcement learning based control achieves better correlation between the reference and simulated signals for both sensed specific force and angular velocity, enhancing overall motion fidelity. It also reduces the root mean square error for sensed specific force, ensuring more accurate replication of target motion cues. Additionally, the method enables broader use of the simulator’s linear displacement range, confirming the effectiveness of reinforcement learning in tuning control parameters for superior simulation performance.
|
|
09:00-18:30, Paper Mo-Online.261 | |
Impact of Domain Adaptation in Deep Learning for Medical Image Classifications |
|
Yihang, Wu | GUET |
Chaddad, Ahmad | Guilin University of Electronic Technology |
Keywords: Application of Artificial Intelligence, Deep Learning, Machine Vision
Abstract: Domain adaptation (DA) is a quickly expanding area in machine learning that involves adjusting a model trained in one domain to perform well in another domain. While there have been notable progressions, the fundamental concept of numerous DA methodologies has persisted: aligning the data from various domains into a shared feature space. In this space, knowledge acquired from labeled source data can improve the model training on target data that lacks sufficient labels. In this study, we demonstrate the use of 10 deep learning models to simulate common DA techniques and explore their application in four medical image datasets. We have considered various situations such as multi-modality, noisy data, federated learning (FL), interpretability analysis, and classifier calibration. The experimental results indicate that using DA with ResNet34 in a brain tumor (BT) data set results in an enhancement of 4.7% in model performance. Similarly, the use of DA can reduce the impact of Gaussian noise, as it provides sim 3% accuracy increase using ResNet34 on a BT dataset. Furthermore, simply introducing DA into FL framework shows limited potential (e.g., sim 0.3% increase in performance) for skin cancer classification. In addition, the DA method can improve the interpretability of the models using the gradcam++ technique, which offers clinical values. Calibration analysis also demonstrates that using DA provides a lower expected calibration error (ECE) value sim 2% compared to CNN alone on a multi-modality dataset. The codes for our experiments are available at url{https://github.com/AIPMLab/Domain_Adaptation}
|
|
09:00-18:30, Paper Mo-Online.262 | |
Deep Modeling and Interpretation for Bladder Cancer Classification |
|
Chaddad, Ahmad | Guilin University of Electronic Technology |
Chen, Xianrui | Guilin University of Electronic Technology |
Yihang, Wu | GUET |
Keywords: Application of Artificial Intelligence, Deep Learning, Machine Learning
Abstract: Deep models based on vision transformer (ViT) and convolutional neural network (CNN) have demonstrated remarkable performance on natural datasets. However, these models may not be similar in medical imaging, where abnormal regions cover only a small portion of the image. This challenge motivates this study to investigate the latest deep models for bladder cancer classification tasks. We propose the following to evaluate these deep models: 1) standard classification using 13 models (four CNNs and eight transormer-based models), 2) calibration analysis to examine if these models are well calibrated for bladder cancer classification, and 3) we use GradCAM++ to evaluate the interpretability of these models for clinical diagnosis. We simulate sim 300 experiments on a publicly multicenter bladder cancer dataset, and the experimental results demonstrate that the ConvNext series indicate limited generalization ability to classify bladder cancer images (e.g., sim 60% accuracy). In addition, ViTs show better calibration effects compared to ConvNext and swin transformer series. We also involve test time augmentation to improve the models interpretability. Finally, no model provides a one-size-fits-all solution for a feasible interpretable model. ConvNext series are suitable for in-distribution samples, while ViT and its variants are suitable for interpreting out-of-distribution samples. The codes are available at url{https://github.com/AIPMLab/SkinCancerSimulation}.
|
|
09:00-18:30, Paper Mo-Online.263 | |
Analyzing the Calibration of CLIP Models under Noisy Data Conditions |
|
Gao, Yuxin | Guilin University of Electronic Technology |
Yihang, Wu | GUET |
Chaddad, Ahmad | Guilin University of Electronic Technology |
Keywords: Application of Artificial Intelligence, Deep Learning, Machine Learning
Abstract: Contrastive Language-Image Pretraining (CLIP) has emerged as a powerful paradigm for cross-modal learning, using image-text pairs to achieve remarkable zero-shot classification performance. However, its calibration on noisy data has been less explored, especially in out-of-distribution settings where overconfidence can lead to misclassification. In this work, we evaluate CLIP's calibration ability under different noise conditions using 10 noise types from the ImageNet-C dataset, including natural noise, digital noise, weather noise, and blur noise. In addition, we propose test-time augmentation (TTA) to improve calibration by increasing prediction diversity and reducing overconfidence. We conduct sim 600 simulations, and the experimental results show that ViT-B/32 achieves higher accuracy (ACC) and lower expected calibration error (ECE) than ResNet50 in in-distribution settings (e.g., 94.66% ACC vs. 62.37% ACC, 0.59% ECE vs. 1.03% ECE on brightness). However, ResNet50 outperforms ViT-B/32 in out-of-domain (OOD) situations (e.g., 3.14% ECE vs. 20.93% ECE on defocus, fine-tuned on glass). By applying TTA, we reduce ViT-B/32's ECE to textbf{13.07%} on defocus (fine-tuned on glass), demonstrating its effectiveness. Our results highlight the importance of calibration in cross-modal learning and provide a simple yet effective solution for noisy and OOD calibration. Our code is available at: url{https://github.com/AIPMLab/CLIPSimulationsGao}.
|
|
09:00-18:30, Paper Mo-Online.264 | |
Performance Evaluation of the CLIP Model in Classification Tasks |
|
Gao, Yuxin | Guilin University of Electronic Technology |
Qin, Baosheng | Guilin University of Electronic Technology |
Chen, Fenglian | Guilin University of Electronic Technology |
Huang, Xiaohan | Guilin University of Electronic Science and Technology |
Wu, Jiajun | Guilin University of Electronic Technology |
Yihang, Wu | GUET |
Chaddad, Ahmad | Guilin University of Electronic Technology |
Keywords: Application of Artificial Intelligence, Deep Learning, Image Processing and Pattern Recognition
Abstract: CLIP (Contrastive Language-Image Pretraining), as a deep learning model developed by OpenAI, combines image and text encoders to process the relationship between images and text for cross-modal classification tasks. Although CLIP shows remarkable performance in classification tasks, the details of its performance based on optimization strategies and parameter configurations have not yet been explored. In this study, we explore the performance of the CLIP model on natural (NoisyDA) and medical data sets (HAM10000 and Eye Diseases). We simulate various settings of the CLIP model, including network backbones, weight decay, learning rate, and optimizer. The experimental results indicate that using ViT-B/32 achieved the highest accuracy of 83.83% in the HAM10000 dataset using the AdamW optimizer. Furthermore, using the Eyes-Disease dataset, the accuracy of ViT-B/32 with the SGD optimizer is 79.16%, which is higher than the ResNet-50 78.12% using the Adam and AdamW optimizers. In the NoisyDA dataset, ViT-B/32 consistently outperformed the ResNet50 model regardless of the optimizer. Our results indicate that the ViT-B/32 model consistently outperforms the ResNet50 model in the HAM10000, Eyes-Disease, and NoisyDA datasets with various optimization settings. Our code is available at url{https://github.com/AIPMLab/CLIPSimulationsGao}.
|
|
09:00-18:30, Paper Mo-Online.265 | |
SQFE: A Scalable Quantum Feature Extraction Model for Deep Neural Networks |
|
Montajab, Sara | University of Calgary |
Leung, Henry | University of Calgary |
Balaji, Bhashyam | Defence Research and Development Canada |
Keywords: Quantum Machine Learning, Quantum Cybernetics, Machine Learning
Abstract: Convolutional neural networks (CNNs) have demonstrated strong performance in image processing tasks. However, as the complexity of the image scene increases, their ability to efficiently extract meaningful features diminishes. Quantum computing is increasingly recognized for its potential to address this limitation. We propose Scalable Quantum Feature Extraction (SQFE), a novel quantum feature extraction model designed as a variational quantum circuit (VQC) that mimics convolutional behavior while leveraging quantum principles. Unlike prior quantum feature extraction models that discard spatial features, use fixed quantum filters, or prevent joint optimization, SQFE supports full end-to-end training through backpropagation and is compatible with deep neural networks. SQFE is integrated into a ResNet-18 network to form the hybrid model QuRes. We evaluate multiple QuRes variants on a real-world dataset and demonstrate that QuRes models achieved test accuracies between 86.15% and 87.82%, outperforming the classical ResNet-18 with 84.84% accuracy. In terms of model complexity, QuRes reduced total parameter counts by 65%. These results highlight the potential of hybrid quantum-classical models to improve efficiency, scalability, and learning capacity in machine learning applications.
|
|
09:00-18:30, Paper Mo-Online.265 | |
Medical Image Segmentation Using Deep Learning and Transformers |
|
Lu, Yunyao | Guilin University of Electronic Technology |
Chaddad, Ahmad | Guilin University of Electronic Technology |
Keywords: Application of Artificial Intelligence, Transfer Learning, Machine Learning
Abstract: Deep learning models, including convolutional neural networks (CNNs) and transformer-based architectures, have achieved remarkable results in medical image segmentation tasks. However, the impact of optimization strategies and training settings on their performance is not explored enough. This study conducted extensive experiments with three CNN models (U-Net, UNet++, Attention-UNet) and four transformer-based models (TransUNet, HiFormer, Swin-UNet, TransDeepLab) using four public medical imaging datasets (DRIVE, Skin Lesion, Lung, Synapse). Three optimizers - SGD, Adam, and AdaGrad were evaluated to assess their influence on the stability and performance of the model. For example, in skin lesion segmentation, Dice scores ranged from 89.34% (Swin-UNet) to 93.40% (TransUNet(Pre)) with SGD, from 71.70% (TransDeepLab) to 89.22% (Attention-UNet) with Adam, and from 84.45% (Swin-UNet(Pre)) to 91.58% (TransUNet(Pre)) with AdaGrad. For the Synapse multi-organ segmentation task, Dice scores varied from 61.39% (Swin-UNet) to 77.58% (UNet++), indicating that CNNs tend to outperform transformer models in multi-organ segmentation. The findings highlight the critical impact of optimizer selection in improving segmentation performance and provide valuable information to advance medical image segmentation methods. Our codes are available at url{https://github.com/AIPMLab/SegmentationSimulationLu}.
|
|
09:00-18:30, Paper Mo-Online.266 | |
Rankformer: Rank-Aware Feature Aggregation for Efficient Low-Light Image Enhancement with Retinex Theory |
|
Wen, Yu | Fujian Normal University |
Yang, Xingxing | Hong Kong Baptist University |
Zhai, Huiyu | Hunan University of Science and Technology |
Xie, Shengyu | Guangxi Vocational and Technical College |
Keywords: Image Processing and Pattern Recognition, Deep Learning, Neural Networks and their Applications
Abstract: Low-light image enhancement remains a challenging task due to the severe information loss and inherent noise disturbance, especially in extremely dark regions. While existing methods often directly map low-light images to the normal-light domain, they inadvertently amplify noise distributions simultaneously, which leads to color distortion and brightness shift. In this paper, we present a new approach for low-light image enhancement based on the Retinex theory, which is to decompose low-light images into reflectance and illumination maps. Our key contributions are twofold: First, we introduce a rank-aware feature aggregation mechanism within the self-attention framework. This Top-K selection process preserves the most salient responses, effectively mitigating noise and highlighting informative features without increasing computational complexity. This approach prevents target signal dilution, a common issue in existing methods. Second, we propose a coarse-to-fine reconstruction framework that generates high-quality results. This multi-stage process allows progressive image refinement, ensuring robust enhancement across varying low-light conditions. Extensive experiments conducted on three diverse low-light image datasets demonstrate the superiority of our method in terms of both enhancement quality and computational efficiency. Our approach consistently outperforms state-of-the-art techniques, particularly in challenging low-light scenarios with complex noise patterns. Code is publicly available at https://github.com/Wenyuzhy/RankFormer-master.
|
| |