Comparative Performance Analysis of Transformer and Convolutional Networks for Machine Vision-Oriented Mobile Robots
Abstract
This study compares the performance of three deep neural network architectures with the goal of searching for robotic navigation by semantic means of recognizing the environment in which it is located. The first is a pre-designed vision transformer network, the second is also a pre-designed convolutional neural network, and the third is a custom-designed convolutional network. These architectures are oriented to machine vision for mobile robots, enabling the recognition of global environments. The novelty exposed in this development consists in being able to identify a place based on its environment, as a human being does, so that a robot can address a place described by name and not by spatial coordinates, as usual. Comparison metrics include the level of recognition accuracy of the network, its size in kilobytes, and identification time. In addition to being able to operate in real time, each network is intended to be at least 90% accurate as an initial design parameter. The proprietary CCN network proved to be the most suitable for use in a mobile robot because it has a size of 22.5 KB, a response time of 0.07 seconds and an accuracy of 95.8%.
Keywords: Convolutional networks, transformer networks, deep learning, pre-trained network architecture, mobile robotics, transfer learning.
Full Text:
PDFReferences
KRIZHEVSKY A., SUTSKEVER I., and HINTON G. E. ImageNet classification with deep convolutional neural networks. Communications of the ACM, 2017, 60(6): 84–90. https://doi.org/10.1145/3065386. 2017
LVYANG Y., JIANKANG Z., HUAIQIANG L., LONGFEI R., CHEN Y., JINGYU W., and DONGYUAN S. A comprehensive end-to-end computer vision framework for restoration and recognition of low-quality engineering drawings. Engineering Applications of Artificial Intelligence, 2024, 133(Part E): 108524. https://doi.org/10.1016/j.engappai.2024.108524
TAO R., PENG R., JIN Y., GONG F. and LI B., Automatic Detection of Asphalt Pavement Crack Width Based on Machine Vision. IEEE Transactions on Intelligent Transportation Systems, 2025, 26(1): 484-496. https://doi.org/10.1109/TITS.2024.3492731
VASWANI A., SHAZEER N., PARMAR N., USZKOREIT J., JONES L., GOMEZ A. N., KAISER L., and POLOSUKHIN I. Attention is all you need. Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). New York, United States of America, 2017, 6000–6010. Curran Associates Inc., Red Hook. https://doi.org/10.48550/arXiv.1706.03762
BERROUKHAM A., HOUSNI K., and LAHRAICHI M., Vision Transformers: A Review of Architecture, Applications, and Future Directions. Proceedings of the 7th IEEE Congress on Information Science and Technology (CiSt), Agadir - Essaouira, Morocco, 2023, 205-210. https://doi.org/10.1109/CiSt56084.2023.10410015
WU C., and HE T. A Survey of Applications of Vision Transformer and its Variants. Proceedings of the 10th IEEE International Conference on Intelligent Data and Security (IDS), New York, United States of America, 2024, 21-25. https://doi.org/10.1109/IDS62739.2024.00011
TEH S., SIVAKUMAR S., and MOTALEBI F. Vision Transformers for Biomedical Applications. Proceedings of the 2024 International Conference on Green Energy, Computing and Sustainable Technology (GECOST), Miri Sarawak, Malaysia, 2024, 195-201. https://doi.org/10.1109/GECOST60902.2024.10474871
BALDEON-CALISTO M., RIVERA-VELASTEGUI F., LAI-YUEN S. K., RIOFRÍO D., PÉREZ-PÉREZ N., BENÍTEZ D., and FLORES-MOYANO R. DistilIQA: Distilling Vision Transformers for no-reference perceptual CT image quality assessment. Computers in Biology and Medicine, 2024, 177: 108670. https://doi.org/10.1016/j.compbiomed.2024.108670
AL FAHIM M., RAMANARAYANAN S., RAHUL G.S., GAYATHRI M. N., SARKAR A., RAM K., and M. SIVAPRAKASAM. OCUC Former: An Over-Complete Under-Complete Transformer Network for accelerated MRI reconstruction. Image and Vision Computing, 2024, 150: 105228. https://doi.org/10.1016/j.imavis.2024.105228
LI J., CHEN N., ZHOU H., LAI T., DONG H., FENG C., CHEN R., YANG C., CAI F., and WEI L. MCRformer: Morphological constraint reticular transformer for 3D medical image segmentation. Expert Systems with Applications, 2023, 232: 120877. https://doi.org/10.1016/j.eswa.2023.120877
GONG Z., CHANMEAN M. and GU W. Multi-Scale Hybrid Attention Integrated with Vision Transformers for Enhanced Image Segmentation. Proceedings of the 2nd International Conference on Algorithm, Image Processing and Machine Vision (AIPMV), Zhenjiang, China, 2024: 180-184. https://doi.org/10.1109/AIPMV62663.2024.10691911
QU M., DENG G., DI D., CUI J., and SU T. Dual attentional transformer for video visual relation prediction. Neurocomputing, 2023, 550: 126372. https://doi.org/10.1016/j.neucom.2023.126372
FARZIPOUR A., MANZARI O. N. and SHOKOUHI S. B. Traffic Sign Recognition Using Local Vision Transformer. Proceedings of the 13th International Conference on Computer and Knowledge Engineering (ICCKE), Mashhad, Islamic Republic of Iran, 2023, 191-196. https://doi.org/10.1109/ICCKE60553.2023.10326288
LI K. and MENG S. TransGait: Vision Transformer Based Gait Recognition Network. Proceedings of the 2023 International Conference on Image Processing, Computer Vision and Machine Learning (ICICML), Chengdu, China, 2023, 339-343. https://doi.org/10.1109/ICICML60161.2023.10424880
PURBA S. O., KHAIRINA N., MUHATHIR, MULIONO R. and LUBIS A. H. Classification of Eye Diseases in Humans Using Vision Transformer Architecture Model. Proceedings of the 2024 International Conference on Information Technology Research and Innovation (ICITRI), Jakarta, Indonesia, 2024: 71-75, https://doi.org/10.1109/ICITRI62858.2024.10699068
NADACHOWSKI P., ŁUBNIEWSKI Z. and TĘGOWSKI J. Glacial Landform Classification with Vision Transformer and Digital Elevation Model. Proceedings of the 2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 2024, 7254-7258. https://doi.org/10.1109/igarss53475.2024.10641509
GURUNATHAN V., SUDHAKAR R., SATHIYAPRIYA T., and SOUNDAPPAN, J. Finger Vein Authentication Using Vision Transformer. Proceedings of the 2024 International Conference on Science Technology Engineering and Management (ICSTEM), Coimbatore, India, 2024: 1-5. https://doi.org/10.1109/ICSTEM61137.2024.10560933
SURYA S. S., KALAISELVI S., BERNICE T. and GUNASEKRAN V. Cursor Movement Based on Object Detection Using Vision Transformers. Proceedings of the 2nd International Conference on Vision Towards Emerging Trends in Communication and Networking Technologies (ViTECoN), Vellore, India, 2023: 1-5. https://doi.org/10.1109/ViTECoN58111.2023.10157042
IBRAHIMOVIC E. Optimizing Vision Transformer Performance with Customizable Parameters, Proceedings of the 46th MIPRO ICT and Electronics Convention (MIPRO), Opatija, Croatia, 2023: 1721-1726, https://doi.org/10.23919/MIPRO57284.2023.10159761
PULLAKANDAM M., SONI S., GUPTA S., REDDY YANAMALA R. M. and THOTA G. K. Vision Transformer Implementation on Edge GPU (AGX Orin) for Image Classification. Proceedings of the First International Conference on Pioneering Developments in Computer Science & Digital Technologies (IC2SDT), Delhi, India, 2024: 551-556, https://doi.org/10.1109/IC2SDT62152.2024.10696661
WANG Y., QIAN X. and ZHOU W. Transformer-Prompted Network: Efficient Audio–Visual Segmentation via Transformer and Prompt Learning. IEEE Signal Processing Letters, 2025, 32: 516-520. https://doi.org/10.1109/LSP.2024.3524120
GAO Y., SHI S., SUN Z. and LING C. The combination of transformer and CNN in computer vision. Proceedings of the IEEE 4th International Conference on Civil Aviation Safety and Information Technology (ICCASIT), Dali, China, 2022: 321-325. https://doi.org/10.1109/ICCASIT55263.2022.9987025
WU P., WENG H., LUO W., ZHAN Y., XIONG L., ZHANG H., and YAN H. An improved Yolov5s based on transformer backbone network for detection and classification of bronchoalveolar lavage cells. Computational and Structural Biotechnology Journal, 2023, 21: 2985-3001. https://doi.org/10.1016/j.csbj.2023.05.008
DEEPA P. L., PONRAI D. N., and SREENA V. G. A Hybrid Vision Transformer model using ResNet152 for Brain Tumor Classification. Proceedings of the 2024 IEEE International Conference on Smart Power Control and Renewable Energy (ICSPCRE), Rourkela, India, 2024: 1-5. https://doi.org/10.1109/ICSPCRE62303.2024.10675105
ÇELEBI A., IMAK A., ÜZEN H., BUDAK Ü., TÜRKOĞLU M., HANBAY D., and ŞENGÜR A. Maxillary sinus detection on cone beam computed tomography images using ResNet and Swin Transformer-based UNet. Oral Surgery, Oral Medicine, Oral Pathology and Oral Radiology, 2024, 138(1): 149-161, https://doi.org/10.1016/j.oooo.2023.06.001
HU Z., WANG Z., JIN Y., and HOU W. VGG-TSwinformer: Transformer-based deep learning model for early Alzheimer’s disease prediction. Computer Methods and Programs in Biomedicine, 2023, 229: 107291. https://doi.org/10.1016/j.cmpb.2022.107291
BALAPAN A., YERALKHAN R., ARYSLANOV A., KALIMULDINA G. and YESHMUKHAMETOV A. A Novel Pattern Recognition Method for Self-Powered TENG Sensor Embedded to the Robotic Hand, IEEE Access, 2023, 11: 1-11. https://doi.org/10.1109/ACCESS.2025.3530465
ALHARTHI A. S., TOKATLI O., LOPEZ E. and HERRMANN G. Toward Semi-Autonomous Robotic Arm Manipulation Operator Intention Detection from Force Data. IEEE Access, 2025, 13: 664-680, https://doi.org/10.1109/ACCESS.2024.3523325
DOSOVITSKIY A., BEYER L., KOLESNIKOV A., WEISSENBORN D., ZHAI X., UNTERTHINER T., and DEHGHANI M. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. 2021. https://doi.org/10.48550/arXiv.2010.11929
JIMÉNEZ R., CASTILLO R. and JARAMILLO J. Machine Vision System for Robotic Navigation in a Residential Environment. Journal of Intelligent & Fuzzy Systems, 2024, 47(5-6): 427–437. https://doi.org/10.3233/JIFS-238028
Refbacks
- There are currently no refbacks.