Comparative Performance Analysis of Transformer and Convolutional Networks for Machine Vision-Oriented Mobile Robots

Robinson Jiménez-Moreno, Anny Astrid Espitia-Cubillos

Abstract

This study compares the performance of three deep neural network architectures with the goal of searching for robotic navigation by semantic means of recognizing the environment in which it is located. The first is a pre-designed vision transformer network, the second is also a pre-designed convolutional neural network, and the third is a custom-designed convolutional network. These architectures are oriented to machine vision for mobile robots, enabling the recognition of global environments. The novelty exposed in this development consists in being able to identify a place based on its environment, as a human being does, so that a robot can address a place described by name and not by spatial coordinates, as usual. Comparison metrics include the level of recognition accuracy of the network, its size in kilobytes, and identification time. In addition to being able to operate in real time, each network is intended to be at least 90% accurate as an initial design parameter. The proprietary CCN network proved to be the most suitable for use in a mobile robot because it has a size of 22.5 KB, a response time of 0.07 seconds and an accuracy of 95.8%.

 

Keywords: Convolutional networks, transformer networks, deep learning, pre-trained network architecture, mobile robotics, transfer learning.

 

https://doi.org/10.55463/issn.1674-2974.52.2.11


Full Text:

PDF


References


KRIZHEVSKY A., SUTSKEVER I., and HINTON G. E. ImageNet classification with deep convolutional neural networks. Communications of the ACM, 2017, 60(6): 84–90. https://doi.org/10.1145/3065386. 2017

LVYANG Y., JIANKANG Z., HUAIQIANG L., LONGFEI R., CHEN Y., JINGYU W., and DONGYUAN S. A comprehensive end-to-end computer vision framework for restoration and recognition of low-quality engineering drawings. Engineering Applications of Artificial Intelligence, 2024, 133(Part E): 108524. https://doi.org/10.1016/j.engappai.2024.108524

TAO R., PENG R., JIN Y., GONG F. and LI B., Automatic Detection of Asphalt Pavement Crack Width Based on Machine Vision. IEEE Transactions on Intelligent Transportation Systems, 2025, 26(1): 484-496. https://doi.org/10.1109/TITS.2024.3492731

VASWANI A., SHAZEER N., PARMAR N., USZKOREIT J., JONES L., GOMEZ A. N., KAISER L., and POLOSUKHIN I. Attention is all you need. Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS'17). New York, United States of America, 2017, 6000–6010. Curran Associates Inc., Red Hook. https://doi.org/10.48550/arXiv.1706.03762

BERROUKHAM A., HOUSNI K., and LAHRAICHI M., Vision Transformers: A Review of Architecture, Applications, and Future Directions. Proceedings of the 7th IEEE Congress on Information Science and Technology (CiSt), Agadir - Essaouira, Morocco, 2023, 205-210. https://doi.org/10.1109/CiSt56084.2023.10410015

WU C., and HE T. A Survey of Applications of Vision Transformer and its Variants. Proceedings of the 10th IEEE International Conference on Intelligent Data and Security (IDS), New York, United States of America, 2024, 21-25. https://doi.org/10.1109/IDS62739.2024.00011

TEH S., SIVAKUMAR S., and MOTALEBI F. Vision Transformers for Biomedical Applications. Proceedings of the 2024 International Conference on Green Energy, Computing and Sustainable Technology (GECOST), Miri Sarawak, Malaysia, 2024, 195-201. https://doi.org/10.1109/GECOST60902.2024.10474871

BALDEON-CALISTO M., RIVERA-VELASTEGUI F., LAI-YUEN S. K., RIOFRÍO D., PÉREZ-PÉREZ N., BENÍTEZ D., and FLORES-MOYANO R. DistilIQA: Distilling Vision Transformers for no-reference perceptual CT image quality assessment. Computers in Biology and Medicine, 2024, 177: 108670. https://doi.org/10.1016/j.compbiomed.2024.108670

AL FAHIM M., RAMANARAYANAN S., RAHUL G.S., GAYATHRI M. N., SARKAR A., RAM K., and M. SIVAPRAKASAM. OCUC Former: An Over-Complete Under-Complete Transformer Network for accelerated MRI reconstruction. Image and Vision Computing, 2024, 150: 105228. https://doi.org/10.1016/j.imavis.2024.105228

LI J., CHEN N., ZHOU H., LAI T., DONG H., FENG C., CHEN R., YANG C., CAI F., and WEI L. MCRformer: Morphological constraint reticular transformer for 3D medical image segmentation. Expert Systems with Applications, 2023, 232: 120877. https://doi.org/10.1016/j.eswa.2023.120877

GONG Z., CHANMEAN M. and GU W. Multi-Scale Hybrid Attention Integrated with Vision Transformers for Enhanced Image Segmentation. Proceedings of the 2nd International Conference on Algorithm, Image Processing and Machine Vision (AIPMV), Zhenjiang, China, 2024: 180-184. https://doi.org/10.1109/AIPMV62663.2024.10691911

QU M., DENG G., DI D., CUI J., and SU T. Dual attentional transformer for video visual relation prediction. Neurocomputing, 2023, 550: 126372. https://doi.org/10.1016/j.neucom.2023.126372

FARZIPOUR A., MANZARI O. N. and SHOKOUHI S. B. Traffic Sign Recognition Using Local Vision Transformer. Proceedings of the 13th International Conference on Computer and Knowledge Engineering (ICCKE), Mashhad, Islamic Republic of Iran, 2023, 191-196. https://doi.org/10.1109/ICCKE60553.2023.10326288

LI K. and MENG S. TransGait: Vision Transformer Based Gait Recognition Network. Proceedings of the 2023 International Conference on Image Processing, Computer Vision and Machine Learning (ICICML), Chengdu, China, 2023, 339-343. https://doi.org/10.1109/ICICML60161.2023.10424880

PURBA S. O., KHAIRINA N., MUHATHIR, MULIONO R. and LUBIS A. H. Classification of Eye Diseases in Humans Using Vision Transformer Architecture Model. Proceedings of the 2024 International Conference on Information Technology Research and Innovation (ICITRI), Jakarta, Indonesia, 2024: 71-75, https://doi.org/10.1109/ICITRI62858.2024.10699068

NADACHOWSKI P., ŁUBNIEWSKI Z. and TĘGOWSKI J. Glacial Landform Classification with Vision Transformer and Digital Elevation Model. Proceedings of the 2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 2024, 7254-7258. https://doi.org/10.1109/igarss53475.2024.10641509

GURUNATHAN V., SUDHAKAR R., SATHIYAPRIYA T., and SOUNDAPPAN, J. Finger Vein Authentication Using Vision Transformer. Proceedings of the 2024 International Conference on Science Technology Engineering and Management (ICSTEM), Coimbatore, India, 2024: 1-5. https://doi.org/10.1109/ICSTEM61137.2024.10560933

SURYA S. S., KALAISELVI S., BERNICE T. and GUNASEKRAN V. Cursor Movement Based on Object Detection Using Vision Transformers. Proceedings of the 2nd International Conference on Vision Towards Emerging Trends in Communication and Networking Technologies (ViTECoN), Vellore, India, 2023: 1-5. https://doi.org/10.1109/ViTECoN58111.2023.10157042

IBRAHIMOVIC E. Optimizing Vision Transformer Performance with Customizable Parameters, Proceedings of the 46th MIPRO ICT and Electronics Convention (MIPRO), Opatija, Croatia, 2023: 1721-1726, https://doi.org/10.23919/MIPRO57284.2023.10159761

PULLAKANDAM M., SONI S., GUPTA S., REDDY YANAMALA R. M. and THOTA G. K. Vision Transformer Implementation on Edge GPU (AGX Orin) for Image Classification. Proceedings of the First International Conference on Pioneering Developments in Computer Science & Digital Technologies (IC2SDT), Delhi, India, 2024: 551-556, https://doi.org/10.1109/IC2SDT62152.2024.10696661

WANG Y., QIAN X. and ZHOU W. Transformer-Prompted Network: Efficient Audio–Visual Segmentation via Transformer and Prompt Learning. IEEE Signal Processing Letters, 2025, 32: 516-520. https://doi.org/10.1109/LSP.2024.3524120

GAO Y., SHI S., SUN Z. and LING C. The combination of transformer and CNN in computer vision. Proceedings of the IEEE 4th International Conference on Civil Aviation Safety and Information Technology (ICCASIT), Dali, China, 2022: 321-325. https://doi.org/10.1109/ICCASIT55263.2022.9987025

WU P., WENG H., LUO W., ZHAN Y., XIONG L., ZHANG H., and YAN H. An improved Yolov5s based on transformer backbone network for detection and classification of bronchoalveolar lavage cells. Computational and Structural Biotechnology Journal, 2023, 21: 2985-3001. https://doi.org/10.1016/j.csbj.2023.05.008

DEEPA P. L., PONRAI D. N., and SREENA V. G. A Hybrid Vision Transformer model using ResNet152 for Brain Tumor Classification. Proceedings of the 2024 IEEE International Conference on Smart Power Control and Renewable Energy (ICSPCRE), Rourkela, India, 2024: 1-5. https://doi.org/10.1109/ICSPCRE62303.2024.10675105

ÇELEBI A., IMAK A., ÜZEN H., BUDAK Ü., TÜRKOĞLU M., HANBAY D., and ŞENGÜR A. Maxillary sinus detection on cone beam computed tomography images using ResNet and Swin Transformer-based UNet. Oral Surgery, Oral Medicine, Oral Pathology and Oral Radiology, 2024, 138(1): 149-161, https://doi.org/10.1016/j.oooo.2023.06.001

HU Z., WANG Z., JIN Y., and HOU W. VGG-TSwinformer: Transformer-based deep learning model for early Alzheimer’s disease prediction. Computer Methods and Programs in Biomedicine, 2023, 229: 107291. https://doi.org/10.1016/j.cmpb.2022.107291

BALAPAN A., YERALKHAN R., ARYSLANOV A., KALIMULDINA G. and YESHMUKHAMETOV A. A Novel Pattern Recognition Method for Self-Powered TENG Sensor Embedded to the Robotic Hand, IEEE Access, 2023, 11: 1-11. https://doi.org/10.1109/ACCESS.2025.3530465

ALHARTHI A. S., TOKATLI O., LOPEZ E. and HERRMANN G. Toward Semi-Autonomous Robotic Arm Manipulation Operator Intention Detection from Force Data. IEEE Access, 2025, 13: 664-680, https://doi.org/10.1109/ACCESS.2024.3523325

DOSOVITSKIY A., BEYER L., KOLESNIKOV A., WEISSENBORN D., ZHAI X., UNTERTHINER T., and DEHGHANI M. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. 2021. https://doi.org/10.48550/arXiv.2010.11929

JIMÉNEZ R., CASTILLO R. and JARAMILLO J. Machine Vision System for Robotic Navigation in a Residential Environment. Journal of Intelligent & Fuzzy Systems, 2024, 47(5-6): 427–437. https://doi.org/10.3233/JIFS-238028


Refbacks

  • There are currently no refbacks.