Event Extraction Using Word Clustering and Word Embedding for Roman Urdu

Asia Samreen, Hina Shakir, Muhammad Hussain, Muhammad Zuhair Arfeen, Syed Asif Ali

Abstract

Urdu is the national language of Pakistan, and Roman Urdu is a writing system representing the Urdu language using the Latin alphabet. The extraction of events from conversations conducted in Roman Urdu is a challenging task. This study aims to recognize multi-word events from text, specifically conversations among users of social networks. The expressed work consists of two phases: data preparation and event detection. In the first phase, four different approaches were investigated to remove useless and stop words in Roman Urdu, providing a comparison of the techniques. In the second phase, two different approaches were used to detect multi-word events in the cleaned text. Among the two, one is based on word clustering and the other is based on word embedding with the help of BiLM (a bidirectional language model). The clustering model efficiently finds events from conversations without complex feature engineering or data training. The second approach uses BiLSTM to obtain word vectors combined with NER (BIO) tagging to sequence the tagged dataset. Datasets for both approaches were prepared separately from the tweets. A comparison shows that word clustering can accommodate new data more efficiently, whereas a deep learning-based approach requires a huge amount of training data. The proposed approach detects the best sequence of two- or three-word events and single-word events in Roman Urdu. Since text-based events can play a significant role in finding the polarity or intensity of a conversation, two well-known techniques have been tested and evaluated.

 

Keywords: natural language processing, Roman Urdu, useless words, multi-word event extraction, bidirectional language models, event cluster.

 

https://doi.org/10.55463/issn.1674-2974.51.1.13


Full Text:

PDF


References


AKHTER M P, JIANGBIN Z, NAQVI I R, ABDELMAJEED M, and SADIQ M T. Automatic detection of offensive language for urdu and roman urdu. IEEE Access, 2020, 8: 91213-91226. https://doi.org/10.1109/ACCESS.2020.2994950.

ZHANG D, XU J, ZADOROZHNY V, and GRANT J. Fake news detection based on statement conflict. Journal of Intelligent Information Systems, 2022, 59(1): 1-20. https://doi.org/10.1007/s10844-021-00678-1

LIU C, YU Y, LI X, and WANG P. Named entity recognition in equipment support field using tri-training algorithm and text information extraction technology. IEEE Access, 2021, 9: 126728-126734. https://doi.org/10.1109/ACCESS.2021.3109911

BRANDSEN A, VERBERNE S, LAMBERS K, et al. Creating a dataset for named entity recognition in the archaeology domain. Conference Proceedings LREC, 2020, 4573-4577. https://aclanthology.org/2020.lrec-1.562

JIN G, and YU Z. A Korean named entity recognition method using bi-LSTM-CRF and masked self-attention. Computer Speech and Language, 2021, 65: 101134. https://doi.org/10.1016/j.csl.2020.101134

CHANDRASEKARAN D, and MAGO V. Evolution of semantic similarity—A survey. ACM Computing Surveys, 2021, 54(2): 1-37. https://doi.org/10.1145/3440755

ANSAH J, LIU L, KANG W. et al. Leveraging burst in twitter network communities for event detection. World Wide Web, 2020, 23: 2851–2876. https://doi.org/10.1007/s11280-020-00786-y

XIANG W, and WANG B. A survey of event extraction from text. IEEE Access, 2019, 7: 173111-173137. https://doi.org/10.1109/ACCESS.2019.2956831

KHAN I U, KHAN A, KHAN W, et al. A Review of Urdu Sentiment Analysis with Multilingual Perspective: A Case of Urdu and Roman Urdu Language. Computers, 2022, 11(1): 3. https://doi.org/10.3390/computers11010003

WU X, WANG T, FAN Y, and YU F. Chinese Event Extraction via Graph Attention Network. Transactions on Asian and Low-Resource Language Information Processing, 2022, 21(4): 1-12. https://doi.org/10.1145/3494533

GONG L, ZHANG Z, and CHEN S. Clinical named entity recognition from Chinese electronic medical records based on deep learning pertaining. Journal of Healthcare Engineering, 2020: 8829219, https://doi.org/10.1155/2020/8829219

DUDEK A, GRYTCZUK J, and RUCIŃSKI, A. Long twins in random words. arXiv preprint, 2021, ARXIV:2112.14197. https://doi.org/10.48550/arXiv.2112.14197

LAGUTINA K, and LAGUTINA N. A Survey of Models for Constructing Text Features to Classify Texts in Natural Language. Proceedings of the 29th Conference of Open Innovations Association, 2021: 222-233, IEEE. https://doi.org/10.23919/FRUCT52173.2021.9435512

SARZYNSKA-WAWER, J, WAWER, A, PAWLAK, A, et al. Detecting formal thought disorder by deep contextualized word representations. Psychiatry Research, 2021, 304: 114135. https://doi.org/10.1016/j.psychres.2021.114135

PETERS M E, NEUMANN M, IYYERM, et al. Deep contextualized word representations. arXiv preprint, 2018, https://doi.org/10.48550/arXiv.1802.05365

TORFI A, SHIRVANI R A, KENESHLOO Y, et al. Natural language processing advancements by deep learning: A survey. arXiv preprint,2020, https://doi.org/10.48550/arXiv.2003.01200

BAŞARAN S, and EJIMOGU O. H. A neural network approach for predicting personality from Facebook data. Sage Open, 2021, 11(3): 21582440211032156. https://doi.org/10.1177/21582440211032156

MAGNINI B, LAVELLI A, and MAGNOLINI S. Comparing machine learning and deep learning approaches on NLP tasks for the Italian language. Proceedings of the 12th Conference on Language Resources and Evaluation, 2020: 2110-2119. https://aclanthology.org/2020.lrec-1.259

KHAN A R, KARIM A, SAJJAD H, et al. A clustering framework for lexical normalization of Roman Urdu. Natural Language Engineering, 2022, 28(1): 93-123. https://doi.org/10.1017/S1351324920000285

AL-AZZAWY D S, and AL-RUFAYE F. M L. Arabic words clustering by using K-means algorithm. Proceedings of the 2017 Annual Conference on New Trends in Information & Communications Technology Applications, 2017: 263-267. IEEE. https://doi.org/10.1109/NTICT.2017.7976098

BALAJI B S, BALAKRISHNAN S, VENKATACHALAM K, and JEYAKRISHNAN V. Automated query classification based web service similarity technique using machine learning. Journal of Ambient Intelligence and Humanized Computing, 2021, 12(6): 6169-6180. https://doi.org/10.1007/s12652-020-02186-6

WU D, YANG R, and SHEN C. Sentiment word co-occurrence and knowledge pair feature extraction based LDA short text clustering algorithm. Journal of Intelligent Information Systems, 2021, 56(1): 1-23. https://doi.org/10.1007/s10844-020-00597-7


Refbacks

  • There are currently no refbacks.