Research Work – Data Science Lab

Currently lab is focusing on the following domains:

Question Classification to facilitate e-learning:We are working to improve e-learning system in Pakistan. This system will help students and teachers to improve their performance and access different type of questions available in different books at elearn.pujab. Currently, we collected a dataset of 812 questions from elearn.punjab. This dataset is classified into two different categories: subjectivity of question(physics, maths and biology etc) and difficulty level of questions(hard, medium and easy). It will facilitate teachers to make quizzes and papers by selecting questions of various domains and difficulty levels according to their choice and need. Student will also be able to access those classified questions and evaluate themselves by solving multi-disciplinary questions of different complexity levels. It will create a healthy, interactive and evaluative virtual environment. We are also aiming to further extends this question classification into a question answering system that will further enhance the performance of e-learning by providing exact answers to the users instead of bunch of related documents and their links.
Urdu Text Summarization (UTS):We developed a system on the extractive summarization of Urdu textual documents by using state of the art techniques available for text summarization. To evaluate the integrity of the developed system, a gold standard dataset is created. This dataset contains 50 Urdu documents gathered from various domains such as health, sports, politics and tourism etc.Up to this point, we have successfully achieved the accuracy of around 55%. Now, we are planning to adopt underlying system for multi document summarization task. In near future, we will conduct a comparative analysis of extractive and abstractive summarization using Tensorflow’s text summarization algorithm.
Urdu Text Classification:To provide basic tool for different tasks on Urdu text analytics such as sentiment analysis, product review classification and Spam or Ham emails identification, we developed Urdu text classification system. The system at hand contains a number of per-processing techniques for Urdu text along with various feature selection algorithms. To evaluate the integrity of developed system, we built a corpus of around 700 articles by scrapping various Urdu news websites such as UrduPoint, HmariWeb, Jang, and BBC news Urdu.Up to this point, accuracy of the developed system has reached to 87% and in near future, aiming to further improve its performance.
Named Entity Recognition(NER):We are working on Urdu Named Entity Recognition which could improve the performance of Machine translation, text classification, question answering systems and speech recognition. Currently, we are working to develop named entity recognition Corpus of Urdu text.