Priya is a PhD student in Computer Science and Engineering at IIIT Hyderabad. Her research area is Semantic Search. She is associated with the Search and Information Extraction Lab under Prof. Vasudeva Varma. She was a telecom engineer for 10 years at Alcatel Lucent. During her tenure there, she worked as technical specialist handling roles involving software development, testing, development support and coordination.
ML-India: Can you give us a quick brief on your background?
Priya: I did my bachelor’s from the University of Kerala and then I worked with Alcatel Lucent for about 10 years. I joined IIIT-H in 2010 as a part-time student. I enrolled as a full-time MS student in 2012 and became a PhD student in 2014 and worked under Dr. Vasudeva Varma. Currently, I am an intern at IISc Bangalore and am working with Dr. Partha Talukdar on a journal publication. My area of interest is Named Entity Extraction for Knowledge base Enhancement (population). My work has basically been in named entity disambiguation, which can be described as matching a string to an entity node in knowledge graph space. My past work has been in finding named entity mentioned across tweets, web logs and search queries. My current work is an extension of this. It functions on a Knowledge base, like Wikipedia, which makes it a multi-class classification problem.
ML-India: How did you initiate your research and what motivated you to pick ML in your master’s?
Priya: I was motivated by my advisor, Dr. Vasudeva Varma. While I was working at Alcatel, I was looking for ways to do some part-time study and was attending some talks that were happening in IIIT-H. Those talks were on cloud computing, where we were introduced to the concept of search engines and distributed computing basics. He talked about search indices and MapReduce framework and how this gets done automatically. It was mind-boggling to see the scale he was discussing. He also explained a small example of a cosine distance and how over a huge corpus, statistics prove to be as good as asking a human librarian. That talk was a huge motivation for me because prior to that, I was working in the domain of wireless networking and it has very little overlap with what I am doing now. This prompted me to join a cloud computing course there in 2010. I found online courses to be less effective and I couldn’t complete them on time due to other priorities and probably some degree of procrastination factored in too. So I decided to commit myself and join a course on campus. These courses offer a certificate at the end. The enrolment criterion involves an interview with the teacher who is offering the course, who will also assess your interest and motivation. I subsequently enrolled in NLP and IR courses as well, and I gradually got exposed to many big problems related to sentiment analysis, named entity extraction, knowledge-based construction. I liked ‘named entity extraction’ better, so I participated in Text Analysis Conference which is organised by U.S. National Institute of Standards and Technology(NIST). I participated in the TAC Knowledge Base Population(KBP) track. We did well and we learned a lot and enjoyed the experience. This helped me move forward and motivated me to work further in this area.
Another part of my motivation was Alcatel. The atmosphere there is very supportive of the employees who want to learn and study more as they work. My supervisors wanted to ensure that there is valuable knowledge addition happening among the employees from quarter to quarter. They even incentivized the research component to promote this practice, which makes you more focused and driven.
ML-India: What were your key learnings at IIIT-H technically at a student level and at a personal level?
Priya: My key takeaways from IIIT-H were the intellectual growth and the strengthened belief that the methods we’ve studied actually do work in the real environment and that they make a difference in how things run. The real differentiator between a master’s study and a PhD is that during master’s you get exposed to quite a lot of advanced and the latest concepts, but staying back in college and doing a PhD is about seeing that the concepts, which were introduced to us in theory, are indeed working. It is also about identifying ways to improve their applications and figuring out methods to advance it further. So, for instance, during master’s we used packages like scikit or classifiers to solve problems. We knew that these problems could be easily resolved by using such methods. Whereas, while we did our PhD, we went one step ahead. We realized that these methods are not fool proof and there are areas where they don’t work, so we tried debugging them. We learned how to analyze the smallest aspects of the algorithm to identify the change so that the algorithm approximates better. In the process, I also learnt that we could work on the tools that we used during our master’s and make them more efficient. I see people focusing on just learning to work with the tools during their master’s and trying to ace it, they disregard its qualitative aspects. Whereas, in a PhD, people try and work on improving the tools.
On the personal side, the entire process helped me mature as a thinker. It was also a rewarding experience for me to come back to college. Coming back after working for 10 years in the industry, I looked at education institutions from a completely different eye. When I left Alcatel, my aim was not to get a degree and go back to working. I wanted to take some time off and enjoy learning, so my enrollment in college was completely open ended. I found there is a huge difference in learning to pass exams and learning to gain knowledge. While I was studying, I didn’t care about dropping grades as long as I was learning something constructive. So the aim with which one joins college makes a big difference in the kind of experience one gets during their time on campus. It proved to be a very rewarding experience for me.
ML-India: What are some of the projects that are going on in your lab?
Priya: Our group is called Language Technology Research Centre. I am part of the Search and Information Extraction lab. The atmosphere in our lab is always dynamic and lots of interesting problems are being tackled by the students. We mostly work on information extraction and information retrieval problems on unstructured data. These involve social media text, social networks, web crawling, sentiment analysis and search engine categorization. We also work on data analysis and text summarization problems. Some new research is mostly on deep learning methods to enhance the named entity recognition methods in social media text. Also, web crawling, text summarizing for large documents are very active. Then we also have work being done towards sentiment analysis, deep learning and NLP.
All in all, we have a big lab that works on a variety of topics. Currently, there are some 20 students who are actively working in the lab.
ML-India: Could you tell us more about your PhD project? Also a little about different lenses of looking a multi-classification?
Priya: My PhD project was based on Named Entity Disambiguation. The aim was to find strings in the text that represent an entity in the knowledge base. Let’s take Wikipedia as an example of a knowledge base and the string “Barack Obama” needs to relate to an entity that talks about the US President. This was the broader aim. Currently, hyperlinks are used for this, which link to a particular knowledge base or web page etc. Hyperlinks act as an evidence that the anchor text of the hyperlink points to something. This evidence is tracked and its statistics are used to identify strings.
However in the disambiguation, if the word “Obama” is encountered, it checks which person it may refer to. It could be Barack Obama, Michelle Obama, President Obama etc. Hyperlink statistics like how frequently the string “Obama” is a hyperlink and how frequently the string “Obama” refers to this entity, are used to disambiguate the string. This is the conventional and state-of-the-art method of Named Entity Disambiguation. While it works for entities like ‘Barack Obama’ which is popular and have many hyperlinks pointing to the entity, it typically fails for less popular entities ( also known as tail entities) about whom a lot of text with many hyperlinks to the entity is not available. We are exploring statistical measures using evidence from web corpus to overcome this lack of information on less popular entities in our current work. In terms of numbers, popular entities are very few while less popular entities are a lot more. Hence ways to disambiguate them is very significant.
This research is very impactful and has a lot of interesting use cases like search queries, enhance knowledge base with new related entities. These use cases are important because when you have a bigger knowledge base, bigger is the number of queries you can handle, bigger are the answers you can generate from the question.
ML-India: What is the state of industry collaboration with your research group? How easy/hard is it to sustain a healthy collaboration with the industry to produce successful research? What are the pros and cons in your opinion?
Priya: In general, there is a fair amount of collaboration happening in almost all major institutes in India today. Industries are coming forward and quite a significant amount of funding is available. Having said that, there are some open challenges like to bring academia and industry together. However, these are being tackled on a case-to-case basis and the quality of collaboration is improving day-by-day. In particular to our lab, we have many projects in collaboration with major technology players like Amazon, Nokia, Intel, AOL, to name a few. We also receive funding from many government and semi-government organizations. This type of collaboration between industry and academic institutions is very useful as it helps us get a view of real-world problems which the industry is facing nowadays and bridges it with the cutting edge research going on in academia. I think we are doing fairly well on this.
ML-India: What are your thoughts on the machine learning space in India given your experience in ML research? What is your take on the popularity of ML among the students in India. How can we improve it?
Priya: To get an idea, we can take an example of the undergraduates in IIIT-H. Here the students are very passionate about machine learning and its applications. Also, the institute provides courses that help them to get fairly exposed to machine learning much earlier in their career. There are a couple of courses involving mathematics and statistics, and I think this is the best way to introduce students to machine learning. Then there are other relevant courses on linear algebra and probability with assignments in MATLAB which help create a very sound background to study neural networks. This way these undergraduates get armed fairly well to get introduced to machine learning.
On the other hand, the overall ecosystem at IIIT-H is very helpful in learning the subject well. I had very little exposure to ML before coming here. The interaction with visiting professors and the talks about their areas of research has been very insightful. This really motivates you. The good thing about these talks is that these are given by people across the spectrum so students get to interact with professors from IIT-B, ISB, and many foreign universities as well. At least three to four really big names from the community come and give a talk every month.
Today, there is a lot of good quality content available on the web, it’s not very difficult to find material. What is more important is motivating people, and this can be done through interesting talk sessions by practitioners and researchers, and interactions with the faculty. Also, I think open data challenges like KDD Cup are also very helpful in motivating students and getting them on-board. I think we should organize more such challenges to engage more of students into machine learning.
ML-India: Thanks a lot for taking time out for this Priya! We look forward to subsequent reports from the AI100 study group!