Research

Current work

I am interested in the language used in Social Media: linguistic differences and similarities; unique properties that extend across communities or are specific to communities. I am currently working on Combatting Abusive Behavior in Online Communities with Dr. Eric Gilbert.

Reddit Norm Violations at Micro, Meso, and Macro Scales

Research Project (Sep, 2017 - Jul, 2018)

Norms are central to how online communities are governed. Yet, norms are also emergent, arise from interaction, and can vary significantly between communities—making them challenging to study at scale. In this paper, we study community norms on Reddit in a large-scale, empirical manner. Via 2.8M comments removed by moderators of 100 top subreddits over 10 months, we use mixed methods to identify three classes of norms: macro norms that are universal to most of Reddit; meso norms shared across groups of subreddits; and micro norms that are specific to unique subreddits.

IMPACT OF BANNING COMMUNITIES AS A MODERATION STRATEGY

Research Project (Aug, 2016 - Jul, 2017)

In the wake of Reddit's new anti-harassment policy, Reddit banned subreddits that were engaging in hateful behavior. We analyze two of the most prominent communities that were banned by Reddit, to understand the effect of the bans on Reddit's landscape. Did Reddit's ban work: Does banning help reduce deviant behaviors like hate speech? Or does it just spread such behaviors to other parts of the site?

CROSS-COMMUNITY APPROACH TO DETECT ONLINE ABUSE

Research Project (Nov, 2015 - Jul, 2016)

We introduced a novel computational approach that leverages large-scale, preexisting data from other Internet communities. We then applied our technique toward identifying abusive behavior within a major Internet community. Specifically, we computed a post's similarity to 9 other communities from 4chan, Reddit, Voat and MetaFilter. Using this conceptual and empirical work, we argue that our approach may allow communities to deal with a range of common problems, like abusive behavior, faster and with fewer engineering resources.

Factors behind YIK YAK's SUCCESS AND POPULARITY

Research Project (Jan-May, 2016)

I worked on an interview-based study to study the factors that were integral to the success and popularity of the location-based social media app, Yik Yak, during its initial deployment. We conducted interviews with 18 Yik Yak users on an urban university campus. We considered how anonymity, ephemerality, and hyper-locality come together to create an effect, in which spatial and temporal grounding encourages users to collaboratively develop investments into their hyper-local community.

Geographical language variation over time

Mini Project (Oct-Nov, 2015)

I worked on Quantifying geographical language variation over time in Social media, with Dr. Jacob Eisenstein. Specifically, we studied if the language in tweets from New York and Los Angeles are more different now than it was a year ago, using tweets collected over a year (2014-2015). We observed a 15% increase in the L1-distance between the two samples and are currently looking at other distance metrics.

Community support on reddit

Mini Project (Nov-Dec, 2015)

I worked with Dr. Munmun De Choudhury on analyzing the linguistic differences between replies to Reddit posts on 'Mental Health' and 'Suicide Watch' sub-reddits over the period of a few months. In particular, we performed LIWC analysis on the replies to observe the differences in support from the community. We observed that replies on Suicide Watch contain more First Person Plural, Second Person and Social terms, while Mental Health replies contain more First Person Singular and Achievement terms. This can be interpreted as: "There is more support coming from the community in the form of replies per post, in SW when compared to MH". Interestingly, SW replies also tend to have a lot of Future and Present Tense terms, while MH replies have more Past Tense terms.
Our finding: "Replies to a post in SW tend to provide support to the user (OP) while MH replies tend to focus more on themselves.".

MASTER's THESIS

DUAL DEGREE PROJECT

I worked on my Dual Degree Project(DDP) with Dr. Sutanu Chakraborti at IIT Madras in the area of Natural Language Processing during June 2014 - July 2015. The project aimed to generate “Stories from a Life”, which is an exemplar of the “Memories for Life” Grand Challenge proposed in a workshop conducted by the UK Computing Research Committee in November 2002, which was about managing, analyzing, and using the enormous amount of digital data that people will soon have about themselves. We built a system that can identify emails containing Autobiographical content for aiding the autobiographical summarization of a user's mail Inbox, over the years. This data can be used to generate a story about the user's life or an autobiography of sorts.

Summer InternshiP - 2013

AMAZON INDIA

I worked on NLP during my summer internship at Amazon India Pvt. Ltd. Bangalore, during the period May-July, 2013. My work was on Product opinion summarization from Amazon's user product reviews pertaining to a product, and thereby generating original content for the product's webpage description to gain improved search-engine optimisation. Basically, the idea was to mine the bulk of user reviews pertaining to a given product, and generate opinions in an extractive manner, thereby creating 'original content' for the webpage. This would help generate more informative and relevant content for the webpages, unlike the previously used method where templates were used for webpages, with just particular feature values plugged in, for the different products. In addition, it would also help the webpage to get an improved score while Google indexes it, as pages with highly relevant and rich original content are believed to get higher rankings intrinsically. We tried to extend the idea to different Social media sources(mainly Twitter) and obtain public opinion and reviews pertaining to different products of interest, and use them in aiding the content generation system.

UNDERGRADUATE PROJECTS

Formal Concept Analysis(FCA)-based Approach for Twitter Analytics : User Suggestions for followers and Hash Tag recommendations

Course: Memory Based Reasoning in AI Semester: (Jan-May,2014)

I worked on a project where we experimented with an FCA-based Approach for Twitter Analytics: User Suggestions for followers and Hash Tag recommendations, as part of my final Project work for the Memory Based Reasoning in Artificial Intelligence course. In this work, we present an application of Formal Concept Analysis(FCA) to the study of Twitter user data and talk about how we used FCA to create a visual overview of the Twitter Network using several tweets obtained from the recent past and perform these different analysis. Formal concept analysis (FCA) is a framework for data analysis and knowledge representation that has been proven useful in a wide range of application areas such as medicine and psychology, sociology and linguistics, information technology and computer science. The FCA approach is to represent the structure of the domain under consideration as a concept lattice. Here, we aim at building lattice-based taxonomies to represent the structure of Twitter Users' interests, and by so doing, to identify very similar Twitter users who can be presented as Suggestions for following, and also to perform certain other tasks like tracking the trends and changes in tweet discussions over time and most commonly used hash tags, and building a recommendation list of hash tags for new tweets. The core idea of our work is to visually represent the Twitter Sub- network using concept lattices, in which the objects are the Twitter users and the attributes are the keywords and hash tags present in the user's tweets.

Conversion of natural language statements to their corresponding First Order Logic(FOL) representations.

Course: Knowledge Representation and Reasoning Semester: (Jan-May,2014)

I worked on a project to convert sentences given in Natural language(English) to their corresponding First Order Logic(FOL) representation, as part of my final Project work for the Knowledge Representation and Reasoning(KRR) course. We coded the translator from scratch and used the Stanford Dependency Parser to obtain the corresponding dependency list for a given English sentence. Based on the dependency list provided by the parser, we would be able to tell about the relations between various words in the sentence(dependencies), and identify the subjects,objects and the verbs in the given input statement. We would be able to identify the predicates by looking at the verbs or descriptive terms present in the sentence. Once we have this, based on the type of dependency, we would be able to identify the “arguments” for the respective predicates that each word would be associated with. A Natural Language to FOL converter was successfully implemented in Java,using the Stanford NLP Parser, which is governed by certain sentence structure assumptions and works within certain rules regarding grammar of input sentences,types of accepted predicates and compulsory presence of verbs in the sentence.

Generating Suggestions for improving restaurant ratings using Latent Topic Extraction(LDA) and Sentiment Analysis on Yelp Reviews(Yelp Data Challenge)
Course: Data Mining Semester: (Jan-May,2014)

As a part of my Data Mining final Project, I worked on the Yelp Data Challenge(Jan-May,2014) . The goal was to generate suggestions for improving restaurant ratings based on the Yelp reviews provided by users. We performed Latent Topic Extraction(LDA) to identify the areas(topics) being addressed in each of the reviews and performed Sentiment Analysis on the Yelp Reviews to obtain scores for these areas. Depending on these scores, we go on to suggest areas to improve for the restaurant. The major concepts that were employed here were Latent Topic Extraction and Sentiment Analysis. By identifying the sentiment contained in the opinion expressed by the user in the review about a restaurant, we would be able to identify areas that require improvement and others where the restaurant is already excelling.

Link prediction in the Wikipedia Network
Course: Social Network Analysis Semester: (Jan-May,2013)

I worked on a project “Link prediction in Wikipedia Network” as a part of my project work for the Social Network Analysis course, where we considered the hyperlink network between the Wikipedia pages as a social network graph and tried to predict the new links which could be formed between the pages in future using prediction methods like Triadic closure, Jaccard Similarity, Preferential attachment, Common neighbors and compared the results with a random predictor.

Spell Checker
Course: Natural Language Processing   Semester: (Jul-Nov,2013)

Implemented a Spell checker which works in three different modes: word based, phrase based and sentence based error correction. Used the Noisy channel model for word level and Bayesian hybrid method for context sensitive spell checking with context words and collocations as features for other modes.

Probabilistic CYK parser
Course: Natural Language Processing   Semester: (Jul-Nov,2013)

Implemented a Probabilistic CYK parser and demonstrated how it resolves the issues and ambiguities that the CYK parser fails to address.

KDD Cup 2004 - Optimizing accuracy and ROC obtained on unseen physics data set.
Course: Machine Learning   Semester: (Jul-Nov,2012)