Data Collection


There are various different complementary aspects of social media text analysis. The results of the information analysis could be influenced by the quality of collected input data. In order to use empirical methods of natural language processing or statistical machine learning algorithms, we need to build or acquire data for training or development, and for testing. These data sets need to be annotated. At least the test data needs to be annotated, so that we can evaluate the algorithms. The training data needs to be annotated in case the algorithms are supervised learning algorithms, while unsupervised learning algorithms can use the data as it is, without additional annotations (though they could benefit from a small annotated development data set). And while doing this, spam in the dataset must be avoided. 

Social data collection depends on the intended task and application. Textual data from social media can be collected in various forms such as microblog messages, image descriptions, commented posts, video narrations, and metadata. We may also be interested in interconnecting data, for example connections between social media platforms (e.g., Twitter and Instagram) or linking Tweets to news.

The social media service’s application programming interface (API) allows other applications to integrate with their platforms. However, collecting data from social media has some restrictions. For example, microblogging services, such as Twitter, offer an API Rate Limit per user or per application; this allows for limited requests per rate limit window. For larger usages of Twitter data, there is paid access to support bigger message volumes, in the thousands or higher per hour. 

The annotation of social media content is a challenging task. Annotation tasks can be performed semi-automatically by using intelligent interfaces between the annotations and the users. For example, GATE (General Architecture for Text Engineering) and TwitIE, its social media component, is an interesting tool for annotation. Some researchers have attempted to automatically generate annotation tags to label Twitter users’ interests by applying text ranking such as TF-IDF and TextRank (graph-based text ranking models) to extract keywords from Tweets to tag each user. Other researchers applied supervised machine learning to annotate Twitter datasets with argumentation classes. There are cases when the users themselves annotated their posts, and such labels were sued as tags for supervised machine learning. An example is the mood labels in the LiveJournal platform. 

When it comes to choosing resources for social media data collection and analysis, the simple act of listening or crawling the millions of daily social conversations is not sufficient. With billions of active users on social media in the world, the volume of user-generated content has grown astronomically.

Selecting the strategy that best supports your objectives and metrics is key in applying the appropriate NLP-based methods and analytic approaches. The significant amount of spam and noise in social media gave rise to the debate surrounding the validity and value of social media data, where this data closely depends on time and location.

To combat these challenges, active research has been undertaken by the scientific community. For example, Jindal and Liu [2008] studied the deceptive opinion problem. They trained models using supervised learning with manually labeled training examples, based on the review text, reviewer, and product, in order to identify deceptive opinions. 

In addition to all these challenges, there have also been serious privacy concerns in the data collection process. There are several concerns about privacy in social media regarding user misunderstandings,
the bugs in the development of social media platforms allowing unauthorized access, or lack
of ethics in marketing. Some privacy research focused on concerns about data protection by
establishing metrics, such as privacy scales, for evaluating these concerns. 

Comments

Popular posts from this blog

Sentiment Analysis

Applications of NLP in Social Media