Technologies used:
- Python
- MongoDB
- PyQt5
- Matplotlib
- NumPy, Scikit-learn and TextBlob
- Google-Translation API
Description
As a part of Mitacs Globalink Research Internship at the Dalhousie University in Halifax, Nova Scotia, Canada. 436 GB of data was collected from Twitter between June 14th and July 17th, 2018 (total 25 days) during the FIFA World Cup 2018. The goal of the work was to analyze the data and later to detect interactions between bots and humans. CentOS 7 was used as a server with Anaconda3 as distribution for Python3. Packages used: TextBlob, PyMongo, and other machine learning libraries like NumPy and Scikit-learn. TextBlob provides common natural language processing (NLP) functions for simplified text processing tasks such as spelling correction, translation, language detection, and sentiment analysis. It is based mainly on NLTK and pattern and it also uses the Google API for Translation purposes.
Concept: Bot Detection (Basic Idea)
First, it is necessary to detect bots. The algorithm flags users as a bot or not by looking up specific patterns. Separating Tweets of one user by the publishing day make it easier to consider each day separately. The Tweets of each day will be examined and based on that the user gets a value between 0 and 1 for that day (0 = human, 1 = bot). In the end, all values are calculated and considered as one value. If the user was flagged as a bot for more than 50% of tweeting days then he would probably be a bot, otherwise, he is a human or unknown bot.
Based on the following criteria a user is flagged as a bot or not:
- Tweets of one day contain similar content.
- Average number of Tweets of one day is above the average number known
- Tweets posted in inactive time for humans (e.g., between 0 a.m. and 6 a.m.).
- Account age (accounts created in the last one or two years before an event are probably a bot)
To examine the similarity between the two Tweets the Python library fuzzy-wuzzy was used. It provides the function “ratio(string1, string2)” that computes the standard Levenshtein distance similarity ratio between two sequences. The Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one word into the other.
Every tweet of a day is compared with all other tweets of that day that were published by the same user. At the end the number of similar tweets of a day is divided by the number of all tweets of that day, so we get a percentage value of tweets similarity for the day. All similarity values are added and divided by the number of days where the user tweeted. If the end value is bigger than 0.5 (50%) so the user fulfilled one criterion of being a bot.
For each of mentioned criteria, a threshold value is necessary to decide if a Tweet fulfilled the criterion. Following threshold values are needed:
- Ratio between two tweets (texts)
- Number of average Tweets per day for a human
- Number of Tweets in inactive time
- Account age
Concept: Bots Humans Interaction (Basic Idea)
For analyzing the interaction between bots and humans, retweets were considered here. Every tweet that was retweeted more than two times is a potential tweet for the interaction analysis. We consider the time difference between the original tweet and all its retweets. Since we have many different times we sum up times in 5, 10, or 15 minutes periods/intervals. The developed bot detection strategy in the previous step is used to distinguish if the retweeter was a bot or not.
They are four kinds of interactions (retweeting something) that can be analyzed:
- Human-Bot-interactions
- Human-Human-interactions
- Bot-Human-interactions
- Bot-Bot-interactions
More details available on request. |