MACS 37000 (Spring 2021) Thinking with Deep Learning for Complex Social & Cultural Data Analysis

Please check out this version with a much nicer format. Thanks!

Please check out this version with a much nicer format. Thanks!

Please check out this version with a much nicer format. Thanks!

Public Sentiment on Microblog during COVID-19

Group Member: Rui Chen

Introduce

Why I’m interested in this topic (socially)

At the end of 2019, just as people were reveling in welcoming the new year, a sudden epidemic spread quietly in Wuhan, central China. It swept through more than 200 countries worldwide in the months that followed, becoming the world’s worst public health crisis in decades. One year on, I hope to look back on this unexpected event using computational techniques.

Why I’m interested in this topic (academically)

The most relevant paper I have read so far is from Jennifer Pan, her PhD student and Yiqing Xu. In their paper, one of the research questions they tried to answer was “How do the shares of critical and supportive posts pertaining to COVID-19 vary in this initial period?” . To answer their research question on the shares of critical and supportive posts pertaining to COVID-19, they train two classifiers to identify 1) COVID-19-related posts containing criticism and 2) COVID-19-related posts containing support. However, their definition of criticism always entails a target, which means that general negative sentiments such as fear or undirected anger are out of scope. The same goes for support, where they define a post as supportive if it contains a positive evaluation of a target; projects positive emotions or praise toward a target; or associates the target with positive characteristics, attributes, and outcomes. But I’d like to investigate the general negative and positive sentiment on Weibo during COVID-19. Interestingly, it turns out my results based on general sentiment are quite different from their findings based on narrow criticism.

Research Questions

The data for this research consists of 12,758,869 original tweets from Weibo-COV V2. After testing RNN, LSTM, and BERT models, I finally decided to analyze public sentiment on Chinese social media during the COVID-19 public health crisis based on the BERT model.

I try to answer the following questions:

  • What are the characteristics of the geographical distribution of the posts?
  • When did discussions of COVID-19 begin on Weibo?
  • How did public sentiments change as the epidemic evolved?

For data cleaning and visualization, I heavily use Dask and Vaex on Amazon EMR and SageMaker. Vaex is very similar to Dask but runs faster. In addition, I frequently use Transformers in model training.

Data

Weibo Active User Pool

The Weibo-COV dataset contains posts collected retrospectively in November 2020. Because content targeted for censorship is usually removed within 24 hours, the Weibo-COV V2 dataset should be considered post-censorship.

The WeiboCOV dataset is based on a pool of 20 million active users. Based on init seed users and continuous expansion through social relationships, the authors of the dataset first build a Weibo user pool including more than 250 million users. The active Weibo user pool is constructed based on Weibo user pool and follow 4 rules:

Item Rule Item Rule
Follows number > 50 Tweets number > 50
Fans number > 50 Recent post < 30 days

 

Finally, they built a Weibo active user pool with 20 million users, accounting for 8% of the total number of weibo users.

Here are some samples from the 20 million Weibo active user pool.

user_id,gender,province,city,birthday,fans_num,vip_level,crawl_time
6e8c581b932a9d4e,女,北京,,0001-00-00,199182410,6级,1576187979

979

Weibo Public Opinion Datasets (V2)

Compared with Weibo-COV V1, Weibo-COV V2 has a longer time span, bigger data size and more refined keyword filtering method.

  • Time Period: 2019-12-01 00:00 – 2020-12-30 23:59 (GMT+8)
  • Keywords: Common keywords and monthly different keywords. For one month, use common keywords and this month’s specific keywords to filter this month’s all original tweets. For the details of keywords, please click here (Translated from Chinese into English).
  • Amount: 65,175,112 tweets filtered from 2,615,185,101 original tweets by keywords.

Here is a sample of Weibo-COV V2.

_id,user_id,crawl_time,created_at,like_num,repost_num,comment_num,content,origin_weibo,geo_info
Jwm2cyhQQ,e0470a66f95fe66d,1607931932,2020-12-01 00:00,0,0,0,【抗疫路上,#幕后的科研专家走了#】疫苗攻关争分夺秒,他总想再快点!因连续工作、过度劳累,中国医学科学院病原生物学研究所研究员赵振东教授倒在了出差途中,最终抢救无效,于9月17日在北京不幸逝世,享年53周岁。赵振东教授是我国从事病原生物学和感染免疫学研究的知名专家,疫情伊始,他说:”这…全文转发理由:[泪],Jwl894jgH,

The Weibo-COV V2 data I obtained contains the text of the post, the date and time of posting and crawling, the hashed user ID, the hashed post ID, the number of likes, the numbers of comments and reshares of each post, the hashed original post ID if the post is a repost, and post’ geological information. If you would like to know more about the dataset, please see here.

Data cleaning

Our analysis is based on original posts. This dataset actually contains reposts, but there is often insufficient information in the content of reposts to determine what content or sentiment is being expressed. In total, I analyzed 12,758,869 original posts from the Weibo-COV V2 dataset. My code for data cleaning can be seen here.

Tokonize

Chinese corpus is a collection of short texts or long texts, such as sentences, excerpts, paragraphs or articles. Generally speaking, the words in Chinese text are continuous. When doing text mining, we expect the smallest unit of text processing to be a word or a phrase. Therefore, we need to tokenize the text. Common tokenization algorithms include string matching-based tokenization, comprehension-based tokenization, statistical-based tokenization, and rule-based tokenization. Each of them corresponds to a number of specific algorithms. The main difficulties of current Chinese tokenization are ambiguity recognition and new word recognition. For example, “羽毛球拍卖完了” can be tokenized in two ways “羽毛/球拍/卖/完/了” or “羽毛球/拍卖/完/了”. The former means “badminton rackets are sold out,” and the latter means “badminton auction is over.” I am afraid that it is difficult to tokenize the text correctly without relying on the context.

There are many kinds of tokenization methods for Chinese, including jieba, NLPIR, LTP, THULAC, Stanford Tokenizer, Hanlp Tokenizer, IKAnalyzer, etc. I use the most popular one, which is jieba.

If you want to know the details of jieba, you can go to their GitHub page. In short, the structure returned by jieba.cut and jieba.cut_for_search is an iterable generator that allows users to get every token with a for loop. Jieba.lcut encapsulates the result of cut, with l representing a list. In summary, you will get the same results by using jieba.cut and jieba.lcut, and the only difference is that jieba.lcut returns a list of strings instead of a string. In the process of data cleaning, you may sometimes need to reassemble the elements of a list depending on your purpose, and you will often use the following code.

s = [‘I’, ‘want’, 4, ‘apples’, ‘and’, 18, ‘bananas’]
listToStr = ‘ ‘.join(map(str, s))
print(listToStr) 

This package also supports parallel computing. Parallel tokenization works by separating text by lines, assigning it to multiple Python processes to split the words in parallel, and finally merging the results. So if you have a multi-core processor, I highly recommend using parallel computing to save you time. Jieba.enable_parallel(4) is a line of code that will significantly help you speed up. The number of parameters equals the number of parallel processes. Note that this feature is based on python’s multiprocessing module, which is not currently supported on Windows.

Remove Stopwords

Stop words generally refer to words that do not contribute anything to the text, such as punctuations, particles, pronouns, etc. Therefore, in standard text processing, the step after tokenization is to drop the stop words. However, for processing Chinese, there is no standard method for removing the stop words. The stop word dictionary is determined according to specific contexts. For example, in sentiment analysis, modal particles and exclamation points should be retained, because they have certain contributions to the tone and emotion of the sentence. In constructing the topic model, I find that words like “的”, “地” and “得” are not helpful in expressing a topic. Because there are too many words like these, they play an important role in constructing models for topics. The location makes it difficult for me to summarize the meaning of a topic. It is thus necessary to remove these words that have little value and even negative effects.

The stop words I use come from the most popular Chinese stop word lists. For details, please see table 1. They can be downloaded from here. All the stop words I eventually use are combined from the four stop word lists, plus several additional stop words I added (e.g., ‘##’, ‘http’, ‘cn’, ‘显示’, ‘地图’, ‘打卡’, ‘地图’, ‘显示’, ‘A6vBv3yL’, ‘A6v1xgC0’, ‘打卡’, ‘微博’, ‘视频’). It should be noted that in the tokenization stage, I keep all the Weibo emojis, so as to further analyze the changes in the frequency of the emojis.

Please note that when training my BERT model, I used unprocessed text predictions. However, when conducting topic modeling, I use cleaned text.

Table 1: The Most Popular Chinese Stop Word Lists

Stopwords List Name File Name
Chinese stop word list cn_stopwords.txt
Harbin University of Technology stop word list hit_stopwords.txt
Baidu stop word list baidu_stopwords.txt
Sichuan University

Machine Intelligence Laboratory stop word list

scu_stopwords.txt

Method

Model Training Data

Our training data is from here. The dataset was used as the 26th China Conference on Information Retrieval evaluation contest challenge. It was collected based on 230 keywords related to “New Coronary Pneumonia”, and a total of 1 million tweets were captured from January 1, 2020 to February 20, 2020, and 100,000 of them were manually labeled into three categories, namely: positive, neutral and negative.After dropping tweets labelled as neutral sentiment, I train my BERT model based on the remaining texts. Finally, I got my binary sentiment (negative and positive) BERT classifier.

RNN and LSTM

I tested both RNN, LSTM and Bidirectional LSTM. Please see the code here for how I trained RNN and LSTM models. However, as the binary sentiment classifier, their accuracies are not greater than 85.1%. For the test performance, please see figure 1. This number is far lower than the accuracy of the BERT model (0.92316). I followed those tutorials to test RNN, LSTM and Bidirectional LSTM. Please click the links for the details of code examples. Personally, I think the first tutorial is the best.

Figure 1: Bidirectional LSTM Result

BERT Model

I used the training data to fine-tune both the pretrained Chinese BERT-BASE model and Chinese BERT-wwm-ext (Chinese BERT with the Whole Word Masking model). BERT is a deep learning algorithm containing pre-trained deep bidirectional representations that has been shown to outperform other state-of-the-art language models. The Chinese BERT-wwm-ext model applies upgraded whole word masking on Chinese text and uses more data sources for training. However, their performance in the test dataset is very close. Specifically, based on text data that has not been cleaned, after hyperparameter searching, Chinese BERT-BASE model offers an accuracy of 0.92317. Please see table 2 for details.

Hyperparameters search

With cutting edge research implementations, thousands of trained models are easily accessible. For any machine learning model to achieve good performance, users often need to implement some form of parameter tuning. Fine-tuning is done by testing a range of values for different hyper-parameters (steps, learning rate, class weight, dropout rate) and selecting hyper-parameters to maximize F1 score, which balances precision and recall. 

Simple experiments can show the benefit of using an advanced tuning technique. If you would like to see a detailed example of how a good hyperparameter search method can improve the model performance, please see here.

So as we have had the idea of how important hyperparameter searching is, I can move on to implement the hyperparameter searching. Please see the appendix for a hyperparameter searching example. Before you go through this code chunk, you’d better take a look at how you should get your data ready here. For my own code, I followed the tutorial examples after getting my data ready. Please see my code here.

Hyperparameter searching can be done by many tools, but I find the Hugging Face transformers library gives the most direct and efficient solution. What is Hugging Face transformers? Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between Jax, PyTorch and TensorFlow. To me, it’s a more efficient way than PyTorch to implement deep learning models. So I will heavily use transformers for hyperparameter searching, fine-tuning and BERT modeling training in this project.

Table 2: Best BERT model training parameters 

Parameter Name Setting
Sentiment Classes 2
Model BERT
Train epochs 2
Per device train batch size 32
Learning rate 5.35E-06
Weight decay 0.01
Seed 35
Warmup steps 500
Accuracy 0.9231678487

 

Fine-Tune BERT

Fine-tuning can be done by many tools, but I find the Hugging Face transformers library gives the most direct and efficient solution. They provide two examples in their official web page: Fine-tuning with custom datasets and IMDb Classification with Trainer.ipynb. I strongly recommend the latter as it’s much more straightforward. But if you have two folders where each includes the corresponding text files, you’d better choose the first tutorial. For example, when your data is organized into pos and neg folders with one text file per example. How did I fine-tune my BERT model? Please see the code here

Topic Modeling

Topic modeling is a method for unsupervised classification of documents, similar to clustering on numeric data, which finds some natural groups of items (topics) even when you’re not sure what you’re looking for. Topic modeling provides methods for automatically organizing, understanding, searching, and summarizing large electronic archives. It can help with discovering the hidden themes in the collection. 

LDA (Latent Dirichlet allocation) is one of the most popular topic modeling methods. Each document is made up of various words, and each topic also has various words belonging to it. The aim of LDA is to find topics a document belongs to, based on the words in it.

For the most basic method to implement topic modeling by LDA, please refer to the models.ldamodel by Gensim. If you happen to have multiple cores in your device, see also gensim.models.ldamulticore. But if you prefer scikit-learn, then use sklearn.decomposition.LatentDirichletAllocation

Please see my code for LDA implementation here. Click here to jump to the corresponding results.

Findings

Before I move on, I’d like to underscore several points mentioned above:

  • My analysis is based on original posts filtered from Weibo-COV V2 dataset.
  • The total number of tweets in my final analysis is 12,758,869.

Tweet Length

As shown in figure 2 and figure 3, most texts are concentrated between 0 and 500 words in length. If we limit the number of words to between 0 and 500, we can find that most texts are between 0 and 150 in length. This is intuitive. If the text entered by the user exceeds 140 characters, then the content of this tweet after 140 characters will be automatically folded. Only when users click on this tweet can they read the full content. Therefore, the actual length of most tweets will not exceed 140 characters.

Figure 2: Text Length (limit 99.7%)

Figure 3: Text Length (limit 0-500)

Weibo Emoji

Emojis are frequently used to express moods, emotions, and feelings in social media. There has been much research on emojis and sentiments. Therefore, I hope to look into the sentimental changes of Weibo users by analyzing the changes in the frequency of emoji usage in different months.

Weibo supports all emojis. In addition, it has hundreds of independently designed Weibo emoticons, most of which have corresponding emojis. These emojis can be entered using the emoji keyboard (menu) in Weibo. You can also add square brackets “[]” before and after the emoji name to quickly enter it, for example: type [loveyou] and send, this text will be converted into a WeChat emoji similar to 😘. For many Weibo users, they are used to adding Weibo-designed emojis to their tweets to complement their expressions.

Here I have counted the 20 most frequently used Weibo emojis each month. Since there may be multiple identical emojis in a post, I only count them once in each post. For example, if the tweet contains “[Tears][Tears][Sad]”, then the final result will be “[Tears][Sad]”. See Table 3 for the most popular emojis and their frequency for each month. See Table 4 for the visualization results.

Table 3: Twenty Most Popular Weibo Emojis And Their Frequencies 

2019-12 2020-01 2020-02 2020-03 2020-04 2020-05 2020-06 2020-07 2020-08 2020-09 2020-10 2020-11 2020-12
[(‘[powerless]’, 224), [(‘[Tears]’, 37142), [(‘[Heart]’, 777444), [(‘[Heart]’, 65403), [(‘[Heart]’, 43519), [(‘[powerless]’, 21520), [(‘[powerless]’, 17602), [(‘[powerless]’, 15626), [(‘[powerless]’, 12922), [(‘[Heart]’, 24959), [(‘[powerless]’, 12953), [(‘[powerless]’, 9811), [(‘[Heart]’, 21829),
(‘[Tears]’, 141), (‘[Heart]’, 31973), (‘[Tears]’, 66487), (‘[powerless]’, 50406), (‘[powerless]’, 40006), (‘[Heart]’, 19709), (‘[Tears]’, 14229), (‘[Tears]’, 13358), (‘[Heart]’, 11718), (‘[Tears]’, 11746), (‘[doge]’, 12226), (‘[Tears]’, 9008), (‘[Tears]’, 18786),
(‘[Heart]’, 127), (‘[powerless]’, 22045), (‘[powerless]’, 59996), (‘[Tears]’, 37500), (‘[Tears]’, 31800), (‘[Tears]’, 15199), (‘[Heart]’, 13348), (‘[Heart]’, 11371), (‘[Tears]’, 11568), (‘[powerless]’, 10558), (‘[Heart]’, 9722), (‘[Heart]’, 7145), (‘[powerless]’, 14184),
(‘[Husky]’, 111), (‘[Come on]’, 17002), (‘[Come on]’, 36181), (‘[Husky]’, 28140), (‘[doge]’, 21986), (‘[Husky]’, 11187), (‘[Husky]’, 8825), (‘[Husky]’, 7874), (‘[doge]’, 6988), (‘[doge]’, 5468), (‘[Tears]’, 9178), (‘[doge]’, 4893), (‘[Microphone]’, 10401),
(‘[cry with laughter]’, 72), (‘[Smile]’, 14563), (‘[Husky]’, 31622), (‘[doge]’, 27914), (‘[Husky]’, 20157), (‘[doge]’, 11071), (‘[doge]’, 7811), (‘[doge]’, 7784), (‘[Husky]’, 6712), (‘[Husky]’, 5374), (‘[Husky]’, 6405), (‘[Husky]’, 4877), (‘[Husky]’, 7633),
(‘[Kneel]’, 72), (‘[Cold and flu]’, 10588), (‘[Smile]’, 27695), (‘[Smile]’, 21579), (‘[cry with laughter]’, 15406), (‘[cry with laughter]’, 8443), (‘[cry with laughter]’, 6444), (‘[angry]’, 6241), (‘[Kneel]’, 5234), (‘[give you my heart]’, 4623), (‘[cry with laughter]’, 5709), (‘[cry with laughter]’, 3745), (‘[doge]’, 7207),
(‘[Smile]’, 69), (‘[Sad]’, 10197), (‘[doge]’, 25548), (‘[cry with laughter]’, 18792), (‘[Smile]’, 15390), (‘[Smile]’, 7554), (‘[Kneel]’, 6325), (‘[cry with laughter]’, 5835), (‘[cry with laughter]’, 5049), (‘[Smile]’, 4513), (‘[Kneel]’, 3870), (‘[Kneel]’, 3726), (‘[Kneel]’, 6839),
(‘[doge]’, 62), (‘[Kneel]’, 9929), (‘[cry with laughter]’, 20687), (‘[Come on]’, 17564), (‘[Kneel]’, 13616), (‘[Kneel]’, 7169), (‘[Smile]’, 6070), (‘[Kneel]’, 5756), (‘[Smile]’, 4917), (‘[Come on]’, 4402), (‘[Smile]’, 3722), (‘[Microphone]’, 3314), (‘[crack]’, 6404),
(‘[Cold and flu]’, 62), (‘[Husky]’, 8422), (‘[Sad]’, 19069), (‘[Kneel]’, 16806), (‘[蜡烛]’, 13433), (‘[giggle]’, 6379), (‘[Come on]’, 4998), (‘[Smile]’, 5323), (‘[giggle]’, 3823), (‘[cry with laughter]’, 3985), (‘[打call]’, 3390), (‘[Smile]’, 3069), (‘[Sad]’, 6099),
(‘[giggle]’, 50), (‘[doge]’, 7833), (‘[Kneel]’, 17705), (‘[giggle]’, 15048), (‘[Come on]’, 12496), (‘[Come on]’, 5892), (‘[giggle]’, 4710), (‘[Come on]’, 5098), (‘[Sad]’, 3411), (‘[Kneel]’, 3906), (‘[giggle]’, 3204), (‘[Sad]’, 2744), (‘[Come on]’, 5639),
(‘[Show hands]’, 50), (‘[Disappointed]’, 6802), (‘[Show hands]’, 15886), (‘[Show hands]’, 14334), (‘[giggle]’, 11252), (‘[Haha]’, 5299), (‘[Show hands]’, 4310), (‘[giggle]’, 3993), (‘[Come on]’, 3222), (‘[打call]’, 3786), (‘[cat]’, 3089), (‘[打call]’, 2542), (‘[Smile]’, 5523),
(‘[Not easy]’, 47), (‘[Show hands]’, 6343), (‘[giggle]’, 15114), (‘[Haha]’, 11853), (‘[Show hands]’, 10778), (‘[Show hands]’, 4970), (‘[Sad]’, 3981), (‘[Sad]’, 3968), (‘[Show hands]’, 3081), (‘[like]’, 3590), (‘[like]’, 3013), (‘[Disappointed]’, 2345), (‘[cry with laughter]’, 5465),
(‘[Come on]’, 42), (‘[cry with laughter]’, 6127), (‘[Cold and flu]’, 14479), (‘[So happy]’, 11137), (‘[Love you]’, 8670), (‘[Love you]’, 4744), (‘[Disappointed]’, 3715), (‘[Show hands]’, 3619), (‘[Disappointed]’, 3011), (‘[Sad]’, 3460), (‘[Come on]’, 2961), (‘[giggle]’, 2217), (‘[打call]’, 5110),
(‘[Sad]’, 40), (‘[Fist]’, 5774), (‘[Love you]’, 14043), (‘[cat]’, 11096), (‘[Haha]’, 8526), (‘[So happy]’, 4534), (‘[Haha]’, 3529), (‘[Haha]’, 3269), (‘[Haha]’, 2999), (‘[giggle]’, 2993), (‘[Haha]’, 2897), (‘[Come on]’, 2192), (‘[Disappointed]’, 4507),
(‘[Applause]’, 39), (‘[angry]’, 4939), (‘[Fist]’, 13987), (‘[Love you]’, 11054), (‘[Sad]’, 8506), (‘[Sad]’, 4160), (‘[Love you]’, 3071), (‘[Disappointed]’, 3105), (‘[So happy]’, 2604), (‘[good]’, 2941), (‘[Sad]’, 2801), (‘[Show hands]’, 2130), (‘[giggle]’, 3715),
(‘[So happy]’, 39), (‘[bow with hands held in front of one’s face]’, 4652), (‘[Disappointed]’, 13070), (‘[Sad]’, 9753), (‘[cat]’, 8179), (‘[cat]’, 4106), (‘[So happy]’, 2996), (‘[spit]’, 2988), (‘[Love you]’, 2598), (‘[angry]’, 2596), (‘[Show hands]’, 2597), (‘[Fist]’, 2089), (‘[Cold and flu]’, 3647),
(‘[Haha]’, 36), (‘[giggle]’, 4641), (‘[Sun]’, 12794), (‘[eat melon]’, 8644), (‘[So happy]’, 7899), (‘[Disappointed]’, 3696), (‘[cat]’, 2875), (‘[Love you]’, 2845), (‘[cat]’, 2406), (‘[Show hands]’, 2506), (‘[Disappointed]’, 2235), (‘[Haha]’, 1806), (‘[Love you]’, 3557),
(‘[Flowers]’, 33), (‘[伤心]’, 4226), (‘[Haha]’, 12095), (‘[breeze]’, 8558), (‘[incomprehensible]’, 7135), (‘[Flowers]’, 3625), (‘[Byebye]’, 2528), (‘[cat]’, 2831), (‘[like]’, 2241), (‘[Flowers]’, 2438), (‘[Love you]’, 2105), (‘[like]’, 1795), (‘[Show hands]’, 3543),
(‘[Disappointed]’, 33), (‘[good]’, 4208), (‘[WuhanCome on]’, 11345), (‘[Applause]’, 8335), (‘[Microphone]’, 7133), (‘[Applause]’, 3483), (‘[Applause]’, 2475), (‘[So happy]’, 2742), (‘[Byebye]’, 2146), (‘[China like]’, 2429), (‘[good]’, 2067), (‘[Cold and flu]’, 1679), (‘[like]’, 3535),
(‘[Byebye]’, 31)] (‘[Love you]’, 4128)] (‘[cat]’, 11292)] (‘[incomprehensible]’, 8205)] (‘[breeze]’, 6988)] (‘[breeze]’, 3454)] (‘[Cold and flu]’, 2461)] (‘[Applause]’, 2396)] (‘[good]’, 2101)] (‘[Love you]’, 2366)] (‘[Microphone]’, 2030)] (‘[Love you]’, 1605)] (‘[give you my heart]’, 3391)]

Table 4: Twenty Most Popular Weibo Emojis 

Negative emojis are marked by red borders, and positive emojis are marked by green borders.

For how I classify emojis as positive or negative, please click here for details. But my general rules are listed in the table 5.

Table 5: General Weibo Emoji Sentiment Classification Rules

Emoji data source: Github and Github

In the table 6, I can see that in every month, except for January 2020, the number of positive emojis is always greater than that of the negative emojis. The interesting thing about this table is that the public almost always had more positive sentiment than negative sentiment between December 2019 and December 2020. Even in February, when the epidemic was at its worst in mainland China, positive sentiment still exceeded negative sentiment.

Table 6: Twenty Most Popular Weibo Emojis in Each Month

Emoji Sentiment Type 2019-12 2020-01 2020-02 2020-03
Positive 6 5 6 7
Negative 3 5 3 2
Not Classified/Neutral 11 10 11 11
Emoji Sentiment Type 2020-04 2020-05 2020-06 2020-07
Positive 7 7 7 8
Negative 1 3 3 4
Not Classified/Neutral 12 10 10 8
Emoji Sentiment Type 2020-08 2020-09 2020-10 2020-11
Positive 8 9 8 7
Negative 3 3 3 3
Not Classified/Neutral 9 8 9 10
Emoji Sentiment Type 2020-12
Positive 7
Negative 3
Not Classified/Neutral 10

 

Weibo Location

I use Vaex (very similar to Dask) to realize our idea for this part. Generating such interactive GPS images is very easy with the plot_widget method provided by Vaex. For a detailed example on NYC taxi data, please see here. But you can safely ignore all but directly use this code snippet to generate the interactive maps. For my own code, please see here. This functionality can’t be implemented remotely on AWS SageMaker. So I get the following GPS maps by using Vaex locally on my computer. The code is attached to the Jupyter Notebook.

Among the 12,758,869 original  posts related to the epidemic, 12.619% (1610086) of them contained GPS information. I generated a dynamic map based on it. The following are a few screenshots. Intuitively, the number of Weibo sent from mainland China is far more than the number of Weibo sent from other countries. See Figure 4. The number of original posts from mainland China was 1,447,838, accounting for 89.923% of the total number of Weibo posts with GPS positioning.

Figure 4: Geographical distribution of Weibo posts around the world

Now, I focus my attention on mainland China. It can be seen from Figure 5 that in general, the brightest areas are concentrated in the relatively more developed eastern and central regions of China, while the number of posts from the relatively underdeveloped western regions is much smaller. In the eastern region, the brightest areas are concentrated in Beijing, Shanghai, Guangzhou, Wuhan and their surrounding areas. figure 6 shows the cumulative number of confirmed cases in different provinces of mainland China. The greater the number, the darker the color. Comparing the cumulative confirmed case map and the satellite map, it can be found that there is no direct positive correlation between the severity of the epidemic in a region and the number of posts posted by local users. For example, although the number of confirmed cases in Heilongjiang (located in the upper right corner of the map) in the northeastern province of China exceeds 1,500, the number of posts from this province is not as large as most provinces.

Figure 5: Geographical distribution of Weibo posts in mainland China

Figure 6: Cumulative number of confirmed cases in different provinces of mainland China

Source:Sina News

After calculation, the number of posts from the Yangtze River Delta is 201958, accounting for 13.949% of the total (posts from mainland China). Weibo from Shanghai accounted for 0.05253 of the total. Weibo sent from Beijing accounted for 0.0765 of the total number. Weibo from Wuhan accounted for 0.0691 of the total. China’s north-south boundary is approximately between 32 and 34 degrees north latitude. If the boundary is set to be 32, 33, and 34 degrees respectively, the total amount of Weibo sent by northern users will account for 0.49267, 0.45428, and 0.43327 of the total. Interestingly, of all the confirmed cases in China, only 10,102 were confirmed in the north, while 80,997 were confirmed in the south. The population of southern China is about 150 million more than that of southern China. The northern population has a smaller population and fewer confirmed cases, but users from the north are more active in discussing topics related to the epidemic than users in the south.

Now, I restrict the scope to Wuhan. Combined with the map of Wuhan, it can be found that Wuhan’s Weibo is mainly sent from the main urban area of Wuhan. The number of Weibo from urban areas is far greater than the number of Weibo from suburbs.

Topic Changes

Topic modeling interactive HTML results can be found here. The topic is constantly changing as the epidemic evolves. I will only select the meaningful words in the two most popular topics of each month.

In general, concerns about the outbreak have continued throughout the year, and other countries, especially the United States, have been among the hot topics of discussion since April. The names of cities and locations accurately reflect the resurgence of the pandemic in certain areas at a certain time.

Dec 2019: Virus found, but we knew very little about it.

Keywords: Wuhan, pneumonia, unknown, cause, discover.

An internal notification from the Wuhan Municipal Health Commission circulated on the Internet on December 30, 2019, stating that “unexplained” pneumonia had emerged in Wuhan. on December 31, the Wuhan Municipal Health Commission made its first public notification of the outbreak, stating that 27 cases of “viral pneumonia” had been identified in the city, but no “apparent human-to-human” evidence or infection among medical personnel had been found. The discussion in December 2019 focused on the uncertainty of the virus as it had just been discovered. The most liked and reposted Weibo on this day also partly reflects the public’s unfamiliarity and curiosity about the virus (the article accompanying this tweet was talking about how Chinese scientist Zhengli Shi and his team tracked down the source of SARS).

#武汉发现不明原因肺炎#武汉这次是海鲜市场,看完文章你就明白了…

#Wuhan found unexplained pneumonia #Wuhan this time is the seafood market, read the article you will understand…

Jan 2020: Out of control, lockdown, helping each other.

Keywords:epidemic, case, confirmation, hope, pneumonia, going out, mask, really, lockdown, stay-at-home, safety, heart, donation, strength.

On January 3, Wuhan reported a total of 44 cases of pneumonia of unknown origin, with no obvious evidence of “human-to-human” transmission. 136 new confirmed cases were suddenly reported in Wuhan on January 20. Other cities, such as Beijing, also reported cases for the first time. That night, Zhong Nanshan, the head of the expert team of the National Health Commission of China, confirmed for the first time that the virus can be transmitted from person to person. On January 23, Wuhan was locked down. The keywords “case,” “confirmation,” “lockdown, “stay-at-home”, “going out” and “masks” accurately reflect the fact that the outbreak is becoming uncontrollable and that masks are in short supply. During this period, Wuhan was facing a shortage of resources. All sectors of society are supporting Wuhan in various ways. Keywords such as “cheer,” “love,” “donation,” and “hope” reflect the fact that people helped each other in January.

Feb 2020: Combat the COVID, ambulance, hope, cheer

Keywords: combat the COVID, cheer, promise, ambulance, donation, meaning, face

February 2020 was the time when the pandemic was the worst in China. Keywords such as “combat the COVID” and “ambulance” reflect the severity of the epidemic. The number of confirmed cases in Wuhan surged to 5,000 on February 2. The official figures regarding the epidemic in Wuhan reached a peak on February 12, with 13,436 new confirmed cases in one day. From the keywords such as “hope,” “meaning,” “end,” “[tears],” “[heart],” and “soon,” we can see that people hoped the epidemic could end as soon as possible. In this month, people’s attention was focused on the epidemic and its impact.

March 2020: From the pandemic to the resumption of work and production

Keywords: really, hope, life, cheer, stay-at-home, work resumption, prevention, work, industry, resumption of production. 

Similarly, keywords such as “hope,” “life,” “stay-at-home,” and “masks” indicate that people were still concerned about the impact of the pandemic in March. But what is more noteworthy is that the public’s attention has begun to shift from the pandemic to the resumption of work and production. Xi Jinping held a meeting of 170,000 people on February 23, 2020. The focus of this meeting was the resumption of work and production. In late March, tthe lockdown in Hubei province began to be gradually relaxed and medical teams from other parts of China began to be evacuated from Wuhan. People from low-risk communities or communities with no confirmed cases can travel within Hubei with the “green code.” Keywords such as “work resumption,” “prevention,” “work,” “industry,” and “resumption of production” began to appear, indicating that in addition to paying attention to the pandemic, the public began to discuss how to restore normal work and production order.

April 2020: resumption of production; attention on the pandemic in the US and the economic impact of the COVID

Keyword: case, United States, confirmation, new cases, market, impact, economy, company, industry.

In April, the keywords of the topic that people are most concerned about (the first category) included “prevention,” “work,” “resumption of work,” “school opening,” “industry,” and “resumption of production.” This shows that the restoration of normal teaching and production order has become the most important concern. Keywords such as “market,” “impact,” “economy,” and “consumption” reflect Weibo users’ concern about the subsequent impact of the epidemic. In addition, the term “United States” entered the top of the first category of topics, reflecting to some extent the public’s concern about the epidemic in the U.S. In April 2020, the first major outbreak of the epidemic in the U.S. was underway.

May 2020: The pandemic in the US and the economic impact of the pandemic

Keywords: China, enterprise, market, economy, the United States, prevention, work

In May, the keywords of the topics most concerned by the public (the first type of topic) included “enterprise,” “market,” “economy,” “the United States,” and “world.” In May 2020, the pandemic situation in the United States showed no improvement, and the public’s attention began to shift from China to other countries, especially the United States.

Jun 2020: The US outbreak; the second wave of Beijing outbreak

Keywords: the United States, China, Beijing, Case, Confirmation, New Cases, Work

In June, keywords such as “U.S.” and “China” indicated that the U.S. outbreak and the comparison China and the U.S. became one of the most important public concerns. Certainly, keywords such as “case,” “confirmation,” “new cases,” “Beijing,” and “prevention” indicated that people were still concerned about the domestic outbreak, especially in Beijing. 23 new indigenous cases were reported on June 19, including 22 in Beijing. In the midst of this serious outbreak, flights and trains to and from Beijing were cancelled, and authorities introduced several measures to control and “self-quarantine” Beijing.

Jul 2020: The college entrance exam

Keywords: work, prevention, Beijing, enterprise, case, the college entrance exam, child, really

In July, the keywords that people were talking most about included “the college entrance exam,” “child,” “candidate,” “exam” and some other words related to the college entrance exam in China.

August 2020: Back to the cinema and school

Keywords: really, film, back to school, hope, time, US, economy, 2020, China, market.

In August, people’s lives began to return to normal. Starting from July 20, cinemas reopened in low-risk areas for business after nearly 180 days of closure. In August, different provinces and cities set the opening dates of the new school year, and the schools staggered the times for students to return to campus. Keywords such as “movie” and “back to school” indicated that the public were allowed into the cinema again and students gradually went back to school. Meanwhile, keywords such as “time,” “really,” and “life” seemed to reflect the public’s emotional sentiments during this period.

Sept 2020: Praise for medical workers

Keywords: heroes in harm’s way, the most beautiful, really, COVID-themed TV series, work, company, enterprise, economy, market

In September, the COVID-themed TV series “The Most Beautiful Heroes in Harm’s Way” produced by China’s CCTV Central Station was released. It was adapted from the touching stories that occurred during the 2020 pandemic. Keywords such as “the most beautiful heroes in harm’s way,” “COVID-themed TV series,” “salute” and “hero” indicated that people began to talk about and praise the medical staff who worked hard during the pandemic.

Oct 2020: Trump, Qingdao and Tiantongyuan

Keywords: really, life, like, work, China, emoji[tears]

Keywords such as “Trump” and “the United States” indicated that Trump’s actions in the pandemic and the impact of the pandemic on the 2020 US election were one of people’s concerns. In October 2020, the epidemic broke out again in Qingdao, a city in eastern China. Therefore, keywords such as “Qingdao” and “Tiantongyuan” also attracted people’s attention.

Nov 2020: No special concern

Keywords: work, the United States, development, China, prevention

No special keywords appeared in November. The main concerns of people in this month were prevention, the United States, economy, development, etc.

Dec 2020: vaccine and cold chain food contamination

Keywords: 2020, one year, hope, vaccine, 2021, the United States, work, prevention, vaccination

In early December 2020, China launched the Sinovac COVID-19 vaccine. Keywords such as “vaccine” and “vaccination” appeared this month. At the same time, cases where the new coronavirus nucleic acid was detected positive in the cold chain environment involved Hubei, Zhejiang, Shandong, Liaoning and many other provinces. Keywords such as “Dalian,” “cold chain,” “food” and “import” reflected the social events of the time.

Number of Posts

At the end of 2019, just as people were reveling in welcoming the new year, a sudden epidemic spread quietly in Wuhan, central China. It swept through more than 200 countries worldwide in the months that followed, becoming the world’s worst public health crisis in decades. What happened to Wuhan, a city with a population of 11 million? How did it gradually return to normal?

Although the official Chinese statistics on this issue are still questionable, they can help me understand the issues I am interested in.

General Trend

As seen in figure 7 and figure 8, the change in the number of posts at different time periods reflects the change in the level of concern about the epidemic. Although the virus started to spread well before January 2021, the Chinese National Health Commission only confirmed “novel coronavirus” as the etiology of the outbreak on January 8, 2020. We can see that there was minimal discussion of related topics in China in December 2019. The change in the number of tweets also reflects the fact that the outbreak was concentrated in January, February and March. After COVID-19 was gradually brought under control in China, the discussion on the topic gradually decreased.

Figure 7: Changes in the number of tweets

Figure 8: Daily new confirmed COVID-19 cases

Data source: Our World in Data

 

Monthly Trend

Next, I look at the change in the number of tweets within each month from December 2019 to December 2020. 0 represents negative sentiment and 1 represents positive sentiment.

2019-12

In December 2019, the number of positive sentiment tweets was 4,286, while the number of negative tweets was 3,362. The number of positive tweets exceeds the number of negative posts by about 27.48%. As seen in figure 9, there were few discussions about the virus from December 1 to 30, 2019. It was not until December 31 that the number of posts jumped to more than two thousand.

Figure 9 shows not only that there was a discussion boom on December 31, but also that the number of posts expressing negative sentiment exceeded the number of posts expressing positive sentiment. During this day, the most commented on tweet was the following. The Wuhan Health Commission made its first public notification of the outbreak on the last day of 2019, informing the public that 27 cases of “viral pneumonia” had been detected in the city, though no evidence of human-to-human transmission or infection among medical personnel had been identified. The Chinese National Health Commission sent experts to Wuhan. 

> #Wuhan_found_unexplained_pneumonia As a respiratory training student in a tertiary hospital in Jiangan District, Comparing what the public is hearing with what’s coming from the inside, let’s just say the secrecy is working well. I will not talk about other stuff, the mask is too thick to speak

> #武汉发现不明原因肺炎#身为在江岸区某三甲医院的呼吸科规培生,对比大众的消息和内部的消息,只能说保密工作做得不错,别的就不说了,口罩太厚不方便说 [感冒] [感冒] [感冒]

On December 30, Dr. Li Wenliang, who is seen as a whistle blower, informed his peers in WeChat that SARS had emerged. WeChat is a tool for contacting friends and acquaintances. But one of the first whistleblowers to appear on the public platform is likely to be this user.

Figure 9: Daily Tweets Number and Sentiment in 2019-12

2020-01

Before officially reporting my results, I will first compare my final results with Yingdan Lu’s results. Please note, their definition of criticism always entails a target, which means that general negative sentiments such as fear or undirected anger are out of scope. But I’d like to investigate the general negative and positive sentiment on Weibo during COVID-19. In her analysis (please see figure 10), on Jan 23, the day when Wuhan was locked down, there was more criticism than support in public sentiment. Conversely, my sentimental analysis shows that positive sentiment was still greater than negative sentiment on this day. Similarly, the results of her analysis indicate that the criticism overwhelmed support on January 31, but my results suggest that positive posts overwhelmed negative posts on this day.

Figure 10: Yingdan Lu’s sentiment analysis result

Now I begin to report the results of the analysis for January 2020. In January 2020, the number of positive posts was 597,207 and the number of negative posts was 385,955. The number of positive tweets was 54.73% more than the number of negative ones. Except for January 20, there were always more positive posts than negative ones.

On January 1, the Weibo account of the Wuhan police department posted that the police had summoned eight people accused of spreading “rumors” online about atypical pneumonia. On January 11, the first death case was reported in the official Chinese media. Unfortunately, public discussion of the virus remained largely inactive for the first 20 days of January. This suggests that the virus was not taken seriously enough until January 20 –– although the public did not show complete disregard. On this day, a very small number of posts expressed concern about the “unexplained pneumonia.” For example, among the 10 posts that received the most likes on January 11, several posts expressed concerns about the epidemic (they received 24 and 16 likes respectively, ranking sixth and tenth).

> #武汉发现不明原因肺炎##武汉不明原因肺炎病原体为新型冠状病毒# 我就想问问大家,今年过年还敢去武汉玩吗,票都买好了[泪]

> #武汉不明原因肺炎导致1人死亡#大家还是多关注这个 我因为去推特看了香港那边的新闻 默默担心好几天了  @楠先生转生 基友 你有最新消息 要告诉我呀

> #Unexplained pneumonia found in Wuhan# #The pathogen of Wuhan’s unexplained pneumonia is a coronavirus# I just want to ask everyone, do you still dare to go to Wuhan this year? I already bought the tickets [tears].

> #Wuhan unexplained pneumonia led to 1 death# We should pay more attention to this. I saw the news in Hong Kong on Twitter, and I have been worried for days. @nanxianshengzhuansheng hey buddy, please tell me if you have the latest news.

After December 31, 2019, public opinion finally exploded for the second time on January 20. 136 new confirmed cases were suddenly reported in Wuhan on January 20, and other cities including Beijing also reported cases for the first time after the outbreak ended on December 31, 2019. In that evening, Zhong Nanshan, the head of China’s National Health Commission’s expert group, confirmed for the first time that the virus could spread “from person to person” and said that 14 health care workers had been infected. On this day, public opinion exploded. In fact, nine out of ten posts that received the most likes had something to do with the outbreak. Positive tweets outnumbered negative tweets on this day. The most-liked negative tweet was criticizing outdoor people for not wearing masks.

> 武汉站下高铁后看到,出站人群里戴口罩的不超过20%,接站的本地人戴口罩的很少。 工作人员部分佩戴外科口罩,应该是统一发放的,但显然没有强制要求,也没有向他们强调重要性。 武汉对这场疫情的认识严重不足,过于松懈! #境内确诊217例新型冠状病毒肺炎病例# 武汉 显示地图 

> Upon getting off the high-speed train at Wuhan Station, it was observed that less than 20% of the crowd exiting the station was wearing masks, and not a lot of locals leaving the station were wearing masks either. Some of the staff wore surgical masks, which should have been uniformly distributed, but were not mandated and they were not emphasized the importance of wearing them. There is a gross lack of knowledge about the outbreak in Wuhan and there is too much laxity! #217 cases of novel coronavirus pneumonia confirmed within # Wuhan Show map

In addition, the negative tweets revolve around several aspects: First, people are comparing Dr. Tao Yong, who almost died on January 20, 2020 because of a medical scandal, with the importance people place on doctors today. Additionally, many people are condemning those who consume wild animals illegally because they believe the virus originated from locals eating wild animals. A third complaint is the short supply of masks and the high prices the businessmen are charging during this period.

Negative sentiment reached its highest point for the month on Jan. 23. On January 23, the number of posts in a single day reached an unprecedented level, with the number of original posts related to the outbreak exceeding 100,000 for the first time. On that day, Wuhan, home to more than 11 million people, was locked down. Chinese official have urged people not to travel to or from Wuhan after imposing a strict travel ban on the region. Chinese officials closed Wuhan’s bus, subway and ferry operations and canceled flights.

Among the 20 most popular (in order of likes) negative posts on January 23, there were several major topics: first, condemnation of some Wuhan people for carrying the virus to other areas; second, criticism of the collective beating of doctors by patients’ families; and complaints about contracting the virus and not getting timely treatment or not being able to get a mask. However, among the negative tweets, those expressing blatant hatred and discrimination were among the most popular tweets of the day.

> #请求上海封城#跑来看女儿有病不治公共交通跑了四个区的是武汉人吧 环球港被封是武汉人吧 一家三口毒人跑去迪士尼的是武汉人吧 上海作了什么孽 平时黑上海有事情了往上海去钻 别说什么地域歧视 武汉一生黑[拜拜] 

> # Request Shanghai closed city Global port was closed because of Wuhan people, right? A family of three poisonous people ran to Disney is Wuhan people, right? Do not say what regional discrimination, hate Wuhan for life [bye]

Interestingly, one of the most liked (ninth) tweets of the day was in support of Wuhan’s closure and against regional discrimination, and the GPS location information for this post happened to be the city of Wuhan:

> #武汉公交地铁暂停运营#身为武汉本地人,我们支持封城,但!封城≠地域黑!设身处地的想想,多少人不能和家人团聚!这个时候我们不需要网络喷子!需要的是理解和鼓励!!舍小家,为大家!武汉在行动!!!! 武汉 显示地图 

> #Wuhan bus and metro operations suspended# As Wuhan locals, we support the city closure, but, city closure ≠ regional discrimination! Think about it, how many people can not reunite with their families! At this time we do not need online haters! What we need is understanding and encouragement! We need to understand and encourage! Wuhan in action !!!!Wuhan Show map

There is a special significance to the Chinese Lunar New Year on January 24, the day before the Chinese Lunar New Year. On this Lunar New Year’s Eve, however, frontline healthcare workers asked for help on social media, claiming there were not enough protective supplies. There were too few hospital beds to accommodate the increasing number of patients. 

On January 25, the positive sentiment reached its peak. And the number of tweets on this day was the top of January. Chinese New Year was celebrated on this day.  Looking at the most popular posts, people expressed their New Year wishes with a special emphasis on wishing people peace and health.

> 鼠年大吉,新的一年祝大家身体健康,万事顺意。特殊时期,出门记得戴口罩,勤洗手,健康最重要。 

> Great luck for the Year of the Rat. I wish you all good health and all the best in the new year. Special period, remember to wear a mask when you go out, wash your hands regularly, health is most important.

Figure 11: Daily Tweets Number and Sentiment in 2020-01

In case my results overshadowed the sentiment of Wuhan citizens, I selected the posts from Wuhan and analyzed their sentimental changes in January 2020. As some users add location information when sending Weibo, I can filter out those posts sent from Wuhan. After filtering and sorting the sentiment, I got a total of 8,024 positive posts and 6,317 negative posts from Wuhan City. As you can see from Table 7, posts from Wuhan and posts from across the country showed a similar pattern in terms of sentiment. However, in the figure below, the day with the highest number of posts in January was January 23, the day Wuhan was locked down. In comparison, the day with the highest overall number of posts by Weibo users was January 25, the first day of the Chinese New Year. Positive sentiments still outweigh negative sentiments on the day of Wuhan’s closure, but negative sentiments reached their peak for the month.

Table 7: Sentiment classification results for tweets sent from Wuhan (January 2020)

2020-02

For detailed statistical information regarding the positive and negative posts every day, please see here. Generally speaking, the number of positive and negative posts was 2443631 and 837991 respectively. The number of positive posts is 291.60% of the number of negative posts.

Between January 31 and February 2, 2020, there is another new spike. This spike focuses on two controversies related to medical supplies. In one, the Red Cross Society of Hubei was criticized on Weibo for colluding with private hospitals and causing supply shortages at hospitals designated for COVID-19 treatment. 33 The second controversy erupted over the Chinese herbal remedy Shuanghuanglian, which Chinese state media reported as a possible treatment for the virus.

Between the night of February 6 and the early morning of February 7, a day when public negative sentiment reached its highest point. On this day, Dr. Li Wenliang, who was considered the “whistle blower” of the epidemic, died after contracting COVID-19. Li Wenliang’s death stirred up anger and disappointment in the Chinese government.

The positive sentiment reached its climax around Feb. 13. This day coincided with the removal of both the Hubei Province Secretary and Wuhan Municipal Party Secretary. Public anger at Hubei province and the Wuhan government seems to have been unleashed on this day.

On February 14 Wuhan issued a notice specifying the measures for the lockdown of residential communities. The most popular posts (ranked by number of likes) on this day were basically expressing hope that the epidemic would end soon and wishing others health and happiness in the name of Valentine’s Day.

Besides tweets from around the country, I also wanted to see how the sentiment of those tweeting from Wuhan in February 2020 had changed. In order to compare their sentiment in February 2020, I selected only tweets from Wuhan. Since many users included their location information when sending tweets, I was able to find out which tweets originated from Wuhan. After text sentiment classifications, I got a total of 15,737 positive tweets and 10,333 negative tweets from the city of Wuhan. In general, tweets from Wuhan and tweets from across the country exhibit similar sentiment. On February 6, however, the negative sentiment slightly outweighed the positive. Li Wenliang, considered the whistleblower of the outbreak, died after contracting COVID-19 between the night of February 6 and the early morning of February 7. It was reported in January that Li Wenliang had been silenced by the authorities. He was interviewed by health officials and police, and eventually signed a Letter of Admonishment that dismissed his warnings as unfounded and illegal rumors. The death of Li stirred anger and frustration in the Chinese government, which had previously attempted to silence the whistleblower rather than inform the public about the outbreak. Many saw his efforts as heroic when he warned that the epidemic could get out of control early in the outbreak. On social media, there was widespread mourning for his passing.

2020-03

In March, there were 115,850 positive tweets and 69,599 negative tweets. For the month, the last day saw the most postings. The Ministry of Education announced a one-month postponement of the 2020 college entrance examination. Exam schedules since the resumption of college entrance exams reveal no mention of a month-long delay, even during epidemics and natural disasters, in 2003 or 2008.

In the graph below, we can see the change in user sentiment in Wuhan in March 2020. During this month, 13,583 tweets were positive, and 7,372 were negative.

 

2020-04

In April 2020, the number of positive and negative tweets was 1107708 and 660954, respectively. Positive sentiment reached its peak on April 4. On this day, China held a national mourning event. The national institutions and consulates at foreign countries lowered their flags to mourn, and public entertainment activities were suspended nationwide. At the same time, the number of confirmed cases worldwide reached one million on this day. The second peak of positive sentiment was on April 8, when Wuhan lifted its departure controls and resumed the traffic to other cities. Wuhan Tianhe Airport began to resume flights, and Wuhan Railway Station also returned to normal operation.

May 2020 to December 2020

2020-05

Positive: 528198 Negative: 307471

2020-06

Positive: 388307 Negative: 281285

2020-07

Positive: 333538 Negative: 249101

2020-08

Positive: 297228 Negative: 207722

2020-09

Positive: 313131 Negative: 170584

2020-10

Positive: 294011 Negative: 211868

2020-11

Positive: 253806 Negative: 200929

2020-12 

Positive: 462322 Negative: 369825

Number of Neutral, Negative and Positive Posts

There are three sentiment categories in my training dataset: negative, neutral, and positive. I removed the neutral emotion posts in the previous analysis and trained my BERT model. However, in this section, I retrain the BERT model based on the three emotion types in the training dataset. In conclusion, this BERT model that categorizes emotions into three categories provides similar conclusions to the previous ones. So, you can safely ignore this part. As some texts are too long to feed into the model (even after truncating the texts), so you may see there is a category called “not_classified”.

LABEL_0, LABEL_1 and LABEL_2 represent negative, neutral and positive sentiment respectively.

My fine-tuned Chinese BERT-wwm-ext model for classifying three types of sentiment has an accuracy of 0.7623980383325827. For hyperparameter details, please see table 8.

Table 8: Three-classes Sentiment Classifier 

Parameter Name Setting
Sentiment Classes 3
Model BERT
Train epochs 3
Per device train batch size 32
Learning rate 2.00E-05
Weight decay 0
Seed 1
Warmup steps 0
Accuracy 0.7623980383

 

2019-12

Neutral sentiment: 5695 Negative sentiment: 1102 Positive sentiment: 851

2020-01

Neutral sentiment: 435914 Positive Sentiment: 333337 Negative sentiment: 213909 Not_classified: 2

2020-02

Positive sentiment: 1661329 Neutral sentiment: 1204892 Negative sentiment: 415392 Not_classified: 9

2020-03

Neutral sentiment: 1008124 Positive Sentiment: 539890 Negative sentiment: 300435

2020-04

Neutral sentiment: 1056088 Positive Sentiment: 455977 Negative sentiment: 256595 Not_classified: 2

2020-05

Neutral sentiment: 512041 Positive Sentiment: 207992 Negative sentiment: 115635 Not_classified: 1

2020-06

Neutral sentiment: 429047 Positive Sentiment: 130801 Negative sentiment: 109744

2020-07

Negative sentiment: 102254 Neutral sentiment: 360139 Positive Sentiment: 120242 Not_classified: 4

2020-08 

Negative sentiment: 82896 Neutral sentiment: 316983 Positive Sentiment: 105069 Not_classified: 2

2020-09

Negative sentiment: 74420 Neutral sentiment: 264101 Positive sentiment: 145192 Not_classified: 2

2020-11

Neutral sentiment: 310269 Positive sentiment: 74333 Negative sentiment: 70132 Not_classified: 1

2020-12

Neutral sentiment: 559747 Positive Sentiment: 144832 Negative sentiment: 127567 Not_classified: 1

Image Classification and Speech Recognition

In the text classification section, the approximately 13 million posts I use are from the Weibo-COV V2 dataset.

Here is a sample of Weibo-COV V2.

_id,user_id,crawl_time,created_at,like_num,repost_num,comment_num,content,origin_weibo,geo_info
Jwm2cyhQQ,e0470a66f95fe66d,1607931932,2020-12-01 00:00,0,0,0,【抗疫路上,#幕后的科研专家走了#】疫苗攻关争分夺秒,他总想再快点!因连续工作、过度劳累,中国医学科学院病原生物学研究所研究员赵振东教授倒在了出差途中,最终抢救无效,于9月17日在北京不幸逝世,享年53周岁。赵振东教授是我国从事病原生物学和感染免疫学研究的知名专家,疫情伊始,他说:”这…全文转发理由:[泪],Jwl894jgH,

 

It should be noted that the images attached to the posts are not provided in this dataset. Therefore, I am unable to combine the text and the pictures of the posts in this dataset for analysis. However, in order to practice the technique of image classification, I adopted another method. Specifically, my training dataset provides the download addresses of these images. I thus first filtered out all the samples with images in the training dataset and then downloaded the images attached to these tweets.

Here is a sample of the training dataset. The total sample size is 100,000, of which there are more than 60,000 posts with pictures.

weibo_id posting_time user_account content weibo_pic weibo_video sentiment
4473093169042720 02月17日 23:06 苏晴晴晴子 待家里过的恍恍惚惚差点忘了我还要去打疫苗这件大事那么问题来了什么时候才可以出门了呢? [https://ww2.sinaimg.cn/orj360/9b0180f7gy1gbzs… [] 0
4466221086475030 01月29日 23:59 李小盒Austin 初五迎财神!希望大家身体健康!疫情赶紧散去2新余? [https://ww1.sinaimg.cn/orj360/ce78f2bagy1gbdv… [] 2
4466916678241530 01月31日 22:03 Cathleenzhou #疫情仍处于扩散阶段##抗击新型肺炎第一线# [https://ww2.sinaimg.cn/orj360/67206916ly1gbg1… [] 1
4470932179164800 02月11日 23:59 为什么不能随机id 呜呜呜呜呜黄石真好!特别安心!衷心祝福黄石,江苏与你们同在dbq他浙我就爬一会墙……主要是…… [https://wx1.sinaimg.cn/orj480/8300ec6cgy1gbsp… [‘https://f.video.weibocdn.com/003Zmfdjgx07ASg… 2
4464693123441350 01月25日 18:47 兩后彩虹6071304534 1月21日,国家卫健委高级别专家组成员、香港大学微生物学系讲座教授袁国勇提出“超级传播者可能… [https://ww3.sinaimg.cn/orj360/006CSzEGgy1gb90… [] 1

 

Sina Weibo has implemented a limit on how often images can be downloaded. Therefore, if we want to download more than 60,000 images, we need to find a way to circumvent Sina Weibo’s limit. Using a tunneling proxy is the method I personally prefer. Tunnel proxy is a dynamic IP proxy server we built based on high performance hosting, which makes it easier for users to use by putting the IP change operation in the cloud. We, as users, do not need to change IPs. The tunnel proxy forwards requests to different proxy IPs, and the forwarding period can be specified on demand. I was using this service. If you would like to know more about my implementation code, please see here. Here is a much more detailed implementation of mine. 

I had to spend more than 10 hours downloading these images due to the large amount of data. Using the python parallel computing module would be an ideal way to save the time it takes to download these images. As a result, the images were organized by sentiment into the appropriate folders.

Training a model from scratch isn’t often easy – with very large datasets and models, it is often worth checking out the large number of very powerful pre-trained models. I choose to use powerful pre-trained models. It’s worth mentioning most of these models are pre-trained on the 1000 imagenet classes. 

I tested five models:

alexnet = models.alexnet(pretrained=True)
densenet = models.densenet161(pretrained=True)
resnet = models.resnet50(pretrained=True)
mobilenet = models.mobilenet_v2(pretrained=True)

I have manually sampled many images and classified them with these five models, however, none of them could provide useful results. On the contrary, the results are ridiculously disappointing. I got the images classified by random sampling but almost none of them are correctly classified. Given its performance, it is difficult for me to use these image classification results in this project. Here are some of image classification samples:

[(‘photocopier’, 16.811765670776367), (‘printer’, 10.825773239135742), (‘medicine_chest’, 7.183788776397705)] 

[(‘swing’, 18.825193405151367), (‘swab’, 6.0248799324035645), (‘shopping_cart’, 5.0650434494018555)] 

I have also used the transformers package to perform speech recognition on the videos that accompany the tweets. I collected my data from the videos that accompanied the tweets. After converting the video to audio, I have a dataset that I can analyze. The model I’m using is called  Wav2Vec2-Large-XLSR-53-Chinese-zh-cn-gpt. They have fine-tuned facebook/wav2vec2-large-xlsr-53 on Chinese (zh-CN) using Common Voice Chinese (zh-TW) dataset (converting the label text to simplified Chinese). But the results are just disappointing. If you want to take a look at the code, click here. Here is the model usage. Click here for model source. But it should be noted that, when using this model, make sure that your speech input is sampled at 16kHz. I give a sample speech recognition result here:

Prediction: [‘宋朝末年年间定区分命为’, ‘渐渐行动无变’]

Reference: [‘宋朝末年年间定居粉岭围。’, ‘渐渐行动不便’]

Final performance: CER: 20.902244

On the basis of my audio dataset, the results indicate that Chinese speech recognition is not useful at all.

Limitations

Maybe Some Posts Are Not Relevant to COVID

I’m not sure how many tweets are actually related to COVID-19. While I’m using the second version of the Weibo-Cov dataset, Yingdan Lu and her colleagues used the first version. Both the datasets are built on a list of keywords. The tweets containing at least one of the keywords were collected into the dataset. It is expected that keyword-based datasets will contain posts unrelated to COVID-19. This is because posts may contain COVID-19-related keywords or COVID-related hashtags but not actually talk about irrelevant stuff. They identify 3,142,178 posts (72.9%) in the Weibo-COV dataset related to COVID-19. Therefore, it’s also possible that about 30% percent of samples I used in this final project has little to do with COVID-19 itself.

Censorship

My results may not be a result by Chinese censorship. Yingdan Lu conducted their analyses on the Weiboscope dataset to ensure that our results are not biased by censorship. They find similar proportions of critical and supportive commentary related to COVID-19. In the post-censorship Weibo-COV dataset, 6.4% of posts related to COVID-19 contain criticism, while 8.1% of predicted posts contain support. Their results do not appear to be the result of censorship. In the pre-censorship Weiboscope dataset, 5.1% of posts related to COVID-19 contain criticism and 6.8% of COVID-19 posts contain support.

Appendix

Dataset Keywords

Language Keywords
Chinese common 冠状 Cov-19 新冠 感染人数 N95 #2019nCoV #nCoV PHEIC 疫情 Coronavirus 感染 例 武汉 封城 武汉 隔离 居家隔离 隔离14天 潜伏期 14天 国际公共卫生紧急事件 复工 中小企业 困境 武汉 死亡病例 武汉 感染病例 湖北 死亡病例 湖北 感染病例 中国 死亡病例 中国 感染病例 潜伏期 北京 病例 天津 病例 河北 病例 辽宁 病例 上海 病例 江苏 病例 浙江 病例 福建 病例 山东 病例 广东 病例 海南 病例 山西 病例 内蒙古 病例 吉林 病例 黑龙江 病例 安徽 病例 江西 病例 河南 病例 湖北 病例 湖南 病例 广西 病例 四川 病例 贵州 病例 云南 病例 西藏 病例 陕西 病例 甘肃 病例 青海 病例 宁夏 病例 新疆 病例 香港 病例 澳门 病例 台湾 病例 ECOM sars-cov-2 核酸检测 COVID-19 2019-nCoV 疑似 病例 无症状 累计病例 境外输入 累计治愈 健康码 返校 美国 例 西班牙 例 新加坡 例 加拿大 例 英国 例 印度 例 日本 例 韩国 例 德国 例 法国 例 意大利 例 新增 例 战疫 抗疫 发热患者 延迟开学 开学时间 不得早于 累计死亡数 疑似病例 武汉 肺炎 新型肺炎 出门 戴口罩 新肺炎 #2019nCoV 新型肺炎 AND 死亡 新型肺炎 AND 感染 企业复工 pandemic 无症状感染者 解除医学观察 本土新增归零 新增确诊病例 检测 阳性 美国 新增确诊 欧洲 新冠 俄罗斯 新增 德国 新增 美国 累计确诊 疫情防控 企业复工 复工复产 疫情防控 新冠疫情防控 疫情反弹 新冠 开学 本土 确诊 境外 输入 新冠疫苗 新冠状疫苗 疫情防控工作提案 无症状感染者 新冠隔离点 小区封闭管理 疫情反弹 核酸检测 中国新冠疫苗 疫苗 全球公共产品 全球合作抗疫 遏制疫情 新冠 流调 流调轨迹 密切接触者 新冠 密接者 新冠灭活疫苗 疫苗 临床试验 公共卫生应急 小微企业纾困 核酸检测 高校学生 返校 病例治愈率 新冠 疫情 应急响应 下调 重点区域 监测 重点人群 监测 应急响应 二级 应急响应 三级 三级防疫 二级防疫 抗击疫情 核酸 阴性 新增本土病例 核酸筛查 健康宝 健康码 北京健康宝 津心办天津健康码 山西健康码 随申码 苏康码 苏城码 安康码 防疫健康信息码 赣通码 山东电子健康通行卡 渝康码 西安一码通 甘肃复工卡 低风险地区 电影院 恢复营业 影院 复工 咽拭子 新冠疫苗 临床试验 境外输入 无症状感染者 隔离治疗 新冠疫苗实施计划 国产疫苗 境外输入病例 疫情 分级管控 本地新增病例 高风险地区 市场 绿码 境外输入病例 2019.12 不明肺炎病症 新冠肺炎可以人传人 口罩 抢购 首例 柳叶刀 首例 华南海鲜 首例 武汉卫健委 中西医结合医院 反常病例 上海公共卫生临床中心 武汉 不明发热 中西医结合医院 流行病学调查 国家卫健委 专家 武汉 肺炎 27例 病毒性肺炎 未发现明显的人传人 2020.01 指定感染症 肯定能人传人 柳叶刀 海鲜市场 新型冠状病毒感染的肺炎治疗方案 警告 2级 检测试剂盒 who 专家 武汉 肺炎疫情防控指挥部 一省包一市 万家宴 新冠病毒 人传人 泰国 首例 英国 旅行风险提示 美国 旅行风险提示 突发公共卫生事件二级响应 卫健委 王广发 核酸检测 李文亮 训诫书 病毒性肺炎 卫健委 马晓伟 8名 武汉 肺炎 谣言 武汉协和医院 张继先 上海公共卫生临床中心 N95 大众畜牧野味店 华南野生市场 华南海鲜市场 管轶 武汉病毒所 CDC 中国疾病预防控制中心 疾控中心 不明原因 发热 双黄连 抢购 双黄连 售磬 武汉卫健委 湖北卫健委 武汉 封城 火神山 雷神山 钟南山 协和医院 李文亮 医生 蒋超良 李文亮 千里投毒 武汉病毒研究 周佩仪 2020.02 刘智明 去世 驰援武汉 柳帆 去世 应勇 书记 王忠林 书记 蒋超良 马国强 书记 小区 封闭式管理 湖北红十字会 韩国 首例 钻石公主号 阿比朵儿 达芦那韦 疫情上报第一人 四类人员 英国 首例 李文亮 去世 NCP 李兰娟 方舱医院 轻症患者 双黄连 抢购 双黄连 售磬 方舱医院 N95 大众畜牧野味店 华南野生市场 华南海鲜市场 管轶 Guan Yi 武汉病毒所 CDC 中国疾病预防控制中心 疾控中心 不明原因 发热 双黄连 抢购 双黄连 售磬 武汉卫健委 湖北卫健委 武汉 封城 火神山 雷神山 钟南山 瑞德西韦 高福 王延轶 舒红兵 李文亮 医生 云监工 瑞德西韦 黄冈 感染者 孝感 感染者 晋江毒王 超级传播者 张晋 卫健委 张晋 卫生将康委员会 刘英姿 卫健委 刘英姿 卫生健康委员会 王贺胜 卫健委 王贺胜 卫生健康委员会 延迟开学 开学时间 不得早于 管轶 火线提拔 干部 柳叶刀 非自然起源 杜显圣 2020.03 武汉 零增长 武汉 解除 措施 撤销 李文亮 首批 开学 群体免疫 谭德塞 新冠 至尊公主号 梅仲明 新天地教会 黄某英 离汉返京 钻石公主号 江学庆 湖北 监狱防控 延迟开学 开学时间 不得早于 方舱医院 N95 大众畜牧野味店 华南野生市场 华南海鲜市场 管轶 武汉病毒所 CDC 中国疾病预防控制中心 疾控中心 不明原因 发热 双黄连 抢购 双黄连 售磬 武汉卫健委 湖北卫健委 武汉 封城 高考 延期一个月 2020.04 绥芬河 确诊 绥芬河 新冠 绥芬河 方舱 雷神山 关闭 武汉 解封 郭某鹏 判处 首批 开学 群体免疫 谭德塞 新冠 方舱医院 N95 大众畜牧野味店 华南野生市场 华南海鲜市场 管轶 武汉病毒所 CDC 中国疾病预防控制中心 疾控中心 不明原因 发热 双黄连 AND 抢购 双黄连 AND 售磬 武汉卫健委 湖北卫健委 2020.05 黑龙江 新增 全球确诊 325万 俄总理 阳性 新疆生产建设兵团 确诊 应急响应 二级 全球 累计确诊 湖北一级应急响应 二级 不建议长途旅行 不建议较多人聚会 影剧院游泳场馆 暂不开放 确诊病例 连续下降 联防联控机制 确诊病例 500 舒兰 疫情 舒兰 新冠 舒兰 聚集性 吉林 疫情 吉林 封闭管理 高校中小学 开学 非毕业年级 返校 两会 疫情防控 常态化防疫 常态化防疫机制 2020.06 突发公共卫生应急响应 新发地 病毒 新发地 感染 新发地 新型冠状 新发地 新冠 病例活动轨迹 新冠 生鲜冷冻肉品市场 高考 戴口罩 2020.07 AKB48 确诊 31省 新增 22例 累计确诊 700万 熔断指令 乌鲁木齐市 新增确诊 大连市 战时状态 甘井子区 高风险 印度 确诊 印度尼西亚 确诊 解除风控管理 进口冻虾 新冠 低风险地区 核酸证明 应急响应 三级 2020.08 大连 新冠 大连 聚集性 大连 病例 大连 确诊 大连 感染 累计核酸检测 新发地 科技战疫成果 2020.09 战役英雄 新冠疫苗 临床试验 境外输入 无症状感染者 隔离治疗 战役英雄 2020.10 新冠疫苗实施计划 青岛 感染者 青岛 疫情 青岛 新冠 青岛 病毒 青岛 新型冠状 青岛 病例 青岛 核酸检测 喀什 感染 喀什 疫情 喀什 病例 喀什 新冠 2020.11 天津 新增 天津 病例 天津 确诊 进口冷链食品 消毒 进口冷链食品 可追溯 冬季 疫情 零星 冬季 新冠 感染 2020.12 成都 确诊 女孩 成都 感染病例 四川 新增 病例 黑龙江 新增 病例 成都 核酸检测 北京 新增 病例 绥芬河 封闭管理 冷链食品 核酸检测 北京 境外输入 聚集性场所 测温验码 聚集 人员密度 经营场所 戴口罩 高风险人群 接种疫苗 重点场所 戴口罩 新增 关联病例 新冠 HPV疫苗 无症状感染者 顺义 一级防控 新冠 医保名录 印尼 输入病例 金马工业区 感染 新疫苗上市 免费
English Common Coronavirus Cov-19 New Coronavirus Number of infections N95 #2019nCoV #nCoV PHEIC Outbreak Coronavirus Infection Cases Wuhan Closed City Wuhan Quarantine Home Quarantine Quarantine 14 days Incubation period 14 days International Public Health Emergency Resumption of work Small and Medium Enterprises Distress Wuhan Deaths Cases Wuhan Infected Cases Hubei Deaths Hubei Infected Cases China Deaths China Infected Cases Incubation Period Beijing Cases Tianjin Cases Hebei Cases Liaoning Cases Shanghai Cases Jiangsu Cases Zhejiang Cases Fujian Cases Shandong Cases Guangdong Cases Hainan Cases Shanxi Cases Inner Mongolia Cases Jilin  Case Heilongjiang Case Anhui Case Jiangxi Case Henan Case Hubei Case Hunan Case Guangxi Case Sichuan Case Guizhou Case Yunnan Case Tibet Case Shaanxi Case Gansu Case Qinghai Case Ningxia Case Xinjiang Case Hong Kong Case Macau Case Taiwan Case ECOM sars- cov-2 nucleic acid testing COVID-19 2019-nCoV Suspected cases Asymptomatic Cumulative cases Cumulative cures Cumulative cures Health code Return to school United States Case Spain Case Singapore Case Canada Case United Kingdom Case India Case Japan Case Korea Case Germany Case France Case Italy Case Added Case War epidemic Anti-epidemic Patients with fever Delayed start of school Start of school must not be earlier than Cumulative number of deaths Suspected cases Wuhan Pneumonia Novel pneumonia Out of home Wear mask New pneumonia #2019nCoV Novel pneumonia AND Deaths Novel pneumonia AND Infection Businesses return to work pandemic Asymptomatic infected Persons released from medical observation Local New to zero New confirmed Case Detection Positive US New confirmed Europe New crown Russia New Germany New US Cumulative confirmed Epidemic prevention and control Corporate return to work Resumption of work and production Epidemic prevention and control New crown Epidemic prevention and control Epidemic rebound New crown Start of school Local Confirmed Diagnosis Offshore Import New crown vaccine New crown vaccine Epidemic prevention and control work proposal Asymptomatic infected New crown quarantine site Small area closed management outbreak rebound nucleic acid testing china new crown vaccine vaccine global public goods global cooperation against the epidemic containment new crown influx influx trajectory close contacts new crown close contacts new crown inactivated vaccine vaccine clinical trials public health emergency small and microenterprise relief nucleic acid testing high school students return to school case cure rate new crown outbreak emergency response downgrade Key areas Surveillance Key populations Surveillance Emergency response Level 2 Emergency response Level 3 Level 3 Prevention Level 2 Prevention Fight against the epidemic Nucleic acid Negative New native cases Nucleic acid screening Health Bao Health code Beijing Health Bao Jinxin Office Tianjin Health code Shanxi Health code Su Shen code Su Kang code Su Cheng code An Kang code Epidemic prevention Health information code Gan Tong code Shandong Electronic health pass Card Yucang Code Xi’an One Code Pass Gansu Resumption of Work Card Low Risk Areas Cinema Resumption of Business Cinema Resumption of Work Pharyngeal Swabs New Crown Vaccine Clinical Trials Offshore Importation Asymptomatic Infected Persons Isolation Treatment New Crown Vaccine Implementation Plan Domestic Vaccine Offshore Importation Cases Epidemic Grading Control Local New Cases High Risk Areas Market Green Code Offshore Importation Cases  2019.12 Unspecified pneumonia disease New crown pneumonia can be human-to-human transmission Mouthpiece rush First case Lancet First case South China seafood First case Wuhan Health Care Commission Chinese and Western Medicine Hospital Anomalous cases Shanghai Public Health Clinical Center Wuhan Unspecified fever Chinese and Western Medicine Hospital Epidemiological investigation National Health Care Commission Experts Wuhan Pneumonia 27 cases Viral pneumonia Not found Apparent human-to-human transmission 2020.01 Designated infectious disease Definitely can be human-to-human transmission Lancet Seafood market Pneumonia treatment protocol for novel coronavirus infection Warning Level 2 Detection kit who Expert Wuhan Pneumonia outbreak prevention and control command One province package One city Wanjia banquet New coronavirus Human-to-human transmission Thailand First case United Kingdom Travel risk alert United States Travel risk alert Level 2 response to public health emergencies Health and Welfare Commission Wang Guangfa Nucleic acid testing Li Wenliang Admonition Book Viral pneumonia Health and Welfare Commission Ma Xiaowei 8 Wuhan Pneumonia Rumors Wuhan Union Hospital Zhang Jixian Shanghai Public Health Clinical Center N95 Mass Animal Husbandry Game Store South China Wild Market South China Seafood Market Guan Yi Wuhan Virus Institute CDC China Center for Disease Control and Prevention Control Center CDC unexplained fever diflucan snapping up diflucan selling chime Wuhan Health and Health Commission Hubei Health and Health Commission Wuhan Seal City Vulcan Mountain Thunder God Mountain Zhong Nanshan Concord Hospital Li Wenliang doctor Jiang Chaoliang Li Wenliang thousand miles poisoning Wuhan Virus Research Zhou Peiyi 2020.02 Liu Zhiming passed away Chikyu Wuhan Liu Fan passed away Ying Yong secretary Wang Zhonglin Clerks Jiang Chaoliang Ma Guoqiang Clerks Cell Closed management Hubei Red Cross Society South Korea First case Diamond Princess Abidor Darunavir First person to report the epidemic Four categories of people UK First case Li Wenliang Deceased NCP Li Lanjuan Fang Cabin Hospital Minor patients Shuanghuanglian Rush Shuanghuanglian Sell chime Fang Cabin Hospital N95 Mass livestock game store South China Wild Market South China Seafood Market Guan Yi Guan Yi Wuhan Institute of Virus CDC China Center for Disease Control and Prevention CDC unexplained fever Shuanghuanglian Grabbing Shuanghuanglian Selling Chime Wuhan Health and Health Commission Hubei Health and Health Commission Wuhan Fengcheng Vulcan Mountain Thunder God Mountain Zhong Nanshan Ridcicevir Gao Fu Wang Yangyi Shu Hongbing Li Wenliang Doctor Yun Supervisor Ridciclovir Huanggang Infected Xiaogan Infected Jinjiang Toxic King Super Spreader Zhang Jin Health and Health Commission Zhang Jin Health will Health Commission Liu Yingzi Health and Health Commission Liu Yingzi Health and Health Commission Wang Hesheng Health and Health Commission Wang Hesheng Health and Health Commission Delayed start of school Start of school shall not be earlier than Guangyi Fire line promotion Cadres Lancet Unnatural origin Du Xian Sheng  2020.03 Wuhan Zero Growth Wuhan Dismissal Measures Withdrawal Li Wenliang First Start of School Herd Immunity Tan Desai New Crown Supreme Princess Mei Zhongming Xintiandi Church Huang Moying Leaving Han to Return to Beijing Diamond Princess Jiang Xueqing Hubei Prison Prevention and Control Delayed Start of School Start of School No Earlier than Square Cabin Hospital N95 Mass Animal Husbandry Wild Game Store South China Wild Market South China Seafood Market Guan Yi Wuhan Virus Institute CDC China Center for Disease Control and Prevention CDC Unexplained fever Shuanghuanglian snatch Shuanghuanglian sell chime Wuhan Health and Health Commission Hubei Health and Health Commission Wuhan Seal of the city College entrance exams Delayed by one month 2020.04 Suifenhe Confirmed Suifenhe Xinguan Suifenhe Square Cabin Thunder God Hill Close Wuhan Unblocked Guo A Peng sentenced First batch Opening of school Herd immunization Tan Desai New Guan Fangcao Hospital N95 Popular animal husbandry wild game store South China wild market South China seafood market Guan Yi Wuhan Virus Institute CDC China Center for Disease Control and Prevention CDC Unidentified fever Shuanghuanglian AND Rush Shuanghuanglian AND Chime Sale Wuhan Health and Health Commission Hubei Health and Health Commission   2020.05 Heilongjiang New Global confirmed 3.25 million Russian Prime Minister Positive Xinjiang Production and Construction Corps Confirmed Emergency Response Level 2 Global Cumulative confirmed Hubei Level 1 Emergency Response Level 2 Not recommended for long distance travel Not recommended for large gatherings Theater and swimming pool not open for the time being Confirmed cases continuously declining Joint prevention and control mechanism Confirmed cases 500 Shulan Epidemic Shulan New crown Shulan Clustered Jilin Epidemic Jilin Closed management High schools Primary and secondary schools Opening of school Non-graduating grades Returning to school Two meetings Epidemic prevention and control Standing prevention and control Standing prevention and control mechanism 2020.06 Emergency response for sudden public health Emerging land Virus Emerging land Infection Emerging land New crown Emerging land New crown New crown Case activity trajectory New crown Fresh and frozen meat market High school entrance exams Wearing masks 2020.07 AKB48 Confirmed 31 provinces New 22 cases Cumulative confirmed 7 million Meltdown directive Urumqi New confirmed Dalian Wartime status Ganjingzi District High risk India Confirmed Indonesia Confirmed Dismantled wind control management Imported frozen shrimp New crown Low risk area Nuclear acid proof Emergency Response Level 3 2020.08 Dalian New crown Dalian Cluster Dalian Cases Dalian Confirmed Dalian Infected Cumulative Nucleic Acid Testing Newly Developed Scientific and Technological Warfare Epidemic Results 2020.09 Battle Hero New crown vaccine Clinical trials Offshore importation Asymptomatic infected persons Isolation treatment Battle Hero 2020.10 New crown vaccine implementation plan Qingdao Infected Qingdao Epidemic Qingdao New crown Qingdao Virus Qingdao New crown Qingdao Cases Qingdao Nucleic acid testing Kashgar Infection Kashgar Epidemic Kashgar Cases Kashgar New crown 2020.11 Tianjin New Tianjin Cases Tianjin Confirmed Imported cold chain food Disinfection Imported cold chain food Traceable Winter Epidemic Sporadic Winter New crown Infection 2020.12 Chengdu Confirmed Girls Chengdu Infected cases Sichuan New cases Heilongjiang New cases Chengdu Nucleic acid testing Beijing New cases Suifenhe Closed management Cold chain food Nucleic acid testing Beijing Imported from abroad Clustered places Temperature testing code Clustered people density Operating places Masking High risk groups Vaccination Key sites Wearing masks New Associated cases New crown HPV vaccine Asymptomatic infected Shunyi Primary prevention and control New crown Medical insurance list Indonesia Imported cases Jinma industrial zone Infection New vaccine launch Free

 

COVID-19 Cases by Region

Data source: statista

Characteristic Currently confirmed Cumulative confirmed Deaths Recovered
Total 7,303 111,147 4,955 98,889
Shanghai 66 2,095 7 2,022
Guangdong 64 2,455 8 2,383
Zhejiang 39 1,364 1 1,324
Fujian 32 621 1 588
Sichuan 30 1,025 3 992
Shaanxi 21 616 3 592
Yunnan 16 352 2 334
Jiangsu 10 726 0 716
Liaoning 9 425 2 414
Tianjin 9 393 3 381
Hunan 8 1,051 4 1,039
Chongqing 6 598 6 586
Guangxi 5 275 2 268
Shanxi 5 253 0 248
Beijing 4 1,059 9 1,046
Shandong 3 883 7 873
Henan 3 1,315 22 1,290
Inner Mongolia 3 387 1 383
Heilongjiang 2 1,612 13 1,597
Hubei 2 68,159 4,512 63,645
Anhui 2 1,004 6 996
Macao 2 51 0 49
Hainan 1 188 6 181
Ningxia 1 76 0 75
Gansu 0 194 2 192
Jilin 0 573 3 570
Hebei 0 1,317 7 1,310
Xinjiang 0 980 3 977
Jiangxi 0 937 1 936
Guizhou 0 147 2 145
Qinghai 0 18 0 18
Tibet 0 1 0 1

 

Power of Advanced Hyperparameter Search

Simple experiments are able to show the benefit of using an advanced tuning technique. Below is a recent experiment run on a BERT model from Hugging Face transformers on the RTE dataset. Genetic optimization techniques like PBT can provide large performance improvements compared to standard hyperparameter optimization techniques.

Algorithm Best Val Acc. Best Test Acc. Total GPU min Total $ cost
Grid Search 74% 65.40% 45 min $2.30
Bayesian Optimization +Early Stop 77% 66.90% 104 min $5.30
Population-based Training 78% 70.50% 48 min $2.45

 

Hyperparameter Search Example

For how to do parameters search using Transformers, please see here or the Colab version of the same notebook.

# ! pip install optuna
# ! pip install ray[tune]

def model_init():
    return AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels=num_labels)

trainer = Trainer(
    model_init=model_init,
    args=args,
    train_dataset=encoded_dataset[“train”],
    eval_dataset=encoded_dataset[validation_key],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

train_dataset = encoded_dataset[“train”].shard(index=1, num_shards=10)

best_run = trainer.hyperparameter_search(n_trials=10, direction=“maximize”)
best_run

best_run

Weibo Emoji Classification Details

I used green to highlight the positive emoji and orange to highlight the negative ones.

2019-12 2020-01 2020-02 2020-03
[(‘[允悲]’, 224), [(‘[泪]’, 37142), [(‘[心]’, 777444), [(‘[心]’, 65403),
(‘[泪]’, 141), (‘[心]’, 31973), (‘[泪]’, 66487), (‘[允悲]’, 50406),
(‘[心]’, 127), (‘[允悲]’, 22045), (‘[允悲]’, 59996), (‘[泪]’, 37500),
(‘[二哈]’, 111), (‘[加油]’, 17002), (‘[加油]’, 36181), (‘[二哈]’, 28140),
(‘[笑cry]’, 72), (‘[微笑]’, 14563), (‘[二哈]’, 31622), (‘[doge]’, 27914),
(‘[跪了]’, 72), (‘[感冒]’, 10588), (‘[微笑]’, 27695), (‘[微笑]’, 21579),
(‘[微笑]’, 69), (‘[悲伤]’, 10197), (‘[doge]’, 25548), (‘[笑cry]’, 18792),
(‘[doge]’, 62), (‘[跪了]’, 9929), (‘[笑cry]’, 20687), (‘[加油]’, 17564),
(‘[感冒]’, 62), (‘[二哈]’, 8422), (‘[悲伤]’, 19069), (‘[跪了]’, 16806),
(‘[嘻嘻]’, 50), (‘[doge]’, 7833), (‘[跪了]’, 17705), (‘[嘻嘻]’, 15048),
(‘[摊手]’, 50), (‘[失望]’, 6802), (‘[摊手]’, 15886), (‘[摊手]’, 14334),
(‘[并不简单]’, 47), (‘[摊手]’, 6343), (‘[嘻嘻]’, 15114), (‘[哈哈]’, 11853),
(‘[加油]’, 42), (‘[笑cry]’, 6127), (‘[感冒]’, 14479), (‘[太开心]’, 11137),
(‘[悲伤]’, 40), (‘[拳头]’, 5774), (‘[爱你]’, 14043), (‘[喵喵]’, 11096),
(‘[鼓掌]’, 39), (‘[怒]’, 4939), (‘[拳头]’, 13987), (‘[爱你]’, 11054),
(‘[太开心]’, 39), (‘[作揖]’, 4652), (‘[失望]’, 13070), (‘[悲伤]’, 9753),
(‘[哈哈]’, 36), (‘[嘻嘻]’, 4641), (‘[太阳]’, 12794), (‘[吃瓜]’, 8644),
(‘[鲜花]’, 33), (‘[伤心]’, 4226), (‘[哈哈]’, 12095), (‘[微风]’, 8558),
(‘[失望]’, 33), (‘[good]’, 4208), (‘[武汉加油]’, 11345), (‘[鼓掌]’, 8335),
(‘[拜拜]’, 31)] (‘[爱你]’, 4128)] (‘[喵喵]’, 11292)] (‘[费解]’, 8205)]

 

2020-04 2020-05 2020-06 2020-07
[(‘[心]’, 43519), [(‘[允悲]’, 21520), [(‘[允悲]’, 17602), [(‘[允悲]’, 15626),
(‘[允悲]’, 40006), (‘[心]’, 19709), (‘[泪]’, 14229), (‘[泪]’, 13358),
(‘[泪]’, 31800), (‘[泪]’, 15199), (‘[心]’, 13348), (‘[心]’, 11371),
(‘[doge]’, 21986), (‘[二哈]’, 11187), (‘[二哈]’, 8825), (‘[二哈]’, 7874),
(‘[二哈]’, 20157), (‘[doge]’, 11071), (‘[doge]’, 7811), (‘[doge]’, 7784),
(‘[笑cry]’, 15406), (‘[笑cry]’, 8443), (‘[笑cry]’, 6444), (‘[怒]’, 6241),
(‘[微笑]’, 15390), (‘[微笑]’, 7554), (‘[跪了]’, 6325), (‘[笑cry]’, 5835),
(‘[跪了]’, 13616), (‘[跪了]’, 7169), (‘[微笑]’, 6070), (‘[跪了]’, 5756),
(‘[蜡烛]’, 13433), (‘[嘻嘻]’, 6379), (‘[加油]’, 4998), (‘[微笑]’, 5323),
(‘[加油]’, 12496), (‘[加油]’, 5892), (‘[嘻嘻]’, 4710), (‘[加油]’, 5098),
(‘[嘻嘻]’, 11252), (‘[哈哈]’, 5299), (‘[摊手]’, 4310), (‘[嘻嘻]’, 3993),
(‘[摊手]’, 10778), (‘[摊手]’, 4970), (‘[悲伤]’, 3981), (‘[悲伤]’, 3968),
(‘[爱你]’, 8670), (‘[爱你]’, 4744), (‘[失望]’, 3715), (‘[摊手]’, 3619),
(‘[哈哈]’, 8526), (‘[太开心]’, 4534), (‘[哈哈]’, 3529), (‘[哈哈]’, 3269),
(‘[悲伤]’, 8506), (‘[悲伤]’, 4160), (‘[爱你]’, 3071), (‘[失望]’, 3105),
(‘[喵喵]’, 8179), (‘[喵喵]’, 4106), (‘[太开心]’, 2996), (‘[吐]’, 2988),
(‘[太开心]’, 7899), (‘[失望]’, 3696), (‘[喵喵]’, 2875), (‘[爱你]’, 2845),
(‘[费解]’, 7135), (‘[鲜花]’, 3625), (‘[拜拜]’, 2528), (‘[喵喵]’, 2831),
(‘[话筒]’, 7133), (‘[鼓掌]’, 3483), (‘[鼓掌]’, 2475), (‘[太开心]’, 2742),
(‘[微风]’, 6988)] (‘[微风]’, 3454)] (‘[感冒]’, 2461)] (‘[鼓掌]’, 2396)]

 

2020-08 2020-09 2020-10 2020-11 2020-12
[(‘[允悲]’, 12922), [(‘[心]’, 24959), [(‘[允悲]’, 12953), [(‘[允悲]’, 9811), [(‘[心]’, 21829),
(‘[心]’, 11718), (‘[泪]’, 11746), (‘[doge]’, 12226), (‘[泪]’, 9008), (‘[泪]’, 18786),
(‘[泪]’, 11568), (‘[允悲]’, 10558), (‘[心]’, 9722), (‘[心]’, 7145), (‘[允悲]’, 14184),
(‘[doge]’, 6988), (‘[doge]’, 5468), (‘[泪]’, 9178), (‘[doge]’, 4893), (‘[话筒]’, 10401),
(‘[二哈]’, 6712), (‘[二哈]’, 5374), (‘[二哈]’, 6405), (‘[二哈]’, 4877), (‘[二哈]’, 7633),
(‘[跪了]’, 5234), (‘[给你小心心]’, 4623), (‘[笑cry]’, 5709), (‘[笑cry]’, 3745), (‘[doge]’, 7207),
(‘[笑cry]’, 5049), (‘[微笑]’, 4513), (‘[跪了]’, 3870), (‘[跪了]’, 3726), (‘[跪了]’, 6839),
(‘[微笑]’, 4917), (‘[加油]’, 4402), (‘[微笑]’, 3722), (‘[话筒]’, 3314), (‘[裂开]’, 6404),
(‘[嘻嘻]’, 3823), (‘[笑cry]’, 3985), (‘[打call]’, 3390), (‘[微笑]’, 3069), (‘[悲伤]’, 6099),
(‘[悲伤]’, 3411), (‘[跪了]’, 3906), (‘[嘻嘻]’, 3204), (‘[悲伤]’, 2744), (‘[加油]’, 5639),
(‘[加油]’, 3222), (‘[打call]’, 3786), (‘[喵喵]’, 3089), (‘[打call]’, 2542), (‘[微笑]’, 5523),
(‘[摊手]’, 3081), (‘[赞]’, 3590), (‘[赞]’, 3013), (‘[失望]’, 2345), (‘[笑cry]’, 5465),
(‘[失望]’, 3011), (‘[悲伤]’, 3460), (‘[加油]’, 2961), (‘[嘻嘻]’, 2217), (‘[打call]’, 5110),
(‘[哈哈]’, 2999), (‘[嘻嘻]’, 2993), (‘[哈哈]’, 2897), (‘[加油]’, 2192), (‘[失望]’, 4507),
(‘[太开心]’, 2604), (‘[good]’, 2941), (‘[悲伤]’, 2801), (‘[摊手]’, 2130), (‘[嘻嘻]’, 3715),
(‘[爱你]’, 2598), (‘[怒]’, 2596), (‘[摊手]’, 2597), (‘[拳头]’, 2089), (‘[感冒]’, 3647),
(‘[喵喵]’, 2406), (‘[摊手]’, 2506), (‘[失望]’, 2235), (‘[哈哈]’, 1806), (‘[爱你]’, 3557),
(‘[赞]’, 2241), (‘[鲜花]’, 2438), (‘[爱你]’, 2105), (‘[赞]’, 1795), (‘[摊手]’, 3543),
(‘[拜拜]’, 2146), (‘[中国赞]’, 2429), (‘[good]’, 2067), (‘[感冒]’, 1679), (‘[赞]’, 3535),
(‘[good]’, 2101)] (‘[爱你]’, 2366)] (‘[话筒]’, 2030)] (‘[爱你]’, 1605)] (‘[给你小心心]’, 3391)]

 

Generating Interactive Maps With Vaex

df.plot_widget(df.pickup_longitude,
              df.pickup_latitude,
              shape=512,
              limits=‘minmax’,
              f=‘log1p’,
              colormap=‘plasma’)

Scraping Weibo Images With Tunnel Agent (Code Example)

import urllib.request
import ssl

ssl._create_default_https_context = ssl._create_unverified_context
tunnel = “tps191.kdlapi.com:15818”
username = “t12104098323179”
password = “whf83boc”
proxies = {
    “http”: “http://%(user)s:%(pwd)s@%(proxy)s/” % {“user”: username, “pwd”: password, “proxy”: tunnel},
}

# target web
target_url = “https://dev.kdlapi.com/testproxy”

# send requests
proxy_support = urllib.request.ProxyHandler(proxies)
opener = urllib.request.build_opener(proxy_support)
urllib.request.install_opener(opener)
response = urllib.request.urlopen(target_url)

if response.code == 200:
    print(response.read().decode(‘utf-8’))
import urllib
import uuid

for index, row in df_picture_sample.iterrows():
    print(len(row[‘weibo_pic’]), index)
    try:
        for image_url in row[‘weibo_pic’]:
            filename = row[‘weibo_id’] + “.jpg”
            if row[“sentiment”] == 0:
                urllib.request.urlretrieve(image_url, ‘./negative_weibo_pic/{}’.format(filename))
            elif row[“sentiment”] == 1:
                urllib.request.urlretrieve(image_url, ‘./neutral_weibo_pic/{}’.format(filename))
            elif row[“sentiment”] == 2:
                urllib.request.urlretrieve(image_url, ‘./positive_weibo_pic/{}’.format(filename))
    except Exception as e:
        print(e)
        continue

 

Chinese Speech Recognition Sample Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset(“common_voice”, “zh-CN”, split=“test”)

processor = Wav2Vec2Processor.from_pretrained(“ydshieh/wav2vec2-large-xlsr-53-chinese-zh-cn-gpt”)
model = Wav2Vec2ForCTC.from_pretrained(“ydshieh/wav2vec2-large-xlsr-53-chinese-zh-cn-gpt”)

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch[“path”])
    batch[“speech”] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset[:2][“speech”], sampling_rate=16_000, return_tensors=“pt”, padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print(“Prediction:”, processor.batch_decode(predicted_ids))
print(“Reference:”, test_dataset[:2][“sentence”])

 

Events Timeline References (Chinese)

武汉新冠疫情爆发一周年,关键节点回顾

疫情与舆情:武汉新冠肺炎时间线

新冠病毒疫情何以发展至今?这里是一份时间表

时间轴:武汉“封城”的76天

 

Scroll to Top