Generalists and Specialists: How Language Barriers Affect Activity Diversity on Reddit¶
We study how the language of user/subreddit affects GS-score. This work is an extension of Waller and Anderson's paper on using GS-scores to measure activity diversity on Reddit. Following the context of the paper, we will use "community" and "subreddit" interchangeably. Similarly, we will use "user" to mean a Redditor.
Abstract¶
Online communities are increasingly multilingual, and Reddit is no exception. However, given that most users and communities on Reddit are English-speaking, it raises questions about how language barriers affect a user's activity diversity. We study the user activity of the top 5 000 subreddits by activity across 900 000 users in over 40 languages from January 2019–June 2021. We measure a user's language as their most used language in comments and submissions, and define a subreddit's language by its comments and submissions as well. Using community and user GS-scores (Waller and Anderson 2019), we will discover relations between the language used by the users versus their activity diversity, measured by GS-score. For English-centric online platforms like Reddit, understanding how language affects activity diversity allows us to gain insights into the cultural diversity of non-English-speaking communities and foster a sense of belonging and community engagement to support the growth of these communities.
Research Questions¶
Our work seek to describe the effect of language usage on activity diversity. Since activity diversity can be measured both on a user and on a community, this naturally leads to these two research questions:
RQ1: Do English-speaking communities tend to be more generalist than non-English-speaking communities?
RQ2: Do English-speaking users tend to be more generalist than non-English-speaking users?
The Data¶
For raw data, we will use text_comments.csv
that contains comments on Reddit from January 2019 to June 2021. This dataset includes around 40M comments across top 5000 subreddits by activity. The schema is defined as follows:
id
: unique idlink_id
: id of submission to which this comment belongsscore
: score of comment based on upvote and downvotesauthor
: username of commentsubreddit
: name of the subreddit the comment was posted increated_utc
: datetime when the comment waw postedbody
: text of the comment
Let's look at the first 10 rows of the dataset.
id | score | link_id | author | subreddit | body | created_utc | |
---|---|---|---|---|---|---|---|
0 | t1_ftjl56l | 4 | t3_gzv6so | mega_trex | BeautyGuruChatter | Does anyone have a good cruelty free one? The ... | 1591755558 |
1 | t1_ftjpxmc | 6 | t3_gzv6so | [deleted] | BeautyGuruChatter | (stares at my soft glam i've had for like 3 ye... | 1591758382 |
2 | t1_gzzxfyt | 22 | t3_nodb9e | divadream | BeautyGuruChatter | When Jen’s initial reactions came out to the s... | 1622398357 |
3 | t1_gzzy7nc | 92 | t3_no6qaj | Ziegenkoennenfliegen | BeautyGuruChatter | I think you mean a \n>Highschool *fucking* bully | 1622398743 |
4 | t1_h00tpbp | 82 | t3_nolx7p | meowrottenralph | BeautyGuruChatter | Ugh. I was honestly hoping that this brand wou... | 1622414834 |
5 | t1_ftlamij | 1 | t3_h0an62 | somethingelse19 | BeautyGuruChatter | She's 35 in 2020\n\nhttps://jezebel.com/no-off... | 1591801024 |
6 | t1_h01dtz3 | 28 | t3_noo5e0 | sasukesbutt | BeautyGuruChatter | Is haus labs still around? I’m so out of the loop | 1622426299 |
7 | t1_h01fl3q | 2 | t3_nn2hz7 | Mika_Kyle | BeautyGuruChatter | But I thought it's all because she is oppresse... | 1622427319 |
8 | t1_ftll1qn | 6 | t3_h0dpxq | [deleted] | BeautyGuruChatter | This is such a mature and professional respons... | 1591807162 |
9 | t1_ftlsbtj | 2 | t3_h0an62 | angelicad6 | BeautyGuruChatter | Definitely 😊 | 1591810532 |
Notice that in this work we only consider user comments despite that we also have text_submissions.csv
that contains the posts. This is because, to answer our research questions, we need to compute the GS-scores of our users. The original paper uses Reddit data from 2017 and defines GS-scores based users commenting in different subreddits. Therefore, to reproduce the GS-scores on our current dataset (Reddit 2019 to 2021), we will follow the same methodology and only use the comments.
Methodology¶
To answer our research questions about activity diversity and languages. We need to first calculate the GS-scores (as a measure of activity diversity) and the language for each user and community.
For the computation of GS-scores, we use an additional dataset, the community embedding for Reddit from CSSLab. Since this is a separate dataset than our text_comments.csv
, our clean data will only contain comments whose subreddits are in the community embedding dataset.
User GS-Score¶
GS-score is a measure of activity diversity that is computed using a community embedding. The community embedding embeds each community into a point in a high-dimensional space where similar communities are close to each other. For any community $c_j$, we denote its embedding as $\vec{c_j}$ (normalised).
With the community embedding, we can define, for a user $u_i$ who has $w_j$ contribution to community $c_j$, the user $u_i$'s center of mass as:
$$ \vec\mu_i=\sum_jw_j\vec{c_j} $$
Then, user $u_i$'s activity diversity (aka GS-score) is defined as the average cosine similarity between $u_i$'s contribution to each community and their center of mass:
$$ GS(u_i) = \frac1J\sum_jw_j\frac{\vec{c_j}\cdot\vec\mu_i}{\|\vec\mu_i\|} $$
where $J$ is the number of communities that $u_i$ has contributed to.
Give our dataset on user comments, we can compute the GS-score for each user. Here is the distribution of GS-scores for our users.
99.87% of user scores are between 0.5 and 1.0
From the historgram, we can see a very large amount of users having GS-score of 1. This occurs when a user only comments in one community.
The GS-score has a theoretical range of $[-1, 1]$. where $-1$ is extreme generalist and $1$ is extreme specialist. In our case, over 99% of users have a GS-score in the range of $0.5$ and $1$, which matches with the findings in the paper.
Community GS-Score¶
The measure for community activity diversity is computed based on user GS-scores. The GS-scores of a community, $GS(c_i)$ is the average of its users' GS-scores, weighted by contribution amount:
$$ GS(c_i) = \frac1{\sum_iw_i}\sum_iw_iGS(u_i) $$
The community GS-score distribution for our dataset is as follows:
The distribution for community scores is less skewed than user scores.
User Language¶
We now move to language detection. We will use lingua, a language detection library suitable for short texts. We exclude comments whose language is cannot be detected. Such comments can be non-text comments (e.g. images, videos) or comments with just emojis/urls.
For a user $u_i$ and a language $L$, we define $u_i$'s $L$ frequency as the proportion of comments that $u_i$ has in $L$. Therefore, the higher the user's english frequency, the more likely the user is an English speaker.
Let's look at the distribution of English frequency for our users.
We see that there is a large amount of users with English frequency of $1$, that is, they only comment in English. The second largest bin is users with English frequency of $0$ (i.e. never comments in English)
Community Language¶
Similar to user language, we can also define community language based on the comments made in the community: a community $c_i$'s frequency in language $L$ is the proportion of comments made in $c_i$ in language $L$.
Here is the distribution of English frequency for our communities.
As expected, the distribution is left skewed, meaning that most communities are English-speaking. We can also verify our calculation based on some common subreddits. For instance:
r/france has 91.49% of comments in French r/italy has 88.85% of comments in Italian r/Denmark has 87.23% of comments in Danish r/germany has 5.79% of comments in German
(r/germany is about English-language discussions related to Germany.)
Results¶
RQ1¶
Do English-speaking communities tend to be more generalist than non-English-speaking communities?
To answer our RQ1, we will look at the communities with the lowest English frequencies. Looking at the histogram of community English frequencies, this means that we are interested in communities with English frequency of at most $0.1$. Let's call these communities non-English communities and highlight them in the following scatter plot.
We see that the non-English communities are spread across the right side of the plot, meaning that they tend to be more specialist than English-speaking communities. The top 5 more specialist non-English communities are:
cscore | |
---|---|
subreddit | |
newsokunomoral | 0.948291 |
podemos | 0.945803 |
ukraina | 0.905109 |
newsokur | 0.887232 |
Argaming | 0.880716 |
- r/newsokunomoral (Japanese) is a subreddit for news in Japan
- r/podemos (Spanish) is a subreddit for a Spanish politics
- r/ukraina (Ukrainian) is a subreddit, well, about Ukraine
- r/newsokur (Japanese) is a subreddit for news in Japan (another Japanese news subreddit?)
- r/Argaming (Spanish) is a subreddit for gaming in Argentina
RQ2¶
RQ2: Do English-speaking users tend to be more generalist than non-English-speaking users?
To answer RQ2, we perform a similiar analysis. For users, we can partition them into three groups
- English-speaking users: users with English as most frequent language
- non-English-speaking users: users with most frequent not being English
We can then plot the GS-score of these users against their English frequency. Since we have 7M users, we will sample 10K users for the following plot.
As we can see from the plot, as user English frequency increases, their GS-score tends to decrease (i.e. becoming more generalist) a little bit.
Conclusion¶
We reproduced user and community GS-scores from Waller and Anderson's paper on a new Reddit dataset and studied the effect of language on activity diversity. We found a weak trend that non-English-speaking communities tend to be more specialist than English-speaking communities; the same finding applies to users as well.
We have very few data points for non-English-speaking communities and users. This is because the raw data comes from the most active subreddits, which are mostly English-speaking. Therefore, we cannot make strong conclusions about the effect of language on activity diversity. However, we can still use our findings to inform future work on language and activity diversity on Reddit.