Logo

(Not) Your Average Twitter User

Discovering groups of similar Twitter users based on their profiles.

About
Intro
Analysis
Conclusion

To find the user types, we applied unsupervised learning methods, concretely $K$-means and DBSCAN clustering. We assessed similarity based on the following properties:

First, all of the above features were scaled so that they fall into the range $[0,1]$. For DBSCAN, we tried to find suitable parameters (neighbourhood size and $\epsilon$) by grid search. Despite that, the algorithm often outputted a very low number of clusters and classified a large fraction of users as noise. As the result was not showing satisfactory results, we considered $K$-means as an alternative. For $K$-means we plotted plotted the distortions ( sum of square errors) for a range of $K$:

elbow

From this plot, a good number of clusters seems to be 3, however our estimated $K=6$ is also not a bad choice, so we stick with it. This clustering method resulted in uneven clusters (some are much bigger than others), but each group is well represented.

users_per_cluster_pie

To visualize the obtained clusters, we tried linear and nonlinear methods using PCA and t-SNE, however the plots were not very useful – one could not identify the clusters within the plots. The densities of the features were a bit more informative as from these one may be able to infer which features played an important role during clustering. In particular, features where there is a lot of overlap in their distribution between clusters might not be useful for differentiating between users.

distrib

Surprisingly, some of the features that were considered as the most important ones have similar distributions for all clusters, such as number of followers or statuses per month. We will now look more closely at the remaining features, first pairwise:

Clustering

And individually:

Clustes_barplot

From here, we start seeing what characteristics might be linked to each cluster. We try to summarize and interpret them as follows:

You can display various feature pairs to compare cluster centres here (Cluster 0, Cluster 1, Cluster 2, Cluster 3, Cluster 4, Cluster 5):

Once we had our clusters, we wanted to explore how users from each group use Twitter from the day they join. To do this, we decided to plot the some metrics (number of tweets, URLs, as well as retweets to track their popularity) that can help us understand user’s Twitter activity over time. These were averaged accross the closest 10 points to the centroids of clusters, to make:

times