Online drug trafficking has been flourishing on darknet marketplaces during the past few years. The skyrocketing popularity of these cryptomarkets can be attributed to the evasion of law enforcement, mainly due to lack of physical interaction between sellers and buyers. Another reason is that these marketplaces rely on encrypted algorithms to promote the anonymity of transactions. Many forums were launched on the Tor network to facilitate communications between vendors and buyers, help buyers identify the quality of offered drugs, and share useful information that can help users preserve their anonymity and avoid being monitored by law enforcement agencies.

A recently published book titled “Drugs, Darknet, and Organized Crime” presents an investigative analysis of the content of dark web forums which are mainly used by cryptomarket vendors and buyers to facilitate trading of illicit drugs. Latent Dirichlet Allocation (LDA) and a hidden Markov model (HMM) were used to analyze discussion topics on these forums. Throughout this article, we will overview the results of this investigation.

Method of data collection:

The data used in this analysis was harvested via a special dark web crawler, i.e. spider, that uses anonymization communication protocols, namely Tor and I2P, to access darknet websites and handles authentication processes to crawl non-indexed Tor hidden services. The crawler focuses on dark web forums and discussion boards on which users mostly discuss cybercrime and fraud, even though other illegal activities are discussed as well, especially illicit drug trading.

Collectively, the crawler scraped data from over 250 dark web forums. The most commonly used languages in these forums were English (37.8% of all threads), Russian (22.4% of all threads), and Chinese (15.4% of all threads). The analysis focused only on English posts, even though the same crawler could be used to analyze content written in multiple languages. The exclusion of non-English content reduced the number of forums to 155. Posts were pre-processed via means of SpaCy, NLTK, and scikit-learn to eliminate stop words and remove frequent words. This yielded a total of 1.33 million posts.

A special statistical technique known as Latent Dirichlet Allocation (LDA) was used to analyze the topics of English posts. LDA is used to break down content into latent topics, where each topic represents a distribution over words. Each document was handled as if it was a “bag of words”. This study utilized the Gensim implementation of LDA to analyze a model with 100 different topics. The model was trained on all 1.33 million posts (documents) to identify the most informative and relevant topics. The model was tested with 50, 100, and 200 topics respectively, which revealed that 100 topics yielded the most relevant and coherent topics. The study focused on the period between January 2016 and September 2017. Only forums with a minimum of one month of activity and 100 posts were analyzed which reduced the dataset of the analysis to 80 forums and around 482,000 posts. Figure (1) illustrates the level of activity across these 80 forums.

Figure (1): Histogram of activity on dark web forums

Results of the analysis (forum topics):

To analyze the dynamics of topics on dark web forums, each forum was represented by a time series of topic vectors which were fed into the 100-topic LDA model. The unit of time used in this analysis was a week, and a forum’s vector was created via averaging the topic vectors of all posts published on the forum during a period of a week. Weekly topic vectors’ time series were used to model hidden Markov model (HMM) states and calculate cross-entropy. Following training of the Beta Process HMM (BP-HMM) on the weekly distributions of forum topics, the model learned 28 different states, i.e. 28 topic distributions.

Dark web forums were clustered according to the similarity of their learned topics. Figure (2) illustrates the resulting dendrogram and the sequences of learned topics for each forum. Each line within the figure corresponds to a forum, and different topics are illustrated by different colors. Transitions between different topics are visible in places where the colors alternate. The clustering results conclude that this approach can cluster dark web forums into meaningful categories.

Figure (2): Topic sequences of dark web forums (each color represents a topic) and Dendrogram illustrating the similarity of dark web forums based on their learned topics

The following represents the main clusters of forums:

Cluster 1

This cluster mostly includes darknet forums focused on hacking, including HackForum, ZeroDay, GroundZero, SafeSkyHacks, and DeepDotWeb. This cluster includes two subgroups which differ namely in their activity levels. Forums within the first subgroup are associated with low activity (around 5 posts per week), which explains why their weekly topic vector is more sensitive and their associated HMM changes state more frequently. The most common state for the second subgroup, i.e. the active group (around 50 posts per week) is the yellow state, which reflects the activation of the following topics law enforcement, hacking tutorials, and proxy servers.

Cluster 2

This cluster mainly includes darknet marketplaces such as AlphaBay, Abraxas Market, Dream Market, BlackWorld, and Hansa Market, in addition to forums focused on their reviews. This cluster is also subdivided into two subgroups: the first subgroup is dominated by the dark blue state which reflects discussion topics related to locations (mainly related to sales of proxy servers), contacts of cryptomarket vendors, reviews of marketplace vendors, and banking. The second subgroup is dominated by light blue states, which corresponds to the topics of darknet marketplace purchase details, darknet marketplace reviews, cryptocurrency, and the trading of narcotics. Based on clustering, forums in the first subgroup are mainly focused on selling proxies and sharing information about darknet marketplaces, while forums in the second subgroup are focused on selling drugs.

Cluster 3

This cluster includes forums focused on hacking Playstation game consoles where the dominant state in this cluster is the Cyan state. The most discussed topics in this cluster were console hacking and security updates.

Cluster 4

This cluster includes forums which are primarily focused on ethical hacking. The most popular forums within this cluster include Metasploit and Hak5. Even though 0daybank and FreeBuf forums are related, most of their posts are in Chinese, so their more active topics include non-English tokens and are difficult to interpret.

Change of discussion topics in darknet forums

With the state sequences identified via the BP-HMM model, dark web forum discussions can be tracked. State transitions reflect a considerable change in discussion topics and could point to an event. Nevertheless, as shown in figure (2), some darknet forums change topics more frequently, so their transitions are of minimal significance. In order to be able to identify significant topic transitions, the volatility measure is described, which reflects a forum’s likelihood to considerably change its distribution of discussion topics. As each forum is marked by its transition matrix throughout the global topics, a forum’s volatility is calculated by adding the off diagonal components of its transition matrix and the probability of change of states.

Because the study focused on finding variations in discussion topics rather than quantifying forums’ activities, the probabilities of the state associated with 0 posts (i.e. no data) were not considered. Table (1) includes a list of dark web forums with high and low volatility calculated via HMM and cross-entropy.

Table (1): Forum volatility calculated via HMM and cross-entropy and ranked from the highest to the lowest volatility

Measuring topic volatility via HMM, the most volatile forums including a minimum of 10 posts per week were OpenSC Marketplace, Stronghold Paste, and Demon Forum, and the least volatile forums were BugsChromium, CSU, and the subreddit PS3Homebrew. An estimate of a forum’s volatility can be obtained from its state sequence illustrated in figure (2). Stronghold Paste is a Tor hidden service similar to Pastebin and involves different discussion topics and hence a myriad of states, i.e. topics. Nevertheless, there are two main topics it oscillates between: in one of them hacking and cybersecurity topics (Web Vulnerabilities) are more prominent, and in the other, cryptocurrencies and cryptomarkets have high activation. These results prove that state transitions in darknet forums like Stronghold Paste exhibit a low probability of being the result of an event. Nevertheless, topic transitions in forums such BugsChromium or CSU are more likely to be indicative of an event.

Final thoughts:

This study shows that dark web forums involve a myriad of discussion topics, yet hacking and cybersecurity represent the most popular topics. Even though there are some forums that are solely focused on discussions related to illicit drug trading taking place on darknet marketplaces, most darknet forums include sections that involve marketplace reviews and other discussion topics related to cryptomarkets.