3

I am new to ML/AI/NLP and am interested in tackling the following problem. I have a database of chat logs from a Discord server. The database contains the following labeled data: Author, Text, Text Created Timestamp, and Channel ID.

I think what I want to do is fairly complex, but should, in theory, be possible. I want to go through the logs, and group messages that are related to a new label, Conversation ID. In an ideal world, the model will label each piece of text with a conversation id and I can then run queries against the data to show me a related text that pertains to a specific conversation. among 1 or more individuals. (i say 1 because someone might ask a question and then reply to themselves with an answer. this will be rare)

I can have someone go through the data manually and pull out and label what constitutes a 'conversation.' for some test data. If I was able to build a successful model, I would use the conversations to look for logical groupings of questions asked with follow-up answers. That would be the ultimate goal here is to find questions/answers. I think a logical step would be first to isolate conversations.

The end goal would be to take the discussion from the beginning of time to the start of time and be able to isolate different chunks of text into a logical conversation group.

It would make sense that any conversation happens within the same Channel ID. I am not worried about cross-channel conversations. Those are rare/unlikely.

My last naive approach for this was to find every question in the corpus and then search for the subsequent N messages following that message from the same channel, then ask an LLM to identify if there was an answer given the question. I want to see if I can improve that process. My approach had a hard limit of messages after a questions where the hard limit was N, nothing is to say that there wasn't an answer to a question in N+M messages and I missed the cut-off. The idea is a model won't be limited the number of subsequent messages.

Happy to take advice from seasoned hackers on different approaches to this problem that I can go ahead and experiment with. Names of techniques, tools, models that I can research that apply specifically to this problem would be ideal.

3 Answers3

1

"The end goal would be to take the discussion from the beginning of time to the start of time and be able to isolate different chunks of text into a logical conversation group."

I would start by clarifying this. A 'conversation' on a messaging app can be defined in a million ways. The day ended and people left the app after a sustained discussion, or the users changed the topic... but maybe came back to it later. One could argue that one conversation = one thread but the way people use threads might depend on the server.

If you don't have a definition yet, I think a conversation could be defined by some threshold that was crossed in terms of time elapsed since the last message and or change in topic. For the former, you have the data and I would look at the distribution of time between messages. It's unlikely that your distribution will be bimodal, but if it is, it might be useful.

For the latter, it will be harder but you could use embeddings (for each message) from a pre-trained model, and look at the distribution of cosine similarity between consecutive messages (or over some window) and find when that value hopefully drops, due to a topic change. You might have to filter out short messages that are very generic and build some attribution logic to assign them back to the conversation later.

Mostly speculating for this one, but maybe another way would be to use a Transformer model that can handle large chunks of text, spanning multiple messages, and look at how the attention moves as you feed the text into the model with a rolling window. I wonder if when the window is across conversations, you can see some sort of clustering.

Just trying
  • 465
  • 2
  • 6
0

This is an interesting problem to tackle, and there are several approaches you can take to solve it. Here are some suggestions:

  1. Clustering Algorithms: You can use clustering algorithms like K-Means or DBSCAN to group messages into conversation threads based on their similarity. To do this you'll need to extract relevant features from the text data and use them as input to the clustering algorithm. Some common features to consider could be the TF-IDF score of the message, the length of the message, or the time difference between messages.
  2. Sequence labeling: Another approach is to use sequence labeling techniques like Conditional Random Fields (CRF) or Hidden Markov Model (HMM) to label each message with a conversation ID. To do this, you'll need to train a model on your labeled data to learn the patterns that distinguish between different conversation threads. You can the use this model to label new messages automatically.
  3. Transformer-based methods: With the recent advancements in dl, you can use pre-trained transformer-based models like BERT, RoBERTa, or GPT-2 to encode the text data and then use them as input to a clustering or sequence labeling algorithm. These models are powerful in understanding the context and semantics of the text and can lead to better performance.

In addition to these techniques, you can also consider pre-processing the text data to remove noise, stop words, and punctuation marks, and use stemming or lemmatization to normalize the text.

I hope this suggestions help you get start with your project :)

irazza
  • 56
  • 4
0

This is a fascinating and common problem known as Conversation Disentanglement or Thread Detection. The goal, as you described, is to group potentially interleaved messages from a chat log into distinct, coherent conversations, assigning something like a "Conversation ID".

We were just discussing a very similar task, and here are the main approaches we considered:

Graph-Based Approach (Our Primary Method): Concept: Represent the chat log as a graph where each message is a node. Edges (Connections): Create weighted edges between messages based on: Direct Replies: If your Discord data includes reply_to information, this is the strongest indicator of connection (high edge weight). Temporal Proximity: Messages sent close in time (e.g., within a few minutes) are likely related. The weight can decrease as the time difference increases. Semantic Similarity: Calculate vector embeddings for each message (using models like Sentence-BERT, BGE-M3, LaBSE, etc.). Messages with high cosine similarity between their embeddings likely belong to the same topic, even without direct replies. This helps link discussions using synonyms or different phrasing. (Optional) Authorship: Consecutive messages from the same author might also indicate a continued thought. Clustering: Apply a community detection algorithm (like Louvain or Leiden) to the resulting graph. These algorithms find densely connected groups of nodes (messages). Each resulting cluster represents a distinct conversation (your "Conversation ID").

Advantages: This is flexible, considers various connection types, and importantly, doesn't rely on a fixed window of N messages after a question. It can naturally group long conversations where answers might appear much later (N+M messages). It doesn't strictly require large labeled datasets, although they are helpful for tuning edge weights.

Tools: pandas (data handling), sentence-transformers (embeddings), networkx (graphs), python-louvain-community or leidenalg (clustering), streamlit (useful for interactively tuning parameters).

LLM-Based Approach (Alternative): Concept: Use a Large Language Model (LLM) by processing the chat log with a sliding window. Process: Feed the LLM messages from the current window and summary information about active conversations identified in the previous window. The LLM's task is to assign each new message to an existing conversation ID or start a new one.

Prompting: Requires careful prompt engineering, instructing the LLM on how to use reply info, semantics, time, authors, and requesting a structured JSON output with "Conversation ID" assignments and updated conversation summaries. Challenges: LLM context window limitations, API cost and speed, maintaining consistency across windows (the hardest part – ensuring the LLM doesn't "forget" active threads), and managing the state between calls. This is potentially more powerful due to the LLM's language understanding but more complex and expensive to implement. It's an improvement over your simpler LLM approach because it tries to maintain conversation context across windows.

Relevant Research & Code:

During our discussion, we also came across an implementation of a modern approach to Conversation Disentanglement that might be useful: CluCDD: Contrastive Dialogue Disentanglement via Clustering

Link: https://github.com/gaojingsheng/CluCDD Description (from README): "An effective framework for dialogue disentanglement. The model captures the utterances and sequential representations in dialogues. The contrastive training paradigm is employed to amend feature space, and a cluster head predicts the session number to enhance the final clustering result. Extensive experiments demonstrate that our CluCDD is suitable for solving dialogue disentanglement and establishes the state-of-the-art on two datasets."

My Question:

I'm very interested in this topic! Did you manage to find a solution or make progress on this problem? I'm asking because I'm currently tackling a very similar task for analyzing Telegram chat logs using the graph-based approach we discussed. Any insights you gained would be appreciated!

P.S. I used a neural network to translate it into English