I'm working on an AI model to predict dependency links between tasks for industrial plannifications, based on historical project data. I have two tables: Task Table (15 sheets, one sheet = one planning)
| ID activity | Name of activity | Equipment Type | Start Date | End Date |
|---|---|---|---|---|
| ZZ0001/001 | TRAVAUX A COORDONNER | COLONNE | 04/01/2011 08:00 | 04/01/2011 08:00 |
| ZZ0001/002 | POSE ECHAFAUDAGE EXTERNE | COLONNE | 04/06/2012 08:00 | 10/08/2012 17:00 |
| ZZ0001/003 | DECALORIFUGEAGE PARTIEL | COLONNE | 10/09/2012 08:00 | 10/09/2012 17:00 |
Dependencies (15 sheets, one sheet = one planning)
| ID task | ID successor | Link Type |
|---|---|---|
| ZZ0001/002 | ZZ0001/003 | FS |
| ZZ0001/002 | ZZ0001/006 | FS |
| ZZ0001/003 | ZZ0001/006 | SS |
Each sheet has 300 to 17k tasks. ID is unique, dataset is unbalanced (some equipment type appears 100x times more than some other)
Goal: Given a new list of tasks (typically filtered by EquipmentType), I want the model to suggest likely dependencies between them (and eventually, the LinkType) — learned from historical patterns in the existing data.
What I’ve tried:
Decision Trees
Basic Neural Networks (MLP + BERT/GNN)
Schematic Code
Data preparation :
- Load Excel
- Encode names with BERT
- Encode type with OneHotEncoder
- Combine: [BERT | OneHot] → torch.tensor(feature vector)
- Build graph G: each node = task with feature, no edges at inference time
Training SEAL model : For each planning in training:
- Extract real edges (u → v)
- Generate negative pairs (same type, no link)
- Build subgraphs 2-hop around each pair
- Apply DRNL labeling
- Store PyG Data(x, edge_index, drnl, label)
- Train GNN: class SEALGNN(nn.Module): GINConv(input = [feat + drnl]) GlobalPool → MLP → Sigmoid
Problems encountered:
Random or irrelevant links
Models predicting dependencies between all tasks
Lack of logical flow learned from historical data
I'm pretty sure i am not pre-processing the data correctly as i'm not sur how to treat the tasks name for it to recognize the "pattern"
My Question: Would it make sense to frame this as a graph problem and use Graph Neural Networks (GNNs)? Or is there a better ML or statistical approach for modeling and predicting dependencies between tasks in this kind of scenario?
I'm open to advice on model architecture or data pre-processing strategies that might improve performance. Note that i work on google colab pro and have access to gpu a100 as well as tpu