Let's say I have dataset with inputs and expected outputs like this:
[
{
"input": "http://localhost/wordpress/wp-includes/blocks/navigation/view.min.js?ver=6.5.3",
"output": ["WordPress 6.5.3"]
},
{
"input": "<meta content=\"max-image-preview:large\" name=\"robots\"/>",
"output": []
},
{
"input": "https://cdnjs.cloudflare.com/ajax/libs/jquery/3.7.1/jquery.min.js?ver=3.7.1",
"output": ["jQuery 3.7.1"]
},
{
"input": "Server: Apache/2.4.56 (Win64) OpenSSL/1.1.1t PHP/8.2.4",
"output": ["Apache 2.4.56", "OpenSSL 1.1.1t", "PHP 8.2.4"]
},
{ "input": "X-Powered-By: PHP/7.4", "output": ["PHP 7.4"] },
...
]
I would like to create a program that extracts/guesses which technologies (ideally with version) are in a given input.
I read something about multi-label classification, named entity recognition and also about fine tuning some LLM. I'm still learning, not sure how best to solve this problem. Thanks for advice!