4

First of all, I am a complete newbie in regard to data science and I am not asking for the complete solution but some guidance as to what I should read up to achieve my task (what algorithms, techniques etc are used to tackle similar problems).

I have different lists of strings which contain one or two useful pieces of information I would like to extract. In the following is an example I need to extract the bold and italic part from each line. This is just an example though, eventually I will need to end up with a process I can apply to different lists with different context. Here's a small sample from a list of 500:

  • 50" Sony KDL 50W756CSAEP Smart LED Full HD
  • 55" Samsung UE55JU6400 Smart LED HD
  • LG 55LF652V 55" SMART 3D FULL HD
  • HITACHI 55HGW69 55'' LED ULTRA SMART WIFI
  • TV 65" SAMSUNG UE65KS7500 4K LED Smart

In my full list I have already manually extracted the brand and model. So what I need now is a way to automate the process for a new list containing more brands and models. I thought I could go about this heuristically but since I am not just doing this for this type of data it won't work well.

So can someone give me some suggestions on a good way to go about it?

Thanks!

kyriakos
  • 141
  • 1

4 Answers4

1

Maybe you can use Python with dictionaries.

You can set a group of word in the dictionary that could be filled each time with new words you find.

To find new word (if the pattern remains the same as your example) you can grab the "brand" (Samsung) and the next word will be the model (UE65KS7500).

This is a good resource

xCloudx8
  • 111
  • 4
0

I would solve this the following way -

  1. Split all the words in the string
  2. Use the regex ^[a-zA-Z0-9]*$ to get alphanumeric characters only. This is the model.
  3. Words before the regex are part of the brand's name.

Explanation:

^ asserts position at start of a line * Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy) $ asserts position at the end of a line

Ic3fr0g
  • 126
  • 5
0

I like to do things simply first, and then add more complexity if it is needed. I'd start with simply removing the TV features that we don't care about and then returning what's left assuming that the brand always precedes the model. In Python for example:

def get_brand_model(input):
  """
  Returns the brand and model number from a TV description

  >>> get_brand_model('50" Sony KDL 50W756CSAEP Smart LED Full HD')
  ('SONY', 'KDL 50W756CSAEP')

  >>> get_brand_model('55" Samsung UE55JU6400 Smart LED HD')
  ('SAMSUNG', 'UE55JU6400')

  >>> get_brand_model('LG 55LF652V 55" SMART 3D FULL HD')
  ('LG', '55LF652V')

  >>> get_brand_model("HITACHI 55HGW69 55'' LED ULTRA SMART WIFI")
  ('HITACHI', '55HGW69')

  >>> get_brand_model('TV 65" SAMSUNG UE65KS7500 4K LED Smart')
  ('SAMSUNG', 'UE65KS7500')
  """

  def filter(word):
    # Basic filter to remove TV features from the input string
    skip_words = ['3d', '720p', '1080p', 'hd', '4k', 'smart', 'wifi',
                  'led', 'full', 'tv', 'ultra', 'inch']

    is_measurement = '"' in word or "'" in word

    return not word.lower() in skip_words and not is_measurement 

  words = [w.upper() for w in input.split(' ') if filter(w)]

  # Return a tuple of (brand, model number)
  return (words[0], ' '.join(words[1:]))

This will likely need some tweaking, but the 5 examples from the question all pass when running the included doctests.

jncraton
  • 578
  • 5
  • 9
0

You might also incorporate TF-IDF to see if there are common word frequencies with elements you intend to extract, or with those elements that you could remove.

dshefman
  • 111
  • 2