I have a dataframe with a column 'Links' that contains the URLs to a few thousand online articles. There is one URL for each observation.
urls_list = ['http://www.ajc.com/news/world/atlan...',
'http://www.seattletimes.com/sports/...',
'https://www.cjr.org/q_and_a/washing...',
'https://www.washingtonpost.com/grap...',
'https://www.nytimes.com/2017/09/01/...',
'http://www.oregonlive.com/silicon-f...']
df = pd.DataFrame(urls_list,columns=['Links'])
I additionally have a dictionary that contains publication names as keys and domain names as values.
urls_dict = dict({'Atlanta Journal-Constitution':'ajc.com',
'The Washington Post':'washingtonpost.com',
'The New York Times':'nytimes.com'})
I'd like to filter the dataframe to get only those observations where the 'Links' column contains the domains in the dictionary values, while at the same time assigning the publication name in the dictionary keys to a new column 'Publication.' What I envisioned is using the below code to create the 'Publication' column then dropping None's from that column to filter the dataframe after the fact.
pub_list = []
for row in df['Links']:
for k,v in urls_dict.items():
if row.find(v) > -1:
publication = k
else:
publication = None
pub_list.append(publication)
However, the list pub_list that I get in return - while appearing to do what I intended - is three times as long as my dataframe. Can someone suggest how to fix the above code? Or, alternatively, suggest a cleaner solution that can both (1) filter the 'Links' column of my dataframe using the dictionary values (domain names) while (2) creating a new 'Publication' column of the dictionary keys (publication names)? (Please note that the df is created here with only one column for brevity; the actual file will have many columns and thereby I have to be able to specify which column to filter on.)
EDIT: I wanted to give some clarification's given RagingRoosevelt's answer. I'd like to avoid using merging as some of the domains may not be exact matches. For example, with ajc.com I'd also like to be able to capture myajc.com, and with washingtonpost.com I'd want to get sub-domains like live.washingtonpost.com as well. Hence, I was hoping for a type of "find substring in string" solution with str.contains(), find(), or the in operator.