I'm trying to clean up data in a PostgreSQL table, where some records have a large number of profanities in the email_address column (the records in question have been entered by agitated users as a result of frustration due to a bug that has since been fixed):
┌───────────────────┐
│ email_address │
├───────────────────┤
│ foo@go.bar.me.net │
│ foo@foo.com │
│ foo@example.com │
│ baz@example.com │
│ barred@qux.com │
└───────────────────┘
Desired query output
I'd like to build a query that annotates each row from the data table with a profanity score, and orders the records by the score, so that a human can go through the annotated data (presented in a web app) and take necessary action:
┌───────────────────┬───────┐
│ email_address │ score │
├───────────────────┼───────┤
│ foo@foo.com │ 18 │
│ foo@go.bar.me.net │ 14 │
│ foo@example.com │ 9 │
│ baz@example.com │ 3 │
│ barred@qux.com │ 0 │
└───────────────────┴───────┘
Attempt #1
The approach I'm taking is to build a list of regular expressions (now I have 2 problems...) and scores, whereby very profane words will contribute a large profanity score if that word is found in the email_address column. My profanities table looks something like this:
┌──────────────────┬───────┐
│ profanity_regexp │ score │
├──────────────────┼───────┤
│ foo │ 9 │
│ bar(?!red) │ 5 │
│ baz │ 3 │
└──────────────────┴───────┘
LATERAL JOIN
I've found that I can use a LATERAL join over the regexp_matches function to extract all profanities from each email_address (but records with no profanities are discarded):
SELECT
data.email_address,
array_agg(matches)
FROM
data,
profanities p,
LATERAL regexp_matches(data.email_address, p.posix_regexp, 'gi') matches
GROUP BY
data.email_address;
This produces the following result:
┌───────────────────┬───────────────────┐
│ email_address │ profanities_found │
├───────────────────┼───────────────────┤
│ foo@foo.com │ {{foo},{foo}} │
│ foo@example.com │ {{foo}} │
│ foo@go.bar.me.net │ {{foo},{bar}} │
│ baz@example.com │ {{baz}} │
└───────────────────┴───────────────────┘
SUB-SELECT
I also figured out how to get an array of profanity score subtotals for each record with this SQL:
SELECT
data.email_address,
array(
SELECT score * (
SELECT COUNT(*)
FROM (SELECT
regexp_matches(data.email_address, p.posix_regexp, 'gi')
) matches
)
FROM profanities p
) prof
from data;
Which correctly yields all rows (including rows without profanities) as such:
┌───────────────────┬──────────┐
│ email_address │ prof │
├───────────────────┼──────────┤
│ foo@go.bar.me.net │ {9,5,0} │
│ foo@foo.com │ {18,0,0} │
│ foo@example.com │ {9,0,0} │
│ baz@example.com │ {0,0,3} │
│ barred@qux.com │ {0,0,0} │
└───────────────────┴──────────┘
Problem
How do I sum the result of a lateral join to get the desired output?
Is there another strategy I can use to get the desired result?
I've posted a live code fiddle for this question at http://sqlfiddle.com/#!17/6685c/4