I've noticed that data pulled from Google Patents via BigQuery may have duplicate entries for the same values of {publication_number, cpc_code, cpc_inventive, cpc_first}. For example:
The key parts of the BigQuery SQL I'm using to retrieve this data are:
SELECT DISTINCT
pdb.publication_number,
( SELECT STRING_AGG(code, '|') FROM UNNEST(cpc) ) AS cpc_code,
( SELECT STRING_AGG(CAST(inventive AS STRING), '|') FROM UNNEST(cpc) ) AS cpc_inventive,
( SELECT STRING_AGG(CAST(first AS STRING), '|') FROM UNNEST(cpc) ) AS cpc_first
FROM `patents-public-data.patents.publications` AS pdb
Filtering out the duplicates post-query results in about a 50% reduction in the number of rows; is there any reason not to do this?
