0

I've noticed that data pulled from Google Patents via BigQuery may have duplicate entries for the same values of {publication_number, cpc_code, cpc_inventive, cpc_first}. For example:

Retrieved CPC data showing dupes for {CPC, inventive, primary} for the same patent

The key parts of the BigQuery SQL I'm using to retrieve this data are:

SELECT DISTINCT
    pdb.publication_number,
    ( SELECT STRING_AGG(code, '|') FROM UNNEST(cpc) ) AS cpc_code,
    ( SELECT STRING_AGG(CAST(inventive AS STRING), '|') FROM UNNEST(cpc) ) AS cpc_inventive,
    ( SELECT STRING_AGG(CAST(first AS STRING), '|') FROM UNNEST(cpc) ) AS cpc_first
FROM `patents-public-data.patents.publications` AS pdb

Filtering out the duplicates post-query results in about a 50% reduction in the number of rows; is there any reason not to do this?

0 Answers0