Why does Google BigQuery repeat CPC codes and Inventive/Primary attributes in patent data?

Asked Feb 18 '25 at 12:23

Active Feb 18 '25 at 12:23

Viewed 15 times

I've noticed that data pulled from Google Patents via BigQuery may have duplicate entries for the same values of {publication_number, cpc_code, cpc_inventive, cpc_first}. For example:

The key parts of the BigQuery SQL I'm using to retrieve this data are:

SELECT DISTINCT
    pdb.publication_number,
    ( SELECT STRING_AGG(code, '|') FROM UNNEST(cpc) ) AS cpc_code,
    ( SELECT STRING_AGG(CAST(inventive AS STRING), '|') FROM UNNEST(cpc) ) AS cpc_inventive,
    ( SELECT STRING_AGG(CAST(first AS STRING), '|') FROM UNNEST(cpc) ) AS cpc_first
FROM `patents-public-data.patents.publications` AS pdb

Filtering out the duplicates post-query results in about a 50% reduction in the number of rows; is there any reason not to do this?

asked Feb 18 '25 at 12:23

Bob Chatham

Why does Google BigQuery repeat CPC codes and Inventive/Primary attributes in patent data?

0 Answers0