How do I get the total GitHub stars for a given repo in BigQuery?
My goal is to track the popularity of my BigQuery rebroadcast over time.
I want to use public BigQuery datasets like GitHub Archive or GitHub dataset
The GitHub dataset sample_repos
does not contain snapshots:
SELECT
watch_count
FROM
[bigquery-public-data:github_repos.sample_repos]
WHERE
repo_name == 'angular/angular'
returns 5318.
The GitHub archive is a timeline of the event. I can try to sum them up, but the numbers don't match the numbers in the GitHub frontend. Probably because it does not take into account the actions of non-stationary actions. Here is the query I used:
SELECT
COUNT(*)
FROM
[githubarchive:year.2011],
[githubarchive:year.2012],
[githubarchive:year.2013],
[githubarchive:year.2014],
[githubarchive:year.2015],
[githubarchive:year.2016],
TABLE_DATE_RANGE([githubarchive:day.], TIMESTAMP('2017-01-01'), TIMESTAMP('2017-03-30') )
WHERE
repo.name == 'angular/angular'
AND type = "WatchEvent"
returns 24144
The real value is 21,921
source to share
#standardSQL
SELECT
COUNT(*) naive_count,
COUNT(DISTINCT actor.id) unique_by_actor_id,
COUNT(DISTINCT actor.login) unique_by_actor_login
FROM `githubarchive.month.*`
WHERE repo.name = 'angular/angular'
AND type = "WatchEvent"
Naive account: some stars stars and stars, and a star again. This creates duplicate WatchEvents.
Unique by the actor ID count: each can only shoot once. We can count them (but we don't know if they weren't allocated, so the total will be below that).
Unique login to login: in some historical months the actor.id field is missing. Instead, we can look at the "actor.login" field (but some people change their logins).
Alternatively, thanks to the GHTorrent project:
#standardSQL
SELECT COUNT(*) stars
FROM `ghtorrent-bq.ght_2017_01_19.watchers` a
JOIN `ghtorrent-bq.ght_2017_01_19.projects` b
ON a.repo_id=b.id
WHERE url = 'https://api.github.com/repos/angular/angular'
LIMIT 10
20567, dated 2017/01/19.
on this topic:
- What happens when a project changes its name?
fooobar.com/questions/2400210 / ...
- How do I get updated GHtorrent data before updating it?
source to share