BigQuery: When is GHTorrent updated and how do I get the latest information?

The data is ghtorrent-bq

great for a GitHub snapshot, however it is unclear when it gets updated and how I can get more up-to-date data.

+2


source to share


2 answers


(related to fooobar.com/questions/2400213 / ... )

GHTorrent only provides a periodic snapshot of its data to BigQuery, while the GitHub Archive is updated daily (or even hourly - let me check).



It would be great to have a more frequent GHTorrent snapshot (maybe https://twitter.com/gousiosg might help), but in the meantime, you can combine both datasets (find the GHTorrent snapshot data, then add the latest stars from the GitHub archive):

#standardSQL
SELECT COUNT(DISTINCT login) c
FROM (
  SELECT login
  FROM (
    SELECT login
    FROM `ghtorrent-bq.ght_2017_01_19.watchers` a
    JOIN `ghtorrent-bq.ght_2017_01_19.projects` b
    ON a.repo_id=b.id
    JOIN `ghtorrent-bq.ght_2017_01_19.users` c
    ON a.user_id=c.id
    WHERE url = 'https://api.github.com/repos/angular/angular'
  )
  UNION ALL (
    SELECT actor.login
    FROM `githubarchive.month.2017*` 
    WHERE repo.name='angular/angular'
    AND type = "WatchEvent"
  )
)

      

+1


source


In theory, it gets updated every time a new GHTorrent MySQL dump is released. There are almost still manual tweaks that need to be done with the generated CSVs as there is a lot of weird text in fields like user locators that the CSV parsers cannot handle.



http://ghtorrent.org/gcloud.html

+1


source







All Articles