The GitHub Data Connector enables federated SQL queries on various GitHub resources such as files, issues, pull requests, and commits by specifying github as the selector in the from value for the dataset.
fromThe from field specifies the GitHub resource to query. The owner and repository name are extracted from the path (e.g., github:github.com/spiceai/spiceai/issues targets the spiceai/spiceai repository). It supports the following formats:
| Format | Description |
|---|---|
github:github.com/<owner>/<repo>/files/<ref> | Query files from a repository at a specific branch or tag |
github:github.com/<owner>/<repo>/issues | Query issues from a repository |
github:github.com/<owner>/<repo>/pulls | Query pull requests from a repository |
github:github.com/<owner>/<repo>/commits | Query commits from a repository |
github:github.com/<owner>/<repo>/stargazers | Query stargazers from a repository |
github:github.com/<organization>/members | Query members from an organization |
nameThe dataset name. This will be used as the table name within Spice. The dataset name cannot be a reserved keyword.
params| Parameter Name | Description |
|---|---|
github_token | Required. GitHub personal access token to use to connect to the GitHub API. Learn more. |
GitHub Apps provide a secure and scalable way to integrate with GitHub's API, and works well when interacting with one or more GitHub organizations. Learn more.
| Parameter Name | Description |
|---|---|
github_client_id | Required. Specifies the client ID for GitHub App Installation auth mode. |
github_private_key | Required. Specifies the private key for GitHub App Installation auth mode. |
github_installation_id | Required. Specifies the installation ID for GitHub App Installation auth mode. |
The client ID and private key are generated when creating the GitHub app.
Getting the Installation ID
If the app is installed on a GitHub organization:
https://github.com/organizations/<ORG>/settings/installations)https://github.com/organizations/<ORG>/settings/installations/<INSTALLATION_ID>If the app is installed on a GitHub user:
https://github.com/settings/installations/<INSTALLATION_ID>:::note[Limitations]
With GitHub App Installation authentication, the connector's functionality depends on the permissions and scope of the GitHub App. Ensure that the app is installed on the repositories and configured with content, commits, issues and pull permissions to allow the corresponding datasets to work.
:::
| Parameter Name | Description |
|---|---|
github_query_mode | Optional. Specifies whether the connector should use the GitHub search API for improved filter performance. Defaults to auto, possible values of auto or search. |
github_endpoint | Optional. Base URL of the GitHub API. Defaults to https://api.github.com. Override to target a GitHub Enterprise Server instance (e.g., https://github.example.com/api/v3). |
github_include_comments | Optional. Pull-request connector only. Specifies the types of comments to fetch: all, review, discussion, or none. Defaults to none. See Comments Example. |
github_max_comments_fetched |
When using multiple GitHub datasets sharing the same GitHub token or GitHub app credentials, it is possible to exceed GitHub's primary and secondary rate limits. To mitigate this, use the github_concurrent_connections_limit setting under runtime.source_rate_control. This connections limit applies per GitHub token and per GitHub app installation, following GitHub's rate limit policy.
:::warning[Deprecated]
runtime.params.github_max_concurrent_connections is deprecated. Use runtime.source_rate_control.github_concurrent_connections_limit instead.
:::
Example Configuration:
The GitHub connector supports the following HTTP concurrency parameter:
| Parameter Name | Description |
|---|---|
max_concurrent_requests | Maximum number of concurrent HTTP requests to the same upstream origin. Overrides runtime.params.http_max_concurrent_requests. If both are unset, concurrency limiting is disabled. |
The GitHub connector uses its own rate limiter based on GitHub API X-RateLimit-* response headers. Multiple datasets targeting the same GitHub endpoint share this rate limiter.
GitHub queries support a github_query_mode parameter, which can be set to either auto or search for the following types:
auto. Query filters are only pushed down to the GitHub API in search mode.auto. Query filters are only pushed down to the GitHub API in search mode.Commits only supports auto mode. Query with filter push down is only enabled for the committed_date column. committed_date supports exact matches, or greater/less than matches for dates provided in ISO8601 format, like WHERE committed_date > '2024-09-24'.
When set to search, Issues and Pull Requests will use the GitHub Search API for improved filter performance when querying against the columns:
author and state; supports exact matches, or NOT matches. For example, WHERE author = 'peasee' or WHERE author <> 'peasee'.body and title; supports exact matches, or LIKE matches. For example, WHERE body LIKE '%duckdb%'.updated_at, created_at, merged_at and closed_at; supports exact matches, or greater/less than matches with dates provided in ISO8601 format. For example, WHERE created_at > '2024-09-24'.All other filters are supported when github_query_mode is set to search, but cannot be pushed down to the GitHub API for improved performance.
:::warning[Limitations]
:::
:::warning[Limitations]
content column is fetched only when acceleration is enabled.github_query_mode to search is not supported.:::
ref - Required. Specifies the GitHub branch or tag to fetch files from.include - Optional. Specifies a pattern to include specific files. Supports glob patterns. If not specified, all files are included by default.| Column Name | Data Type | Is Nullable |
|---|---|---|
| name | Utf8 | YES |
| path | Utf8 | YES |
| ref | Utf8 | NO |
| size | Int64 | YES |
| sha | Utf8 | YES |
| mode | Utf8 | YES |
| url | Utf8 | YES |
| download_url | Utf8 | YES |
| created_at | Timestamp | YES |
| updated_at | Timestamp | YES |
| content | Utf8 | YES |
created_at and updated_at are present only when github_include_commits is set to true.
:::warning[Limitations]
WHERE created_at > '2024-09-24'.:::
| Column Name | Data Type | Is Nullable |
|---|---|---|
| assignees | List(Utf8) | YES |
| author | Utf8 | YES |
| body | Utf8 | YES |
| closed_at | Timestamp | YES |
| comments | List(Struct) | YES |
| created_at | Timestamp | YES |
| id | Utf8 | YES |
| labels | List(Utf8) | YES |
| milestone_id | Utf8 | YES |
| milestone_title | Utf8 | YES |
| comments_count | Int64 | YES |
| number | Int64 | YES |
| state | Utf8 | YES |
| title | Utf8 | YES |
| updated_at | Timestamp | YES |
| url | Utf8 | YES |
:::warning[Limitations]
WHERE created_at > '2024-09-24'.:::
| Column Name | Data Type | Is Nullable |
|---|---|---|
| additions | Int64 | YES |
| assignees | List(Utf8) | YES |
| author | Utf8 | YES |
| body | Utf8 | YES |
| changed_files | Int64 | YES |
| closed_at | Timestamp | YES |
| comments_count | Int64 | YES |
| commits_count | Int64 | YES |
| created_at | Timestamp | YES |
| deletions | Int64 | YES |
| discussion | List(Struct(body: Utf8, author: Utf8, created_at: Timestamp)) | YES |
| hashes | List(Utf8) | YES |
| id | Utf8 | YES |
| labels | List(Utf8) | YES |
| merged_at | Timestamp | YES |
| number | Int64 | YES |
| review_comments | List(Struct(body: Utf8, author: Utf8, created_at: Timestamp)) | YES |
| reviews_count | Int64 | YES |
| state | Utf8 | YES |
| title | Utf8 | YES |
| updated_at | Timestamp | YES |
| url | Utf8 | YES |
Note: The discussion and review_comments columns are only included in the schema when the github_include_comments parameter is set accordingly.
:::warning[Limitations]
WHERE committed_date > '2024-09-24'.github_query_mode to search is not supported.:::
| Column Name | Data Type | Is Nullable |
|---|---|---|
| additions | Int64 | YES |
| associated_pull_request_number | Int64 | YES |
| author_email | Utf8 | YES |
| author_name | Utf8 | YES |
| changed_files | Int64 | YES |
| committed_date | Timestamp | YES |
| committer_date | Timestamp | YES |
| committer_email | Utf8 | YES |
| committer_name | Utf8 | YES |
| deletions | Int64 | YES |
| id | Utf8 | YES |
| message | Utf8 | YES |
| message_body | Utf8 | YES |
| message_head_line | Utf8 | YES |
| ref | Utf8 | YES |
| sha | Utf8 | YES |
| status | Utf8 | YES |
:::warning[Limitations]
WHERE starred_at > '2024-09-24'.github_query_mode to search is not supported.:::
| Column Name | Data Type | Is Nullable |
|---|---|---|
| starred_at | Timestamp | YES |
| login | Utf8 | YES |
| Utf8 | YES | |
| name | Utf8 | YES |
| company | Utf8 | YES |
| x_username | Utf8 | YES |
| location | Utf8 | YES |
| avatar_url | Utf8 | YES |
| bio | Utf8 | YES |
:::warning[Limitations]
WHERE created_at > '2024-09-24'.github_query_mode to search is not supported.:::
| Column Name | Data Type | Is Nullable |
|---|---|---|
| username | Utf8 | YES |
| name | Utf8 | YES |
| avatar_url | Utf8 | YES |
| url | Utf8 | YES |
| Utf8 | YES | |
| location | Utf8 | YES |
| company | Utf8 | YES |
| created_at | Timestamp | YES |
| bio | Utf8 | YES |
sql> select created_at, username from apache.members order by created_at desc limit 10; +---------------------+-------------------+ | created_at | username | +---------------------+-------------------+ | 2023-10-09T13:14:13 | heliang666s | | 2023-04-14T11:26:44 | cortlepp | | 2023-02-16T08:28:58 | ChengJie1053 | | 2023-02-11T03:51:52 | FinalT | | 2022-11-20T12:12:56 | Yanshuming1 | | 2022-10-10T23:29:29 | bernardodemarco | | 2022-10-07T05:06:37 | coldgust | | 2022-09-06T14:38:44 | No-SilverBullet | | 2022-08-18T13:31:44 | harshithasudhakar | | 2022-07-05T10:44:08 | bearslyricattack | +---------------------+-------------------+ Time: 0.054390375 seconds. 10 rows.
Optional. Pull-request connector only. Maximum number of comments to fetch per review thread (when github_include_comments is set to review or all) or per pull-request discussion (when set to discussion or all). Defaults to 25, and is capped at 75 to protect against GitHub secondary rate limits. |
github_include_commits | Optional. Files connector only. Whether to fetch commit metadata (adds the created_at and updated_at timestamp columns) for each file. Set to true to enable. Defaults to false. |
github_workflow_logs | Optional. Workflow-runs connector only (github.com/<owner>/<repo>/workflows/<workflow_file.yml>/runs). Set to enabled to download and include the workflow run logs for each row. Defaults to disabled. |