Summary
Propose adding a manifest-only (YAML) configuration DSL for file-based source connectors, analogous to the existing declarative CDK for HTTP API connectors. This would allow file-based connectors (Google Drive, S3, Azure Blob Storage, GCS, SFTP, OneDrive, SharePoint) to be defined entirely in YAML without custom Python code.
Motivation
Today, all 7 file-based connectors are Python connectors because the file-based source framework (FileBasedSource / AbstractFileBasedStreamReader) has no declarative equivalent. Each connector must implement Python classes for:
- Stream reader (
AbstractFileBasedStreamReader) — authentication, file listing, file reading, file upload/download
- Source (
FileBasedSource) — wiring up the stream reader, spec class, and optional permissions reader
- Spec (
AbstractFileBasedSpec) — connector-specific config (auth credentials, folder/bucket paths, delivery method)
- Permissions reader (optional,
AbstractFileBasedStreamPermissionsReader) — ACL/identity loading
By contrast, the HTTP declarative CDK has enabled hundreds of API connectors to be manifest-only YAML.
Sonar/Coral context
Of the 21 Sonar agent connectors, source-google-drive is the only file-based connector with a corresponding Airbyte replication source. Being able to define file-based connectors declaratively would reduce the maintenance burden and make it easier to add new file storage integrations.
First target: Google Drive
source-google-drive is the proposed first case because it exercises all major file-based abstractions:
Google Drive Python modules and their responsibilities
source.py — SourceGoogleDrive(FileBasedSource)
- Wires up
SourceGoogleDriveStreamReader, SourceGoogleDriveSpec, and SourceGoogleDriveStreamPermissionsReader
- Defines OAuth
AdvancedAuth spec (consent URL, token URL, scopes, output mappings)
stream_reader.py — SourceGoogleDriveStreamReader(AbstractFileBasedStreamReader)
- Authentication: OAuth2 or Service Account credentials →
google.oauth2.credentials / service_account
- File listing:
files().list() with recursive folder traversal, pagination (1000 per page), shared drive support, glob matching
- File reading:
files().get_media() for regular files, files().export_media() for Google Docs/Sheets/Presentations/Drawings (with MIME type conversion)
- File upload/download:
MediaIoBaseDownload chunked downloads with progress tracking and 1.5GB size limit
- File size:
files().get() metadata retrieval
- Custom
GoogleDriveRemoteFile model with id, original_mime_type, view_link, drive_id, created_at
stream_permissions_reader.py — SourceGoogleDriveStreamPermissionsReader(AbstractFileBasedStreamPermissionsReader)
- File permissions:
permissions().list() for per-file ACLs, public access detection
- Identity groups: Google Admin Directory API (
users().list(), groups().list(), members().list())
- Separate Google service client for Admin API with specific scopes
spec.py — SourceGoogleDriveSpec(AbstractFileBasedSpec)
folder_url config field with URL pattern validation
- Two auth modes: OAuth (
OAuthCredentials) and Service Account (ServiceAccountCredentials)
- Three delivery methods:
DeliverRecords, DeliverRawFiles, DeliverPermissions
- Schema customization (removes legacy fields, hides API processing option)
utils.py
get_folder_id() — URL parsing to extract folder ID from Google Drive URL
exceptions.py
ErrorFetchingMetadata, ErrorDownloadingFile — custom error types extending BaseFileBasedSourceError
Proposed YAML DSL shape (strawman)
version: "1.0.0"
type: FileBasedSource
spec:
type: GoogleDriveSpec
folder_url:
type: string
pattern: "^https://drive.google.com/.+"
credentials:
type: oneOf
options:
- type: OAuthCredentials
fields: [client_id, client_secret, refresh_token]
- type: ServiceAccountCredentials
fields: [service_account_info]
stream_reader:
type: GoogleDriveStreamReader
# Or if we want to be more generic:
type: SDKBasedStreamReader
sdk: google-drive
authentication:
type: oneOf
options:
- type: oauth2
credentials_path: "$.credentials"
- type: service_account
credentials_path: "$.credentials"
file_listing:
api: "drive.files.list"
root_path: "$.folder_url" # -> parsed to folder ID
recursive: true
page_size: 1000
shared_drives: true
file_reading:
default: "drive.files.get_media"
exportable_types:
- mime_type: "application/vnd.google-apps.document"
export_as: "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
# ...
This is intentionally a rough sketch. The actual design would need to balance generality (supporting S3, Azure, GCS, SFTP, etc.) with specificity (Google Drive API quirks like exportable documents).
Key design questions
- Scope of abstraction: Should the DSL abstract over the storage SDK entirely (like
HttpRequester does for HTTP), or should it be a thinner layer that references pre-built stream reader implementations?
- Authentication: File storage systems use diverse auth mechanisms (OAuth, service accounts, IAM roles, connection strings, SSH keys). How much of this can be declaratively configured?
- File type handling: Google Drive has unique export semantics for Google Docs/Sheets/Presentations. How should storage-specific file handling be expressed?
- Permissions: Some connectors (Google Drive, SharePoint) support permissions/identity streams. Should this be part of the DSL?
- Incremental approach: Should we start with a minimal DSL that covers the common case (list + read files with simple auth) and extend over time?
Affected file-based connectors (all currently Python)
| Connector |
Key complexity |
source-google-drive |
OAuth + Service Account, exportable docs, permissions, shared drives |
source-s3 |
IAM auth, bucket listing, multiple file formats |
source-azure-blob-storage |
Connection string / SAS auth, container listing |
source-gcs |
Service account auth, bucket listing |
source-sftp-bulk |
SSH key / password auth, directory traversal |
source-microsoft-onedrive |
OAuth, Graph API file listing |
source-microsoft-sharepoint |
OAuth, Graph API, site/drive discovery, permissions |
Related issues
CDK classes and modules involved
airbyte_cdk.sources.file_based.file_based_source.FileBasedSource — base class for all file-based sources
airbyte_cdk.sources.file_based.file_based_stream_reader.AbstractFileBasedStreamReader — abstract stream reader (3 abstract methods: config setter, open_file, get_matching_files)
airbyte_cdk.sources.file_based.file_based_stream_permissions_reader.AbstractFileBasedStreamPermissionsReader — abstract permissions reader
airbyte_cdk.sources.file_based.config.abstract_file_based_spec.AbstractFileBasedSpec — abstract config spec
Requested by Aaron ("AJ") Steers (@aaronsteers) (AJ Steers)
Summary
Propose adding a manifest-only (YAML) configuration DSL for file-based source connectors, analogous to the existing declarative CDK for HTTP API connectors. This would allow file-based connectors (Google Drive, S3, Azure Blob Storage, GCS, SFTP, OneDrive, SharePoint) to be defined entirely in YAML without custom Python code.
Motivation
Today, all 7 file-based connectors are Python connectors because the file-based source framework (
FileBasedSource/AbstractFileBasedStreamReader) has no declarative equivalent. Each connector must implement Python classes for:AbstractFileBasedStreamReader) — authentication, file listing, file reading, file upload/downloadFileBasedSource) — wiring up the stream reader, spec class, and optional permissions readerAbstractFileBasedSpec) — connector-specific config (auth credentials, folder/bucket paths, delivery method)AbstractFileBasedStreamPermissionsReader) — ACL/identity loadingBy contrast, the HTTP declarative CDK has enabled hundreds of API connectors to be manifest-only YAML.
Sonar/Coral context
Of the 21 Sonar agent connectors,
source-google-driveis the only file-based connector with a corresponding Airbyte replication source. Being able to define file-based connectors declaratively would reduce the maintenance burden and make it easier to add new file storage integrations.First target: Google Drive
source-google-driveis the proposed first case because it exercises all major file-based abstractions:Google Drive Python modules and their responsibilities
source.py—SourceGoogleDrive(FileBasedSource)SourceGoogleDriveStreamReader,SourceGoogleDriveSpec, andSourceGoogleDriveStreamPermissionsReaderAdvancedAuthspec (consent URL, token URL, scopes, output mappings)stream_reader.py—SourceGoogleDriveStreamReader(AbstractFileBasedStreamReader)google.oauth2.credentials/service_accountfiles().list()with recursive folder traversal, pagination (1000 per page), shared drive support, glob matchingfiles().get_media()for regular files,files().export_media()for Google Docs/Sheets/Presentations/Drawings (with MIME type conversion)MediaIoBaseDownloadchunked downloads with progress tracking and 1.5GB size limitfiles().get()metadata retrievalGoogleDriveRemoteFilemodel withid,original_mime_type,view_link,drive_id,created_atstream_permissions_reader.py—SourceGoogleDriveStreamPermissionsReader(AbstractFileBasedStreamPermissionsReader)permissions().list()for per-file ACLs, public access detectionusers().list(),groups().list(),members().list())spec.py—SourceGoogleDriveSpec(AbstractFileBasedSpec)folder_urlconfig field with URL pattern validationOAuthCredentials) and Service Account (ServiceAccountCredentials)DeliverRecords,DeliverRawFiles,DeliverPermissionsutils.pyget_folder_id()— URL parsing to extract folder ID from Google Drive URLexceptions.pyErrorFetchingMetadata,ErrorDownloadingFile— custom error types extendingBaseFileBasedSourceErrorProposed YAML DSL shape (strawman)
This is intentionally a rough sketch. The actual design would need to balance generality (supporting S3, Azure, GCS, SFTP, etc.) with specificity (Google Drive API quirks like exportable documents).
Key design questions
HttpRequesterdoes for HTTP), or should it be a thinner layer that references pre-built stream reader implementations?Affected file-based connectors (all currently Python)
source-google-drivesource-s3source-azure-blob-storagesource-gcssource-sftp-bulksource-microsoft-onedrivesource-microsoft-sharepointRelated issues
CDK classes and modules involved
airbyte_cdk.sources.file_based.file_based_source.FileBasedSource— base class for all file-based sourcesairbyte_cdk.sources.file_based.file_based_stream_reader.AbstractFileBasedStreamReader— abstract stream reader (3 abstract methods:configsetter,open_file,get_matching_files)airbyte_cdk.sources.file_based.file_based_stream_permissions_reader.AbstractFileBasedStreamPermissionsReader— abstract permissions readerairbyte_cdk.sources.file_based.config.abstract_file_based_spec.AbstractFileBasedSpec— abstract config specRequested by Aaron ("AJ") Steers (@aaronsteers) (AJ Steers)