Proposal: Manifest-only (YAML) configuration DSL for file-based connectors

## Summary

Propose adding a manifest-only (YAML) configuration DSL for file-based source connectors, analogous to the existing declarative CDK for HTTP API connectors. This would allow file-based connectors (Google Drive, S3, Azure Blob Storage, GCS, SFTP, OneDrive, SharePoint) to be defined entirely in YAML without custom Python code.

## Motivation

Today, **all 7 file-based connectors are Python connectors** because the file-based source framework (`FileBasedSource` / `AbstractFileBasedStreamReader`) has no declarative equivalent. Each connector must implement Python classes for:

1. **Stream reader** (`AbstractFileBasedStreamReader`) — authentication, file listing, file reading, file upload/download
2. **Source** (`FileBasedSource`) — wiring up the stream reader, spec class, and optional permissions reader
3. **Spec** (`AbstractFileBasedSpec`) — connector-specific config (auth credentials, folder/bucket paths, delivery method)
4. **Permissions reader** (optional, `AbstractFileBasedStreamPermissionsReader`) — ACL/identity loading

By contrast, the HTTP declarative CDK has enabled **hundreds of API connectors** to be manifest-only YAML.

### Sonar/Coral context

Of the 21 Sonar agent connectors, `source-google-drive` is the only file-based connector with a corresponding Airbyte replication source. Being able to define file-based connectors declaratively would reduce the maintenance burden and make it easier to add new file storage integrations.

## First target: Google Drive

`source-google-drive` is the proposed first case because it exercises all major file-based abstractions:

<details><summary>Google Drive Python modules and their responsibilities</summary>

### `source.py` — `SourceGoogleDrive(FileBasedSource)`
- Wires up `SourceGoogleDriveStreamReader`, `SourceGoogleDriveSpec`, and `SourceGoogleDriveStreamPermissionsReader`
- Defines OAuth `AdvancedAuth` spec (consent URL, token URL, scopes, output mappings)

### `stream_reader.py` — `SourceGoogleDriveStreamReader(AbstractFileBasedStreamReader)`
- **Authentication**: OAuth2 or Service Account credentials → `google.oauth2.credentials` / `service_account`
- **File listing**: `files().list()` with recursive folder traversal, pagination (1000 per page), shared drive support, glob matching
- **File reading**: `files().get_media()` for regular files, `files().export_media()` for Google Docs/Sheets/Presentations/Drawings (with MIME type conversion)
- **File upload/download**: `MediaIoBaseDownload` chunked downloads with progress tracking and 1.5GB size limit
- **File size**: `files().get()` metadata retrieval
- Custom `GoogleDriveRemoteFile` model with `id`, `original_mime_type`, `view_link`, `drive_id`, `created_at`

### `stream_permissions_reader.py` — `SourceGoogleDriveStreamPermissionsReader(AbstractFileBasedStreamPermissionsReader)`
- **File permissions**: `permissions().list()` for per-file ACLs, public access detection
- **Identity groups**: Google Admin Directory API (`users().list()`, `groups().list()`, `members().list()`)
- Separate Google service client for Admin API with specific scopes

### `spec.py` — `SourceGoogleDriveSpec(AbstractFileBasedSpec)`
- `folder_url` config field with URL pattern validation
- Two auth modes: OAuth (`OAuthCredentials`) and Service Account (`ServiceAccountCredentials`)
- Three delivery methods: `DeliverRecords`, `DeliverRawFiles`, `DeliverPermissions`
- Schema customization (removes legacy fields, hides API processing option)

### `utils.py`
- `get_folder_id()` — URL parsing to extract folder ID from Google Drive URL

### `exceptions.py`
- `ErrorFetchingMetadata`, `ErrorDownloadingFile` — custom error types extending `BaseFileBasedSourceError`

</details>

## Proposed YAML DSL shape (strawman)

```yaml
version: "1.0.0"
type: FileBasedSource

spec:
  type: GoogleDriveSpec
  folder_url:
    type: string
    pattern: "^https://drive.google.com/.+"
  credentials:
    type: oneOf
    options:
      - type: OAuthCredentials
        fields: [client_id, client_secret, refresh_token]
      - type: ServiceAccountCredentials
        fields: [service_account_info]

stream_reader:
  type: GoogleDriveStreamReader
  # Or if we want to be more generic:
  type: SDKBasedStreamReader
  sdk: google-drive
  authentication:
    type: oneOf
    options:
      - type: oauth2
        credentials_path: "$.credentials"
      - type: service_account
        credentials_path: "$.credentials"
  file_listing:
    api: "drive.files.list"
    root_path: "$.folder_url"  # -> parsed to folder ID
    recursive: true
    page_size: 1000
    shared_drives: true
  file_reading:
    default: "drive.files.get_media"
    exportable_types:
      - mime_type: "application/vnd.google-apps.document"
        export_as: "application/vnd.openxmlformats-officedocument.wordprocessingml.document"
      # ...
```

This is intentionally a rough sketch. The actual design would need to balance generality (supporting S3, Azure, GCS, SFTP, etc.) with specificity (Google Drive API quirks like exportable documents).

## Key design questions

1. **Scope of abstraction**: Should the DSL abstract over the storage SDK entirely (like `HttpRequester` does for HTTP), or should it be a thinner layer that references pre-built stream reader implementations?
2. **Authentication**: File storage systems use diverse auth mechanisms (OAuth, service accounts, IAM roles, connection strings, SSH keys). How much of this can be declaratively configured?
3. **File type handling**: Google Drive has unique export semantics for Google Docs/Sheets/Presentations. How should storage-specific file handling be expressed?
4. **Permissions**: Some connectors (Google Drive, SharePoint) support permissions/identity streams. Should this be part of the DSL?
5. **Incremental approach**: Should we start with a minimal DSL that covers the common case (list + read files with simple auth) and extend over time?

## Affected file-based connectors (all currently Python)

| Connector | Key complexity |
|-----------|---------------|
| `source-google-drive` | OAuth + Service Account, exportable docs, permissions, shared drives |
| `source-s3` | IAM auth, bucket listing, multiple file formats |
| `source-azure-blob-storage` | Connection string / SAS auth, container listing |
| `source-gcs` | Service account auth, bucket listing |
| `source-sftp-bulk` | SSH key / password auth, directory traversal |
| `source-microsoft-onedrive` | OAuth, Graph API file listing |
| `source-microsoft-sharepoint` | OAuth, Graph API, site/drive discovery, permissions |

## Related issues

- [#714 — Survey of Manifest-Only Connectors Using Custom Components (Feature Gaps)](https://github.com/airbytehq/airbyte-python-cdk/issues/714) — surveys API connector feature gaps; file-based connectors are a separate category not yet addressed
- [#713 — Custom components use case analysis: HTTP Requests for Configuration Determination](https://github.com/airbytehq/airbyte-python-cdk/issues/713)
- [#837 — Declarative: GitHub App authentication](https://github.com/airbytehq/airbyte-python-cdk/issues/837), [#838 — Declarative: Pattern-based partition routing](https://github.com/airbytehq/airbyte-python-cdk/issues/838), [#835 — Declarative: Multi-token authenticator](https://github.com/airbytehq/airbyte-python-cdk/issues/835) — related declarative CDK feature gaps for API connectors

## CDK classes and modules involved

- `airbyte_cdk.sources.file_based.file_based_source.FileBasedSource` — base class for all file-based sources
- `airbyte_cdk.sources.file_based.file_based_stream_reader.AbstractFileBasedStreamReader` — abstract stream reader (3 abstract methods: `config` setter, `open_file`, `get_matching_files`)
- `airbyte_cdk.sources.file_based.file_based_stream_permissions_reader.AbstractFileBasedStreamPermissionsReader` — abstract permissions reader
- `airbyte_cdk.sources.file_based.config.abstract_file_based_spec.AbstractFileBasedSpec` — abstract config spec

---
*Requested by @aaronsteers (AJ Steers)*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Manifest-only (YAML) configuration DSL for file-based connectors #901

Summary

Motivation

Sonar/Coral context

First target: Google Drive

`source.py` — `SourceGoogleDrive(FileBasedSource)`

`stream_reader.py` — `SourceGoogleDriveStreamReader(AbstractFileBasedStreamReader)`

`stream_permissions_reader.py` — `SourceGoogleDriveStreamPermissionsReader(AbstractFileBasedStreamPermissionsReader)`

`spec.py` — `SourceGoogleDriveSpec(AbstractFileBasedSpec)`

`utils.py`

`exceptions.py`

Proposed YAML DSL shape (strawman)

Key design questions

Affected file-based connectors (all currently Python)

Related issues

CDK classes and modules involved

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Connector	Key complexity
`source-google-drive`	OAuth + Service Account, exportable docs, permissions, shared drives
`source-s3`	IAM auth, bucket listing, multiple file formats
`source-azure-blob-storage`	Connection string / SAS auth, container listing
`source-gcs`	Service account auth, bucket listing
`source-sftp-bulk`	SSH key / password auth, directory traversal
`source-microsoft-onedrive`	OAuth, Graph API file listing
`source-microsoft-sharepoint`	OAuth, Graph API, site/drive discovery, permissions

Proposal: Manifest-only (YAML) configuration DSL for file-based connectors #901

Description

Summary

Motivation

Sonar/Coral context

First target: Google Drive

source.py — SourceGoogleDrive(FileBasedSource)

stream_reader.py — SourceGoogleDriveStreamReader(AbstractFileBasedStreamReader)

stream_permissions_reader.py — SourceGoogleDriveStreamPermissionsReader(AbstractFileBasedStreamPermissionsReader)

spec.py — SourceGoogleDriveSpec(AbstractFileBasedSpec)

utils.py

exceptions.py

Proposed YAML DSL shape (strawman)

Key design questions

Affected file-based connectors (all currently Python)

Related issues

CDK classes and modules involved

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`source.py` — `SourceGoogleDrive(FileBasedSource)`

`stream_reader.py` — `SourceGoogleDriveStreamReader(AbstractFileBasedStreamReader)`

`stream_permissions_reader.py` — `SourceGoogleDriveStreamPermissionsReader(AbstractFileBasedStreamPermissionsReader)`

`spec.py` — `SourceGoogleDriveSpec(AbstractFileBasedSpec)`

`utils.py`

`exceptions.py`