Skip to content

vdk-data-source-git: data source for git POC#2859

Merged
antoniivanov merged 1 commit intomainfrom
person/aivanov/git
Nov 16, 2023
Merged

vdk-data-source-git: data source for git POC#2859
antoniivanov merged 1 commit intomainfrom
person/aivanov/git

Conversation

@antoniivanov
Copy link
Copy Markdown
Contributor

@antoniivanov antoniivanov commented Nov 1, 2023

Extracts content from Git repositories along with associated file metadata. See README for more details

This is needed because most other data sources (basically all vdk-singer data sources) are really relational data sources (json / dictionary) I needed a data source that is blobs of data so that wecan test those scenarios and to find out the limitations of our ingestion. Git data from internal git sytems is natural data source for fine tuning certain ML models as well.



@data_source(name="git", config_class=GitDataSourceConfiguration)
class GitDataSource(IDataSource):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This datasource will always only have a single stream?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this iteration of the data source implementation. All singer data sources are really relational data sources (json / dictionary) I needed a data source that is blobs of data so that's why I developed that. To find out the limitations of our ingestion . Eventually it might make sense to allow users to configure streams - maybe branches, or directories or something else. But not at this first iteration.

It's development status is pre-alpha currently or maybe alphase as per https://martin-thoma.com/software-development-stages/

PS : The main limitation I found is that we want the payloads to be json serializable so we don't really accept "bytes" in the payload.

@antoniivanov antoniivanov force-pushed the person/aivanov/git branch 2 times, most recently from b557d5a to bfd97e9 Compare November 3, 2023 12:59
Extracts content from Git repositories along with associated file
metadata. See README for more details
@antoniivanov antoniivanov merged commit 99eed5f into main Nov 16, 2023
@antoniivanov antoniivanov deleted the person/aivanov/git branch November 16, 2023 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants