vdk-data-source-git: data source for git POC#2859
Conversation
projects/vdk-plugins/vdk-data-source-git/src/vdk/plugin/data_source_git/git_source.py
Show resolved
Hide resolved
|
|
||
|
|
||
| @data_source(name="git", config_class=GitDataSourceConfiguration) | ||
| class GitDataSource(IDataSource): |
There was a problem hiding this comment.
This datasource will always only have a single stream?
There was a problem hiding this comment.
At this iteration of the data source implementation. All singer data sources are really relational data sources (json / dictionary) I needed a data source that is blobs of data so that's why I developed that. To find out the limitations of our ingestion . Eventually it might make sense to allow users to configure streams - maybe branches, or directories or something else. But not at this first iteration.
It's development status is pre-alpha currently or maybe alphase as per https://martin-thoma.com/software-development-stages/
PS : The main limitation I found is that we want the payloads to be json serializable so we don't really accept "bytes" in the payload.
projects/vdk-plugins/vdk-data-source-git/src/vdk/plugin/data_source_git/git_source.py
Show resolved
Hide resolved
b557d5a to
bfd97e9
Compare
ce7a39b to
cd9e53c
Compare
Extracts content from Git repositories along with associated file metadata. See README for more details
cd9e53c to
5f667ae
Compare
Extracts content from Git repositories along with associated file metadata. See README for more details
This is needed because most other data sources (basically all vdk-singer data sources) are really relational data sources (json / dictionary) I needed a data source that is blobs of data so that wecan test those scenarios and to find out the limitations of our ingestion. Git data from internal git sytems is natural data source for fine tuning certain ML models as well.