A scalable, real-time data pipeline that ingests weather data from public APIs, processes it using Spark Structured Streaming, and stores it in an Iceberg data lake with Nessie version control.
- Real-time ingestion of weather data from multiple locations
- Kafka-based event streaming for reliable data delivery
- Spark Structured Streaming for processing and transformation
- Iceberg tables with Nessie version control for data governance
- MinIO (S3-compatible) storage backend
- AI-assisted querying using Claude 3 for natural language to SQL conversion
- Docker-based deployment for easy local development
| Component | Purpose |
|---|---|
| Apache Kafka | Distributed event streaming |
| Apache Spark | Stream processing engine |
| Apache Iceberg | Open table format for analytics |
| Project Nessie | Git-like version control for data |
| MinIO | S3-compatible object storage |
| Claude 3 Sonnet | AI for SQL generation |
- Docker and Docker Compose
- Python 3.8+
- Java 11 (for Spark)
- AWS credentials (for MinIO access)
- Weather API token (set as WEATHER_API_TOKEN)
- Anthropic API key (set as ANTHROPIC_KEY) for AI queries
Clone the repository:
git clone https://github.com/yourusername/weather-data-pipeline.git
Start the infrastructure:
docker compose up -d
Set up environment variables:
export AWS_ACCESS_KEY_ID=minioadmin
export AWS_SECRET_ACCESS_KEY=minioadmin
export AWS_S3_ENDPOINT=http://localhost:9000
export WEATHER_API_TOKEN=your_api_token
export ANTHROPIC_KEY=your_anthropic_key
Running the Pipeline Start the data producer:
python main.py
In a separate terminal, start the Spark stream processor:
spark-submit stream_processor.py
To query the data:
python query_iceberg_table.py
The AI query interface supports natural language questions like:
"Show me the average temperature by country for the last 24 hours""Find locations with rainfall exceeding 10mm in the past hour""What was the maximum wind gust recorded in California yesterday?"
Access these services locally:
- MinIO Console: http://localhost:9001 (minioadmin/minioadmin)
- Spark UI: http://localhost:4040
- Nessie API: http://localhost:19120/api/v1
Contributions are welcome! Please open an issue or submit a pull request.
Apache License 2.0 Note: This is a simplified example. For production use, consider adding: Proper error handling and retries Monitoring and alerting Security configurations CI/CD pipelines
