@coltonpadden@lakeFS Totally, this is a great fundamental layer for versioned storage.
But there’s two other aspects where I think explicit versioning helps:
1. When to re-compute data that is expensive because the code logic changed
2. Communicating published versions for downstream consumption
Has semantic versioning been applied to data? Here's what I am thinking:
+patch for updates without a schema change.
+minor for additive changes like a new column, or narrowing a type (nullable->non-null).
+major for breaking changes like removing/renaming columns.
@FunWithTheCloud Yeah. We are considering LakeFS for s3 versioning and nessie for catalog versioning.
But we use @dagster to materialize our data sets and need to design a per-asset versioning strategy for optimization and it could help with publishing.
I think semver would work well.
There’s also lakefs which includes versioning of s3 objects and not just tables like Nessie.
But it requires a hosted gateway in between you and S3 which scares me.
lakefs.io
Git-like branching, commits and tags for data lakes is starting to emerge.
We are reaching a point where data lakes of any size can be managed just like version controlled code.
Make a change, push a PR, preview the change, merge to main if happy.
projectnessie.org
When it comes to your data, quality matters!
Incorrect data can harm a reputation, misdirect resources, and lead to false insights and missed opportunities 🤦♀️. Learn how teams today test data validity and accuracy to ensure #DataQuality.
lakefs.io/data-quality-t…
With more than 45 #unicorns, Israel is one of the world's leading startup hubs in the world. Here are some of the best Israeli startups to watch in 2021!
startupstash.com/israeli-startu…
📣 We’re thrilled to announce a new integration between @Minio & lakeFS. MinIO users can now power their storage environment with Git-like operations to easily version data at scale.
Check out our blog to see how easy it is: bit.ly/2LqkFP1
Want to level-up your data lake?
Join us tomorrow (Wednesday, Nov 25) at
@BigDataConfEU to learn best practices and principles in data versioning for big data sets.
Grab your seat: bit.ly/3q1YTkP
Finding creative solutions to big data problems is our thing! We’re looking for passionate data enthusiasts who love all things #opensource to join our team.
Open positions:
- Solution Architect
- Developer Advocate
To learn more and apply: bit.ly/3kh8osf
A data development environment contains everything required to build and deploy data intensive applications.
Learn how easy it is to setup:
bit.ly/2HE7Xue
It's open!
Introducing lakeFS: a powerful open source platform that delivers resilience and manageability to object-storage based data lakes.
Check out our new blog, and get started today
lakefs.io/2020/08/03/int…