The story is a cross-posting from the Data Engineering Weekly. Please subscribe to the Data Engineering Newsletter for the latest update.

www.dataengineeringweekly.com

Welcome to the 30th edition of the data engineering newsletter. This week’s release is a new set of articles that focus on Uber’s schema-agnostic log analytics platform, Google’s opensource model search system, Intuit’s Data Mesh strategy, Salesforce’s secure data intelligence platform, Netflix’s composable data pipeline, BrightWind’s wind analytical data hub, Apache Pinot’s star tree indexing, Squarespace’s A/B testing platform, Snowflake vs. Redshift comparison, and overview of the modern analytical stack.

Uber: Fast and Reliable Schema-Agnostic Log Analytics Platform

Elasticsearch provides a dynamic schema inference to improve the performance of log indexing. The dynamic type inference often leads to type…


The story is a cross-posting from the Data Engineering Weekly. Please subscribe to the Data Engineering Newsletter for the latest update.

www.dataengineeringweekly.com

Welcome to the 29th edition of the data engineering newsletter. This week’s release is a new set of articles that focus on Google’s research paper on Data Cascades in High-Stakes AI, Fiddler Labs debugging ML model performance, Monte Carlo’s Data Observability Using SQL, Airbnb’s Superset adoption, Apache Kylin’s Evolution of Precomputation, Spotify’s Sorted Merge Bucket implementation, Doordash’s effective data science communication, Funding Societies Data Governance journey, QueryClick’s Self-Serve analytical journey, and Databricks Delta Lake 0.8.

Google: "Everyone wants…


The story is a cross-posting from the Data Engineering Weekly. Please subscribe to the Data Engineering Newsletter for the latest update.

www.dataengineeringweekly.com

Welcome to the 28th edition of the data engineering newsletter. This week’s release is a new set of articles that focus on Google’s ML for computer architecture, Microsoft’s PyTorch vs. TensorFlow, Capital One’s Time travel offline ML evaluation frameworks, Alibaba Cloud’s Data Lake introduction, PayPal’s Next-Gen data movement framework, Apache Pinot’s integration story with Presto, Gradient Flow’s growing importance of Metadata, Metadata Day 2020 overview, Monte Carlo Data’s data pipeline SLA, and TDD with Apache Airflow.

Google: Machine…


The story is a cross-posting from the Data Engineering Weekly. Please subscribe to the Data Engineering Newsletter for the latest update.

www.dataengineeringweekly.com

Welcome to the fourth edition of the data engineering newsletter. This week’s release is a new set of articles that focus on data orchestration, ML applications, tuning data workload, and Kafka on Kubernetes.

Airflow is a huge step forward over loosely coupled cron jobs for running the data pipeline. Dagster, a data-aware, typed, self-describing, logical orchestration graph, takes the data orchestration to the next level by focusing on local development, testable code before production, and Linking data assets…


The story is a cross-posting from the Data Engineering Weekly. Please subscribe to the Data Engineering Newsletter for the latest update.

www.dataengineeringweekly.com

I’m excited to read about the GPU-accelerated streaming platform this week. NVIDIA writes about cuStreamz, the first GPU-accelerated streaming data processing library. Written in Python, it built on top of RAPIDS, the GPU-accelerator for data science libraries.

Continue on the GPU-accelerated stream processing, Apache Flink 1.11 introduces a new External Resource Framework, which allows you to request external resources from the underlying resource management systems (e.g., Kubernetes) and accelerate your workload with those resources. …


The story is a cross-posting from the Data Engineering Weekly. Please subscribe to the Data Engineering Newsletter for the latest update.

www.dataengineeringweekly.com

Apache Pinot is gaining momentum as a realtime OLAP system for data engineering needs. In this blog post, Sapient narrates its experience benchmarking Apache Pinot. The ingestion rate cross 120k entries/second on one node is impressive.

Netflix open sourced metaflow.org December 2019. Metaflow follows a layered architecture approach to run the data workload, a contrasting approach from a tightly coupled airflow’s scheduler architecture. In this post, Netflix explains how the scheduler layer integrated with the AWS step functions.


The story is a cross-posting from the Data Engineering Weekly. Please subscribe to the Data Engineering Newsletter for the latest update.

www.dataengineeringweekly.com

Privacy often an afterthought development in the data world. There are numerous ways one might betray someone’s privacy, but they are evident in most everyday situations. The New York Times wrote their thought on data privacy. The post is a good overview of privacy, useful links, and what are the steps NYT is doing in marketing and advertisement on their user’s privacy.

The popularity of microservices adds complexity to enforce data privacy policies over the period. The data…


The story is a cross-posting from the Data Engineering Weekly. Please subscribe to the Data Engineering Newsletter for the latest update.

www.dataengineeringweekly.com

Welcome to the 27th edition of the data engineering newsletter. This week’s release is a new set of articles that focus on decentralized content moderation, Kafka as a database, Snowflake’s External Table, Dagster 0.10.0, Uber’s real-time data intelligence platform, Dropbox’s Superset adoption, Cloudflare’s data center operations using Airflow, Apache Kudi’s clustering, Timeline’s data lake.

Martin Kleppmann: Decentralized content moderation

January 2021 is a happening month, brings a lot of debate over censorship and content moderation by social media…


The story is a cross-posting from the Data Engineering Weekly. Please subscribe to the Data Engineering Newsletter for the latest update.

www.dataengineeringweekly.com

Welcome to the 26th edition of the data engineering newsletter. This week’s release is a new set of articles that focus on Unusual Venture’s data lineage forum, Uber’s metrics standardization journey, Adobe’s Apache Iceberg usage, Databricks talk on Lakehouse, Pinterest’s realtime search engine, Intuit’s take on the data lake, Microsoft’s take on cost management, and Grab’s realtime workflow engine.

Unusual Ventures: Unusual Roundtable Takeaways: Data Lineage and its Role in Data Unification

The Unusual Ventures writes an excellent…


The story is a cross-posting from the Data Engineering Weekly. Please subscribe to the Data Engineering Newsletter for the latest update.

www.dataengineeringweekly.com

Welcome to the 25th edition of the data engineering newsletter. This week’s release is a new set of articles that focus on Kleiner Perkins’s future of computing and data infrastructure, LinkedIn’s fast ingestion with Gobblin, Intuit’s data journey, AWS’s PyDeequ, Alibaba Cloud’s Flink infra with 4 billion events per sec, Expedia’s ML deployment pattern, Delta lake vs. Hudi, handling late-arriving dimensions, entity resolution for big data, Airflow 2.0 and Debezium year-in-review 2020.

Kleiner Perkins: Looking ahead to the…

Ananth Packkildurai

I break things for living.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store