Dan M. Nacinovich’s Post

RVP - Couchbase #Hiring

9mo

Join us for our upcoming webinar on JSON data modeling for document databases! Discover best practices and techniques to optimize your data structures. Register now to secure your spot#couchbase #cloud #database #NoSQL #JSON #DBaaS#AI

Upcoming Webcast: JSON Data Modeling in Document Databases - Jan. 23 & 24

info.couchbase.com

To view or add a comment, sign in

More Relevant Posts

SQL Saturday San Diego

275 followers
3mo
Report this post
dbt + Databricks: SQL based ELT Using SQL for data transformation is a powerful way to empower an analytics team to create their own optimized data model. However, applying best practices like version control and data tests is often skipped. dbt is an open source tool to apply engineering best practices to SQL based data transformations, giving you more confidence in your ELT pipeline. This talk provides an introduction to how dbt helps with SQL based ETL and guidance on using dbt with Databricks SQL Warehouse. We will cover patterns using dbt Cloud with Databricks as the processing engine.
Like Comment
To view or add a comment, sign in
Srinivas Devaki

Founder @ optiowl.cloud | Ex SRE Lead @ Zomato
3mo
Report this post
Most analytics systems are strongly typed, which means it's a PITA to handle data that comes from DynamoDB / MongoDB Sure things get easier if the code that uses DynamoDB itself uses a schema

1 Comment
Like Comment
To view or add a comment, sign in
Ramprasad Gurumoorthy

Principal Solutions Architect | Cloud Practice Leader | Field Technical Advisor | ex-AWS
1mo
Report this post
AWS Glue Data Catalog now supports storage optimization of Apache Iceberg tables! The AWS Glue Data Catalog now enhances managed table optimization of Apache Iceberg tables by automatically removing data files that are no longer needed. Along with the Glue Data Catalog’s automated compaction feature, these storage optimizations can help you reduce metadata overhead, control storage costs, and improve query performance. https://lnkd.in/gdMrecxe #data #analytics #awscloud #bigdata

The AWS Glue Data Catalog now supports storage optimization of Apache Iceberg tables | Amazon Web Services

aws.amazon.com
Like Comment
To view or add a comment, sign in
Dipankar Mazumdar, M.Sc 🥑

Staff Data Engineer Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Author of "Engineering Lakehouses"
5mo
Report this post
How does Apache Hudi organizes the data & file layout in data lakes? Lakehouse table formats such as Hudi abstracts the physical file formats complexity and provides a 'metadata layer' on top of cloud object stores such as S3, GCS or Azure Blob. This allows you to have a table-like abstraction with a well-defined schema to query data from. Let's understand how Hudi: - organizes physical data files in a distributed file system - What type of files are used A Hudi table when broken down into the various physical & logical components looks like this (depicted in the diagram): ✅ Table are broken down into 'Partitions' ✅ Within each partition, we have the 'File Group' ✅ Each File Group contains multiple versions of 'File Slice' ✅ File Slice = Base File and Log File ✅ Base File = contains the main records in a Hudi table (optimized for read hence uses columnar formats like #parquet) ✅ Log File = contains the changes(updates/deletes) for the associated Base File (optimized for write hence uses row formats like #avro) Concepts: File Group & File Slices -> Logical Partition -> Physical This approach of organizing files in #Apachehudi gives user the control to optimize for read & write workloads. Most importantly this goes back to how Hudi achieves versioning from its commit timeline & supports critical use cases such as incremental processing. Docs link in comments. #dataengineering #softwareengineering
1 Comment
Like Comment
To view or add a comment, sign in
Hridesh Kumar

Dynamic Java Developer with expertise in Microservices and Agile methodologies, targeting to leverage technical skills and leadership experience in a challenging IT role focused on software development and architecture.
3mo
Report this post
Just finished the course “Data Modeling in MongoDB” by John Cokos! Check it out: https://lnkd.in/dNjSZMQi #mongodb #datamodeling.

Certificate of Completion

linkedin.com
Like Comment
To view or add a comment, sign in
Bruno Padilha

BI Developer
3mo Edited
Report this post
Wrapping up Sunday with a new Azure project! As part of my Azure learning journey, I built a pipeline in Azure Data Factory using the medallion architecture to move data from a Data Lake into Azure SQL Server. Check out the details here: #data #Azure #AzureDataFactory

End-to-End Data Pipeline in Azure with Data Lake and SQL Server

sites.google.com

2 Comments
Like Comment
To view or add a comment, sign in
Venubabu Lanke

Data Engineer (ETL, Data Integration) | IBM DataStage | Snowflake | Talend DI | DBT (Data Build Tool)
1w
Report this post
Completed the "Refactoring SQL for Modularity" in Data Build Tool - DBT.

Refactoring SQL for Modularity

learn.getdbt.com

3 Comments
Like Comment
To view or add a comment, sign in
Swarnali S.

Serving Notice Period | Immediate Joiner | L.W.D : 29.11.24 | Data Engineer | DP-203 | Pyspark | SQL | Python | Azure | AWS
1mo
Report this post
Understanding Azure Synapse: Serverless SQL Pool vs. Dedicated SQL Pool Selecting the appropriate SQL pool in Azure Synapse is vital for optimizing performance and cost for your data workloads. Here's a concise comparison of the two main options: Serverless SQL Pool: - No infrastructure management: Pay only for the data processed by your queries. - Best for: Querying external data sources (like Azure Data Lake) with a pay-per-query model. - Cost-effective: Ideal for lightweight queries, exploratory analysis, or quick data insights. - Scaling: Automatically scales based on the query’s complexity. Dedicated SQL Pool: - Provisioned resources: You allocate compute and storage resources for high-performance workloads. - Best for: Large-scale data warehousing and complex queries requiring consistent performance. - Performance: Optimized for heavy-duty ETL processes and advanced reporting. - Scaling: Manually scale by adjusting Data Warehousing Units (DWUs). Key Decision: Choose Serverless for flexible, on-demand querying or Dedicated for large-scale, high-performance workloads that require persistent storage. Both have their use cases – it’s all about aligning your data strategy with your business needs! What's your go-to Synapse option? Share your thoughts! Follow Swarnali S. #AzureSynapse #DataEngineering #SQLPools #CloudData #DataWarehousing
Like Comment
To view or add a comment, sign in
son nguyen

Scientist at Naval Surface Warfare Center (NSWC)
1mo
Report this post
Just finished the course “Data Modeling in MongoDB” by John Cokos! Check it out: https://lnkd.in/eye_GnEi #mongodb #datamodeling.

Certificate of Completion

linkedin.com
Like Comment
To view or add a comment, sign in
Dinesh Pandey

Azure Data Engineer | ADF | PySpark | SQL | PL/SQL | Python | Alteryx | ETL | Power BI | PL-300 | Data Analysis | Automation
1mo Edited
Report this post
💡 𝐌𝐚𝐧𝐚𝐠𝐞𝐝 𝐯𝐬. 𝐄𝐱𝐭𝐞𝐫𝐧𝐚𝐥 𝐓𝐚𝐛𝐥𝐞𝐬 𝐢𝐧 𝐀𝐩𝐚𝐜𝐡𝐞 𝐒𝐩𝐚𝐫𝐤 💡 When working with Apache Spark, understanding the difference between Managed and External tables is key to efficient data management. Here's a quick comparison: 🔹 𝐌𝐚𝐧𝐚𝐠𝐞𝐝 𝐓𝐚𝐛𝐥𝐞 >> Spark handles both the metadata and the data. >> Data is stored in a default warehouse directory (/user/hive/warehouse). >> Dropping a managed table deletes both the data and the metadata. >> Ideal for temporary or intermediary datasets. 🔹 𝐄𝐱𝐭𝐞𝐫𝐧𝐚𝐥 𝐓𝐚𝐛𝐥𝐞 >> You control the data location, while Spark manages only the metadata. >> Data is stored outside the default directory, allowing you to specify a custom path (e.g., in HDFS or cloud storage like ADLS). >> Dropping an external table deletes only the metadata, but the data remains intact. >> Best for persistent datasets that are shared across multiple tools or environments. Choosing between managed and external tables depends on the level of control you need over your data. Managed tables offer simplicity, while external tables provide more flexibility for complex data architectures. #Spark #DataEngineering #BigData #DataManagement #ApacheSpark #Pyspark #Data
Like Comment
To view or add a comment, sign in

2,777 followers

619 Posts

View Profile Follow

Dan M. Nacinovich’s Post

More Relevant Posts

Explore topics