Join us for our upcoming webinar on JSON data modeling for document databases! Discover best practices and techniques to optimize your data structures. Register now to secure your spot#couchbase #cloud #database #NoSQL #JSON #DBaaS#AI
Dan M. Nacinovich’s Post
More Relevant Posts
-
dbt + Databricks: SQL based ELT Using SQL for data transformation is a powerful way to empower an analytics team to create their own optimized data model. However, applying best practices like version control and data tests is often skipped. dbt is an open source tool to apply engineering best practices to SQL based data transformations, giving you more confidence in your ELT pipeline. This talk provides an introduction to how dbt helps with SQL based ETL and guidance on using dbt with Databricks SQL Warehouse. We will cover patterns using dbt Cloud with Databricks as the processing engine.
To view or add a comment, sign in
-
Most analytics systems are strongly typed, which means it's a PITA to handle data that comes from DynamoDB / MongoDB Sure things get easier if the code that uses DynamoDB itself uses a schema
To view or add a comment, sign in
-
AWS Glue Data Catalog now supports storage optimization of Apache Iceberg tables! The AWS Glue Data Catalog now enhances managed table optimization of Apache Iceberg tables by automatically removing data files that are no longer needed. Along with the Glue Data Catalog’s automated compaction feature, these storage optimizations can help you reduce metadata overhead, control storage costs, and improve query performance. https://lnkd.in/gdMrecxe #data #analytics #awscloud #bigdata
The AWS Glue Data Catalog now supports storage optimization of Apache Iceberg tables | Amazon Web Services
aws.amazon.com
To view or add a comment, sign in
-
Staff Data Engineer Advocate @Onehouse.ai | Apache Hudi, Iceberg Contributor | Author of "Engineering Lakehouses"
How does Apache Hudi organizes the data & file layout in data lakes? Lakehouse table formats such as Hudi abstracts the physical file formats complexity and provides a 'metadata layer' on top of cloud object stores such as S3, GCS or Azure Blob. This allows you to have a table-like abstraction with a well-defined schema to query data from. Let's understand how Hudi: - organizes physical data files in a distributed file system - What type of files are used A Hudi table when broken down into the various physical & logical components looks like this (depicted in the diagram): ✅ Table are broken down into 'Partitions' ✅ Within each partition, we have the 'File Group' ✅ Each File Group contains multiple versions of 'File Slice' ✅ File Slice = Base File and Log File ✅ Base File = contains the main records in a Hudi table (optimized for read hence uses columnar formats like #parquet) ✅ Log File = contains the changes(updates/deletes) for the associated Base File (optimized for write hence uses row formats like #avro) Concepts: File Group & File Slices -> Logical Partition -> Physical This approach of organizing files in #Apachehudi gives user the control to optimize for read & write workloads. Most importantly this goes back to how Hudi achieves versioning from its commit timeline & supports critical use cases such as incremental processing. Docs link in comments. #dataengineering #softwareengineering
To view or add a comment, sign in
-
Dynamic Java Developer with expertise in Microservices and Agile methodologies, targeting to leverage technical skills and leadership experience in a challenging IT role focused on software development and architecture.
Just finished the course “Data Modeling in MongoDB” by John Cokos! Check it out: https://lnkd.in/dNjSZMQi #mongodb #datamodeling.
Certificate of Completion
linkedin.com
To view or add a comment, sign in
-
Wrapping up Sunday with a new Azure project! As part of my Azure learning journey, I built a pipeline in Azure Data Factory using the medallion architecture to move data from a Data Lake into Azure SQL Server. Check out the details here: #data #Azure #AzureDataFactory
End-to-End Data Pipeline in Azure with Data Lake and SQL Server
sites.google.com
To view or add a comment, sign in
-
Data Engineer (ETL, Data Integration) | IBM DataStage | Snowflake | Talend DI | DBT (Data Build Tool)
Completed the "Refactoring SQL for Modularity" in Data Build Tool - DBT.
Refactoring SQL for Modularity
learn.getdbt.com
To view or add a comment, sign in
-
Serving Notice Period | Immediate Joiner | L.W.D : 29.11.24 | Data Engineer | DP-203 | Pyspark | SQL | Python | Azure | AWS
Understanding Azure Synapse: Serverless SQL Pool vs. Dedicated SQL Pool Selecting the appropriate SQL pool in Azure Synapse is vital for optimizing performance and cost for your data workloads. Here's a concise comparison of the two main options: Serverless SQL Pool: - No infrastructure management: Pay only for the data processed by your queries. - Best for: Querying external data sources (like Azure Data Lake) with a pay-per-query model. - Cost-effective: Ideal for lightweight queries, exploratory analysis, or quick data insights. - Scaling: Automatically scales based on the query’s complexity. Dedicated SQL Pool: - Provisioned resources: You allocate compute and storage resources for high-performance workloads. - Best for: Large-scale data warehousing and complex queries requiring consistent performance. - Performance: Optimized for heavy-duty ETL processes and advanced reporting. - Scaling: Manually scale by adjusting Data Warehousing Units (DWUs). Key Decision: Choose Serverless for flexible, on-demand querying or Dedicated for large-scale, high-performance workloads that require persistent storage. Both have their use cases – it’s all about aligning your data strategy with your business needs! What's your go-to Synapse option? Share your thoughts! Follow Swarnali S. #AzureSynapse #DataEngineering #SQLPools #CloudData #DataWarehousing
To view or add a comment, sign in
-
Just finished the course “Data Modeling in MongoDB” by John Cokos! Check it out: https://lnkd.in/eye_GnEi #mongodb #datamodeling.
Certificate of Completion
linkedin.com
To view or add a comment, sign in
-
Azure Data Engineer | ADF | PySpark | SQL | PL/SQL | Python | Alteryx | ETL | Power BI | PL-300 | Data Analysis | Automation
💡 𝐌𝐚𝐧𝐚𝐠𝐞𝐝 𝐯𝐬. 𝐄𝐱𝐭𝐞𝐫𝐧𝐚𝐥 𝐓𝐚𝐛𝐥𝐞𝐬 𝐢𝐧 𝐀𝐩𝐚𝐜𝐡𝐞 𝐒𝐩𝐚𝐫𝐤 💡 When working with Apache Spark, understanding the difference between Managed and External tables is key to efficient data management. Here's a quick comparison: 🔹 𝐌𝐚𝐧𝐚𝐠𝐞𝐝 𝐓𝐚𝐛𝐥𝐞 >> Spark handles both the metadata and the data. >> Data is stored in a default warehouse directory (/user/hive/warehouse). >> Dropping a managed table deletes both the data and the metadata. >> Ideal for temporary or intermediary datasets. 🔹 𝐄𝐱𝐭𝐞𝐫𝐧𝐚𝐥 𝐓𝐚𝐛𝐥𝐞 >> You control the data location, while Spark manages only the metadata. >> Data is stored outside the default directory, allowing you to specify a custom path (e.g., in HDFS or cloud storage like ADLS). >> Dropping an external table deletes only the metadata, but the data remains intact. >> Best for persistent datasets that are shared across multiple tools or environments. Choosing between managed and external tables depends on the level of control you need over your data. Managed tables offer simplicity, while external tables provide more flexibility for complex data architectures. #Spark #DataEngineering #BigData #DataManagement #ApacheSpark #Pyspark #Data
To view or add a comment, sign in