Ready to boost your Hadoop Data Lake security on GCP? Our latest blog dives into enabling security for Uber's modernized batch data lake on Google Cloud Storage! Read more: https://lnkd.in/gbUMvrUs #UberEngineering #UberEng #Cloud #GCP
Uber Engineering’s Post
More Relevant Posts
-
Excited to share our latest blog where we explore enabling enhanced security for Hadoop Data Lake with data being stored in Google Cloud Storage! #hadoop #datasecurity #GoogleCloud
Ready to boost your Hadoop Data Lake security on GCP? Our latest blog dives into enabling security for Uber's modernized batch data lake on Google Cloud Storage! Read more: https://lnkd.in/gbUMvrUs #UberEngineering #UberEng #Cloud #GCP
Enabling Security for Hadoop Data Lake on Google Cloud Storage
uber.com
To view or add a comment, sign in
-
Here's the Part 2 of the series of blogs on how we are modernizing Uber's batch data lake: Enabling Security for Hadoop Data Lake on Google Cloud Storage [Part 1: https://lnkd.in/esyhMSSg]
Ready to boost your Hadoop Data Lake security on GCP? Our latest blog dives into enabling security for Uber's modernized batch data lake on Google Cloud Storage! Read more: https://lnkd.in/gbUMvrUs #UberEngineering #UberEng #Cloud #GCP
Enabling Security for Hadoop Data Lake on Google Cloud Storage
uber.com
To view or add a comment, sign in
-
Cloud Partner Engineer at Google | Data & Analytics Specialist | Thought Leadership | Blogger | ex-AWS | ex-TCS
Uber, with one of the world's largest Hadoop infrastructures (managing an exabyte of data!), is moving its Big Data operations to Google Cloud. Check out this post to learn about the strategy and principles of migration. https://lnkd.in/gnPDTJqs #google #googlecloud #dataanalytics #migration
Modernizing Uber’s Batch Data Infrastructure with Google Cloud Platform
uber.com
To view or add a comment, sign in
-
Technical engineer who dreams better future.(All views &opinions are my own and don’t reflect Google)
Object storage 가 이렇게도 되는군요. 😉 Google cloud 의 GCS 에서 이제 public preview 로 hierarchical namespace 를 지원합니다. 이를 통해서 기존의 flat 구조에서 저희가 잘아는 tree 구조 형태로 변경이 되면서 추가적인 기능 지원과 함께 더 높은 성능이 제공 가능해졌네요. Existing Cloud Storage buckets consist of a flat namespace where all objects are stored in one logical layer of hierarchy. Folders are simulated in the UI and CLI through “/” prefixes, but are not backed by Cloud Storage resources and cannot be explicitly accessed via API. This can lead to performance and consistency issues with applications that expect file-oriented semantics, such as Hadoop/Spark analytics and AI/ML workloads. Much like a traditional file system, a hierarchical namespace organizes the bucket into a “tree”-like structure with folders that can contain other folders and objects.
Understanding new Cloud Storage hierarchical namespace | Google Cloud Blog
cloud.google.com
To view or add a comment, sign in
-
Sharing my Cloudera Community article discussing the critical role of HBase caching in achieving optimal read performance when implementing HBase on cloud storage. Check it out here: https://lnkd.in/evEG5cDu. #Cloudera #HBase #CloudStorage
wchevreuil
community.cloudera.com
To view or add a comment, sign in
-
Good reference article on comparing between 3 cloud offerings for NoSQL DB on pricing, limitations, storage, replication : Azure Cosmo DB or AWS DynamoDB or GCP Bigtable/Datastore: #nosqldatabases #multicloud
NoSQL databases comparison: Cosmos DB vs DynamoDB vs Cloud Datastore and Bigtable
pluralsight.com
To view or add a comment, sign in
-
Excellent course on Building Batch and Streaming Data Pipelines in GCP. Dataflow and Dataproc - Code based ETL solution & Dataproc a Googles Hadoop/Spark platform ----------------------------------------------------------------------------------------------------------- 1.ETL/ELT Workloads using Google Dataflow and Dataproc. 2.If there are existing Onprem Hadoop workloads, they can be migrated with similar map/reduce codebase using Dataproc. Dataproc is Googles managed Hadoop/Spark environment. 3.If we are building a new pipelines, then Google Dataflow is preferred managed service. DataFusion - Nocode UI based ETL solution by Google ------------------------------------------------------------ 4.Nocode ETL solution (Similar to Infomatica/DataStage) using Google Data Fusion (UI based ETL tool). 5.Data wrangling (basic data cleansing and transforming)capabilities as well in Google Data Fusion. Pub/Sub ----------- Google's messaging service for continuous streaming applications. It provides a Publisher -> Topic -> Subscriber based decoupled architecture. It is similar to Apache Kafka and provides both push and pull based subscriptions ensuring at least once delivery. DAG - Directed Acyclic Graphs ----------------------------------- 6.The cloud composer(serverless) workflow management tool based on Apache Airflow for creating Pipelines(DAGs - Directed Acyclic Graphs). 7.DAGs are acyclic i,e they are not cyclic and rather move in linear way as a step by step process. 8.DAGs provide periodic and event driven pipelining capability. 9.Monitor DAG workflows and use for troubleshooting the flows. DBT - Data Build Tool --------------------------- Framework tool to organize SQLs and SQL based transformations which are easier to understand for Analysts. Call these DBT SQL Views from Composer. DBT + Apache Airflow a powerful way of orchestration. Terraform ------------- Automation of "Infrastructure As Code" service from Hashicorp used for deploying the required infrastructure of GCP.
Completion Certificate for Building Batch Data Pipelines on Google Cloud
coursera.org
To view or add a comment, sign in
-
🚀 Exciting News from Uber Engineering! 🚀 We're thrilled to announce our partnership with Google Cloud Platform to modernize Uber's batch data infrastructure. As one of the largest Hadoop installations globally, hosting over 1 exabyte of data, our goal is to enhance our big data capabilities and keep pace with Uber's growing demands. 🔹 Migration Strategy: Our initial migration focuses on leveraging GCP's robust IaaS for our data lake, ensuring minimal disruption and maintaining our commitment to efficiency and security. 🔹 Future Plans: This move will not only boost our engineering velocity but also improve cost efficiency and expand data governance. We're setting the stage for a series of innovations and greater productivity across teams. Stay tuned as we share our progress and insights in a series of upcoming blog posts. Here's to making data-driven decisions faster and more accessible than ever! 🌐 #UberEngineering #CloudComputing #DataScience #GoogleCloud #BigData
Did you know that Uber is transitioning to the cloud? Learn more about how we're modernizing our batch data infrastructure with Google Cloud Platform. Read more: https://lnkd.in/gSzJGcwF #UberEng #UberEngineering
Modernizing Uber’s Batch Data Infrastructure with Google Cloud Platform
uber.com
To view or add a comment, sign in
-
Data Engineer @ LLoyds Tech | Big Data | Spark | Hadoop | Python | SQL | Hive | Sqoop | PySpark | NoSQL(Hbase, Cassandra) | Git | ETL | Azure | Airflow | Shell Scripting | Blogger : medium.com/@ambansal1014#DataAman
On Premise Datalake Vs Cloud Datalake: ============================================= Lets try to understand the key differences between on premise datalake(Example HDFS) and cloud datalake(Example Amazon S3): -> When we talk about HDFS (Hadoop Distributed File System) all the data that is stored in the form of blocks of data. -> When we talk about Amazon S3 all the data is stored in the terms of objects and these objects have some key features like object_id (unique identifier), object_value (consist of the actual data), metadata(access of file and kind of data) etc. -> HDFS is not a persistent system but Amazon S3 is a persistent system and lets try to understand what do we mean by persistent system. Example: Lets consider we have a 4 node cluster and now if we shut down the cluster then in the case HDFS we won't be able to access the data as in the case of on-premise datalake (HDFS) the (storage+compute) is tightly coupled that means irrespective if we not using the compute but only storage we still need to pay for the compute part which is expensive and hence on-premise datalake is said to be non persistent system. In the case of a cloud datalake (Amazon S3, Azure ADLS Gen2) they are persistent and the storage and compute is not tightly coupled and hence they are separate services that means if we have to only store the data we can store and we need to only pay for the storage nothing to paid for the compute part that means we are only paying for the storage part we need and it becomes quite cheap and hence cloud datalake is said to be persistent system. #DataAman #BigData
To view or add a comment, sign in
-
Data Engineer | GCP Certified Professional Data Engineer | Databricks Certified Spark Developer | Cloudera Certified Hadoop & Spark Developer
Just published two articles from my ongoing series focused on Google Cloud Platform's Dataproc service. Part 1 - Introduction to Dataproc https://lnkd.in/gRcpDfi5 Part 2 - Different ways to create Dataproc cluster https://lnkd.in/gZeRvz_d I believe both beginners and experienced GCP users will find these articles valuable. I'd love to hear your thoughts, questions, or feedback on these articles. Stay tuned for the next articles in the series. Meanwhile, happy reading and happy cloud computing! #dataengineering #gcpdataengineer #bigdata #dataproc #cloudcomputing #spark #dataprocessing
Getting Started with Google Cloud Dataproc: A Beginner’s Guide — Part 1
medium.com
To view or add a comment, sign in
185,456 followers
Senior Program Manager, Tech Brand at Uber
3mo🎉 🎉