𝟭𝟬 𝗣𝘆𝘁𝗵𝗼𝗻 𝗰𝗼𝗺𝗺𝗮𝗻𝗱𝘀 𝗲𝘃𝗲𝗿𝘆 𝗱𝗮𝘁𝗮 𝗲𝗻𝗴𝗶𝗻𝗲𝗲𝗿 𝘀𝗵𝗼𝘂𝗹𝗱 𝗸𝗻𝗼𝘄! 🐍 1️⃣ 𝗽𝗮𝗻𝗱𝗮𝘀.𝗿𝗲𝗮𝗱_𝗰𝘀𝘃() Load data from CSV files into DataFrame for easy manipulation. 2️⃣ 𝗱𝗳.𝗵𝗲𝗮𝗱() Quickly preview the first few rows of your DataFrame to understand its structure. 3️⃣ 𝗱𝗳.𝗱𝗿𝗼𝗽𝗻𝗮() Remove missing values from your DataFrame to clean the data. 4️⃣ 𝗱𝗳.𝗴𝗿𝗼𝘂𝗽𝗯𝘆() Group data by specific columns to perform aggregate functions. 5️⃣ 𝗱𝗳.𝘁𝗼_𝘀𝗾𝗹() Save your DataFrame to a SQL database for persistent storage. 6️⃣ 𝗿𝗲𝗾𝘂𝗲𝘀𝘁𝘀.𝗴𝗲𝘁() Fetch data from APIs and web services for integration. 7️⃣ 𝗷𝘀𝗼𝗻.𝗹𝗼𝗮𝗱𝘀() Parse JSON data to work with nested data structures. 8️⃣ 𝗼𝘀.𝗽𝗮𝘁𝗵.𝗷𝗼𝗶𝗻() Construct file paths in a platform-independent manner. 9️⃣ 𝗴𝗹𝗼𝗯.𝗴𝗹𝗼𝗯() Retrieve files matching a specified pattern for batch processing. 🔟 𝗺𝗮𝘁𝗽𝗹𝗼𝘁𝗹𝗶𝗯.𝗽𝘆𝗽𝗹𝗼𝘁.𝗽𝗹𝗼𝘁() Create simple plots and visualizations to analyze data trends. . What other Python commands do you find essential? Share your thoughts below! #dataengineering #python #dataleaders
Ascend.io’s Post
More Relevant Posts
-
Creating an External Table 📊🛠️ In Databricks, external tables are a handy way to manage metadata without affecting underlying data. Here's how to create one: -- Using SQL USE db_name_default_location; -- Create a temporary view to define the schema and options CREATE OR REPLACE TEMPORARY VIEW temp_delays USING CSV OPTIONS ( path = '${da.paths.working_dir}/flights/departuredelays.csv', header = "true", mode = "FAILFAST" -- Abort file parsing with a RuntimeException if any malformed lines are encountered ); -- Create the external table and point it to the specified location CREATE OR REPLACE TABLE external_table LOCATION 'path/external_table' AS SELECT * FROM temp_delays; -- Query the external table SELECT * FROM external_table; In Python: df.write.option("path", "/path/to/empty/directory").saveAsTable("table_name") With external tables, you retain the flexibility to manage metadata independently from your underlying data. 🚀💡 #Databricks #ExternalTables #DataManagement
To view or add a comment, sign in
-
Data Engineer @ Wells Fargo || Pyspark, Alteryx,AWS, Stored Procedure, Hadoop,Python,SQL,Airflow,Kakfa,IceBurg,DeltaLake,HIVE,BFSI,Telecom
𝗗𝗮𝘁𝗮 𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴 𝗝𝗼𝘂𝗿𝗻𝗲𝘆: 📌 𝗗𝗔𝗬 𝟴𝟮/𝟵𝟬 📢 𝗗𝗮𝘁𝗮𝗘𝗻𝗴𝗶𝗻𝗲𝗲𝗿𝗶𝗻𝗴-- 𝗣𝘆𝘁𝗵𝗼𝗻 🚩𝗤𝘂𝗲𝘀𝘁𝗶𝗼𝗻- Read two txt files & merge & sort it by ascending order-Keep maintain the order 📌𝗔𝗽𝗽𝗿𝗼𝗮𝗰𝗵-𝟭 -𝗣𝘆𝘁𝗵𝗼𝗻 ----------------------- import pandas as pd import re # Read the text files into DataFrames df_sales = pd.read_table("Sales.txt",header=1,sep=",") df_Dept = pd.read_table("Dept.txt",header=1,sep=",") merge_list=list(df_sales)+list(df_Dept) split_list = [re.split(r'[ ,]', item) for item in merge_list] split_list = [elem for sublist in split_list for elem in sublist] sorted_list=sorted(split_list,reverse=False) #SQL #DataEngineering #python #TechTips #dataengineer #Pyspark #Pysparkinterview #Bigdata #BigDataengineer #interview #sparkdeveloper #sparkbyexample #ApacheKafka #Kafka #BigData #DataEngineering #StreamingData #DataArchitecture #EventDriven #RealTimeAnalytics #DataIntegration #DataPipeline
To view or add a comment, sign in
-
Data Engineer at EY | EX - TCS | Spark | PySpark | Hadoop | Sqoop | Hive | AWS | CI/CD | Airflow | Kafka | Python | 3 x Cloud Computing Certified | MySQL | Power BI
Ways of creating Dataframe in PySpark: ✅ Creating a DataFrame using the Spark reader API: df = spark.read.format("csv").option("header", "true").load("file/path") ✅ Creating a DataFrame using Spark SQL: df = spark.sql("select * from table") ✅ Creating a DataFrame using an existing table: df = spark.table("tablename") ✅ Creating a DataFrame using the "range" function: df = spark.range(1, 10, 2) # Creates a single-column DataFrame similar to Python's range function ✅ Creating a DataFrame using a local list: df = spark.createDataFrame(["list"]) #bigdata #bigdataengineer #dataengineers #bigdatasolutions
To view or add a comment, sign in
-
Helping people to become a Data Analytics Engineer and SAP Consultant || Data Engineer expert and Trainer || SAP BODS and SAP CPIDS || Microsoft Azure Data Engineer || Microsoft Fabric
Data analysis with python on sample data set with code using Here sample example ▶ choose one Column from a DataFrame import pandas as pd # Load data from CSV file data = pd.read_csv("grocery_items.csv") #Select a single column using its name and store it in a new variable prices = data["Price"] # Display the first 5 entries of the 'prices' Series print(prices.head()) #This code retrieves the "Price" column as a pandas Series named "prices" and displays the first 5 entries. ------outupt--- 0 2.50 1 3.00 2 1.99 3 1.49 4 0.79 Name: Price, dtype: float64 -------------------- ------------------------ ▶ choose Multiple Columns from a DataFrame ▶ Add New Column to DataFrame. for above here is the link https://lnkd.in/giPPwQiw follow me ineed -Tech #dataengineering #datascience #coder #code #databricks #computerscience #techno #machinelearning #linux #pythonprogramming #datascience #databricks #dataengineer #sparksql #pyspark #software #dataengineer #webdevelopment #webdeveloper #dataanalysis
To view or add a comment, sign in
-
I’m happy to share that I’ve obtained a new certification: Build a Data Warehouse Using BigQuery from Starweaver! What I learned: o Created efficient data warehouses and tables using Google BigQuery's UI and SQL. o Building dynamic ETL pipelines with Python, incorporating partitioning and clustering strategies. o Performing advanced queries, including aggregate and window functions, for insightful data analysis. #BigQuery #DataWarehousing #DataAnalysis #ETLPipelines #Python #SQL
To view or add a comment, sign in
-
In PySpark, getting your data into a DataFrame as early in your ETL pipeline as possible should be a priority. Why? It's then super easy step to create a temp view on top of your DataFrame that opens up your data to the rich set of PySpark SQL operations available. You can stick to pure DataFrame operations if you want, but most folks are more comfortable with SQL. Let's say your data is in a Python list. We can use the createDataFrame on it to get it into ... well a DataFrame. If the data type of your list members are identical it's easy, just from pyspark.sql.types import * myList = [1, 2, 3, 4] # Assuming your Spark session is already set myDf=spark.createDataFrame(myList,IntegerType()) # From here you can easily create a temp view myDf.createOrReplaceTempView("nums") # Now use SQL on your temp view spark.sql("select * from nums").show() +-----+ |value | +-----+ | 1 | | 2 | | 3 | | 4 | +-----+ #PySpark #DataEngineering #DataScience #Python
To view or add a comment, sign in
-
To define a schema using DDL (Data Definition Language) in PySpark when reading a DataFrame, you can use the schema method of the DataFrameReader class. This lets you specify the schema either as a StructType object or a DDL-formatted string. 🚀 Example: Using DDL to Define a Schema When Reading a CSV File # Define the schema using DDL ddl_schema = "`name` STRING, `age` INT, `city` STRING" # Read the CSV file with the defined schema df = spark.read.schema(ddl_schema).csv("path/to/your/csvfile.csv") # Show the DataFrame df.show() 🔑 Key Points: 📝 The ddl_schema string defines the schema with three columns: name (STRING), age (INT), and city (STRING). 🎯 The schema method is used to apply this schema when reading the CSV file. This approach is super handy for quickly defining schemas without needing to dive into more verbose StructType and StructField objects. 💡 #DataEngineering #PySpark #BigData #ETL #ApacheSpark #TechTips #LinkedInLearning #DataScience #Python #Automation
To view or add a comment, sign in
-
Cloud Data Engineer (AWS Snowflake) || Mentoring data professionals in learning and acing technical job interviews || Verified Mentor on @topmate.io
▶ What is Snowflake Event Table? - Event tables are a special type of Snowflake table with several differences compared to standard tables: . predefined set of columns which can’t be modified . used only for tracking logging and tracing data . you can have only one active event table associated with account Typical use cases for using event table is capturing logging information from your code handlers used as stored procedures, UDFs or collecting tracing data from native apps. Setting Up and Using an Event Table for Logging in Snowflake 1. Create Event Table hashtag #SQL_Query CREATE EVENT TABLE my_db.logging.my_event_table; 2. Assign Event Table to Account hashtag #SQL_Query ALTER ACCOUNT SET EVENT_TABLE = my_db.logging.my_event_table; 3. Start Capturing Log Events Add logging code to UDF/UDTF or stored procedures using native logging libraries. Example in Python: python import logging logger = logging.getLogger("my_logger") def run(): logger.info("Processing start") ... logger.error("Some error in your code") return value Example in SQL: hashtag #SQL_Query create or replace procedure returning_table() returns table(id number, name varchar) language sql as declare result RESULTSET DEFAULT (SELECT 1 id, 'test value' name); begin SYSTEM$LOG('info', 'Returning a table'); return table(result); end; 4. Query the Event Table Each logged message includes a timestamp, scope, severity level, and log message. Query the event table to review logged events. Follow Avinash S. for more. Get Interview Resources: https://lnkd.in/dCcD-kBt Get SQL, Python, Snowflake (25% OFF) https://lnkd.in/dRehzivv
To view or add a comment, sign in
-
Hello Linkedin Family, I performed sales data cleaning process with help of youtube using pandas library. I performed following tasks during data cleaning process 1.Import pandas library. 2.list all csv file and merge into single csv file 3.create dataframe and save copy file of original file 4.drop nan data from file 5.remove wrong data from order date 6.add new month column in dataframe 7.Data type conversion 8.Create new city column 9.Renaming columns and convert into lower case. #dataanlysis #dataengineer #datascience #powerbi #sql #python
To view or add a comment, sign in
-
Site Reliability Engineer @JLR | DevOps Engineer | Machine Learning | Interested in ML Ops and Cloud infrastructure | MSc in Computer Science | University College Dublin
🎓 Just completed certification on ETL pipeline! -> “ETL in Python and SQL” by Jennifer Ebe! Check it out: https://lnkd.in/edHWDbew 💼 Excited to dive deeper into this crucial aspect of data management. In today's market, the demand for skilled professionals in ETL pipelines is soaring, as businesses seek efficient ways to extract, transform, and load data for insightful analysis. Ready to leverage this expertise for impactful data-driven solutions! #ETL #DataManagement #CertificationCompleted"#python #extracttransformload #sql.
To view or add a comment, sign in
11,977 followers