Import CSV from FTP to PostgreSQL Using Python
Automating CSV Imports: FTP to PostgreSQL with Python Full ETL Walkthrough & Code Examples
In modern data-driven architectures—especially in IoT platforms, smart metering systems, and industrial automation—data rarely originates from a single source. Embedded devices often export measurements as CSV files and upload them to an FTP server because FTP is lightweight and universally supported by legacy hardware.
The challenge lies in moving this data into a structured PostgreSQL database for analysis. In this guide, I will show you how to build a Python script that monitors an FTP server, detects new files, and performs an "Upsert" (Update or Insert) into your database.
🛠️ Prerequisites & Architecture
Before diving into the code, ensure you have the following Python libraries installed. We use ftplib for the connection and psycopg2 for high-performance database insertions.
The Workflow:
- Connect: Script logs into the FTP server.
- List: It retrieves a list of all CSV files.
- Compare: It checks a local "processed_files.json" log to see which files are new.
- Process: It downloads new files, parses the CSV, and pushes data to Postgres.
💻 Core Code Logic: The FTP Connection
While the full source code is available on GitHub (linked below), here is the critical logic for establishing a secure connection and listing files. This function ensures we only target .csv files:
import ftplib import os def get_files_from_ftp(host, user, pwd): # Establish connection to FTP Server ftp = ftplib.FTP(host) ftp.login(user, pwd) files = [] try: # List all files in the directory filenames = ftp.nlst() # Filter only for CSV files for filename in filenames: if filename.endswith(".csv"): files.append(filename) except ftplib.all_errors as e: print(f"FTP Error: {e}") return files, ftp
Once the files are identified, we use psycopg2.extras.execute_values to batch insert data, which is significantly faster than inserting row-by-row.
📺 Watch the Step-by-Step Tutorial
I have recorded a two-part series demonstrating this code in a live production environment.
Part 1: Setting up the Environment
Part 2: The Python ETL Script
📦 Get the Full Source Code
The complete project, including the database schema, config file handling, and error logging, is available on my GitHub.
🔗 GitHub Repository:
View Project on GitHub
✅ Conclusion
Building your own ETL pipeline using Python gives you full control over how data is handled. Unlike expensive ETL tools, this script is free, lightweight, and can be scheduled using a simple Cron job or SystemD service.
If this guide helped you, please consider starring the repository on GitHub and subscribing to the YouTube channel for more IoT data engineering tutorials.
Comments
Post a Comment