Job Summary: We are seeking a skilled Data Engineer with a strong understanding of web scraping, data extraction, and data pipeline development. The ideal candidate will have experience in building, maintaining, and optimising data pipelines, performing data transformation and cleaning, and ensuring the accuracy and reliability of data through both automated and manual processes. Proficiency in handling multiple Excel and CSV and other structured files, file management, SQL, and knowledge of web technologies is essential.
Key Responsibilities:
Data Pipeline Development and Management:
- Design, build, and maintain scalable data pipelines for web scraping and data extraction.
- Implement ETL (Extract, Transform, Load) processes to move data from various sources into a central repository.
- Optimize and troubleshoot data pipelines to ensure high performance and reliability.
Web Scraping and Data Extraction:
- Develop and manage scripts for data scraping using Python libraries such as Requests, Selenium, and Beautiful Soup.
- Extract data from websites and APIs, ensuring efficient and accurate data collection.
- Utilize HTML, CSS, XPath, and other web technologies related to scraping.
Data Transformation and Cleaning:
- Transform, clean, and standardize data to ensure high quality and consistency across datasets.
- Utilize Python and Pandas for data manipulation and preprocessing.
Data Quality Assurance:
- Develop and implement automated and manual checks to ensure data accuracy and completeness.
- Perform data validation to identify and rectify inconsistencies and inaccuracies.
Data Enrichment:
- Conduct secondary research to enrich data fields with additional information.
- Integrate supplementary data sources to enhance datasets.
File Handling and Management:
- Efficiently manage and process multiple Excel and CSV files.
- Perform file operations such as merging, splitting, and cleaning data from multiple sources.
- Maintain organized file structures for easy access and retrieval.
Technical Skills:
- Proficiency in Python and experience with web scraping libraries such as Requests, Selenium, and Beautiful Soup.
- Strong knowledge of data manipulation and processing using Pandas.
- Hands-on experience with Excel, including the use of nested text functions and logical operations.
- Proficiency in SQL for data querying, manipulation, and building data pipelines.
- Strong understanding of HTML, CSS, XPath, and web technologies related to scraping.
- Experience with large-scale data storage and retrieval, particularly within data engineering contexts.
- Knowledge of ethical web scraping practices and legal compliance.
- Understanding of REST API development.
- Understanding of ETL pipelines and related tools such as FastAPI, Docker, PostgreSQL, and AWS services (Lambda, EC2, ECR, S3, CloudWatch).
Tech Stack:
- Programming Languages: Python
- Frameworks: Pandas, Requests, Selenium, Beautiful Soup, FastAPI
- Tools: Excel, SQL
- DataBase: PostgreSQL
- AWS Services: Lambda, EC2, ECR, S3, CloudWatch
- Good to have: Snowflake, DBT, Databricks, Salesforce