March 2019 - April 2019

Hitachi

Data Analyst Intern Railway Systems Business Divsion (Worked on Data Extraction, Web Scraping, Automation)

Developed a Project using Python in Web Scraping by utilizing Pandas, ScraPy, SciPy, NumPy, and Python-based Automation and extracted semi-unstructured (Raw) Data from WebPages and prepared a structured Dataset for further Data processing. Developed a Project for Extracting and Scraping the Arrival Time of Indian Railways Trains for the determination and correctness of National Train Enquiry System data displayed on websites with respect to the actual real-time data used for public consumption. Performed Data Manipulation, wrangling, cleansing, and Insight Extraction in Python. Extracted Data to assess the accuracy and consistency of the Arrival Time of Trains. Implemented a system for process improvement and corrective actions for Railways Operations. Designed a script based solution in Python which automated the processes of Data extraction from websites in real-time and simultaneously stored the Data in CSV formats. Implemented Crawlers a.k.a. Spiders using ScraPy as a tool in Python and extracted the raw data from the HTML tags using CSS and XPath selectors. Utilized the Regular expressions and Date and Time Packages and converted the semi-structured data's Time and Date formats into Excel readable formats for doing time series analysis. Removed various anomalies in the Data by using Python libraries like Pandas, NumPy, for manipulating the Data Frames and converted it into an optimized format. Utilized Amazon EC2 Cloud Instance for running multiple instances of the python spider scripts for carrying out the above mentioned analytical pipeline onto different VMs.

Technical Skills: Python, Sci-Kit Learn, MS Excel, Beautiful Soup, Selenium (Automation)