How to do web scraping/crawling using Python with Selenium

How to do web scraping/crawling using Python with Selenium

Have you ever wondered how online shopping sites such as Amazon, Flipkart, and Meesho suggest or recommend us products depending on our search or browsing history, how they do? This is because their server indexes all the information in their records so they can return the most relevant search-based results. Web crawlers use to handle this process.

Data is becoming the key to growth for any business over the past decade most successful organizations used data-driven decision-making.

with 5 billion users creating billions of data points per second. They get data primarily for price and brand monitoring, price comparison, and big data analysis that serve their decision-making process and business strategy

Web scraping/ crawling is used to find meaningful insights (Data) that will help in making decisions for business growth. Let’s see how we can achieve this.

Web scraping is used to gather a large amount of data from Websites. Doing such a thing manually is very difficult to manage because data available on the web is in an unstructured manner with the help of web scraping we can avoid this. Scraping stores data in a structured manner.

Example – Python web scraping/crawling for Flipkart

Prerequisites – We need the following lib to achieve the scraping of Flipkart, so to install these packages on your system, simply open cmd and run the following commands.

1.    Pip install python 3+

2.    Pip install selenium

3.    Pip install requests

4.    Pip install lxml

Once we install all required lib then we are good to go. Now we need to add request headers to scrap the information from the web. To find request headers on the web page follow the steps given below.

Step1:

Open the URL of the webpage you want to scrap/crawl and search for the product name in the search bar you want the information about, products list is displayed on the page, and then click on any product and right-click on the page and click on “inspect”.

Step2:

Now the Html format page is open then select the “Network” option, under this click on the first checkbox it will show all requests and response headers. Just copy the request headers that we need for scraping here.

Step 3:-

Create a python file (file name with .py extension) and import all required libraries which we are going to use.

Here create a file with name >python_scrapy_demo.py

import requests

from lxml import html

from csv import DictWriter

headers={‘Accept’: ‘text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9’,

‘Accept-Encoding’: ‘gzip, deflate, br’,

‘Accept-Language’: ‘en-US,en;q=0.9,mr;q=0.8’,

‘Cache-Control’: ‘max-age=0’,

‘Connection’: ‘keep-alive’,

‘Cookie’: ‘T=TI164041787357700171479106209992677828199995896121444018506202609225; _pxvid=9615f9eb-6555-11ec-a0e5-544468794e4b; Network-Type=4g; pxcts=e13bea52-9c71-11ec-8fbe-4e794558716a; _pxff_tm=1; AMCVS_17EB401053DAF4840A490D4C%40AdobeOrg=1; AMCV_17EB401053DAF4840A490D4C%40AdobeOrg=-227196251%7CMCIDTS%7C19057%7CMCMID%7C60020900103419608715489104597694502461%7CMCAID%7CNONE%7CMCOPTOUT-1646484547s%7CNONE%7CMCAAMLH-1647082147%7C3%7CMCAAMB-1647082147%7Cj8Odv6LonN4r3an7LhD3WZrU1bUpAkFkkiY1ncBR96t2PTI; _px3=517dd86b669bed026967b6bdfbfac15a6893b3fb6a0a48639f8c8cac65b3cd64:OFKVhuX/QOYMMgqjXTNst5364SHIk+eTiaOVpjTfYKc6cnY+68dfTvg1NUCBE2W7jjH0hr7tgdk6UkBvsJVm9A==:1000:rga8uP2RMWp7ee1XTv8PVYgqr/ZlUn4jscKqdAKTIK9OFsmlF4QbPjfaDpAcMZn18Eip7z8FZsgO3j/KJ5x3m7BeObZLpMhgigTALVggsTCobVWml0DqL55ZTywnb5ezOslK6Q9axT+/y3CK7meTirkm9bumQWlOwMSMinGilSmpFCek9gBrinbeKWgdDCzFIKhH9ZOdRDiYGKa0DUOu7w==; SN=VI1A3BE7DC80484037A949D48CB6847E12.TOKB37A0AFBA76F46BB84B8BC39EEE0C132.1646477414.LI; s_sq=flipkart-prd%3D%2526pid%253Dwww.flipkart.com%25253A%2526pidt%253D1%2526oid%253Dhttps%25253A%25252F%25252Fwww.flipkart.com%25252Fsearch%25253Fq%25253Diphone%252526sid%25253Dtyy%2525252C4io%252526as%25253Don%252526as-show%25253Don%252526otracker%25253DAS_QueryStore_Organ%2526ot%253DA; S=d1t13Pz8VPxZJPxkMPwA/GT8/P8fOp+2+MT5EGsfvlEGAzgqQGy0f0O82o91FZOXzCPWn/Wqqo3+892JiBWn5oEFyQg==; qH=0b3f45b266a97d70’,

‘Host’: ‘www.flipkart.com’,

‘Referrer’: ‘https://www.google.com/’,

‘sec-ch-ua’: ‘” Not A;Brand”;v=”99″, “Chromium”;v=”99″, “Google Chrome”;v=”99″‘,

‘sec-ch-ua-mobile’: ‘?0’,

‘sec-ch-ua-platform’: “Windows”,

‘Sec-Fetch-Dest’: ‘document’,

‘Sec-Fetch-Mode’: ‘navigate’,

‘Sec-Fetch-Site’: ‘same-origin’,

‘Sec-Fetch-User’: ‘?1’,

‘Upgrade-Insecure-Requests’: ‘1’,

‘User-Agent’: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36’ }

#write method to save html page data in the file

def save_html_page(page_data): 

with open(‘flipkart_page.html’,’w+’,encoding=’utf-8′) as fl:

   fl.write(page_data)

#write method to save the data in the csv file

def data_save_csv(dt):

    headings={‘product_name’,’product_price’}

with open(‘flipkartdata.csv’,’a+’,encoding=’utf-8′) as file:

        writer=DictWriter(file,fieldnames=headings)

        writer.writeheader()

        writer.writerow(dt)

     file.close()

In the above code, we are saving html page and writing data into a file, and then we save that data into csv file. Here we write a method for saving html page we use the file open() method to open the file and we use “w+”,encoding=’utf-8”  to write data into Unicode transformation. For extracting data (i.e here we extract product_name and product_price) follow the methods given below. We can extract the different types of data by using this code here just need to add xpath of what type of product description we need to extract and it will return the data. 

def crawling_data():

response=requests.get(url=’https://www.flipkart.com/search?q=iphone&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off’,headers=headers,timeout=30)

# print(response.text)

save_html_page(page_data=response.text)

if response.status_code==200:

  tree=html.fromstring(response.text)     

        product_name=tree.xpath(‘//div[@class=”_3pLy-c row”]/div[@class=”col col-7-12″]/div[@class=”_4rR01T”]/text()’)                                

        prod_price=tree.xpath(‘//div[@class=”col col-5-12 nlI3QM”]/div[@class=”_3tbKJL”]/div[@class=”_25b18c”]/div[@class=”_30jeq3 _1_WHN1″]/text()’)

        all_data=list(zip(product_name,prod_price))

     # print(all_data)

     product={}

     for item in all_data:

            product[‘product_name’]=item[0]

            product[‘product_price’]=item[1].replace(‘₹’,”) # regex

          print(product)

            data_save_csv(dt=product)

crawling_data()

Conclusion

We live in a world where technology continues to develop, particularly in the computer industry. The current market scenario and client demands change every second. Hence to satisfy customer needs and business growth simultaneously we need to make changes in business and it can be achieved using web scraping/crawling.

Read more blogs here