Data Crawling/Scraping

Data Crawling/Scraping

"Data is the new oil" - Clive Humby

  • What is Web Scraping?

Web Scraping is an automatic way to retrieve unstructured data from a website and store them in a structured format. For example, if you want to analyze what kind of face mask can sell better in Singapore, you may want to scrape all the face mask information on an E-Commerce website like Lazada, Amazon etc.

  • How does web scraping work?

Web scraping just works like a bot person browsing different pages website and copy pastedown all the contents. When you run the code, it will send a request to the server and the data is contained in the response you get. What you then do is parse the response data and extract out the parts you want.

I am personally using Scrapy Framework and Request library to scrape data because it is a more robust, feature-complete, more extensible, and more maintained web scraping tool.

You can refer best scrapy at scrapy documentation i am giving link below i have pursuing daliy new things about scrapy from its own documentation.

https://docs.scrapy.org/en/latest/

Scrapy and Python Image

There are also some other ways to scrape data from automated tools like Beautiful Soup and Selenium.

Selenium is an excellent automation tool and Scrapy is by far the most robust web scraping framework. When we consider web scraping, in terms of speed and efficiency Scrapy is a better choice. While dealing with JavaScript based websites where we need to make AJAX/PJAX requests, Selenium can work better.

Source for below content :- pub.towardsai.net/web-scraping-in-scrapy-c2.. 👇

Steps for web scraping using Scrapy :-

Step 1: Install the Scrapy package

# install at terminal
pip install Scrapy

Or you can refer to the installation guide: https://docs.scrapy.org/en/latest/intro/install.html

Step 2: Create the Scrapy project

In the terminal, locate the folder where you want to store the scraping code, and then type

scrapy startproject <project_name>

Here you should replace <project_name> with your project name. Here I create a new project called ‘scraping_demo’

It will create a folder with the structure shown below

Here most of the files are pre-configured and you do not need to touch them. The first file you can check out is the settings.py

As you can see from the screenshot above, the originally created settings.py will just follow robots.txt rules without any setting or user agent. You can modify it by defining your user agent and setting the download_delay(as specified by the website rules)

The other file you should modify is to create the scraper python code in the spiders folder. Currently, there is only an init.py file in the folder and you can create multiple scrapers files in the folder and use them in different scenarios and that will be our step 3.

Step 3: Create scraper code under the spiders folder

This step is the main step for you to write a scraper. You first need to create a py file under the spiders folder. After that, you can refer to the example below(from Scrapy main website) for a basic code structure.

In the new py file you created, you need to define the scraper class and then define the name of the Scrapy scraper. This is because you can create multiple scrapers in the spiders folder. When you run the scraper you should use the name to differentiate, so you need to make sure names in different scrapers are different.

After that, the basic structure requires two functions ‘start_requests’ and ‘parse’. ‘start_requests’ is similar to the role of the request library in Python and it would raise requests to the website URL you defined. Please note two things, here you do not return in the function but rather yield which means you proceed with this step. In the request function, you need to define the URL and also the callback function which is to convert website information into structured content and return back to this function.

The last function parse function is similar to the role of BeautifulSoup which is to parse website contents. You can parse the contents by CSS or XPath. You may refer to more information on the selectors at https://docs.scrapy.org/en/latest/topics/selectors.html. Here you can directly output the data or leave it later to output using the command line.

You can test the scraper selector by using the scrapy shell function. Firstly, you should locate to your project folder in terminal and put ‘scrapy shell <url you want to scrape>’.

After that, if the status is successful, you can use the response object to test your parse like ‘response.xpath(‘//title/text()’).get()’ whether gets the correct output that you expect. You can refer to more details in https://docs.scrapy.org/en/latest/topics/shell.html

Last Step(Finally): Run Scrapy Scraper in terminal

The last step is simple, just locate your project folder in the terminal and write ‘scrapy crawl <the name you defined for the scraper>’. The step may take quite some time if you are scraping a large number of pages and the scraping speed is also largely determined by the timeout you defined in the settings.py file.

Soon i'll post my scraping projects on github you can check it out too! See you in the next article.

Do Subscribe our newsletter for such kind of informational content

Thanks for reading this article Hope you got something new!!