Scrapy The Tool

-date: 2019-05-19

As part of my job, I have to scrape some website to help our sales team with data on the market, as of now they were doing it manually which is a bit of tedious job to do and consumes a lot of their productive time. So on bit searching and going through different tools and framework came across a framework named Scrapy. So here I am going to share how to set up and use Scrapy.

Scrapy is a free and open source web-crawling framework written in python which is used to extract data from a website without much of hassle. They have a very nice documentation you can check out here.

Steps to Install Scrapy

sudo apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev pip install Scrapy

Steps to Create New Project

To create a Scrapy project type this command in your terminal.scrapy start project <project name>. Project structure will look like this

Now go ahead and create a python file at path /spiders and paste below code.

#!/usr/bin/env python3
import scrapy

class RedditSpider(scrapy.Spider):
    # name of the scrapper, it should be unique.
    name = "reddit"
    # list of the URL need to be iterated.
    start\_urls = \['https://www.reddit.com/'\]

    # Called to do any operation on the response of the above URL.
    def parse(self, response):
       # css selector of the anchor tag which contains the headers
       top\_post = response.css("a.SQnoC3ObvgnGjWt90zD9Z")
       for post in top\_post:
           self.log(post.css('::text').extract\_first())

To start scrapping, type

`scrapy crawl reddit`

Here we are scrapping the Reddit website for the latest post and getting the header of all the post. The output of the above code will look like this.

Now the best part of Scrapy is if you want to experiment around any website before creating any project you can easily do that.

scrapy shell 'https://www.reddit.com/'

And then can try a different CSS selector on the response. Though there is a lot more you can do with Scrapy like saving the result in JSON, CSV format and even integrate with Django project might show that in next post, till then goodbye.

Cheers