Scrapy The Tool
-date: 2019-05-19
As part of my job, I have to scrape some website to help our sales team with data on the market, as of now they were doing it manually which is a bit of tedious job to do and consumes a lot of their productive time. So on bit searching and going through different tools and framework came across a framework named Scrapy. So here I am going to share how to set up and use Scrapy.
Scrapy is a free and open source web-crawling framework written in python which is used to extract data from a website without much of hassle. They have a very nice documentation you can check out here.
Steps to Install Scrapy
sudo apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev
pip install Scrapy
Steps to Create New Project
To create a Scrapy project type this command in your terminal.scrapy start project <project name>
. Project structure will look like this
Now go ahead and create a python file at path /spiders and paste below code.
#!/usr/bin/env python3
import scrapy
class RedditSpider(scrapy.Spider):
# name of the scrapper, it should be unique.
name = "reddit"
# list of the URL need to be iterated.
start\_urls = \['https://www.reddit.com/'\]
# Called to do any operation on the response of the above URL.
def parse(self, response):
# css selector of the anchor tag which contains the headers
top\_post = response.css("a.SQnoC3ObvgnGjWt90zD9Z")
for post in top\_post:
self.log(post.css('::text').extract\_first())
To start scrapping, type
`scrapy crawl reddit`
Here we are scrapping the Reddit website for the latest post and getting the header of all the post. The output of the above code will look like this.
- Trump Organization ‘Sold Property to Shell Company Linked to Maduro Regime,’ Says Report
- Blind people of Reddit, what do you find sexually attractive?
- A “caravan” of Americans is crossing the Canadian border to get affordable medical care
- A “caravan” of Americans is crossing the Canadian border to get affordable medical care
- [Post Game Thread] The Houston Rockets defeat the Golden State Warriors, 112-108, behind Harden's 38 points to level the series 2-2, despite the continued brilliance of Kevin Durant 18, my friend here is failing biology and thinks she's unroastable. Go for it guys, and go hard If you strike me down, I shall become more powerful than you can possibly imagine. [BOTW]
- ELI5: Why are all economies expected to “grow”? Why is an equilibrium bad?
....
Now the best part of Scrapy is if you want to experiment around any website before creating any project you can easily do that.
scrapy shell 'https://www.reddit.com/'
And then can try a different CSS selector on the response. Though there is a lot more you can do with Scrapy like saving the result in JSON, CSV format and even integrate with Django project might show that in next post, till then goodbye.
Cheers