sandeepk

Scrapy The Tool

December 4, 2020

-date: 2019-05-19

As part of my job, I have to scrape some website to help our sales team with data on the market, as of now they were doing it manually which is a bit of tedious job to do and consumes a lot of their productive time. So on bit searching and going through different tools and framework came across a framework named Scrapy. So here I am going to share how to set up and use Scrapy.

Scrapy is a free and open source web-crawling framework written in python which is used to extract data from a website without much of hassle. They have a very nice documentation you can check out here.

Steps to Install Scrapy

sudo apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev pip install Scrapy

Steps to Create New Project

To create a Scrapy project type this command in your terminal.scrapy start project <project name>. Project structure will look like this

Now go ahead and create a python file at path /spiders and paste below code.

#!/usr/bin/env python3
import scrapy

class RedditSpider(scrapy.Spider):
    # name of the scrapper, it should be unique.
    name = "reddit"
    # list of the URL need to be iterated.
    start\_urls = \['https://www.reddit.com/'\]

    # Called to do any operation on the response of the above URL.
    def parse(self, response):
       # css selector of the anchor tag which contains the headers
       top\_post = response.css("a.SQnoC3ObvgnGjWt90zD9Z")
       for post in top\_post:
           self.log(post.css('::text').extract\_first())

To start scrapping, type

`scrapy crawl reddit`

Here we are scrapping the Reddit website for the latest post and getting the header of all the post. The output of the above code will look like this.

Trump Organization ‘Sold Property to Shell Company Linked to Maduro Regime,’ Says Report
Blind people of Reddit, what do you find sexually attractive?
A “caravan” of Americans is crossing the Canadian border to get affordable medical care
A “caravan” of Americans is crossing the Canadian border to get affordable medical care
[Post Game Thread] The Houston Rockets defeat the Golden State Warriors, 112-108, behind Harden's 38 points to level the series 2-2, despite the continued brilliance of Kevin Durant 18, my friend here is failing biology and thinks she's unroastable. Go for it guys, and go hard If you strike me down, I shall become more powerful than you can possibly imagine. [BOTW]
ELI5: Why are all economies expected to “grow”? Why is an equilibrium bad?
....

Now the best part of Scrapy is if you want to experiment around any website before creating any project you can easily do that.

scrapy shell 'https://www.reddit.com/'

And then can try a different CSS selector on the response. Though there is a lot more you can do with Scrapy like saving the result in JSON, CSV format and even integrate with Django project might show that in next post, till then goodbye.

Cheers

Understanding Generators in Python.

December 4, 2020

—date: 2019-10-03 originally posted here

Generators is a function in which objects are created at once but not all code is executed at once as done in normal function. In normal function execution from top to the return statement. A function that consists of a yield statement is called a generator's function. The execution of the generator function happens differently, in which the code execution stops at the yield statement rather than a return statement, to move to the next statement next() method is called which will start the execution of the code from where it is left over. If no yield statement is found a StopIteration exception is raised.

So let's see how to create, execute a Generators in python.

def fib(n):
    a, b = 0, 1
    while a <= n:
        yield a   # yield statement.
        a, b = b, a + b

Now let execute the method fib().

fib_fun = fib(10)
next(fib_fun) # 0
next(fib_fun) # 1
next(fib_fun) # 1
.
.
.
next(fib_fun) # 8
next(fib_fun) # reached the end will raise StopIteration Error.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
StopIteration

Else you can use for loop which call next() in the background.

for fib\_value in fib(10):
    print(fib)

# Output
0
1
1
2
3
5
8

So here we today understand the Generators concept in python. Now you would be thinking where we can use this, let me state some use cases.

Can be used for memory management, where we pass the whole list at once, we can use Generator to pass data one by one so that less load comes on memory.
Generator can be used to define infinite streams.

If you know any more use case, please do share in the comments and if want to share something else or talk about Generators feel free to ping me on twitter

Till then Cheers :)
Happy Digging.

What is Closures ?

December 4, 2020

—date:2019-09-19 originally posted here

Today we will talk about Closure a functional object in Python.

Closure is a function object which has access to the local variables/free variables of the enclosing scope and can be executed outside its scope. Nested function is a Closure function if

It can access variables that are local to enclosing scope.
It can be executed outside for its scope.

Closure can be used for one of these

Replacing the hard coded constants.
Eliminating global.
Can be used for data hiding and many other things.

So lets see with a code snippet to see how closure and nested function are different from each other.

# nested functions 
def inc_x(x):  
    def add_10(x=x):
        print("{0} is increased by 10 = {1}".format(x, x+10)) 
    return add_10()  # remember about the parenthesis.
  
inc_value = inc_x(10)
inc_value  #Output: 10 is increased by 10 = 20

So the above function will be called a Nested function, not a Closure because

Inner function [add10] doesn't access the local variable of enclosing function incx. It used the value of X rather than using a reference.
Inner function [add10] cannot be executed outside the scope of incx.

Now let see the Closure function example.

# closure functions 
def inc_x(x):  
    def add_10(): 
        print("{0} is increased by 10 = {1}".format(x, x+10)) 
    return add_10 # returning function without parenthesis, passing only references.

inc_value = inc_x(10)
# We are able to execute the inner function outside its scope.
inc_value() 
#Output: 10 is increased by 10 = 20

So above code will be called as Closure function rather than Nested function because

add10 function is accessing the local variable of the incx function. Here a reference to the local variable of incx is maintained in the add10.
add10 can even be executed outside the body/scope of incx function.

Closure in python is created by a function call, here every time incx is called a new instance of this function is created. So whenever you call incx a binding reference is made to x which is used in add_10 function.

So let see how under the hood these variable reference are maintained

Function attributes func_closure in python < 3.X or closure in python > 3. X save the these references to these variable or also called as free variable. Let see how to access these values.

# Taking same example for the above code
def inc_x(x):  
    def add_10(): 
        print("{0} is increased by 10 = {1}".format(x, x+10)) 
    return add_10

add_10 = inc_x(30)

add_10()
# Output: 30 is increased by 10 = 40

# Checking whether it is Closure or not.

'__closure__'  in  dir(add\_10)

# Output: True

# Getting the free variable value from closure.

add_10.__closure__[0].cell_contents 

# Output: 30

While talking about the closure we also heard the term free variables which is also an interesting topic to discuss, which I will cover in the next blog post, till then

Cheers !! :)
Happy Learning

Shell: Day #5

July 13, 2020

Today go through the commands to monitor processes and how to handle them

ps – It reports the snapshot of the current process
init- It the parent process of all the processes.
pstree – Same as ps but list the process in form of tree with more details.
top- List down all the process running, update the snapshot after a while.
Kill – it signals the process
- INT – 2 -Interrupt, stop running
- TERM – 15 – ask a process to exit gracefully
- KILL – 9 – force the process to stop running
- TSTP – 18 – request the process to stop temporarily
- HUP – 1 – Hang up
nice – Every process run has priority and with nice we can control this priority, it ranges from +19(very nice) to -20(not very nice) decreased niceness higher the priority
renice- change the priority of the existing process


>> top
PID  User  PR  NI  VIRT     RES     SHR     S  %CPU %MEM Time+ Command
3911 user  20   0 2855988 206872 141304 S  72.2  2.6   7:15.09 Web Content                                                                       
31980 user  20   0 3703988 509176 188188 S  33.3  6.3  49:36.10 firefox                                                                           
 2839 user  20   0 2834092 191744 128268 S  27.8  2.4  16:13.39 Web Content    

>>ps
  PID TTY          TIME CMD
 2418 pts/2    00:00:00 zsh
 4318 pts/2    00:00:00 ps

>> pstree | less
systemd-+-NetworkManager-+-dhclient
        |                |-dnsmasq---dnsmasq
        |                |-{gdbus}
        |                `-{gmain}
        |-accounts-daemon-+-{gdbus}
        |                 `-{gmain}
        |-acpid
        |-agetty
        |-apache2---2*[apache2---26*[{apache2}]]
        |-at-spi-bus-laun-+-{dconf worker}
        |                 |-{gdbus}
        |                 `-{gmain}
...
>> nice -n 10 long-running-command &
>>renice 20 2984
>>renice 15 -u mike # changing  niceness for all process of mike user
>> kill -9 PID

#shell #dgplug #ilugc

Shell: Day #4

July 10, 2020

Today was the day with the commands grep and sed.

grep – command used for the patter matching it have many useful options
- -i: to make case-insensitive search
- -r: search through the file in dire recursively
- – l: print the name of the file with matching string
- -c: print the counts of match
- -n: numbers the matching lines in the output
- -v: it's like not condition, print the reverse of the condition
sed – its read the input lines, run script on them, and writes them to stdout. This command is good for the string replacement and editing of the files.

Both these commands can be used with regex for the pattern matching.


>> grep -nv  ^# \| ^$  /etc/services |less
# will list all the lines from the file with do not start with *#* and ends with an empty line.

>>sed ``s/UNIX/LINUX` file.txt

# sed command will replace the occurrence of the *UNIX* word with the *LINUX*

#shellrun #dgplug #ilugc

Shell: Day #3

July 9, 2020

Today was the day of basic File Management, Pipes, and Redirects. So let's jump to the command's

mkdir – create a directory for you, -p will create the parent directory if it does not exist.
rmdir – remove the empty directory
rm – remove the files and directory with -r.
pushd & popd – this one is the new command I came across, it let you save the previous command and you can pop that command with popd when you require. Dirs let you list down the directories you can pop back too.
file – tells you about the format of the file
?, * – these are the wildcards which help in pattern matching, '*' for any number of character and whereas ? for only one character
| – called as pipe, it takes the output of one program and gives as an input to another program
Redirection (<, >) – < this indicates to file to read input from, > this indicates the file to write output to.
>> – Appends the output to the end of the file, If the file does not exist it creates a new one.
File Descriptors – Standard Input(0), Standard Output (1), Standard Error (2), make sure to check the example below to see how you can use them
xargs- it read a text and pass them as input to the following command
tee – is a combination of > and | and let you copy data from the input to the output or a file

Now let see these commands in action


>> mkdir -p chess/pieces/board # create an directory for you.
>> rmdir -p chess/pieces/board # will delete whole path if no other file or dir exist
>> pushd /media/USB # will let you save this path
... # any command you run b/w
>> popd # this will get you back to the pushed path
>> dirs # list all the dir path saved
~/bash-trail ~ / ~/Code/tranzact ~/bash-trail/program

>> file shopping_list 
shopping_list: ASCII text

>> whoami | rev # will reverse the output from the *whoami* output
keepdnas

>> last > last-login.txt # will save the output of login user to the file

>> wc < last-login.txt # will pass the text from the file as input to the *wc* command
>> program 2> file # will write the Standard error from the program to the file. **File Descriptor**

>> find /media/USB | xargs -l 3 rm -f  # this will pass files for USB dir and xargs will pass the 3 filenames at a time to remove them.

>> last | tee everyone.txt | grep bob > bob.txt
#  To save details of everyone’s logins and save Bob’s in files also.

#shellrun #dgplug #ilugc

Shell: Day #2

July 8, 2020

Today run through the commands to process the Text streams from the shell.

less – command let you show less content from the file you are viewing.
sort – helps you sort the output, -f let you do sort case-insensitive and -n numerical sort.
cut – help to select the fields(-f)/character(-c)
fmt – format the output of the file, you can specify the width with -w
tac – similar to the cat command but in reverse
sed- this command use to process each line of file with a script

let see these command in action


>> less hello.txt
Hello World
THis is a text file
hello.txt (END)

>> cat shopping_list 
cucumber
bread
fish fingers

>> sort shopping_list         
bread
cucumber
fish fingers

>>date
Thu Jul  9 00:21:17 IST 2020

>>date | cut -d " " -f1
Thu

>>cat COPYING |  less  
The GNU General Public License is a free, copyleft license for
software and other kinds of works.

  The licenses for most software and other practical works are designed

>> cat COPYING |  less  | fmt -w 30
The GNU General Public
  License does not permit
  incorporating your program

>>cat copybump.py
#! /usr/bin/env python3

import datetime
import os
import re
import stat
import sys
...
if __name__ == '__main__':
    do_walk()

>> tac copybump.py
    do_walk()
if __name__ == '__main__':
...
import sys
import stat
import re
import os
import datetime

#! /usr/bin/env python3

>> sed -f spelling.sed < report.txt > corrected.txt # correct the spellling mistake in report.txt and output the correct text in corrected.txt

#shellrun #dgplug #ilugc

Shell: Day #1

July 7, 2020

This post and the continuing post will be post/notes to share my journey of going through the shell to brush the commands which I forget and to learn some new ones.

So here one or more things I learned today. * !! – will show you the previous command. * ! String- show's the last command with the given string. * !$– will give the last argument of the previous command * !^ – will give you the first argument of the previous command * ^String^replacement- will replace the first occurrence of the * String* with the replacement string. * Ctrl + A – will get you to the start of the line. * Ctrl + E – will get you to the end of the line. * Ctrl + D – will delete the current character, even can close your shell session :). * For loop – yup we can write for loop to certain repetitive actions. * Locate – can help you to search for the file/s in the drive. * file – it not only help you with the file search just not based on the name but has many options to perform on the result and you can use regex for the file search also.

Now let's dive into some cool example

>> ls
...
>> clear
>>!! # will refer to the previous command
>> clear
>> !l # refers to the previous command start with a given string in our case `l`
>> ls
>> cd Documents
>> echo !$ # will refer to the *Documents* args from the previous command, same will be the case with *!^*  which refer the first args of the previous command
>> ls
>> echo !$
>> echo ls
>> echo Documents
>>for file in *; echo ${file}; done # try to run this in the shell to see the output

References

#shellrun #dgplug #ilugc

Git: Rebase

June 1, 2020

Git rebase is a very handy command to integrate changes from one branch to another, we can run git rebase in two modes manual and interactive. In manual all commit take from the current branch and applied over the head of the passed branch, but in case of the interactive rebase command you have more control over the option to what do with commits.

So to understand how rebase work, we take an example where we have a master branch with the following commits.

>>> git log
commit 8bfb8c19c3d7b795e9698a9818880d89ca3c214a
Author: Sandeep <sandeepchoudhary1507@gmail.com>
Date:   Sun Jun 14 01:06:07 2020 +0530

    New goals added

from this master branch, we create a new branch dev and do some changes/bug fixes.

>>> git checkout -b dev master
...
>>> git log
commit a05b6cd75e604df0f4434a574809a4fc14e4313e
Author: Sandeep <sandeepchoudhary1507@gmail.com>
Date:   Sun Jun 14 01:08:11 2020 +0530

    workaround bugs

commit 8bfb8c19c3d7b795e9698a9818880d89ca3c214a
Author: Sandeep <sandeepchoudhary1507@gmail.com>
Date:   Sun Jun 14 01:06:07 2020 +0530

    New goals added

but in between the other developer push changes in the master branch and to integrate that changes in your current branch you can use merge or rebase command, the rebase helps you maintain the liner history of your workflow.

>>> git checkout master
>>> git log
commit 3c5d6baf13aeac37d9efb1218bbf3240ec5c2a12
Author: Sandeep <sandeepchoudhary1507@gmail.com>
Date:   Sun Jun 14 01:07:29 2020 +0530

    new release added

commit 8bfb8c19c3d7b795e9698a9818880d89ca3c214a
Author: Sandeep <sandeepchoudhary1507@gmail.com>
Date:   Sun Jun 14 01:06:07 2020 +0530

    New goals added

so now to integrate new changes from master to your branch dev, without making the commit history complex, we can use rebase command, let's check out

>>> git checkout dev
>>> git rebase master
>>> git log
commit 184805896dd5684fc076b9bb9aa34eb3994251b1
Author: Sandeep <sandeepchoudhary1507@gmail.com>
Date:   Sun Jun 14 01:08:11 2020 +0530

    workaround bugs

commit 3c5d6baf13aeac37d9efb1218bbf3240ec5c2a12
Author: Sandeep <sandeepchoudhary1507@gmail.com>
Date:   Sun Jun 14 01:07:29 2020 +0530

    new release added

commit 8bfb8c19c3d7b795e9698a9818880d89ca3c214a
Author: Sandeep <sandeepchoudhary1507@gmail.com>
Date:   Sun Jun 14 01:06:07 2020 +0530

    New goals added

we can also run the rebase command in —interactive mode which gives us the option to edit/squash/... the commits

>>> git rebase -i master
pick 1848058 work around bugs

# Rebase 3c5d6ba..1848058 onto 3c5d6ba (1 command(s))
#
# Commands:
# p, pick = use commit
# r, reword = use commit, but edit the commit message
# e, edit = use commit, but stop for amending
# s, squash = use commit, but meld into previous commit
# f, fixup = like "squash", but discard this commit's log message
# x, exec = run command (the rest of the line) using shell
# d, drop = remove commit
#
# These lines can be re-ordered; they are executed from top to bottom.
#
# If you remove a line here THAT COMMIT WILL BE LOST.
#
# However, if you remove everything, the rebase will be aborted.
#
# Note that empty commits are commented out

One of the cool use of the git rebase command is that you can change the base of the branch from one branch to other by use of —onto option.

Let assume you create a branch featureA from master and then another branch featureB from featureA, to change the base of the featureB branch.

# git rebase --onto <newbase> <oldbase>
>>> git rebase --onto master featureA featureB

So this is all about the git rebase command, which can help you to keep your commit history clean and your current working branch commits sync with the master branch.

#git

Python: Decorators Part II

May 10, 2020

In this post, we will talk about the standard python library decorators and one or more things about the decorators . If you haven't read the previous blog post about decorator, go check out that here. I will be waiting...

We will talk about functools.lru_cache Decorator from Python standard Library, where lru means Least Recently Used.

lru_cache as the name suggested, it saves the previous result of the function expression based on argument and uses that result if the same argument passed. To save expensive calculations.

functools.lru_cache(maxsize=128, typed=False)

maxsize means that numbers of cache result which can be cached, once the cache is full the older result is discarded. One should use maxsize value as a power of 2 for optimal performance.

type true means argument will be treated differently as int and float values as 1 and 1.0 are treated the same, but if type value is set to true it will be treated differently.

>> 1 == 1.0
>> True

lru_cache use dict to the save the argument as position and keyword-based so all the argument passed to the decorator should be hash-able.

Some point as notes to remember about the decorators

Decorators are executed when the module is loaded by Python and decorated function only executed if explicitly invoked.
Decorators have the power to return the entirely a different function.
We can also have a parameterized decorator as we have seen in the lru_cache decorator.
Stocked Decorators means when more then one decorator is applied to a function, then the order of execution, is from the decorator nearest to the function definition to outside. Let seen an example

@d2
@d1
def func:
    print('f')

func = d2(d1(func))

so that wrap from my side on the topic Decorators.

#python