Wednesday, 27 August 2014

Data Handling (4/4) Web Data handling using Python library

Today  I am going to introduce another way to get a web data using python library.

I think it's much easier than previous one.
Python provides a useful library called "urllib". It can make you get a html tag without downloading or copying.

In order to use this library you have to import it at the first section of your python program.


I am going to do the same thing we did on previous posts.
Only you have to do is use this library and open method to get a html source. The rest of python source code is quite similar to previous one.




==================
import urllib
import re

def main(filename):
    fileopen=open(filename,"w+")

    htmlfile = urllib.urlopen("http://www.nbastuffer.com/2013-2014_NBA_Regular_Season_Player_Stats.html")
    htmltext=htmlfile.read()

    pattern = re.compile('<td></td><td>([\w]+[\s][\w]+)</td><td>(\w\w\w)</td><td>(\w+)</td><td>(\d+)</td><td>')

    nba_contents=re.findall(pattern,htmltext)
    for filerows in nba_contents:
        (name, team, position, age ) = filerows
        data = '%s\t%s\t%s\t%s' %(name,team,position,age)
        fileopen.write(data+'\n')

if __name__=='__main__':
    filename=raw_input('Enter Filename : ')
    main(filename)

=======================

Now store this source in a appropriate linux directory and execute it.


[hadoop15:52:36@NBA]$python WEB_NBA.py
Enter Filename : NBAFILE_Through_web.txt
[hadoop15:53:49@NBA]$cat NBAFILE_Through_web.txt  |more
Quincy Acy    Tor    SF    23
Quincy Acy    Sac    SF    23
Steven Adams    Okc    C    20
Jeff Adrien    Cha    PF    27
Jeff Adrien    Mil    PF    27
Arron Afflalo    Orl    SG    28
......


You can see the file which is named as a 'NBAFILE_Through_web.txt' on the directory.
It's done

No comments: