First, we need a appropriate table for targeting data format. I am going to upload 4 columns which are player's name, team, position and age.
Table format is like this
mysql> desc NBA_LOAD;
+----------+---------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+----------+---------------+------+-----+---------+-------+
| name | varchar(200) | YES | | NULL | |
| team | varchar(10) | YES | | NULL | |
| position | varchar(10) | YES | | NULL | |
| age | decimal(10,0) | YES | | NULL | |
+----------+---------------+------+-----+---------+-------+
4 rows in set (0.00 sec)
And then, my python program should use some modules such as a 'MySQLdb' module for database connection, and 're' module for searching for the specific patten.
1. Database connection
- MySQLdb.connect("localhost","dkkim","dkkim","dkkim")
2. HTML Pattern match
- nba_contents=re.findall('<td></td><td>([\w]+[\s][\w]+)</td><td>(\w\w\w)</td><td>(\w+)</td><td>(\d+)</td><td>',filecontents )
I use two functions.
First main function is written for searching a appropriate html pattern, I would like to hook and insert 4 kinds of meaningful columns which are name, team, positions, age.
Second function is created for data insert with row by row.
Of course, this program is not optimized for the best performance. Maybe you can make a better code in consideration of database performance.
import MySQLdb
import sys
import re
def main(filename):
print 'this is the name of the %s' %filename
fileopen=open(filename,"r+")
filecontents=fileopen.read()
nba_contents=re.findall('<td></td><td>([\w]+[\s][\w]+)</td><td>(\w\w\w)</td><td>(\w+)</td><td>(\d+)</td><td>',filecontents )
for filerows in nba_contents:
(name,team,pos,age)=filerows
dbinsert(name,team,pos,age)
def dbinsert(name,team,pos,age):
db = MySQLdb.connect("localhost","dkkim","dkkim","dkkim")
cursor=db.cursor()
cursor.execute("insert into NBA_LOAD values(%s,%s,%s,%s)",(name,team,pos,age))
db.commit()
db.close()
if __name__=='__main__':
filename = sys.argv[1]
main(filename)
Now everything is ready.
Let's execute this python script. check out whether expected data is loaded or not. (HTML data should be located in same directory of your python file.)
1. Execute python program
[hadoop22:47:39@NBA]$ls NBA.html
NBA.html
[hadoop22:47:43@NBA]$python NBAU_dbinsert.py NBA.html
this is the name of the NBA.html
2. Check your data by querying generated table.
mysql> select * from NBA_LOAD ;
+-----------------------+------+----------+------+
| name | team | position | age |
+-----------------------+------+----------+------+
| Quincy Acy | Tor | SF | 23 |
| Quincy Acy | Sac | SF | 23 |
| Steven Adams | Okc | C | 20 |
| Jeff Adrien | Cha | PF | 27 |
| Arron Afflalo | Orl | SG | 28 |
| Alexis Ajinca | Nor | C | 25 |
It's done.
No comments:
Post a Comment