Monday, 31 March 2014

Data Handling (3/4) - Load data to Mysql using python programming

In the previous post,  I used LOAD command to upload NBA raw data file into the MYSQL database. In this post, I am going to use python program as the same purpose.

First, we need a appropriate table for targeting data format. I am going to upload 4 columns which are player's name, team, position and age.

Table format is like this

mysql> desc NBA_LOAD;
+----------+---------------+------+-----+---------+-------+
| Field    | Type          | Null | Key | Default | Extra |
+----------+---------------+------+-----+---------+-------+
| name     | varchar(200)  | YES  |     | NULL    |       |
| team     | varchar(10)   | YES  |     | NULL    |       |
| position | varchar(10)   | YES  |     | NULL    |       |
| age      | decimal(10,0) | YES  |     | NULL    |       |
+----------+---------------+------+-----+---------+-------+
4 rows in set (0.00 sec)


And then, my python program should use some modules such as a 'MySQLdb' module for database connection, and 're' module for searching for the specific patten.

1. Database connection
- MySQLdb.connect("localhost","dkkim","dkkim","dkkim")
2. HTML Pattern match
- nba_contents=re.findall('<td></td><td>([\w]+[\s][\w]+)</td><td>(\w\w\w)</td><td>(\w+)</td><td>(\d+)</td><td>',filecontents )

I use two functions.
First main function is written for searching a appropriate html pattern, I would like to hook and insert  4 kinds of  meaningful columns which are name, team, positions, age.
Second function is created for data insert with row by row.

Of course, this program is not optimized for the best performance. Maybe you can make a better code in consideration of database performance.

import MySQLdb
import sys
import re

def main(filename):
  print 'this is the name of the %s' %filename
  fileopen=open(filename,"r+")
  filecontents=fileopen.read()
  nba_contents=re.findall('<td></td><td>([\w]+[\s][\w]+)</td><td>(\w\w\w)</td><td>(\w+)</td><td>(\d+)</td><td>',filecontents )
  for filerows in  nba_contents:
      (name,team,pos,age)=filerows
      dbinsert(name,team,pos,age)

def dbinsert(name,team,pos,age):
  db = MySQLdb.connect("localhost","dkkim","dkkim","dkkim")
  cursor=db.cursor()

  cursor.execute("insert into NBA_LOAD values(%s,%s,%s,%s)",(name,team,pos,age))
  db.commit()
  db.close()

if __name__=='__main__':
  filename = sys.argv[1]
  main(filename)


Now everything is ready.
Let's execute this python script. check out whether expected data is loaded or not. (HTML data should be located in same directory of your python file.)

1. Execute python program
[hadoop22:47:39@NBA]$ls NBA.html
NBA.html
[hadoop22:47:43@NBA]$python NBAU_dbinsert.py NBA.html
this is the name of the NBA.html

2. Check your data by querying generated table.

mysql> select * from NBA_LOAD ;
+-----------------------+------+----------+------+
| name                  | team | position | age  |
+-----------------------+------+----------+------+
| Quincy Acy            | Tor  | SF       |   23 |
| Quincy Acy            | Sac  | SF       |   23 |
| Steven Adams          | Okc  | C        |   20 |
| Jeff Adrien           | Cha  | PF       |   27 |
| Arron Afflalo         | Orl  | SG       |   28 |
| Alexis Ajinca         | Nor  | C        |   25 |




It's done.



No comments: