Monday, 27 January 2014

Data Handling (1/4) -unstructured data manipulation

Before I begin to talk about data mining, I would like to share my idea about data handling. When it comes to the data mining, a lot of data should be handled in order to find out the data insight in a high performance system. If your result is deduced by computing small data in your PC, I don't think,  this is a real data mining. The larger statistic data analysis, the higher accurate confidence interval is guaranteed.

I think, to become a practical data analyst, you have to be familiar with data manipulation. From now on, I will introduce a data handling skill by  next several posts.

Today, I am going to show you how to change unstructured data into structured data format using interpreter language like python.

Let me start by clicking the below web page.

 http://www.nbastuffer.com/2013-2014_NBA_Regular_Season_Player_Stats.html

This web URL  provides statistical NBA data which is a  '2013-2014 NBA Regular Season Player Stats'.





Let's assume that  I would like to get a first four column data (player, team, position, age ) inside the table below. Of course you might think that you can get those data by dragging your mouse on the table. However, I would like to do this by programming approach.

First, download this web URL source by selecting the "view source page" tab which is shown when you click the left  mouse button on the web page. And then download this file on your system.

* In my case, generate a new  file which is named as a "NBA.html" in my linux operating system.



There are many programming languages to handle data but one of my favorite program languages is python language. Python is a interpreter language so there is no compile process before running this program. Furthermore python has a variety of library like R program.

As you can see below, this python program search whole HTML source and extract  appropriate part of data and finally generates a given file on your Linux file system.


Once you generate a new file on your system. you can get this data from R command line.

The data file on your system can be retrieved by R command like below picture.
Once you succeed data retrieval, you can assign this data into appropriate variable.




Today, we saw how to turn from unstructured web html into the structured data format using python programming. In addition to,  I showed you , how to retrieve data from the file format on Linux system using the appropriate R command. I think this process is very common in a practical field.

Next time, I will show you another case.






No comments: