DK Kim's data analysis (using Python,R)

Sunday, 28 September 2014

A simple data classification-part 4/4 (Using Python in a Hadoop filesystem)

Today I am going to run a python code by using Hadoop filesystem.
(Hadoop version in my computer is 1.2.1)

We've already made a code structure in last post, so this time we should know how to execute a python code in Hadoop filesystem.

When it comes to the Hadoop, we exactly means Hadoop distributed file system, and this means data is distributed across multiple nodes. Hadoop distributed system takes care of distribution, security and fault tolerance.

In order to execute python program in Hadoop environment, you have to use Hadoop streaming API.
Check following web page out !
"http://hadoop.apache.org/docs/r1.1.2/streaming.html#Hadoop+Streaming"
As you can see in the document, any executables that read the input from stdin can be the mapper and reducer. In our case, we use a python mapper and reducer.

Now, we are ready to execute python code.

First, start up the Hadoop file system.
In my case I have a single Hadoop node because I installed hadoop in my virtual machine in my PC.
[ Hadoop Administration console ]

Second, copy the Age data into the Hadoop filesystem. In my case, I make a new directory "python" and copy from local file to Hadoop file system.As soon as text file loaded into Hadoop filesystem, it is distributed automatically.

[hadoop06:58:20@NBAAge]$ls -lrt NBA_Age_file.txt
-rw-rw-r-- 1 hadoop hadoop 1458 Sep 29 06:42 NBA_Age_file.txt
[hadoop06:57:44@NBAAge]$hadoop fs -mkdir /python
[hadoop06:58:16@NBAAge]$hadoop fs -copyFromLocal NBA_Age_file.txt /python/

[ Location of Age file in Hadoop file system ]

Third, execute python map/reduce program using streaming API. In my case, I specify two python program as a mapper and reducer and I added -file option because python program is located in local file system. At the last part of the command I specify the location for the result.

* If you want to check the mapper and reducer source, check out the post 3/4.

[hadoop07:25:26@NBAAge]$
hadoop jar /home/hadoop/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar -file /home/hadoop/python2.7/MAPREDUCE/NBAAge/mapper.py -mapper /home/hadoop/python2.7/MAPREDUCE/NBAAge/mapper.py -file /home/hadoop/python2.7/MAPREDUCE/NBAAge/reducer.py -reducer /home/hadoop/python2.7/MAPREDUCE/NBAAge/reducer.py -input /python/NBA_Age_file.txt -output /python/NBAAge

[ Map-Reduce execution log ]

Finally, once you succeed in the map-reduce process, you can check the result using either the administration console or Hadoop command.

Let's open the console and browse the target directory we specified at the execution and check the result.
[ Directory of result file ]

[ Final result of map/reduce]

As you can see above, a frequency of each "Age" is calculated by using map-reduce python programming. It's fun.

Monday, 22 September 2014

A simple data classification-part 3/4 (Using Python in a Hadoop filesystem)

Through the last few posts, we made a python code which calculates the frequency of age. Today, what we are going to do is making a code applied to a Big Data architecture. If our "Age" data is a quite big. For example, over 10 tera byte or 100 tera bytes or something. It probably takes so much time to finish it. In this case, we can consider a Big Data platform like a Hadoop filesystem. A Hadoop filesystem automatically distributes the huge data into a multiple nodes, so it is much faster than a traditional single node filesystem.

In a Hadoop filesystem, there is an unique programming method, so called a "map-reduce" programming to manipulate data.
There are so many articles related to Hadoop and "map-reduce" programming, so I won't explain anymore, but I just focus on the python programming in a Hadoop platform.

The demo scenario is as follows.
1). The first step
Getting the individual "Age" data from the html web page using the url module and the regular expression module.
2). The second step
Once you have an "Age" data file, copy the file to the Hadoop filesytem in order to distribute the data to multiple nodes.
3). The third step
Mapping program coding.
4). The forth step
Reducer program coding.
5). The fifth step
Run in a Hadoop filesystem using a streaming service.

A simple data classification-part 2/4 (Using Python in a Hadoop filesystem)

Last post, we reviewed the Dictionary structure which is composed of two pair values. (Key, Value)
This time I am going to make a program which calculate the frequency of each age.
First of all, I would like to reuse the previous program which was posted in title "Simple visualization using (python and R )_part1".
This program read the html tag and finds out the values in accordance with the given pattern.
Because our focused value is just an age, we put all the age values into the list structure called "data2".
Now, I want to use this "data2" list as an argument for the new function called "frequency()".
Let's take look at the source below. I defined a new function with a parameter called "agedata"
We need a dictionary. Read the age until the end of the list and evaluate if given age meets a same age. In this case just increase the value of that age in dictionary structure. This iteration goes over and over again until the end of the list.

def frequency(agedata)::
wordDic = {}
for word in agedata:
if wordDic.has_key(word):
wordDic[word]=wordDic[word]+1
else:
wordDic[word]=1
##################################
## change the dictionary structure to the list structure
##################################
wordDiclist = wordDic.items()
wordDiclist.sort()
for x,y in wordDiclist:
print "result of age and frequency is : %s , %s " %(x,y)

As you might imagine, result will show you the Age : Frequency with sorted by age.

[hadoop10:50:22@MATPLOT]$python agefrequency.py
result of age and frequency is : 19 , 2
result of age and frequency is : 20 , 16
result of age and frequency is : 21 , 20
result of age and frequency is : 22 , 38
result of age and frequency is : 23 , 50
result of age and frequency is : 24 , 49
result of age and frequency is : 25 , 54
result of age and frequency is : 26 , 40
result of age and frequency is : 27 , 39
result of age and frequency is : 28 , 38
result of age and frequency is : 29 , 29
result of age and frequency is : 30 , 19
result of age and frequency is : 31 , 21
result of age and frequency is : 32 , 14
result of age and frequency is : 33 , 23
result of age and frequency is : 34 , 8
result of age and frequency is : 35 , 9
result of age and frequency is : 36 , 6
result of age and frequency is : 37 , 7
result of age and frequency is : 38 , 2
result of age and frequency is : 39 , 2

Next time, I am going to make a program with an another approach but it should be get an same result.

Monday, 8 September 2014

A simple data classification-part 1/4 (Using Python in a Hadoop filesystem)

Through previous two series, we learned two simple way to visualize a NBA individual age. Of course this is meaningful task
but people may more interested in that what age group has the greatest number? or vise versa.

In order to get this answer we have to count individual age one by one. If your total number is quite small, then it is make sense but if given number is thousand of number it is a really stupid job to do it.
A classification usually make us identify complicated things more easily.

There are so many classification method.
This time I am going to make a code for calculating the frequency of each age so we can figure out the age in which most of player are included.

Before proceding python code. we have to understand dictionary structure. The dictionary structure is one of the most common structures in a python programming. It is like a real dictionary. It is composed of two pair values. (Key, Value)

For example, if we have a certain age number like this.

22, 23, 23,23 ,40, 22, 22, 22

You can figure out that most of player's age is 22.
It can be described as follows
AGE COUNTS
22 : 5
23 : 3
40 : 1

Simple visualization using (python and R )_part 2/2

This time, as you expected, I am going to generate a one simple graph using R command. R provides diverse libraries relating visualization but this time I am going to use a plot function which is very simple.

I am going to use a file which is generated from last tutorial.
This file contains 4 columns(name,team,position,age) and 486 rows. but my interested column is "Age" column.

First, I should read data file in the R command prompt.
Once file data is stored in the any variable, you can take "Age" column and visualize it.

Let's try. Check the file name and location, and then just glance over a file contents. As you can see below there is no head information so be careful the read option when you read file contents

[hadoop08:37:06@NBA]$tail -10 NBAFILE_Through_web.txt
Jeff Withey    Nor    C    23
Nate Wolters    Mil    PG    22
Brandan Wright    Dal    C    26
Chris Wright    Mil    SF    25
Dorell Wright    Por    SF    28
Tony Wroten    Phi    SG    20
Nick Young    Lal    SG    28
Thaddeus Young    Phi    PF    25
Cody Zeller    Cha    C    21
Tyler Zeller    Cle    C    24

And then go into the R command prompt and read the file with read package util. Just a little bit endeavor is required for convenience. If you are not familiar with plot function then read description with a simple command.
>?plot

> nbafile = read.csv(file="/home/hadoop/python2.7/NBA/NBAFILE_Through_web.txt" , header=FALSE, sep="\t")
> names(nbafile) = c('NAME','TEAM','POSITION','AGE')
> plot(nbafile$AGE, xlab="player", ylab="Age" ,col="red", xlim=c(1,500), ylim=c(15,40), type="p", main="Individual Age")

Can you distinguish which graph is generated by R command ?

Answer is right one. It is easy because the left one is the exact same one we made in previous tutorial.

Sunday, 31 August 2014

Simple visualization using (python and R )_part 1/2

Today I am going to make a simple graph with a data we generated through last post.
We had a data set which has a 4 data columns. I would like to just visualize each individual age using python and R.

We will compare two different visualization methods at the end of this post.

Let's look at the python case first.
In order to visualize the data, I am going to use a matplotlib library which is so popular. Please check the http://matplotlib.org/ for detail information. You can review the sample code and detail instruction.

This time I am going to just modify the code I made in previous post.
Of course matplotlib library is imported at the beginning section and some codes are added for displaying the graph.
As I said, my interest is the player's age. As you can see the source code, the individual's age is stored into the list.

Data Handling (4/4) Web Data handling using Python library

Today I am going to introduce another way to get a web data using python library.

I think it's much easier than previous one.
Python provides a useful library called "urllib". It can make you get a html tag without downloading or copying.

In order to use this library you have to import it at the first section of your python program.

I am going to do the same thing we did on previous posts.
Only you have to do is use this library and open method to get a html source. The rest of python source code is quite similar to previous one.