MADlib 是伯克利大学的一个开源软件项目. 主要目的是扩展数据库的分析能力. 支持PostgreSQL和Greenplum数据库.
可以非常方便的加载到PostgreSQL或Greenplum, 扩展数据库的分析功能. 当然这和PostgreSQL本身支持模块化加载是分布开的.
在数据库中呈现给用户的是一堆分析函数. 1.0包含71个聚合函数和786个普通函数.
http://db.cs.berkeley.edu/w/source-code/
An open source machine learning library on RDBMS for Big Data age
MADlib is an open-source library for scalable in-database analytics. It provides data-parallel implementations of mathematical, statistical and machine learning methods for structured and unstructured data.
The MADlib mission is to foster widespread development of scalable analytic skills, by harnessing efforts from commercial practice, academic research, and open-source development. The library consists of various analytics methods including linear regression, logistic regression, k-means clustering, decision tree, support vector machine and more. That's not all; there is also super-efficient user-defined data type for sparse vector with a number of arithmetic methods. It can be loaded and run in PostgreSQL 8.4 to 9.1 as well as Greenplum 4.0 to 4.2. This talk covers its concept overall with some introductions to the problems we are tackling and the solutions for them. It will also contain some topics around parallel data processing which is very hot in both of research and commercial area these days.
MADLib需要用到Python 2.6或者更高版本, 同时需要PL/Python 2.6或者更高版本.
如果数据库安装时是低版本的python, 那么需要在安装好高版本的python后重新编译一下.
安装python 2.7.5 , 需要用到动态库, 所以在安装python是需要使用--enable-shared选项.
tar -jxvf Python-2.7.5.tar.bz2
cd Python-2.7.5
./configure --enable-shared
make
make install
如果报以下错误, 需要将lib库加入到系统环境中,
[root@db-192-168-100-216 ~]# python -V
python: error while loading shared libraries: libpython2.7.so.1.0: cannot open shared object file: No such file or directory
[root@db-192-168-100-216 ~]# ldconfig -p|grep -i python
libpython2.4.so.1.0 (libc6,x86-64) => /usr/lib64/libpython2.4.so.1.0
libpython2.4.so (libc6,x86-64) => /usr/lib64/libpython2.4.so
libboost_python.so.2 (libc6,x86-64) => /usr/lib64/libboost_python.so.2
libboost_python.so.2 (libc6) => /usr/lib/libboost_python.so.2
libboost_python.so (libc6,x86-64) => /usr/lib64/libboost_python.so
libboost_python.so (libc6) => /usr/lib/libboost_python.so
加入系统环境 :
[root@db-192-168-100-216 ~]# vi /etc/ld.so.conf.d/python2.7.conf
/usr/local/lib
[root@db-192-168-100-216 ~]# ldconfig
[root@db-192-168-100-216 ~]# ldconfig -p|grep -i python
libpython2.7.so.1.0 (libc6,x86-64) => /usr/local/lib/libpython2.7.so.1.0
libpython2.7.so (libc6,x86-64) => /usr/local/lib/libpython2.7.so
libpython2.4.so.1.0 (libc6,x86-64) => /usr/lib64/libpython2.4.so.1.0
libpython2.4.so (libc6,x86-64) => /usr/lib64/libpython2.4.so
libboost_python.so.2 (libc6,x86-64) => /usr/lib64/libboost_python.so.2
libboost_python.so.2 (libc6) => /usr/lib/libboost_python.so.2
libboost_python.so (libc6,x86-64) => /usr/lib64/libboost_python.so
libboost_python.so (libc6) => /usr/lib/libboost_python.so
现在正常了 :
[root@db-192-168-100-216 ~]# python -V
Python 2.7.5
安装完python2.7.5后编译PostgreSQL :
tar -jxvf postgresql-9.2.4.tar.bz2
cd postgresql-9.2.4
./configure --prefix=/home/pg92/pgsql9.2.4 --with-pgport=2921 --with-perl --with-tcl --with-python --with-openssl --with-pam --without-ldap --with-libxml --with-libxslt --enable-thread-safety --with-wal-blocksize=16 && gmake world && gmake install-world
初始化, 启动数据库 :
[root@db-192-168-100-216 ~]# su - pg92
pg92@db-192-168-100-216-> initdb -D $PGDATA -E UTF8 --locale=C -W -U postgres
pg_ctl start
psql
create database digoal;
安装madlib 1.0 :
wget http://www.madlib.net/files/madlib-1.0-Linux.rpm
rpm -ivh madlib-1.0-Linux.rpm
安装完后的目录在/usr/local/madlib
rpm -ql madlib
/usr/local/madlib/.....
将madlib安装到数据库中 :
确保psql以及python在路径中.
pg92@db-192-168-100-216-> which psql
~/pgsql/bin/psql
pg92@db-192-168-100-216-> which python
/usr/local/bin/python
pg92@db-192-168-100-216-> python -V
Python 2.7.5
pg92@db-192-168-100-216-> /usr/local/madlib/bin/madpack -p postgres -c postgres@127.0.0.1:2921/digoal install
检查安装是否正确.
pg92@db-192-168-100-216-> /usr/local/madlib/bin/madpack -p postgres -c postgres@127.0.0.1:2921/digoal install-check
madlib安装在一个名为madlib的schema中.
pg92@db-192-168-100-216-> psql
psql (9.2.4)
Type "help" for help.
digoal=# \dn
List of schemas
Name | Owner
--------+----------
madlib | postgres
public | postgres
(2 rows)
新增表和多个函数 :
digoal=# set search_path="$user",madlib,public;
SET
digoal=# \dt
List of relations
Schema | Name | Type | Owner
--------+------------------+-------+----------
madlib | migrationhistory | table | postgres
madlib | training_info | table | postgres
(2 rows)
digoal=# select * from migrationhistory;
id | version | applied
----+---------+----------------------------
1 | 1.0 | 2013-07-31 15:05:50.900619
(1 row)
digoal=# select * from training_info ;
classifier_name | result_table_oid | training_table_oid | training_metatable_oid | training_encoded_table_oid | validation_table_oi
d | how2handle_missing_value | split_criterion | sampling_percentage | num_feature_chosen | num_trees
-----------------+------------------+--------------------+------------------------+----------------------------+--------------------
--+--------------------------+-----------------+---------------------+--------------------+-----------
(0 rows)