Nutch 快速入门(Nutch 1.7)

发表于 2014-01-21 | 分类于 Search-Engine |

Nutch 2.2.1目前性能没有Nutch 1.7好，参考这里，NUTCH FIGHT! 1.7 vs 2.2.1. 所以我目前还是使用的Nutch 1.7。

##1 下载已编译好的二进制包，解压
$ wget http://psg.mtu.edu/pub/apache/nutch/1.7/apache-nutch-1.7-bin.tar.gz
$ tar zxf apache-nutch-1.7-bin.tar.gz

##2 验证一下

$ cd apache-nutch-1.7
$ bin/nutch

如果出现”Permission denied”请运行下面的命令：

$ chmod +x bin/nutch

如果有Warning说 JAVA_HOME没有设置，请设置一下JAVA_HOME.

##3 添加种子URL

mkdir ~/urls
vim ～/urls/seed.txt
http://movie.douban.com/subject/5323968/

##4 设置URL过滤规则
如果只想抓取某种类型的URL，可以在 conf/regex-urlfilter.txt设置正则表达式，于是，只有匹配这些正则表达式的URL才会被抓取。

阅读全文 »

在Eclipse里运行Nutch

发表于 2014-01-20 | 分类于 Search-Engine |

环境：Ubuntu Desktop 12.04，JDK 1.7, Nutch 1.7

本文主要参考Running Nutch in Eclipse

##前提

机器上安装了Ant, Eclipse
Eclipse安装了subclipse, update site 是 http://subclipse.tigris.org/update_1.10.x
Eclipse安装了IvyDE, update site 是 http://www.apache.org/dist/ant/ivyde/updatesite
Eclipse安装了m2e插件，update site 是 http://download.eclipse.org/technology/m2e/releases

##1 下载源码
有两种方法，

去官网首页下载apache-nutch-1.7-src.tar.gz

用svn checkout

$ svn co https://svn.apache.org/repos/asf/nutch/tags/release-1.7

推荐用第2种方法，因为用SVN checkout出来的有pom.xml文件，即maven文件，但是压缩包里没有，只有ant的build.xml文件。

##2 配置
把 conf/ 下的 nutch-site.xml.template复制一份，命名为nutch-site.xml，在里面添加如下配置：

<property>
  <name>http.agent.name</name>
  <value>My Nutch Spider</value>
</property>
<property>
  <name>plugin.folders</name>
  <value>$NUTCH_HOME/build/plugins</value>
</property>

$NUTCH_HOME是指nutch源码的根目录，例如我的是/home/soulmachine/local/opt/apache-nutch-1.7.

##3 生成Eclipse项目文件，即.project文件

$ ant eclipse

阅读全文 »

运行mahout的朴素贝叶斯分类器

发表于 2013-12-23 | 分类于 Machine-Learning |

##1.准备数据

###1.1 下载数据集，并解压

wget http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz
tar -xf 20news-bydate.tar.gz
#上传到hdfs
hadoop fs -put 20news-bydate-test .
hadoop fs -put 20news-bydate-train .

###1.2 转换格式

#转换为序列文件(sequence files)
mahout seqdirectory -i 20news-bydate-train -o 20news-bydate-train-seq
mahout seqdirectory -i 20news-bydate-test -o 20news-bydate-test-seq
#转换为tf-idf向量
mahout seq2sparse -i 20news-bydate-train-seq -o 20news-bydate-train-vector -lnorm -nv -wt tfidf
mahout seq2sparse -i 20news-bydate-test-seq -o 20news-bydate-test-vector -lnorm -nv -wt tfidf

##2. 训练朴素贝叶斯模型

mahout trainnb -i 20news-bydate-train-vectors/tfidf-vectors -el -o model -li labelindex -ow

##3. 测试朴素贝叶斯模型

mahout testnb -i 20news-bydate-train-vectors/tfidf-vectors -m model -l labelindex -ow -o test-result

##4. 查看训练后的结构

mahout seqdumper -i labelindex 

Input Path: labelindex
Key class: class org.apache.hadoop.io.Text Value Class: class org.apache.hadoop.io.IntWritable
Key: alt.atheism: Value: 0
Key: comp.graphics: Value: 1
Key: comp.os.ms-windows.misc: Value: 2
Key: comp.sys.ibm.pc.hardware: Value: 3
Key: comp.sys.mac.hardware: Value: 4
Key: comp.windows.x: Value: 5
Key: misc.forsale: Value: 6
Key: rec.autos: Value: 7
Key: rec.motorcycles: Value: 8
Key: rec.sport.baseball: Value: 9
Key: rec.sport.hockey: Value: 10
Key: sci.crypt: Value: 11
Key: sci.electronics: Value: 12
Key: sci.med: Value: 13
Key: sci.space: Value: 14
Key: soc.religion.christian: Value: 15
Key: talk.politics.guns: Value: 16
Key: talk.politics.mideast: Value: 17
Key: talk.politics.misc: Value: 18
Key: talk.religion.misc: Value: 19
Count: 20

使用docker打造spark集群

发表于 2013-10-27 | 分类于 Spark |

前提条件：安装好了docker，见我的另一篇博客，Docker安装

有两种方式，

Spark官方repo里，docker文件夹下的脚本。官方的这个脚本封装很薄，尽可能把必要的信息展示出来。
AMPLab开源的这个独立小项目，来打造一个spark集群。这个脚本封装很深，自带了一个DNS服务器，还有hadoop，非常自动化，缺点是很多信息看不到了。

1. 第1种方式

git clone 源码

首先要把官方repo的代码下载下来

git clone [email protected]:apache/incubator-spark.git

（可选）修改apt源

在国内，将apt源修改国内源，例如163的源，速度会快很多。将base/Dockerfile里的

RUN echo "deb http://archive.ubuntu.com/ubuntu precise main universe" > /etc/apt/sources.list

替换为

RUN echo "deb http://mirrors.163.com/ubuntu/ precise main restricted universe multiverse" > /etc/apt/sources.list
RUN echo "deb http://mirrors.163.com/ubuntu/ precise-security main restricted universe multiverse" >> /etc/apt/sources.list
RUN echo "deb http://mirrors.163.com/ubuntu/ precise-updates main restricted universe multiverse" >> /etc/apt/sources.list
RUN echo "deb http://mirrors.163.com/ubuntu/ precise-proposed main restricted universe multiverse" >> /etc/apt/sources.list
RUN echo "deb http://mirrors.163.com/ubuntu/ precise-backports main restricted universe multiverse" >> /etc/apt/sources.list
RUN echo "deb-src http://mirrors.163.com/ubuntu/ precise main restricted universe multiverse" >> /etc/apt/sources.list
RUN echo "deb-src http://mirrors.163.com/ubuntu/ precise-security main restricted universe multiverse" >> /etc/apt/sources.list
RUN echo "deb-src http://mirrors.163.com/ubuntu/ precise-updates main restricted universe multiverse" >> /etc/apt/sources.list
RUN echo "deb-src http://mirrors.163.com/ubuntu/ precise-proposed main restricted universe multiverse" >> /etc/apt/sources.list
RUN echo "deb-src http://mirrors.163.com/ubuntu/ precise-backports main restricted universe multiverse" >> /etc/apt/sources.list

build镜像

将build和spark-test/build里的docker命令前，添加sudo，然后执行docker下的build

cd docker
./build

启动master

sudo docker run -v $SPARK_HOME:/opt/spark spark-test-master

启动worker

新开一个终端窗口（强烈推荐tmux），启动一个worker

sudo docker run -v $SPARK_HOME:/opt/spark spark-test-worker <master_ip>

可以在master终端窗口看到worker注册上来了。

可以再开多个终端窗口，启动多个worker。

2. 第2种方式

升级wget

如果发现wget不识别--no-proxy选项，需要升级wget。

下载镜像

为了让脚本第一次执行的时候更快，还是手动下载所有的镜像吧，amplab在index.docker.io上有一个官方账号，把这个账号有关spark的repo都pull下来。

sudo docker pull amplab/apache-hadoop-hdfs-precise
sudo docker pull amplab/dnsmasq-precise
sudo docker pull amplab/spark-worker
sudo docker pull amplab/spark-master
sudo docker pull amplab/spark-shell

git clone 脚本

[email protected]:amplab/docker-scripts.git

这个脚本可以一键启动集群，爽啊哈哈哈！

一键启动spark集群

sudo ./deploy/deploy.sh -i amplab/spark:0.8.0 -w 3

启动 Spark shell

启动一个交互式shell吧，IP为上一步输出的Master的IP

sudo docker run -i -t -dns 172.17.0.90 amplab/spark-shell:0.8.0

运行一个简单的的例子

scala> val textFile = sc.textFile("hdfs://master:9000/user/hdfs/test.txt")
scala> textFile.count()
scala> textFile.map({line => line}).collect()

关闭集群

$ sudo ./deploy/kill_all.sh spark
$ sudo ./deploy/kill_all.sh nameserver

docker 快速入门

发表于 2013-10-26 | 分类于 Docker |

前提条件：要安装好 docker，见我的另一篇博客，docker 安装

交互式命令行入门教程

首先强烈建议玩一遍官方的一个交互式命令行入门教程，Interactive commandline tutorial。甚至要多玩几遍，加深印象。

初次过完这个教程，感觉docker用起来跟git很类似。

玩完了后，在自己的真实机器上，把上面的命令重新敲一遍，感受一下。

Hello World

参考官方文档Hello World

首先下载官方的ubuntu image:

sudo docker pull ubuntu

然后运行 hello world：

sudo docker run ubuntu /bin/echo hello world

三种运行命令的模式

docker 有三种运行命令的方式，短暂方式，交互方式，daemon方式。

短暂方式，就是刚刚的那个”hello world”，命令执行完后，container就终止了，不过并没有消失，可以用 sudo docker ps -a 看一下所有的container，第一个就是刚刚执行过的container，可以再次执行一遍：

sudo docker start container_id

不过这次看不到”hello world”了，只能看到ID，用logs命令才能看得到，

sudo docker logs container_id

可以看到两个”hello world”，因为这个container运行了两次。

交互方式，

sudo docker run -i -t image_name /bin/bash

daemon方式，即让软件作为长时间服务运行，这就是SAAS啊！

例如，一个无限循环打印的脚本（替换为memcached, apache等，操作方法仍然不变！）：

CONTAINER_ID=$(sudo docker run -d ubuntu /bin/sh -c "while true; do echo hello world; sleep 1; done")

在container外面查看它的输出

sudo docker logs $CONTAINER_ID

或者连接上容器实时查看

sudo docker attach $CONTAINER_ID

终止容器

sudo docker stop $CONTAINER_ID

sudo docker ps看一下，已经没了

docker ps 命令详解

sudo docker ps，列出当前所有正在运行的container

sudo docker ps -l，列出最近一次启动的，且正在运行的container

sudo docker ps -a，列出所有的container

其他用法请参考 sudo docker ps -h

还有一种方式可以让程序在daemon模式下运行，就是在Dockerfile里设置USER为daemon，见Dockerfile tutorial Level2。

添加http代理

在国内，pull或push的时候经常连不上docker.com（原因你懂的，或者在公司内部统一用一个代理上网的时候），可以在docker daemon进程启动的时候加个代理，例如

sudo HTTP_PROXY=proxy_server:port docker -d &

docker貌似是不识别http_proxy, https_proxy和no_proxy环境变量的，因此要在命令行里指定，参考 Github Issue #402 Using Docker behind a firewall。

如果在命令行里指定了HTTP_PROXY，则要unset掉http_proxy和https_proxy环境变量。原因是：

首先， docker daemon进程是通过http协议与docker.com通信的
其次，docker的各种命令（例如 run, login等）也是通过http协议与docker daemon进程通信的（发送jasn字符串，daemon进程返回的也是json字符串），有时候docker客户端命令貌似能识别http_proxy变量，这时，客户端发送一个命令，路径是localhost->http_proxy->daemon进程，daemon进程返回的数据，路径是 daemon进程->proxy->proxy->localhost，其中，从proxy->localhost的路径是不通的，因为proxy连接不了内网IP。

之所以把这一步放在本文开始，是因为这一步不做的话，后面很多命令会出错，让人摸不着头脑，我在这里就掉进坑了，花了很长时间才搞明白，原来是网络连接不稳定。

熟悉一下 Dockerfile

完了几遍交互式入门教程后，你会好奇，怎么自己定制一个 image，例如把常用的软件装好后打包 ? 这时候该 Dockfile 登场了。Dockerfile 实质上是一个脚本文件，用于自动化创建image。

阅读全文 »

docker安装

发表于 2013-10-25 | 分类于 Docker |

1 在 CentOS 6.4 上安装 docker

docker当前官方只支持Ubuntu，所以在 CentOS 安装Docker比较麻烦(Issue #172)。

docker官方文档说要求Linux kernel至少3.8以上，CentOS 6.4是2.6的内核，于是我哼哧哼哧的编译安装了最新的kernel 3.11.6，重启后运行docker还是失败，最后找到原因，是因为编译时忘记集成aufs模块了。aufs 需要和 kernel 一起编译，很麻烦。

不过不需要这么麻烦，有强人已经编译好了带aufs模块的内核，见这里Installing docker.io on centos 6.4 (64-bit)

1.1 取消selinux，因为它会干扰lxc的正常功能

sudo vim /etc/selinux/config 
SELINUX=disabled
SELINUXTYPE=targeted

1.2 安装 Fedora EPEL

sudo yum install http://ftp.riken.jp/Linux/fedora/epel/6/x86_64/epel-release-6-8.noarch.rpm

1.3 添加 hop5 repo地址

cd /etc/yum.repos.d
sudo wget http://www.hop5.in/yum/el6/hop5.repo

1.4 安装 docker-io

sudo yum install docker-io

会自动安装带aufs模块的3.10内核，以及docker-io包。

1.5 将 cgroup 文件系统添加到 `/etc/fstab` , 只有这样docker才能正常工作

sudo echo "none                    /sys/fs/cgroup          cgroup  defaults        0 0" >> /etc/fstab

1.6 修改grub引导顺序

sudo vim /etc/grub.conf
default=0

设置default为新安装的内核的位置，一般是0

1.7 重启

sudo reboot

1.8 检查新内核是否引导成功

重启后，检查一下新内核是否引导起来了

uname -r
3.10.5-3.el6.x86_64

说明成功了

看一下 aufs是否存在

grep aufs /proc/filesystems 
nodev   aufs

说明存在

1.9 启动 docker daemon 进程

sudo docker -d &

如果你在公司，且公司内部都是通过代理上网，则可以把代理服务器告诉docker，用如下命令(参考这里)：

sudo HTTP_PROXY=http://xxx:port docker -d &

1.10 下载 ubuntu 镜像

sudo docker pull ubuntu

1.11 运行 hello world

阅读全文 »

安装Spark 0.8 集群(在CentOS上)

发表于 2013-10-17 | 分类于 Spark |

环境:CentOS 6.4, Hadoop 1.1.2, JDK 1.7, Spark 0.8.0, Scala 2.9.3

Spark 0.7.2 的安装请看之前的一篇博客，安装Spark集群(在CentOS上) 。

Spark的安装很简单，总结起来一句话：下载，解压，然后拷贝到所有机器，完毕，无需任何配置。

#1. 安装 JDK 1.7
yum search openjdk-devel
sudo yum install java-1.7.0-openjdk-devel.x86_64
/usr/sbin/alternatives –config java
/usr/sbin/alternatives –config javac
sudo vim /etc/profile

# add the following lines at the end
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.19.x86_64
export JRE_HOME=$JAVA_HOME/jre
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
# save and exit vim
# make the bash profile take effect immediately
$ source /etc/profile
# test
$ java -version

参考我的另一篇博客，安装和配置CentOS服务器的详细步骤。

#2. 安装 Scala 2.9.3
Spark 0.8.0 依赖 Scala 2.9.3, 我们必须要安装Scala 2.9.3.

下载 scala-2.9.3.tgz 并保存到home目录.

$ tar -zxf scala-2.9.3.tgz
$ sudo mv scala-2.9.3 /usr/lib
$ sudo vim /etc/profile
# add the following lines at the end
export SCALA_HOME=/usr/lib/scala-2.9.3
export PATH=$PATH:$SCALA_HOME/bin
# save and exit vim
#make the bash profile take effect immediately
source /etc/profile
# test
$ scala -version

#3. 下载预编译好的Spark
下载预编译好的Spark, spark-0.8.0-incubating-bin-hadoop1.tgz.

如果你想从零开始编译，则下载源码包，但是我不建议你这么做，因为有一个Maven仓库，twitter4j.org, 被墙了，导致编译时需要翻墙，非常麻烦。如果你有DIY精神，并能顺利翻墙，则可以试试这种方式。

#4. Local模式

##4.1 解压

$ tar -zxf spark-0.8.0-incubating-bin-hadoop1.tgz

##4.2 （可选）设置 SPARK_HOME环境变量

$ vim ~/.bash_profile
# add the following lines at the end
export SPARK_HOME=$HOME/spark-0.8.0
# save and exit vim
#make the bash profile take effect immediately
$ source /etc/profile

##4.3 现在可以运行SparkPi了

$ cd $SPARK_HOME
$ ./run-example org.apache.spark.examples.SparkPi local

#5. Cluster模式

阅读全文 »

简洁的Scala

发表于 2013-08-29 | 分类于 Language |

Scala语言是很注重一致性(consistency)的，Scala的简洁性(concision)都是由其一致性带来的。

Scala的看上去很复杂，但是它在概念上是非常一致的。弄清了几个概念后，也就不觉得复杂了，反倒是比Java的简单。

1. OO + FP

1.1 一切都是对象

更精确地说，应该是“一切值都是对象”。

整数, 浮点数等基本类型(primitive type)是对象
```
123.toByte
3.14.toInt
```
在Java中，primitive type不是对象，打破了一致性。

函数是对象

val compare = (x: Int, y: Int) => x > y
compare(1, 2)

不再有静态方法(static method)和静态属性(static field)。Java中的静态方法(static method)和静态属性(static field)，有点打破了面向对象，因为它们不属于一个实例，而是属于类。在Scala中，静态方法和静态属性也属于对象，具体来说，属于Scala中的单例object。这样，静态成员和普通成员统一了起来，都附属于某个实例(instance)。
```
object Dog {
  val sound = "wang wang" //static field
}
```

1.2 函数是值

函数是一等公民，跟普通的值没区别

可以当作参数传递

val  compare = (x: Int , y: Int ) => x >  y
list sortWith compare

不管它是实例的方法

class AComparator  {
  def  compare(x: Int , y: Int ) = x >  y
}
list sortWith ( new  AComparator ).compare

还是匿名子句

object  annonymous extends scala.Function2[Int , Int , Boolean] {
  override  def  apply(x: Int , y: Int ) = x >  y
}
list sortWith annonymous

1.3 一切操作都是函数调用

阅读全文 »

安装Spark集群(在CentOS上)

发表于 2013-06-17 | 分类于 Spark |

环境:CentOS 6.4, Hadoop 1.1.2, JDK 1.7, Spark 0.7.2, Scala 2.9.3

折腾了几天，终于把Spark 集群安装成功了，其实比hadoop要简单很多，由于网上搜索到的博客大部分都还停留在需要依赖mesos的版本，走了不少弯路。

#1. 安装 JDK 1.7
yum search openjdk-devel
sudo yum install java-1.7.0-openjdk-devel.x86_64
/usr/sbin/alternatives –config java
/usr/sbin/alternatives –config javac
sudo vim /etc/profile

# add the following lines at the end
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.19.x86_64
export JRE_HOME=$JAVA_HOME/jre
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
# save and exit vim
# make the bash profile take effect immediately
$ source /etc/profile
# test
$ java -version

参考我的另一篇博客，安装和配置CentOS服务器的详细步骤。

#2. 安装 Scala 2.9.3
Spark 0.7.2 依赖 Scala 2.9.3, 我们必须要安装Scala 2.9.3.

下载 scala-2.9.3.tgz 并保存到home目录.

$ tar -zxf scala-2.9.3.tgz
$ sudo mv scala-2.9.3 /usr/lib
$ sudo vim /etc/profile
# add the following lines at the end
export SCALA_HOME=/usr/lib/scala-2.9.3
export PATH=$PATH:$SCALA_HOME/bin
# save and exit vim
#make the bash profile take effect immediately
source /etc/profile
# test
$ scala -version

#3. 下载预编译好的Spark
下载预编译好的Spark, spark-0.7.2-prebuilt-hadoop1.tgz.

#4. 本地模式

##4.1 解压

$ tar -zxf spark-0.7.2-prebuilt-hadoop1.tgz

##4.2 设置SPARK_EXAMPLES_JAR 环境变量

$ vim ~/.bash_profile
# add the following lines at the end
export SPARK_EXAMPLES_JAR=$HOME/spark-0.7.2/examples/target/scala-2.9.3/spark-examples_2.9.3-0.7.2.jar
# save and exit vim
#make the bash profile take effect immediately
$ source /etc/profile

这一步其实最关键，很不幸的是，官方文档和网上的博客，都没有提及这一点。我是偶然看到了这两篇帖子，Running SparkPi, Null pointer exception when running ./run spark.examples.SparkPi local，才补上了这一步，之前死活都无法运行SparkPi。

##4.3 （可选）设置 SPARK_HOME环境变量，并将SPARK_HOME/bin加入PATH
$ vim ~/.bash_profile

# add the following lines at the end
export SPARK_HOME=$HOME/spark-0.7.2
export PATH=$PATH:$SPARK_HOME/bin
# save and exit vim
#make the bash profile take effect immediately
$ source /etc/profile

##4.4 现在可以运行SparkPi了

$ cd ~/spark-0.7.2
$ ./run spark.examples.SparkPi local

#5. 集群模式

阅读全文 »

Installing Spark on CentOS

发表于 2013-06-14 | 分类于 Spark |

Environment:CentOS 6.4, Hadoop 1.1.2, JDK 1.7, Spark 0.7.2, Scala 2.9.3

After a few days hacking , I have found that installing a Spark cluster is exteremely easy :)

#1. Install JDK 1.7
yum search openjdk-devel
sudo yum install java-1.7.0-openjdk-devel.x86_64
/usr/sbin/alternatives –config java
/usr/sbin/alternatives –config javac
sudo vim /etc/profile

# add the following lines at the end
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.19.x86_64
export JRE_HOME=$JAVA_HOME/jre
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
# save and exit vim
# make the bash profile take effect immediately
$ source /etc/profile
# test
$ java -version

#2. Install Scala 2.9.3
Spark 0.7.2 depends on Scala 2.9.3, So we must install Scala of version 2.9.3.

Download scala-2.9.3.tgz and save it to home directory.

$ tar -zxf scala-2.9.3.tgz
$ sudo mv scala-2.9.3 /usr/lib
$ sudo vim /etc/profile
# add the following lines at the end
export SCALA_HOME=/usr/lib/scala-2.9.3
export PATH=$PATH:$SCALA_HOME/bin
# save and exit vim
# make the bash profile take effect immediately
source /etc/profile
# test
$ scala -version

#3. Download prebuilt packages
Download prebuilt packages, spark-0.7.2-prebuilt-hadoop1.tgz.

If you want to compile it from scratch, download the source package, but I don’t recommend this way, because in Chinese Mainland the GFW has blocked one of maven repositories, twitter4j.org, which makes the compilation an impossible mission unless you can conquer GFW.

#4. Local Mode

##4.1 Untar the tarball

$ tar -zxf spark-0.7.2-prebuilt-hadoop1.tgz

##4.2 Set the SPARK_EXAMPLES_JAR environment variable
$ vim ~/.bash_profile

# add the following lines at the end
export SPARK_EXAMPLES_JAR=$HOME/spark-0.7.2/examples/target/scala-2.9.3/spark-examples_2.9.3-0.7.2.jar
# save and exit vim
# make the bash profile take effect immediately
$ source /etc/profile

This is the most important step that must be done , but unfortunately the official docs and most web blogs haven’t mentioned this. I found this step when I bumped into these posts, Running SparkPi, Null pointer exception when running ./run spark.examples.SparkPi local.

##4.3 (Optional)Set SPARK_HOME and add SPARK_HOME/bin to PATH

$ vim ~/.bash_profile
# add the following lines at the end
export SPARK_HOME=$HOME/spark-0.7.2
export PATH=$PATH:$SPARK_HOME/bin
# save and exit vim
# make the bash profile take effect immediately
$ source /etc/profile

##4.4 Now you can run SparkPi.

$ cd ~/spark-0.7.2
$ ./run spark.examples.SparkPi local

#5. Cluster Mode

阅读全文 »

1. 第1种方式

git clone 源码

（可选）修改apt源

build镜像

启动master

启动worker

2. 第2种方式

升级wget

下载镜像

git clone 脚本

一键启动spark集群

启动 Spark shell

运行一个简单的的例子

关闭集群

更多详情请参考项目主页的文档

交互式命令行入门教程

Hello World

三种运行命令的模式

docker ps 命令详解

添加http代理

熟悉一下 Dockerfile

1 在 CentOS 6.4 上安装 docker

1.1 取消selinux，因为它会干扰lxc的正常功能

1.2 安装 Fedora EPEL

1.3 添加 hop5 repo地址

1.4 安装 docker-io

1.5 将 cgroup 文件系统添加到 /etc/fstab , 只有这样docker才能正常工作

1.6 修改grub引导顺序

1.7 重启

1.8 检查新内核是否引导成功

1.9 启动 docker daemon 进程

1.10 下载 ubuntu 镜像

1.11 运行 hello world

1. OO + FP

1.1 一切都是对象

1.2 函数是值

1.3 一切操作都是函数调用

1.5 将 cgroup 文件系统添加到 `/etc/fstab` , 只有这样docker才能正常工作