Nutch-0.9 研究 Whole-web Crawling<二>

lovejuan1314

浏览: 336944 次
性别:
来自: 北京

最近访客更多访客>>

huaiao_chen

flzlovexfy

释冰翼

zhugaopeng

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Hadoop/Hypertable/Nutch系列

Web PHP XML Apache performance

Nutch 得到Related Link以及动态内容

1. vi conf/crawl-urlfilter.txt

#+[?*!@=]

# 添加接受链接带? = &字符的

# accept URLs containing certain characters as probable queries, etc.
+[?=&]

## 抓取程序链接/apps/application.php?id=在网页中是动态的相对链接地址
+^http://www.test01.com/apps/application.php?id=([0-9])

2. vi conf/regex-urlfilter.txt

## 同样添加1.所加的

注意：两个文件都需要修改，因为NUTCH加载规则的顺序是crawl-urlfilter.txt-> regex-urlfilter.txt

3. vi conf/nutch-default.xml或者conf/nutch-site.xml

<property>
  <name>urlfilter.order</name>
  <value>org.apache.nutch.urlfilter.regex.RegexURLFilter</value>
  <description>The order by which url filters are applied.
  If empty, all available url filters (as dictated by properties
  plugin-includes and plugin-excludes above) are loaded and applied in system
  defined order. If not empty, only named filters are loaded and applied
  in given order. For example, if this property has value:
  org.apache.nutch.urlfilter.regex.RegexURLFilter org.apache.nutch.urlfilter.prefix.PrefixURLFilter
  then RegexURLFilter is applied first, and PrefixURLFilter second.
  Since all filters are AND'ed, filter ordering does not have impact
  on end result, but it may have performance implication, depending
  on relative expensiveness of filters.
  </description>
</property>

4. 修改conf/nutch-default.xml

<property>
  <name>db.max.outlinks.per.page</name>
  <value>-1</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
</property>

0
顶

0
踩

分享到：

Hypertable Installation on Ubuntu | Linux 统计squid日志access.log IP 访问次 ...

2009-09-09 19:10
浏览 1549
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Nutch-0.9 研究 Whole-web Crawling<二>

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Nutch-0.9 研究 Whole-web Crawling<二>

评论

发表评论

相关推荐

关于Hypertable的设置属性

hypertable 0.9.2.7的一个错误

Nutch 研究<三> 将Nutch爬取结果放入Hypertable

Hadoop Safe mode

Hypertable Installation on Ubuntu

Nutch-0.9 研究 Whole-web Crawling<一>

Hypertable 的建表及插入

Google BigTable 翻译 ---大表(Bigtable):结构化数据的分布存储系统

Hypertable Apache log

Hypertable 安装

cloudera hadoop

Hadoop 安装

最近访客更多访客>>