银河里的星星

落在人间

日志

关于我

星星

文章分类

正则表达式元搜索应用

2009-03-12 12:13:24| 分类：技术专题 | 标签： |举报 |字号大中小订阅

下载LOFTER 我的照片书 |

最近在用正则表达式，分析网页抽取结果的时候，碰到了一些问题，大体做个记录

1.子表达式，利用()可以抽取出，匹配的字符串中的一个字段

比如，用来抽取google的，搜索结果，可以这样写(?<=<li class=g>).*?<.*?><a.*?href="(.*?)"[^>]*?>(.*?)</a>(.*?)<br>.*?<cite>(.*?)</CITE>.*?(?=</a></span>.*?</div>)。其中里面的一些()(当然不全是，比如带问号的{?<=<li class=g>)}，就是为了抽取更进一步的字段设立的，比如链接，标题，摘要等。

2.非贪婪匹配

即在表示数量的元字符后面加上?,比如“.*?”，这样可以使匹配的字符串尽量短，这也是在这种网页分析中需要的效果。

3.匹配中出现，memory exhausted.即内存耗尽的运行时异常

简单的处理是捕获这个异常，但这样处理是等到问题已经严重到使内存耗尽的时候才去处理。实际上这个问题是由于正则表达式不完全匹配，导致匹配无法终止，直到耗尽内存。更合理的作法，一般是要重新写匹配部分，把它写进一个线程，进行计时，当超时时断定异常并销毁这个线程。

另外也许考虑重新写更安全的正则式，可以避免这个问题的发生。

4.不匹配某个字符串

不匹配某个字符比较容易写，直接[^c]就可以了，对于字符串则应当这样处理采用

((?!regex).)* 这个就是不包含字符串"regex"的字符串。为了避免把(?!regex)当作一个子表达式式，可以这样修改(?:(?!regex).)*

评论这张

转发至微博

阅读(652)| 评论(0)

历史上的今天

this.p={  m:2,
              b:2,
              loftPermalink:'',
              id:'fks_085074082087085065080085094095083086088068085081080068',
              blogTitle:'正则表达式 元搜索应用',
              blogAbstract:'<p\>最近在用正则表达式，分析网页抽取结果的时候，碰到了一些问题，大体做个记录</p\> <p\>1.子表达式，利用()可以抽取出，匹配的字符串中的一个字段</p\> <p\>比如，用来抽取google的，搜索结果，可以这样写(?<=<li class=g>).*?<.*?><a.*?href=\"(.*?)\"[^>]*?>(.*?)</a>(.*?)<br>.*?<cite>(.*?)</CITE>.*?(?=</a></span>.*?</div>)。其中里面的一些()(当然不全是，比如带问号的{?<=<li class=g>)}，就是为了抽取更进一步的字段设立的，比如链接，标题，摘要等。</p\> <p\>2.非贪婪匹配</p\> <p\>即在表示数量的元字符后面加上?,比如“.*?”，这样可以使匹配的字符串尽量短，这也是在这种网页分析中需要的效果。</p\>',
              blogTag:'',
              blogUrl:'blog/static/70971767200921201324511',
              isPublished:1,
              istop:false,
              type:2,
              modifyTime:1345944645172,
              publishTime:1236831204511,
              permalink:'blog/static/70971767200921201324511',
              commentCount:0,
              mainCommentCount:0,
              recommendCount:0,
              bsrk:-100,
              publisherId:0,
              recomBlogHome:false,
              currentRecomBlog:false,
              attachmentsFileIds:[],
              vote:{},
              groupInfo:{},
              friendstatus:'none',
              followstatus:'unFollow',
              pubSucc:'',
              visitorProvince:'',
              visitorCity:'',
              visitorNewUser:false,
              postAddInfo:{},
              mset:'000',
              mcon:'',
              srk:-100,
              remindgoodnightblog:false,
              isBlackVisitor:false,
              isShowYodaoAd:false,
              hostIntro:'',
              hmcon:'1',
              selfRecomBlogCount:'0',
              lofter_single:'<iframe width="140" height="560" style="overflow:hidden;" src="http://www.lofter.com/mailEntry.do?blogad=1&blog" frameBorder="0"></iframe>'
            }

{list a as x}
    {if !!x}
    <div class="iblock nbw-fce nbw-f40">
      <a class="fc03 noul" target="_blank" hidefocus="true" href="http://blog.163.com/${x.visitorName}/">
      {if x.visitorName==visitor.userName}
      <img alt="${x.visitorNickname|escape}" onerror="this.src=location.f40" class="cwd bdwa bdc0" src="${fn1(x.visitorName)}&r=${visitor.imageUpdateTime}"/>
      {else}
      <img alt="${x.visitorNickname|escape}" onerror="this.src=location.f40" class="cwd bdwa bdc0" src="${fn1(x.visitorName)}"/>
      {/if}
      </a>
      <div class="cwd vname thide">
        {if x.moveFrom=='wap'}
          <a class="noul pnt" target="_blank" href="http://blog.163.com/services/wapblog.html?frompersonalbloghome"><span title="来自网易手机博客" class="iblock wapIcon"> </span></a>
        {elseif x.moveFrom=='iphone'}
          <a class="noul pnt" target="_blank"><span title="来自iPhone客户端" class="iblock iphoneIcon"> </span></a>
        {elseif x.moveFrom=='android'}
          <a class="noul pnt" target="_blank"><span title="来自Android客户端" class="iblock androidIcon"> </span></a>
        {elseif x.moveFrom=='mobile'}
          <a class="noul pnt" target="_blank" href="http://blog.163.com/services/emsblog.html?frompersonalbloghome"><span title="来自网易短信写博" class="iblock wapIcon"> </span></a>
        {/if}
        <a class="fc03 m2a"  target="_blank" hidefocus="true" href="http://blog.163.com/${x.visitorName}/">
          ${fn(x.visitorNickname,8)|escape}
        </a>
      </div>
    </div>
    {/if}
    {/list}

<#--最新日志，群博日志--> <#--推荐日志-->

<p class="fc06">推荐过这篇日志的人：</p>
    <div>
      {list a as x}
      {if !!x}
      <div class="iblock nbw-fce nbw-f40">
        <a class="fc03 noul" target="_blank" hidefocus="true" href="http://blog.163.com/${x.recommenderName}/">
        <img alt="${x.recommenderNickname|escape}" onerror="this.src=location.f40" class="cwd bdwa bdc0" src="${fn1(x.recommenderName)}"/>
        </a>
        <div class="cwd thide">
          <a class="fc03 m2a" target="_blank" hidefocus="true" href="http://blog.163.com/${x.recommenderName}/">
            ${fn(x.recommenderNickname,6)|escape}
          </a>
        </div>
      </div>
      {/if}
      {/list}
    </div>
    {if !!b&&b.length>0}
    <p  class="fc06">他们还推荐了：</p>
    <ul>
    {list b as y}
      {if !!y}
        <li class="rrb"><span class="iblock">·</span><a class="fc03 m2a" target="_blank" href="http://blog.163.com/${y.recommendBlogPermalink}/?from=blog/static/70971767200921201324511">${y.recommendBlogTitle|escape}</a></li>
      {/if}
    {/list}
    </ul>
    {/if}

<#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇，下一篇--> <#-- 热度 -->

{list a as x}
    {if !!x}
    <div class="hotItem iblock nbw-fce nbw-f40">
      <a class="fc03 noul" target="_blank" hidefocus="true" href="http://blog.163.com/${x.publisherUsername}/">
      {if x.publisherUsername==visitor.userName}
      <img alt="${x.publisherNickname|escape}" onerror="this.src=location.f40" class="cwd bdwa bdc0" src="${fn1(x.publisherUsername)}&r=${visitor.imageUpdateTime}"/>
      {else}
      <img alt="${x.publisherNickname|escape}" onerror="this.src=location.f40" class="cwd bdwa bdc0" src="${fn1(x.publisherUsername)}"/>
      {/if}
      </a>
      <div class="cwd vname thide">
        <a class="fc03 m2a"  target="_blank" hidefocus="true" href="http://blog.163.com/${x.publisherUsername}/">
          ${fn(x.publisherNickname,8)|escape}
        </a>
      </div>
      <a class="f-myLikeIcons hottype {if x.type==1} js-liketype{elseif x.type==2} js-reblogtype{elseif x.type==3} js-sharetype{else}{/if}" target="_blank" hidefocus="true" href="http://blog.163.com/${x.publisherUsername}/"> </a>
    </div>
    {/if}
    {/list}

<#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->

页脚

我的照片书 - 手机博客 - 下载LOFTER APP - 订阅此博客

银河里的星星

导航

日志

正则表达式元搜索应用

历史上的今天

最近读者

热度

评论

页脚

银河里的星星

导航

日志

正则表达式 元搜索应用

历史上的今天

最近读者

热度

评论

页脚

正则表达式元搜索应用