一聚教程网:一个值得你收藏的教程网站

热门教程

C#.Net基于正则表达式抓取百度百家文章列表的方法示例

时间:2022-06-25 07:48:45 编辑:袖梨 来源:一聚教程网

工作之余,学习了一下正则表达式,鉴于实践是检验真理的唯一标准,于是便写了一个利用正则表达式抓取百度百家文章的例子,具体过程请看下面源码:

一、获取百度百家网页内容

publicList GetUrl()
{
  try
  {
    stringurl ="http://baijia.baidu.com/";
    WebRequest webRequest = WebRequest.Create(url);
    WebResponse webResponse = webRequest.GetResponse();
    StreamReader reader =newStreamReader(webResponse.GetResponseStream());
    stringresult = reader.ReadToEnd();
    reader.Close();
    webResponse.Close();
    returnAnalysisHtml(result);
  }
  catch(Exception ex)
  {
    throwex;
  }
}

二、通过正则表达式筛选

publicList AnalysisHtml(stringhtmlContent)
{
  List list =newList();
  stringstrPattern ="

(?[^<]+)</a></h3>.*s*<ps*class="feeds-item-text">(?<Abstract>[^<]+)<as*href="(?<Url>.*)"s*target="_blank"s*class="feeds-item-more"s*mon=".*s*">.*s*</a></p>"; Regex regex =newRegex(strPattern, RegexOptions.IgnoreCase | RegexOptions.Multiline | RegexOptions.CultureInvariant); if(regex.IsMatch(htmlContent)) { MatchCollection matchCollection = regex.Matches(htmlContent); foreach(Match matchinmatchCollection) { string[] str =newstring[3]; str[0] = match.Groups[1].Value;//获取到的是列表数据的标题 str[1] = match.Groups[2].Value;//获取到的是内容 str[2] = match.Groups[3].Value;//获取到的是链接到的地址 list.Add(str); } } returnlist; }</pre> </div> </div> <div class="articles"> <div class="tit02"> <h4>相关文章</h4> </div> <ul> <li> <a target="_blank" href="/new/424364.htm">蚂蚁新村小课堂今日答案12月12日 追得百花鲜迎来生活甜形容的是以下哪种职业的特点</a> <span>12-23</span> </li> <li> <a target="_blank" href="/new/424365.htm">蚂蚁庄园今日答案最新2024.12.13</a> <span>12-23</span> </li> <li> <a target="_blank" href="/new/424363.htm">无限暖暖奇迹之冠第二期巅峰赛怎么玩</a> <span>12-23</span> </li> <li> <a target="_blank" href="/new/424362.htm">光遇12.23每日任务怎么做</a> <span>12-23</span> </li> <li> <a target="_blank" href="/new/424361.htm">王者荣耀QQ飞车联动怎么样</a> <span>12-23</span> </li> <li> <a target="_blank" href="/new/424360.htm">原神5.3版本游迹怎么样 原神5.3版本绘想游迹新增活动介绍</a> <span>12-23</span> </li> </ul> </div> </div> <div class="pages art-detail"> </div> </div> </div> </div> </div> <div class="hot-column"> <div class="cont"> <div class="tit"> <h4>热门栏目</h4> </div> <ul class="clearfix"> <li> <h6><a href="/list-1/" target="_blank">php教程</a></h6> <a href="/list-45/" target="_blank">php入门</a> <a href="/list-46/" target="_blank">php安全</a> <a href="/list-47/" target="_blank">php安装</a> <a href="/list-48/" target="_blank">php常用代码</a> <a href="/list-49/" target="_blank">php高级应用</a> </li> <li> <h6><a href="/list-2/" target="_blank">asp.net教程</a></h6> <a href="/list-78/" target="_blank">基础入门</a> <a href="/list-79/" target="_blank">.Net开发</a> <a href="/list-80/" target="_blank">C语言</a> <a href="/list-81/" target="_blank">VB.Net语言</a> <a href="/list-82/" target="_blank">WebService</a> </li> <li> <h6><a href="/list-6/" target="_blank">手机开发</a></h6> <a href="/list-208/" target="_blank">安卓教程</a> <a href="/list-209/" target="_blank">ios7教程</a> <a href="/list-210/" target="_blank">Windows Phone</a> <a href="/list-211/" target="_blank">Windows Mobile</a> <a href="/list-212/" target="_blank">手机常见问题</a> </li> <li> <h6><a href="/list-3/" target="_blank">css教程</a></h6> <a href="/list-99/" target="_blank">CSS入门</a> <a href="/list-100/" target="_blank">常用代码</a> <a href="/list-101/" target="_blank">经典案例</a> <a href="/list-102/" target="_blank">样式布局</a> <a href="/list-103/" target="_blank">高级应用</a> </li> <li> <h6><a href="/list-4/" target="_blank">网页制作</a></h6> <a href="/list-136/" target="_blank">设计基础</a> <a href="/list-137/" target="_blank">Dreamweaver</a> <a href="/list-138/" target="_blank">Frontpage</a> <a href="/list-139/" target="_blank">js教程</a> <a href="/list-140/" target="_blank">XNL/XSLT</a> </li> <li> <h6><a href="/list-7/" target="_blank">办公数码</a></h6> <a href="/list-236/" target="_blank">word</a> <a href="/list-237/" target="_blank">excel</a> <a href="/list-238/" target="_blank">powerpoint</a> <a href="/list-239/" target="_blank">金山WPS</a> <a href="/list-240/" target="_blank">电脑新手</a> </li> <li> <h6><a href="/list-11/" target="_blank">jsp教程</a></h6> <a href="/list-68/" target="_blank">Application与Applet</a> <a href="/list-69/" target="_blank">J2EE/EJB/服务器</a> <a href="/list-70/" target="_blank">J2ME开发</a> <a href="/list-71/" target="_blank">Java基础</a> <a href="/list-72/" target="_blank">Java技巧及代码</a> </li> </ul> </div> </div> <div class="footer"> <div class="cont"> <p> <a href="/" target="_self">一聚教程网</a>| <a href="javascript:;" class="about" target="_self">关于我们</a>| <a href="javascript:;" class="contact" target="_self">联系我们</a>| <a href="javascript:;" class="gg_contact" target="_self">广告合作</a>| <a href="javascript:;" class="friend_link" target="_self">友情链接</a>| <a href="javascript:;" class="copyright_notice" target="_self">版权声明</a> </p> <p> <span>copyRight@2007-2022 www.111CN.NET AII Right Reserved <a href="https://beian.miit.gov.cn/" target="_blank" class="beian"></a></span> </p> <p> <span> 网站内容来自网络整理或网友投稿如有侵权行为请邮件:111cn.com@163.com 我们24小时内处理 </span> </p> </div> </div> <script> var advData = {"img_fixed_pc_adv":"https:\/\/img.111cn.net\/uploads\/20240509\/663c2e9729f58.jpg","img_fixed_mob_adv":"https:\/\/img.111cn.net\/uploads\/20240509\/663c2e8793225.jpg","url_adv":"http:\/\/shop.hushen.cn\/shop\/c\/baojianpin.html","str_adv":"\u864e\u795e\u5546\u57ce\uff1a\u5173\u7231\u7537\u6027\uff0c\u66f4\u61c2\u7537\u4eba\u3002\u89e3\u51b3\u5927\u4f17\u7684\u7537\u8a00\u4e4b\u9690","img_popup_adv":"https:\/\/img.111cn.net\/uploads\/20240509\/663c2e748238d.png","pc_show_img":"2","pc_show_popup":"2","pc_show_video":"2","mob_show_img":"2","mob_show_popup":"2","mob_show_video":"2","close_adv":"https:\/\/img.111cn.net\/uploads\/20240508\/663b20650801e.png","video_adv":"\/pc\/images\/pc-adv.mp4"}; </script> <script src="/jspc/func.js" type="text/javascript"></script> <!-- Google tag (gtag.js) --> <script async src="https://www.googletagmanager.com/gtag/js?id=G-DSRRGRV1TL"></script> <script> window.dataLayer = window.dataLayer || []; function gtag(){dataLayer.push(arguments);} gtag('js', new Date()); gtag('config', 'G-DSRRGRV1TL'); </script> <script src="/js/stat.js"></script> </body> </html>