人人狠狠综合久久亚洲婷婷,日日夜夜精品视频,91精品视频在线看

主頁 > 知識庫 > c# 正則表達式對網頁進行有效內容抽取

c# 正則表達式對網頁進行有效內容抽取

搜索引擎中一個比較重要的環節就是從網頁中抽取出有效內容。簡單來說，就是吧HTML文本中的HTML標記去掉,留下我們用IE等瀏覽器打開HTML文檔看到的部分（我們這里不考慮圖片）.
將HTML文本中的標記分為:注釋,script ,style，以及其他標記分別去掉：
1.去注釋,正則為:
output = Regex.Replace(input, @"!--[^-]*-->", string.Empty, RegexOptions.IgnoreCase);
2.去script,正則為:
ouput = Regex.Replace(input, @"script[^>]*?>.*?/script>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline);
output2 = Regex.Replace(ouput , @"noscript[^>]*?>.*?/noscript>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline);
3.去style,正則為:
output = Regex.Replace(input, @"style[^>]*?>.*?/style>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline);
4.去其他HTML標記
result = result.Replace("nbsp;", " ");
result = result.Replace("quot;", "\"");
result = result.Replace("lt;", "");
result = result.Replace("gt;", ">");
result = result.Replace("amp", "");
result = result.Replace("br>", "\r\n");
result = Regex.Replace(result, @"[\s\S]*?>", string.Empty, RegexOptions.IgnoreCase);
以上的代碼中大家可以看到,我使用了RegexOptions.Singleline參數，這個參數很重要，他主要是為了讓"."(小圓點)可以匹配換行符.如果沒有這個參數，大多數情況下，用上面列正則表達式來消除網頁HTML標記是無效的.
HTML發展至今，語法已經相當復雜,上面只列出了幾種最主要的標記,更多的去HTML標記的正則我將在
Rost WebSpider 的開發過程中補充進來。
下面用c#實現了一個從HTML字符串中提取有效內容的類:
using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;
class HtmlExtract
{
#region private attributes
private string _strHtml;
#endregion
#region public mehtods
public HtmlExtract(string inStrHtml)
{
_strHtml = inStrHtml
}
public override string ExtractText()
{
string result = _strHtml;
result = RemoveComment(result);
result = RemoveScript(result);
result = RemoveStyle(result);
result = RemoveTags(result);
return result.Trim();
}
#endregion
#region private methods
private string RemoveComment(string input)
{
string result = input;
//remove comment
result = Regex.Replace(result, @"!--[^-]*-->", string.Empty, RegexOptions.IgnoreCase);
return result;
}
private string RemoveStyle(string input)
{
string result = input;
//remove all styles
result = Regex.Replace(result, @"style[^>]*?>.*?/style>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline);
return result;
}
private string RemoveScript(string input)
{
string result = input;
result = Regex.Replace(result, @"script[^>]*?>.*?/script>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline);
result = Regex.Replace(result, @"noscript[^>]*?>.*?/noscript>", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline);
return result;
}
private string RemoveTags(string input)
{
string result = input;
result = result.Replace("nbsp;", " ");
result = result.Replace("quot;", "\"");
result = result.Replace("lt;", "");
result = result.Replace("gt;", ">");
result = result.Replace("amp", "");
result = result.Replace("br>", "\r\n");
result = Regex.Replace(result, @"[\s\S]*?>", string.Empty, RegexOptions.IgnoreCase);
return result;
}
#endregion

您可能感興趣的文章:

使用C# Winform應用程序獲取網頁源文件的解決方法
C#基于正則表達式實現獲取網頁中所有信息的網頁抓取類實例
使用C#正則表達式獲取必應每日圖片地址
C#正則表達式獲取下拉菜單(select)的相關屬性值
C#使用正則表達式抓取網站信息示例
C#通過正則表達式實現提取網頁中的圖片
常用正則常用的C#正則表達式
C#的正則表達式Regex類使用簡明教程
C# 正則表達式經典分類整理集合手冊
C#中的正則表達式學習資料
WinForm使用正則表達式提取內容的方法示例

標簽：秦皇島茂名怒江西寧玉林昆明河北吉林

巨人網絡通訊聲明：本文標題《c# 正則表達式對網頁進行有效內容抽取》，本文關鍵詞正則,表達式,對,網頁,進行,；如發現本文內容存在版權問題，煩請提供相關信息告之我們，我們將及時溝通與處理。本站內容系統采集于網絡，涉及言論、版權與本站無關。

婷婷综合国产,91蜜桃婷婷狠狠久久综合9色 ,九九九九九精品,国产综合av

c# 正則表達式對網頁進行有效內容抽取