0
点赞
收藏
分享

微信扫一扫

解决中文乱码问题:获取任意网页代码


我们在使用C#获取某个网页代码时,经常会遇到中文乱字符的问题:

         

WebRequest request = WebRequest.Create(textBox2.Text);
WebResponse response = null;
try
{
response = request.GetResponse();
}
catch (Exception exc)
{

}

Stream resStream = response.GetResponseStream();
StreamReader sr = new StreamReader(resStream, System.Text.Encoding.Default); //这里使用了Encoding.Default,但有时还是免不了出现乱码!
string tempCode = sr.ReadToEnd();
resStream.Close();
sr.Close();


做了改进:


static string GetHtml(string url, Encoding encoding)
{

byte[] buf = new WebClient().DownloadData(url);

if (encoding != null) return encoding.GetString(buf);


string html = Encoding.UTF8.GetString(buf);

encoding = GetEncoding(html);

if (encoding == null || encoding == Encoding.UTF8) return html;


return encoding.GetString(buf);

}


// 根据网页的HTML内容提取网页的Encoding

static Encoding GetEncoding(string html)

{

string pattern = @"(?i)\bcharset=(?<charset>[-a-zA-Z_0-9]+)";

string charset = Regex.Match(html, pattern).Groups["charset"].Value;

try { return Encoding.GetEncoding(charset); }

catch (ArgumentException) { return null; }

}


//调用方法:

string url="http://www.1.com";

string tempCode = GetHtml(url, null);  //不知道编码时,第二个参数用null


举报

相关推荐

0 条评论