c# .net 读取word文档文件，.txt、.doc、.docx、.xls、xlsx-CFANZ编程社区

目前市面上的方案

最受欢迎的“NPOI”、“Microsoft.Office.Interop”、“Spire.Doc”，如果有不全的，欢迎指正。

Word文件doc和docx的存储格式是不同的，相应的解析Word文件的方式也类似，主要有以下方式：

1.通过MS Word应用程序的DCOM接口；

2.WPS Word应用程序的DCOM接口，其他Office应用程序，例如Open Office等；

3.NPOI库；

4.MS Open XML；

5.Spire.Doc库；

--------------------------------------

实际操作中，MS与Open Office等不同厂家对Word（或泛指Office中的字处理软件文档）的格式定义标准有差别，因此存在兼容性的问题；

即使MS的docx格式文件，2007、2010、2013等不同版本虽然都使用XML格式定义，但仍然不同，也存在兼容性问题。

因此，多数客户终端使用MS Office的情况下，如果使用Open Office等其他应用程序来处理Word文档，则会出现很多问题。

能够很好处理MS Word文档的应用程序，好用的应用程序就是MS Office Word、WPS，好用的组件库就是Spire.Doc，NPOI（仅docx）。

使用组件库的好处是不需要安装应用程序，部署简单，另外通过.Net Framework调用也优于DCOM接口。

由于Spire.Doc为收费组件，所以建议使用NPOI解析Word，当然仅限于Word 2007+。

以下为关键代码，供参考：

FileStream fileStream = new FileStream(fileName, FileMode.Open, FileAccess.Read);
xwpfDocument = new XWPFDocument((Stream)fileStream);
foreach (XWPFParagraph xwpfParagraph in (IEnumerable<XWPFParagraph>)xwpfDocument.Paragraphs)
stringBuilder.AppendLine(xwpfParagraph.ParagraphText);

--------------------------------------

一、最受欢迎的NPOI

https://github.com/nissl-lab/npoi

该项目是 POI Java 项目的 .NET 版本。使用 NPOI，您可以非常轻松地读取/写入 Office 2003/2007 文件。

NPOI的优势

它完全免费使用

涵盖 Excel 的大部分功能（单元格样式、数据格式、公式等）

支持的格式：xls、xlsx、docx

设计为面向接口（查看 NPOI.SS 命名空间）

不仅支持导出，还支持导入

适用于 Windows 和 Linux

系统要求

.NET 标准 2.1 (.NET Core 3.x)

.NET 标准 2.0 (.NET Core 2.x)

.NET Framework 4.0 及更高版本

读取.docx代码

using NPOI.XWPF.UserModel;

/// <summary>
        /// 获取.docx文件内容，使用NPOI.XWPF插件解析
        /// </summary>
        /// <param name="strFilePath">文件路径</param>
        /// <returns></returns>
        public string GetDocxContent(string strFilePath)
        {
            string result = "Docx解析不成功"; 

            if (File.Exists(strFilePath))
            {
                System.Text.Encoding encoding = GetType(strFilePath);
                result = "文件编码：" + encoding.BodyName + "\r";

                //读取文本文件流
                FileStream stream = new FileStream(strFilePath, FileMode.Open);
                try
                {
                    //根据提供的文件，创建一个Word文档对象
                    XWPFDocument docx = new XWPFDocument(stream);
                    //获取Word文档的所有段落对象
                    IList<XWPFParagraph> paragraphs = docx.Paragraphs;
                    //……
                    foreach (var item in paragraphs)
                    {
                        result = item.ParagraphText + "<br>" + "&nbsp&nbsp";
                    }

                    获取文本内容并替换特殊字符
                    //strContent = reader.ReadToEnd().Replace(" ", "&nbsp").Replace("\r\n", "<br>");

                    //关闭文件流
                    stream.Close();
                }
                catch (Exception e)
                {
                    result = e.Message;
                    //关闭文件流
                    stream.Close();
                }

            }

            return result;

        }

二、Microsoft.Office.Interop

NuGet Gallery | Microsoft.Office.Interop.Word 15.0.4797.1003

它分别有：

表格：Microsoft.Office.Interop.Excel

文档：Microsoft.Office.Interop.Word

PPT：Microsoft.Office.Interop.PowerPoint

c# .net 读取word文档文件，.txt、.doc、.docx、.xls、xlsx_读取doc文件

根据上图提示，可以看出目前他支持的是 Office 2013 ，

所以程序如果使用这个,需要安装Office 2013

读取.doc代码

需要增加 Microsoft.Office.Interop.Word的引用

/// <summary>
/// 获取.doc文件内容,使用Microsoft.Office.Interop.Word插件解析
/// </summary>
/// <param name="strFilePath">文件路径</param>
/// <returns></returns>
public string GetDocContent(string strFilePath)
{
  string result = "Doc解析不成功";

   

  if (System.IO.File.Exists(strFilePath))
  {
     

    object missing = System.Reflection.Missing.Value;
    object readOnly = true;
    Microsoft.Office.Interop.Word.Application wordApp;
    wordApp = new Microsoft.Office.Interop.Word.Application();
    object docxPath = strFilePath;
    Microsoft.Office.Interop.Word.Document wordDoc = wordApp.Documents.Open(ref docxPath,
                      ref missing,
                      ref readOnly,
                      ref missing,
                      ref missing,
                      ref missing,
                      ref missing,
                      ref missing,
                      ref missing,
                      ref missing,
                      ref missing,
                      ref missing,
                      ref missing,
                      ref missing,
                      ref missing,
                      ref missing);
    result = wordDoc.Content.Text;
     
    wordDoc.Close();
    wordApp.Quit(); 
  }

  return result;

}

三、Spire.Doc库

https://www.e-iceblue.cn/Introduce/Free-Spire-Doc-NET.html

c# .net 读取word文档文件，.txt、.doc、.docx、.xls、xlsx_microsoft_02

Free Spire.Doc for .NET 是 Spire.Doc for .NET 的免费产品。Free Spire.Doc 是一款免费的专门对 Word 文档进行操作的 .NET类库。适用于商业或个人用途。这款控件的主要功能在于帮助开发人员轻松快捷高效地创建、编辑和转换 Microsoft Word 文档。作为一款独立的 Word .NET 控件，Free Spire.Doc for .NET 的运行系统（服务器端或客户端）均无需安装 Microsoft Word，但是它却可以将 Microsoft Word 文档的操作功能集成到任何开发人员的 .NET 应用程序中。

Free Spire.Doc for .NET 能执行多种 Microsoft Word 文档处理任务的 .NET API。支持 Word97-2003、Word2007、Word2010、Word2013、Word2016 以及 Word2019。能在 Word 97/2003/2007/2010/2013/2016/2019 和 XML、RTF、TXT、XPS、EPUB、EMF、HTML 等格式文件之间进行双向转换，还能将 Word 文件高质量地转换为 PDF 文件格式。

友情提示：

免费版有篇幅限制。在加载或操作 Word 文档时，要求 Word 文档不超过 500 个段落，25 个表格。同时将 Word 文档转换为 PDF 和 XPS 格式时，仅支持转换前三页。

和商业 Spire.Doc 版本相比，除了文档篇幅限制外，Free Spire.Doc 没有任何警告信息，但我们仅对免费版进行不定期维护。

示例代码

引用；我引用的是商业版Spire.Doc ，大家测试的时候可以引用 Free Spire.Doc

c# .net 读取word文档文件，.txt、.doc、.docx、.xls、xlsx_读取doc文件_03

/// <summary>
        /// 获取文本内容
        /// </summary>
        /// <param name="strFilePath">文件路径</param>
        /// <returns></returns>
        public string GetTxtContent2(string strFilePath)
        {
            string result = "解析不成功";

            if (File.Exists(strFilePath))
            {
                System.Text.Encoding encoding = Form1.GetType(strFilePath);
                result = "文件编码：" + encoding.BodyName + "\r";

                try
                {
                    //加载Word文档
                    Spire.Doc.Document doc = new Spire.Doc.Document();
                    doc.LoadFromFile(strFilePath); 

                    StringBuilder sb = new StringBuilder();
                    //遍历节和段落，获取段落中的文本
                    foreach (Spire.Doc.Section section in doc.Sections)
                    { 
                        foreach (Spire.Doc.Documents.Paragraph paragraph in section.Paragraphs)
                        {
                            sb.AppendLine(paragraph.Text);
                        }
                    }

                    //另存为 文本文件
                    //File.WriteAllText("文本2.txt", sb.ToString());

                    result = sb.ToString();

                }
                catch (Exception e)
                {
                    result = e.Message;
                }

            }
            return result;
        }

一、C#编程读取文档Doc、Docx及Pdf内容的方法

本文实例讲述了C#编程读取文档Doc、Docx及Pdf内容的方法。分享给大家供大家参考。具体分析如下：

Doc文档：Microsoft Word 14.0 Object Library (GAC对象，调用前需要安装word。安装的word版本不同，COM的版本号也会不同)
Docx文档：Microsoft Word 14.0 Object Library (GAC对象，调用前需要安装word。安装的word版本不同，COM的版本号也会不同)
Pdf文档：PDFBox

/*
 作者：GhostBear
 */
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
using System.Text.RegularExpressions;
using org.pdfbox.pdmodel;
using org.pdfbox.util;
using Microsoft.Office.Interop.Word;
namespace TestPdfReader
{
 class Program
 {
 static void Main(string[] args)
 {
  //PDF
  PDDocument doc = PDDocument.load(@"C:\resume.pdf");
  PDFTextStripper pdfStripper = new PDFTextStripper();
  string text = pdfStripper.getText(doc);
  string result = text.Replace('\t', ' ').Replace('\n', ' ').Replace('\r', ' ').Replace(" ", "");
  Console.WriteLine(result);
  //Doc,Docx
  object docPath = @"C:\resume.doc";
  object docxPath = @"C:\resume.docx";
  object missing=System.Reflection.Missing.Value;
  object readOnly=true;
  Application wordApp;
  wordApp = new Application();
  Document wordDoc = wordApp.Documents.Open(ref docPath,
       ref missing,
       ref readOnly,
       ref missing,
       ref missing,
       ref missing,
       ref missing,
       ref missing,
       ref missing,
       ref missing,
       ref missing,
       ref missing,
       ref missing,
       ref missing,
       ref missing,
       ref missing);
  string text2 = FilterString(wordDoc.Content.Text);
  wordDoc.Close(ref missing, ref missing, ref missing);
  wordApp.Quit(ref missing, ref missing, ref missing);
  Console.WriteLine(text2);
  Console.Read();
   
 }
 private static string FilterString(string input)
 {
  return Regex.Replace(input, @"(\a|\t|\n|\s+)", "");
 }
 }
}

二、C#读取word文档内容

读取word,首先得添加引用，不同的word版本对应着不同的引用

部分版本对应引用如下：

Microsoft Word 11.0 object library对应Office2003

Microsoft Word 12.0 object library对应Office2007

Microsoft Word 14.0 object library对应Office2010

Microsoft Word 15.0 object library对应Office2013

以word 2007为例,故添加Microsoft Word 12.0 Object Library，添加方法，右击项目解决方案，选择 Add Reference,弹出对话框如下图：

c# .net 读取word文档文件，.txt、.doc、.docx、.xls、xlsx_microsoft_04

然后在namespace上面写下：using Word = Microsoft.Office.Interop.Word;这样，添加引用就算OK了

读取文档代码如下：

protected string ReadFile_Word()
        {
            string context="";
            string path = @"F:\测试文档.docx";
            Word.Application app = new Microsoft.Office.Interop.Word.Application();
            Word.Document doc = null;
            object unknow = Type.Missing;
            //object nullobj = System.Reflection.Missing.Value;
            app.Visible = true;            
            object file = path;
            doc = app.Documents.Open(ref file,
                ref unknow, ref unknow, ref unknow, ref unknow,
                ref unknow, ref unknow, ref unknow, ref unknow,
                ref unknow, ref unknow, ref unknow, ref unknow,
                ref unknow, ref unknow, ref unknow);
            string temp = doc.Paragraphs[1].Range.Text.Trim();//读取第一段内容
            context = doc.Content.Text;//读取整篇文档的内容            
            doc.Close(ref unknow, ref unknow, ref unknow);//关闭文件
            app.Quit(ref unknow, ref unknow, ref unknow);//关闭COM
            return context;       
 }

三、C#读取doc,pdf,ppt文件

doc pdf ppt与 txt之间的转换：

组件的作用一般是将文件读出成字符格式，并不是单纯的转换文件名后缀，所以需要将读出的东西写入txt文件。

添加office引用

.net中对office中的word及ppt进行编程时，确保安装office时已经安装了word，ppt可编程组件（自定义安装时可查看）或者安装“Microsoft Office 2003 Primary Interop Assemblies”

安装后，在编程页面添加引用：

添加引用-com—microsoft powerpoint object 11.0 libaray/word 11.0 object library;

还得添加office组件

using Microsoft.Office.Interop.Word;

using Microsoft.Office.Interop.PowerPoint;

using org.pdfbox.pdmodel;                     

using org.pdfbox.util;

using Microsoft.Office.Interop.Word;

using Microsoft.Office.Interop.PowerPoint;

public void pdf2txt(FileInfo file,FileInfo txtfile)

    {

        PDDocument doc = PDDocument.load(file.FullName);

        PDFTextStripper pdfStripper = new PDFTextStripper();

        string text = pdfStripper.getText(doc);

            StreamWriter swPdfChange = new StreamWriter(txtfile.FullName, false, Encoding.GetEncoding("gb2312"));

        swPdfChange.Write(text);

        swPdfChange.Close();

    }

对于doc文件中的表格，读出的结果是去除掉了网格线，内容按行读取。

public void word2text(FileInfo file,FileInfo txtfile)

    {

        object readOnly = true;

        object missing = System.Reflection.Missing.Value;

        object fileName = file.FullName;

        Microsoft.Office.Interop.Word.ApplicationClass wordapp = new Microsoft.Office.Interop.Word.ApplicationClass();

        Document doc = wordapp.Documents.Open(ref fileName,

    ref missing, ref readOnly, ref missing, ref missing, ref missing,

    ref missing, ref missing, ref missing, ref missing, ref missing,

    ref missing, ref missing, ref missing, ref missing, ref missing);

        string text = doc.Content.Text;

        doc.Close(ref missing, ref missing, ref missing);

        wordapp.Quit(ref missing, ref missing, ref missing);

        StreamWriter swWordChange = new StreamWriter(txtfile.FullName, false, Encoding.GetEncoding("gb2312"));

        swWordChange.Write(text);

        swWordChange.Close();

    }

    public void ppt2txt(FileInfo file, FileInfo txtfile)

    {

         Microsoft.Office.Interop.PowerPoint.Application pa = new Microsoft.Office.Interop.PowerPoint.ApplicationClass();

        Microsoft.Office.Interop.PowerPoint.Presentation pp = pa.Presentations.Open(file.FullName,

                        Microsoft.Office.Core.MsoTriState.msoTrue,

                        Microsoft.Office.Core.MsoTriState.msoFalse,

                        Microsoft.Office.Core.MsoTriState.msoFalse);

        string pps = "";

        StreamWriter swPPtChange = new StreamWriter(txtfile.FullName, false, Encoding.GetEncoding("gb2312"));

       

        foreach (Microsoft.Office.Interop.PowerPoint.Slide slide in pp.Slides)

        {

            foreach (Microsoft.Office.Interop.PowerPoint.Shape shape in slide.Shapes)

           

                pps += shape.TextFrame.TextRange.Text.ToString();

   

        }

        swPPtChange.Write(pps);

        swPPtChange.Close();

   

    }

读取不同类型的文件

public StreamReader text2reader(FileInfo file)

    {

        StreamReader st = null;

        switch (file.Extension.ToLower())

        {

            case ".txt":

                st = new StreamReader(file.FullName, Encoding.GetEncoding("gb2312"));

                break;

            case ".doc":

                FileInfo wordfile = new FileInfo(@"E:\my programs\200807program\FileSearch\App_Data\word2txt.txt");//不能使用相对路径，想办法改进

                word2text(file, wordfile);

                st = new StreamReader(wordfile.FullName, Encoding.GetEncoding("gb2312"));

                break;

            case ".pdf":

                FileInfo pdffile = new FileInfo(@"E:\my programs\200807program\FileSearch\App_Data\pdf2txt.txt");

                pdf2txt(file, pdffile);

                st = new StreamReader(pdffile.FullName, Encoding.GetEncoding("gb2312"));

                break;

            case".ppt":

                FileInfo pptfile = new FileInfo(@"E:\my programs\200807program\FileSearch\App_Data\ppt2txt.txt");

                ppt2txt(file,pptfile);

                st = new StreamReader(pptfile.FullName,Encoding.GetEncoding("gb2312"));

                break;

        }

        return st;

    }

c# .net 读取word文档文件，.txt、.doc、.docx、.xls、xlsx