Compare Word and "HTML output from Word"

Hello,

I would like to compare Word and “HTML output from Word”, I have tried several ways, but I can’t get these two objects to be equal.

I’m not so much interested in formatting, but more in text, tables, numbering, etc…

I tried to convert both of these formats to MarkDown, but Word and HTML have different output formats that cannot be compared. (Maybe it would work, but I would have to replace a lot)

Could you direct me to how I could solve this problem?

TB

@benestom

Can you please provide more details on the specific methods or code you have tried for comparing Word and HTML outputs?

For HTML:

 //html
 using (HTMLDocument document = new HTMLDocument(htmlPath))
 {
     // Extrakce HTML obsahu
     string extractedHtml = document.DocumentElement.OuterHTML;

     // Konverze HTML → Markdown
     var converter = new Converter();
     string markdownString = converter.Convert(extractedHtml);

     // Uložení Markdownu do souboru
     File.WriteAllText(Path.Combine(outputFolder, "outputHtml.md"), markdownString);
 }

For Word

Document doc = new Document(docxPath);
doc.Range.Bookmarks.Clear();
MarkdownSaveOptions options = new MarkdownSaveOptions
{
    ImagesFolder = "images",  // Ukládání obrázků do složky
    TableContentAlignment = TableContentAlignment.Left, // Zarovnání tabulek jako v HTML
    ListExportMode = MarkdownListExportMode.MarkdownSyntax, // Použití standardního Markdownu pro seznamy
    ParagraphBreak = "  \n",  // Správné formátování odstavců pro lepší čitelnost
    ExportHeadersFootersMode = TxtExportHeadersFootersMode.None
};
doc.Save(Path.Combine(outputFolder, "outputWord.md"), options);

0TS215pV02.docx (2.6 MB)

this document save as HTML and compare