Hello,
I would like to compare Word and “HTML output from Word”, I have tried several ways, but I can’t get these two objects to be equal.
I’m not so much interested in formatting, but more in text, tables, numbering, etc…
I tried to convert both of these formats to MarkDown, but Word and HTML have different output formats that cannot be compared. (Maybe it would work, but I would have to replace a lot)
Could you direct me to how I could solve this problem?
TB
@benestom
Can you please provide more details on the specific methods or code you have tried for comparing Word and HTML outputs?
For HTML:
//html
using (HTMLDocument document = new HTMLDocument(htmlPath))
{
// Extrakce HTML obsahu
string extractedHtml = document.DocumentElement.OuterHTML;
// Konverze HTML → Markdown
var converter = new Converter();
string markdownString = converter.Convert(extractedHtml);
// Uložení Markdownu do souboru
File.WriteAllText(Path.Combine(outputFolder, "outputHtml.md"), markdownString);
}
For Word
Document doc = new Document(docxPath);
doc.Range.Bookmarks.Clear();
MarkdownSaveOptions options = new MarkdownSaveOptions
{
ImagesFolder = "images", // Ukládání obrázků do složky
TableContentAlignment = TableContentAlignment.Left, // Zarovnání tabulek jako v HTML
ListExportMode = MarkdownListExportMode.MarkdownSyntax, // Použití standardního Markdownu pro seznamy
ParagraphBreak = " \n", // Správné formátování odstavců pro lepší čitelnost
ExportHeadersFootersMode = TxtExportHeadersFootersMode.None
};
doc.Save(Path.Combine(outputFolder, "outputWord.md"), options);
0TS215pV02.docx (2.6 MB)
this document save as HTML and compare