Words Cloud PDF -> Docx losing text with backgrounds and sporadically changing font

ddsam · August 1, 2023, 9:56am

We’re trying to convert PDFs to DOCx using Words Cloud, but we’re encountering some strange issues.

Any content in the PDF that has a not-square background colour (i.e. the background has a corner radius) then only the background colour/shape ends up in the word document. Any text that was inside/overlapping the shape is removed.

Additional to this, for no apparent reason, occasionally the document returns with all of the fonts changed from Verdana to Times New Roman. Usually it’s absolutely fine, but occasionally bam - all Times New Roman.

The PDFs are generated from HTML using puppeteer, and we’ve tried a couple of things to try to force the issue on the shapes, e.g. replacing <div> elements with backgrounds with <svg> elements with a path, but the same thing. Essentially the only way we can get text to show up is if the background has square corners.

tilal.ahmad · August 1, 2023, 5:05pm

@ddsam

Please share your sample input and output documents with us. We will try to reproduce the issue and guide you accordingly.

ddsam · August 2, 2023, 6:16am

test.pdf (70.3 KB)
test.docx (35.8 KB)

The PDF and the resulting docx are attached. I’ll note that this docx was produced using the Aspose Words docker container, but we were getting the same/similar results from using the cloud service.

There are some characters that have gone missing, and the fonts aren’t remaining consistent (changing form Verdana to times, losing bold, etc.), as well as words which are apparently not making it to the final document because they have a background.

tilal.ahmad · August 2, 2023, 7:57am

@ddsam

We have noticed the missing text issue in PDF to DOCX conversion with your shared sample PDF document and logged a ticket(WORDSCLOUD-2424) for investigation and rectification.

tilal.ahmad · August 2, 2023, 8:52am

@ddsam

I’m afraid I’m unable to find any font information in your input PDF document. So the conversion API replaces the font with Times New Roman. Please share the expected output document with us; we will look into it and guide you accordingly.