Convert PDF to DOCX - no text is extracted

lucka · May 24, 2018, 10:28am

Hi,

I need to convert pdf documents to the Word document format. These documents are mostly scanned pdfs. I use the following code for the conversion:

public String saveAsDocx() {

    String fn = parent.getNewFileName(file, "DOCX");

    DocSaveOptions saveOption = new DocSaveOptions();
    saveOption.setFormat(DocSaveOptions.DocFormat.DocX);
    saveOption.setMode(DocSaveOptions.RecognitionMode.Flow);

    saveOption.setRecognizeBullets(true);

    document.save(fn, saveOption);

    parent.addResult(fn, false);

    this.processedFile.otherFormats.add(new ProcessedFile(fn));

    return fn;

}

where document is of type com.aspose.pdf.Document

The output I get is a word document, where each page contains only an image with the original pdf content. I understand that if the text of the scanned pdf is not well recognizable this is a valid output. However, when I try to extract the text from the pdf document it works well. For the text extraction I use the following code:

    public String saveAsText() throws Exception {

    String fn = parent.getNewFileName(file, "TXT");
   
    try (Writer writer = Files.newBufferedWriter(Paths.get(fn), StandardCharsets.UTF_8)) {
        for (int i = 0; i < document.getPages().size(); i++) {

            com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();
            textAbsorber.getExtractionOptions().setFormattingMode(TextExtractionOptions.TextFormattingMode.Pure);
            document.getPages().get_Item(i + 1).accept(textAbsorber);
            String extractedText = textAbsorber.getText();

            writer.write(extractedText);
            writer.write("\n ----------------- PAGE --------------- \n");
            writer.flush();

        }
    }

    parent.addResult(fn, false);

    this.processedFile.otherFormats.add(new ProcessedFile(fn));

    return fn;

}

And the output of this method is a txt file with the text of the original pdf document and there are no errors in it.
So the question is - how to convert the pdf to docx when I know that the text is well extractable? I can also provide the document if it is necessary.

Thanks!

sohail.aspose · May 24, 2018, 11:17am

@lucka

Thank you for contacting Aspose Support.

This forum is for topics related to Aspose REST APIs, your query is relevant to Aspose Native/Downloadable APIs, please follow this thread Convert PDF to DOCX - no text is extracted - Free Support Forum - aspose.com for an answer to your query.

tilal.ahmad · June 25, 2021, 3:43pm