Convert PDF to DOCX - no text is extracted


#1

Hi,

I need to convert pdf documents to the Word document format. These documents are mostly scanned pdfs. I use the following code for the conversion:

public String saveAsDocx() {

    String fn = parent.getNewFileName(file, "DOCX");

    DocSaveOptions saveOption = new DocSaveOptions();
    saveOption.setFormat(DocSaveOptions.DocFormat.DocX);
    saveOption.setMode(DocSaveOptions.RecognitionMode.Flow);

    saveOption.setRecognizeBullets(true);

    document.save(fn, saveOption);

    parent.addResult(fn, false);

    this.processedFile.otherFormats.add(new ProcessedFile(fn));

    return fn;

}

where document is of type com.aspose.pdf.Document

The output I get is a word document, where each page contains only an image with the original pdf content. I understand that if the text of the scanned pdf is not well recognizable this is a valid output. However, when I try to extract the text from the pdf document it works well. For the text extraction I use the following code:

    public String saveAsText() throws Exception {

    String fn = parent.getNewFileName(file, "TXT");
   
    try (Writer writer = Files.newBufferedWriter(Paths.get(fn), StandardCharsets.UTF_8)) {
        for (int i = 0; i < document.getPages().size(); i++) {

            com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();
            textAbsorber.getExtractionOptions().setFormattingMode(TextExtractionOptions.TextFormattingMode.Pure);
            document.getPages().get_Item(i + 1).accept(textAbsorber);
            String extractedText = textAbsorber.getText();

            writer.write(extractedText);
            writer.write("\n ----------------- PAGE --------------- \n");
            writer.flush();

        }
    }

    parent.addResult(fn, false);

    this.processedFile.otherFormats.add(new ProcessedFile(fn));

    return fn;

}

And the output of this method is a txt file with the text of the original pdf document and there are no errors in it.
So the question is - how to convert the pdf to docx when I know that the text is well extractable? I can also provide the document if it is necessary.

Thanks!


#2

@lucka

Thank you for contacting Aspose Support.

This forum is for topics related to Aspose REST APIs, your query is relevant to Aspose Native/Downloadable APIs, please follow this thread https://forum.aspose.com/t/convert-pdf-to-docx-no-text-is-extracted/177221 for an answer to your query.