Hi,
I need to convert pdf documents to the Word document format. These documents are mostly scanned pdfs. I use the following code for the conversion:
public String saveAsDocx() {
String fn = parent.getNewFileName(file, "DOCX");
DocSaveOptions saveOption = new DocSaveOptions();
saveOption.setFormat(DocSaveOptions.DocFormat.DocX);
saveOption.setMode(DocSaveOptions.RecognitionMode.Flow);
saveOption.setRecognizeBullets(true);
document.save(fn, saveOption);
parent.addResult(fn, false);
this.processedFile.otherFormats.add(new ProcessedFile(fn));
return fn;
}
where document is of type com.aspose.pdf.Document
The output I get is a word document, where each page contains only an image with the original pdf content. I understand that if the text of the scanned pdf is not well recognizable this is a valid output. However, when I try to extract the text from the pdf document it works well. For the text extraction I use the following code:
public String saveAsText() throws Exception {
String fn = parent.getNewFileName(file, "TXT");
try (Writer writer = Files.newBufferedWriter(Paths.get(fn), StandardCharsets.UTF_8)) {
for (int i = 0; i < document.getPages().size(); i++) {
com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();
textAbsorber.getExtractionOptions().setFormattingMode(TextExtractionOptions.TextFormattingMode.Pure);
document.getPages().get_Item(i + 1).accept(textAbsorber);
String extractedText = textAbsorber.getText();
writer.write(extractedText);
writer.write("\n ----------------- PAGE --------------- \n");
writer.flush();
}
}
parent.addResult(fn, false);
this.processedFile.otherFormats.add(new ProcessedFile(fn));
return fn;
}
And the output of this method is a txt file with the text of the original pdf document and there are no errors in it.
So the question is - how to convert the pdf to docx when I know that the text is well extractable? I can also provide the document if it is necessary.
Thanks!