I am converting PDF documents to markdown using the API and the python client according to
This works fine, but we have the following situation:
The documents contain several source links / footnotes. They are numbered from 1 to n and always at the end of the page.
The number of the link is added at the corresponding section of the content/text in superscript.
From the markdown, we import the content to our database. I would like to ignore the footnotes in the import, as it is not relevant for our use-case. As each footnote gets an own line in the markdown, and they follow a strict formatting, this line can be selected by regex and ignored in the import.
To remove the references in the text (superscripted numbers) it would be nice if the markdown would contain them as superscripted, so we could just regex all superscripted numbers out of the document.
Is there an option to activate superscript format in markdown? Or is there a better way to somehow achieve this?
I am afraid currently there is no option to control the superscript text in PDF to Markdown conversion. However, we will appreciate it if you please share your exiting output and sample expected output. We will look into these and will try to provide some solution.
Thank you for the answer.
I uploaded the PDF and the generated markdown here: example_doc.zip (26.5 KB)
How the expected output should look is a bit hard to tell.
The easiest for our case would be, if the numbers 1 to 4 are marked as superscripted (as the documents does not contain other superscripted content) but as you told us, this is not yet implemented.
While constructing an example, I found out that if the references are correctly done using the word-references-feature, the numbers are properly marked: *This is one block of [*1 ](#_page0_x96.40_y511.95) ...
The problem is, that this is not the case for the documents we have, as the numbers are just superscripted texts and not “real” references.
Do you have any idea how we could properly find this numbers in the markdown anyway?
We have logged a ticket WORDSCLOUD-1784 in our issue tracking system for further investigation and resolution. We will keep you updated about the issue resolution progress within this forum thread.
We have further investigated and found that you can remove superscript text in PDF to Markdown conversion with Python using Aspose.Words Cloud API as follows. Please check the sample code for reference and let us know if it suits your scenario.
Convert PDF to Word format
Read the footnotes from the Word document
Remove the footnotes from the Word document
Convert Word to MD format
However, unfortunately, the only problem we see right now is that API does not detect footnotes in the PDF that doesn’t have a horizontal divider above the footnotes. This is the case with your shared sample PDF. So we logged another ticket PDF2WORD-961 to fix the issue. And hopefully, this problem will be fixed in the upcoming release, 21.12.
Sample Python Code Remove Superscript Text in PDF to MD Conversion Online
Thank you for the script.
The code runs and seems to work with PDF which have the horizontal divider.
Unfortunately, in all our PDF we do not have the divider. So I will test this again with the new releases.
The issues you have found earlier (filed as WORDSCLOUD-1784) have been fixed in this update. This message was posted using Bugs notification tool by Ivanov_John
Yes, we have fixed the case in which PDF doesn’t have a horizontal divider above the footnotes in the Aspose.Words Cloud 21.12 release. Please let us know if you still face any issues in this regard.
We are sorry for the inconvenience. I have noticed the footnote detection issue with your shared PDF document and logged a ticket (WORDSCLOUD-1904) for further investigation. We will keep you updated about the issue resolution progress in this thread.
The issues you have found earlier (filed as WORDSCLOUD-1904) have been fixed in this update. This message was posted using Bugs notification tool by Ivanov_John
After an initial investigation, we have logged a ticket WORDSCLOUD-1933 for further investigation and rectification. We will keep you updated about the issue resolution progress within this thread.