How To Remove Superscript Text in PDF to Markdown Conversion with Python

lars.peyer · October 25, 2021, 9:44am

I am converting PDF documents to markdown using the API and the python client according to

This works fine, but we have the following situation:
The documents contain several source links / footnotes. They are numbered from 1 to n and always at the end of the page.
The number of the link is added at the corresponding section of the content/text in superscript.

From the markdown, we import the content to our database. I would like to ignore the footnotes in the import, as it is not relevant for our use-case. As each footnote gets an own line in the markdown, and they follow a strict formatting, this line can be selected by regex and ignored in the import.

To remove the references in the text (superscripted numbers) it would be nice if the markdown would contain them as superscripted, so we could just regex all superscripted numbers out of the document.

Is there an option to activate superscript format in markdown? Or is there a better way to somehow achieve this?

tilal.ahmad · October 25, 2021, 2:54pm

@lars.peyer

I am afraid currently there is no option to control the superscript text in PDF to Markdown conversion. However, we will appreciate it if you please share your exiting output and sample expected output. We will look into these and will try to provide some solution.

lars.peyer · October 27, 2021, 1:16pm

Thank you for the answer.
I uploaded the PDF and the generated markdown here: example_doc.zip (26.5 KB)

How the expected output should look is a bit hard to tell.
The easiest for our case would be, if the numbers 1 to 4 are marked as superscripted (as the documents does not contain other superscripted content) but as you told us, this is not yet implemented.

While constructing an example, I found out that if the references are correctly done using the word-references-feature, the numbers are properly marked:
*This is one block of [*1 ](#_page0_x96.40_y511.95) ...

The problem is, that this is not the case for the documents we have, as the numbers are just superscripted texts and not “real” references.

Do you have any idea how we could properly find this numbers in the markdown anyway?

Thanks a lot for your help.

tilal.ahmad · October 28, 2021, 1:13am

@lars.peyer

Thanks for sharing the additional information. We are investigating your requirements and will share our findings soon.

tilal.ahmad · October 28, 2021, 11:46am

@lars.peyer

We have logged a ticket WORDSCLOUD-1784 in our issue tracking system for further investigation and resolution. We will keep you updated about the issue resolution progress within this forum thread.

tilal.ahmad · November 17, 2021, 4:40am

@lars.peyer

We have further investigated and found that you can remove superscript text in PDF to Markdown conversion with Python using Aspose.Words Cloud API as follows. Please check the sample code for reference and let us know if it suits your scenario.

Convert PDF to Word format
Read the footnotes from the Word document
Remove the footnotes from the Word document
Convert Word to MD format

However, unfortunately, the only problem we see right now is that API does not detect footnotes in the PDF that doesn’t have a horizontal divider above the footnotes. This is the case with your shared sample PDF. So we logged another ticket PDF2WORD-961 to fix the issue. And hopefully, this problem will be fixed in the upcoming release, 21.12.

Sample Python Code Remove Superscript Text in PDF to MD Conversion Online

# Import required modules
import asposewordscloud
import asposewordscloud.models.requests
from shutil import copyfile

# Init the client
client_id = 'xxxxxx-xxxx-xxxx-xxxx-xxxxxxxx'
client_secret = 'xxxxxxxxxxxxxxxxxxxxxxxxxx'
base_url = 'https://api.aspose.cloud'
words_api = asposewordscloud.WordsApi(client_id, client_secret, base_url)
words_api.api_client.configuration.host = base_url

# File names
documentName = 'document_with_footnotes'
inputFilePath = f'c:/tmp/{documentName}.pdf'
outputFilePath = f'c:/tmp/{documentName}.md'
tempCloudFolder = "pdf2md"

# Step 1. Upload PDF file to Cloud
input_document = open(inputFilePath, 'rb')
upload_request = asposewordscloud.models.requests.UploadFileRequest(input_document, f'{tempCloudFolder}/{documentName}.pdf')
upload_response = words_api.upload_file(upload_request)
print('Step 1: uploaded {}'.format(upload_response.uploaded))

# Step 2. Convert PDF to DOCX in Cloud
save_options = asposewordscloud.SaveOptionsData(save_format='docx', file_name=f'{documentName}.docx')
save_request = asposewordscloud.models.requests.SaveAsRequest(name=f'{documentName}.pdf', save_options_data=save_options, folder=tempCloudFolder)
save_response = words_api.save_as(save_request)
print('Step 2: converted to {}'.format(save_response.save_result.dest_document.href))

# Step 3. Get number of footnotes
get_footnotes_request = asposewordscloud.models.requests.GetFootnotesRequest(name=f'{documentName}.docx', folder=tempCloudFolder)
get_footnotes_response = words_api.get_footnotes_online(get_footnotes_request)
footnotes_count = len(get_footnotes_response.footnotes.list)
print(f'Step 3: found {footnotes_count} footnotes')

# Step 4. Delete footnotes
for number in range(footnotes_count):
    delete_footnote_request = asposewordscloud.models.requests.DeleteFootnoteRequest(name=f'{documentName}.docx', index=0, folder=tempCloudFolder)
    delete_footnote_response = words_api.delete_footnote(delete_footnote_request)
    print(f'Step 4: deleted footnote n.{number}')

# Step 5. Convert DOCX to MD
save_options = asposewordscloud.SaveOptionsData(save_format='md', file_name=f'{documentName}.md')
save_request = asposewordscloud.models.requests.SaveAsRequest(name=f'{documentName}.docx', save_options_data=save_options, folder=tempCloudFolder)
save_response = words_api.save_as(save_request)
print('Step 5: converted to {}'.format(save_response.save_result.dest_document.href))

#Step 6. Download MD file
download_request = asposewordscloud.models.requests.DownloadFileRequest(path=f'{tempCloudFolder}/{documentName}.md')
download_response = words_api.download_file(download_request)
copyfile(download_response, outputFilePath)
print(f'Step 6: downloaded {outputFilePath}')

lars.peyer · November 18, 2021, 9:26am

Thank you for the script.
The code runs and seems to work with PDF which have the horizontal divider.
Unfortunately, in all our PDF we do not have the divider. So I will test this again with the new releases.

Thanks a lot for all the effort so far.

aspose.notifier · December 13, 2021, 3:40am

The issues you have found earlier (filed as WORDSCLOUD-1784) have been fixed in this update. This message was posted using Bugs notification tool by Ivanov_John

lars.peyer · January 18, 2022, 1:04pm

Is there already an update or a timeframe for this issue?

tilal.ahmad · January 18, 2022, 6:23pm

@lars.peyer

Yes, we have fixed the case in which PDF doesn’t have a horizontal divider above the footnotes in the Aspose.Words Cloud 21.12 release. Please let us know if you still face any issues in this regard.

lars.peyer · January 24, 2022, 4:06pm

Thanks for the feedback.

It seems to work with the linked document above.
I just tested it with a second document test.pdf (107.5 KB) and it seems that it does not work yet.

Can you quickly check what the problem in with this document is?

tilal.ahmad · January 25, 2022, 4:05am

@lars.peyer

We are sorry for the inconvenience. I have noticed the footnote detection issue with your shared PDF document and logged a ticket (WORDSCLOUD-1904) for further investigation. We will keep you updated about the issue resolution progress in this thread.

aspose.notifier · February 15, 2022, 12:52pm

The issues you have found earlier (filed as WORDSCLOUD-1904) have been fixed in this update. This message was posted using Bugs notification tool by Ivanov_John

lars.peyer · February 23, 2022, 1:36pm

Thanks a lot for fixing the issue. It works super fine on the first page of a document.

If we have more than one page, it seems to only find the ones on page 1:
Example_ superscript example.pdf (319.6 KB)

Is this a problem in our PDF or is there a bug in the conversion?

tilal.ahmad · February 24, 2022, 3:25am

@lars.peyer

After an initial investigation, we have logged a ticket WORDSCLOUD-1933 for further investigation and rectification. We will keep you updated about the issue resolution progress within this thread.