We're sorry AsposeCloud doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.cloud

Remove Superscript text in PDF to MD with Python Conversion

I am converting PDF documents to markdown using the API and the python client according to

This works fine, but we have the following situation:
The documents contain several source links / footnotes. They are numbered from 1 to n and always at the end of the page.
The number of the link is added at the corresponding section of the content/text in superscript.

From the markdown, we import the content to our database. I would like to ignore the footnotes in the import, as it is not relevant for our use-case. As each footnote gets an own line in the markdown, and they follow a strict formatting, this line can be selected by regex and ignored in the import.

To remove the references in the text (superscripted numbers) it would be nice if the markdown would contain them as superscripted, so we could just regex all superscripted numbers out of the document.

Is there an option to activate superscript format in markdown? Or is there a better way to somehow achieve this?

@lars.peyer

I am afraid currently there is no option to control the superscript text in PDF to Markdown conversion. However, we will appreciate it if you please share your exiting output and sample expected output. We will look into these and will try to provide some solution.

Thank you for the answer.
I uploaded the PDF and the generated markdown here: example_doc.zip (26.5 KB)

How the expected output should look is a bit hard to tell.
The easiest for our case would be, if the numbers 1 to 4 are marked as superscripted (as the documents does not contain other superscripted content) but as you told us, this is not yet implemented.

While constructing an example, I found out that if the references are correctly done using the word-references-feature, the numbers are properly marked:
*This is one block of [*1 ](#_page0_x96.40_y511.95) ...

The problem is, that this is not the case for the documents we have, as the numbers are just superscripted texts and not “real” references.

Do you have any idea how we could properly find this numbers in the markdown anyway?

Thanks a lot for your help.

@lars.peyer

Thanks for sharing the additional information. We are investigating your requirements and will share our findings soon.

@lars.peyer

We have logged a ticket WORDSCLOUD-1784 in our issue tracking system for further investigation and resolution. We will keep you updated about the issue resolution progress within this forum thread.

@lars.peyer

We have further investigated and found that you can achieve your requirements using the following steps. Please check the sample code for reference and let us know it suits to your scenario.

  • Convert PDF to Word format
  • Read the footnotes from the Word document
  • Remove the footnotes from the Word document
  • Convert Word to MD format

However, unfortunately, the only problem we see right now is that API does not detect footnotes in the PDF that doesn’t have a horizontal divider above the footnotes. This is the case with your shared sample PDF. So we logged another ticket PDF2WORD-961 to fix the issue. And hopefully, this problem will be fixed in the upcoming release, 21.12.

# Import required modules
import asposewordscloud
import asposewordscloud.models.requests
from shutil import copyfile

# Init the client
client_id = 'xxxxxx-xxxx-xxxx-xxxx-xxxxxxxx'
client_secret = 'xxxxxxxxxxxxxxxxxxxxxxxxxx'
base_url = 'https://api.aspose.cloud'
words_api = asposewordscloud.WordsApi(client_id, client_secret, base_url)
words_api.api_client.configuration.host = base_url

# File names
documentName = 'document_with_footnotes'
inputFilePath = f'c:/tmp/{documentName}.pdf'
outputFilePath = f'c:/tmp/{documentName}.md'
tempCloudFolder = "pdf2md"

# Step 1. Upload PDF file to Cloud
input_document = open(inputFilePath, 'rb')
upload_request = asposewordscloud.models.requests.UploadFileRequest(input_document, f'{tempCloudFolder}/{documentName}.pdf')
upload_response = words_api.upload_file(upload_request)
print('Step 1: uploaded {}'.format(upload_response.uploaded))

# Step 2. Convert PDF to DOCX in Cloud
save_options = asposewordscloud.SaveOptionsData(save_format='docx', file_name=f'{documentName}.docx')
save_request = asposewordscloud.models.requests.SaveAsRequest(name=f'{documentName}.pdf', save_options_data=save_options, folder=tempCloudFolder)
save_response = words_api.save_as(save_request)
print('Step 2: converted to {}'.format(save_response.save_result.dest_document.href))

# Step 3. Get number of footnotes
get_footnotes_request = asposewordscloud.models.requests.GetFootnotesRequest(name=f'{documentName}.docx', folder=tempCloudFolder)
get_footnotes_response = words_api.get_footnotes_online(get_footnotes_request)
footnotes_count = len(get_footnotes_response.footnotes.list)
print(f'Step 3: found {footnotes_count} footnotes')

# Step 4. Delete footnotes
for number in range(footnotes_count):
    delete_footnote_request = asposewordscloud.models.requests.DeleteFootnoteRequest(name=f'{documentName}.docx', index=0, folder=tempCloudFolder)
    delete_footnote_response = words_api.delete_footnote(delete_footnote_request)
    print(f'Step 4: deleted footnote n.{number}')

# Step 5. Convert DOCX to MD
save_options = asposewordscloud.SaveOptionsData(save_format='md', file_name=f'{documentName}.md')
save_request = asposewordscloud.models.requests.SaveAsRequest(name=f'{documentName}.docx', save_options_data=save_options, folder=tempCloudFolder)
save_response = words_api.save_as(save_request)
print('Step 5: converted to {}'.format(save_response.save_result.dest_document.href))

#Step 6. Download MD file
download_request = asposewordscloud.models.requests.DownloadFileRequest(path=f'{tempCloudFolder}/{documentName}.md')
download_response = words_api.download_file(download_request)
copyfile(download_response, outputFilePath)
print(f'Step 6: downloaded {outputFilePath}')

Thank you for the script.
The code runs and seems to work with PDF which have the horizontal divider.
Unfortunately, in all our PDF we do not have the divider. So I will test this again with the new releases.

Thanks a lot for all the effort so far.

1 Like