Node.js - How to Extract Paragraphs Text from Word Document in Aspose.Words REST API

nativepear · February 21, 2023, 10:54am

I am extracting paragraphs from docx documents like so:

const fs = require('fs');
const { WordsApi, GetParagraphsOnlineRequest } = require("asposewordscloud");
let wordsApi = new WordsApi(CLIENT_ID, CLIENT_SECRET);

const requestDocument = fs.createReadStream("test.docx");
const request = new GetParagraphsOnlineRequest({
    document: requestDocument,
});

wordsApi.getParagraphsOnline(request)
    .then((requestResult: any) => {
        console.log(requestResult);
    });

Here is a sample of what gets returned in a document with three paragraphs, with the second paragraph (sentence one) containing a reference to a footnote at the bottom of the page:

{"paragraphLinkList": [
    {
      "link": {
        "href": "https://api.aspose.cloud/v4.0/words",
        "rel": "self"
      },
      "nodeId": "0.0.0",
      "text": "Testing paragraph number 1."
    },
    {
      "link": {
        "href": "https://api.aspose.cloud/v4.0/words",
        "rel": "self"
      },
      "nodeId": "0.0.1",
      "text": "This is test paragraph 2, sentence 1. Here is the footnote. This is the other part of test paragraph 2, sentence 2."
    },
    {
      "link": {
        "href": "https://api.aspose.cloud/v4.0/words",
        "rel": "self"
      },
      "nodeId": "0.0.1.4.0",
      "text": " Here is the footnote."
    },
    {
      "link": {
        "href": "https://api.aspose.cloud/v4.0/words",
        "rel": "self"
      },
      "nodeId": "0.0.2",
      "text": "Test paragraph 3."
    }
  ]
}

So “Here is the footnote.” is included both in the returned paragraph object (nodeId 0.0.1), and as a separate object (nodeId 0.0.1.4.0).

Is there any way of preventing the footnote from being included in the paragraph?

tilal.ahmad · February 21, 2023, 6:23pm

@nativepear

You can use GetRangeTextOnline API method to get the paragraph text. It will exclude the footnote/endnote text.

nativepear · February 22, 2023, 5:19am

Thank you for the help.

When I do the following:

const request = new GetRangeTextOnlineRequest({
    document: requestDocument,
    rangeStartIdentifier: "id0.0.0"
});

I only get back the text in the first paragraph. How do I get the text of the entire document?

If I do something like this:

rangeEndIdentifier: "id0.100000000.0"

I get the error:

"Element with the identifier specified in rangeStartIdentifier and/or rangeEndIdentifier not found.

I just want to grab the text of the entire document without having to know the id’s of each section beforehand.

tilal.ahmad · February 22, 2023, 5:18pm

@nativepear
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): WORDSCLOUD-2243

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

aspose.notifier · June 6, 2023, 7:21am

The issues you have found earlier (filed as WORDSCLOUD-2243) have been fixed in this update. This message was posted using Bugs notification tool by yaroslaw.ekimov