Comparing two .doc documents results in corrupt comparison file

Yovira · July 19, 2023, 2:18pm

I’ve been trying to compare two .doc documents using the /words/online/put/compareDocument api endpoint. The api responds with a 200 OK status code yet the response body contains chaotic content which I’m unable to convert to a .doc document. The response Content-Type is ‘multipart/mixed; boundary="…"’. You can see the response body in the image (99.9 KB).

I’m using typescript to send requests to the api (I write the requests myself, without the library, due to some conflicts that the library causes). Is that response correct? If so how do I convert that response into a .doc document? If not why do I get it and how does a request to that endpoint look like when only using typescript features (How do I get a correct response file which I can work with?)?

Thanks for a response in advance

tilal.ahmad · July 19, 2023, 7:17pm

@Yovira

Thanks for your inquiry. We are looking into your requirements and will guide you shortly.

tilal.ahmad · July 20, 2023, 6:23am

@Yovira

Yes, the response is correct. As the API response is multipart, you need to parse it to get the resultant document. For example, please check the sample Node.js code to compare two word documents from the local drive. It will give you an idea of how to resolve your issue.

const { WordsApi, CompareData, CompareDocumentOnlineRequest } = require("asposewordscloud");
var fs = require('fs');

const compare = async () => {
// Get Client ID and Secret from https://dashboard.aspose.cloud/
wordsApi = new WordsApi("xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx", "xxxxxxxxxxxxxxxxxxxxxxxxx");

try {

const requestDocument = fs.createReadStream("compareTestDoc1.doc");
const requestCompareData = new CompareData({
    author: "author"
})
const requestComparingDocument = fs.createReadStream("compareTestDoc2.doc");
const compareRequest = new CompareDocumentOnlineRequest({
    document: requestDocument,
    compareData: requestCompareData,
    comparingDocument: requestComparingDocument
    //destFileName: "CompareDocumentOut.doc"
});

wordsApi.compareDocumentOnline(compareRequest)
.then((compareRequestResult) => {
    // tslint:disable-next-line:no-console
 const compareOutputDocument =compareRequestResult.body.document.entries().next().value[1];
 fs.writeFileSync("CompareDocumentOut.doc", compareOutputDocument);
});
} catch (err) {
throw err;
}
}

compare()
.then(() => {
console.log("documents compared.... successfully");
})
.catch((err) => {
console.log("Error occurred while comparing the documents:", err);
})

Furthermore, please note that the Aspose.Words Cloud SDK for Node.js has .ts files; you can use the SDK with typescript. However, if you are having some issues using it, then please share some details of the issues. We will look into these and help you use the SDK.

Yovira · July 20, 2023, 7:38am

Thanks for your fast response @tilal.ahmad . Before opening this question, I’ve already been trying to parse this multipart response (unsuccessfully). Additionally I am unable to resolve my issue using the code in your response .

So how can I parse the response without using the asposewordscloud library (preferably without using any library)?

And on top of that: why does it say the following inside the response?
{
…
“FileName”: “testDest.doc”,
“SourceFormat”: “Docx”,
…
}
Why is the SourceFormat Docx if I’m downloading a .doc document? Is that supposed to be like that?

tilal.ahmad · July 20, 2023, 3:07pm

@Yovira

I am afraid I am not good at typescript. You can google how to parse JSON using typescript. I think it will help you accomplish the task.

The default SourceFormat value is Docx. If you want to compare PDF documents, then you need to set it to PDF.

Yovira · July 20, 2023, 3:24pm

@tilal.ahmad
I know how to parse json, but the response I’m getting is not json. It’s multipart/mixed content in which the first part is json, tho not relevant to me. The second part is what matters to me as it contains the file. But if I remove everything except the binary data (which I suppose is the file) and convert it into a .doc document, when opening the file, word tells me that the file has been corrupted (irrepairable). Now why is that the case and how do I fix that? How do I get a usable file?

tilal.ahmad · July 20, 2023, 4:03pm

@Yovira

Please share your working sample code with us. We will try to replicate the issue and investigate it.

Yovira · July 21, 2023, 7:19am

@tilal.ahmad
In the following you can see the current code. accessToken is the access token generated by the https://api.aspose.cloud/connect/token endpoint (and is correct for sure). this.file1.file and this.file2.file are files of type File. The output variable stores my attempt of parsing the response. And as I said: The resulting file is a corrupted msword document.

    const ASPOSE_BASE_URL = "https://api.aspose.cloud/v4.0";

    let httpReq = new XMLHttpRequest();
    let requestData = new FormData();
    
    httpReq.open("PUT", `${ASPOSE_BASE_URL}/words/online/put/compareDocument?destFileName=testDest.doc`, false);
    httpReq.setRequestHeader("Authorization", "Bearer " + accessToken);
    
    requestData.append("Document", this.file1.file);
    requestData.append("CompareData", JSON.stringify({
        "Author": "author",
        "ComparingWithDocument": this.file2.file.name,
        "DateTime": "2015-10-26T00:00:00Z"
    }));
    requestData.append("ComparingDocument", this.file2.file);
    
    httpReq.onload = () => {
        const contentType = httpReq.getResponseHeader("Content-Type");
        const boundary = contentType.split(";")[1].trim().split("=")[1].slice(1, -1);

        let output = httpReq.responseText.split(boundary)[2];
        output = output.slice(output.indexOf("\n")+1, -4);
        output = output.slice(output.indexOf("\n")+1);
        output = output.slice(output.indexOf("\n")+1);
    
        let file = new Blob([output], {type: "application/msword"});
        const downloadLink = document.createElement("a");
        downloadLink.target = "_self";
        const datas = window.URL.createObjectURL(file);
        downloadLink.href = datas;
        downloadLink.download = "testDest";
        document.body.appendChild(downloadLink);
        downloadLink.click();
        document.body.removeChild(downloadLink);
    }
    httpReq.send(requestData);

tilal.ahmad · July 21, 2023, 7:18pm

@Yovira

Thanks for sharing the sample code. We have logged a ticket(WORDSCLOUD-2409) to investigate the problem and will keep you updated about the issue resolution progress within this forum thread.

Yovira · July 21, 2023, 7:27pm

@tilal.ahmad

Thanks a lot. Looking forward to hearing from you.

tilal.ahmad · July 22, 2023, 5:22am

@Yovira

Certainly, once we make some significant progress towards issue resolution, we will keep you informed.

Yovira · July 24, 2023, 11:17am

@tilal.ahmad

Within what time frame can I expect the issue to be resolved?

tilal.ahmad · July 24, 2023, 5:33pm

@Yovira

We have planned the issue’s investigation for this week and will share our findings with you accordingly.

Yovira · July 28, 2023, 4:10pm

@tilal.ahmad

Hey, is there an update on the issue?

tilal.ahmad · July 28, 2023, 6:32pm

@Yovira

I am afraid the issue is still not resolved. I have asked for an update and will share it with you as soon as possible.

Yovira · August 2, 2023, 9:29am

@tilal.ahmad

We might have an idea of why the issue occurs: The response we get when sending the request without the library is encoded with utf-8, yet when sending the request with the library, the response is encoded with ANSI. We tried to set the loadEncoding parameter to ansi, yet that yields an error response (500, message="‘ansi’ is not a supported encoding name. For information on defining a custom encoding, see the documentation for the Encoding.RegisterProvider method. (Parameter ‘name’)".

How do we change the encoding such that we get back ansi, which we may be able to process?

tilal.ahmad · August 2, 2023, 5:13pm

@Yovira

Thanks for sharing your findings. We are already working on the solution without Aspose.Words Cloud SDKs but using a rest client and we will share the sample code with you shortly.

tilal.ahmad · September 27, 2023, 4:17pm

@Yovira

The code below is extracted from the current SDK and gives a detailed description of CompareDocumentOnline response data parsing. Hopefully, it will give you some idea of how to parse the response.

/**
     * create response
     */
    createResponse(_response: Buffer, _headers: http.IncomingHttpHeaders): any {
        const result = new CompareDocumentOnlineResponse();
        const boundary = getBoundary(_headers);
        const parts = parseMultipart(_response, boundary);
        result.model = ObjectSerializer.deserialize(JSON.parse(findMultipartElement(parts, "Model").body.toString()), "DocumentResponse");


        const partDocument = findMultipartElement(parts, "Document");
        result.document = parseFilesCollection(partDocument.body, partDocument.headers);

        return result;
    }

    /**
     * Get boundary for IncomingHttpHeaders
     */
    export function getBoundary(headers: http.IncomingHttpHeaders): string {
        return parseContentType(headers["content-type"]);
    }

    /**
     * Get boundary value from content-type header
     */
    function parseContentType(contentType: string) : string {
        return contentType.split(" ")[1].split("=")[1].slice(1, -1);
    }

    /**
     * Parse multipart response body for given boundary
     */
    export function parseMultipart(body: Buffer, boundary: string): any {
        const allParts = [];

        let partHeaders = [];
        let buffer = [];

        const UNKNOWN = 0;
        const PART_HEADERS = 1;
        const CONTENT = 4;
        const PART_END = 5;

        let state = UNKNOWN; 
        let lastline = '';

        for (let i = 0; i < body.length; i++) {
            const oneByte = body[i];
            const prevByte = i > 0 ? body[i-1] : null;
            const newLineDetected = ((oneByte === 0x0a) && (prevByte === 0x0d)) ? true : false;
            const newLineChar = ((oneByte === 0x0a) || (oneByte === 0x0d)) ? true : false;

            if(!newLineChar)
                lastline += String.fromCharCode(oneByte);

            if((UNKNOWN === state) && newLineDetected){
                if(("--"+boundary) === lastline){
                    state = PART_HEADERS;
                    lastline = '';
                };
            } else
            if((PART_HEADERS === state) && newLineDetected){
                if (lastline !== '') {
                    partHeaders.push(lastline);
                }
                else {
                    state = CONTENT;
                }
                lastline = '';
            } else  
            if(CONTENT === state){
                if(lastline.length > (boundary.length+4)) lastline='';
                if(((("--" + boundary) === lastline))){              
                    const part = { 
                        headers: partHeaders.reduce((headers, header) => {
                            if (header.indexOf(':') !== -1) {
                                const [ key, value ] = header.split(/:\s+/)
                                headers[key.toLowerCase()] = value
                            }
                            return headers
                            }, {}),
                        body: Buffer.from(buffer.slice(0,buffer.length - lastline.length - 1))
                    };

                    allParts.push(part);

                    buffer = []; lastline = ''; state = PART_END; partHeaders = [];
                } else {
                    buffer.push(oneByte);
                }
                if(newLineDetected) lastline='';
            } else
            if(PART_END === state){
                if(newLineDetected)
                    state = PART_HEADERS;
            }
        }
        return allParts;
    }

    /**
     * Get multipart part by name
     */
    export function findMultipartElement(parts: any[], name: string): any {
        for (const part of parts) {
            const disp = part.headers['content-disposition'];
            const subs = disp.split(';');
            let subn = null;
            subs.forEach(element => {
                if (element.trim().startsWith("name=")) {
                    subn = element.trim().substr(5).replace(new RegExp('"', 'g'), '');
                }
            });
            if (subn === name) {
                return part;
            }
        }

        return null;
    }

    /**
     * Get files collection from Response
     */
    export function parseFilesCollection(response: Buffer, headers: http.IncomingHttpHeaders): Map<string, Buffer> {
        const result = new Map<string, Buffer>();
        if (headers["content-type"]?.startsWith("multipart/mixed")) {
            const boundary = getBoundary(headers);
            const parts = parseMultipart(response, boundary);
            for (const part of parts) {
                const disp = part.headers['content-disposition'];
                const subs = disp.split(';');
                let filename = null;
                subs.forEach(element => {
                    if (element.trim().startsWith("filename=")) {
                        filename = element.trim().substr(9).replace(new RegExp('"', 'g'), '');
                    }
                });
                result.set(filename, part.body);
            };
        }
        else {
            result.set("", response);
        }

        return result;
    }