How to extract email address from PDF - API example

tim330i · November 2, 2023, 9:27pm

I want to take a random PDF, pass it to an API, and get an email address(es) in return. Can you direct me to the right API, documentation, or an example code, please?

Thanks,
Tim

alexey.noskov · November 2, 2023, 9:27pm

@tim330i You can use Aspose.Words and it’s Find/Replace functionality to achieve this. For example see the following code, which uses regular expression to find e-mail addresses in the document:

Document doc = new Document(@"C:\Temp\in.pdf");

EmailsCollecotor collecotor = new EmailsCollecotor();
FindReplaceOptions opt = new FindReplaceOptions();
opt.ReplacingCallback = collecotor;

doc.Range.Replace(new Regex(@"([\w\.\-]+)@([\w\-]+)((\.(\w){2,3})+)"), "", opt);

foreach (string email in collecotor.EMails)
    Console.WriteLine(email);

private class EmailsCollecotor : IReplacingCallback
{
    public ReplaceAction Replacing(ReplacingArgs args)
    {
        string email = args.Match.Value.ToString();
        if (!mEMails.Contains(email))
            mEMails.Add(args.Match.Value.ToString());
        return ReplaceAction.Skip;
    }

    public List<string> EMails
    {
        get { return mEMails; }
    }

    private readonly List<string> mEMails = new List<string>();
}

Please note, Aspose.Words it designed to work with MS Word documents at first and loading PDF document is supported only in .NET and Python versions of Aspose.Words.

Aspose.PDF is the product designed to work with PDF documents, my colleagues from Aspose.PDF team will guide you how to achieve the same using Aspose.PDF shortly.

asad.ali · November 2, 2023, 9:27pm

@alexey.noskov

With Aspose.PDF for .NET, you can use below sample code to extract email address from a PDF:

// Load the PDF document
Document pdfDocument = new Document("input.pdf");

// Create a regular expression pattern to match email addresses
string pattern = @"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}\b";
Regex regex = new Regex(pattern);

// Create a TextFragmentAbsorber with the regular expression
TextFragmentAbsorber absorber = new TextFragmentAbsorber(pattern);

// Accept text fragments that match the regular expression
absorber.TextSearchOptions = new TextSearchOptions(true);

// Search for email addresses in the PDF
pdfDocument.Pages.Accept(absorber);

// Extract and print the email addresses
foreach (TextFragment textFragment in absorber.TextFragments)
{
    if (regex.IsMatch(textFragment.Text))
    {
        Console.WriteLine("Email Address: " + textFragment.Text);
    }
}

tim330i · November 2, 2023, 9:27pm

Awesome! Thank you for two great suggestions.

I planned to do this through calls to an API. Is there an option to do that?

Tim

tilal.ahmad · November 3, 2023, 9:19am

@tim330i

You can use the GetText API method of Aspose.PDF Cloud to read email addresses with regular expression. Please check out the following quick start article to get started with Aspose Cloud APIs and SDKs. Please feel free to contact us for any further assistance.