I want to take a random PDF, pass it to an API, and get an email address(es) in return. Can you direct me to the right API, documentation, or an example code, please?
Thanks,
Tim
I want to take a random PDF, pass it to an API, and get an email address(es) in return. Can you direct me to the right API, documentation, or an example code, please?
Thanks,
Tim
@tim330i You can use Aspose.Words and it’s Find/Replace functionality to achieve this. For example see the following code, which uses regular expression to find e-mail addresses in the document:
Document doc = new Document(@"C:\Temp\in.pdf");
EmailsCollecotor collecotor = new EmailsCollecotor();
FindReplaceOptions opt = new FindReplaceOptions();
opt.ReplacingCallback = collecotor;
doc.Range.Replace(new Regex(@"([\w\.\-]+)@([\w\-]+)((\.(\w){2,3})+)"), "", opt);
foreach (string email in collecotor.EMails)
Console.WriteLine(email);
private class EmailsCollecotor : IReplacingCallback
{
public ReplaceAction Replacing(ReplacingArgs args)
{
string email = args.Match.Value.ToString();
if (!mEMails.Contains(email))
mEMails.Add(args.Match.Value.ToString());
return ReplaceAction.Skip;
}
public List<string> EMails
{
get { return mEMails; }
}
private readonly List<string> mEMails = new List<string>();
}
Please note, Aspose.Words it designed to work with MS Word documents at first and loading PDF document is supported only in .NET and Python versions of Aspose.Words.
Aspose.PDF is the product designed to work with PDF documents, my colleagues from Aspose.PDF team will guide you how to achieve the same using Aspose.PDF shortly.
With Aspose.PDF for .NET, you can use below sample code to extract email address from a PDF:
// Load the PDF document
Document pdfDocument = new Document("input.pdf");
// Create a regular expression pattern to match email addresses
string pattern = @"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}\b";
Regex regex = new Regex(pattern);
// Create a TextFragmentAbsorber with the regular expression
TextFragmentAbsorber absorber = new TextFragmentAbsorber(pattern);
// Accept text fragments that match the regular expression
absorber.TextSearchOptions = new TextSearchOptions(true);
// Search for email addresses in the PDF
pdfDocument.Pages.Accept(absorber);
// Extract and print the email addresses
foreach (TextFragment textFragment in absorber.TextFragments)
{
if (regex.IsMatch(textFragment.Text))
{
Console.WriteLine("Email Address: " + textFragment.Text);
}
}
Awesome! Thank you for two great suggestions.
I planned to do this through calls to an API. Is there an option to do that?
Tim