How to parse emails from HTML in Python

In this tutorial we are going to get an idea of how to parse emails from HTML using Python.

Python is a scripting language easy to get started and is perfect for tasks like parsing emails.

So let’s elaborate an approach of how parsing works:

  1. Initialize a queue of URLs. The first item will be the initial URL.
  2. Initialize a set of already visited URL to avoid repetitions.
  3. Start parsing the current URL from the queue.
  4. Add the URL to processed URLs.
  5. Extract the whole HTML, search for an email pattern using a regex.
  6. If one or multiple emails were found, write to CSV.
  7. Loop through <a> tags found.
  8. Check if URL is relative or absolute.
  9. Check if URL is already in the processed URLs set. If not, add to the processing queue
  10. Repeat from step 3.

Before launching the script don’t forget to install proper libraries.

Using command line do:

pip install requests
pip install urlparse
pip install csv
pip install beautifulsoup4

Once you have the libraries installed, you’ll be able to check the script.

from bs4 import BeautifulSoup
import requests
import requests.exceptions
from urlparse import urlparse
from urlparse import urlsplit
from collections import deque
import re
import csv

#initialize CSV writer and filename
cw = csv.writer(open("Singa.csv",'a'), delimiter=',')
# a queue of urls, start
new_urls = deque(['https://foundersgrid.com/50-singapore-startups/'])

# a set of urls that we have already crawled
processed_urls = set()

# a set of crawled emails
emails = set()

# process urls one by one until we exhaust the queue
while len(new_urls):

    #extract the last one from queue
	url = new_urls.popleft()
	#mark as visited by adding to proccessed URLs
	processed_urls.add(url)

    # break down the extract the base url to resolve relative links
	parts = urlsplit(url)
	base_url = "{0.scheme}://{0.netloc}".format(parts)
	path = url[:url.rfind('/')+1] if '/' in parts.path else url

    # get url's content
	#handle exception if any
	try:
		response = requests.get(url)
	except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
        # skip pages with errors
		continue

    # extract all email addresses and add them into the resulting set
	new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I))
	emails.update(new_emails)
	print new_emails
	#write to CSV the new mails.
	#alternatively you can write the emails set to CSV after parsing
	for em in new_emails:
		cw.writerow([em,])

    # create a beutiful soup object as representation of the html page
	soup = BeautifulSoup(response.text)

    # walk through a anchords
	for anchor in soup.find_all("a"):
        # extract link url from the anchor
		link = anchor.attrs["href"] if "href" in anchor.attrs else ''
        # resolve relative links
		if link.startswith('/'):
			link = base_url + link
		elif not link.startswith('http'):
			link = path + link
        # add the new url to the queue if it was not enqueued nor processed yet
		if not link in new_urls and not link in processed_urls:
			new_urls.append(link)

As you can see, parsing emails in Python is rather a simple task.

If you have any questions on this tutorial, you can contact us [email protected]

Also, if you need assistance with data collection or any other digital service, please let us know.

Don’t forget to share the tutorial and visit us at https://cyberwhale.tech

PS. In the next tutorial we will discuss how to parse dynamic HTML content using Python.

Update XML node in Python

I like python because it’s minimalistic and elegant.
Let’s see how to update an XML node using ElementTree.

We use CD catalog in XML as a datasource.

<?xml version="1.0" encoding="iso-8859-1" ?>
<?xml-stylesheet type="text/xsl" href="cdcatalog.xsl"?>
 <catalog>
<cd>
  <title>empire burlesque</title> 
  <artist>bob dylan</artist> 
  <country>usa</country> 
  <company>columbia</company> 
  <price>10.90</price> 
  <year>1985</year> 
  </cd>
 <cd>
  <title>hide your heart</title> 
  <artist>bonnie tyler</artist> 
  <country>uk</country> 
  <company>cbs records</company> 
  <price>9.90</price> 
  <year>1988</year> 
  </cd>
 <cd>
  <title>greatest hits</title> 
  <artist>dolly parton</artist> 
  <country>usa</country> 
  <company>rca</company> 
  <price>9.90</price> 
  <year>1982</year> 
  </cd>
</catalog>

Here is the python script itself.

import xml.etree.ElementTree as ET	

#parse XML file
tree = ET.parse('catalog_.xml')

#get root
root = tree.getroot()
#iterate over each price node (which is subchild of cd node)
for price in root.iter('price'):
	#get the price of CD, multiply 10
	new_price = float(price.text) * 10
	#update the text (value) of the node
	price.text = str(new_price)
	#add 'updated' attribute to mark node updated=yes
	price.set('updated', 'yes')

#can also use the same file if you want to directly update file.
tree.write('catalog_new.xml')

And the output is the following:

<catalog>
<cd>
  <title>empire burlesque</title> 
  <artist>bob dylan</artist> 
  <country>usa</country> 
  <company>columbia</company> 
  <price updated="yes">109.0</price> 
  <year>1985</year> 
  </cd>
 <cd>
  <title>hide your heart</title> 
  <artist>bonnie tyler</artist> 
  <country>uk</country> 
  <company>cbs records</company> 
  <price updated="yes">99.0</price> 
  <year>1988</year> 
  </cd>
 <cd>
  <title>greatest hits</title> 
  <artist>dolly parton</artist> 
  <country>usa</country> 
  <company>rca</company> 
  <price updated="yes">99.0</price> 
  <year>1982</year> 
  </cd>
</catalog>

Remove duplicate lines from a file in Scala

How to remove duplicate lines from csv or txt file?

The answer is quite straightforward: you basically need BufferedReader and BufferedWriter, and this also works for large files quite well.

 

 def removeDuplicatesFromFile(fileName : String) {

    val reader = new BufferedReader(new FileReader(fileName))
    val lines = new mutable.HashSet[String]()
    var line: String = null
    while ({line = reader.readLine; line != null}) {
      lines.add(line)
    }
    reader.close

    val writer = new BufferedWriter(new FileWriter(fileName))
    for (unique <- lines) {
      writer.write(unique)
      writer.newLine()
    }
    writer.close

  }

Top 5 useful Java Libs

Java is an advanced language, but nonetheless there are libs to make life even more easier. We would like to share 5 useful libs to help you with projects of different kind.

FileUtils – Apache Commons

Small but a very useful lib to help you deal with files. Simplifies working with files in a great way, making you productive and avoiding boilerplate code.

FileUtils.readLines(new File("myfile.txt"));

String Utils – Apache Commons

Also small but powerful library. Has all string methods you always lack.

String title = StringUtils.substringBetween(someText, "The", "end");

Jsoup Library

This is the best Java library for parsing HTML and XML, or other markup in general.

Document doc = Jsoup.connect("http://example.com/").get();
String title = doc.title();

OpenCSV

Parsing CSV is a trivial task, but sometimes still cause trouble. OpenCSV is a minimalistic library to help you with this.

CSVReader reader = new CSVReader(new FileReader("yourfile.csv"));
     String [] nextLine;
     while ((nextLine = reader.readNext()) != null) {
        // nextLine[] is an array of values from the line
        System.out.println(nextLine[0] + nextLine[1] + "etc...");
     }

org.json

You usually do a lot of networking in Java, but what you really need is a good JSON parser/manager. Org.json is a popular and minimalistic Java library for operating with JSON data.

String str = "{ \"firstName\": \"Vladimir\", \"age\": 30 }";
JSONObject obj = new JSONObject(str);
String n = obj.getString("firstName");
int a = obj.getInt("age");
System.out.println(n + " " + a);  // prints "Vladimir 30"

We would also point out other libs like fasterxml, FileNameUtils and Unirest.

Hope you’ll find these minimalistic java libs helpful and powerful.

Anyway you can check with us to see if we can help you develop your java application.

Python networking example

Here is a small example demonstrating get requests in Python.

pip install requests

And the code itself


import library
import requests

#prepare paramteters
parameters = {'date:':'2000:2010', 'format':'xml'}

#prepare URL
url = 'http://api.worldbank.org/countries/br/indicators/SP.POP.TOTL'

#call get method and save data into the response
r = requests.get(url, params=parameters)

#print the url considering the params
print r.url

#check for status code
statusCode = int(r.status_code)

#if failed request print the mesage, else print response headers and text
if statusCode != 200:
print 'Request Failed'
else:
print r.headers['content-type']
#print r.json() for json requests
print r.text

Who we are

We are Cyber Whale, we are here in this world to deliver you hi-fi digital services at affordable prices.

We operate worldwide making our customers happy.

Some of our services are:

  • Mobile applications developments
  • Web apps, rich content web apps
  • Cloud deployment
  • Datamining, business intelligence services
  • Quality Assurance and help desk

Visit https://cyberwhale.tech for more info.