tutorial – Cyber Whale – Digital Services blog

December 8, 2023December 11, 2023

Code of conduct at Cyber Whale

A. Basic Rules of Work Ethics

To work at Cyber Whale, it is essential to consider compliance and adhere to necessary norms when dealing with colleagues and clients. Diligent, timely, and clear adherence to client and tech lead preferences ensures the prompt achievement of results with minimal revisions.
Crucial strategies and decisions essential for the company’s operation (in technical, ethical, financial, and organizational terms) are not discussed with clients without notifying and involving the managers.
Every employee in our company can be confident that they will be evaluated solely based on their professional qualities. We stand against discrimination on any grounds and appreciate the individuality, personal stance, and cultural characteristics of each colleague. In case of any observed discrimination within the team, we take immediate measures to protect the rights of the colleague facing discrimination.

B. Confidentiality, Privacy, and Transparency

The company’s policy emphasizes complete transparency and honest feedback with our clients, as well as the clients themselves and employees engaged in relevant projects. At the same time, we highly respect the confidentiality of our colleagues and guarantee that no personal data of colleagues, except those necessary for work activities, will leave the company. You can fully trust both our managers and the clients you work with.
All work-related data handled by company employees is confidential, and all personal data of the employees themselves is private and is not to be disclosed to third parties, except in the case of intra-corporate interactions within the scope of the contract or special legal proceedings. Managers and clients, on their part, are also obligated to adhere to this directive.
Adhere to digital security. When using the internet from a work computer, ensure the safety of corporate data you are working with, whether they are on your computer or directly accessible online through various accounts.
We guarantee transparency in using artificial intelligence technologies in carrying out work tasks. The client must be informed that, in performing tasks such as content generation, coding, or management, we employ AI assistance.
Whenever we collect others’ data, record audio/video materials with colleagues or clients, we always seek the person’s permission. Anything otherwise goes against the values of our company and our clients.

C. Organization of Working Time: General Provisions

We provide a flexible work schedule, allowing the choice of working location (office or remote) and working hours from 10:00 to 19:00, with a one-hour lunch break. Short breaks for rest during working hours are allowed, and a slight adjustment to the boundaries of the working day is permissible.
Communication among colleagues is welcome, but during working hours, focus should be solely on work-related topics, ensuring that a colleague can allocate time to you either immediately or later. By agreement, work-related issues can be discussed until 8 p.m., while other matters are better addressed before 10 a.m. The exception is high urgency, emergency situations, acute health deterioration, etc. Work-related issues are not discussed on weekends (except for compensatory time off or part-time work).
Before taking leave, it is necessary to inform the department head and HR at least 2 weeks in advance, and before resignation, one month in advance. In this case, relevant applications (in 2 copies) should be prepared and signed by the department head or director after submission. Application templates can be obtained from HR.
It is better to submit an application for sick leave than to jeopardize the project and the client with slow and poor-quality work.
Before taking leave, it is necessary to notify the department head and HR in advance, and on the nearest working day, compensate for the time off.
For us, the balance between work and life matters. We do not force our colleagues to live for work, spending more time on it than the regulated hours or tackling unmanageable tasks. We do not obstruct their desire to take a vacation or sick leave. Regular extracurricular events are held to help employees feel the company’s care, relax, enjoy good vibes, and interact with colleagues. We support employees’ desire to appreciate the results of their work at Cyber Whale in both work and non-working hours.
When sending any application to the department head, also notify HR and PM, including placing them in copy when sending an email or message via messenger.
For the most effective project coordination, if you live in the city where the company’s headquarters is located, it is recommended to work at the company’s office regularly, at least once a week. In other cases, rely on the goal-setting of the department head.

D. Organization of Working Time: Daily Provisions

It is important to value each other’s time. Approach colleagues if you are sure that the information you provide will be informative, acceptable, unintrusive, and timely. Strive to structure thoughts clearly and concisely.
Respect for time is one of the reasons why we actively use information search in browsers and with the help of AI. Practice shows that this is an effective strategy that significantly reduces micromanagement, saves managers’ and tech leads’ time, and positively influences colleagues’ ability to ask the right questions and efficiently find the necessary information. It is better to approach the tech lead or manager with well-clarified information and ensure that there are no remaining questions. These questions, compiled in a list, are discussed in subsequent calls or video conferences, after which colleagues return to improving previous tasks or completing new ones.
Project management primarily relies on voice and video communication with the project group or individual colleagues. This allows for clearer conveyance of all project and task nuances, more precise regulation of work, improved coordination, and better time management, eliminating downtime due to lengthy and disorganized text-based discussions.

September 20, 2017

How to write trained Word2Vec model to CSV with DeepLearning4j

I used DeepLearning4j to train word2vec model. Then I had to save the dictionary to CSV so I can run some clustering algorithms on it.

Sounded like a simple task, but it took a while, and here is the code to do this:

   private void writeIndexToCsv(String csvFileName, Word2Vec model) {

        CSVWriter writer = null;
        try {
            writer = new CSVWriter(new FileWriter(csvFileName));
        } catch (IOException e) {
            e.printStackTrace();
        }

        VocabCache<VocabWord> vocCache =  model.vocab();
        Collection<VocabWord> wrds = vocCache.vocabWords();

        for(VocabWord w : wrds) {
            String s = w.getWord();
            System.out.println("Looking into the word:");
            System.out.println(s);
            StringBuilder sb = new StringBuilder();
            sb.append(s).append(",");
            double[] wordVector = model.getWordVector(s);
            for(int i = 0; i < wordVector.length; i++) {
                sb.append(wordVector[i]).append(",");
            }

            writer.writeNext(sb.toString().split(","), false);
        }

        try {
            writer.close();
        } catch (IOException e) {
            e.printStackTrace();
        }

    }

February 1, 2017

How to parse dynamic HTML content using Python

In the previous tutorial we learning how to parse HTML in Python. In the Python tutorial we are going to learn to to parse dynamic HTML content generated by JavaScript, jQuery, Ajax, Angular or other dynamic pages technology.

What’s the problem with parsing dynamic HTML content in Python and in general?

The problem is that when you request contents of a HTML page, you are presented HTML, CSS and scripts returned from the server. If the page is dynamic, what you get is only a couple of scripts that are meant to be interpreted by your browser that, in its turn, will eventually display HTML content for a user.

That leads us to the idea that we should first render the page and then grab its HTML. Also it should take some time to render the page since sometimes the content is quite “heavy” and it takes some time to load it.

So, along with pure Python we should use some kind of UI component and in particular a Web View or some kind of Web frame.

One of the options is to use Qt for Python and to handle page rendering events and another one (which I honestly prefer more) is to use selenium for python.

So, let’s get down to writing some code but before that let’s elaborate and approach.

Open web view with URL.
Wait untill the page is loaded. Often the criteria here is a loaded div of some class.
Grab the rendered HTML.
Process it further using beautiful soup

You will need Chrome Web Driver to run the web view.

Also you will have to install selenium as well as libs from previous tutorial:

pip install selenium

So here is the Python code to parse dynamic content:

#import selenium compnents, urllib, beautiful soup
from bs4 import BeautifulSoup
from selenium import webdriver
from urllib import urlopen
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By


#url - the url to fetch dynamic content from.
#delay - second for web view to wait
#block_name - id of the tag to be loaded as criteria for page loaded state.
def fetchHtmlForThePage(url, delay, block_name):
	#supply the local path of web driver.
	#in this example we use chrome driver
	browser = webdriver.Chrome('/Applications/chromedriver')
	#open the browser with the URL
	#a browser windows will appear for a little while
	browser.get(url)
	try:
	#check for presence of the element you're looking for
		element_present = EC.presence_of_element_located((By.ID, block_name))
		WebDriverWait(browser, delay).until(element_present)

	#unless found, catch the exception
	except TimeoutException:
		print "Loading took too much time!"	

	#grab the rendered HTML
	html = browser.page_source
	#close the browser
	browser.quit()
	#return html
	return html


#call the fetching function we created
html = fetchHtmlForThePage(url, 5, 're-Searchresult')
#grab HTML document
soup = BeautifulSoup(html)
#process it further as you wish.....
#.....
processFetchedUrls(soup, path)

So here how to parse dynamic HTML content generated with JavaScript with the of Python.

Visit us to get help with your Python challenge of let us know if can help you with your digital needs.

February 1, 2017February 1, 2017

How to parse emails from HTML in Python

In this tutorial we are going to get an idea of how to parse emails from HTML using Python.

Python is a scripting language easy to get started and is perfect for tasks like parsing emails.

So let’s elaborate an approach of how parsing works:

Initialize a queue of URLs. The first item will be the initial URL.
Initialize a set of already visited URL to avoid repetitions.
Start parsing the current URL from the queue.
Add the URL to processed URLs.
Extract the whole HTML, search for an email pattern using a regex.
If one or multiple emails were found, write to CSV.
Loop through <a> tags found.
Check if URL is relative or absolute.
Check if URL is already in the processed URLs set. If not, add to the processing queue
Repeat from step 3.

Before launching the script don’t forget to install proper libraries.

Using command line do:

pip install requests

pip install urlparse

pip install csv

pip install beautifulsoup4

Once you have the libraries installed, you’ll be able to check the script.

from bs4 import BeautifulSoup
import requests
import requests.exceptions
from urlparse import urlparse
from urlparse import urlsplit
from collections import deque
import re
import csv

#initialize CSV writer and filename
cw = csv.writer(open("Singa.csv",'a'), delimiter=',')
# a queue of urls, start
new_urls = deque(['https://foundersgrid.com/50-singapore-startups/'])

# a set of urls that we have already crawled
processed_urls = set()

# a set of crawled emails
emails = set()

# process urls one by one until we exhaust the queue
while len(new_urls):

    #extract the last one from queue
	url = new_urls.popleft()
	#mark as visited by adding to proccessed URLs
	processed_urls.add(url)

    # break down the extract the base url to resolve relative links
	parts = urlsplit(url)
	base_url = "{0.scheme}://{0.netloc}".format(parts)
	path = url[:url.rfind('/')+1] if '/' in parts.path else url

    # get url's content
	#handle exception if any
	try:
		response = requests.get(url)
	except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
        # skip pages with errors
		continue

    # extract all email addresses and add them into the resulting set
	new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I))
	emails.update(new_emails)
	print new_emails
	#write to CSV the new mails.
	#alternatively you can write the emails set to CSV after parsing
	for em in new_emails:
		cw.writerow([em,])

    # create a beutiful soup object as representation of the html page
	soup = BeautifulSoup(response.text)

    # walk through a anchords
	for anchor in soup.find_all("a"):
        # extract link url from the anchor
		link = anchor.attrs["href"] if "href" in anchor.attrs else ''
        # resolve relative links
		if link.startswith('/'):
			link = base_url + link
		elif not link.startswith('http'):
			link = path + link
        # add the new url to the queue if it was not enqueued nor processed yet
		if not link in new_urls and not link in processed_urls:
			new_urls.append(link)

As you can see, parsing emails in Python is rather a simple task.

If you have any questions on this tutorial, you can contact us [email protected]

Also, if you need assistance with data collection or any other digital service, please let us know.

Don’t forget to share the tutorial and visit us at https://cyberwhale.tech

PS. In the next tutorial we will discuss how to parse dynamic HTML content using Python.

January 10, 2017January 10, 2017

Update XML node in Python

I like python because it’s minimalistic and elegant.
Let’s see how to update an XML node using ElementTree.

We use CD catalog in XML as a datasource.

<?xml version="1.0" encoding="iso-8859-1" ?>
<?xml-stylesheet type="text/xsl" href="cdcatalog.xsl"?>
 <catalog>
<cd>
  <title>empire burlesque</title> 
  <artist>bob dylan</artist> 
  <country>usa</country> 
  <company>columbia</company> 
  <price>10.90</price> 
  <year>1985</year> 
  </cd>
 <cd>
  <title>hide your heart</title> 
  <artist>bonnie tyler</artist> 
  <country>uk</country> 
  <company>cbs records</company> 
  <price>9.90</price> 
  <year>1988</year> 
  </cd>
 <cd>
  <title>greatest hits</title> 
  <artist>dolly parton</artist> 
  <country>usa</country> 
  <company>rca</company> 
  <price>9.90</price> 
  <year>1982</year> 
  </cd>
</catalog>

Here is the python script itself.

import xml.etree.ElementTree as ET	

#parse XML file
tree = ET.parse('catalog_.xml')

#get root
root = tree.getroot()
#iterate over each price node (which is subchild of cd node)
for price in root.iter('price'):
	#get the price of CD, multiply 10
	new_price = float(price.text) * 10
	#update the text (value) of the node
	price.text = str(new_price)
	#add 'updated' attribute to mark node updated=yes
	price.set('updated', 'yes')

#can also use the same file if you want to directly update file.
tree.write('catalog_new.xml')

And the output is the following:

<catalog>
<cd>
  <title>empire burlesque</title> 
  <artist>bob dylan</artist> 
  <country>usa</country> 
  <company>columbia</company> 
  <price updated="yes">109.0</price> 
  <year>1985</year> 
  </cd>
 <cd>
  <title>hide your heart</title> 
  <artist>bonnie tyler</artist> 
  <country>uk</country> 
  <company>cbs records</company> 
  <price updated="yes">99.0</price> 
  <year>1988</year> 
  </cd>
 <cd>
  <title>greatest hits</title> 
  <artist>dolly parton</artist> 
  <country>usa</country> 
  <company>rca</company> 
  <price updated="yes">99.0</price> 
  <year>1982</year> 
  </cd>
</catalog>

January 5, 2017January 6, 2017

Python networking example

Here is a small example demonstrating get requests in Python.

pip install requests

And the code itself


import library
import requests

#prepare paramteters
parameters = {'date:':'2000:2010', 'format':'xml'}

#prepare URL
url = 'http://api.worldbank.org/countries/br/indicators/SP.POP.TOTL'

#call get method and save data into the response
r = requests.get(url, params=parameters)

#print the url considering the params
print r.url

#check for status code
statusCode = int(r.status_code)

#if failed request print the mesage, else print response headers and text
if statusCode != 200:
print 'Request Failed'
else:
print r.headers['content-type']
#print r.json() for json requests
print r.text