Windows Subsystem Linux Development

Linux SHELL is very useful in development. It is color coded, easier to manage, and more command. Windows provide WSL to enable linux system in windows. To enable it this is the official guide.

Basically, you need to enable Windows Subsystem for Linux in the program windows fratures. Then select one of the linux system in Microsoft Store. The one I recommand is Debian. It is small and Ubuntu is base on Debian.

After you restart and run Debian, linux will be ready to go. You now running full operational linux in windows. And it can access windows directories.

Installation will be easier. For git, just run sudo apt install git.

I try to let IDE (Intellij IDEA) to run git through wsl, I found a bat file from here.

@echo off
setlocal enabledelayedexpansion
set command=%*
set find=C:\Users\%USERNAME%\AppData\Local\Temp\git-commit-msg-.txt
set replace=/mnt/c/Users/%USERNAME%/AppData/Local/Temp/git-commit-msg-.txt
call set command=%%command:!find!=!replace!%%
echo | C:\Windows\Sysnative\bash.exe -c 'git %command%'

It works beautifully.

This is far from a good development. Since there are more to install. And I install java in windows, so I can use it eaiser without running wsl java all the time.

Centos install newer package of gcc via yum

Centos is popular for it is stability and security. It meets all kinds of purpose for commercial requirements. However, to maintain security and stability, most of its package are outdated. It is harder to run newer version compiled program with lower version of gcc.

We can solve this easily by compiler new version of gcc. It is not that hard, but requires some time to do the job. We can fix it in a easy way. Thanks to Software Colletions, it is provide in RHSCL(Red Hat Software Collections) in devtoolset-7. We just simply need to enable it and download it from the repository. It can be done in following steps. You can find more information here

  1. Install the centos-release-scl in centos repository
    sudo yum install centos-release-scl
    
  2. Enable RHSCL repository
    sudo yum-config-manager --enable rhel-server-rhscl-7-rpms
    
  3. Install the collections
    sudo yum install devtoolset-7
    
  4. Enable the package
    scl enable devtoolset-7 bash
    

You can find more other package from the website. I can also download node from here. So we do not need to spend time compile the software.

Scrapy grap files from website

I am trying to download some material from my course website. However, there are a lot of files in the server. I do not want to download file one by one. So, I try to write a script that can help me download all the files at once.

I found scrapy. It is a crawler framework that can extract data from website.

First of all, install scrapy from source. In my case, I use conda instead of pip. It is simlar fo pip but I can setup different envirnment with conda. Installtion command is conda install -c conda-forge scrapy.

After install we can follow this guide. First, I create the project with scrapy startproject yorku that creates an sample configurated scrapy project. It looks like

scrapy/
   ├── yorku/
   │    ├── __init__.py
   │    ├── items.py
   │    ├── middlewares.py
   │    ├── pipelines.py
   │    ├── settings.py
   │    └── spiders/
   │        └── __init__.py
   └── scrapy.cfg

It contains only basic file for setting up spider, we can generate spider with command scrapy genspider <name> <domain name>. I typed in scrapy genspider wikieecs wiki.eecs.yorku.ca that generates the spider that starts like this.

import scrapy
import re
import os
from scrapy.utils.python import to_native_str

class WikieecsSpider(scrapy.Spider):

    name = 'wikieecs'
    allowed_domains = ['wiki.eecs.yorku.ca']
    start_urls = ['http://wiki.eecs.yorku.ca/']

    def parse(self, response):
        pass

We can write logic to seek the links to the file. So, it will start access the starts_urls. When it finish loading the page, it will pass the page in the response object to parse function. So, we can extract the link of the files from the page. It can be done easily with build in css seceltor or xpath.

response.css('a::attr(href)') # select all the link with url.

It returns an array of object. Then we can use loop to see all the links and select the files.

for url in response.css('a::attr(href)'):
    link = url.extract() # get the link from element
    if link.endswith(".pdf"): # if it is an pdf
        yield response.follow(link, self.save_file) # instrcut scrapy to download the file

So this function will generate a new array that conatains all the function. If you want to download multiple file with different ext. i.e. docx, pptx. You can type all stuff in to an array like this.

ext = ['.pdf', '.ppt', '.pptx', '.doc', '.docx', '.txt', '.v', '.c', '.tar', '.tar.gz', '.zip']
link.endswith(tuple(ext)) # if it ends with any of the ext

So you can download specified list of files. Then we need to generate a new callback save_file to save the file.

def save_file(self, response):
    filename = response.url.split("/")[-1] # assume the last path of url is the name
    with open(filename, 'wb') as f: # open a new file
        f.write(response.body)      # write content downloaded
    self.logger.info('Save file %s', filename) # display the file downloaded

However, our school website require us to login in order to download. And it uses basic access authentication. So we need to generate a new header Authorization if it require us to login. Scrapy support this with HttpAuthMiddleware by defaut, we just need to spcify this with http_user and http_pass in the spider.

# improt omitted
class WikieecsSpider(scrapy.Spider):
    http_user = 'xxxx'
    http_pass = 'xxxx'

    # other definition and functions

Then you will have a working spider. The final code will look like.

import scrapy
import re
import os
from scrapy.utils.python import to_native_str

class WikieecsSpider(scrapy.Spider):
    http_user = 'xxxx'
    http_pass = 'xxxx'

    name = 'wikieecs'
    allowed_domains = ['wiki.eecs.yorku.ca', 'www.eecs.yorku.ca']
    start_urls = ['https://wiki.eecs.yorku.ca/course_archive/2018-19/W/2011/assignments:start']
    
    ext = ['.pdf', '.ppt', '.pptx', '.doc', '.docx', '.txt', '.v', '.c', '.tar', '.tar.gz', '.zip']

    def parse(self, response):
        for url in response.css('a::attr(href)'):
            link = url.extract()

            if link.endswith(tuple(self.ext)):
                yield response.follow(link, self.save_file)
    
    def save_file(self, response):
        filename = response.url.split("/")[-1]
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.logger.info('Save file %s', filename)

We just need put all the url in to the start_urls and it will start to scan the files and try to download it.

Recurrent Neural Networks

\[\begin{align*} \frac{\partial\varepsilon}{\partial\theta}&=\sum_{1 \leq t \leq T} \frac{\partial\varepsilon_{t}}{\partial\theta}\\ \frac{\partial\varepsilon}{\partial\theta}&=\sum_{1 \leq k \leq t} (\frac{\partial\varepsilon_{t}}{\partial x_{t}}\frac{\partial x_{t}}{\partial x_{k}}\frac{\partial^{+}x_{k}}{\partial \theta})\\ \frac{\partial x_{t}}{\partial x_{k}}&=\amalg_{t \geq i > k} \frac{\partial x_{i}}{\partial x_{i-1}}=\amalg_{t \geq i > k} W_{rec}^{T} diag(\sigma^{'}(x_{i-1})) \end{align*}\]

\(W_{rec} \sim small (<1)\) cause vanishing
\(W_{rec} \sim large (>1)\) cause Exploding

Solution to The Vanishing Gradient Problem.

  • Exploding Gradient
    • Trucated Backpropagation
    • Penalties
    • Gradient Clipping
  • Vanishing Gradient
    • Weight Initialization
    • Echo State Networks
    • Long Short-Term Memory Networks(LSTMs)

LSTM

Set \(W_{rec}\) to 1. This is a good post about it. It basically saying that the tradictional neural network is hard to link between the context. However, they are good at linking it long-term. LSTMs is a way to solve it. It basically adds more action before the activation function, and that allows the

GTX1070 cannot work with Tensorflow

I was trying to set up Deep learning on my Windows PC. However, after installing Anaconda, I cannot use my GPU to do my training. It shows

E tensorflow/stream_executor/cuda/cuda_blas.cc:461] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED

I search for many sources trying to solve this problem. I tried to set up with this guide, and then I found this one with detailed. The setup was smooth, but I still cannot run the training because of the error.

It is definitely a nightmare to solve the issue. Then I found it on stackoverflow.

The solution is

add to the top of your code under import tensorflow as tf
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.Session(config=config...)

It works well when I modify the config “allow_growth” to true. Then it can use GPU to train the networked lighting fast.

UPDATE 2019-04-05: It was working well for some time. However, I found some serious problem with the graphic card. My GTX1070 crash everything if the memory usage exceed 3 GB. I found it when I was playing BF5. Now, I have sent the graphic card to repaired. So, if you run into some problem like this, you should try to check if your graphic card is working in a good condition.