您现在的位置：程式師世界 >> 編程語言 > >> 更多編程語言 >> Python

Python teaches you 3 minutes to build a question and answer search engine with Bert

編輯：Python

Famous Bert I believe most students have heard of the algorithm , It is Google To launch the NLP field “ Wang fried class ” Pre training model , Its presence NLP Several records have been refreshed in the task , And get state of the art The achievement of .

But there are a lot of novices who have deep learning BERT The model is not easy to build , It's very difficult to get started , Ordinary people may have to study for a few days to build a model .

No problem , Today we will introduce this module , Can make you in 3 Minutes based on BERT The algorithm builds a question and answer search engine . It is bert-as-service project . This open source project , Can make you based on more GPU Rapid machine building BERT service （ Support fine tuning model ）, And it can be used by multiple clients concurrently .

1. Get ready

Before the start , You have to make sure that Python and pip Has been successfully installed on the computer , without , Installation .

( Optional 1) If you use Python The goal is data analysis , It can be installed directly Anacond

( Optional 2) Besides , Recommended VSCode Editor , It has many advantages .

Please choose one of the following ways to enter the command to install the dependency ： 1. Windows Environmental Science open Cmd ( Start - function -CMD). 2. MacOS Environmental Science open Terminal (command+ Space input Terminal). 3. If you're using a VSCode Editor or Pycharm, You can directly use the Terminal.

pip install bert-serving-server # Server side
pip install bert-serving-client # client

Please note that , Server version requirements ：Python >= 3.5,Tensorflow >= 1.10 .

In addition, download the pre trained BERT Model , stay https://github.com/hanxiao/bert-as-service#install Can be downloaded from .

Also available at Python Practical dictionary backstage reply bert-as-service Download these pre trained models .

When the download is complete , take zip Extract the file into a folder , for example /tmp/english_L-12_H-768_A-12/

2.Bert-as-service Basic use

After installation , Enter the following command to start BERT service ：

bert-serving-start -model_dir /tmp/english_L-12_H-768_A-12/ -num_worker=4

-num_worker=4 On behalf of this will start one with four worker Service for , This means that it can handle up to four Concurrent request . exceed 4 Other concurrent requests will be queued in the load balancer for processing .

The following shows what the server looks like when it starts correctly ：

Use the client to get the encoding of the statement

Now you can simply code the sentences , As shown below ：

from bert_serving.client import BertClient
bc = BertClient()
bc.encode(['First do it', 'then do it right', 'then do it better'])

As BERT A feature of , You can compare them with |||（ There are spaces before and after ） Connect to get the code of a pair of sentences , for example

bc.encode(['First do it ||| then do it right'])

Remote use BERT service

You can also be on one (GPU) Start the service on the machine and from another (CPU) The machine calls it , As shown below ：

# on another CPU machine
from bert_serving.client import BertClient
bc = BertClient(ip='xx.xx.xx.xx') # ip address of the GPU machine
bc.encode(['First do it', 'then do it right', 'then do it better'])

3. Build a Q & a search engine

We will pass bert-as-service from FAQ Find the most similar question to the question entered by the user in the list , And return the corresponding answer .

FAQ You can also list in Python Practical dictionary backstage reply bert-as-service download .

First , Load all questions , And display statistics ：

prefix_q = '##### **Q:** '
with open('README.md') as fp:
    questions = [v.replace(prefix_q, '').strip() for v in fp if v.strip() and v.startswith(prefix_q)]
    print('%d questions loaded, avg. len of %d' % (len(questions), np.mean([len(d.split()) for d in questions])))
    # 33 questions loaded, avg. len of 9

Altogether 33 Questions were loaded , The average length is 9.

Then use the pre trained model ：uncased_L-12_H-768_A-12 Start a Bert service ：

bert-serving-start -num_worker=1 -model_dir=/data/cips/data/lab/data/model/uncased_L-12_H-768_A-12

Next , Encode our problem as a vector ：

bc = BertClient(port=4000, port_out=4001)
doc_vecs = bc.encode(questions)

Last , We are ready to receive user queries , And perform a simple “ Fuzzy ” Search for .

So , Every time a new query comes , We encode it as a vector and compute its dot product doc_vecs Then sort the results in descending order , Return to the former N A similar question ：

while True:
    query = input('your question: ')
    query_vec = bc.encode([query])[0]
    # compute normalized dot product as score
    score = np.sum(query_vec * doc_vecs, axis=1) / np.linalg.norm(doc_vecs, axis=1)
    topk_idx = np.argsort(score)[::-1][:topk]
    for idx in topk_idx:
        print('> %s\t%s' % (score[idx], questions[idx]))

complete ！ Now run the code and enter your query , See how this search engine handles fuzzy matching ：

The complete code is as follows , altogether 23 Line code （ Reply to keywords in the background can also be downloaded ）：

Slide up to see the complete code

import numpy as np
from bert_serving.client import BertClient
from termcolor import colored
prefix_q = '##### **Q:** '
topk = 5
with open('README.md') as fp:
    questions = [v.replace(prefix_q, '').strip() for v in fp if v.strip() and v.startswith(prefix_q)]
    print('%d questions loaded, avg. len of %d' % (len(questions), np.mean([len(d.split()) for d in questions])))
with BertClient(port=4000, port_out=4001) as bc:
    doc_vecs = bc.encode(questions)
    while True:
        query = input(colored('your question: ', 'green'))
        query_vec = bc.encode([query])[0]
        # compute normalized dot product as score
        score = np.sum(query_vec * doc_vecs, axis=1) / np.linalg.norm(doc_vecs, axis=1)
        topk_idx = np.argsort(score)[::-1][:topk]
        print('top %d questions similar to "%s"' % (topk, colored(query, 'green')))
        for idx in topk_idx:
            print('> %s\t%s' % (colored('%.1f' % score[idx], 'cyan'), colored(questions[idx], 'yellow')))

It's simple enough ？ Of course , This is a pre training based Bert A simple example of model making QA Search model .

You can also fine tune the model , Let this model perform more perfectly as a whole , You can put your data in a directory , And then execute run_classifier.py Fine tune the model