Connect with us

AI

Build a work-from-home posture tracker with AWS DeepLens and GluonCV

Working from home can be a big change to your ergonomic setup, which can make it hard for you to keep a healthy posture and take frequent breaks throughout the day. To help you maintain good posture and have fun with machine learning (ML) in the process, this post shows you how to build a […]

Published

on

Working from home can be a big change to your ergonomic setup, which can make it hard for you to keep a healthy posture and take frequent breaks throughout the day. To help you maintain good posture and have fun with machine learning (ML) in the process, this post shows you how to build a posture tracker project with AWS DeepLens, the AWS programmable video camera for developers to learn ML. You will learn how to use the latest pose estimation ML models from GluonCV to map out body points from profile images of yourself working from home and send yourself text message alerts whenever your code detects bad posture. GluonCV is a computer vision library built on top of the Apache MXNet ML framework that provides off-the-shelf ML models from state-of-the-art deep learning research. With the ability run GluonCV models on AWS DeepLens, engineers, researchers, and students can quickly prototype products, validate new ideas, and learn computer vision. In addition to detecting bad posture, you will learn to analyze your posture data over time with Amazon QuickSight, an AWS service that lets you easily create and publish interactive dashboards from your data.

This tutorial includes the following steps:

  1. Experiment with AWS DeepLens and GluonCV
  2. Classify postures with the GluonCV pose key points
  3. Deploy pre-trained GluonCV models to AWS DeepLens
  4. Send text message reminders to stretch when the tracker detects bad posture
  5. Visualize your posture data over time with Amazon QuickSight

The following diagram shows the architecture of our posture tracker solution.

Prerequisites

Before you begin this tutorial, make sure you have the following prerequisites:

  • An AWS account
  • An AWS DeepLens device. Available on the following Amazon websites:

Experimenting with AWS DeepLens and GluonCV

Normally, AWS developers use Jupyter notebooks hosted in Amazon SageMaker to experiment with GluonCV models. Jupyter notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text. In this tutorial you are going to create and run Jupyter notebooks directly on an AWS DeepLens device, just like any other Linux computer, in order to enable rapid experimentation.

Starting with version AWS DeepLens software version 1.4.5, you can run GluonCV pretrained models directly on AWS DeepLens. To check the version number and update your software, go to the AWS DeepLens console, under Devices select your DeepLens device, and look at the Device status section. You should see the version number similar to the following screenshot.

To start experimenting with GluonCV models on DeepLens, complete the following steps:

  1. SSH into your AWS DeepLens device.

To do so, you need the IP address of AWS DeepLens on the local network. To find the IP address, select your device on the AWS DeepLens console. Your IP address is listed in the Device Details section.

You also need to make sure that SSH is enabled for your device. For more information about enabling SSH on your device, see View or Update Your AWS DeepLens 2019 Edition Device Settings.

Open a terminal application on your computer. SSH into your DeepLens by entering the following code into your terminal application:

ssh [email protected]<YOUR_DEEPLENS_IP>

When you see a password prompt, enter the SSH password you chose when you set up SSH on your device.

  1. Install Jupyter notebook and GluonCV on your DeepLens. Enter each of the following commands one at a time in the SSH terminal. Press Enter after each line entry.
    sudo python3 -m pip install –-upgrade pip sudo python3 -m pip install notebook sudo python3.7 -m pip install ipykernel python3.7 -m ipykernel install --name 'Python3.7' --user sudo python3.7 -m pip install gluoncv
    

  2. Generate a default configuration file for Jupyter notebook:
    jupyter notebook --generate-config

  3. Edit the Jupyter configuration file in your SSH session to allow access to the Jupyter notebook running on AWS DeepLens from your laptop.
    nano ~/.jupyter/jupyter_notebook_config.py

  4. Add the following lines to the top of the config file:
    c.NotebookApp.ip = '0.0.0.0'
    c.NotebookApp.open_browser = False
    

  5. Save the file (if you are using the nano editor, press Ctrl+X and then Y).
  6. Open up a port in the AWS DeepLens firewall to allow traffic to Jupyter notebook. See the following code:
    sudo ufw allow 8888

  7. Run the Jupyter notebook server with the following code:
    jupyter notebook

    You should see output like the following screenshot:

  8. Copy the link and replace the IP portion (DeepLens or 127.0.0.1). See the following code:
    http://(DeepLens or 127.0.0.1):8888/?token=sometoken

    For example, the URL based on the preceding screenshot is http://10.0.0.250:8888/?token=7adf9c523ba91f95cfc0ba3cacfc01cd7e7b68a271e870a8.

  9. Enter this link into your laptop web browser.

You should see something like the following screenshot.

  1. Choose New to create a new notebook.
  2. Choose Python3.7.

Capturing a frame from your camera

To capture a frame from the camera, first make sure you aren’t running any projects on AWS DeepLens.

  1. On the AWS Deeplens console, go to your device page.
  2. If a project is deployed, you should see a project name in the Current Project pane. Choose Remove Project if there is a project deployed to your AWS DeepLens.
  3. Now go back to the Jupyter notebook running on your AWS DeepLens, enter the following code into your first code cell:
    import awscam
    import cv2 ret,frame = awscam.getLastFrame()
    print(frame.shape)
    

  4. Press Shift+Enter to execute the code inside the cell.

Alternatively, you can press the Run button in the Jupyter toolbar as shown in the screenshot below:

You should see the size of the image captured by AWS DeepLens similar to the following text:

(1520, 2688, 3)

The three numbers show the height, width, and number of color channels (red, green, blue) of the image.

  1. To view the image, enter the following code in the next code cell:
    %matplotlib inline
    from matplotlib import pyplot as plt
    plt.imshow(frame)
    plt.show()
    

    You should see an image similar to the following screenshot:

Detecting people and poses

Now that you have an image, you can use GluonCV pre-trained models to detect people and poses. For more information, see Predict with pre-trained Simple Pose Estimation models from the GluonCV model zoo.

  1. In a new code cell, enter the following code to import the necessary dependencies:
    import mxnet as mx
    from gluoncv import model_zoo, data, utils
    from gluoncv.data.transforms.pose import detector_to_simple_pose, heatmap_to_coord
    

  2. You load two pre-trained models, one to detect people (yolo3_mobilenet1.0_coco) in the frame and one to detect the pose (simple_pose_resnet18_v1b) for each person detected. To load the pre-trained models, enter the following code in a new code cell:
    people_detector = model_zoo.get_model('yolo3_mobilenet1.0_coco', pretrained=True)
    pose_detector = model_zoo.get_model('simple_pose_resnet18_v1b', pretrained=True)
    

  3. Because the yolo_mobilenet1.0_coco pre-trained model is trained to detect many types of objects in addition to people, the code below narrows down the detection criteria to just people so that the model runs faster. For more information about the other types of objects that the model can predict, see the GluonCV MSCoco Detection source code.
    people_detector.reset_class(["person"], reuse_weights=['person'])

  4. The following code shows how to use the people detector to detect people in the frame. The outputs of the people detector are the class_IDs (just “person” in this use case because we’ve limited the model’s search scope), the confidence scores, and a bounding box around each person detected in the frame.
    img = mx.nd.array(frame)
    x, img = data.transforms.presets.ssd.transform_test(img, short=256)
    class_IDs, scores, bounding_boxs = people_detector(x)
    

  5. Enter the following code to feed the results from the people detector into the pose detector for each person found. Normally you need to use the bounding boxes to crop out each person found in the frame by the people detector, then resize each cropped person image into appropriately sized inputs for the pose detector. Fortunately GluonCV comes with a detector_to_simple_pose function that takes care of cropping and resizing for you.
    pose_input, upscale_bbox = detector_to_simple_pose(img, class_IDs, scores, bounding_boxs) predicted_heatmap = pose_detector(pose_input)
    pred_coords, confidence = heatmap_to_coord(predicted_heatmap, upscale_bbox)
    

  6. The following code overlays the results of the pose detector onto the original image so you can visualize the result:
    ax = utils.viz.plot_keypoints(img, pred_coords, confidence, class_IDs, bounding_boxs,scores, box_thresh=0.5, keypoint_thresh=0.2)
    plt.show(ax)

After completing steps 1-6, you should see an image similar to the following screenshot.

If you get an error similar to the ValueError output below, make sure you have at least one person in the camera’s view.

ValueError: In HybridBlock, there must be one NDArray or one Symbol in the input. Please check the type of the args

So far, you experimented with a pose detector on AWS DeepLens using Jupyter notebooks. You can now collect some data to figure out how to detect when someone is hunching, sitting, or standing. To collect data, you can save the image frame from the camera out to disk using the built-in OpenCV module. See the following code:

cv2.imwrite('output.jpg', frame)

Classifying postures with the GluonCV pose key points

After you have collected a few samples of different postures, you can start to detect bad posture by applying some rudimentary rules.

Understanding the GluonCV pose estimation key points

The GluonCV pose estimation model outputs 17 key points for each person detected. In this section, you see how those points are mapped to human body joints and how to apply simple rules to determine if a person is sitting, standing, or hunching.

This solution makes the following assumptions:

  • The camera sees your entire body from head to toe, regardless of whether you are sitting or standing
  • The camera sees a profile view of your body
  • No obstacles exist between camera and the subject

The following is an example input image. We’ve asked the actor in this image to face the camera instead of showing the profile view to illustrate the key body joints produced by the pose estimation model.

The following image is the output of the model drawn as lines and key points onto the input image. The cyan rectangle shows where the people detector thinks a person is in the image.

The following code shows the raw results of the pose detector. The code comments show how each entry maps to point on the a human body:

array([[142.96875, 84.96875],# Nose [152.34375, 75.59375],# Right Eye [128.90625, 75.59375],# Left Eye [175.78125, 89.65625],# Right Ear [114.84375, 99.03125],# Left Ear [217.96875, 164.65625],# Right Shoulder [ 91.40625, 178.71875],# Left Shoulder [316.40625, 197.46875],# Right Elblow [ 9.375 , 232.625 ],# Left Elbow [414.84375, 192.78125],# Right Wrist [ 44.53125, 244.34375],# Left Wrist [199.21875, 366.21875],# Right Hip [128.90625, 366.21875],# Left Hip [208.59375, 506.84375],# Right Knee [124.21875, 506.84375],# Left Knee [215.625 , 570.125 ],# Right Ankle [121.875 , 570.125 ]],# Left Ankle

Deploying pre-trained GluonCV models to AWS DeepLens

In the following steps, you convert your code written in the Jupyter notebook to an AWS Lambda inference function to run on AWS DeepLens. The inference function optimizes the model to run on AWS DeepLens and feeds each camera frame into the model to get predictions.

This tutorial provides an example inference Lambda function for you to use. You can also copy and paste code sections directly from the Jupyter notebook you created earlier into the Lambda code editor.

Before creating the Lambda function, you need an Amazon Simple Storage Service (Amazon S3) bucket to save the results of your posture tracker for analysis in Amazon QuickSight. If you don’t have an Amazon S3 Bucket, see How to create an S3 bucket.

To create a Lambda function to deploy to AWS DeepLens, complete the following steps:

  1. Download aws-deeplens-posture-lambda.zip onto your computer.
  2. On the Lambda console, choose Create Function.
  3. Choose Author from scratch and choose the following options:
    1. For Runtime, choose Python 3.7.
    2. For Choose or create an execution role, choose Use an existing role.
    3. For Existing role, enter service-role/AWSDeepLensLambdaRole.
  4. After you create the function, go to function’s detail page.
  5. For Code entry type¸ choose Upload zip.
  6. Upload the aws-deeplens-posture-lambda.zip you downloaded earlier.
  7. Choose Save.
  8. In the AWS Lambda code editor, select the lambda_funtion.py file and enter an Amazon S3 bucket where you want to store the results.
    S3_BUCKET = '<YOUR_S3_BUCKET_NAME>'

  9. Choose Save.
  10. From the Actions drop-down menu, choose Publish new version.
  11. Enter a version number and choose Publish. Publishing the function makes it available on the AWS DeepLens console so you can add it to your custom project.
  12. Give your AWS DeepLens Lambda function permissions to put files in the Amazon S3 bucket. Inside your Lambda function editor, click on Permissions, then click on the AWSDeepLensLambda role name.
  13. You will be directed to the IAM editor for the AWSDeepLensLambda role. Inside the IAM role editor, click Attach Policies.
  14. Type in S3 to search for the AmazonS3 policy and check the AmazonS3FullAccess policy. Click Attach Policy.

Understanding the Lambda function

This section walks you through some important parts of the Lambda function.

You load the GluonCV model with the following code:

detector = model_zoo.get_model('yolo3_mobilenet1.0_coco', pretrained=True, root='/opt/awscam/artifacts/')
pose_net = model_zoo.get_model('simple_pose_resnet18_v1b', pretrained=True, root='/opt/awscam/artifacts/') # Note that we can reset the classes of the detector to only include
# human, so that the NMS process is faster. detector.reset_class(["person"], reuse_weights=['person'])

You run the model frame-per-frame over the images from the camera with the following code:

ret, frame = awscam.getLastFrame()
img = mx.nd.array(frame)
x, img = data.transforms.presets.ssd.transform_test(img, short=200) class_IDs, scores, bounding_boxs = detector(x)
pose_input, upscale_bbox = detector_to_simple_pose(img, class_IDs, scores, bounding_boxs) predicted_heatmap = pose_net(pose_input)
pred_coords, confidence = heatmap_to_coord(predicted_heatmap, upscale_bbox)

The following code shows you how to send the text prediction results back to the cloud. Viewing the text results in the cloud is a convenient way to make sure the model is working correctly. Each AWS DeepLens device has a dedicated iot_topic automatically created to receive the inference results.

# Send the top k results to the IoT console via MQTT
cloud_output = { 'boxes': bounding_boxs, 'box_scores': scores, 'coords': pred_coords, 'coord_scors': confidence }
client.publish(topic=iot_topic, payload=json.dumps(cloud_output))

Using the preceding key points, you can apply the geometric rules shown in the following sections to calculate angles between the body joints to determine if the person is sitting, standing, or hunching. You can change the geometric rules to suit your setup. As a follow-up activity to this tutorial, you can collect the pose data and train a simple ML model to more accurately predict when someone is standing or sitting.

Sitting vs. Standing

To determine if a person is standing or sitting, use the angle between the horizontal (ground) and the line connecting the hip and knee.

Hunching

When a person hunches, their head is typically looking down and their back is crooked. You can use the angles between the ear and shoulder and the shoulder and hip to determine if someone is hunching. Again, you can modify these geometric rules as you see fit. The following code inside the provided AWS DeepLens Lambda function determines if a person is hunching:

def hip_and_hunch_angle(left_array): ''' :param left_array: pass in the left most coordinates of a person , should be ok, since from side left and right overlap :return: ''' # hip to knee angle hipX = left_array[-2][0] - left_array[-3][0] hipY = left_array[-2][1] - left_array[-3][1] # hunch angle = (hip to shoulder ) - (shoulder to ear ) # (hip to shoulder ) hunchX1 = left_array[-3][0] - left_array[-6][0] hunchY1 = left_array[-3][1] - left_array[-6][1] ang1 = degrees(atan2(hunchY1, hunchX1)) # (shoulder to ear) hunchX2 = left_array[-6][0] - left_array[-7][0] hunchY2 = left_array[-6][1] - left_array[-7][1] ang2 = degrees(atan2(hunchY2, hunchX2)) return degrees(atan2(hipY, hipX)), abs(ang1 - ang2) def sitting_and_hunching(left_array): hip_ang, hunch_ang = hip_and_hunch_angle(left_array) if hip_ang < 25 or hip_ang > 155: print("sitting") hip = 0 else: print("standing") hip = 1 if hunch_ang < 3: print("no hunch") hunch = 0 else: hunch = 1 return hip, hunch

Deploying the Lambda inference function to your AWS DeepLens device

To deploy your Lambda inference function to your AWS DeepLens device, complete the following steps:

  1. On the AWS DeepLens console, under Projects, choose Create new project.
  2. Choose Create a new blank project.
  3. For Project name, enter posture-tracker.
  4. Choose Add model.

To deploy a project, AWS DeepLens requires you to select a model and a Lambda function. In this tutorial, you are downloading the GluonCV models directly onto AWS DeepLens from inside your Lambda function so you can choose any existing model on the AWS DeepLens console to be deployed. The model selected on the AWS DeepLens console only serves as a stub and isn’t be used in the Lambda function. If you don’t have an existing model, deploy a sample project and select the sample model.

  1. Choose Add function.
  2. Choose the Lambda function you created earlier.
  3. Choose Create.
  4. Select your newly created project and choose Deploy to device.
  5. On the Target device page, select your device from the list.
  6. Choose Review.
  7. On the Review and deploy page, choose Deploy.

To verify that the project has deployed successfully, you can check the text prediction results sent back to the cloud via AWS IoT Greengrass. For instructions on how to view the text results, see Viewing text output of custom model in AWS IoT Greengrass.

In addition to the text results, you can view the pose detection results overlaid on top of your AWS DeepLens live video stream. For instructions on viewing the live video stream, see Viewing AWS DeepLens Output Streams.

The following screenshot shows what you will see in the project stream:

Sending text messages to reminders to stand and stretch

In this section, you use Amazon Simple Notification Service (Amazon SNS) to send reminder text messages when your posture tracker determines that you have been sitting or hunching for an extended period of time.

  1. Register a new SNS topic to publish messages to.
  2. After you create the topic, copy and save the topic ARN, which you need to refer to in the AWS DeepLens Lambda inference code.
  3. Subscribe your phone number to receive messages posted to this topic.

Amazon SNS sends a confirmation text message before your phone number can receive messages.

You can now change the access policy for the SNS topic to allow AWS DeepLens to publish to the topic.

  1. On the Amazon SNS console, choose Topics.
  2. Choose your topic.
  3. Choose Edit.
  4. On the Access policy tab, enter the following code:
    { "Version": "2008-10-17", "Id": "lambda_only", "Statement": [ { "Sid": "allow-lambda-publish", "Effect": "Allow", "Principal": { "Service": "lambda.amazonaws.com" }, "Action": "sns:Publish", "Resource": "arn:aws:sns:us-east-1:your-account-no:your-topic-name", "Condition": { "StringEquals": { "AWS:SourceOwner": "your-AWS-account-no" } } } ]
    }
    

  5. Update the AWS DeepLens Lambda function with the ARN for the SNS topic. See the following code:
    def publishtoSNSTopic(SittingTime=None, hunchTime=None): sns = boto3.client('sns') # Publish a simple message to the specified SNS topic response = sns.publish( TopicArn='arn:aws:sns:us-east-1:xxxxxxxxxx:deeplenspose', # update topic arn Message='Alert: You have been sitting for {}, Stand up and stretch, and you have hunched for {}'.format( SittingTime, hunchTime), ) print(SittingTime, hunchTime)
    

Visualizing your posture data over time with Amazon QuickSight

This next section shows you how to visualize your posture data with Amazon QuickSight. You first need to store the posture data in Amazon S3.

Storing the posture data in Amazon S3

The following code example records posture data one time every second; you can adjust this interval to suit your needs. The code writes the records to a CSV file every 60 seconds and uploads the results to the Amazon S3 bucket you created earlier.

 if len(physicalList) > 60: try: with open('/tmp/temp2.csv', 'w') as f: writer = csv.writer(f) writer.writerows(physicalList) physicalList = [] write_to_s3('/tmp/temp2.csv', S3_BUCKET, "Deeplens-posent/gluoncvpose/physicalstate-" + datetime.datetime.now().strftime( "%Y-%b-%d-%H-%M-%S") + ".csv") except Exception as e: print(e)

Your Amazon S3 bucket now starts to fill up with CSV files containing posture data. See the following screenshot.

Using Amazon QuickSight

You can now use Amazon QuickSight to create an interactive dashboard to visualize your posture data. First, make sure that Amazon QuickSight has access to the S3 bucket with your pose data.

  1. On the Amazon QuickSight console, from the menu bar, choose Manage QuickSight.
  2. Choose Security & permissions.
  3. Choose Add or remove.
  4. Select Amazon S3.
  5. Choose Select S3 buckets.
  6. Select the bucket containing your pose data.
  7. Choose Update.
  8. On the Amazon QuickSight landing page, choose New analysis.
  9. Choose New data set.

You see a variety of options for data sources.

  1. Choose S3.

A pop-up window appears that asks for your data source name and manifest file. A manifest file tells Amazon QuickSight where to look for your data and how your dataset is structured.

  1. To build a manifest file for your posture data files in Amazon S3, open your preferred text editor and enter the following code:
    { "fileLocations": [ { "URIPrefixes": ["s3://YOUR_BUCKET_NAME/FOLDER_OF_POSE_DATA" ] } ], "globalUploadSettings": { "format": "CSV", "delimiter": ",", "textqualifier": "'", "containsHeader": "true" } }

  2. Save the text file with the name manifest.json.
  3. In the New S3 data source window, select Upload.
  4. Upload your manifest file.
  5. Choose Connect.

If you set up the data source successfully, you see a confirmation window like the following screenshot.

To troubleshoot any access or permissions errors, see How do I allow Amazon QuickSight access to my S3 bucket when I have a deny policy?

  1. Choose Visualize.

You can now experiment with the data to build visualizations. See the following screenshot.

The following bar graphs show visualizations you can quickly make with the posture data.

For instructions on creating more complex visualizations, see Tutorial: Create an Analysis.

Conclusion

In this post, you learned how to use Jupyter notebooks to prototype with AWS DeepLens, deploy a pre-trained GluonCV pose detection model to AWS DeepLens, send text messages using Amazon SNS based on triggers from the pose model, and visualize the posture data with Amazon QuickSight. You can deploy other GluonCV pre-trained models to AWS DeepLens or replace the hard-coded rules for classifying standing and sitting positions with a robust machine learning model. You can also dive deeper with Amazon QuickSight to reveal posture patterns over time.

For a detailed walkthrough of this tutorial and other tutorials, sample code, and project ideas with AWS DeepLens, see AWS DeepLens Recipes.


About the Authors

Phu Nguyen is a Product Manager for AWS DeepLens. He builds products that give developers of any skill level an easy, hands-on introduction to machine learning.

Raj Kadiyala is an AI/ML Tech Business Development Manager in AWS WWPS Partner Organization. Raj has over 12 years of experience in Machine Learning and likes to spend his free time exploring machine learning for practical every day solutions and staying active in the great outdoors of Colorado.

Source: https://aws.amazon.com/blogs/machine-learning/build-a-work-from-home-posture-tracker-with-aws-deeplens-and-gluoncv/

AI

10 Advanced Features to Make your Dating App Successful

Online dating is all the rage; people all over the world are using dating applications and websites to find people for dating, love, and romance. […]

The post 10 Advanced Features to Make your Dating App Successful appeared first on Quytech Blog.

Published

on

Online dating is all the rage; people all over the world are using dating applications and websites to find people for dating, love, and romance. These dating apps have made it convenient to meet new people without even stepping out of home. Just download the app on your smartphone, enter your particular partner preferences, and get a list of profiles matching your search criteria.

Send request to the profile you like and begin your dating journey. It is no less than a cakewalk. With people preferring e-dating applications over the other mediums of finding a partner, the demand for online dating applications is rising high. The online dating industry is expected to reach a market volume of US$1,968m by 2023 (Source: Statista). This is the best opportunity for the startups that are looking out for a mobile app idea that can be successful.

However, building a dating application can be beneficial only if you integrate highly advanced and unique features into it. We are saying so because the Play Store and App Stores already have hundreds of such applications. In this scenario, developing a new app and then expecting it to outdo all others in the market is a task that requires a lot of things to consider. And, adding exceptional features to the app is the first consideration among all.

To make you help with this, we have provided ten advanced features a dating app should have to get desired success and popularity. Check them out below:

dating app
  • Profile verification

While the core motive of a dating app should be creating a platform to meet new people, ensuring the safety of users cannot be ignored. Connecting new people is exciting but it also arouses a sense of fear in the people, especially to ones who already have experienced harassment and other related issues in the past.

Therefore, you have to make sure that your dating app has a feature to verify the profile by asking the user to upload identification proof before letting him/her search and connect to other profiles. This will make your app trustworthy for everyone who wants to experience online dating. Profile verification can also be done by making users connect their social profiles with their dating accounts.

  • AI-based chatbots

Offering AI-based chatbots would not only make your app unique from others, but it will also help your users, especially the introverts, to break the ice. Such a chatbot would suggest to them the beginning lines to start a conversation. It could also show them the suggestions for replying to a particular message. It is an essential feature because “what you say and how you say” impacts a lot when you talk to a stranger.

  • Explore events and meet-ups

You can add this exceptional feature so that people can explore nearby events that match their interests. People with similar interests can attend these events together and take a step further to their relationship. You can create categories like events for “animal lovers”, “romantic people”, “fitness enthusiasts”, and more.

  • Advanced search

When it comes to choosing a partner, everyone has their own preferences. Some want a partner for dating or hangout, while others could be interested in a serious relationship.  Apart from this, people might be particular about age, gender, religion, cast, or astrological sign. Therefore, it is imperative to provide advanced search options with filters to help them narrow down their search. The filter you provide should be developed using strict algorithms to deliver the desired results in no time.

  • Profile performance checker

A profile is the first thing any user checks before hitting the “Send a Friend Request” button. However, not everyone knows how to create a dating profile that can grab the attention of other users. By providing a profile performance checker in your app, you can help users, mainly the first-time daters, to create a profile that can make them find a partner they have been looking for.

ai whitepaper
  • Behavior Analysis

Artificial intelligence has improved to such an extent that it can now recognize behavior of users. When integrated into a dating app, AI’s behavior analysis feature can help dating app users to find people with similar interests based on what they have mentioned in their profile.

  • Profile recommendation

Most of the dating apps are available with a search option or filters to help users find their partner for dating, love, and romance. However, AI can take this whole scenario to the next level by automatically suggesting or recommending you the profiles that match your search criteria or partner preferences you have mentioned in your profile.

  • Gamification

Gamification is the new trend in the mobile app industry. It increases user engagement. By gamifying a dating app, you can augment your user base and make your app entertaining. Users can take part in different competitions and earn score or points. For example, in the case of a dating app, you can give the matched profiles a chance to take part in a competition where they have to tell likes and dislikes of each other.

  • AI-based Video calls

Video calling powered by artificial intelligence is one of the unique features you can add to a dating app to make it successful. Now, you must be thinking about the role of AI in video calling; the feature can be integrated even without this technology. It is true, but, AI can help to identify the nudity or obscenity during a call.

Read More: 4 Ways Artificial Intelligence empowered Dating Apps

  • Facial recognition to find lookalikes

Using AI’s facial recognition technology, you can make people find a partner that looks like someone they want. Users can upload a photo of a celebrity, public figure, or even their ex, and the app will display lookalikes. This is indeed a unique feature for a dating app.

To build a feature-rich and successful online dating app, contact a trusted dating app development company or hire app developers today.

Final Words

Online dating is a new trend in the dating industry. Anyone, irrespective of their age, race, community, can access dating applications to meet new people. While the use of dating apps is increasing, the demand for dating apps with modern features is also skyrocketing. If you are thinking of fulfilling this demand by developing a dating app with highly advanced features, then this article is for.

Check out some of the innovative features listed in this article to integrate them into your app and paving a path of success for it. To begin developing a dating app with these features, you can contact a reliable mobile app development company or hire mobile app developers with considerable experience in dating app development. Make sure you choose the right company that provides you a perfect combination of cost and quality.

hire mobile development

Source: https://www.quytech.com/blog/10-advanced-features-to-make-your-dating-app-successful/

Continue Reading

AI

How to Get the Best Start at Sports Betting

If you are looking into getting into sports betting, then you might be hesitant about how to start, and the whole idea of it can be quite daunting. There are many techniques to get the best possible start at sports betting and, in this article, we will have a look at some of the best […]

The post How to Get the Best Start at Sports Betting appeared first on 1redDrop.

Published

on

If you are looking into getting into sports betting, then you might be hesitant about how to start, and the whole idea of it can be quite daunting. There are many techniques to get the best possible start at sports betting and, in this article, we will have a look at some of the best tips for that.

Mental preparation

This sounds a bit pretentious, but it is very important to understand some things about betting before starting so you can not only avoid nasty surprises but also avoid losing too much money. Firstly, you need to know that, in the beginning, you will not be good at betting. It is through experience and learning from your mistakes that you will get better. It is imperative that you do not convince yourself that you are good at betting, especially if you win some early bets, because I can guarantee it will have been luck – and false confidence is not your friend. 

It is likely that you will lose some money at first, but this is to be expected. Almost any hobby that you are interested in will cost you some money so, instead, look at it as an investment. However, do not invest ridiculous amounts; rather, wait until you are confident in your betting ability to start placing larger stakes. 

Set up different accounts

This is the best way to start with sports betting, as the welcome offers will offset a lot of the risk. These offers are designed to be profitable to entice you into betting with the bookie, but it is completely legal to just profit from the welcome offer and not bet with the bookie again. 

If you do this with the most bookies, as you can, you are minimising the risk involved with your betting and maximising possible returns, so it really is a no-brainer.

As well as this clear advantage, different betting companies offer different promotions. Ladbrokes offer a boost every day, for example, where you can choose your bet and boost it a little bit, and the Parimatch betting website chooses a bet for big events and doubles the odds. 

If you are making sure you stay aware of the best offers across these platforms, then you will be able to use the most lucrative ones and, as such, you will be giving yourself the best chance of making money. The house always wins, as they say, but if you use this tip, you are skewing the odds back in your favour. 

Remember, the house wins because of gamblers that do not put in the effort and do not bet smart. Avoid those mistakes and you will massively increase your chances of making money.

Tipsters

On Twitter, especially, but also other social media platforms, there are tipsters who offer their bets for free. It is not so much the bets themselves that you are interested in, but rather why they are betting on this. It is important that you find tipsters who know what they are doing, though, because there are a lot of tipsters who are essentially scamming their customers. It is quite easy to find legitimate tipsters because they are not afraid to show their mistakes. 

Once you have found good tipsters, then you need to understand the reasoning behind their bets. When you have done that, you can start placing these bets yourself, and they will likely be of better value since some tipsters influence the betting markets considerably. You can also follow their bets as they are likely to be sensible bets, although this does not necessarily translate to success.

Source: https://1reddrop.com/2020/10/20/how-to-get-the-best-start-at-sports-betting/?utm_source=rss&utm_medium=rss&utm_campaign=how-to-get-the-best-start-at-sports-betting

Continue Reading

AI

Natural Language Processing in Production: 27 Fast Text Pre-Processing Methods

Estimates state that 70%–85% of the world’s data is text (unstructured data) [1]. New deep learning language models (transformers) have caused explosive growth in industry applications [5,6,11]. This blog is not an article introducing you to Natural Language Processing. Instead, it assumes you are familiar with noise reduction and normalization of text. It covers text preprocessing up […]

The post Natural Language Processing in Production: 27 Fast Text Pre-Processing Methods appeared first on TOPBOTS.

Published

on

text pre-processing

Estimates state that 70%–85% of the world’s data is text (unstructured data) [1]. New deep learning language models (transformers) have caused explosive growth in industry applications [5,6,11].

This blog is not an article introducing you to Natural Language Processing. Instead, it assumes you are familiar with noise reduction and normalization of text. It covers text preprocessing up to producing tokens and lemmas from the text.

We stop at feeding the sequence of tokens into a Natural Language model.

The feeding of that sequence of tokens into a Natural Language model to accomplish a specific model task is not covered here.

In production-grade Natural Language Processing (NLP), what is covered in this blog is that fast text pre-processing (noise cleaning and normalization) is critical.

  1. I discuss packages we use for production-level NLP;
  2. I detail the production-level NLP preprocessing text tasks with python code and packages;
  3. Finally. I report benchmarks for NLP text pre-processing tasks;

Dividing NLP Processing into Two Steps

We segment NLP into two major steps (for the convenience of this article):

  1. Text pre-processing into tokens. We clean (noise removal) and then normalize the text. The goal is to transform the text into a corpus that any NLP model can use. A goal is rarely achieved until the introduction of the transformer [2].
  2. A corpus is an input (text preprocessed into a sequence of tokens) into NLP models for training or prediction.

The rest of this article is devoted to noise removal text and normalization of text into tokens/lemmas (Step 1: text pre-processing). Noise removal deletes or transforms things in the text that degrade the NLP task model. It is usually an NLP task-dependent. For example, e-mail may or may not be removed if it is a text classification task or a text redaction task. We’ll cover replacement and removal of the noise.

Normalization of the corpus is transforming the text into a common form. The most frequent example is normalization by transforming all characters to lowercase. In follow-on blogs, we will cover different deep learning language models and Transformers (Steps 2-n) fed by the corpus token/lemma stream.

NLP Text Pre-Processing Package Factoids

There are many NLP packages available. We use spaCy [2], textacy [4], Hugging Face transformers [5], and regex [7] in most of our NLP production applications. The following are some of the “factoids” we used in our decision process.

Note: The following “factoids” may be biased. That is why we refer to them as “factoids.”

NLTK [3]

  • NLTK is a string processing library. All the tools take strings as input and return strings or lists of strings as output [3].
  • NLTK is a good choice if you want to explore different NLP with a corpus whose length is less than a million words.
  • NLTK is a bad choice if you want to go into production with your NLP application [3].

Regex

The use of regex is pervasive throughout our text-preprocessing code. Regex is a fast string processor. Regex, in various forms, has been around for over 50 years. Regex support is part of the standard library of Java and Python, and is built into the syntax of others, including Perl and ECMAScript (JavaScript);

spaCy [2]

  • spaCy is a moderate choice if you want to research different NLP models with a corpus whose length is greater than a million words.
  • If you use a selection from spaCy [3], Hugging Face [5], fast.ai [13], and GPT-3 [6], then you are performing SOTA (state-of-the-art) research of different NLP models (my opinion at the time of writing this blog).
  • spaCy is a good choice if you want to go into production with your NLP application.
  • spaCy is an NLP library implemented both in Python and Cython. Because of the Cython, parts of spaCy are faster than if implemented in Python [3];
  • spacy is the fastest package, we know of, for NLP operations;
  • spacy is available for operating systems MS Windows, macOS, and Ubuntu [3];
  • spaCy runs natively on Nvidia GPUs [3];
  • explosion/spaCy has 16,900 stars on Github (7/22/2020);
  • spaCy has 138 public repository implementations on GitHub;
  • spaCy comes with pre-trained statistical models and word vectors;
  • spaCy transforms text into document objects, vocabulary objects, word- token objects, and other useful objects resulting from parsing the text ;
  • Doc class has several useful attributes and methods. Significantly, you can create new operations on these objects as well as extend a class with new attributes (adding to the spaCy pipeline);
  • spaCy features tokenization for 50+ languages;

Do you find this in-depth technical education about NLP applications to be useful? Subscribe below to be updated when we release new relevant content.

Creating long_s Practice Text String

We create long_, a long string that has extra whitespace, emoji, email addresses, $ symbols, HTML tags, punctuation, and other text that may or may not be noise for the downstream NLP task and/or model.

MULPIPIER = int(3.8e3)
text_l = 300 %time long_s = ':( 😻 😈 #google +1 608-444-0000 08-444-0004 608-444-00003 ext. 508 '
long_s += ' 888 eihtg DoD Fee https://medium.com/ #hash ## Document Title</title> '
long_s += ':( cat- \n nip'
long_s += ' immed- \n natedly <html><h2>2nd levelheading</h2></html> . , '
long_s += '# [email protected] [email protected] can\'t Be a ckunk. $4 $123,456 won\'t seven '
long_s +=' $Shine $$beighty?$ '
long_s *= MULPIPIER
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))

output =>

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 8.11 µs
size: 1.159e+06 :( 😻 😈 #google +1 608-444-0000 08-444-0004 608-444-00003 ext. 508 888 eihtg DoD Fee https://medium.com/ #hash ## Document Title</title> :( cat- nip immed- natedly <html><h2>2nd levelheading</h2></html> . , # [email protected] [email protected] can't Be a ckunk. $4 $123,456 won't seven $Shine $$beigh

A string, long_s of 1.159 million characters is created in 8.11 µs.

Python String Corpus Pre-processing Step and Benchmarks

All benchmarks are run within a Docker container on MacOS Version 14.0 (14.0).

Model Name: Mac Pro
Processor Name: 12-Core Intel Xeon E5
Processor Speed: 2.7 GHz
Total Number of Cores: 24
L2 Cache (per Core): 256 KB
L3 Cache: 30 MB
Hyper-Threading Technology: Enabled Memory: 64 GB

Note: Corpus/text pre-processing is dependent on the end-point NLP analysis task. Sentiment Analysis requires different corpus/text pre-processing steps than document redaction. The corpus/text pre-processing steps given here are for a range of NLP analysis tasks. Usually. a subset of the given corpus/text pre-processing steps is needed for each NLP task. Also, some of required corpus/text pre-processing steps may not be given here.

1. NLP text preprocessing: Replace Twitter Hash Tags

from textacy.preprocessing.replace import replace_hashtags
%time text = replace_hashtags(long_s,replace_with= 'HASH')
print('size: {:g} {}'.format(len(text),text[:text_l])))

output =>

CPU times: user 223 ms, sys: 66 µs, total: 223 ms
Wall time: 223 ms
size: 1.159e+06 :
( 😻 😈 _HASH_ +1 608-444-0000 08-444-0004 608-444-00003 ext. 508 888 eihtg DoD Fee https://medium.com/ _HASH_ ## Document Title</title> :( cat- nip immed- natedly <html><h2>2nd levelheading</h2></html> . , # [email protected] [email protected] can't Be a ckunk. $4 $123,456 won't seven $Shine $$beigh

Notice that #google and #hash are swapped with_HASH_,and ##and _# are untouched. A million characters were processed in 200 ms. Fast enough for a big corpus of a billion characters (example: web server log).

2. NLP text preprocessing: Remove Twitter Hash Tags

from textacy.preprocessing.replace import replace_hashtags
%time text = replace_hashtags(long_s,replace_with= '')
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 219 ms, sys: 0 ns, total: 219 ms
Wall time: 220 ms
size: 1.1134e+06 :( 😻 😈 +1 608-444-0000 08-444-0004 608-444-00003 ext. 508 888 eihtg DoD Fee https://medium.com/ ## Document Title</title> :( cat- nip immed- natedly <html><h2>2nd levelheading</h2></html> . , # [email protected] [email protected] can't Be a ckunk. $4 $123,456 won't seven $Shine $$beighty?$

Notice that #google and #hash are removed and ##,and _# are untouched. A million characters were processed in 200 ms.

3. NLP text preprocessing: Replace Phone Numbers

from textacy.preprocessing.replace import replace_phone_numbers
%time text = replace_phone_numbers(long_s,replace_with= 'PHONE')
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 384 ms, sys: 1.59 ms, total: 386 ms
Wall time: 383 ms
size: 1.0792e+06
:( 😻 😈 PHONE 08-PHONE 608-444-00003 ext. 508 888 eihtg

Notice phone number 08-444-0004 and 608-444-00003 ext. 508 were not transformed.

4. NLP text preprocessing: Replace Phone Numbers – better

RE_PHONE_NUMBER: Pattern = re.compile( # core components of a phone number r"(?:^|(?<=[^\w)]))(\+?1[ .-]?)?(\(?\d{2,3}\)?[ .-]?)?(\d{2,3}[ .-]?\d{2,5})" # extensions, etc. r"(\s?(?:ext\.?|[#x-])\s?\d{2,6})?(?:$|(?=\W))", flags=re.UNICODE | re.IGNORECASE) text = RE_PHONE_NUMBER.sub('_PHoNE_', long_s)
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 353 ms, sys: 0 ns, total: 353 ms
Wall time: 350 ms
size: 1.0108e+06 :( 😻 😈 _PHoNE_ _PHoNE_ _PHoNE_ 888 eihtg DoD Fee https://medium.com/ ## Document Title</title> :( cat- nip immed- natedly <html><h2>2nd levelheading</h2></html> . , # [email protected] [email protected] can't Be a ckunk. $4 $123,456 won't seven $Shine $$beighty?$

Notice phone number 08-444-0004 and 608-444-00003 ext. 508 were transformed. A million characters were processed in 350 ms.

5. NLP text preprocessing: Remove Phone Numbers

Using the improved RE_PHONE_NUMBER pattern, we put '' in for ‘PHoNE' to remove phone numbers from the corpus.

text = RE_PHONE_NUMBER.sub('', long_s)
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 353 ms, sys: 459 µs, total: 353 ms
Wall time: 351 ms
size: 931000 :( 😻 😈 888 eihtg DoD Fee https://medium.com/ ## Document Title</title> :( cat- nip immed- natedly <html><h2>2nd levelheading</h2></html> . , # [email protected] [email protected] can't Be a ckunk. $4 $123,456 won't seven $Shine $$beighty?$

A million characters were processed in 375 ms.

6. NLP text preprocessing: Removing HTML metadata

I admit removing HTML metadata is my favorite. Not because I like the task, but because I screen-scrape frequently. There is a lot of useful data that resides on an IBM mainframe, VAX-780 (huh?), or whatever terminal-emulation that results in an HTML-based report.

These techniques of web scraping of reports generate text that has HTML tags. HTML tags are considered noise typically as they are parts of the text with little or no value in the follow-on NLP task.

Remember, we created a test string (long_s) a little over million characters long with some HTML tags. We remove the HTML tags using BeautifulSoup.

from bs4 import BeautifulSoup
%time long_s = BeautifulSoup(long_s,'html.parser').get_text()
print('size: {:g} {}'.format(len(long_s),long_s[:text_l])))

output =>

CPU times: user 954 ms, sys: 17.7 ms, total: 971 ms
Wall time: 971 ms
size: 817000 :( 😻 😈 888 eihtg DoD Fee https://medium.com/ ## Document Title :( cat- nip immed- natedly 2nd levelheading 

The result is that BeautifulSoup is able to remove over 7,000 HTML tags in a million character corpus in one second. Scaling linearly, a billion character corpus, about 200 million word, or approxiately 2000 books, would require about 200 seconds.

The rate for HTML tag removal byBeautifulSoup is about 0. 1 second per book. An acceptable rate for our production requirements.

I only benchmark BeautifulSoup. If you know of a competitive alternative method, please let me know.

Note: The compute times you get may be multiples of time longer or shorter if you are using the cloud or Spark.

7. NLP text preprocessing: Replace currency symbol

The currency symbols “[$¢£¤¥ƒ֏؋৲৳૱௹฿៛ℳ元円圆圓﷼\u20A0-\u20C0] “ are replaced with _CUR_using the textacy package:

%time textr = textacy.preprocessing.replace.replace_currency_symbols(long_s)
print('size: {:g} {}'.format(len(textr),textr[:text_l]))

output =>

CPU times: user 31.2 ms, sys: 1.67 ms, total: 32.9 ms
Wall time: 33.7 ms
size: 908200 :( 😻 😈 888 eihtg DoD Fee https://medium.com/ ## Document Title :( cat- nip immed- natedly 2nd levelheading . , # [email protected] [email protected] can't Be a ckunk. _CUR_4 _CUR_123,456 won't seven _CUR_Shine _CUR__CUR_beighty?_CUR_

Note: The option textacy replace_<something> enables you to specify the replacement text. _CUR_ is the default substitution text for replace_currency_symbols.

You may have the currency symbol $ in your text. In this case you can use a regex:

%time text = re.sub('\$', '_DOL_', long_s)
print('size: {:g} {}'.format(len(text),text[:250]))

output =>

CPU times: user 8.06 ms, sys: 0 ns, total: 8.06 ms
Wall time: 8.25 ms
size: 1.3262e+06 :( 😻 😈 #google +1 608-444-0000 08-444-0004 608-444-00003 ext. 508 888 eihtg DoD Fee https://medium.com/ #hash ## <html><title>Document Title</title></html> :( cat- nip immed- natedly <html><h2>2nd levelheading</h2></html> . , # [email protected] [email protected] can't Be a ckunk. _DOL_4 _DOL_123,456 won't seven _DOL_Shine _DOL__DOL_beighty?_DOL_ :

Note: All symbol $ in your text will be removed. Don’t use if you have LaTex or any text where multiple symbol $ are used.

8. NLP text preprocessing: Replace URL String

from textacy.preprocessing.replace import replace_urls
%time text = replace_urls(long_s,replace_with= '_URL_')
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 649 ms, sys: 112 µs, total: 649 ms
Wall time: 646 ms
size: 763800
:( 😻 😈 888 eihtg DoD Fee _URL_ ## Document Title :(

9. NLP text preprocessing: Remove URL String

from textacy.preprocessing.replace import replace_urls
%time text = replace_urls(long_s,replace_with= '')
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 633 ms, sys: 1.35 ms, total: 635 ms
Wall time: 630 ms
size: 744800
:( 😻 😈 888 eihtg DoD Fee ## Document Title :(

The rate for URL replace or removal is about 4,000 URLs per 1 million characters per second. Fast enough for 10 books in a corpus.

10. NLP text preprocessing: Replace E-mail string

%time text = textacy.preprocessing.replace.replace_emails(long_s)
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 406 ms, sys: 125 µs, total: 406 ms
Wall time: 402 ms
size: 725800
:( 😻 😈 888 eihtg DoD Fee ## Document Title :( cat-
nip immed-
natedly 2nd levelheading . , # _EMAIL_ _EMAIL_ can't Be a ckunk. $4 $123,456 won't seven $Shine $$beighty?$

The rate for email reference replace is about 8,000 emails per 1.7 million characters per second. Fast enough for 17 books in a corpus.

11. NLP text pre-processing: Remove E-mail string

from textacy.preprocessing.replace import replace_emails

%time text = textacy.preprocessing.replace.replace_emails(long_s,replace_with= '')
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 413 ms, sys: 1.68 ms, total: 415 ms
Wall time: 412 ms
size: 672600 :( 😻 😈 888 eihtg DoD Fee ## Document Title :( cat-
nip immed-
natedly 2nd levelheading . , # can't Be a ckunk. $4 $123,456 won't seven $Shine $$beighty?$

The rate for email reference removal is about 8,000 emails per 1.1 million characters per second. Fast enough for 11 books in a corpus.

12. NLP text preprocessing: normalize_hyphenated_words

from textacy.preprocessing.normalize import normalize_hyphenated_words
%time long_s = normalize_hyphenated_words(long_s)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l])))

output =>

CPU times: user 186 ms, sys: 4.58 ms, total: 191 ms
Wall time: 190 ms
size: 642200 :
( 😻 😈 888 eihtg DoD Fee ## Document Title :( catnip immednatedly

Approximately 8,000 hyphenated-words, cat — nip and immed- iately (mispelled) were corrected in a corpus of 640,000 characters in 190 ms or abouut 3 million per second.

13. NLP text preprocessing: Convert all characters to lower case

### - **all characters to lower case;**
%time long_s = long_s.lower()
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))

output =>

CPU times: user 4.82 ms, sys: 953 µs, total: 5.77 ms
Wall time: 5.97 ms
size: 642200
:( 😻 😈 888 eihtg dod fee ## document title :( catnip immednatedly 2nd levelheading . , # can't be a ckunk. $4 $123,456 won't seven $shine $$beighty?$

I only benchmark the .lower Python function. The rate for lower case transformation by.lower() of a Python string of a million characters is about 6 ms. A rate that far exceeds our production rate requirements.

14. NLP text preprocessing: Whitespace Removal

%time text = re.sub(' +', ' ', long_s)
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 44.9 ms, sys: 2.89 ms, total: 47.8 ms
Wall time: 47.8 ms
size: 570000
:( 😻 😈 888 eihtg dod fee ## document title :( catnip immednatedly 2nd levelheading . , # can't be a ckunk. $4 $123,456 won't seven $shine $$beighty?$

The rate is about 0.1 seconds for 1 million characters.

15. NLP text preprocessing: Whitespace Removal (slower)

from textacy.preprocessing.normalize import normalize_whitespace

%time text= normalize_whitespace(long_s)
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 199 ms, sys: 3.06 ms, total: 203 ms
Wall time: 201 ms
size: 569999
:( 😻 😈 888 eihtg dod fee ## document title :( catnip immednatedly 2nd levelheading . , # can't be a ckunk. $4 $123,456 won't seven $shine $$beighty?$

normalize_whitespce is 5x slower but more general. For safety in production, we use normalize_whitespce.To date, we do not think we had any problems with faster regex.

16. NLP text preprocessing: Remove Punctuation

from textacy.preprocessing.remove import remove_punctuation

%time text = remove_punctuation(long_s, marks=',.#$?')
print('size: {:g} {}'.format(len(text),text[:text_l]))

output =>

CPU times: user 34.5 ms, sys: 4.82 ms, total: 39.3 ms
Wall time: 39.3 ms
size: 558599
:( 😻 😈 888 eihtg dod fee document title :( catnip immednatedly 2nd levelheading can't be a ckunk 4 123 456 won't seven shine beighty

spaCy

Creating the spaCy pipeline and Doc

In order to text pre-process with spaCy, we transform the text into a corpus Doc object. We can then use the sequence of word tokens objects of which a Doc object consists. Each token consists of attributes (discussed above) that we use later in this article to pre-process the corpus.

Our text pre-processing end goal (usually) is to produce tokens that feed into our NLP models.

  • spaCy reverses the stream of pre-processing text and then transforming text into tokens. spaCy creates a Doc of tokens. You then pre-process the tokens by their attributes.

The result is that parsing text into a Doc object is where the majority of computation lies. As we will see, pre-processing the sequence of tokens by their attributes is fast.

Adding emoji cleaning in the spaCy pipeline

import en_core_web_lg
nlp = en_core_web_lg.load() do = nlp.disable_pipes(["tagger", "parser"])
%time emoji = Emoji(nlp)
nlp.max_length = len(long_s) + 10
%time nlp.add_pipe(emoji, first=True)
%time long_s_doc = nlp(long_s)
print('size: {:g} {}'.format(len(long_s_doc),long_s_doc[:text_l]))

output =>

CPU times: user 303 ms, sys: 22.6 ms, total: 326 ms
Wall time: 326 ms
CPU times: user 23 µs, sys: 0 ns, total: 23 µs
Wall time: 26.7 µs
CPU times: user 7.22 s, sys: 1.89 s, total: 9.11 s
Wall time: 9.12 s
size: 129199
:( 😻 😈 888 eihtg dod fee document title :( catnip immednatedly 2nd levelheading can't be a ckunk 4 123 456 won't seven shine beighty

Creating the token sequence required at 14,000 tokens per second. We will quite a speedup when we use NVIDIA gpu.

nlp.pipe_names output => ['emoji', 'ner']

Note: The tokenizer is a “special” component and isn’t part of the regular pipeline. It also doesn’t show up in nlp.pipe_names. The reason is that there can only be one tokenizer, and while all other pipeline components take a Doc and return it, the tokenizer takes a string of text and turns it into a Doc. You can still customize the tokenizer. You can either create your own Tokenizer class from scratch, or even replace it with an entirely custom function.

spaCy Token Attributes for Doc Token Preprocessing

As we saw earlier, spaCy provides convenience methods for many other pre-processing tasks. It turns — for example, to remove stop words you can reference the .is_stop attribute.

dir(token[0]) output=> 'ancestors', 'check_flag', 'children', 'cluster', 'conjuncts', 'dep', 'dep_', 'doc', 'ent_id', 'ent_id_', 'ent_iob', 'ent_iob_', 'ent_kb_id', 'ent_kb_id_', 'ent_type', 'ent_type_', 'get_extension', 'has_extension', 'has_vector', 'head', 'i', 'idx', 'is_alpha', 'is_ancestor', 'is_ascii', 'is_bracket', 'is_currency', 'is_digit', 'is_left_punct', 'is_lower', 'is_oov', 'is_punct', 'is_quote', 'is_right_punct', 'is_sent_end', 'is_sent_start', 'is_space', 'is_stop', 'is_title', 'is_upper', 'lang', 'lang_', 'left_edge', 'lefts', 'lemma', 'lemma_', 'lex_id', 'like_email', 'like_num', 'like_url', 'lower', 'lower_', 'morph', 'n_lefts', 'n_rights', 'nbor', 'norm', 'norm_', 'orth', 'orth_', 'pos', 'pos_', 'prefix', 'prefix_', 'prob', 'rank', 'remove_extension', 'right_edge', 'rights', 'sent', 'sent_start', 'sentiment', 'set_extension', 'shape', 'shape_', 'similarity', 'string', 'subtree', 'suffix', 'suffix_', 'tag', 'tag_', 'tensor', 'text', 'text_with_ws', 'vector', 'vector_norm', 'vocab', 'whitespace_']

Attributes added by emoji and new.

dir(long_s_doc[0]._) output => ['emoji_desc', 'get', 'has', 'is_emoji', 'set', 'trf_alignment', 'trf_all_attentions', 'trf_all_hidden_states', 'trf_d_all_attentions', 'trf_d_all_hidden_states', 'trf_d_last_hidden_state', 'trf_d_pooler_output', 'trf_end', 'trf_last_hidden_state', 'trf_pooler_output', 'trf_separator', 'trf_start', 'trf_word_pieces', 'trf_word_pieces_'

I show spaCy performing preprocessing that results in a Python string corpus. The corpus is used to create a new sequence of spaCy tokens (Doc).

There is a faster way to accomplish spaCy preprocessing with spaCy pipeline extensions [2], which I show in an upcoming blog.

17. EMOJI Sentiment Score

EMOJI Sentiment Score is not a text preprocessor in the classic sense.

However, we find that emoji almost always is the dominating text in a document.

For example, two similar phrases from legal notes e-mail with opposite sentiment.

The client was challenging. :( The client was difficult. :)

We calcuate only emoji when present in a note or e-mail.

%time scl = [EMOJI_TO_SENTIMENT_VALUE[token.text] for token in long_s_doc if (token.text in EMOJI_TO_SENTIMENT_VALUE)]
len(scl), sum(scl), sum(scl)/len(scl)

output =>

CPU times: user 179 ms, sys: 0 ns, total: 179 ms
Wall time: 178 ms
(15200, 1090.7019922523152, 0.07175671001659968)

The sentiment was 0.07 (neutral) for 0.5 million character “note” with 15,200 emojis and emojicons in 178 ms. A fast sentiment analysis calculation!

18. NLP text preprocessing: Removing emoji

You can remove emoji using spaCy pipeline add-on

%time long_s_doc_no_emojicon = [token for token in long_s_doc if token._.is_emoji == False]
print('size: {:g} {}'.format(len(long_s_doc_no_emojicon),long_s_doc_no_emojicon[:int(text_l/5)]))

output =>

CPU times: user 837 ms, sys: 4.98 ms, total: 842 ms
Wall time: 841 ms
size: 121599
[:(, 888, eihtg, dod, fee, , document, title, :(, catnip, immednatedly, 2nd, levelheading, , ca, n't, be, a, ckunk, , 4, , 123, 456, wo, n't, seven, , shine, , beighty, , :(, 888, eihtg, dod, fee, , document, title, :(, catnip, immednatedly, 2nd, levelheading, , ca, n't, be, a, ckunk, , 4, , 123, 456, wo, n't, seven, , shine, , beighty, , :(, 888, eihtg, dod, fee, ]

The emoji spacy pipeline addition detected the emojicons, 😻 😈, but missed :) and :(.

19. NLP text pre-processing: Removing emoji (better)

We developed EMOJI_TO_PHRASEto detect the emojicons, 😻 😈, and emoji, such as :) and :(. and removed them [8,9].

%time text = [token.text if (token.text in EMOJI_TO_PHRASE) == False \
else '' for token in long_s_doc]
%time long_s = ' '.join(text)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))

output =>

CPU times: user 242 ms, sys: 3.76 ms, total: 245 ms
Wall time: 245 ms
CPU times: user 3.37 ms, sys: 73 µs, total: 3.45 ms
Wall time: 3.46 ms
size: 569997
888 eihtg dod fee document title catnip immednatedly 2nd levelheading ca n't be a ckunk 4 123 456 wo n't seven shine beighty 888 eihtg dod fee document title catnip immednatedly 2nd levelheading ca n't be a ckunk 4 123 456 wo n't seven shine beighty 888 eihtg dod fee document title catnip imm

20. NLP text pre-processing: Replace emojis with a phrase

We can translate emojicon into a natural language phrase.

%time text = [token.text if token._.is_emoji == False else token._.emoji_desc for token in long_s_doc]
%time long_s = ' '.join(text)
print('size: {:g} {}'.format(len(long_s),long_s[:250]))

output =>

CPU times: user 1.07 s, sys: 7.54 ms, total: 1.07 s
Wall time: 1.07 s
CPU times: user 3.78 ms, sys: 0 ns, total: 3.78 ms
Wall time: 3.79 ms
size: 794197
:( smiling cat face with heart-eyes smiling face with horns 888 eihtg dod fee document title :( catnip immednatedly 2nd levelheading ca n't be a ckunk 4 123 456 wo n't seven shine beighty

The emoji spaCy pipeline addition detected the emojicons, 😻 😈, but missed :) and :(.

21. NLP text pre-processing: Replace emojis with a phrase (better)

We can translate emojicons into a natural language phrase.

%time text = [token.text if (token.text in EMOJI_TO_PHRASE) == False \
else EMOJI_TO_PHRASE[token.text] for token in long_s_doc]
%time long_s = ' '.join(text)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))

output =>

CPU times: user 251 ms, sys: 5.57 ms, total: 256 ms
Wall time: 255 ms
CPU times: user 3.54 ms, sys: 91 µs, total: 3.63 ms
Wall time: 3.64 ms
size: 904397
FROWNING FACE SMILING CAT FACE WITH HEART-SHAPED EYES SMILING FACE WITH HORNS 888 eihtg dod fee document title FROWNING FACE catnip immednatedly 2nd levelheading ca n't be a ckunk 4 123 456 wo n't seven shine beighty FROWNING FAC

Again. EMOJI_TO_PHRASE detected the emojicons, 😻 😈, and emoji, such as :) and :(. and substituted a phrase.

22. NLP text preprocessing: Correct Spelling

We will use symspell for spelling correction [14].

SymSpell, based on the Symmetric Delete spelling correction algorithm, just took 0.000033 seconds (edit distance 2) and 0.000180 seconds (edit distance 3) on an old MacBook Pro [14].

%time sym_spell_setup() 
%time tk = [check_spelling(token.text) for token in long_s_doc[0:99999]]
%time long_s = ' '.join(tk)
print('size: {:g} {}'.format(len(long_s),long_s[:250]))

output =>

CPU times: user 5.22 s, sys: 132 ms, total: 5.35 s
Wall time: 5.36 s
CPU times: user 25 s, sys: 12.9 ms, total: 25 s
Wall time: 25.1 s
CPU times: user 3.37 ms, sys: 42 µs, total: 3.41 ms
Wall time: 3.42 ms
size: 528259 FROWNING FACE SMILING CAT FACE WITH HEART a SHAPED EYES SMILING FACE WITH HORNS 888 eight do fee document title FROWNING FACE catnip immediately and levelheading a not be a chunk a of 123 456 to not seven of shine of eighty

Spell correction was accomplished for immednatedly, ckunk and beight. Correcting mis-spelled words is our largest computation. It required 30 seconds for 0.8 million characters.

23. NLP text preprocessing: Replacing Currency Symbol (spaCy)

%time token = [token.text if token.is_currency == False else '_CUR_' for token in long_s_doc]
%time long_s = ' '.join(token)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))aa

Note: spacy removes all punctuation including :) emoji and emoticon. You can protect the emoticon with:

%time long_s_doc = [token for token in long_s_doc if token.is_punct == False or token._.is_emoji == True]
print('size: {:g} {}'.format(len(long_s_doc),long_s_doc[:50]))

However, replace_currency_symbols and regex ignore context and replace any currency symbol. You may have multiple use of $ in your text and thus can not ignore context. In this case you can use spaCy.

%time tk = [token.text if token.is_currency == False else '_CUR_' for token in long_s_doc]
%time long_s = ' '.join(tk)
print('size: {:g} {}'.format(len(long_s),long_s[:250]))

output =>

CPU times: user 366 ms, sys: 13.9 ms, total: 380 ms
Wall time: 381 ms
CPU times: user 9.7 ms, sys: 0 ns, total: 9.7 ms
Wall time: 9.57 ms
size: 1.692e+06 😻 👍 🏿 < title > Document Title</title > :( < html><h2>2nd levelheading</h2></html > [email protected] [email protected] [email protected] ca n't bc$$ ef$4 5 66 _CUR_ wo nt seven eihtg _CUR_ nine _CUR_ _CUR_ zer$ 😻 👍 🏿 < title > Document Title</title > :( < html><h2>2nd leve

24. NLP text preprocessing: Removing e-mail address (spacy)

%time tokens = [token for token in long_s_doc if not token.like_email]
print('size: {:g} {}'.format(len(tokens),tokens[:int(text_l/3)]))

output =>

CPU times: user 52.7 ms, sys: 3.09 ms, total: 55.8 ms
Wall time: 54.8 ms
size: 99999

About 0.06 second for 1 million characters.

25. NLP text preprocessing: Remove whitespace and punctuation (spaCy)

%time tokens = [token.text for token in long_s_doc if (token.pos_ not in ['SPACE','PUNCT'])]
%time text = ' '.join(tokens)
print('size: {:g} {}'.format(len(text),text[:text_l]))

26. NLP text preprocessing: Removing stop-words

NLP models (ex: logistic regression and transformers) and NLP tasks (Sentiment Analysis) continue to be added. Some benefit from stopword removal, and some will not. [2]

Note: We now only use different deep learning language models (transformers) and do not remove stopwords.

%time tokens = [token.text for token in long_s_doc if token.is_stop == False]
%time long_s = ' '.join(tokens)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))

27. NLP text pre-processing: Lemmatization

Lemmatization looks beyond word reduction and considers a language’s full vocabulary to apply a morphological analysis to words.

Lemmatization looks at the surrounding text to determine a given word’s part of speech. It does not categorize phrases.

%time tokens = [token.lemma_ for token in long_s_doc]
%time long_s = ' '.join(tokens)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l]))

output =>

CPU times: user 366 ms, sys: 13.9 ms, total: 380 ms
Wall time: 381 ms
CPU times: user 9.7 ms, sys: 0 ns, total: 9.7 ms
Wall time: 9.57 ms
size: 1.692e+06 😻 👍 🏿 < title > Document Title</title > :( < html><h2>2nd levelheading</h2></html > [email protected] [email protected] [email protected] ca n't bc$$ ef$4 5 66 _CUR_ wo nt seven eihtg _CUR_ nine _CUR_ _CUR_ zer$ 😻 👍 🏿 < title > Document Title</title > :( < html><h2>2nd leve

Note: Spacy does not have stemming. You can add if it is you want. Stemming does not work as well as Lemmazatation because Stemming does not consider context [2] (Why some researcher considers spacy “opinionated”).

Note: If you do not know what is Stemming, you can still be on the Survivor show. (my opinion)

Conclusion

Whatever the NLP task, you need to clean (pre-process) the data (text) into a corpus (document or set of documents) before it is input into any NLP model.

I adopt a text pre-processing framework that has three major categories of NLP text pre-processing:

  1. Noise Removal
  • Transform Unicode characters into text characters.
  • convert a document image into segmented image parts and text snippets [10];
  • extract data from a database and transform into words;
  • remove markup and metadata in HTML, XML, JSON, .md, etc.;
  • remove extra whitespaces;
  • remove emoji or convert emoji into phases;
  • Remove or convert currency symbol, URLs, email addresses, phone numbers, hashtags, other identifying tokens;
  • The correct mis-spelling of words (tokens) [7];
  • Remove remaining unwanted punctuation;

2. Tokenization

  • They are splitting strings of text into smaller pieces, or “tokens.” Paragraphs segment into sentences, and sentences tokenize into words.

3. Normalization

  • Change all characters to lower case;
  • Remove English stop words, or whatever language the text is in;
  • Perform Lemmatization or Stemming.

Note: The tasks listed in Noise Removal and Normalization can move back and forth. The categorical assignment is for explanatory convenience.

Note: We do not remove stop-words anymore. We found that our current NLP models have higher F1 scores when we leave in stop-words.

Note: Stop-word removal is expensive computationally. We found the best way to achieve faster stop-word removal was not to do it.

Note: We saw no significant change in Deep Learning NLP models’ speed with or without stop-word removal.

Note: The Noise Removal and Normalization lists are not exhaustive. These are some of the tasks I have encountered.

Note: The latest NLP Deep Learning models are more accurate than older models. However, Deep Learning models can be impractically slow to train and are still too slow for prediction. We show in a follow-on article how we speed-up such models for production.

Note: Stemming algorithms drop off the end of the beginning of the word, a list of common prefixes and suffixes to create a base root word.

Note: Lemmatization uses linguistic knowledge bases to get the correct roots of words. Lemmatization performs morphological analysis of each word, which requires the overhead of creating a linguistic knowledge base for each language.

Note: Stemming is faster than lemmatization.

Note: Intuitively and in practice, lemmatization yields better results than stemming in an NLP Deep Learning model. Stemming generally reduces precision accuracy and increases recall accuracy because it injects semi-random noise when wrong.

Read more in How and Why to Implement Stemming and Lemmatization from NLTK.

Text preprocessing Action benchmarks

Our unique implementations, spaCy, and textacy are our current choice for short text preprocessing production fast to use. If you don’t mind the big gap in performance, I would recommend using it for production purposes, over NLTK’s implementation of Stanford’s NER.

In the next blogs, We see how performance changes using multi-processing, multithreading, Nvidia GPUs, and pySpark. Also, I will write about how and why our implementations, such as EMOJI_TO_PHRASEand EMOJI_TO_SENTIMENT_VALUE and or how to add emoji, emoticon, or any Unicode symbol.

References

[1] How Much Data Do We Create Every Day? The Mind-Blowing Stats Everyone Should Read.

[2] Industrial-Strength Natural Language Processing;Turbo-charge your spaCy NLP pipeline.

[3] NLTK 3.5 Documentation.

[4] Textacy: Text (Pre)-processing.

[5] Hugging Face.

[6] Language Models are Few-Shot Learners.

[7] re — Regular expression operations.

[8] Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm.

[9] How I Built Emojitracker.

[10] Classifying e-commerce products based on images and text.

[11] DART: Open-Domain Structured Data Record to Text Generation.

[12] Adv-BERT: BERT is not robust on misspellings! Generating nature adversarial samples on BERT.

[13] fast.ai .

[14] 1000x faster Spelling Correction.

This article was originally published on Medium and re-published to TOPBOTS with permission from the author. Read more technical guides by Bruce Cottman, Ph.D. on Medium.

Enjoy this article? Sign up for more AI and NLP updates.

We’ll let you know when we release more in-depth technical education.

Continue Reading
AI1 hour ago

10 Advanced Features to Make your Dating App Successful

AI21 hours ago

How to Get the Best Start at Sports Betting

AI21 hours ago

Natural Language Processing in Production: 27 Fast Text Pre-Processing Methods

AI23 hours ago

Microsoft BOT Framework: Building Blocks

AI23 hours ago

Are Banking Chatbots Vulnerable to Attacks?

AI23 hours ago

TikTok Alexa Skill — Dance to the Tunes Hands-free!

AI24 hours ago

How to Survive the Avalanche of Customer Interactions with Automation

AI2 days ago

How does it know?! Some beginner chatbot tech for newbies.

AI2 days ago

Who is chatbot Eliza?

AI3 days ago

FermiNet: Quantum Physics and Chemistry from First Principles

AI3 days ago

How to take S3 backups with DejaDup on Ubuntu 20.10

AI4 days ago

How banks and finance enterprises can strengthen their support with AI-powered customer service…

AI4 days ago

GBoard Introducing Voice — Smooth Texting and Typing

AI5 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI5 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI5 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI5 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI5 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI5 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

AI5 days ago

Automatically detecting personal protective equipment on persons in images using Amazon Rekognition

Trending