Need data from the forum

achow101 (OP)

Staff
Legendary

Offline

Activity: 3444
Merit: 6785

Just writing some code

Need data from the forum

June 16, 2016, 01:09:41 PM

I'm thinking about (going to try) using the Google Prediction API: https://cloud.google.com/prediction/ to detect spam. However, in order to do so, it needs to be trained to know what spam and not spam looks like. Would it be possible for the mods or admins to give me a file with all of the posts that were deleted for being spam and the topic they were posted in?

If that is not possible, could the mods, as you delete spam posts, put them into this spreadsheet: https://docs.google.com/spreadsheets/d/16frPDZkHcg-WYuWtj_Qqkc0fzPtoj4kBKrjpCrlU9h4/edit?usp=sharing on the sheet labeled SPAM.

Users can also help. If you see a post that you think is spam, you can put it into the above spreadsheet. You can also put posts that you think are good and not spam into the sheet on the sheet labeled NOT SPAM. If you put things into the spreadsheet, it would be best not to include quoted stuff.

I know that this sounds like a lot of work for users and mods, but I also think that having a prediction model for spam would be a beneficial thing for this forum.

If anyone wants to help me here, feel free and please do so. If anyone has any suggestions for me for any part of this project, please let me know.

Bitcoin Core contributor | Tip Me! | GitHub | GPG Key Fingerprint 0x17565732E08E5E41

CIYAM

Legendary

Offline

Activity: 1890
Merit: 1081

Ian Knowles - CIYAM Lead Developer

Re: Need data from the forum

June 16, 2016, 01:13:03 PM

As much as you probably won't like it - if you just got rid of every single ad-sig post then you would get rid of at least 99% of the spam (and my guess is that any AI type analysis will probably end up identifying the ad-sig as being the key indicator for spammers).

(to keep posts made by people such as yourself would require a few exceptions to the ad-sig rule but I haven't seen more than a dozen accounts with ad-sigs that are not spammers)

With CIYAM anyone can create 100% generated C++ web applications in literally minutes.

GPG Public Key | 1ciyam3htJit1feGa26p2wQ4aw6KFTejU

achow101 (OP)

Staff
Legendary

Offline

Activity: 3444
Merit: 6785

Just writing some code

Re: Need data from the forum

June 16, 2016, 01:18:38 PM

Quote from: CIYAM on June 16, 2016, 01:13:03 PM

As much as you probably won't like it - if you just got rid of every single ad-sig post then you would get rid of at least 99% of the spam (and my guess is that any AI type analysis will probably end up identifying the ad-sig as being the key indicator).

(to keep posts made by people such as yourself would require a few exceptions to the ad-sig rule but I haven't seen more than a dozen accounts with ad-sigs that are not spammers)

It's probably a good place to start; just going through and adding the posts of users in 777coin or yobit to the spam part. But I'm not going to train it with people's sigs in the posts. I want it to strictly be just the post text, although I'm not sure how I'm going to handle quotes.

Right now my method is to choose some users who I know have a good post quatlity (e.g. DannyHamilton) and put them into the NOT SPAM sheet. Then I will do the same with users who I know have a terrible post quality.

Bitcoin Core contributor | Tip Me! | GitHub | GPG Key Fingerprint 0x17565732E08E5E41

Lauda

Legendary

Offline

Activity: 2674
Merit: 2965

Terminated.

Re: Need data from the forum

June 16, 2016, 01:19:36 PM

Quote from: knightdk on June 16, 2016, 01:09:41 PM

Would it be possible for the mods or admins to give me a file with all of the posts that were deleted for being spam and the topic they were posted in?

I doubt that. Not all of the posts/threads that are deleted/trashed are spam.

Quote from: knightdk on June 16, 2016, 01:09:41 PM

If that is not possible, could the mods, as you delete spam posts, put them into this spreadsheet: https://docs.google.com/spreadsheets/d/16frPDZkHcg-WYuWtj_Qqkc0fzPtoj4kBKrjpCrlU9h4/edit?usp=sharing on the sheet labeled SPAM.

The 'URL' would not work for trashed and deleted threads. We could only manually copy the body. We have a few patterns that we look out for. You can just stick around in the speculation section and will encounter various spam. Additionally, our own system might be less effective if we make patterns public.

"The Times 03/Jan/2009 Chancellor on brink of second bailout for banks"
😼 Bitcoin Core (onion)

CIYAM

Legendary

Offline

Activity: 1890
Merit: 1081

Ian Knowles - CIYAM Lead Developer

Re: Need data from the forum

June 16, 2016, 01:21:02 PM

Quote from: knightdk on June 16, 2016, 01:18:38 PM

I want it to strictly be just the post text, although I'm not sure how I'm going to handle quotes.

Why not just strip the quotes (if you end up with nothing else other than say +1 it was a pointless post anyway)?

With CIYAM anyone can create 100% generated C++ web applications in literally minutes.

GPG Public Key | 1ciyam3htJit1feGa26p2wQ4aw6KFTejU

achow101 (OP)

Staff
Legendary

Offline

Activity: 3444
Merit: 6785

Just writing some code

Re: Need data from the forum

June 16, 2016, 01:29:33 PM

Quote from: Lauda on June 16, 2016, 01:19:36 PM

The URL is probably not necessary. I'm thinking that I might need it to provide context to the posts for training the model, but it probably isn't that useful anyways.

Quote from: Lauda on June 16, 2016, 01:19:36 PM

Additionally, our own system might be less effective if we make patterns public.

You could just PM me.

Quote from: CIYAM on June 16, 2016, 01:21:02 PM

Quote from: knightdk on June 16, 2016, 01:18:38 PM

I want it to strictly be just the post text, although I'm not sure how I'm going to handle quotes.

Why not just strip the quotes (if you end up with nothing else other than say +1 it was a pointless post anyway)?

When I feed data to the model for spam detection after it's done training, I will strip out the quotes as that part will be automated. But for gathering the data, I noticed that stripping out the quotes by hand can be a hassle, especially for the posts of people who respond to things line by line (like myself) and use quotes a lot. And it is pretty much impossible to strip out the posts when the entire post is copied since nothing indicates where the quote stops.

Bitcoin Core contributor | Tip Me! | GitHub | GPG Key Fingerprint 0x17565732E08E5E41

KenR

Hero Member

Offline

Activity: 910
Merit: 1000

「きみはこれ&#

Re: Need data from the forum

June 16, 2016, 01:40:24 PM

Google Prediction API is the most failed when it comes to NLP. Understanding the heuristics is next to impossible with it.Perhaps if you write your won genetic algorithms ,it might get the closest to detect spam and hopefully filter out spam.
Not easy as it seems since you would have to write your own model which analysis's the topic first and then filter out the spam.

Speaking practically,its impossible because model will fail most of the time since a post could be off topic but yet constructive to discussion.You know there are a few things bots can't beat humans at..

▄▄█████████▄▄ ▄██▀▀ ▀▀██▄ ▄██▀ ██▄ ▄██ ██▄ N A R B O N N E ▀██ ██▀ ▀██▄ ▄██▀ ▀██▄▄ ▄▄██▀ ▀▀█████████▀▀

████ █ ████ █ ████ █ ████ █ ████ █ █ ████ █ █ ████ █ █ ████ █ █ ████ █ ████ █ ████ █ ████ █ ████

.THE 1^st decentralised CRYPTO BANK.
▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ ▄ ▄▄ ▄▄▄ ▄▄▄▄▄

.WEBSITE.
▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
.ANN THREAD.

▄▄███████▄▄ ▄████████▀ ▄ █████████ ▄▄██ ██████████ ██████ ████████ ███ ████████▄▄ ▄▄████ █████████ █████ ▀███████ ███▀ ▀▀████ ▀▀
.
facebook

▄▄███████▄▄ ▄█████████████▄ ███▀█████▀ ▀▀▀██ ████ ▀▀ ████ ████▄ █████ █████▄ ██████ ██▄▀ ▄▄██████ ▀█████████████▀ ▀▀███████▀▀
.
twitter

▄▄███████▄▄ ▄█████████████▄ ███████████▀▀████ ███████▀▀▀ █████ ███▀ ▄▀ ▄█████ █████▄▄█ ██████ ██████ ▄█▄ ▄█████ ▀█████████████▀ ▀▀███████▀▀
.
telegram

.
▄▄▄▄▄▄▄▄

.
.TOKEN SALE.
Sept 29th - Oct 8th

achow101 (OP)

Staff
Legendary

Offline

Activity: 3444
Merit: 6785

Just writing some code

⇾ Re: Need data from the forum

June 16, 2016, 01:58:02 PM

Quote from: KenR on June 16, 2016, 01:40:24 PM

If Google's Prediction API doesn't work, then I will find some other Machine Learning platform and see how well they do. If all else fails, I can attempt to figure out TensorFlow. Either way, I'm going to need the same data.

Quote from: KenR on June 16, 2016, 01:40:24 PM

Speaking practically,its impossible because model will fail most of the time since a post could be off topic but yet constructive to discussion.

That's also why I'm looking for the Topic URLs. I may be able to use the rest of the topic as context and train it to analyze the post in context.

Quote from: KenR on June 16, 2016, 01:40:24 PM

You know there are a few things bots can't beat humans at..

Some AIs are very intelligent. Especially ones on cloud platforms where there is a ton of computing power backing it.

Bitcoin Core contributor | Tip Me! | GitHub | GPG Key Fingerprint 0x17565732E08E5E41

Pages: [1]

Bitcoin Forum > Other > Meta > Need data from the forum

« previous topic next topic »