Bitcoin Forum
May 10, 2024, 08:32:27 AM *
News: Latest Bitcoin Core release: 27.0 [Torrent]
 
   Home   Help Search Login Register More  
Pages: [1]
  Print  
Author Topic: Need data from the forum  (Read 513 times)
achow101 (OP)
Staff
Legendary
*
Offline Offline

Activity: 3388
Merit: 6635


Just writing some code


View Profile WWW
June 16, 2016, 01:09:41 PM
 #1

I'm thinking about (going to try) using the Google Prediction API: https://cloud.google.com/prediction/ to detect spam. However, in order to do so, it needs to be trained to know what spam and not spam looks like. Would it be possible for the mods or admins to give me a file with all of the posts that were deleted for being spam and the topic they were posted in?

If that is not possible, could the mods, as you delete spam posts, put them into this spreadsheet: https://docs.google.com/spreadsheets/d/16frPDZkHcg-WYuWtj_Qqkc0fzPtoj4kBKrjpCrlU9h4/edit?usp=sharing on the sheet labeled SPAM.

Users can also help. If you see a post that you think is spam, you can put it into the above spreadsheet. You can also put posts that you think are good and not spam into the sheet on the sheet labeled NOT SPAM. If you put things into the spreadsheet, it would be best not to include quoted stuff.

I know that this sounds like a lot of work for users and mods, but I also think that having a prediction model for spam would be a beneficial thing for this forum.

If anyone wants to help me here, feel free and please do so. If anyone has any suggestions for me for any part of this project, please let me know.

Activity + Trust + Earned Merit == The Most Recognized Users on Bitcointalk
Advertised sites are not endorsed by the Bitcoin Forum. They may be unsafe, untrustworthy, or illegal in your jurisdiction.
1715329947
Hero Member
*
Offline Offline

Posts: 1715329947

View Profile Personal Message (Offline)

Ignore
1715329947
Reply with quote  #2

1715329947
Report to moderator
1715329947
Hero Member
*
Offline Offline

Posts: 1715329947

View Profile Personal Message (Offline)

Ignore
1715329947
Reply with quote  #2

1715329947
Report to moderator
CIYAM
Legendary
*
Offline Offline

Activity: 1890
Merit: 1078


Ian Knowles - CIYAM Lead Developer


View Profile WWW
June 16, 2016, 01:13:03 PM
 #2

As much as you probably won't like it - if you just got rid of every single ad-sig post then you would get rid of at least 99% of the spam (and my guess is that any AI type analysis will probably end up identifying the ad-sig as being the key indicator for spammers).

(to keep posts made by people such as yourself would require a few exceptions to the ad-sig rule but I haven't seen more than a dozen accounts with ad-sigs that are not spammers)

With CIYAM anyone can create 100% generated C++ web applications in literally minutes.

GPG Public Key | 1ciyam3htJit1feGa26p2wQ4aw6KFTejU
achow101 (OP)
Staff
Legendary
*
Offline Offline

Activity: 3388
Merit: 6635


Just writing some code


View Profile WWW
June 16, 2016, 01:18:38 PM
 #3

As much as you probably won't like it - if you just got rid of every single ad-sig post then you would get rid of at least 99% of the spam (and my guess is that any AI type analysis will probably end up identifying the ad-sig as being the key indicator).

(to keep posts made by people such as yourself would require a few exceptions to the ad-sig rule but I haven't seen more than a dozen accounts with ad-sigs that are not spammers)

It's probably a good place to start; just going through and adding the posts of users in 777coin or yobit to the spam part. But I'm not going to train it with people's sigs in the posts. I want it to strictly be just the post text, although I'm not sure how I'm going to handle quotes.

Right now my method is to choose some users who I know have a good post quatlity (e.g. DannyHamilton) and put them into the NOT SPAM sheet. Then I will do the same with users who I know have a terrible post quality.

Lauda
Legendary
*
Offline Offline

Activity: 2674
Merit: 2965


Terminated.


View Profile WWW
June 16, 2016, 01:19:36 PM
 #4

Would it be possible for the mods or admins to give me a file with all of the posts that were deleted for being spam and the topic they were posted in?
I doubt that. Not all of the posts/threads that are deleted/trashed are spam.

If that is not possible, could the mods, as you delete spam posts, put them into this spreadsheet: https://docs.google.com/spreadsheets/d/16frPDZkHcg-WYuWtj_Qqkc0fzPtoj4kBKrjpCrlU9h4/edit?usp=sharing on the sheet labeled SPAM.
The 'URL' would not work for trashed and deleted threads. We could only manually copy the body. We have a few patterns that we look out for. You can just stick around in the speculation section and will encounter various spam. Additionally, our own system might be less effective if we make patterns public.

"The Times 03/Jan/2009 Chancellor on brink of second bailout for banks"
😼 Bitcoin Core (onion)
CIYAM
Legendary
*
Offline Offline

Activity: 1890
Merit: 1078


Ian Knowles - CIYAM Lead Developer


View Profile WWW
June 16, 2016, 01:21:02 PM
 #5

I want it to strictly be just the post text, although I'm not sure how I'm going to handle quotes.

Why not just strip the quotes (if you end up with nothing else other than say +1 it was a pointless post anyway)?

With CIYAM anyone can create 100% generated C++ web applications in literally minutes.

GPG Public Key | 1ciyam3htJit1feGa26p2wQ4aw6KFTejU
achow101 (OP)
Staff
Legendary
*
Offline Offline

Activity: 3388
Merit: 6635


Just writing some code


View Profile WWW
June 16, 2016, 01:29:33 PM
 #6

The 'URL' would not work for trashed and deleted threads. We could only manually copy the body. We have a few patterns that we look out for. You can just stick around in the speculation section and will encounter various spam.
The URL is probably not necessary. I'm thinking that I might need it to provide context to the posts for training the model, but it probably isn't that useful anyways.

Additionally, our own system might be less effective if we make patterns public.
You could just PM me.

I want it to strictly be just the post text, although I'm not sure how I'm going to handle quotes.

Why not just strip the quotes (if you end up with nothing else other than say +1 it was a pointless post anyway)?

When I feed data to the model for spam detection after it's done training, I will strip out the quotes as that part will be automated. But for gathering the data, I noticed that stripping out the quotes by hand can be a hassle, especially for the posts of people who respond to things line by line (like myself) and use quotes a lot. And it is pretty much impossible to strip out the posts when the entire post is copied since nothing indicates where the quote stops.

KenR
Hero Member
*****
Offline Offline

Activity: 910
Merit: 1000


「きみはこれ&#


View Profile
June 16, 2016, 01:40:24 PM
 #7

Google Prediction API is the most failed when it comes to NLP. Understanding the heuristics is next to impossible with it.Perhaps if you write your won genetic algorithms ,it might get the closest to detect spam and hopefully filter out spam.
Not easy as it seems since you would have to write your own model which analysis's the topic first and then filter out the spam.

Speaking practically,its impossible because model will fail most of the time since a post could be off topic but yet constructive to discussion.You know there are a few things bots can't beat humans at..

  ████
█ ████
█ ████
█ ████
█ ████ █
█ ████ █
█ ████ █
█ ████ █
█ ████ █
  ████ █
  ████ █
  ████ █
  ████
  ████
█ ████
█ ████
█ ████
█ ████ █
█ ████ █
█ ████ █
█ ████ █
█ ████ █
  ████ █
  ████ █
  ████ █
  ████
  .WEBSITE.
▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄
  .ANN THREAD.
.
▄▄▄▄▄▄▄▄
  ████
█ ████
█ ████
█ ████
█ ████ █
█ ████ █
█ ████ █
█ ████ █
█ ████ █
  ████ █
  ████ █
  ████ █
  ████
achow101 (OP)
Staff
Legendary
*
Offline Offline

Activity: 3388
Merit: 6635


Just writing some code


View Profile WWW
June 16, 2016, 01:58:02 PM
 #8

Google Prediction API is the most failed when it comes to NLP. Understanding the heuristics is next to impossible with it.Perhaps if you write your won genetic algorithms ,it might get the closest to detect spam and hopefully filter out spam.
Not easy as it seems since you would have to write your own model which analysis's the topic first and then filter out the spam.
If Google's Prediction API doesn't work, then I will find some other Machine Learning platform and see how well they do. If all else fails, I can attempt to figure out TensorFlow. Either way, I'm going to need the same data.

Speaking practically,its impossible because model will fail most of the time since a post could be off topic but yet constructive to discussion.
That's also why I'm looking for the Topic URLs. I may be able to use the rest of the topic as context and train it to analyze the post in context.


You know there are a few things bots can't beat humans at..
Some AIs are very intelligent. Especially ones on cloud platforms where there is a ton of computing power backing it.

Pages: [1]
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1.19 | SMF © 2006-2009, Simple Machines Valid XHTML 1.0! Valid CSS!