Bitcoin Forum

Other => Meta => Topic started by: LoyceV on October 01, 2019, 02:05:33 PM



Title: How to get all posts through "recent"?
Post by: LoyceV on October 01, 2019, 02:05:33 PM
While scraping recent (https://bitcointalk.org/index.php?action=recent), I noticed I missed some posts. My logs show this:
Quote
Downloading recent.html
1. userID: 819696 - username: Hypnosis00 - msgID: 52615289
2. userID: 2286354 - username: FrequencyRules058 - msgID: 52615288
3. userID: 662400 - username: kzv - msgID: 52615287
4. userID: 1226689 - username: phoen - msgID: 52615285
5. userID: 93751 - username: ltcdice - msgID: 52615284
6. userID: 947291 - username: Polar91 - msgID: 52615283
7. userID: 2480302 - username: Bullrunking - msgID: 52615282
8. userID: 543165 - username: citronick - msgID: 52615281
9. userID: 2294946 - username: reena024 - msgID: 52615280
10. userID: 1000199 - username: krogothmanhattan - msgID: 52615279
The post ending on 86 this post (https://bitcointalk.org/index.php?topic=178336.msg52615286#msg52615286) is missing. I missed another post from the same thread too. I don't have the board on ignore, some other posts in the same thread show up as expected.

It's missing from half way the recent-page, and I have the same post missing a few seconds earlier or later too. That means the post was really missing from the page, which makes me think it's a bug in "recent".


Title: Re: Bug in "recent"? Missing posts
Post by: suchmoon on October 01, 2019, 02:32:37 PM
Doesn't "recent" show only the most recent post in each thread? Not sure where I got that from but I thought that's how it worked.


Title: Re: Bug in "recent"? Missing posts
Post by: LoyceV on October 01, 2019, 02:34:58 PM
Doesn't "recent" show only the most recent post in each thread?
You're right! Mind blown :O

I never knew that. I'll edit the title to my new question: how do I get all posts? This messes up my data projects.


Title: Re: How to get all posts through "recent"?
Post by: o_e_l_e_o on October 01, 2019, 03:25:15 PM
Are you sure? See posts 26 and 30 in the screenshot below - both show up for me in recent, both in the same thread (Here: https://bitcointalk.org/index.php?topic=5174107.msg52617047#msg52617047).

https://i.imgur.com/EdUk6FE.jpg



Edit: Confirmed I can see this post and suchmoon's test post below both on recent at the same time, albeit on different pages.


Title: Re: How to get all posts through "recent"?
Post by: suchmoon on October 01, 2019, 03:26:55 PM
test

Edit: it looks like I was wrong, sorry LoyceV for confusing you. I got the same result as o_e_l_e_o. No idea then why you missed those posts.


Title: Re: How to get all posts through "recent"?
Post by: hosseinimr93 on October 01, 2019, 03:42:48 PM
As far as I know all posts should be shown in "recent".
These numbers are the IDs of missed posts in http://loyce.club/archive/posts/5259/ and http://loyce.club/archive/posts/5260/
52590233
52591100
52591174
52591179
52591311
52591721
52592748
52597731
52598319
52598892
52602024
52602357
52604597
52607589
It seems that there is a bug. It can be from Loyce.club or Bitcointalk.


Title: Re: How to get all posts through "recent"?
Post by: suchmoon on October 01, 2019, 03:49:38 PM
These numbers are the IDs of missed posts in http://loyce.club/archive/posts/5259/ and http://loyce.club/archive/posts/5260/

Some of those might be missing legitimately - e.g. quickly deleted, or posted on an invisible board, for example:

52591179
52591721
52592748
52598319
52598892
52602024
52602357

All others seem to exist in the WO thread, except this one in a different thread:

https://bitcointalk.org/index.php?topic=5026942.msg52597731#msg52597731


Title: Re: How to get all posts through "recent"?
Post by: hosseinimr93 on October 01, 2019, 04:17:11 PM
Some of those might be missing legitimately - e.g. quickly deleted, or posted on an invisible board, for example:
Invisible boards?
Which boards are invisible? Are there some boards that are only visible to moderators?

All others seem to exist in the WO thread, except this one in a different thread:
Do you mean this thread (https://bitcointalk.org/index.php?topic=178336.0)?
So, there is a bug. Am I right?
All of the posts in this thread should be shown in "Recent" too.

https://bitcointalk.org/index.php?topic=5026942.msg52597731#msg52597731
May I know how could you find this post only with knowing msgID?
The links of posts contain topic number too.


Title: Re: How to get all posts through "recent"?
Post by: Halab on October 01, 2019, 05:07:33 PM
Some of those might be missing legitimately - e.g. quickly deleted, or posted on an invisible board, for example:
[...]
52602024
52602357

I didn't check the other ids, but these two are the last 2 posts in the Staff forum.

Invisible boards?
Which boards are invisible? Are there some boards that are only visible to moderators?

Yes there is a special board for the Staff. Another one for the VIPs. And maybe other boards, but I don't have access to these ones :).
And a special one for the April Fool's Day ideas, Theymos takes this very seriously :).

Ok, I'm lying for the last one.


Title: Re: How to get all posts through "recent"?
Post by: suchmoon on October 01, 2019, 05:46:59 PM
May I know how could you find this post only with knowing msgID?

Quote a post - any post, doesn't matter. Then in the URL that looks like this:

Code:
https://bitcointalk.org/index.php?action=post;quote=52617659;topic=5189156.0;num_replies=8;sesc=...

Replace the number after "quote=" with the ID of the post you're looking for. You'll get the post quoted in the text box and then you can use the user's post history and the contents of the post to find it. Note that the link in the quote is not valid, e.g. if you click Preview and click the link it won't go to the correct thread.


Title: Re: How to get all posts through "recent"?
Post by: theymos on October 01, 2019, 10:47:49 PM
All posts you can see should be listed there, though due to database concurrency limitations, ones made in the last few seconds might not show up, even if others before/after them do.

Note that if you don't need to get posts ASAP, it may be more easy and efficient for you to use https://bitcointalk.org/sitemap.php. All of the last-modification times are accurate to within a couple of hours.


Title: Re: How to get all posts through "recent"?
Post by: LoyceV on October 03, 2019, 03:43:47 PM
Are you sure?
No, apparently I wasn't sure. I tested it again, and all 5 test-posts in this thread ended showed up in "recent".

I checked 40,000 of my scraped posts, and I have:
9990/10000 (http://loyce.club/archive/posts/5259/)
9996/10000 (http://loyce.club/archive/posts/5260/)
9996/10000 (http://loyce.club/archive/posts/5261/)
9993/10000 (http://loyce.club/archive/posts/5262/)

That means it must have been a coincidence that I missed 2 posts in the Wall Observer thread in a short time span, right at the moment I was testing my scraper there (https://bitcointalk.org/index.php?topic=178336.msg52614956#msg52614956).

I can live with missing less than 0.1% of all posts (and some of the missing posts are on hidden boards (I only know of VIP and Staff boards) and Investigations is excluded.

May I know how could you find this post only with knowing msgID?
Quote a post - any post, doesn't matter. Then in the URL that looks like this:

Code:
https://bitcointalk.org/index.php?action=post;quote=52617659;topic=5189156.0;num_replies=8;sesc=...

Replace the number after "quote=" with the ID of the post you're looking for.
That's a neath trick! But difficult to automate, so I can't really use it to check for missing posts.

All posts you can see should be listed there, though due to database concurrency limitations, ones made in the last few seconds might not show up, even if others before/after them do.
Thanks. If it's a known limitation, I'll just let it be :)

Quote
Note that if you don't need to get posts ASAP, it may be more easy and efficient for you to use https://bitcointalk.org/sitemap.php. All of the last-modification times are accurate to within a couple of hours.
That page looks different in Firefox and in Chrome, but I can't really figure out what I'm looking at.


Title: Re: How to get all posts through "recent"?
Post by: hosseinimr93 on October 03, 2019, 06:31:28 PM
Most of the missed posts are in Wall Observer thread (https://bitcointalk.org/index.php?topic=178336.0)
25 out of those 40,000 posts have been missed.
14 out of 25 posts are in hidden threads. So we can say that 11 out of 40,000 posts have been missed. 9 out of 11 missed posts are in Wall Observer thread. That's 82% of missed posts.


Title: Re: How to get all posts through "recent"?
Post by: theymos on October 03, 2019, 09:14:05 PM
That page looks different in Firefox and in Chrome, but I can't really figure out what I'm looking at.

It's an XML sitemap file. Search engines use that file to keep up-to-date on forum posts. It's designed for computers to process, not humans; different browsers display it differently.