Sinclair Heaven and issue hosting and OCRing

Everything else about Zzap
fogartylee
Techno Teaboy
Posts: 16
Joined: Sat Mar 05, 2005 12:40 pm
Location: Nottingham
Contact:

Post by fogartylee »

*Gets over the shock of joining a C64 forum*

Hi folks,

I will be loading the first few issues of c&vg onto the server this afternoon. There is a page viewer which errmm views the pages.

If you become a member, there is also a pdf download which is currently just the images in one file. I have a project running to do proper pdf's of magazines so they are searchable, but this is a huge task, so if anyone wants to help out, you are more than welcome. Just check the WoS forum under Preservation.

If anyone would like other magazines hosting to preserve bandwidth, drop me a mail and I'll see what I can do. My site is for Sinclair stuff, but I see no reason not to put others on there.

Please be gentle with the downloads (ie don't do them all at once!). Bandwidth is not an issue, but it slows down other peoples downloads.

You can see the site at http://www.sinclair-heaven.net

Its still being developed, so be patient!
User avatar
Iain
Admin
Posts: 2222
Joined: Tue Jun 17, 2003 6:42 pm
Location: Cavan, Ireland
Contact:

Post by Iain »

Excuse my French ( ;-) ) but that's some amazing shit! How the hell do you have access to over 25GB of space and no badnwidth issues??!

Regarding, PDF's, I assume at the moment they are just a collection of image files?

To make them searchable you'd have to OCR then, which is easier on the older issues before they started putting mad colourful design behind the text.

It would be great to have all the issues OCRed and searchable but it's a massive job since OCR software hasn't quite got human like AI at the moment :(
fogartylee
Techno Teaboy
Posts: 16
Joined: Sat Mar 05, 2005 12:40 pm
Location: Nottingham
Contact:

Post by fogartylee »

I have a project running at the moment to OCR all magazines on my site. I'm starting with reviews as it gives people something to aim for. Its pretty boring just ocring pages and it takes so long, people are likely to just give up. So at the moment, they are just pdf's of the scans. As the site is around 90% php scripted it takes a while to do the pdf, then move everything into the correct folders so the scripts work as intended. I tend to do around 10 a day.

I feel sorry for Mort as its even more of a pain to scan them in the first place! At least the OCR software can read his scans.

As for the bandwidth - you might notice my site is a bit slow. I had 53 magazine downloads today, and at about 30mb per magazine, it can slow things down. I am hosting from one of my home PCs, so the only restrictions I have is NTL's upload speed. Its a bit crap, but its free!

I have just installed a 200gb SATA drive, so space is never going to be an issue. This is why I have offered loads of people sections on the site (gambase & sam coupe being two of them). It pads the site out & all I need to do is drop the files in.

Like I said - the site is still being developed & some of the scripts I am developing are also becoming commercial, so I need to concentrate on those first. In particular, my stats package & forum. Just takes ages to write them & test them (thank god for Wos users!)

Lastly - you have a strange grasp of French. It looks English to me. Unless this forum has a babel converter. (Now theres an idea!)
User avatar
Iain
Admin
Posts: 2222
Joined: Tue Jun 17, 2003 6:42 pm
Location: Cavan, Ireland
Contact:

Post by Iain »

fogartylee wrote: At least the OCR software can read his scans.
It can?!! The OCR software I am/was using TextBridge, couldn't make out his scans at 957xwhatever resolution. I found I needed to rescan the pages at around 2000x4000 or something to get a clean run at OCRing them. Very boring job alright! Maybe we could outsource it to India or something? Did they get english copies of zzap there in the 80;s? ;-)
fogartylee
Techno Teaboy
Posts: 16
Joined: Sat Mar 05, 2005 12:40 pm
Location: Nottingham
Contact:

Post by fogartylee »

I'm using omnipage pro 14, and its great. I need to reformat some pages after, but can't complain. I've seen the same results with textbridge & its not that different. Have you tried saving the output to a word document?
I can't speak for your strange magazines, but the sinclair ones are easy enough.

By the way, I guess there is a conspiracy with commodore & google. The only reason I knew my site had been mentioned here was because I did a search for 'Sinclair Heaven', and the ONLY result with my site name in it was this one!
User avatar
Lloyd Mangram
King of Ludlow
Posts: 1151
Joined: Thu Jun 19, 2003 10:22 pm
Location: Ludlow
Contact:

Post by Lloyd Mangram »

fogartylee wrote:
By the way, I guess there is a conspiracy with commodore & google. The only reason I knew my site had been mentioned here was because I did a search for 'Sinclair Heaven', and the ONLY result with my site name in it was this one!
:D That's because Speccies suck, of course, and Commodore rules. :wink:
Once again I emerge from beneath a massive pile of paper which makes my desk groan to bring you the world’s most amazing posts.
fogartylee
Techno Teaboy
Posts: 16
Joined: Sat Mar 05, 2005 12:40 pm
Location: Nottingham
Contact:

Post by fogartylee »

You could be right.

Hang on a minute - is that the same Mr.Zzapback that became a member of my site at the weekend?

hmmm you guys better beware, I think you have a spy in the camp!

Seriously though, I noticed you had downloaded some mags. Any comments?
User avatar
Iain
Admin
Posts: 2222
Joined: Tue Jun 17, 2003 6:42 pm
Location: Cavan, Ireland
Contact:

Post by Iain »

fogartylee wrote:I'm using omnipage pro 14, and its great. I need to reformat some pages after, but can't complain. I've seen the same results with textbridge & its not that different. Have you tried saving the output to a word document?
I can't speak for your strange magazines, but the sinclair ones are easy enough.
While Mort's scans are fine for reading with the human eye, they don't have enough detail for the OCRing, the output text comes out with a LOAD of errors. But if I rescan the page in a higher resolution and then OCR, there's a lot less errors. But as regards later issues with lots of crappy multicolour backgrounds...... it's a dead loss.

Do you photoshop the scans before you OCR them?
fogartylee
Techno Teaboy
Posts: 16
Joined: Sat Mar 05, 2005 12:40 pm
Location: Nottingham
Contact:

Post by fogartylee »

No. These are 'raw' results from one of Morts scans. I didn't do these by the way, but I have done the omnipage one and got the same results:

http://www.sinclair-heaven.net/crash_omnipage.zip

http://www.sinclair-heaven.net/crash_textbridge.zip

These were both cut n paste jobs into a word document.
User avatar
Lloyd Mangram
King of Ludlow
Posts: 1151
Joined: Thu Jun 19, 2003 10:22 pm
Location: Ludlow
Contact:

Post by Lloyd Mangram »

fogartylee wrote:No. These are 'raw' results from one of Morts scans. I didn't do these by the way, but I have done the omnipage one and got the same results:

http://www.sinclair-heaven.net/crash_omnipage.zip

http://www.sinclair-heaven.net/crash_textbridge.zip

These were both cut n paste jobs into a word document.
That looks quite interesting.
Imagine all Crash & Zzap (etc) text in a database, together with a search engine, now that would be ideal!
Once again I emerge from beneath a massive pile of paper which makes my desk groan to bring you the world’s most amazing posts.
fogartylee
Techno Teaboy
Posts: 16
Joined: Sat Mar 05, 2005 12:40 pm
Location: Nottingham
Contact:

Post by fogartylee »

Yep - one of the reasons I wanted reviews doing first was for that very reason. It will then be easier to do the rest of the mags as pdf's and not as boring.
Iznogoud
Techno Teaboy
Posts: 4
Joined: Fri Apr 22, 2005 2:41 am

Post by Iznogoud »

well I tried the links but they don't work anymore, the result of the link I find quite interesting though. BTW the first "you mum" needs an "r". Good luck with the scanning
gizmomelb
Ken's Fishy Friend
Posts: 33
Joined: Mon Aug 22, 2005 2:33 am

Post by gizmomelb »

I'm planning on doing some scanning and OCRing in the near future of various 80's and 90's magazines I have at hand (ZZap!64, PCG, CD32, The One, Amiga Format etc.) and was after some general advice on scanning/OCRing please.

Space for me is no issue (large HDD) I'd just like to keep the 'best' quality copies I can as well as have them easily searchable. Any suggestions most welcome re: procedure, settings and software.
gizmomelb
Ken's Fishy Friend
Posts: 33
Joined: Mon Aug 22, 2005 2:33 am

Post by gizmomelb »

hmm, no comments.. oh well.

also Sinclair Heaven seems to have lost it's tracker? There's no seed anymore for any of the torrents.
User avatar
Iain
Admin
Posts: 2222
Joined: Tue Jun 17, 2003 6:42 pm
Location: Cavan, Ireland
Contact:

Post by Iain »

Well for OCRing, I use TextBridge Pro. I scan the pages at 300dpi or so to give a horizontal resolution of over 2000 pixels.

Coloured backgrounds or especially changing background can really screw up the OCRing, so sometimes I have to use Paint Shop Pro to colour replace them to just bare white etc.

Then it's time to actually do the OCRing, which by this stage is fairly painless, although it takes a while to format the output text.

It's a very slow, boring process unfortunately, but it's worth it in the end I guess! :)

Feel free to OCR any Zzap stuff for this site! :) Just make sure it hasn't been done already first.
Post Reply