Page 1 of 2

Posted: Sat Mar 05, 2005 12:54 pm
by fogartylee
*Gets over the shock of joining a C64 forum*

Hi folks,

I will be loading the first few issues of c&vg onto the server this afternoon. There is a page viewer which errmm views the pages.

If you become a member, there is also a pdf download which is currently just the images in one file. I have a project running to do proper pdf's of magazines so they are searchable, but this is a huge task, so if anyone wants to help out, you are more than welcome. Just check the WoS forum under Preservation.

If anyone would like other magazines hosting to preserve bandwidth, drop me a mail and I'll see what I can do. My site is for Sinclair stuff, but I see no reason not to put others on there.

Please be gentle with the downloads (ie don't do them all at once!). Bandwidth is not an issue, but it slows down other peoples downloads.

You can see the site at http://www.sinclair-heaven.net

Its still being developed, so be patient!

Posted: Mon Mar 07, 2005 12:17 am
by Iain
Excuse my French ( ;-) ) but that's some amazing shit! How the hell do you have access to over 25GB of space and no badnwidth issues??!

Regarding, PDF's, I assume at the moment they are just a collection of image files?

To make them searchable you'd have to OCR then, which is easier on the older issues before they started putting mad colourful design behind the text.

It would be great to have all the issues OCRed and searchable but it's a massive job since OCR software hasn't quite got human like AI at the moment :(

Posted: Mon Mar 07, 2005 12:47 am
by fogartylee
I have a project running at the moment to OCR all magazines on my site. I'm starting with reviews as it gives people something to aim for. Its pretty boring just ocring pages and it takes so long, people are likely to just give up. So at the moment, they are just pdf's of the scans. As the site is around 90% php scripted it takes a while to do the pdf, then move everything into the correct folders so the scripts work as intended. I tend to do around 10 a day.

I feel sorry for Mort as its even more of a pain to scan them in the first place! At least the OCR software can read his scans.

As for the bandwidth - you might notice my site is a bit slow. I had 53 magazine downloads today, and at about 30mb per magazine, it can slow things down. I am hosting from one of my home PCs, so the only restrictions I have is NTL's upload speed. Its a bit crap, but its free!

I have just installed a 200gb SATA drive, so space is never going to be an issue. This is why I have offered loads of people sections on the site (gambase & sam coupe being two of them). It pads the site out & all I need to do is drop the files in.

Like I said - the site is still being developed & some of the scripts I am developing are also becoming commercial, so I need to concentrate on those first. In particular, my stats package & forum. Just takes ages to write them & test them (thank god for Wos users!)

Lastly - you have a strange grasp of French. It looks English to me. Unless this forum has a babel converter. (Now theres an idea!)

Posted: Mon Mar 07, 2005 1:17 am
by Iain
fogartylee wrote: At least the OCR software can read his scans.
It can?!! The OCR software I am/was using TextBridge, couldn't make out his scans at 957xwhatever resolution. I found I needed to rescan the pages at around 2000x4000 or something to get a clean run at OCRing them. Very boring job alright! Maybe we could outsource it to India or something? Did they get english copies of zzap there in the 80;s? ;-)

Posted: Mon Mar 07, 2005 7:53 am
by fogartylee
I'm using omnipage pro 14, and its great. I need to reformat some pages after, but can't complain. I've seen the same results with textbridge & its not that different. Have you tried saving the output to a word document?
I can't speak for your strange magazines, but the sinclair ones are easy enough.

By the way, I guess there is a conspiracy with commodore & google. The only reason I knew my site had been mentioned here was because I did a search for 'Sinclair Heaven', and the ONLY result with my site name in it was this one!

Posted: Mon Mar 07, 2005 8:55 am
by Lloyd Mangram
fogartylee wrote:
By the way, I guess there is a conspiracy with commodore & google. The only reason I knew my site had been mentioned here was because I did a search for 'Sinclair Heaven', and the ONLY result with my site name in it was this one!
:D That's because Speccies suck, of course, and Commodore rules. :wink:

Posted: Mon Mar 07, 2005 9:54 am
by fogartylee
You could be right.

Hang on a minute - is that the same Mr.Zzapback that became a member of my site at the weekend?

hmmm you guys better beware, I think you have a spy in the camp!

Seriously though, I noticed you had downloaded some mags. Any comments?

Posted: Mon Mar 07, 2005 6:30 pm
by Iain
fogartylee wrote:I'm using omnipage pro 14, and its great. I need to reformat some pages after, but can't complain. I've seen the same results with textbridge & its not that different. Have you tried saving the output to a word document?
I can't speak for your strange magazines, but the sinclair ones are easy enough.
While Mort's scans are fine for reading with the human eye, they don't have enough detail for the OCRing, the output text comes out with a LOAD of errors. But if I rescan the page in a higher resolution and then OCR, there's a lot less errors. But as regards later issues with lots of crappy multicolour backgrounds...... it's a dead loss.

Do you photoshop the scans before you OCR them?

Posted: Mon Mar 07, 2005 7:12 pm
by fogartylee
No. These are 'raw' results from one of Morts scans. I didn't do these by the way, but I have done the omnipage one and got the same results:

http://www.sinclair-heaven.net/crash_omnipage.zip

http://www.sinclair-heaven.net/crash_textbridge.zip

These were both cut n paste jobs into a word document.

Posted: Mon Mar 07, 2005 7:26 pm
by Lloyd Mangram
fogartylee wrote:No. These are 'raw' results from one of Morts scans. I didn't do these by the way, but I have done the omnipage one and got the same results:

http://www.sinclair-heaven.net/crash_omnipage.zip

http://www.sinclair-heaven.net/crash_textbridge.zip

These were both cut n paste jobs into a word document.
That looks quite interesting.
Imagine all Crash & Zzap (etc) text in a database, together with a search engine, now that would be ideal!

Posted: Tue Mar 08, 2005 12:11 am
by fogartylee
Yep - one of the reasons I wanted reviews doing first was for that very reason. It will then be easier to do the rest of the mags as pdf's and not as boring.

Posted: Fri Apr 22, 2005 3:11 am
by Iznogoud
well I tried the links but they don't work anymore, the result of the link I find quite interesting though. BTW the first "you mum" needs an "r". Good luck with the scanning

Posted: Mon Aug 22, 2005 8:44 am
by gizmomelb
I'm planning on doing some scanning and OCRing in the near future of various 80's and 90's magazines I have at hand (ZZap!64, PCG, CD32, The One, Amiga Format etc.) and was after some general advice on scanning/OCRing please.

Space for me is no issue (large HDD) I'd just like to keep the 'best' quality copies I can as well as have them easily searchable. Any suggestions most welcome re: procedure, settings and software.

Posted: Wed Aug 24, 2005 7:58 am
by gizmomelb
hmm, no comments.. oh well.

also Sinclair Heaven seems to have lost it's tracker? There's no seed anymore for any of the torrents.

Posted: Wed Aug 24, 2005 12:20 pm
by Iain
Well for OCRing, I use TextBridge Pro. I scan the pages at 300dpi or so to give a horizontal resolution of over 2000 pixels.

Coloured backgrounds or especially changing background can really screw up the OCRing, so sometimes I have to use Paint Shop Pro to colour replace them to just bare white etc.

Then it's time to actually do the OCRing, which by this stage is fairly painless, although it takes a while to format the output text.

It's a very slow, boring process unfortunately, but it's worth it in the end I guess! :)

Feel free to OCR any Zzap stuff for this site! :) Just make sure it hasn't been done already first.