iTextSharp is open source PDF solution. In most of the examples below, I tried to alter,copy a template PDF and then save it into a brand new. How to duplicate PDF text but rasterize graphics. Looking for advice on the best approach to do something others may have tried. I have PDFs. This class is part of the book “iText in Action – 2nd Edition” * written by Bruno Lowagie (ISBN: ) * For more info, go to.
|Published (Last):||26 February 2007|
|PDF File Size:||20.19 Mb|
|ePub File Size:||18.14 Mb|
|Price:||Free* [*Free Regsitration Required]|
Wednesday, March 25, 6: Looking for rasterkze on the best approach to do something others may have tried. I have PDFs with text and graphics. They are very large, due to the graphics. I want to read each PDF and produce a new PDF with the text just as it was in the original, but rasterize the rest of the graphics into a fairly low res bitmap to be added behind the text, reducing the overall filesize of itexysharp bitmap.
I do not need to manipulate the text, just replicate it. Everything else can be bitmapped. Any starting places would be greatly appreciated.
Search everywhere only in this topic.
How to duplicate PDF text but rasterize graphics. First of all, I’m going to ass-u-me that you mean “raster image” when you say “graphic”. It’ll have to go Something Like This: Heck, plain ol’ Java might do to the trick in some cases. You can set a PRStream’s data directly, though you might have to deal with some compression filters that iText doesn’t know about yet.
It really depends on the format. You need to track down the instructions used to draw the image no mean featthen Change Them potential nightmare. For example, a simple image draw command might look like This: If the image was rotated, you’ll get to Have Fun With Trigonometry. Fortunately, you shouldn’t have to track the current transformation matrix, just the proportionate difference between the old resolution and the new resolution.
I don’t think you’ll need to change the x,y offsets at all, so long as your output is scaled properly. Render the pages to some image format at the resolution you want using any available PDF renderer GhostScript for example.
Draw that into the page as the “under content” using a PdfStamper. Now parse the content streams, keeping track of all the graphic state as you go, and yank out all those graphics you don’t want, leaving the text Where It Was Before no small trick.
This approach will work even if the “Graphics” you’re trying to remove are line art, pattern fills, or what have you. It’ll just be Very Difficult to do all the extra parsing. This approach is cutting a lot of corners from a General Solution because when you have a limited number of programs producing your PDFs hopefully “1”you can start to make Assumptions about how their content streams will be laid out. This can be Quite Dangerous, particularly if they change their content formatting in some minor revision and blow your corner-cutting parser all to hell.
You Have Been Warned. I take it from your question that you don’t know all that much about PDF? You miiight want to contract this one out Would it be too mercenary to suggest itextsoftware. In reply to this post by Doug Moreland. What patents or publications came out of this effort?
Can you outline the issues as I would think with html you could just swap out images. Is there a key objective this would not accomplish? I have to buy a product that took a heroic effort to produce just to let me integrate data from multiple sources and this is an “enhancement?
The comments in the spec about reflowing and the importance of logical structure make it sound like there is the potential here for a reasonably well authored document to appeal to both the automated data processor and the viewer-of-nice-pictures. With html you could just swap out the images. That is great but then you end up with situations like the US IRS offering documents to people who are unable to extract their own tax numbers from the artwork because no one enabled “user rights” or found some other features for those with proprietary interests.
I don’t have an algebra for pictures but I’ve managed to get pretty good with integer math I’ll even concede that “along with freedom comes responsibility” and if you offer a versatile format it can be difficult to sell the right defaults to every customers but in this case it seems the format lacks some versatility versatility has to be realizable, not just hypothetical and it sounds like this comparatively simple task is not simple.
You just like arguing. I don’t think you are understanding the problem that the person is having. It’s not about simply finding already rastered images and replacing them with alternate versions – that’s pretty simple and there is an example of using iText for doing just that. In the process, you will probably want to optimize the output. Actually, this problem is difficult with HTML too.
Your task is to reduce that to the smallest number of objects that produce the same visual result with all text kept intact.
Now, add to that all the complexities of the PDF rendering model – overlapping Z-ordered objects, color management, rich transparency model, etc.
And these are all things that are being considered for HTML5 – so that the same problems would now manifest themselves in that environment as well. Leonard Original Message From: Thursday, March 26, 8: No, I was just trying to do my taxes. The text is fixed or unrelated no? At worst then this come down to the same “typesetting” or “reflow” issues that always come up.
So the complexity depends on quality? It would depend on what you want to replace the SVG stuff- best quality for the size or something like a placeholder.
I guess I would like to get some idea of the model storage capabilities too- if you can store more complicated information in a usable by something other than canned proprietary apps that would be great. I’ll postpone the witch-hunt until I have some better direction: In reply to this post by Mike Marchywka Regarding the suggestions on how to tackle my problem, thank you.
I had reviewed PdfBox and iText literature, hoping there might be a more trivial approach, but it seems to be confirmed that I will have to parse the PDF at a low level moving the text while rendering the non-text objects to a bitmap sized appropriately.
I am a programmer so its not out of the question; it’s just a big investment in time I had hoped to avoid: It looks like it should do what I want and more.
Chapter 6: Working with existing PDFs
Have not gotten the selective rasterization to do what I want yet; it seems to rasterize the text too, but probably my mistake. Will seek assistance from Apago. Consider the follow pseudo-coded SVG: And that’s even assuming no transparency or filter effects in place. You suggested the user just open this up in an editor an SVG editor, in this case.
iText – How to duplicate PDF text but rasterize graphics
That might be OK if I had access to all the same fonts that the author did assuming I am not the original author of the document – but what if I don’t? There would be no way for to “reauthor the content” and ensure that things did NOT reflow or relayout – let alone change the appearance. Thursday, March 26, 9: Preview and select themes for Hotmail r. If you have a more apropos 3 letter acronym for Itetsharp I can use that So if a line is drawn from A to B, the Location has changes.
Because you can push and pop the graphic state, its quite possible to isolate various graphic elements such that they do not affect one another at all. PDF is all but write-only. It doesn’t do any hand holding.
There’s a programming analogy that seems to fit: Basic is like a suction cup dart gun. Which one do you want when you’re learning?
Which one do you want when you go bear hunting? Things that are trivial in one are impossible or nearly so in the other. Lots of things have been added to both to broaden their appeal, with varying degrees of success. Properly implemented, structure can do All Kinds of Spiffy Things. Only Adobe properly implements it that I’ve ever seenand even they mess it up.
To really live up to its promise, structure needs to be more wide spread than it is. It has thus far failed to reach techno-critical-mass.
PDF puts appearance and the consistency of that appearance first. HTML puts meaning first. PDF has done various things to try to add meaning. HTML has done various things to improve its appearance and I suspect you have Intimate Knowledge of just how consistent that appearance can be across different versions of different browsers. A collection of characters.
And you can’t tell which until you peel the PDF open and start rooting around in it. A “table” in PDF is where stuff happens to be itexxtsharp. Many of your attacks on PDF lol! Thanks for clearing that up. Use the right tool for the job. And stop trying to hang sheet rock with a voltage meter. I mean, can you imagine how you look to someone who knows how to pick up a nail gun?